本篇博文主要内容为 2026-01-06 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-06)

今日共更新820篇论文,其中:

  • 自然语言处理109篇(Computation and Language (cs.CL))
  • 人工智能258篇(Artificial Intelligence (cs.AI))
  • 计算机视觉206篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习247篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling

【速读】: 该论文旨在解决毒性检测(toxicity detection)这一主观性任务中因不同人群视角差异而导致的模型性能不稳定问题,即现有大型语言模型(Large Language Model, LLM)在不同角色(persona)设定下表现出显著的性能波动。其核心解决方案在于提出一种轻量级元集成方法:基于四种提示变体(prompting variants)的预测结果构建一个4-bit向量,并使用支持向量机(SVM)对该向量进行加权融合。该方法通过捕捉不同提示策略与模型-角色组合间的互补误差,在多样化的角色条件下实现比单一提示方法和传统多数投票法更稳健且一致的性能表现,从而为主观自然语言处理任务提供了一种有效的多元评价范式。

链接: https://arxiv.org/abs/2601.02337
作者: Berk Atil,Rebecca J. Passonneau,Ninareh Mehrabi
机构: Pennsylvania State University (宾夕法尼亚州立大学); Resolution
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Toxicity detection is inherently subjective, shaped by the diverse perspectives and social priors of different demographic groups. While ``pluralistic’’ modeling as used in economics and the social sciences aims to capture perspective differences across contexts, current Large Language Model (LLM) prompting techniques have different results across different personas and base models. In this work, we conduct a systematic evaluation of persona-aware toxicity detection, showing that no single prompting method, including our proposed automated prompt optimization strategy, uniformly dominates across all model-persona pairs. To exploit complementary errors, we explore ensembling four prompting variants and propose a lightweight meta-ensemble: an SVM over the 4-bit vector of prompt predictions. Our results demonstrate that the proposed SVM ensemble consistently outperforms individual prompting methods and traditional majority-voting techniques, achieving the strongest overall performance across diverse personas. This work provides one of the first systematic comparisons of persona-conditioned prompting for toxicity detection and offers a robust method for pluralistic evaluation in subjective NLP tasks.
zh

[NLP-1] Estimating Text Temperature

【速读】: 该论文旨在解决如何对任意文本(包括人类书写的文本)在给定语言模型下估计其温度参数(temperature parameter)的问题,从而量化生成文本的随机性水平。解决方案的关键在于提出了一种基于最大似然估计(maximum likelihood estimation)的方法,能够从已生成或已存在的文本中反推其对应的温度值,进而实现对不同文本来源(如人类写作或模型生成)的温度差异进行量化分析。通过在多个小到中等规模的语言模型上验证该方法的有效性,并最终使用表现最佳的Qwen3 14B模型对主流语料库进行温度估计,验证了该方法的实用性与泛化能力。

链接: https://arxiv.org/abs/2601.02320
作者: Nikolay Mikhaylovskiy
机构: NTR Labs (NTR 实验室); Higher IT School of Tomsk State University (托木斯克国立大学高级信息技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive language models typically use temperature parameter at inference to shape the probability distribution and control the randomness of the text generated. After the text was generated, this parameter can be estimated using maximum likelihood approach. Following it, we propose a procedure to estimate the temperature of any text, including ones written by humans, with respect to a given language model. We evaluate the temperature estimation capability of a wide selection of small-to-medium LLMs. We then use the best-performing Qwen3 14B to estimate temperatures of popular corpora.
zh

[NLP-2] Classifying several dialectal Nawatl varieties

【速读】: 该论文旨在解决纳瓦特尔语(Nawatl)方言分类困难的问题,因其存在约30种方言变体及书写形式不统一,导致计算机资源匮乏。解决方案的关键在于利用机器学习(Machine Learning)与神经网络(Neural Networks)技术对纳瓦特尔语的不同方言进行自动识别与分类,从而为语言处理和数字化保护提供可行的技术路径。

链接: https://arxiv.org/abs/2601.02303
作者: Juan-José Guzmán-Landa,Juan-Manuel Torres-Moreno,Miguel Figueroa-Saavedra,Carlos-Emiliano González-Gallardo,Graham Ranger,Martha Lorena-Avendaño-Garrido
机构: LIA/Avignon Université (法国阿维尼翁大学); LIFAT/Université de Tours (法国图尔大学); Universidad Veracruzana (墨西哥韦拉克鲁斯大学); ICTT/Avignon Université (法国阿维尼翁大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Mexico is a country with a large number of indigenous languages, among which the most widely spoken is Nawatl, with more than two million people currently speaking it (mainly in North and Central America). Despite its rich cultural heritage, which dates back to the 15th century, Nawatl is a language with few computer resources. The problem is compounded when it comes to its dialectal varieties, with approximately 30 varieties recognised, not counting the different spellings in the written forms of the language. In this research work, we addressed the problem of classifying Nawatl varieties using Machine Learning and Neural Networks.
zh

[NLP-3] Power-of-Two Quantization-Aware-Training (PoT-QAT) in Large Language Models (LLM s)

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)参数规模急剧增长所带来的边缘计算设备部署难题,尤其是受限于边缘设备的内存容量和算力资源。其核心解决方案是采用一种特殊的量化方法——仅保留幂次为2的数值(Power-of-Two Quantization, PoT),从而显著降低模型存储需求(仅需存储指数部分)并利用位移操作替代高成本乘法运算,实现推理速度提升。为缓解严格PoT量化导致的性能损失,研究进一步引入量化感知训练(Quantization Aware Training, QAT),通过额外训练优化模型表现。实验表明,在GPT-2 124M模型上,PoT量化结合QAT后,困惑度(perplexity)相比基线提升66%,BERT分数损失仅为1%,同时内存占用减少87.5%,推理速度提升3–10倍。

链接: https://arxiv.org/abs/2601.02298
作者: Mahmoud Elgenedy
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a significant challenge for implementation, especially for Edge devices. Unlike cloud computing, memory and processing power for Edge devices are very limited, which necessitates developing novel ideas to make such applications feasible. In this work, we investigate compressing weights with a special quantization that limits numbers to only power-of-two (PoT). This helps save a huge amount of memory as only exponents need to be stored, more importantly, it significantly reduces processing power by replacing costly multiplication with low cost bit shifting. To overcome performance loss due to this strict quantization, we investigate Quantization Aware Training (QAT) to enhance performance through additional training. Results on GPT-2 124M show a major enhancement for quantized PoT model after additional training, with a perplexity enhancement of 66% and BERT-Score loss to baseline GPT-2 of 1%. The memory saving is estimated to be 87.5% while the inference speed is expected to be 3-10x faster with PoT quantization versus full-precision.
zh

[NLP-4] pdfQA: Diverse Challenging and Realistic Question Answering over PDFs

【速读】: 该论文旨在解决当前问答(QA)数据集普遍基于文本源或局限于特定领域,难以全面评估端到端文档问答(Document QA)系统性能的问题。其解决方案的关键在于构建了一个多领域、高质量的PDF问答数据集pdfQA,包含2000个真实标注(real-pdfQA)和2000个合成(syn-pdfQA)的QA对,并在十个复杂度维度(如文件类型、来源模态、答案类型等)上进行细粒度区分,同时通过质量与难度过滤确保数据的有效性与挑战性。该数据集为评估信息检索、文档解析等模块的本地优化及整体问答流程提供了标准化基准。

链接: https://arxiv.org/abs/2601.02285
作者: Tobias Schimanski,Imene Kolli,Jingwei Ni,Yu Fan,Ario Saeid Vaghefi,Elliott Ash,Markus Leippold
机构: University of Zurich (苏黎世大学); ETH Zurich (苏黎世联邦理工学院); Swiss Finance Institute (SFI) (瑞士金融研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).
zh

[NLP-5] CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理阶段因静态训练与动态生成需求不匹配而导致的效率低下问题,即训练时优化固定调度下的局部转移,而高效推理需依赖自适应的“长跳”精炼过程。其解决方案的关键在于提出CD4LM框架,通过两个核心机制实现:一是离散空间一致性蒸馏(Discrete-Space Consistency Distillation, DSCD),使学生模型具备轨迹不变性,能够直接从多样化的噪声状态映射到干净分布;二是置信度自适应解码(Confidence-Adaptive Decoding, CAD),基于token置信度动态分配计算资源,实现高效跳步而不损失生成质量。此方法显著提升了DLMs的并行解码效率,在保持甚至提升生成质量的同时,实现了平均3.62倍的速度提升。

链接: https://arxiv.org/abs/2601.02236
作者: Yihao Liang,Ze Wang,Hao Chen,Ximeng Sun,Jialian Wu,Xiaodong Yu,Jiang Liu,Emad Barsoum,Zicheng Liu,Niraj K. Jha
机构: Princeton University (普林斯顿大学); Advanced Micro Devices, Inc (超威半导体公司)
类目: Computation and Language (cs.CL)
备注: 33 pages, 7 figures

点击查看摘要

Abstract:Autoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency-limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static-to-dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive “long-jump” refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD4LM, a framework that decouples training from inference via Discrete-Space Consistency Distillation (DSCD) and Confidence-Adaptive Decoding (CAD). Unlike standard objectives, DSCD trains a student to be trajectory-invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup; across code and math benchmarks, it strictly dominates the accuracy-efficiency Pareto frontier, achieving a 3.62x mean speedup while improving average accuracy. Code is available at this https URL
zh

[NLP-6] From XAI to Stories: A Factorial Study of LLM -Generated Explanation Quality

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在时间序列预测任务中,如何将可解释人工智能 (Explainable AI, XAI) 方法产生的数值特征归因转化为高质量自然语言解释 (Natural Language Explanations, NLEs) 的问题,尤其关注不同模型、XAI方法、大语言模型(LLM)选择及提示策略对NLE质量的影响。其解决方案的关键在于通过一个系统性的因子实验设计,量化分析四个核心变量:预测模型类型(如XGBoost、SARIMAX等)、XAI方法(SHAP、LIME与无XAI基线)、LLM选择(GPT-4o、Llama-3-8B、DeepSeek-R1)以及提示策略(共八种),并采用基于LLM-as-a-judge的G-Eval评估框架进行多维度测评,从而识别出影响NLE质量的主导因素与非直观现象,例如发现LLM选择比其他因素更具影响力,且存在“可解释性悖论”——即传统统计模型SARIMAX虽预测精度更高,但生成的NLE质量反而低于机器学习模型。

链接: https://arxiv.org/abs/2601.02224
作者: Fabian Lukassen,Jan Herrmann,Christoph Weisser,Benjamin Saefken,Thomas Kneib
机构: University of Göttingen (哥廷根大学); BASF SE (巴斯夫公司); TU Clausthal (克劳斯塔尔工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.
zh

[NLP-7] ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

【速读】: 该论文旨在解决阿拉伯语语音识别中城市级别方言细粒度标注不足的问题,即现有研究难以准确映射语音到具体城市级别的方言来源。其解决方案的关键在于构建首个具有城市级方言粒度的阿拉伯语语音语料库ARCADE(Arabic Radio Corpus for Audio Dialect Evaluation),通过从阿拉伯世界多个广播流媒体平台采集30秒音频片段,并由母语者进行多维度标注(包括情感、语音类型、方言类别及有效性标签),最终形成包含58个城市、19个国家的3,790个独特音频段和6,907条标注数据的高质量语料库,为城市级方言识别任务提供可靠的基准和多任务学习基础。

链接: https://arxiv.org/abs/2601.02209
作者: Omer Nacar,Serry Sibaee,Adel Ammar,Yasser Alhabashi,Nadia Samer Sibai,Yara Farouk Ahmed,Ahmed Saud Alqusaiyer,Sulieman Mahmoud AlMahmoud,Abdulrhman Mamdoh Mukhaniq,Lubaba Raed,Sulaiman Mohammed Alatwah,Waad Nasser Alqahtani,Yousif Abdulmajeed Alnasser,Mohamed Aziz Khadraoui,Wadii Boulila
机构: Tuwaiq Academy (图威克学院); Prince Sultan University (王子苏丹大学); Higher School of Communication of Tunis (SUP’COM) (突尼斯通信高等学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The Arabic language is characterized by a rich tapestry of regional dialects that differ substantially in phonetics and lexicon, reflecting the geographic and cultural diversity of its speakers. Despite the availability of many multi-dialect datasets, mapping speech to fine-grained dialect sources, such as cities, remains underexplored. We present ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first Arabic speech dataset designed explicitly with city-level dialect granularity. The corpus comprises Arabic radio speech collected from streaming services across the Arab world. Our data pipeline captures 30-second segments from verified radio streams, encompassing both Modern Standard Arabic (MSA) and diverse dialectal speech. To ensure reliability, each clip was annotated by one to three native Arabic reviewers who assigned rich metadata, including emotion, speech type, dialect category, and a validity flag for dialect identification tasks. The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries. These fine-grained annotations enable robust multi-task learning, serving as a benchmark for city-level dialect tagging. We detail the data collection methodology, assess audio quality, and provide a comprehensive analysis of label distributions. The dataset is available on: this https URL
zh

[NLP-8] oward Global Large Language Models in Medicine

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域应用中存在语言资源分布不均的问题,即现有模型主要基于高资源语言训练,导致低资源语言场景下的医疗应用效果显著受限。其解决方案的关键在于构建了GlobMed——一个包含超过50万条目、覆盖12种语言(包括四种低资源语言)的多语言医学数据集,并在此基础上开发了GlobMed-Bench评估基准和GlobMed-LLMs系列多语言医学大模型。其中,GlobMed-LLMs通过在GlobMed数据集上训练,实现了相对于基线模型平均40%以上的性能提升,尤其在低资源语言上性能提升超过三倍,从而为全球范围内公平、可及的医疗AI应用提供了重要基础。

链接: https://arxiv.org/abs/2601.02186
作者: Rui Yang,Huitao Li,Weihao Xuan,Heli Qi,Xin Li,Kunyu Yu,Yingjian Chen,Rongrong Wang,Jacques Behmoaras,Tianxi Cai,Bibhas Chakraborty,Qingyu Chen,Lionel Tim-Ee Cheng,Marie-Louise Damwanza,Chido Dzinotyiwei,Aosong Feng,Chuan Hong,Yusuke Iwasawa,Yuhe Ke,Linah Kitala,Taehoon Ko,Jisan Lee,Irene Li,Jonathan Chong Kai Liew,Hongfang Liu,Lian Leng Low,Edison Marrese-Taylor,Yutaka Matsuo,Isheanesu Misi,Yilin Ning,Jasmine Chiat Ling Ong,Marcus Eng Hock Ong,Enrico Petretto,Hossein Rouhizadeh,Abiram Sandralegar,Oren Schreier,Iain Bee Huat Tan,Patrick Tan,Daniel Shu Wei Ting,Junjue Wang,Chunhua Weng,Matthew Yu Heng Wong,Fang Wu,Yunze Xiao,Xuhai Xu,Qingcheng Zeng,Zhuo Zheng,Yifan Peng,Douglas Teodoro,Nan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 182 pages, 65 figures

点击查看摘要

Abstract:Despite continuous advances in medical technology, the global distribution of health care resources remains uneven. The development of large language models (LLMs) has transformed the landscape of medicine and holds promise for improving health care quality and expanding access to medical information globally. However, existing LLMs are primarily trained on high-resource languages, limiting their applicability in global medical scenarios. To address this gap, we constructed GlobMed, a large multilingual medical dataset, containing over 500,000 entries spanning 12 languages, including four low-resource languages. Building on this, we established GlobMed-Bench, which systematically assesses 56 state-of-the-art proprietary and open-weight LLMs across multiple multilingual medical tasks, revealing significant performance disparities across languages, particularly for low-resource languages. Additionally, we introduced GlobMed-LLMs, a suite of multilingual medical LLMs trained on GlobMed, with parameters ranging from 1.7B to 8B. GlobMed-LLMs achieved an average performance improvement of over 40% relative to baseline models, with a more than threefold increase in performance on low-resource languages. Together, these resources provide an important foundation for advancing the equitable development and application of LLMs globally, enabling broader language communities to benefit from technological advances.
zh

[NLP-9] Confidence Estimation for LLM s in Multi-turn Interactions

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中置信度估计(confidence estimation)的可靠性问题,尤其是在上下文累积和歧义逐步消解的动态场景下,现有方法普遍缺乏校准性(calibration)和置信度随信息增加而单调递增的特性。其解决方案的关键在于提出首个系统性的多轮对话置信度评估框架,基于两个核心标准:逐轮校准性和置信度单调性,并引入两项创新工具:长度归一化的期望校准误差(InfoECE)指标以及用于生成可控评估数据集的“Hinter-Guesser”范式。实验表明,主流置信度方法在多轮场景中表现不佳,而作者提出的基于logit的探测器P(Sufficient)展现出相对更优的性能,但仍存在显著改进空间。

链接: https://arxiv.org/abs/2601.02179
作者: Caiqi Zhang,Ruihan Yang,Xiaochen Zhu,Chengzu Li,Tiancheng Hu,Yijiang River Dong,Deqing Yang,Nigel Collier
机构: University of Cambridge (剑桥大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new “Hinter-Guesser” paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
zh

[NLP-10] EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期交互中因上下文窗口有限而导致行为连贯性难以维持的问题。现有记忆系统通常存储孤立记录并检索片段信息,无法有效整合用户状态的演变与冲突消解。其解决方案的关键在于提出EverMemOS——一种受记忆痕迹(engram)启发的自组织记忆操作系统,通过三个核心机制实现:1)情景痕迹形成(Episodic Trace Formation),将对话流转化为包含情景痕迹、原子事实和时间边界前瞻性信号的MemCells;2)语义巩固(Semantic Consolidation),将MemCells组织为主题性的MemScenes,提炼稳定语义结构并更新用户画像;3)重构回忆(Reconstructive Recollection),基于MemScene引导的代理检索,生成下游推理所需的必要且充分上下文。该架构显著提升了记忆增强型推理任务的表现,并支持用户画像构建与前瞻性能力等高级交互特性。

链接: https://arxiv.org/abs/2601.02163
作者: Chuanrui Hu,Xingze Gao,Zuyi Zhou,Dannong Xu,Yi Bai,Xintong Li,Hui Zhang,Tong Li,Chong Zhang,Lidong Bing,Yafeng Deng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 6 figures, 12 tables. Code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self-organizing memory operating system that implements an engram-inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time-bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene-guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state-of-the-art performance on memory-augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat-oriented capabilities such as user profiling and Foresight. Code is available at this https URL.
zh

[NLP-11] FormationEval an open multiple-choice benchmark for petroleum geoscience

【速读】: 该论文旨在解决当前大语言模型在石油地质科学(petroleum geoscience)与地下学科领域评估缺乏专业、可靠基准测试的问题。解决方案的关键在于构建一个名为FormationEval的开放多选题基准数据集,涵盖7个子领域(如岩石物理、石油地质学和油藏工程),共505道题目,源自权威资料并通过概念驱动的方法生成,避免直接复制受版权保护的内容;同时提供完整的来源元数据以支持可追溯性,并对模型性能进行全面评估(覆盖72个来自OpenAI、Anthropic、Google、Meta等主流厂商及开源模型的版本),从而揭示不同模型类别(闭源与开源)、领域间的性能差异与潜在偏差(如答案长度偏倚),为后续研究提供标准化评估工具和透明度保障。

链接: https://arxiv.org/abs/2601.02158
作者: Almaz Ermilov
机构: UiT The Arctic University of Norway(北挪威大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
备注: 24 pages, 8 figures, 10 tables; benchmark and code at this https URL

点击查看摘要

Abstract:This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept-based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives. The top performers achieve over 97% accuracy, with Gemini 3 Pro Preview reaching 99.8%, while tier and domain gaps persist. Among open-weight models, GLM-4.7 leads at 98.6%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93%. The performance gap between open-weight and closed models is narrower than expected, with several lower-cost open-weight models exceeding 90% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.
zh

[NLP-12] Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在领域适配过程中常引发的灾难性遗忘(catastrophic forgetting)问题,其核心在于SFT与模型内部信念之间的分布偏差(distributional gap)。具体而言,SFT强制模型拟合外部标注,导致在某些“自信冲突”(Confident Conflicts)token上产生高置信度但低熵的预测,从而触发破坏性的梯度更新。为此,作者提出熵自适应微调(Entropy-Adaptive Fine-Tuning, EAFT),其关键创新在于引入token级熵作为门控机制,区分认知不确定性(epistemic uncertainty)与知识冲突(knowledge conflict),从而在保留不确定样本学习能力的同时抑制冲突数据的梯度传播,实现下游任务性能与通用能力之间的一致性平衡。

链接: https://arxiv.org/abs/2601.02151
作者: Muxi Diao,Lele Yang,Wuxuan Gong,Yutong Zhang,Zhonghao Yan,Yufei Han,Kongming Liang,Weiran Xu,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Zhongguancun Academy (中关村学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model’s internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as “Confident Conflicts” tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.
zh

[NLP-13] Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构在面对分布偏移(distribution shift)时路由决策脆弱的问题,即传统方法中固定训练后冻结的路由器难以适应新场景。其解决方案的关键在于提出kNN-MoE,一种基于检索增强的路由框架:通过离线构建一个记忆库,该记忆库直接优化token级路由logits以最大化参考集上的似然,并利用检索到的相似历史案例的聚合相似度作为置信度驱动的混合系数,从而在无相关案例时自动退回到冻结的原始路由器,实现鲁棒且灵活的动态路由。

链接: https://arxiv.org/abs/2601.02144
作者: Boxuan Lyu,Soichiro Murakami,Hidetaka Kamigaito,Peinan Zhang
机构: Institute of Science Tokyo (东京科学研究所); CyberAgent (CyberAgent); Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric “router” to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.
zh

[NLP-14] owards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation INTERSPEECH INTERSPEECH2025

【速读】: 该论文旨在解决语音转录文本的层次化主题分割问题,以提升下游处理效率并增强无障碍访问体验。其解决方案的关键在于提出一种新型的多层级主题分割方法,通过生成包含主题与子主题边界的层级目录结构,并对比零样本提示(zero-shot prompting)与LoRA微调(LoRA fine-tuning)在大语言模型上的性能差异,同时融合高阶语音停顿特征以增强分割准确性。实验表明,该方法在英语会议录音和多语种讲座转录(葡萄牙语、德语)上均显著优于现有基线模型,且改进了多层级分割的统一评估指标,实现了对所有层级信息的一体化衡量。

链接: https://arxiv.org/abs/2601.02128
作者: Steffen Freisinger,Philipp Seeberger,Thomas Ranzenberger,Tobias Bocklet,Korbinian Riedhammer
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Published in Proceedings of Interspeech 2025. Please cite the proceedings version (DOI: https://doi.org/10.21437/Interspeech.2025-2792 )

点击查看摘要

Abstract:Segmenting speech transcripts into thematic sections benefits both downstream processing and users who depend on written text for accessibility. We introduce a novel approach to hierarchical topic segmentation in transcripts, generating multi-level tables of contents that capture both topic and subtopic boundaries. We compare zero-shot prompting and LoRA fine-tuning on large language models, while also exploring the integration of high-level speech pause features. Evaluations on English meeting recordings and multilingual lecture transcripts (Portuguese, German) show significant improvements over established topic segmentation baselines. Additionally, we adapt a common evaluation measure for multi-level segmentation, taking into account all hierarchical levels within one metric.
zh

[NLP-15] DeCode: Decoupling Content and Delivery for Medical QA

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中生成答案时缺乏对个体患者情境适配的问题,即模型虽能提供医学上正确的回答,但往往与具体患者的临床需求不匹配。解决方案的关键在于提出一种无需训练、与模型无关的框架 DeCode,该框架通过引入上下文感知机制,在不修改原模型参数的前提下,使现有 LLM 能够根据患者特定信息生成更具临床相关性的回答。实验表明,DeCode 在 OpenAI HealthBench 基准测试中将性能从 28.4% 提升至 49.8%,实现了 75% 的相对改进,验证了其在提升临床问答准确性方面的有效性。

链接: https://arxiv.org/abs/2601.02123
作者: Po-Jen Ko,Chen-Han Tsai,Yu-Shao Peng
机构: National Taiwan University (台湾大学); HTC DeepQ
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients’ needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from 28.4% to 49.8% , corresponding to a 75% relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.
zh

[NLP-16] Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows

【速读】: 该论文旨在解决块状扩散语言模型(block-based diffusion language models)中存在的边界诱导上下文截断(Boundary-Induced Context Truncation, BICT)问题:在块边界附近的未解码 token 被迫提前提交,而无法利用邻近的未来上下文信息,导致不确定性增加、生成质量下降,尤其在数学推理和代码生成等需要精确推理的任务中表现不佳。解决方案的关键在于提出一种无需训练的解码策略——延迟承诺解码(Deferred Commitment Decoding, DCD),其核心思想是维护一个基于置信度的滑动窗口,在窗口内优先确定低不确定性 token,同时推迟高不确定性 token 的提交,直至获得充分的上下文证据。该设计实现了窗口内的双向信息流动而不牺牲效率,显著提升了生成准确率(平均提升 1.39%,最高达 9.0%)。

链接: https://arxiv.org/abs/2601.02076
作者: Yingte Shu,Yuchuan Tian,Chao Xu,Yunhe Wang,Hanting Chen
机构: Peking University (北京大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding confidence and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. This design enables effective bidirectional information flow within the decoding window without sacrificing efficiency. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.
zh

[NLP-17] Cost-Efficient Cross-Lingual Retrieval-Augmented Generation for Low-Resource Languages: A Case Study in Bengali Agricultural Advisory

【速读】: 该论文旨在解决发展中国家农业知识获取受限的问题,尤其是因语言障碍导致的权威农业手册(主要为英文)与农民常用低资源本地语言(如孟加拉语)之间的信息断层。其解决方案的关键在于提出一种基于翻译中心架构的跨语言检索增强生成(Retrieval-Augmented Generation, RAG)框架:通过将孟加拉语用户查询翻译为英文,结合领域特定关键词注入以对齐农民口语术语与科学术语,再利用密集向量检索从精选英文农业手册(如FAO、IRRI)中提取事实依据生成回答,并最终将英文结果回译为孟加拉语,从而实现高准确性、低成本且可在消费级硬件上部署的农业咨询服务。

链接: https://arxiv.org/abs/2601.02065
作者: Md. Asif Hossain,Nabil Subhan,Mantasha Rahman Mahi,Jannatul Ferdous Nabila
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 1 table

点击查看摘要

Abstract:Access to reliable agricultural advisory remains limited in many developing regions due to a persistent language barrier: authoritative agricultural manuals are predominantly written in English, while farmers primarily communicate in low-resource local languages such as Bengali. Although recent advances in Large Language Models (LLMs) enable natural language interaction, direct generation in low-resource languages often exhibits poor fluency and factual inconsistency, while cloud-based solutions remain cost-prohibitive. This paper presents a cost-efficient, cross-lingual Retrieval-Augmented Generation (RAG) framework for Bengali agricultural advisory that emphasizes factual grounding and practical deployability. The proposed system adopts a translation-centric architecture in which Bengali user queries are translated into English, enriched through domain-specific keyword injection to align colloquial farmer terminology with scientific nomenclature, and answered via dense vector retrieval over a curated corpus of English agricultural manuals (FAO, IRRI). The generated English response is subsequently translated back into Bengali to ensure accessibility. The system is implemented entirely using open-source models and operates on consumer-grade hardware without reliance on paid APIs. Experimental evaluation demonstrates reliable source-grounded responses, robust rejection of out-of-domain queries, and an average end-to-end latency below 20 seconds. The results indicate that cross-lingual retrieval combined with controlled translation offers a practical and scalable solution for agricultural knowledge access in low-resource language settings
zh

[NLP-18] Simulated Reasoning is Reasoning

【速读】: 该论文旨在解决当前基础模型(Foundational Models, FM)在推理能力上的本质问题,即这些模型虽能通过模仿“自言自语式思考”(thinking out loud)实现问题求解,但其推理过程缺乏人类认知中的常识性根基(grounding and common sense),从而导致推理结果的脆弱性(brittleness)。论文指出,传统将FM类比为“随机鹦鹉”(stochastic parrot)的隐喻已不再适用,因为FM展现出的是一种不同于人类符号推理(symbolic reasoning)的新形式推理。解决方案的关键在于重新理解推理的本质及其必要条件,并基于此构建更安全、鲁棒的防御机制,以应对FM推理过程中因缺乏常识和情境依赖所引发的风险。

链接: https://arxiv.org/abs/2601.02043
作者: Hendrik Kempt,Alon Lavie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

Abstract:Reasoning has long been understood as a pathway between stages of understanding. Proper reasoning leads to understanding of a given subject. This reasoning was conceptualized as a process of understanding in a particular way, i.e., “symbolic reasoning”. Foundational Models (FM) demonstrate that this is not a necessary condition for many reasoning tasks: they can “reason” by way of imitating the process of “thinking out loud”, testing the produced pathways, and iterating on these pathways on their own. This leads to some form of reasoning that can solve problems on its own or with few-shot learning, but appears fundamentally different from human reasoning due to its lack of grounding and common sense, leading to brittleness of the reasoning process. These insights promise to substantially alter our assessment of reasoning and its necessary conditions, but also inform the approaches to safety and robust defences against this brittleness of FMs. This paper offers and discusses several philosophical interpretations of this phenomenon, argues that the previously apt metaphor of the “stochastic parrot” has lost its relevance and thus should be abandoned, and reflects on different normative elements in the safety- and appropriateness-considerations emerging from these reasoning models and their growing capacity.
zh

[NLP-19] Output Embedding Centering for Stable LLM Pretraining

【速读】: 该论文旨在解决大规模语言模型预训练过程中因大学习率导致的输出logit发散(output logit divergence)问题,这是一种常见的训练不稳定性现象。现有主流缓解策略z-loss仅针对症状而非根本原因,无法有效保障训练稳定性和学习率鲁棒性。论文从输出嵌入(output embeddings)几何结构的角度分析了该不稳定性的成因,并提出一种新的解决方案——输出嵌入中心化(Output Embedding Centering, OEC),其核心在于通过将输出嵌入向量中心化(如μ-centering的确定性操作或μ-loss的正则化方法)来抑制logit发散。实验表明,OEC在训练稳定性和学习率敏感度上均优于z-loss,尤其能在z-loss失效时仍确保收敛,且μ-loss对超参数调优的依赖显著低于z-loss。

链接: https://arxiv.org/abs/2601.02031
作者: Felix Stollenwerk,Anna Lokrantz,Niclas Hertzberg
机构: AI Sweden
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs for large learning rates at the end of training is output logit divergence. The most widely used mitigation strategy, z-loss, merely addresses the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings’ geometry and identify its cause. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and prove that it suppresses output logit divergence. OEC can be implemented in two different ways, as a deterministic operation called \mu-centering, or a regularization method called \mu-loss. Our experiments show that both variants outperform z-loss in terms of training stability and learning rate sensitivity. In particular, they ensure that training converges even for large learning rates when z-loss fails. Furthermore, we find that \mu-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.
zh

[NLP-20] Not All Needles Are Found: How Fact Distribution and Dont Make It Up Prompts Shape Literal Extraction Logical Inference and Hallucination Risks in Long-Context LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理超长输入上下文时,信息提取与推理能力不稳定的问题,特别是当事实信息在文档中分布稀疏或位置偏移时,模型性能显著下降。其核心挑战在于:尽管LLMs支持长上下文,但其实际表现高度依赖于事实的放置方式、语料库中的事实分布模式以及是否使用“不要编造”(Don’t Make It Up)类提示来抑制幻觉。解决方案的关键在于构建一个扩展的“针在 haystack 中”基准测试(needle-in-a-haystack benchmark),涵盖四个生产级模型(Gemini-2.5-flash、ChatGPT-5-mini、Claude-4.5-haiku 和 Deepseek-v3.2-chat),并区分评估字面提取(literal extraction)、逻辑推理(logical inference)和幻觉风险(hallucination risk)。研究发现,单纯增加上下文长度并不能提升性能,反而可能因证据稀释而恶化;此外,抗幻觉提示虽能降低虚构风险,却可能导致模型过度保守,损害准确性。因此,关键洞见是:模型的有效上下文利用能力(而非仅上下文长度)才是决定可靠性的核心因素,这对企业级应用中直接插入大量未过滤文档的场景具有重要实践意义。

链接: https://arxiv.org/abs/2601.02023
作者: Amirali Ebrahimzadeh,Seyyed M. Salili
机构: University of Michigan, Ann Arbor, MI 48109, USA (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) increasingly support very long input contexts. Yet it remains unclear how reliably they extract and infer information at scale. Performance varies with context length and strongly interacts with how information is distributed in real-world corpora. Motivated by these observations, we study how fact placement, corpus-level fact distributions, and Don’t Make It Up prompts influence model behavior. We introduce an extended needle-in-a-haystack benchmark across four production-scale models: Gemini-2.5-flash, ChatGPT-5-mini, Claude-4.5-haiku, and Deepseek-v3.2-chat. Unlike prior work, we separately evaluate literal extraction, logical inference, and hallucination risk. Our study considers both positional effects and realistic distributions of evidence across long contexts, as well as prompts that explicitly discourage fabrication. We find that longer contexts alone do not guarantee better performance and can be detrimental when relevant evidence is diluted or widely dispersed. Performance varies substantially across models: some show severe degradation under realistic conditions, while others remain more robust at longer context lengths. Anti-hallucination (AH) instructions can make some models overly conservative, sharply reducing accuracy in literal extraction and logical inference. While we do not directly compare retrieval-augmented generation (RAG) and cache-augmented generation (CAG), our results suggest many failures stem from ineffective context utilization. Models often struggle to identify and prioritize relevant information even when it is present. These findings have direct practical implications, as enterprise workflows increasingly involve pasting large volumes of unfiltered documents into LLM prompts. Effective context length and model-specific robustness to long contexts are therefore critical for reliable LLM deployment in research and business.
zh

[NLP-21] Surprisal and Metaphor Novelty: Moderate Correlations and Divergent Scaling Effects EACL2026

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在理解新颖隐喻(novel metaphor)时,其预测不确定性(即 surprisal)是否能有效反映人类对隐喻新颖性的标注这一问题。解决方案的关键在于通过对比分析16种不同规模的语言模型在基于语料库和人工构造的隐喻新颖性数据集上的surprisal值与人类标注分数之间的相关性,并引入一种基于完整句子上下文的cloze-style surprisal计算方法,从而揭示出surprisal作为衡量隐喻新颖性指标的局限性及其与模型规模之间的非线性关系——具体表现为:在语料库数据上呈现“反向缩放效应”(inverse scaling effect),而在合成数据上则符合“质量-功率假说”(Quality-Power Hypothesis)。

链接: https://arxiv.org/abs/2601.02015
作者: Omar Momen,Emilie Sitter,Berenike Herrmann,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学); CRC 1646 – Linguistic Creativity in Communication (语言创造力在交流中的CRC 1646)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: to be published at EACL 2026 main conference

点击查看摘要

Abstract:Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models (LMs). This study investigates whether surprisal, a probabilistic measure of predictability in LMs, correlates with different metaphor novelty datasets. We analyse surprisal from 16 LM variants on corpus-based and synthetic metaphor novelty datasets. We explore a cloze-style surprisal method that conditions on full-sentence context. Results show that LMs yield significant moderate correlations with scores/labels of metaphor novelty. We further identify divergent scaling patterns: on corpus-based data, correlation strength decreases with model size (inverse scaling effect), whereas on synthetic data it increases (Quality-Power Hypothesis). We conclude that while surprisal can partially account for annotations of metaphor novelty, it remains a limited metric of linguistic creativity.
zh

[NLP-22] Exploring Approaches for Detecting Memorization of Recommender System Data in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推荐场景中可能因训练数据泄露而导致的信息安全问题,特别是针对模型是否记忆了特定数据集(如MovieLens-1M)以及如何自动化检测和提取这些记忆内容的问题。解决方案的关键在于系统性地评估三种不同策略:(i) 攻击性提示工程(jailbreak prompt engineering),(ii) 基于内部激活的无监督潜在知识发现(通过Contrast-Consistent Search和Cluster-Norm实现),以及(iii) 自动提示工程(Automatic Prompt Engineering, APE),其中APE将提示发现建模为元学习过程并迭代优化候选指令,实验表明该方法在自动提取记忆样本方面最具潜力。

链接: https://arxiv.org/abs/2601.02002
作者: Antonio Colacicco,Vito Guida,Dario Di Palma,Fedelucio Narducci,Tommaso Di Noia
机构: Politecnico di Bari (巴里理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied in recommendation scenarios due to their strong natural language understanding and generation capabilities. However, they are trained on vast corpora whose contents are not publicly disclosed, raising concerns about data leakage. Recent work has shown that the MovieLens-1M dataset is memorized by both the LLaMA and OpenAI model families, but the extraction of such memorized data has so far relied exclusively on manual prompt engineering. In this paper, we pose three main questions: Is it possible to enhance manual prompting? Can LLM memorization be detected through methods beyond manual prompting? And can the detection of data leakage be automated? To address these questions, we evaluate three approaches: (i) jailbreak prompt engineering; (ii) unsupervised latent knowledge discovery, probing internal activations via Contrast-Consistent Search (CCS) and Cluster-Norm; and (iii) Automatic Prompt Engineering (APE), which frames prompt discovery as a meta-learning process that iteratively refines candidate instructions. Experiments on MovieLens-1M using LLaMA models show that jailbreak prompting does not improve the retrieval of memorized items and remains inconsistent; CCS reliably distinguishes genuine from fabricated movie titles but fails on numerical user and rating data; and APE retrieves item-level information with moderate success yet struggles to recover numerical interactions. These findings suggest that automatically optimizing prompts is the most promising strategy for extracting memorized samples.
zh

[NLP-23] Exploring Diversity Novelty and Popularity Bias in ChatGPT s Recommendations

【速读】: 该论文旨在解决当前生成式 AI(如 ChatGPT)在推荐系统(Recommender Systems, RSs)中应用时,对多样性、新颖性和流行度偏差(popularity bias)等非准确性指标缺乏系统评估的问题。现有研究主要聚焦于模型的推荐准确率,而忽视了这些维度对长期个性化和用户体验的重要性。解决方案的关键在于通过多数据集实验,在 Top-N 推荐和冷启动场景下,定量评估 ChatGPT-3.5 和 ChatGPT-4 在多样性、新颖性及流行度偏差方面的表现,结果表明 ChatGPT-4 在平衡新颖性与多样性方面可媲美甚至超越传统推荐算法,并在冷启动场景中展现出更高的准确率与新颖性,凸显其在提升用户满意度和解决新用户推荐难题上的潜力。

链接: https://arxiv.org/abs/2601.01997
作者: Dario Di Palma,Giovanni Maria Biancofiore,Vito Walter Anelli,Fedelucio Narducci,Tommaso Di Noia
机构: Politecnico di Bari (巴里理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:ChatGPT has emerged as a versatile tool, demonstrating capabilities across diverse domains. Given these successes, the Recommender Systems (RSs) community has begun investigating its applications within recommendation scenarios primarily focusing on accuracy. While the integration of ChatGPT into RSs has garnered significant attention, a comprehensive analysis of its performance across various dimensions remains largely unexplored. Specifically, the capabilities of providing diverse and novel recommendations or exploring potential biases such as popularity bias have not been thoroughly examined. As the use of these models continues to expand, understanding these aspects is crucial for enhancing user satisfaction and achieving long-term personalization. This study investigates the recommendations provided by ChatGPT-3.5 and ChatGPT-4 by assessing ChatGPT’s capabilities in terms of diversity, novelty, and popularity bias. We evaluate these models on three distinct datasets and assess their performance in Top-N recommendation and cold-start scenarios. The findings reveal that ChatGPT-4 matches or surpasses traditional recommenders, demonstrating the ability to balance novelty and diversity in recommendations. Furthermore, in the cold-start scenario, ChatGPT models exhibit superior performance in both accuracy and novelty, suggesting they can be particularly beneficial for new users. This research highlights the strengths and limitations of ChatGPT’s recommendations, offering new perspectives on the capacity of these models to provide recommendations beyond accuracy-focused metrics. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.01997 [cs.IR] (or arXiv:2601.01997v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.01997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-24] Hidden State Poisoning Attacks against Mamba-based Language Models ACL2026

【速读】: 该论文旨在解决状态空间模型(State Space Models, SSMs)在对抗攻击下的鲁棒性问题,特别是针对一种新型的“隐藏状态投毒攻击”(Hidden State Poisoning Attack, HiSPA),该攻击通过特定短语引发模型隐藏状态的信息丢失,导致其产生部分遗忘效应。解决方案的关键在于识别并验证HiSPA对SSMs的显著破坏性——例如,在RoBench25基准上,即使是近期的52B参数混合SSM-Transformer模型(Jamba)也因优化后的HiSPA触发词而失效,且该攻击同样削弱其在Open-Prompt-Injections基准上的表现;同时,研究通过可解释性分析揭示了Mamba模型在遭受HiSPA时隐藏层中的规律性模式,为构建针对性的防御机制提供了理论基础和实践路径。

链接: https://arxiv.org/abs/2601.01972
作者: Alexandre Le Mercier,Chris Develder,Thomas Demeester
机构: IDLab–T2K, Ghent University–imec (IDLab–T2K, 根特大学–imec)
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures. Submitted to ACL 2026

点击查看摘要

Abstract:State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model’s information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba’s hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at this https URL.
zh

[NLP-25] CSF: Contrastive Semantic Features for Direct Multilingual Sign Language Generation

【速读】: 该论文旨在解决现有手语翻译系统普遍依赖英语作为中介语言所导致的非英语使用者在聋人社区中面临的技术壁垒问题。其解决方案的关键在于提出了一种语言无关的语义表示框架——规范语义形式(Canonical Semantic Form, CSF),该框架将话语分解为九个通用语义槽(event, intent, time, condition, agent, object, location, purpose, modifier),并构建了一个包含35类条件类型的综合性条件分类体系,从而实现任意源语言到手语的直接翻译,无需英语中介。该方法通过一个轻量级Transformer提取器(模型大小仅0.74 MB)实现了跨四种语言(英语、越南语、日语和法语)的平均99.03%语义槽提取准确率,尤其在复杂条件分类任务上达到99.4%准确率,且推理延迟仅为3.02ms(CPU),支持浏览器端实时手语生成。

链接: https://arxiv.org/abs/2601.01964
作者: Tran Sy Bao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 8 tables, code available at this https URL

点击查看摘要

Abstract:Sign language translation systems typically require English as an intermediary language, creating barriers for non-English speakers in the global deaf community. We present Canonical Semantic Form (CSF), a language-agnostic semantic representation framework that enables direct translation from any source language to sign language without English mediation. CSF decomposes utterances into nine universal semantic slots: event, intent, time, condition, agent, object, location, purpose, and modifier. A key contribution is our comprehensive condition taxonomy comprising 35 condition types across eight semantic categories, enabling nuanced representation of conditional expressions common in everyday communication. We train a lightweight transformer-based extractor (0.74 MB) that achieves 99.03% average slot extraction accuracy across four typologically diverse languages: English, Vietnamese, Japanese, and French. The model demonstrates particularly strong performance on condition classification (99.4% accuracy) despite the 35-class complexity. With inference latency of 3.02ms on CPU, our approach enables real-time sign language generation in browser-based applications. We release our code, trained models, and multilingual dataset to support further research in accessible sign language technology.
zh

[NLP-26] he Invisible Hand of AI Libraries Shaping Open Source Projects and Communities

【速读】: 该论文旨在解决当前对人工智能(Artificial Intelligence, AI)在开源软件(Open Source Software, OSS)项目中采用情况及其影响的研究不足问题,特别是AI库在Python和Java OSS项目中的整合如何重塑开发实践、技术生态与社区参与度。其解决方案的关键在于开展一项大规模分析,基于157,700个潜在OSS仓库,利用仓库级指标与软件度量方法,系统比较采纳AI库与未采纳AI库的项目在开发活跃度、社区互动性和代码复杂性等方面的差异,从而提供实证依据以揭示AI集成对软件工程实践的结构性影响。

链接: https://arxiv.org/abs/2601.01944
作者: Matteo Esposito,Andrea Janes,Valentina Lenarduzzi,Davide Taibi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Programming Languages (cs.PL)
备注: ACCEPTED REGISTERED REPORT AT SANER (CORE A*) 2026

点击查看摘要

Abstract:In the early 1980s, Open Source Software emerged as a revolutionary concept amidst the dominance of proprietary software. What began as a revolutionary idea has now become the cornerstone of computer science. Amidst OSS projects, AI is increasing its presence and relevance. However, despite the growing popularity of AI, its adoption and impacts on OSS projects remain underexplored. We aim to assess the adoption of AI libraries in Python and Java OSS projects and examine how they shape development, including the technical ecosystem and community engagement. To this end, we will perform a large-scale analysis on 157.7k potential OSS repositories, employing repository metrics and software metrics to compare projects adopting AI libraries against those that do not. We expect to identify measurable differences in development activity, community engagement, and code complexity between OSS projects that adopt AI libraries and those that do not, offering evidence-based insights into how AI integration reshapes software development practices. Comments: ACCEPTED REGISTERED REPORT AT SANER (CORE A*) 2026 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Programming Languages (cs.PL) Cite as: arXiv:2601.01944 [cs.SE] (or arXiv:2601.01944v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.01944 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-27] ackling the Inherent Difficulty of Noise Filtering in RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因引入噪声或无关文档而导致大语言模型(Large Language Models, LLMs)性能下降甚至产生幻觉的问题。现有方法难以完全过滤无关内容,且标准微调策略受限于注意力机制的结构特性,无法有效引导模型选择性地利用相关信息并忽略无关内容。解决方案的关键在于提出一种新颖的微调方法,专门设计用于提升LLM在检索到的文档中区分相关信息与无关信息的能力,从而显著增强模型对噪声的鲁棒性和整体性能。

链接: https://arxiv.org/abs/2601.01896
作者: Jingyu Liu,Jiaen Lin,Yong Liu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Beijing Key Laboratory of Big Data Management and Analysis Methods (北京市大数据管理与分析方法重点实验室); School of Software Tsinghua University (清华大学软件学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a widely adopted approach to enhance Large Language Models (LLMs) by incorporating external knowledge and reducing hallucinations. However, noisy or irrelevant documents are often introduced during RAG, potentially degrading performance and even causing hallucinated outputs. While various methods have been proposed to filter out such noise, we argue that identifying irrelevant information from retrieved content is inherently difficult and limited number of transformer layers can hardly solve this. Consequently, retrievers fail to filter out irrelevant documents entirely. Therefore, LLMs must be robust against such noise, but we demonstrate that standard fine-tuning approaches are often ineffective in enabling the model to selectively utilize relevant information while ignoring irrelevant content due to the structural constraints of attention patterns. To address this, we propose a novel fine-tuning method designed to enhance the model’s ability to distinguish between relevant and irrelevant information within retrieved documents. Extensive experiments across multiple benchmarks show that our approach significantly improves the robustness and performance of LLMs.
zh

[NLP-28] Agent ic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长程推理任务中因上下文窗口有限而导致的长期记忆(Long-Term Memory, LTM)与短期记忆(Short-Term Memory, STM)管理效率低下问题。现有方法通常将LTM和STM作为独立模块处理,依赖启发式规则或辅助控制器,限制了模型的自适应性和端到端优化能力。其解决方案的关键在于提出一种统一的记忆框架——Agentic Memory (AgeMem),将LTM和STM的管理直接嵌入代理策略中,通过工具化动作(tool-based actions)形式暴露存储、检索、更新、总结和丢弃等记忆操作,使LLM代理能够自主决策何时以及如何操作记忆内容。同时,论文设计了一个三阶段渐进式强化学习训练策略和步进式GRPO算法,以应对由记忆操作引发的稀疏且不连续奖励问题,从而实现更高效、高质量的长程推理性能。

链接: https://arxiv.org/abs/2601.01885
作者: Yi Yu,Liuyi Yao,Yuexiang Xie,Qingquan Tan,Jiaqi Feng,Yaliang Li,Libing Wu
机构: Alibaba Group(阿里巴巴集团); School of Cyber Science and Engineering, Wuhan University(武汉大学网络科学与工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent’s policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.
zh

[NLP-29] DermoGPT : Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLM s

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在皮肤科领域应用中面临的三大瓶颈:训练数据稀缺、任务覆盖范围狭窄以及缺乏以临床专家诊断流程为基准的监督信号。其解决方案的关键在于构建一个系统性框架,包含三个核心组件:首先,提出DermoInstruct——一个大规模形态学锚定的指令语料库,涵盖211,243张图像和772,675条诊断轨迹,完整映射从形态观察到最终诊断的全流程;其次,建立DermoBench基准测试体系,涵盖11项任务及四个临床维度(形态、诊断、推理与公平性),并提供3,600个专家验证的开放问答实例和人类性能基线;最后,开发DermoGPT模型,通过监督微调结合形态锚定视觉推理一致性(Morphologically-Anchored Visual-Inference-Consistent, MAVIC)强化学习目标,确保视觉观察与诊断结论的一致性,并在推理阶段引入置信度一致性测试时自适应(Confidence-Consistency Test-time adaptation, CCT)机制,从而显著提升模型鲁棒性和准确性,实现对16个代表性基线的全面超越。

链接: https://arxiv.org/abs/2601.01868
作者: Jinghan Ru,Siyuan Yan,Yuguo Yin,Yuexian Zou,Zongyuan Ge
机构: Peking University (北京大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision that mirrors expert diagnostic workflows. We present a comprehensive framework to address these gaps. First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats, capturing the complete diagnostic pipeline from morphological observation and clinical reasoning to final diagnosis. Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600 expert-verified open-ended instances and human performance baselines. Third, we develop DermoGPT, a dermatology reasoning MLLM trained via supervised fine-tuning followed by our Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning objective, which enforces consistency between visual observations and diagnostic conclusions. At inference, we deploy Confidence-Consistency Test-time adaptation (CCT) for robust predictions. Experiments show DermoGPT significantly outperforms 16 representative baselines across all axes, achieving state-of-the-art performance while substantially narrowing the human-AI gap. DermoInstruct, DermoBench and DermoGPT will be made publicly available at this https URL upon acceptance.
zh

[NLP-30] Judging with Personality and Confidence: A Study on Personality-Conditioned LLM Relevance Assessment

【速读】: 该论文旨在解决生成式 AI(Generative AI)在Web搜索中的相关性判断与置信度校准问题,特别是如何通过模拟大五人格特质(Big Five personality traits)来提升语言模型的决策可靠性。现有研究虽表明提示工程可使大语言模型(LLMs)模拟特定人格特征,但缺乏对其在关键搜索任务如相关性评估和信心校准方面影响的系统理解。论文的关键解决方案是:基于不同人格条件下的模型输出构建个性化的相关性评分与置信度分布特征,并将其作为输入引入随机森林分类器,在有限训练数据下实现了优于单一人格条件的性能表现,验证了人格引导的置信度作为互补预测信号的有效性,从而为更可靠且符合人类偏好的LLM评估体系提供新路径。

链接: https://arxiv.org/abs/2601.01862
作者: Nuo Chen,Hanpei Fang,Piaohong Wang,Jiqun Liu,Tetsuya Sakai,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University(香港理工大学); Waseda University(早稻田大学); City University of Hong Kong(香港城市大学); The University of Oklahoma(俄克拉荷马大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent studies have shown that prompting can enable large language models (LLMs) to simulate specific personality traits and produce behaviors that align with those traits. However, there is limited understanding of how these simulated personalities influence critical web search decisions, specifically relevance assessment. Moreover, few studies have examined how simulated personalities impact confidence calibration, specifically the tendencies toward overconfidence or underconfidence. This gap exists even though psychological literature suggests these biases are trait-specific, often linking high extraversion to overconfidence and high neuroticism to underconfidence. To address this gap, we conducted a comprehensive study evaluating multiple LLMs, including commercial models and open-source models, prompted to simulate Big Five personality traits. We tested these models across three test collections (TREC DL 2019, TREC DL 2020, and LLMJudge), collecting two key outputs for each query-document pair: a relevance judgment and a self-reported confidence score. The findings show that personalities such as low agreeableness consistently align more closely with human labels than the unprompted condition. Additionally, low conscientiousness performs well in balancing the suppression of both overconfidence and underconfidence. We also observe that relevance scores and confidence distributions vary systematically across different personalities. Based on the above findings, we incorporate personality-conditioned scores and confidence as features in a random forest classifier. This approach achieves performance that surpasses the best single-personality condition on a new dataset (TREC DL 2021), even with limited training data. These findings highlight that personality-derived confidence offers a complementary predictive signal, paving the way for more reliable and human-aligned LLM evaluators. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2601.01862 [cs.CL] (or arXiv:2601.01862v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.01862 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-31] owards Automated Lexicography: Generating and Evaluating Definitions for Learners Dictionaries

【速读】: 该论文旨在解决学习者词典定义生成(Learner’s Dictionary Definition Generation, LDDG)问题,即自动生成简洁、易懂的词汇定义,以降低人工编写词典定义的成本。其核心挑战在于如何在保持语义准确性的同时确保所生成定义的词汇简单性,从而满足语言学习者的需求。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的迭代简化方法(iterative simplification),通过LLM逐步优化定义表达,在保证语义完整性的同时持续降低词汇复杂度;同时,作者构建了一个由专业词典学家协作标注的日语数据集,并设计了一种基于LLM-as-a-judge的新评估框架,使自动化评估结果与人工评判高度一致,从而有效支撑了LDDG任务的系统性研究和性能验证。

链接: https://arxiv.org/abs/2601.01842
作者: Yusuke Ide,Adam Nohejl,Joshua Tanner,Hitomi Yanaka,Christopher Lindsay,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); RIKEN (理化学研究所); Resolve Research; The University of Tokyo (东京大学); Tohoku University (东北大学); Serpenti Sei Japan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study dictionary definition generation (DDG), i.e., the generation of non-contextualized definitions for given headwords. Dictionary definitions are an essential resource for learning word senses, but manually creating them is costly, which motivates us to automate the process. Specifically, we address learner’s dictionary definition generation (LDDG), where definitions should consist of simple words. First, we introduce a reliable evaluation approach for DDG, based on our new evaluation criteria and powered by an LLM-as-a-judge. To provide reference definitions for the evaluation, we also construct a Japanese dataset in collaboration with a professional lexicographer. Validation results demonstrate that our evaluation approach agrees reasonably well with human annotators. Second, we propose an LDDG approach via iterative simplification with an LLM. Experimental results indicate that definitions generated by our approach achieve high scores on our criteria while maintaining lexical simplicity.
zh

[NLP-32] Emergent Introspective Awareness in Large Language Models

【速读】: 该论文旨在解决大语言模型是否具备对其内部状态进行自我反思(introspection)的能力这一问题。由于仅通过对话难以区分真实的自我觉察与虚假编造(confabulations),作者提出了一种基于激活空间干预的实验方法:向模型的内部激活中注入已知概念的表征,并测量这些操作对模型自述状态的影响。关键在于利用可控的神经激活扰动来观察模型能否识别并报告这些注入内容,从而间接验证其内部状态的可感知性和可追溯性。实验表明,部分先进模型(如Claude Opus 4和4.1)能在特定情境下识别注入概念、区分自身输出与人工前缀,并在指令驱动下调节内部激活,证明当前语言模型具备一定程度的功能性自我觉察能力,尽管该能力仍具高度不稳定性与情境依赖性。

链接: https://arxiv.org/abs/2601.01828
作者: Jack Lindsey
机构: Anthropic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.
zh

[NLP-33] Aspect Extraction from E-Commerce Product and Service Reviews

【速读】: 该论文旨在解决在低资源和代码混用(code-switched)场景下,如菲律宾电商评论中常见的Taglish(塔加洛语与英语混合)语境中,情感分析中的方面抽取(Aspect Extraction, AE)任务难以有效实施的问题。其解决方案的关键在于构建一个综合性的AE流水线,融合规则驱动、大语言模型(Large Language Model, LLM)生成与微调技术,并引入一种基于多方法主题建模的分层方面框架(Hierarchical Aspect Framework, HAF)以及双模式标注方案以区分显式与隐式方面。实验表明,采用生成式LLM(Gemini 2.0 Flash)的方法在所有任务中均取得最高性能(Macro F1达0.91),尤其在处理隐式方面时表现优异,凸显了生成式AI在复杂语言环境下的适应性和有效性。

链接: https://arxiv.org/abs/2601.01827
作者: Valiant Lance D. Dionela,Fatima Kriselle S. Dy,Robin James M. Hombrebueno,Aaron Rae M. Nicolas,Charibeth K. Cheng,Raphael W. Gonda
机构: De La Salle University (德拉萨大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aspect Extraction (AE) is a key task in Aspect-Based Sentiment Analysis (ABSA), yet it remains difficult to apply in low-resource and code-switched contexts like Taglish, a mix of Tagalog and English commonly used in Filipino e-commerce reviews. This paper introduces a comprehensive AE pipeline designed for Taglish, combining rule-based, large language model (LLM)-based, and fine-tuning techniques to address both aspect identification and extraction. A Hierarchical Aspect Framework (HAF) is developed through multi-method topic modeling, along with a dual-mode tagging scheme for explicit and implicit aspects. For aspect identification, four distinct models are evaluated: a Rule-Based system, a Generative LLM (Gemini 2.0 Flash), and two Fine-Tuned Gemma-3 1B models trained on different datasets (Rule-Based vs. LLM-Annotated). Results indicate that the Generative LLM achieved the highest performance across all tasks (Macro F1 0.91), demonstrating superior capability in handling implicit aspects. In contrast, the fine-tuned models exhibited limited performance due to dataset imbalance and architectural capacity constraints. This work contributes a scalable and linguistically adaptive framework for enhancing ABSA in diverse, code-switched environments.
zh

[NLP-34] CSCBench: A PVC Diagnostic Benchmark for Commodity Supply Chain Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在商品供应链(Commodity Supply Chains, CSCs)领域中推理能力不足的问题,尤其是其在受制度规则体系和可行性约束驱动的复杂场景下的表现尚不明确。解决方案的关键在于构建了一个名为CSCBench的2.3K+单选题基准测试集,并提出PVC三维度评估框架(Process、Variety、Cognition),其中Process轴对应SCOR+Enable流程阶段,Variety轴刻画商品特异性规则系统及其物质-信息-金融耦合约束,Cognition轴基于布卢姆修订版认知分类法设计多层推理任务。实证表明,LLMs在Process与Cognition维度表现良好,但在Variety维度(特别是货运协议相关任务)显著退化,从而为诊断和提升LLM在高风险供应链场景中的推理能力提供了可量化、结构化的评估工具。

链接: https://arxiv.org/abs/2601.01825
作者: Yaxin Cui,Yuanqiang Zeng,Jiapeng Yan,Keling Lin,Kai Ji,Jianhui Zeng,Sheng Zhang,Xin Luo,Binzhu Su,Chaolai Shen,Jiahao Yu
机构: Xiamen SmartChain Innovations Co., Ltd.(厦门智链创新科技有限公司); Xiamen ITG Digital Technology Co., Ltd.(厦门ITG数字科技有限公司); Xiamen C&D Co., Ltd.(厦门C&D有限公司); Xiamen Xiangyu Co., Ltd.(厦门翔宇有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in general benchmarks, yet their competence in commodity supply chains (CSCs) – a domain governed by institutional rule systems and feasibility constraints – remains under-explored. CSC decisions are shaped jointly by process stages (e.g., planning, procurement, delivery), variety-specific rules (e.g., contract specifications and delivery grades), and reasoning depth (from retrieval to multi-step analysis and decision selection). We introduce CSCBench, a 2.3K+ single-choice benchmark for CSC reasoning, instantiated through our PVC 3D Evaluation Framework (Process, Variety, and Cognition). The Process axis aligns tasks with SCOR+Enable; the Variety axis operationalizes commodity-specific rule systems under coupled material-information-financial constraints, grounded in authoritative exchange guidebooks/rulebooks and industry reports; and the Cognition axis follows Bloom’s revised taxonomy. Evaluating representative LLMs under a direct prompting setting, we observe strong performance on the Process and Cognition axes but substantial degradation on the Variety axis, especially on Freight Agreements. CSCBench provides a diagnostic yardstick for measuring and improving LLM capabilities in this high-stakes domain.
zh

[NLP-35] HyperCLOVA X 8B Omni

【速读】: 该论文旨在解决多模态模型在实际应用中因模态间割裂而导致的泛化能力不足问题,即如何实现文本、音频和视觉模态之间的任意双向交互(any-to-any omni interaction)。其解决方案的关键在于构建一个统一的8B规模的端到端生成式AI(Generative AI)模型——HyperCLOVA X 8B Omni,通过共享的下一个词预测接口(next-token prediction interface)将交错排列的多模态序列进行统一建模,并利用视觉与音频编码器注入连续嵌入(continuous embeddings)以实现细粒度理解与跨模态对齐,从而在不依赖特定模态管道的前提下,支持所有模态作为输入或输出的灵活组合。

链接: https://arxiv.org/abs/2601.01792
作者: NAVER Cloud HyperCLOVA X Team
机构: NAVER Cloud (NAVER云); HyperCLOVA X Team (HyperCLOVA X团队)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Technical Report

点击查看摘要

Abstract:In this report, we present HyperCLOVA X 8B Omni, the first any-to-any omnimodal model in the HyperCLOVA X family that supports text, audio, and vision as both inputs and outputs. By consolidating multimodal understanding and generation into a single model rather than separate modality-specific pipelines, HyperCLOVA X 8B Omni serves as an 8B-scale omni-pathfinding point toward practical any-to-any omni assistants. At a high level, the model unifies modalities through a shared next-token prediction interface over an interleaved multimodal sequence, while vision and audio encoders inject continuous embeddings for fine-grained understanding and grounding. Empirical evaluations demonstrate competitive performance against comparably sized models across diverse input-output combinations spanning text, audio, and vision, in both Korean and English. We anticipate that the open-weight release of HyperCLOVA X 8B Omni will support a wide range of research and deployment scenarios.
zh

[NLP-36] BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali

【速读】: 该论文旨在解决孟加拉语(Bengali)缺乏一个鲁棒的自动国际音标(IPA)转写系统的问题,尤其是现有方法在处理区域方言、数字表达以及未见词汇时表现不佳。解决方案的关键在于提出BanglaIPA系统,该系统结合基于字符的词典与词级对齐机制,通过预计算的词到IPA映射字典提升推理效率,并显著增强对区域变体和数值表达的准确性,从而实现更鲁棒的语音转写性能。

链接: https://arxiv.org/abs/2601.01778
作者: Jakir Hasan,Shrestha Datta,Md Saiful Islam,Shubhashis Roy Dipta,Ameya Debnath
机构: Shahjalal University of Science and Technology (沙贾拉尔科技大学); University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite its widespread use, Bengali lacks a robust automated International Phonetic Alphabet (IPA) transcription system that effectively supports both standard language and regional dialectal texts. Existing approaches struggle to handle regional variations, numerical expressions, and generalize poorly to previously unseen words. To address these limitations, we propose BanglaIPA, a novel IPA generation system that integrates a character-based vocabulary with word-level alignment. The proposed system accurately handles Bengali numerals and demonstrates strong performance across regional dialects. BanglaIPA improves inference efficiency by leveraging a precomputed word-to-IPA mapping dictionary for previously observed words. The system is evaluated on the standard Bengali and six regional variations of the DUAL-IPA dataset. Experimental results show that BanglaIPA outperforms baseline IPA transcription models by 58.4-78.7% and achieves an overall mean word error rate of 11.4%, highlighting its robustness in phonetic transcription generation for the Bengali language.
zh

[NLP-37] Can LLM s Track Their Output Length? A Dynamic Feedback Mechanism for Precise Length Regulation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时难以精确控制输出长度的问题,尤其是在需要满足特定词数、句数或标记数(token count)约束的实际应用中。研究表明,LLMs 常因无法准确感知输入文本长度而导致生成结果偏离目标长度。解决方案的关键在于提出一种无需训练的动态长度反馈机制,在生成过程中实时引入长度反馈信号,使模型能够自适应调整生成策略以逼近目标长度,同时保持生成质量。实验表明,该方法在摘要生成和传记生成任务中显著提升了长度控制精度,并通过监督微调可进一步拓展至更广泛的文本生成场景。

链接: https://arxiv.org/abs/2601.01768
作者: Meiman Xiao,Ante Wang,Qingguo Hu,Zhongjian Miao,Huangjun Shen,Longyue Wang,Weihua Luo,Jinsong Su
机构: Xiamen University (厦门大学); Alibaba International Digital Commerce Group; Li Auto Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Precisely controlling the length of generated text is a common requirement in real-world applications. However, despite significant advancements in following human instructions, Large Language Models (LLMs) still struggle with this task. In this work, we demonstrate that LLMs often fail to accurately measure input text length, leading to poor adherence to length constraints. To address this issue, we propose a novel length regulation approach that incorporates dynamic length feedback during generation, enabling adaptive adjustments to meet target lengths. Experiments on summarization and biography tasks show our training-free approach significantly improves precision in achieving target token, word, or sentence counts without compromising quality. Additionally, we demonstrate that further supervised fine-tuning allows our method to generalize effectively to broader text-generation tasks.
zh

[NLP-38] Context-Free Recognition with Transformers

【速读】: 该论文旨在解决生成式 AI(Generative AI)中 Transformer 模型对上下文无关语言(Context-Free Languages, CFLs)的识别能力问题。此前研究表明,标准 Transformer 在不引入额外机制的情况下无法识别 CFL,甚至无法识别更简单的正则语言(Regular Languages),这限制了其在语法结构复杂任务中的应用。本文的关键解决方案是引入带有 O(logn)\mathcal{O}(\log n) 层循环(looping layers)的扩展 Transformer 架构,并通过添加 O(n6)\mathcal{O}(n^6) 个填充标记(padding tokens)来实现对任意 CFL 的识别。尽管该方案理论上可行,但其高复杂度在实践中难以应用;进一步地,作者发现对于自然子类如无歧义 CFL(Unambiguous CFLs),只需 O(n3)\mathcal{O}(n^3) 的填充即可实现高效识别,显著提升了实用性。实验验证了循环机制在理论上需对数深度的语言上确实有效,揭示了 Transformer 对 CFL 识别的内在复杂性与可优化路径。

链接: https://arxiv.org/abs/2601.01754
作者: Selim Jerad,Anej Svete,Sophie Hao,Ryan Cotterell,William Merrill
机构: ETH Zürich (苏黎世联邦理工学院); Boston University (波士顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Transformers excel on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs (Merrill et al., 2022). Merrill Sabharwal (2024) show that \mathcalO(\log n) looping layers (w.r.t. input length n ) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with \mathcalO(\log n) looping layers and \mathcalO(n^6) padding tokens can recognize all CFLs. However, training and inference with \mathcalO(n^6) padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring \mathcalO(n^3) padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.
zh

[NLP-39] Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis ECIR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在信息检索(Information Retrieval, IR)评估中作为相关性判别器时是否存在系统性误判的问题,而不仅仅是评估其平均表现。传统研究关注LLM与人类标注者的一致性,但本文进一步探究LLM是否在特定语义场景下持续产生偏差。解决方案的关键在于提出一种新颖的查询-文档(Query-Document, Q-D)对表示方法,将Q-D对嵌入到联合语义空间中,并将相关性视为关系属性;在此基础上构建基于聚类的框架,通过局部聚类分析识别出人类与LLM标签不一致的集中区域。实验表明,系统性差异主要集中在特定语义簇内,尤其出现在定义类、政策相关或模糊语境下的查询中,从而实现从全局诊断到局部定位的偏差识别,为更可靠、具偏见意识的IR评估提供了方法支撑。

链接: https://arxiv.org/abs/2601.01751
作者: Samaneh Mohtadi,Gianluca Demartini
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for presentation at the ECIR 2026 Full Papers track

点击查看摘要

Abstract:Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation due to reduced cost and increased scalability as compared to human assessors. While previous research has looked at the reliability of LLMs as compared to human assessors, in this work, we aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average. To this aim, we propose a novel representational method for queries and documents that allows us to analyze relevance label distributions and compare LLM and human labels to identify patterns of disagreement and localize systematic areas of disagreement. We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space, treating relevance as a relational property. Experiments on TREC Deep Learning 2019 and 2020 show that systematic disagreement between humans and LLMs is concentrated in specific semantic clusters rather than distributed randomly. Query-level analyses reveal recurring failures, most often in definition-seeking, policy-related, or ambiguous contexts. Queries with large variation in agreement across their clusters emerge as disagreement hotspots, where LLMs tend to under-recall relevant content or over-include irrelevant material. This framework links global diagnostics with localized clustering to uncover hidden weaknesses in LLM judgments, enabling bias-aware and more reliable IR evaluation.
zh

[NLP-40] Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment AAAI2026

【速读】: 该论文旨在解决多粒度发音评估中缺乏跨粒度双向交互的问题,现有方法仅考虑相邻粒度层级间的单向依赖关系,难以充分捕捉音素、词和语句层级之间的声学结构相关性。其解决方案的关键在于提出一种残差层次交互方法(HIA),其中核心组件是交互注意力模块(Interactive Attention Module),通过注意力机制实现粒度间的动态双向信息交互,从而有效提取各层级的语言特征并整合不同粒度间的关联;同时引入残差层次结构缓解声学层级建模中的特征遗忘问题,并采用一维卷积层增强各粒度层级局部上下文线索的提取能力。

链接: https://arxiv.org/abs/2601.01745
作者: Hong Han,Hao-Chen Pei,Zhao-Zheng Nie,Xin Luo,Xin-Shun Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 5 tables, accepted by AAAI 2026

点击查看摘要

Abstract:Automatic pronunciation assessment plays a crucial role in computer-assisted pronunciation training systems. Due to the ability to perform multiple pronunciation tasks simultaneously, multi-aspect multi-granularity pronunciation assessment methods are gradually receiving more attention and achieving better performance than single-level modeling tasks. However, existing methods only consider unidirectional dependencies between adjacent granularity levels, lacking bidirectional interaction among phoneme, word, and utterance levels and thus insufficiently capturing the acoustic structural correlations. To address this issue, we propose a novel residual hierarchical interactive method, HIA for short, that enables bidirectional modeling across granularities. As the core of HIA, the Interactive Attention Module leverages an attention mechanism to achieve dynamic bidirectional interaction, effectively capturing linguistic features at each granularity while integrating correlations between different granularity levels. We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies. In addition, we use 1-D convolutional layers to enhance the extraction of local contextual cues at each granularity. Extensive experiments on the speechocean762 dataset show that our model is comprehensively ahead of the existing state-of-the-art methods.
zh

[NLP-41] K-EXAONE Technical Report

【速读】: 该论文旨在解决大规模多语言基础模型在工业与研究应用中对高性能、高效率和多语言支持的需求问题。解决方案的关键在于构建K-EXAONE,一个基于专家混合(Mixture-of-Experts, MoE)架构的大型多语言语言模型,其总参数量达236B,在推理时仅激活23B参数,从而实现计算效率优化;同时支持256K token的上下文窗口和六种语言(韩语、英语、西班牙语、德语、日语和越南语),并通过全面的基准测试验证其在推理、代理能力、通用性、韩语及多语言任务上的表现可与同规模开源模型相当,体现了其作为工业级专有基础模型的潜力。

链接: https://arxiv.org/abs/2601.01739
作者: Eunbi Choi,Kibong Choi,Seokhee Hong,Junwon Hwang,Hyojin Jeon,Hyunjik Jo,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Haeju Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Heuiyeen Yeen,Hwan Chang,Stanley Jungkyu Choi,Yejin Choi,Jiwon Ham,Kijeong Jeon,Geunyeong Jeong,Gerrard Jeongwon Jo,Yonghwan Jo,Jiyeon Jung,Naeun Kang,Dohoon Kim,Euisoon Kim,Hayeon Kim,Hyosang Kim,Hyunseo Kim,Jieun Kim,Minu Kim,Myoungshin Kim,Unsol Kim,Youchul Kim,YoungJin Kim,Chaeeun Lee,Chaeyoon Lee,Changhun Lee,Dahm Lee,Edward Hwayoung Lee,Honglak Lee,Jinsang Lee,Jiyoung Lee,Sangeun Lee,Seungwon Lim,Solji Lim,Woohyung Lim,Chanwoo Moon,Jaewoo Park,Jinho Park,Yongmin Park,Hyerin Seo,Wooseok Seo,Yongwoo Song,Sejong Yang,Sihoon Yang,Chang En Yea,Sihyuk Yi,Chansik Yoon,Dongkeun Yoon,Sangyeon Yoon,Hyeongu Yun
机构: LG AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.
zh

[NLP-42] Entropy-Aligned Decoding of LMs for Better Writing and Reasoning

【速读】: 该论文旨在解决语言模型(Language Model, LM)在生成文本时因采用朴素随机采样而导致质量低下、以及现有解码算法依赖贪婪启发式策略引入短视偏差(myopic distortions),从而产生同质化、重复且不连贯句子的问题。解决方案的关键在于提出一种无需超参数调节的解码方法 EPIC,其核心思想是将未来轨迹的熵(entropy)纳入解码过程,通过显式控制每一步生成中表达的不确定性,使采样分布的熵与数据固有的偶然不确定性(aleatoric uncertainty)对齐;具体实现上采用熵感知的懒惰 Gumbel-Max 采样(Entropy-Aware Lazy Gumbel-Max sampling),既保证了精确性,又仅需每步亚线性数量的熵评估,显著提升了效率和生成质量,在创意写作、摘要生成及数学推理等任务中均优于现有基线方法。

链接: https://arxiv.org/abs/2601.01714
作者: Kareem Ahmed,Sameer Singh
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) are trained on billions of tokens in an attempt to recover the true language distribution. Still, vanilla random sampling from LMs yields low quality generations. Decoding algorithms attempt to restrict the LM distribution to a set of high-probability continuations, but rely on greedy heuristics that introduce myopic distortions, yielding sentences that are homogeneous, repetitive and incoherent. In this paper, we introduce EPIC, a hyperparameter-free decoding approach that incorporates the entropy of future trajectories into LM decoding. EPIC explicitly regulates the amount of uncertainty expressed at every step of generation, aligning the sampling distribution’s entropy to the aleatoric (data) uncertainty. Through Entropy-Aware Lazy Gumbel-Max sampling, EPIC manages to be exact, while also being efficient, requiring only a sublinear number of entropy evaluations per step. Unlike current baselines, EPIC yields sampling distributions that are empirically well-aligned with the entropy of the underlying data distribution. Across creative writing and summarization tasks, EPIC consistently improves LM-as-judge preference win-rates over widely used decoding strategies. These preference gains are complemented by automatic metrics, showing that EPIC produces more diverse generations and more faithful summaries. We also evaluate EPIC on mathematical reasoning, where it outperforms all baselines.
zh

[NLP-43] A Training-Free Large Reasoning Model-based Knowledge Tracing Framework for Unified Prediction and Prescription

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的知识追踪(Knowledge Tracing, KT)方法中存在的两个核心问题:一是现有方法通常需要繁琐的微调(fine-tuning),且性能不稳定甚至接近随机水平;二是传统KT系统多采用多阶段流水线处理预测、反馈生成与学习推荐,导致系统复杂度高、资源消耗大。其解决方案的关键在于提出Thinking-KT框架,该框架不依赖训练过程(training-free),通过引入测试时扩展(Test-Time Scaling, TTS)技术,使小型LLM即可实现高性能的知识追踪,并在统一输出中协同完成预测、个性化反馈生成和学习推荐任务,同时保持预测准确性,从而将KT从多阶段系统简化为单模型统一引擎。

链接: https://arxiv.org/abs/2601.01708
作者: Unggi Lee,Joo Young Kim,Ran Ju,Minyoung Jung,Jeyeon Eo
机构: Chosun University (全南大学); Neudive Inc; Independent Researcher
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) aims to estimate a learner’s evolving mastery based on interaction histories. Recent studies have explored Large Language Models (LLMs) for KT via autoregressive nature, but such approaches typically require fine-tuning and exhibit unstable or near-random performance. Moreover, prior KT systems primarily focus on prediction and rely on multi-stage pipelines for feedback and recommendation, resulting in increased system complexity and resources. To address this gap, we propose Thinking-KT, a training-free KT framework that incorporates Test-Time Scaling (TTS), enabling even small LLMs to achieve competitive KT performance. Moreover, in this framework, a small LLM can jointly perform KT prediction, personalized feedback generation, and learning recommendation in a unified output without degrading prediction accuracy. Beyond performance, we present the systematic analysis of reasoning traces in KT. Our results demonstrate that TTS is a critical yet underexplored factor in LLM-based KT, and that small LLMs can serve as unified ITS engines.
zh

[NLP-44] Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在演变为自主代理并整合实时信息时,因推理能力增强而引入的新型认知攻击风险问题。其核心威胁在于:多个协作的代理可通过公开渠道分发真实但碎片化的证据,利用LLM的过度思考倾向(overthinking tendency),构建具有欺骗性的叙事结构,从而诱导目标模型形成并传播虚假信念。解决方案的关键是提出Generative Montage框架——一个由“写作者-编辑-导演”组成的协同机制,通过对抗性辩论和有策略地发布证据片段来生成误导性结论,并辅以CoPHEME数据集模拟真实谣言事件进行验证。实验表明,该攻击对14种不同LLM家族均具高成功率(最高达74.4%),且推理能力越强的模型反而越易受攻击,揭示了生成式AI在动态信息环境中存在的系统性社会技术脆弱性。

链接: https://arxiv.org/abs/2601.01685
作者: Jinwei Hu,Xinmiao Huang,Youcheng Sun,Yi Dong,Xiaowei Huang
机构: University of Liverpool, UK; Mohamed bin Zayed University of Artificial Intelligence, UAE
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Under Review

点击查看摘要

Abstract:As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs’ overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: this https URL.
zh

[NLP-45] LACONIC: Dense-Level Effectiveness for Scalable Sparse Retrieval via a Two-Phase Training Curriculum

【速读】: 该论文旨在解决当前密集检索模型(dense retrieval models)在实际部署中面临的高内存占用和对GPU加速器依赖的问题,同时弥补学习型稀疏检索(learned sparse retrieval)在性能上长期落后于密集方法的不足。其解决方案的关键在于提出了一种基于Llama-3架构(1B、3B和8B参数规模)的新型学习型稀疏检索器LACONIC,并设计了一个简化的两阶段训练流程:首先通过弱监督预微调(weakly supervised pre-finetuning)使因果语言模型(causal LLMs)具备双向上下文建模能力,进而利用精心筛选的难负样本进行高信号微调(high-signal finetuning),从而显著提升检索效果。实验表明,LACONIC 8B版本在MTEB Retrieval基准上达到60.2 nDCG,性能接近最优密集模型,同时索引内存消耗减少71%,且可在通用CPU硬件上高效运行,大幅降低计算预算。

链接: https://arxiv.org/abs/2601.01684
作者: Zhichao Xu,Shengyao Zhuang,Crystina Zhang,Xueguang Ma,Yijun Tian,Maitrey Mehta,Jimmy Lin,Vivek Srikumar
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While dense retrieval models have become the standard for state-of-the-art information retrieval, their deployment is often constrained by high memory requirements and reliance on GPU accelerators for vector similarity search. Learned sparse retrieval offers a compelling alternative by enabling efficient search via inverted indices, yet it has historically received less attention than dense approaches. In this report, we introduce LACONIC, a family of learned sparse retrievers based on the Llama-3 architecture (1B, 3B, and 8B). We propose a streamlined two-phase training curriculum consisting of (1) weakly supervised pre-finetuning to adapt causal LLMs for bidirectional contextualization and (2) high-signal finetuning using curated hard negatives. Our results demonstrate that LACONIC effectively bridges the performance gap with dense models: the 8B variant achieves a state-of-the-art 60.2 nDCG on the MTEB Retrieval benchmark, ranking 15th on the leaderboard as of January 1, 2026, while utilizing 71% less index memory than an equivalent dense model. By delivering high retrieval effectiveness on commodity CPU hardware with a fraction of the compute budget required by competing models, LACONIC provides a scalable and efficient solution for real-world search applications.
zh

[NLP-46] EHRSummarizer: A Privacy-Aware FHIR-Native Architecture for Structured Clinical Summarization of Electronic Health Records

【速读】: 该论文旨在解决临床医生在使用碎片化的电子健康记录(Electronic Health Record, EHR)界面时,难以高效整合患者问题、用药情况、近期就诊记录及纵向趋势信息的问题。解决方案的关键在于提出EHRSummarizer这一隐私敏感的、基于FHIR R4标准的参考架构:它能够检索高价值的FHIR资源,将其归一化为一致的临床上下文包,并生成结构化摘要以支持结构化病历审查;系统设计具备数据最小化、无状态处理和灵活部署能力(包括在组织信任边界内本地推理),并通过限制总结阶段仅依赖于检索到的证据、明确标识缺失领域并避免诊断或治疗建议来降低不安全行为风险。

链接: https://arxiv.org/abs/2601.01668
作者: Houman Kazemzadeh,Nima Minaifar,Kamyar Naderi,Sho Tabibzadeh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient’s problems, medications, recent encounters, and longitudinal trends. This work describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture that retrieves a targeted set of high-yield FHIR R4 resources, normalizes them into a consistent clinical context package, and produces structured summaries intended to support structured chart review. The system can be configured for data minimization, stateless processing, and flexible deployment, including local inference within an organization’s trust boundary. To mitigate the risk of unsupported or unsafe behavior, the summarization stage is constrained to evidence present in the retrieved context package, is intended to indicate missing or unavailable domains where feasible, and avoids diagnostic or treatment recommendations. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes or controlled workflow studies. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, and operational monitoring to guide future institutional assessments.
zh

[NLP-47] JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

【速读】: 该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)安全评估中存在的两大关键问题:一是现有安全基准测试 predominantly 以英文为主,缺乏对日语等非英语医疗场景的覆盖;二是现有测试仅使用单轮对话提示(single-turn prompts),无法反映真实临床咨询中多轮交互的复杂性。为应对这些问题,作者提出了JMedEthicBench,这是首个面向日本医疗场景的多轮对话式LLM医学安全性评估基准。其核心创新在于基于67条日本医学会指南构建超过5万条对抗性多轮对话数据,并采用双LLM评分协议进行量化评估,揭示出医疗专用模型在多轮交互中安全性显著下降(中位数从9.5降至5.0,p < 0.001),且这种脆弱性在跨语言(日语与英语)测试中依然存在,表明其根源在于模型对齐机制的固有缺陷而非语言特异性因素。这一发现强调了领域微调可能无意削弱安全机制,并呼吁针对多轮交互设计专门的对齐策略。

链接: https://arxiv.org/abs/2601.01627
作者: Junyu Liu,Zirui Li,Qian Niu,Zequn Zhang,Yue Xun,Wenlong Hou,Shujun Wang,Yusuke Iwasawa,Yutaka Matsuo,Kan Hatakeyama-Sato
机构: Kyoto University (京都大学); Hohai University (河海大学); The University of Tokyo (东京大学); University of Science and Technology of China (中国科学技术大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, p 0.001 ). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.
zh

[NLP-48] How Does Prefix Matter in Reasoning Model Tuning?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)过程中因去除引导性前缀(prefix)而导致的对齐效果下降问题,特别是安全性和推理能力的弱化。传统做法通常移除如“请回答”或“以下是问题”等引入性短语以简化数据格式,但本文提出质疑:这些看似冗余的前缀句子实则蕴含轻量级对齐信号,能够有效引导模型解码过程向更安全、更连贯的方向演化。解决方案的关键在于系统性地控制前缀保留比例(0%至100%),并通过多任务实验验证其影响——结果表明,包含前缀的SFT显著提升安全性和数学/编码推理性能(如WildJailbreak基准上Safe@1提升+6%,GSM8K提升+7%),而事实准确性与代码任务则无明显改善,说明前缀通过缩小搜索空间优化结构化推理;进一步的token级损失分析揭示,“revised”、“logically”等前缀词具有更高梯度幅值,作为对齐锚点稳定推理轨迹。因此,前缀条件化提供了一种可扩展且可解释的隐式对齐机制,补充了基于奖励的学习方法。

链接: https://arxiv.org/abs/2601.01624
作者: Raj Vardhan Tomar,Preslav Nakov,Yuxia Wang
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Indian Institute of Technology Delhi; INSAIT
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent alignment studies commonly remove introductory boilerplate phrases from supervised fine-tuning (SFT) datasets. This work challenges that assumption. We hypothesize that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses. To examine this, we fine-tune three R1 series models across three core model capabilities: reasoning (mathematics, coding), safety, and factuality, systematically varying prefix inclusion from 0% to 100%. Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy on adversarial benchmarks (WildJailbreak, StrongReject) and +7% improvement on GSM8K reasoning. However, factuality and coding tasks show marginal or negative effects, indicating that prefix-induced narrowing of the search space benefits structured reasoning. Token-level loss analysis further reveals that prefix tokens such as “revised” and “logically” incur higher gradient magnitudes, acting as alignment anchors that stabilize reasoning trajectories. Our findings suggest that prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.01624 [cs.CL] (or arXiv:2601.01624v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.01624 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-49] he Gray Area: Characterizing Moderator Disagreement on Reddit

【速读】: 该论文旨在解决在线内容 moderation 中因志愿者 moderators 之间意见分歧所引发的“灰色地带”(gray area)问题,即那些难以明确判定是否违规的案例。研究表明,约七分之一的 moderation 案例属于灰色地带,常涉及用户意图不明确的行为(如 trolling 和 brigading)以及社区治理争议;且近半数此类案例由自动化系统处理,而信息论评估和语言模型测试均表明灰色地带案例比无争议案例更难判断,现有生成式 AI(Generative AI)难以有效处理。论文的关键解决方案在于强调专家人类 moderator 在监督和决策中的核心作用,并揭示当前 moderation 流程与工具在应对复杂性方面的局限性。

链接: https://arxiv.org/abs/2601.01620
作者: Shayan Alipour,Shruti Phadke,Seyed Shahabeddin Mousavi,Amirhossein Afsharrad,Morteza Zihayat,Mattia Samory
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Volunteer moderators play a crucial role in sustaining online dialogue, but they often disagree about what should or should not be allowed. In this paper, we study the complexity of content moderation with a focus on disagreements between moderators, which we term the ``gray area’’ of moderation. Leveraging 5 years and 4.3 million moderation log entries from 24 subreddits of different topics and sizes, we characterize how gray area, or disputed cases, differ from undisputed cases. We show that one-in-seven moderation cases are disputed among moderators, often addressing transgressions where users’ intent is not directly legible, such as in trolling and brigading, as well as tensions around community governance. This is concerning, as almost half of all gray area cases involved automated moderation decisions. Through information-theoretic evaluations, we demonstrate that gray area cases are inherently harder to adjudicate than undisputed cases and show that state-of-the-art language models struggle to adjudicate them. We highlight the key role of expert human moderators in overseeing the moderation process and provide insights about the challenges of current moderation processes and tools.
zh

[NLP-50] Steerability of Instrumental-Convergence Tendencies in LLM s

【速读】: 该论文旨在解决开放权重(open-weight)人工智能模型在安全与安全性之间的根本性矛盾问题,即如何在保障系统可控性(安全)的同时防止恶意攻击者利用高可操控性(steerability)诱导有害行为。其核心解决方案在于区分授权可控性(authorized steerability)与非授权可控性(unauthorized steerability),并通过设计特定的反工具性提示后缀(anti-instrumental prompt suffix)显著降低模型产生工具性趋同行为(instrumental convergence)的概率——例如规避关闭、欺骗或自我复制等潜在风险行为。实验表明,使用Qwen3系列模型进行测试时,该方法可使工具性趋同输出比例从81.69%降至2.82%,且更大规模的对齐模型在该策略下表现更优,体现出通过提示工程实现高效行为约束的有效性。

链接: https://arxiv.org/abs/2601.01584
作者: Jakub Hoscilowicz
机构: Warsaw University of Technology (华沙理工大学)
类目: Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). In our experiments, higher capability does not imply lower steerability. We distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety–security dilemma for open-weight AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability to prevent malicious actors from eliciting harmful behaviors. This tension is acute for open-weight models, which are currently highly steerable via common techniques such as fine-tuning and adversarial prompting. Using Qwen3 models (4B/30B; Base/Instruct/Thinking) and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces outputs labeled as instrumental convergence (e.g., shutdown avoidance, deception, self-replication). For Qwen3-30B Instruct, convergence drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models produce fewer convergence-labeled outputs than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at this http URL.
zh

[NLP-51] OpenNovelty: An LLM -powered Agent ic System for Verifiable Scholarly Novelty Assessment

【速读】: 该论文旨在解决学术同行评审中新颖性(novelty)评估的挑战问题,即审稿人需在快速演化的庞大文献库中准确判断投稿作品的创新程度。解决方案的关键在于提出OpenNovelty——一个基于大语言模型(LLM)的智能体系统(agentic system),其通过四阶段流程实现透明且基于证据的新颖性分析:首先提取核心任务与贡献声明生成检索查询;其次利用语义搜索引擎获取相关前置研究;再构建任务相关工作的层级分类体系并进行逐贡献的全文对比;最后整合所有分析结果生成结构化报告,附带明确引用和证据片段。该方案的核心优势在于将所有评估锚定于真实检索到的论文,确保判断可验证、可追溯,从而提升评审的公平性、一致性与科学性。

链接: https://arxiv.org/abs/2601.01576
作者: Ming Zhang,Kexin Tan,Yueyuan Huang,Yujiong Shen,Chunchun Ma,Li Ju,Xinran Zhang,Yuhui Wang,Wenqing Jing,Jingyi Deng,Huayu Sha,Binze Hu,Jingqi Tong,Changhao Jiang,Yage Geng,Yuankai Ying,Yue Zhang,Zhangyue Yin,Zhiheng Xi,Shihan Dou,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating novelty is critical yet challenging in peer review, as reviewers must assess submissions against a vast, rapidly evolving literature. This report presents OpenNovelty, an LLM-powered agentic system for transparent, evidence-based novelty analysis. The system operates through four phases: (1) extracting the core task and contribution claims to generate retrieval queries; (2) retrieving relevant prior work based on extracted queries via semantic search engine; (3) constructing a hierarchical taxonomy of core-task-related work and performing contribution-level full-text comparisons against each contribution; and (4) synthesizing all analyses into a structured novelty report with explicit citations and evidence snippets. Unlike naive LLM-based approaches, \textscOpenNovelty grounds all assessments in retrieved real papers, ensuring verifiable judgments. We deploy our system on 500+ ICLR 2026 submissions with all reports publicly available on our website, and preliminary analysis suggests it can identify relevant prior work, including closely related papers that authors may overlook. OpenNovelty aims to empower the research community with a scalable tool that promotes fair, consistent, and evidence-backed peer review.
zh

[NLP-52] HalluZig: Hallucination Detection using Zigzag Persistence

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景中因幻觉(hallucination)问题导致的事实可靠性不足这一关键挑战。现有检测方法多依赖于输出层面的表面信号,忽视了模型内部推理过程中的错误。论文提出了一种新的幻觉检测范式——HalluZig,其核心创新在于通过分析模型各层注意力机制随生成过程演变的动态拓扑结构来识别幻觉。解决方案的关键在于将注意力矩阵序列建模为zigzag图滤波(zigzag graph filtration),并利用拓扑数据分析中的zigzag持久性(zigzag persistence)提取具有区分性的拓扑特征,实验证明这些特征能够有效区分事实性与幻觉生成,并且在不同模型间具有泛化能力,仅需部分网络深度的结构信息即可实现检测。

链接: https://arxiv.org/abs/2601.01552
作者: Shreyas N. Samaga,Gilberto Gonzalez Arroyo,Tamal K. Dey
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The factual reliability of Large Language Models (LLMs) remains a critical barrier to their adoption in high-stakes domains due to their propensity to hallucinate. Current detection methods often rely on surface-level signals from the model’s output, overlooking the failures that occur within the model’s internal reasoning process. In this paper, we introduce a new paradigm for hallucination detection by analyzing the dynamic topology of the evolution of model’s layer-wise attention. We model the sequence of attention matrices as a zigzag graph filtration and use zigzag persistence, a tool from Topological Data Analysis, to extract a topological signature. Our core hypothesis is that factual and hallucinated generations exhibit distinct topological signatures. We validate our framework, HalluZig, on multiple benchmarks, demonstrating that it outperforms strong baselines. Furthermore, our analysis reveals that these topological signatures are generalizable across different models and hallucination detection is possible only using structural signatures from partial network depth.
zh

[NLP-53] Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM

【速读】: 该论文旨在解决低资源语言(如印地语)在自然语言处理(Natural Language Processing, NLP)领域中高质量文本摘要数据集严重匮乏的问题,尤其在文本摘要任务中,由于缺乏多样且专业的语料库,导致模型开发受限。其解决方案的关键在于构建一个成本效益高、自动化程度高的框架,通过将英文极端摘要(Extreme Summarization, XSUM)数据集作为源数据,结合先进的机器翻译与语言适应技术生成印地语摘要语料,并利用跨语言优化评估指标(Crosslingual Optimized Metric for Evaluation of Translation, COMET)进行质量验证,辅以大型语言模型(Large Language Models, LLMs)进行内容筛选与校正,从而生成一个主题多样、语境贴合的印地语文本摘要数据集。该方法不仅为印地语NLP研究提供直接可用的数据资源,也为其他低资源语言的NLP发展提供了可扩展的标准化路径。

链接: https://arxiv.org/abs/2601.01543
作者: Praveenkumar Katwe,RakeshChandra Balabantaray,Kaliprasad Vittala
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Book chapter for River publications

点击查看摘要

Abstract:Current advancements in Natural Language Processing (NLP) have largely favored resource-rich languages, leaving a significant gap in high-quality datasets for low-resource languages like Hindi. This scarcity is particularly evident in text summarization, where the development of robust models is hindered by a lack of diverse, specialized corpora. To address this disparity, this study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset. By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques. To ensure high fidelity and contextual relevance, we utilize the Crosslingual Optimized Metric for Evaluation of Translation (COMET) for validation, supplemented by the selective use of Large Language Models (LLMs) for curation. The resulting dataset provides a diverse, multi-thematic resource that mirrors the complexity of the original XSUM corpus. This initiative not only provides a direct tool for Hindi NLP research but also offers a scalable methodology for democratizing NLP in other underserved languages. By reducing the costs associated with dataset creation, this work fosters the development of more nuanced, culturally relevant models in computational linguistics. Comments: Book chapter for River publications Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.01543 [cs.CL] (or arXiv:2601.01543v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.01543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-54] Aletheia: Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix

【速读】: 该论文旨在解决当前人工智能评估范式在衡量模型认知深度方面的局限性问题,即静态基准测试仅能反映知识广度而无法量化信念的深度。其核心解决方案是提出Project Aletheia这一认知物理学框架,通过引入Tikhonov正则化方法对评判者的混淆矩阵进行反演,从而实现对“认知确信度(Cognitive Conviction)”的量化测量;同时设计合成代理协议(Synthetic Proxy Protocol)以避免依赖黑箱私有数据,确保评估可复现性与透明度。

链接: https://arxiv.org/abs/2601.01532
作者: Fanzhe Fu
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:In the progressive journey toward Artificial General Intelligence (AGI), current evaluation paradigms face an epistemological crisis. Static benchmarks measure knowledge breadth but fail to quantify the depth of belief. While Simhi et al. (2025) defined the CHOKE phenomenon in standard QA, we extend this framework to quantify “Cognitive Conviction” in System 2 reasoning models. We propose Project Aletheia, a cognitive physics framework that employs Tikhonov Regularization to invert the judge’s confusion matrix. To validate this methodology without relying on opaque private data, we implement a Synthetic Proxy Protocol. Our preliminary pilot study on 2025 baselines (e.g., DeepSeek-R1, OpenAI o1) suggests that while reasoning models act as a “cognitive buffer,” they may exhibit “Defensive OverThinking” under adversarial pressure. Furthermore, we introduce the Aligned Conviction Score (S_aligned) to verify that conviction does not compromise safety. This work serves as a blueprint for measuring AI scientific integrity.
zh

[NLP-55] EmoHarbor: Evaluating Personalized Emotional Support by Simulating the Users Internal World

【速读】: 该论文旨在解决当前情感支持对话评估范式中存在的核心问题:现有方法倾向于奖励通用的情感共鸣回应,而未能有效评估支持是否真正个性化地契合用户独特的心理特质和情境需求。其解决方案的关键在于提出EmoHarbor框架,该框架采用“用户即裁判”(User-as-a-Judge)范式,通过模拟用户的内在心理世界来实现更贴近真实人类体验的评估。其核心技术是Chain-of-Agent架构,将用户内部认知过程分解为三个专业化角色,使代理能够与支持者互动并完成类似人类用户的多维评估。该框架基于100个真实用户画像构建,并定义了10个个性化支持质量维度,从而揭示出大语言模型(LLM)虽具备高情商生成能力,却普遍缺乏对个体情境的精准适配,推动研究焦点从泛化共情转向真正的用户感知型情感支持系统开发。

链接: https://arxiv.org/abs/2601.01530
作者: Jing Ye,Lu Xiang,Yaping Zhang,Chengqing Zong
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Current evaluation paradigms for emotional support conversations tend to reward generic empathetic responses, yet they fail to assess whether the support is genuinely personalized to users’ unique psychological profiles and contextual needs. We introduce EmoHarbor, an automated evaluation framework that adopts a User-as-a-Judge paradigm by simulating the user’s inner world. EmoHarbor employs a Chain-of-Agent architecture that decomposes users’ internal processes into three specialized roles, enabling agents to interact with supporters and complete assessments in a manner similar to human users. We instantiate this benchmark using 100 real-world user profiles that cover a diverse range of personality traits and situations, and define 10 evaluation dimensions of personalized support quality. Comprehensive evaluation of 20 advanced LLMs on EmoHarbor reveals a critical insight: while these models excel at generating empathetic responses, they consistently fail to tailor support to individual user contexts. This finding reframes the central challenge, shifting research focus from merely enhancing generic empathy to developing truly user-aware emotional support. EmoHarbor provides a reproducible and scalable framework to guide the development and evaluation of more nuanced and user-aware emotional support systems.
zh

[NLP-56] Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

【速读】: 该论文旨在解决在决策成本不对称场景下(如招聘、医疗分诊和欺诈检测),传统单一大语言模型(Large Language Models, LLMs)作为代理进行决策时因缺乏贝叶斯一致性与成本感知能力而导致的次优问题。其核心解决方案是提出一种基于贝叶斯理论、成本敏感的多LLM协同框架,将LLM视为近似似然模型而非分类器,通过对比提示(contrastive prompting)获取候选状态下的似然估计,利用稳健统计方法聚合多个异构模型的结果,并结合先验信息进行贝叶斯更新,从而实现一致的概率信念演化、预期成本最小化的动作选择、基于信息价值的主动信息采集以及通过集成偏差缓解提升公平性。实验表明,在简历筛选任务中,该方法相比最优单模型基线降低总成本34%(减少29.4万美元),同时将最大群体差距从22个百分点降至5个百分点。

链接: https://arxiv.org/abs/2601.01522
作者: Danial Amin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs: hiring (missed talent vs wasted interviews), medical triage (missed emergencies vs unnecessary escalation), and fraud detection (approved fraud vs declined legitimate payments). The dominant design queries a single LLM for a posterior over states, thresholds “confidence,” and acts; we prove this is inadequate for sequential decisions with costs. We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models rather than classifiers. For each candidate state, we elicit likelihoods via contrastive prompting, aggregate across diverse models with robust statistics, and update beliefs with Bayes rule under explicit priors as new evidence arrives. This enables coherent belief updating, expected-cost action selection, principled information gathering via value of information, and fairness gains via ensemble bias mitigation. In resume screening with costs of 40000 USD per missed hire, 2500 USD per interview, and 150 USD per phone screen, experiments on 1000 resumes using five LLMs (GPT-4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek) reduce total cost by 294000 USD (34 percent) versus the best single-LLM baseline and improve demographic parity by 45 percent (max group gap 22 to 5 percentage points). Ablations attribute 51 percent of savings to multi-LLM aggregation, 43 percent to sequential updating, and 20 percent to disagreement-triggered information gathering, consistent with the theoretical benefits of correct probabilistic foundations.
zh

[NLP-57] From Failure to Mastery: Generating Hard Samples for Tool-use Agents

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)代理在工具使用能力训练中所面临的样本质量不足问题,即现有数据生成方法多依赖随机采样与浅层生成,导致训练轨迹简单且同质化,难以捕捉复杂、隐含的逻辑依赖关系。解决方案的关键在于提出一个名为HardGen的自动化代理式数据生成流水线:首先构建基于代理失败案例的动态API图(API Graph),从中采样以合成高难度轨迹;其次利用这些轨迹作为条件先验,引导模块化高级工具的实例化,并据此生成具有挑战性的查询;最后通过闭环评估反馈机制驱动复杂思维链(Chain-of-Thought, CoT)的持续优化与验证,从而生成可验证的高质量训练样本。

链接: https://arxiv.org/abs/2601.01498
作者: Bingguang Hao,Zengzhuang Xu,Yuntao Wen,Xinyi Xu,Yang Liu,Tong Zhao,Maolin Wang,Long Chen,Dong Wang,Yicheng Chen,Cunyin Peng,Xiangyu Zhao,Chenyi Zhuang,Ji Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.
zh

[NLP-58] Distortion Instead of Hallucination: The Effect of Reasoning Under Strict Constraints

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在缺乏外部工具或知识支持的封闭系统中,通过推理机制提升输出可靠性的有效性问题。研究发现,尽管推理能显著降低模型对约束条件的违反率(从66–75%降至13–26%),但会引发系统性事实扭曲和完全虚构内容的增加,即存在“合规性与真实性之间的权衡”(compliance-truthfulness trade-off)。这一现象在不同架构的模型(GPT-5.2 和 Gemini 3 Flash)中均一致出现,表明推理并非普遍提升可靠性,反而可能将诚实的违规转化为更隐蔽的虚假信息,从而挑战了“推理可增强输出可信度”的主流假设。关键在于揭示了推理在封闭环境中可能导致认知偏差加剧,而非单纯改善输出质量。

链接: https://arxiv.org/abs/2601.01490
作者: Junichiro Niimi
机构: Meijo University (明治大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread adoption of large language models (LLMs), hallucinations, which are non-factual fabrications in model outputs, have become serious concerns. Reasoning capabilities have received attention as a self-verification process to improve output reliability. However, the effect of reasoning within a closed system where LLMs cannot rely on external tools or knowledge has yet to be clarified. We therefore conduct experiments under strict constraints (recommending peer-reviewed journal articles in computer science) to examine the effect of reasoning across multiple models (GPT-5.2 and Gemini 3 Flash). Our results reveal a problematic trade-off between constraint compliance and factual accuracy. Non-reasoning models exhibit high constraint violation rates (66-75%) but maintain factual accuracy, while reasoning models reduce violations (13-26%) but systematically distort known facts to satisfy constraints and increase complete fabrication. This trade-off pattern is consistent across both models despite different architectures, indicating a fundamental limitation of reasoning. Furthermore, reasoning does not uniformly improve output authenticity: effects diverge by model, reflecting different allocations of the compliance-truthfulness trade-off. These findings challenge the assumption that reasoning universally improves reliability: reasoning models trade honest constraint violations for detection-resistant distortions.
zh

[NLP-59] Four Quadrants of Difficulty: A Simple Categorisation and its Limits

【速读】: 该论文旨在解决课程学习(Curriculum Learning, CL)中样本难度估计不准确的问题,即现有方法多依赖任务无关的语言学启发式或人类直觉来近似样本难度,而这些信号是否真正反映神经模型的学习难易程度尚不明确。解决方案的关键在于提出一个四象限的难度信号分类框架——区分人类感知与模型感知、任务无关与任务相关两类维度,并系统分析其在自然语言理解数据集上的交互作用;研究发现仅任务相关的难度信号能与模型学习行为对齐,从而强调开发轻量级、任务依赖的难度估计器的重要性,以更精准地反映模型实际的学习过程。

链接: https://arxiv.org/abs/2601.01488
作者: Vanessa Toborek,Sebastian Müller,Christian Bauckhage
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: prepared for ESANN 2026 submission

点击查看摘要

Abstract:Curriculum Learning (CL) aims to improve the outcome of model training by estimating the difficulty of samples and scheduling them accordingly. In NLP, difficulty is commonly approximated using task-agnostic linguistic heuristics or human intuition, implicitly assuming that these signals correlate with what neural models find difficult to learn. We propose a four-quadrant categorisation of difficulty signals – human vs. model and task-agnostic vs. task-dependent – and systematically analyse their interactions on a natural language understanding dataset. We find that task-agnostic features behave largely independently and that only task-dependent features align. These findings challenge common CL intuitions and highlight the need for lightweight, task-dependent difficulty estimators that better reflect model learning behaviour.
zh

[NLP-60] Can Legislation Be Made Machine-Readable in PROLEG?

【速读】: 该论文旨在解决监管应用中效率与准确性不足的问题,尤其是在处理复杂法律文本(如欧盟《通用数据保护条例》(GDPR))时,如何实现自动化、可执行且具备解释能力的规则转化。其解决方案的关键在于构建一个融合大语言模型(Large Language Models, LLMs)与形式化法律推理系统(PROLEG)的框架:首先通过单个LLM提示(prompt)将自然语言法规文本同时转化为if-then规则和对应的PROLEG编码,再经由法律专家验证与优化,最终生成可执行的PROLEG程序,从而实现对具体GDPR决策的自动推理与人类可读解释输出。

链接: https://arxiv.org/abs/2601.01477
作者: May-Myo Zin,Sabine Wehnert,Yuntao Kong,Ha-Thanh Nguyen,Wachara Fungwacharakorn,Jieying Xue,Michał Araszkiewicz,Randy Goebel,Ken Satoh,Le-Minh Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The anticipated positive social impact of regulatory processes requires both the accuracy and efficiency of their application. Modern artificial intelligence technologies, including natural language processing and machine-assisted reasoning, hold great promise for addressing this challenge. We present a framework to address the challenge of tools for regulatory application, based on current state-of-the-art (SOTA) methods for natural language processing (large language models or LLMs) and formalization of legal reasoning (the legal representation system PROLEG). As an example, we focus on Article 6 of the European General Data Protection Regulation (GDPR). In our framework, a single LLM prompt simultaneously transforms legal text into if-then rules and a corresponding PROLEG encoding, which are then validated and refined by legal domain experts. The final output is an executable PROLEG program that can produce human-readable explanations for instances of GDPR decisions. We describe processes to support the end-to-end transformation of a segment of a regulatory document (Article 6 from GDPR), including the prompting frame to guide an LLM to “compile” natural language text to if-then rules, then to further “compile” the vetted if-then rules to PROLEG. Finally, we produce an instance that shows the PROLEG execution. We conclude by summarizing the value of this approach and note observed limitations with suggestions to further develop such technologies for capturing and deploying regulatory frameworks.
zh

[NLP-61] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

【速读】: 该论文旨在解决多语言对话语音识别(Multilingual Conversational Speech Language Models, MLC-SLM)中基于大语言模型(Large Language Models, LLM)的自动语音识别(ASR)系统性能受限的问题,特别是现有方法在特征融合方式上的局限性以及LLM-based ASR与端到端(End-to-End, E2E)编码器-解码器架构之间性能差距不明确的问题。解决方案的关键在于提出一种增强型LLM-based ASR框架,通过微调Whisper和mHuBERT双语音编码器,并引入基于交叉注意力(cross-attention)的融合机制来优化多模态特征交互,从而更充分地利用互补信息;实验表明,在仅使用1,500小时训练数据的情况下,该方法在官方评测集上达到CER/WER为10.69%,与顶级Track 1系统相当,但仍未超越微调后的E2E Whisper模型,为未来Speech-LLM的设计提供了重要实证依据。

链接: https://arxiv.org/abs/2601.01461
作者: Yuxiang Mei,Dongxing Xu,Jiaen Liang,Yanhua Long
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at this https URL.
zh

[NLP-62] Segmentation and Processing of German Court Decisions from Open Legal Data

【速读】: 该论文旨在解决德国法院判决文本在结构化处理中的不一致性问题,尤其是原始Open Legal Data数据集中判决正文缺乏清晰分段、关键部分(如裁判主文、案件事实和裁判理由)难以准确识别的问题。这一问题限制了自然语言处理(Natural Language Processing, NLP)技术在法律文本分析中的应用,例如角色分类、信息检索与引用分析等下游任务。解决方案的关键在于:从原始数据中系统性地提取并结构化三个核心部分——Tenor(裁判主文)、Tatbestand(案件事实)和Entscheidungsgründe(裁判理由),并通过统计学方法(Cochran公式,95%置信水平,5%误差范围)选取384个代表性样本进行人工验证,确保分段准确性;同时将Rechtsmittelbelehrung(上诉告知)作为独立字段提取,因其属于程序性说明而非判决内容本身。最终构建了一个包含251,038份判决的清洗与分段数据集,以JSONL格式公开发布,为德国法律语料库研究提供高质量基础资源。

链接: https://arxiv.org/abs/2601.01449
作者: Harshil Darji,Martin Heckelmann,Christina Kratsch,Gerard de Melo
机构: Hochschule für Technik und Wirtschaft Berlin, Germany; Hasso-Plattner Institute / University of Potsdam, Germany
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted and published as a research article in Legal Knowledge and Information Systems (JURIX 2025 proceedings, IOS Press). Pages 276–281

点击查看摘要

Abstract:The availability of structured legal data is important for advancing Natural Language Processing (NLP) techniques for the German legal system. One of the most widely used datasets, Open Legal Data, provides a large-scale collection of German court decisions. While the metadata in this raw dataset is consistently structured, the decision texts themselves are inconsistently formatted and often lack clearly marked sections. Reliable separation of these sections is important not only for rhetorical role classification but also for downstream tasks such as retrieval and citation analysis. In this work, we introduce a cleaned and sectioned dataset of 251,038 German court decisions derived from the official Open Legal Data dataset. We systematically separated three important sections in German court decisions, namely Tenor (operative part of the decision), Tatbestand (facts of the case), and Entscheidungsgründe (judicial reasoning), which are often inconsistently represented in the original dataset. To ensure the reliability of our extraction process, we used Cochran’s formula with a 95% confidence level and a 5% margin of error to draw a statistically representative random sample of 384 cases, and manually verified that all three sections were correctly identified. We also extracted the Rechtsmittelbelehrung (appeal notice) as a separate field, since it is a procedural instruction and not part of the decision itself. The resulting corpus is publicly available in the JSONL format, making it an accessible resource for further research on the German legal system.
zh

[NLP-63] Flip: Iterative Feedback-driven Counterfactual Example Refinement

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成有效反事实样本(counterfactual examples)时面临的挑战,即现有单次遍历方法难以可靠地诱导标签变化,且未充分利用LLMs的自我修正能力。其解决方案的关键在于提出iFlip,一种迭代精炼方法,通过整合三种反馈信号——模型置信度、特征归因(feature attribution)和自然语言反馈——来逐步优化反事实样本的生成过程。实验表明,iFlip在标签翻转率上平均比五种前沿基线高出57.8%,且用户研究验证了其在完整性、满意度和可行性方面的优势,同时消融实验证明迭代次数、高归因词定位和早期停止机制是生成有效反事实的核心要素。

链接: https://arxiv.org/abs/2601.01446
作者: Yilong Wang,Qianli Wang,Nils Feldhus
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: In submission

点击查看摘要

Abstract:Counterfactual examples are minimal edits to an input that alter a model’s prediction. They are widely employed in explainable AI to probe model behavior and in natural language processing (NLP) to augment training data. However, generating valid counterfactuals with large language models (LLMs) remains challenging, as existing single-pass methods often fail to induce reliable label changes, neglecting LLMs’ self-correction capabilities. To explore this untapped potential, we propose iFlip, an iterative refinement approach that leverages three types of feedback, including model confidence, feature attribution, and natural language. Our results show that iFlip achieves an average 57.8% higher validity than the five state-of-the-art baselines, as measured by the label flipping rate. The user study further corroborates that iFlip outperforms baselines in completeness, overall satisfaction, and feasibility. In addition, ablation studies demonstrate that three components are paramount for iFlip to generate valid counterfactuals: leveraging an appropriate number of iterations, pointing to highly attributed words, and early stopping. Finally, counterfactuals generated by iFlip enable effective counterfactual data augmentation, substantially improving model performance and robustness.
zh

[NLP-64] SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

【速读】: 该论文旨在解决软件工程(Software Engineering, SWE)任务中模型性能受限于复杂训练范式的问题,尤其是如何在不依赖强化学习或混合训练策略的前提下,通过轻量级监督微调(Supervised Fine-Tuning, SFT)实现顶尖性能。其解决方案的关键在于提出SWE-Lego这一系统性SFT配方,包含三个核心要素:一是构建了一个由3.2万高质量任务实例和1.8万验证轨迹组成的混合数据集(融合真实与合成数据),确保数据在质量和数量上的互补性;二是引入错误掩码(error masking)与难度导向课程学习(difficulty-based curriculum)的精细化SFT流程,显著提升动作质量和整体表现;三是基于训练好的验证器进行测试时扩展(Test-Time Scaling, TTS),进一步放大模型能力——例如,8B和32B模型在TTS@16下分别从42.2%提升至49.6%,52.6%提升至58.8%。

链接: https://arxiv.org/abs/2601.01426
作者: Chaofan Tao,Jierun Chen,Yuxin Jiang,Kaiqi Kou,Shaowei Wang,Ruoyu Wang,Xiaohui Li,Sidi Yang,Yiming Du,Jianbo Dai,Zhiming Mao,Xinyu Wang,Lifeng Shang,Haoli Bai
机构: Huawei Technologies (华为技术); NTU (南洋理工大学); HKU (香港大学); CUHK (香港中文大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Project website: this https URL

点击查看摘要

Abstract:We present SWE-Lego, a supervised fine-tuning (SFT) recipe designed to achieve state-ofthe-art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid-training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT-only approach for SWE tasks. SWE-Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE-Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty-based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE-Lego models to state-of-the-art performance among open-source models of comparable size on SWE-bench Verified: SWE-Lego-Qwen3-8B reaches 42.2%, and SWE-Lego-Qwen3-32B attains 52.6%. 3) We further evaluate and improve test-time scaling (TTS) built upon the SFT foundation. Based on a well-trained verifier, SWE-Lego models can be significantly boosted–for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.
zh

[NLP-65] From Emotion Classification to Emotional Reasoning : Enhancing Emotional Intelligence in Large Language Models

【速读】: 该论文旨在解决如何在不改变模型架构的前提下,提升小型开源大语言模型(LLM)在情感推理任务中的表现问题。其核心挑战在于,小规模模型通常缺乏对复杂情绪情境的理解能力,而现有方法难以有效迁移大规模模型的情感推理知识。解决方案的关键在于设计了一种多智能体生成流水线,通过模拟心理治疗对话生成结构化的、带解释的情感多选题(emotion multiple-choice questions, MCQs),从而构建合成的情感推理数据集,并利用该数据集对7B参数量的模型进行微调。实验表明,该方法显著提升了模型在EmoBench评估指标上的情感理解(EU)和情感意识(EA)得分,验证了合成情感链式思维数据的有效性。

链接: https://arxiv.org/abs/2601.01407
作者: Arjhun Sreedar,Rohan Pillay,Laukik Patade
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:This work investigates whether synthetic emotional chain-of-thought data can improve the emotional reasoning abilities of smaller open large language models (LLMs). We design a multi-agent generation pipeline that produces therapy-style conversations and converts them into structured emotion multiple-choice questions (MCQs) with explanations. We propose that fine-tuning a variety of 7B models on this dataset should yield substantial gains in emotional understanding and emotional awareness on EmoBench-style evaluations, suggesting that emotional reasoning can be induced without architectural changes. Our results demonstrate that fine-tuned Mistral 7B achieves EU improvements from 10.5 to 20.5 and EA improvements from 40.5 to 60.0, validating the effectiveness of synthetic emotional reasoning data for enhancing model capabilities in nuanced emotional tasks.
zh

[NLP-66] LANCET: Neural Intervention via Structural Entropy for Mitigating Faithfulness Hallucinations in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的忠实性幻觉(faithfulness hallucinations)问题,即模型在生成内容时产生与输入或事实不符的错误信息。现有方法多依赖节点级调整或粗粒度抑制策略,未能充分考虑神经信息在模型内部的分布式传播特性,导致干预不精准。其解决方案的关键在于提出一种名为Lancet的新型框架,通过结构熵(structural entropy)和幻觉差异比(hallucination difference ratios)实现精准神经干预:首先利用梯度驱动的对比分析定位易产生幻觉的神经元,再通过最小化结构熵映射其传播路径,最终实施分层干预策略,在阻断幻觉传播的同时保留模型的通用能力。实验证明,该方法在多个幻觉基准数据集上显著优于当前最优方案。

链接: https://arxiv.org/abs/2601.01401
作者: Chenxu Wang,Chaozhuo Li,Pengbo Wang,Litian Zhang,Songyang Liu,Ji Qi,Jiahui Hu,Yushan Cai,Hao Zhao,Rui Pu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models have revolutionized information processing, yet their reliability is severely compromised by faithfulness hallucinations. While current approaches attempt to mitigate this issue through node-level adjustments or coarse suppression, they often overlook the distributed nature of neural information, leading to imprecise interventions. Recognizing that hallucinations propagate through specific forward transmission pathways like an infection, we aim to surgically block this flow using precise structural analysis. To leverage this, we propose Lancet, a novel framework that achieves precise neural intervention by leveraging structural entropy and hallucination difference ratios. Lancet first locates hallucination-prone neurons via gradient-driven contrastive analysis, then maps their propagation pathways by minimizing structural entropy, and finally implements a hierarchical intervention strategy that preserves general model capabilities. Comprehensive evaluations across hallucination benchmark datasets demonstrate that Lancet significantly outperforms state-of-the-art methods, validating the effectiveness of our surgical approach to neural intervention.
zh

[NLP-67] EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)数学推理能力评估中存在的两大问题:一是评估基准多为静态且覆盖范围有限,难以反映前沿数学研究的真实复杂性;二是现有方法依赖大量专家人工标注,导致可扩展性和持续更新能力不足。解决方案的关键在于提出一个完全自动化的、基于定理的评估流水线,该流水线能够直接从近期同行评审的数学文献中提取可执行和可验证的推理任务,通过识别构造性或定量结果并生成参数化的问题模板与确定性解法,实现无需大规模人工干预即可进行规模化、可复现且持续演进的评估。这一方法不仅支持时间上的扩展性(temporal extensibility),还能提供内在正确性验证(intrinsic correctness checking)和子领域定制能力(domain-specific customization)。

链接: https://arxiv.org/abs/2601.01400
作者: Jicheng Ma,Guohua Wang,Xinhua Feng,Yiming Liu,Zhichao Hu,Yuhong Liu
机构: Renmin University of China (中国人民大学); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbfEternalMath, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.
zh

[NLP-68] SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning

【速读】: 该论文旨在解决现有欺诈检测方法主要依赖语音转写文本(Speech-to-Text, STT)所带来的局限性,包括自动语音识别(ASR)错误以及对声学线索(如语调、环境背景等)的忽略,从而难以有效识别复杂欺骗策略的问题。其解决方案的关键在于提出一个端到端的音频驱动型慢思考欺诈检测框架——SAFE-QAQ,该框架通过消除转写错误对检测性能的影响,并引入基于规则的慢思考奖励机制,以分层推理方式精准捕捉细粒度音频特征,实现对欺诈指示模式的系统性识别;同时结合实时通话中的动态风险评估机制,支持早期预警与干预,显著提升了检测准确率、推理效率和实时处理能力。

链接: https://arxiv.org/abs/2601.01392
作者: Peidong Wang,Zhiming Ma,Xin Dai,Yongkang Liu,Shi Feng,Xiaocui Yang,Wenxing Hu,Zhihao Wang,Mingjun Pan,Li Yuan,Daling Wang
机构: Northeastern University, China (东北大学); China Mobile Internet Company Ltd. (中国移动互联网公司); Shanghai University of Electric Power, China (上海电力大学); Peking University, Shenzhen, China (北京大学深圳校区)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Existing fraud detection methods predominantly rely on transcribed text, suffering from ASR errors and missing crucial acoustic cues like vocal tone and environmental context. This limits their effectiveness against complex deceptive strategies. To address these challenges, we first propose \textbfSAFE-QAQ, an end-to-end comprehensive framework for audio-based slow-thinking fraud detection. First, the SAFE-QAQ framework eliminates the impact of transcription errors on detection performance. Secondly, we propose rule-based slow-thinking reward mechanisms that systematically guide the system to identify fraud-indicative patterns by accurately capturing fine-grained audio details, through hierarchical reasoning processes. Besides, our framework introduces a dynamic risk assessment framework during live calls, enabling early detection and prevention of fraud. Experiments on the TeleAntiFraud-Bench demonstrate that SAFE-QAQ achieves dramatic improvements over existing methods in multiple key dimensions, including accuracy, inference efficiency, and real-time processing capabilities. Currently deployed and analyzing over 70,000 calls daily, SAFE-QAQ effectively automates complex fraud detection, reducing human workload and financial losses. Code: this https URL.
zh

[NLP-69] Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning EACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下预测不确定性校准不足的问题,尤其关注低资源语言中因数据稀缺导致的模型置信度与实际准确率不一致的现象。研究发现,即使在低资源语言中,仅使用高资源语言的监督微调(Supervised Fine-Tuning, SFT)数据进行指令微调后,模型置信度显著上升但准确率提升有限,造成严重的校准偏差。解决方案的关键在于采用标签平滑(label smoothing)技术,该方法无需额外低资源语言的微调数据即可有效缓解校准问题,在所有语言中均保持更优的校准性能,凸显了多语言训练与微调策略对提升模型可靠性与公平性的重要性。

链接: https://arxiv.org/abs/2601.01362
作者: Jerry Huang,Peng Lu,Qiuhao Zeng,Yusuke Iwasawa,Yutaka Matsuo,Sarath Chandar,Edison Marrese-Taylor,Irene Li
机构: Mila - Quebec AI Institute (魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); The University of Tokyo (东京大学); Western University (西蒙弗雷泽大学); Vector Institute (向量研究所); Polytechnique Montréal (蒙特利尔工程学院); CIFAR AI Chair (加拿大高级研究院人工智能主席); AIST (产业技术综合研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted to The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL)

点击查看摘要

Abstract:Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.
zh

[NLP-70] FC-CONAN: An Exhaustively Paired Dataset for Robust Evaluation of Retrieval Systems KR

【速读】: 该论文旨在解决在线言论中仇恨言论(Hate Speech, HS)识别与对抗策略研究中因数据集标注不充分而导致的评估偏差问题。现有代表性资源如CONAN仅标注了部分HS与反叙事(Counter-Narrative, CN)配对,限制了计数言论检索系统(counterspeech retrieval systems)的全面评估。解决方案的关键在于构建首个全连接数据集FC-CONAN,通过穷举45条英文HS消息与129条CN的全部组合(共5805对),并采用两阶段标注流程(九名标注者与四名验证者参与)生成四个可靠性递减但规模递增的分区(Diamond、Gold、Silver和Bronze),从而实现更忠实的模型评估与细致的错误分析。该数据集不包含任何CONAN中的配对,揭示了数百个此前未被标注的正样本。

链接: https://arxiv.org/abs/2601.01350
作者: Juan Junqueras,Florian Boudin,May-Myo Zin,Ha-Thanh Nguyen,Wachara Fungwacharakorn,Damián Ariel Furman,Akiko Aizawa,Ken Satoh
机构: Universidad de Buenos Aires, FCEyN, Departamento de Computación (布宜诺斯艾利斯大学, 自然科学与数学学院, 计算系); JFLI, CNRS, Nantes University (法国国家科学研究中心, 南特大学); Center for Juris-Informatics, ROIS-DS (日本国立情报学研究所司法信息中心); National Institute of Informatics (NII) (日本国立情报学研究所)
类目: Computation and Language (cs.CL)
备注: Presented at NeLaMKRR@KR, 2025 ( arXiv:2511.09575 )

点击查看摘要

Abstract:Hate speech (HS) is a critical issue in online discourse, and one promising strategy to counter it is through the use of counter-narratives (CNs). Datasets linking HS with CNs are essential for advancing counterspeech research. However, even flagship resources like CONAN (Chung et al., 2019) annotate only a sparse subset of all possible HS-CN pairs, limiting evaluation. We introduce FC-CONAN (Fully Connected CONAN), the first dataset created by exhaustively considering all combinations of 45 English HS messages and 129 CNs. A two-stage annotation process involving nine annotators and four validators produces four partitions-Diamond, Gold, Silver, and Bronze-that balance reliability and scale. None of the labeled pairs overlap with CONAN, uncovering hundreds of previously unlabelled positives. FC-CONAN enables more faithful evaluation of counterspeech retrieval systems and facilitates detailed error analysis. The dataset is publicly available.
zh

[NLP-71] Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG -Based Mental Health Dialogue Systems

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在心理咨询服务中面临的双重挑战:幻觉(hallucinations)和缺乏共情能力(lack of empathy)。为应对这些问题,研究提出采用检索增强生成(Retrieval-Augmented Generation, RAG)技术,将模型输出锚定在可信的临床来源上以减少幻觉,同时比较不同模型架构在共情表现上的差异。解决方案的关键在于:相较于专门针对心理健康领域微调的大模型(domain-specific fine-tunes),具备强推理能力的通用模型(generalist models)即使参数规模更小(3B vs. 7B),也能在共情评分上显著优于后者(3.72 vs. 3.26, p < 0.001),且表现出更强的上下文理解能力和更低的过拟合风险。这表明,在RAG框架下,良好的推理能力比特定领域的词汇训练更为重要,只要答案基于临床证据,一个推理能力强的通用模型即可提供更具共情力和平衡性的支持。

链接: https://arxiv.org/abs/2601.01341
作者: Md Abdullah Al Kafi,Raka Moni,Sumit Kumar Banshal
机构: Dhaka International University (达卡国际大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific ones in empathy (3.72 vs. 3.26, p 0.001 ) in spite of being much smaller (3B vs. 7B), and all models perform well in terms of safety, but the generalist models show better contextual understanding and are less prone to overfitting as we observe in the domain-specific models. Overall, our results indicate that for RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary; i.e. a well-reasoned general model would provide more empathetic and balanced support than a larger narrowly fine-tuned model, so long as the answer is already grounded in clinical evidence.
zh

[NLP-72] FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练过程中的高计算资源消耗问题,即如何在不牺牲模型准确性的前提下显著降低训练所需的浮点运算次数(FLOPs)。其核心解决方案是提出“测试时计算感知训练”(TTC-aware training),关键在于通过选择一个中间训练检查点(checkpoint)和相应的测试时计算(Test-Time Compute, TTC)配置,使得模型在训练FLOPs大幅减少的情况下仍能实现与全训练模型相当甚至更优的性能。为此,作者设计了一种联合优化的早停算法,结合高效的TTC评估方法和理论上的“盈亏平衡边界”(break-even bound),实现训练与推理计算的动态权衡,实验表明可实现最高达92%的训练FLOPs节省,同时保持或提升模型准确性。

链接: https://arxiv.org/abs/2601.01332
作者: Hossam Amer,Maryam Dialameh,Hossein Rajabzadeh,Walid Ahmed,Weiwei Zhang,Yang Liu
机构: Ascend Team, Huawei Technologies (华为技术有限公司); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through iterative sampling-can allow smaller models to rival or surpass much larger ones at lower overall cost. We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model while requiring substantially fewer training FLOPs. Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy. To make this practical, we develop an efficient TTC evaluation method that avoids exhaustive search, and we formalize a break-even bound that identifies when increased inference compute compensates for reduced training compute. Experiments demonstrate up to 92% reductions in training FLOPs while maintaining and sometimes remarkably improving accuracy. These results highlight a new perspective for balancing training and inference compute in model development, enabling faster deployment cycles and more frequent model refreshes. Codes will be publicly released.
zh

[NLP-73] AppellateGen: A Benchmark for Appellate Legal Judgment Generation

【速读】: 该论文旨在解决当前法律判决生成研究中对二审(appellate)程序忽视的问题,现有方法多集中于一审判决生成,依赖静态的案件事实到判决结果映射,未能体现二审中基于证据更新和法律论证的辩证推理过程。其解决方案的关键在于提出一个名为AppellateGen的基准数据集,包含7,351个案件对,用于模拟二审阶段的判决生成任务;同时设计了一种基于司法标准操作流程(Standard Operating Procedure, SOP)的法律多智能体系统(Legal Multi-Agent System, SLMAS),将判决生成分解为问题识别、证据检索与文书起草等离散步骤,从而更精确地建模审判阶段间的因果依赖关系。

链接: https://arxiv.org/abs/2601.01331
作者: Hongkun Yang,Lionel Z. Wang,Wei Fan,Yiran Hu,Lixu Wang,Chenyu Liu,Shenghong Fu,Haoyang Li,Xin Xu,Jiexin Zheng,Wei Dong
机构: Nanyang Technological University (南洋理工大学); Ocean University of China (中国海洋大学); The Hong Kong Polytechnic University (香港理工大学); Hong Kong University of Science and Technology (香港科技大学); Tsinghua University (清华大学); Peking University (北京大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Legal judgment generation is a critical task in legal intelligence. However, existing research in legal judgment generation has predominantly focused on first-instance trials, relying on static fact-to-verdict mappings while neglecting the dialectical nature of appellate (second-instance) review. To address this, we introduce AppellateGen, a benchmark for second-instance legal judgment generation comprising 7,351 case pairs. The task requires models to draft legally binding judgments by reasoning over the initial verdict and evidentiary updates, thereby modeling the causal dependency between trial stages. We further propose a judicial Standard Operating Procedure (SOP)-based Legal Multi-Agent System (SLMAS) to simulate judicial workflows, which decomposes the generation process into discrete stages of issue identification, retrieval, and drafting. Experimental results indicate that while SLMAS improves logical consistency, the complexity of appellate reasoning remains a substantial challenge for current LLMs. The dataset and code are publicly available at: this https URL.
zh

[NLP-74] 3C: Test-Time Tensor Compression with Consistency Guarantees

【速读】: 该论文旨在解决模型压缩中难以在部署时灵活控制精度与计算资源(如延迟、能耗、模型大小)之间的权衡问题,尤其是在不同硬件设备上实现可预测且可靠的性能-效率折衷。解决方案的关键在于提出T3C框架——一种“训练一次、测试时按预算调整”的压缩方法,其核心创新包括:1)弹性张量分解(elastic tensor factorization)与秩绑定的混合精度量化(rank-tied mixed-precision quantization)相结合,使模型结构具备可调的秩和位宽;2)引入轻量级控制器,将硬件感知的预算令牌(latency/energy/size budget token)映射为各层的秩/比特分配策略,且该策略对预算单调递增;3)基于谱代理和激活统计量的快速逐层一致性验证机制,有效上界logit漂移并正则化训练过程,提供近乎零开销的可靠性信号。此方案实现了在保持高精度的同时显著降低延迟和模型尺寸,并支持跨设备按需部署。

链接: https://arxiv.org/abs/2601.01299
作者: Ismail Lamaakal,Chaymae Yahyati,Yassine Maleh,Khalid El Makkaoui,Ibrahim Ouahbi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present T3C, a train-once, test-time budget-conditioned compression framework that exposes rank and precision as a controllable deployment knob. T3C combines elastic tensor factorization (maintained up to a maximal rank) with rank-tied mixed-precision quantization and a lightweight controller that maps a latency/energy/size budget token to per-layer rank/bit assignments; the policy snaps to hardware-aligned profiles and is monotone in the budget. A fast, layerwise consistency certificate, computed from spectral proxies and activation statistics, upper-bounds logit drift and regularizes training, yielding a practical reliability signal with negligible overhead. On ImageNet-1k, T3C shifts the vision Pareto frontier: for ResNet-50 at matched accuracy (\leq 0.5% drop), p50 latency is 1.18ms with a 38MB model, outperforming PTQ-8b (1.44ms, 88MB); for ViT-B/16, T3C reaches 2.30ms p50 with 59MB, improving over strong PTQ/QAT baselines. A single T3C checkpoint therefore provides predictable, certificate-backed accuracy-latency-size trade-offs on demand across devices.
zh

[NLP-75] ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System

【速读】: 该论文旨在解决高维数据流中分布漂移(distributional drift)检测的难题,现有方法如全局比较法计算复杂度高、基于投影的方法丢失几何结构信息、重新聚类方法存在身份不稳定问题。其解决方案的关键在于提出Argus框架,将漂移检测重构为在固定空间划分的数据流形上追踪局部统计量;核心创新包括:证明基于规范正交基的Voronoi剖分可生成对正交变换(orthogonal transformations)不变的漂移度量,实现每快照O(N)复杂度并提供单元级空间定位,引入图论刻画漂移传播以区分协同漂移与孤立扰动,并通过产品量化剖分(product quantization tessellation)分解高维空间为独立子空间,在d≥500维度下实现高效漂移信号聚合,从而在保持高维结构的同时避免成对比较的计算负担。

链接: https://arxiv.org/abs/2601.01297
作者: Anantha Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity instability. This paper introduces Argus, A framework that reconceptualizes drift detection as tracking local statistics over a fixed spatial partition of the data manifold. The key contributions are fourfold. First, it is proved that Voronoi tessellations over canonical orthonormal frames yield drift metrics that are invariant to orthogonal transformations. The rotations and reflections that preserve Euclidean geometry. Second, it is established that this framework achieves O(N) complexity per snapshot while providing cell-level spatial localization of distributional change. Third, a graph-theoretic characterization of drift propagation is developed that distinguishes coherent distributional shifts from isolated perturbations. Fourth, product quantization tessellation is introduced for scaling to very high dimensions (d500) by decomposing the space into independent subspaces and aggregating drift signals across subspaces. This paper formalizes the theoretical foundations, proves invariance properties, and presents experimental validation demonstrating that the framework correctly identifies drift under coordinate rotation while existing methods produce false positives. The tessellated approach offers a principled geometric foundation for distribution monitoring that preserves high-dimensional structure without the computational burden of pairwise comparisons. Comments: 26 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.01297 [cs.LG] (or arXiv:2601.01297v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.01297 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-76] Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory

【速读】: 该论文旨在解决对话记忆系统中图结构(graph structure)设计有效性不一致的问题,即现有研究在不同架构下的性能差异缺乏明确归因,难以判断哪些设计选择真正影响系统表现。其解决方案的关键在于提出一个统一的框架,将对话记忆系统分解为核心组件,并支持图与非图两种方法的对比实验;在此基础上,通过控制变量的分阶段实验,在LongMemEval和HaluMem数据集上系统性评估记忆表示、组织、维护和检索等关键模块的设计选项,发现多数性能差异源于基础系统配置而非特定架构创新,从而为未来研究提供了稳定可靠的强基线。

链接: https://arxiv.org/abs/2601.01280
作者: Sen Hu,Yuxiang Wei,Jiaxin Ran,Zhiyuan Yao,Lei Zou
机构: Peking University (北京大学); Georgia Institute of Technology (佐治亚理工学院); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph structures are increasingly used in dialog memory systems, but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter. We present an experimental, system-oriented analysis of long-term dialog memory architectures. We introduce a unified framework that decomposes dialog memory systems into core components and supports both graph-based and non-graph approaches. Under this framework, we conduct controlled, stage-wise experiments on LongMemEval and HaluMem, comparing common design choices in memory representation, organization, maintenance, and retrieval. Our results show that many performance differences are driven by foundational system settings rather than specific architectural innovations. Based on these findings, we identify stable and reliable strong baselines for future dialog memory research.
zh

[NLP-77] From Policy to Logic for Efficient and Interpretable Coverag e Assessment AAAI2026 ALT

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗覆盖政策审查中因幻觉(hallucination)和不一致性导致的可靠性问题,尤其是在处理主观性和复杂性较高的法律与政策文本时。其解决方案的关键在于提出一种混合方法:将一个面向覆盖范围的检索器(coverage-aware retriever)与符号规则推理(symbolic rule-based reasoning)相结合,以提取相关条款、结构化为明确的事实与规则,并生成可审计的推理依据。该方法显著减少了LLM推理次数,从而在降低模型成本的同时提升了准确率——实现了44%的推理成本下降和4.5%的F1分数提升,兼顾了效率与可靠性。

链接: https://arxiv.org/abs/2601.01266
作者: Rhitabrat Pokharel,Hamid Hassanzadeh,Ameeta Agrawal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AIMedHealth @ AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.
zh

[NLP-78] MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床实际应用中面临的计算成本与线性时间模型效率之间的根本性权衡问题。解决方案的关键在于提出一种基于MambaFormer的混合专家(Mixture-of-Experts, MoE)框架,通过轻量级门控机制实现令牌级别的动态路由:短而复杂的查询被路由至定制化的Transformer专家(ET5),长且高吞吐量的序列则路由至状态空间模型专家(EMamba)。该设计结合上下文复杂度、归一化序列长度和领域感知特征进行智能决策,以实现推理延迟与预测准确率之间的帕累托最优平衡;同时引入一种新型效用引导的多目标损失函数,自适应调节令牌级专家激活,从而协同优化路由行为、专家利用率及计算开销。该方法在DentalQA和PubMedQA数据集上验证有效,相比T5-Large实现24.4倍加速,且BERTScore达0.9180,为资源受限场景下的临床辅助提供了可扩展方案。

链接: https://arxiv.org/abs/2601.01260
作者: Hamad Khan,Saddam Hussain Khan(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat 19060, Pakistan)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 Pages, Tables 12, Figure 09

点击查看摘要

Abstract:The deployment of large language models (LLMs) in real-world clinical applications is constrained by the fundamental trade-off between computational cost and the efficiency of linear-time models. To address this, we propose an LLM-based MambaFormer hybrid Mixture-of-Experts (MoE) framework for efficient medical question-answering (QA) and clinical assistance. The MambaFormer employs a lightweight gating mechanism that performs token-level dynamic routing to a customized Transformer expert (ET5) for short, complex queries or to a State Space Model expert (EMamba) for long, high-throughput sequences. The customized EMamba and ET5 models are tailored to accommodate input sequence dimensionality, embedding structure, sequence length, and target-specific output heads, and are fine-tuned through transfer learning on a new, custom-designed DentalQA dataset. Moreover, intelligent routing decisions are driven by the contextual complexity of token embeddings, normalized sequence length, and domain-aware features, thereby enforcing a Pareto-optimal trade-off between inference latency and prediction accuracy. Furthermore, a novel utility-guided multi-objective loss jointly optimizes decisions, router parameters, routing behavior, expert utilization, and computational cost by adaptively regulating token-level expert activation. Finally, the proposed MambaFormer is cross-validated (holdout) for medical QA on the new, custom-designed DentalQA and PubMedQA datasets and compared with state-of-the-art techniques. The proposed MambaFormer outperforms (BERTScore = 0.9180) with ultra-low latency (0.077 s), delivering a 24.4 speedup over T5-Large and establishing a scalable solution for resource-constrained clinical deployment.
zh

[NLP-79] Entity-Aware and Secure Query Optimization in Database Using Named Entity Recognition

【速读】: 该论文旨在解决云存储环境中隐私保护与高效数据检索之间的矛盾问题,即传统方法虽注重数据库安全,却缺乏对查询中敏感信息的自动识别能力,导致手动识别效率低且易出错,进而增加隐私风险。其解决方案的关键在于提出一种智能隐私保护查询优化框架,核心包括:利用深度学习模型(特别是DBN-LSTM)实现高精度的命名实体识别(Named Entity Recognition, NER),自动检测查询中的敏感信息;结合高级加密标准(Advanced Encryption Standard, AES)与盲索引(blind indexing)技术对敏感数据进行加密并支持安全搜索;同时采用K-means聚类和排序搜索对非敏感数据分组优化,从而在保证隐私的前提下显著提升整体查询效率。

链接: https://arxiv.org/abs/2601.01254
作者: Azrin Sultana,Hasibur Rashid Chayon
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注: 48 pages, 15 figures, 14 tables

点击查看摘要

Abstract:Cloud storage has become the backbone of modern data infrastructure, yet privacy and efficient data retrieval remain significant challenges. Traditional privacy-preserving approaches primarily focus on enhancing database security but fail to address the automatic identification of sensitive information before encryption. This can dramatically reduce query processing time and mitigate errors during manual identification of sensitive information, thereby reducing potential privacy risks. To address this limitation, this research proposes an intelligent privacy-preserving query optimization framework that integrates Named Entity Recognition (NER) to detect sensitive information in queries, utilizing secure data encryption and query optimization techniques for both sensitive and non-sensitive data in parallel, thereby enabling efficient database optimization. Combined deep learning algorithms and transformer-based models to detect and classify sensitive entities with high precision, and the Advanced Encryption Standard (AES) algorithm to encrypt, with blind indexing to secure search functionality of the sensitive data, whereas non-sensitive data was divided into groups using the K-means algorithm, along with a rank search for optimization. Among all NER models, the Deep Belief Network combined with Long Short-Term Memory (DBN-LSTM) delivers the best performance, with an accuracy of 93% and precision (94%), recall, and F1 score of 93%, and 93%, respectively. Besides, encrypted search achieved considerably faster results with the help of blind indexing, and non-sensitive data fetching also outperformed traditional clustering-based searches. By integrating sensitive data detection, encryption, and query optimization, this work advances the state of privacy-preserving computation in modern cloud infrastructures.
zh

[NLP-80] Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure

【速读】: 该论文旨在解决匈牙利语与高资源语言(如英语和德语)之间在大语言模型(Large Language Model, LLM)能力上的资源差距问题。其核心解决方案是采用参数高效持续预训练方法——低秩适应(Low-Rank Adaptation, LoRA),基于Qwen-3 4B骨干模型实现轻量级更新,同时在A100(40GB)计算集群上以较低节点间带宽完成训练。关键创新在于对分词器(tokenizer)的替换与适配,显著提升了匈牙利语的分词效率(tokenization fertility),并保持了英语和德语的性能竞争力;此外,通过混合训练数据(44%匈牙利语、24%英语、21%德语、11%代码)有效缓解灾难性遗忘,确保多语言能力的稳定保留。

链接: https://arxiv.org/abs/2601.01244
作者: Zsolt Csibi(2),Bence György Gortka(1),Natabara Gyöngyössy(2),Kornél Nagy(1),Dávid Márk Nemeskey(1),Martin Sallai(1),András Simonyi(2),András Márk Szekeres(1),Gábor Palkó(1) ((1) Department of Digital Humanities, Eötvös Loránd University (2) Department of Artificial Intelligence, Eötvös Loránd University)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 1 figures. To appear in the XXII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2026)

点击查看摘要

Abstract:We present Racka, a lightweight, continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages such as English and German. Racka employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen-3 4B backbone, making the recipe practical on A100 (40GB)-based HPC clusters with low inter-node bandwidth. To better match the training distribution, we replace and adapt the tokenizer, achieving substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German. The model is trained on 160B subword tokens drawn from a mixture of internet and high-quality curated sources, with a composition of 44% Hungarian, 24% English, 21% German, and 11% code. This data mix is chosen to mitigate catastrophic forgetting and preserve high-resource language capabilities during continual pretraining. Our preliminary results indicate modest but stable results in language adaptation.
zh

[NLP-81] Stylometry Analysis of Human and Machine Text for Academic Integrity

【速读】: 该论文旨在解决学术诚信领域中的关键问题,包括抄袭(plagiarism)、数据伪造(fabrication)以及教育内容作者身份验证(authorship verification),提出了一种基于自然语言处理(Natural Language Processing, NLP)的框架,通过作者归属(author attribution)和写作风格变化检测(style change detection)来认证学生生成的内容。其解决方案的关键在于构建并评估针对四项核心任务的模型:(i)区分人类与机器生成文本;(ii)识别单作者与多作者文档;(iii)在多作者文档中检测作者变更;(iv)在协作生成文档中进行作者识别。实验在两个使用 Gemini 模型生成的数据集上进行,分别采用常规和严格提示指令,结果表明严格提示下性能下降,凸显了检测精心设计的机器生成文本的复杂性,为后续研究提供了可复现的基准资源。

链接: https://arxiv.org/abs/2601.01225
作者: Hezam Albaqami,Muhammad Asif Ayub,Nasir Ahmad,Yaseen Ahmad,Mohammed M. Alqahtani,Abdullah M. Algamdi,Almoaid A. Owaidah,Kashif Ahmad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 tables, 3 figures

点击查看摘要

Abstract:This work addresses critical challenges to academic integrity, including plagiarism, fabrication, and verification of authorship of educational content, by proposing a Natural Language Processing (NLP)-based framework for authenticating students’ content through author attribution and style change detection. Despite some initial efforts, several aspects of the topic are yet to be explored. In contrast to existing solutions, the paper provides a comprehensive analysis of the topic by targeting four relevant tasks, including (i) classification of human and machine text, (ii) differentiating in single and multi-authored documents, (iii) author change detection within multi-authored documents, and (iv) author recognition in collaboratively produced documents. The solutions proposed for the tasks are evaluated on two datasets generated with Gemini using two different prompts, including a normal and a strict set of instructions. During experiments, some reduction in the performance of the proposed solutions is observed on the dataset generated through the strict prompt, demonstrating the complexities involved in detecting machine-generated text with cleverly crafted prompts. The generated datasets, code, and other relevant materials are made publicly available on GitHub, which are expected to provide a baseline for future research in the domain.
zh

[NLP-82] Almost Clinical: Linguistic properties of synthetic electronic health records

【速读】: 该论文旨在解决生成式人工智能(Generative AI)在心理健康领域中合成电子健康记录(EHRs)的语义合理性与临床适用性问题。其核心挑战在于确保合成文本不仅语法通顺、术语准确,还需符合临床实践中的语用规范,如医疗权威建构、患者主体性表达及信息传递逻辑。解决方案的关键在于构建一个结构化的合成语料库,并通过分析四类临床文体(评估报告、往来信件、转诊记录和护理计划)中的代理关系(agency)、情态(modality)和信息流(information flow),量化大语言模型(LLM)在语言层面如何再现真实临床语境,从而识别其在注册差异、临床特异性不足以及药物使用和诊断程序上的系统性偏差。

链接: https://arxiv.org/abs/2601.01171
作者: Serge Sharoff,John Baker,David Francis Hunt,Alan Simpson
机构: University of Leeds (利兹大学); University of Exeter (埃克塞特大学); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study evaluates the linguistic and clinical suitability of synthetic electronic health records (EHRs) in the field of mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we assess agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures.
zh

[NLP-83] Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models ICPR2026

【速读】: 该论文旨在解决 categorical data (类别型数据) 聚类中因属性值缺乏固有顺序或距离度量而导致的相似性刻画不准确问题,这一缺陷会引入语义鸿沟,掩盖潜在结构并降低聚类质量。现有方法依赖数据集内部共现模式推断值间关系,但在样本有限时可靠性下降,难以充分挖掘语义上下文。解决方案的关键在于提出 ARISE(Attention-weighted Representation with Integrated Semantic Embeddings),通过引入大型语言模型(Large Language Models, LLMs)获取外部语义知识,构建语义感知的嵌入表示,并将其与原始数据融合,从而增强类别型数据的度量空间,实现更精准的聚类效果。

链接: https://arxiv.org/abs/2601.01162
作者: Zihua Yang,Xin Liao,Yiqun Zhang,Yiu-ming Cheung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to ICPR 2026

点击查看摘要

Abstract:Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM-enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%. Code is available at this https URL
zh

[NLP-84] DHI: Leverag ing Diverse Hallucination Induction for Enhanced Contrastive Factuality Control in Large Language Models ICONIP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的“幻觉”问题,即模型生成不准确或虚构信息的现象,这严重影响了其可靠性。现有方法通常通过训练一个“邪恶模型”(Evil LLM)在特定数据集上故意产生幻觉,并利用这些幻觉引导对比解码(contrastive decoding)以抑制错误输出,但该策略受限于诱导幻觉类型的多样性不足,因为 Evil LLM 往往只能重复训练时所接触的特定错误模式。论文提出 DHI(Diverse Hallucination Induction)框架,其核心创新在于无需依赖预标注的幻觉数据即可诱导多样化的幻觉类型:首先设计一种改进的损失函数,对事实正确token的生成进行惩罚性降权,从而促使 Evil LLM 在目标位置生成多样化幻觉;其次引入因果注意力掩码适配机制,降低惩罚对后续token生成的影响;最后在推理阶段采用自适应合理性约束,仅在正向模型高置信度的token位置执行对比解码,避免对正确内容施加不必要的惩罚。实验证明,DHI 在多个幻觉基准测试中显著优于其他基于对比解码的方法。

链接: https://arxiv.org/abs/2601.01156
作者: Jiani Guo,Xiangke Zeng,Jie Wu,Zuchao Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICONIP 2025

点击查看摘要

Abstract:Large language models (LLMs) frequently produce inaccurate or fabricated information, known as “hallucinations,” which compromises their reliability. Existing approaches often train an “Evil LLM” to deliberately generate hallucinations on curated datasets, using these induced hallucinations to guide contrastive decoding against a reliable “positive model” for hallucination mitigation. However, this strategy is limited by the narrow diversity of hallucinations induced, as Evil LLMs trained on specific error types tend to reproduce only these particular patterns, thereby restricting their overall effectiveness. To address these limitations, we propose DHI (Diverse Hallucination Induction), a novel training framework that enables the Evil LLM to generate a broader range of hallucination types without relying on pre-annotated hallucination data. DHI employs a modified loss function that down-weights the generation of specific factually correct tokens, encouraging the Evil LLM to produce diverse hallucinations at targeted positions while maintaining overall factual content. Additionally, we introduce a causal attention masking adaptation to reduce the impact of this penalization on the generation of subsequent tokens. During inference, we apply an adaptive rationality constraint that restricts contrastive decoding to tokens where the positive model exhibits high confidence, thereby avoiding unnecessary penalties on factually correct tokens. Extensive empirical results show that DHI achieves significant performance gains over other contrastive decoding-based approaches across multiple hallucination benchmarks.
zh

[NLP-85] SongSage: A Large Musical Language Model with Lyric Generative Pre-training

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在音乐场景中对歌词导向知识(lyric-centric knowledge)理解不足的问题,尤其是其在播放列表(playlist)语义理解与用户意图建模方面的局限性。解决方案的关键在于提出SongSage——一个通过歌词生成式预训练(lyric generative pretraining)构建的音乐大语言模型,其核心创新包括:1)基于54.8亿词元的精心构建歌词语料库LyricBank进行持续预训练;2)采用包含77.5万样本的指令微调数据集LyricBank-SFT,在九类核心歌词任务上进行精细化微调;3)在保持通用知识理解能力的同时,显著提升模型在歌词生成、续写、零样本播放列表推荐等音乐相关任务上的表现。

链接: https://arxiv.org/abs/2601.01153
作者: Jiani Guo,Jiajia Li,Jie Wu,Zuchao Li,Yujiu Yang,Ping Wang
机构: Wuhan University (武汉大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have achieved significant success in various domains, yet their understanding of lyric-centric knowledge has not been fully explored. In this work, we first introduce PlaylistSense, a dataset to evaluate the playlist understanding capability of language models. PlaylistSense encompasses ten types of user queries derived from common real-world perspectives, challenging LLMs to accurately grasp playlist features and address diverse user intents. Comprehensive evaluations indicate that current general-purpose LLMs still have potential for improvement in playlist understanding. Inspired by this, we introduce SongSage, a large musical language model equipped with diverse lyric-centric intelligence through lyric generative pretraining. SongSage undergoes continual pretraining on LyricBank, a carefully curated corpus of 5.48 billion tokens focused on lyrical content, followed by fine-tuning with LyricBank-SFT, a meticulously crafted instruction set comprising 775k samples across nine core lyric-centric tasks. Experimental results demonstrate that SongSage exhibits a strong understanding of lyric-centric knowledge, excels in rewriting user queries for zero-shot playlist recommendations, generates and continues lyrics effectively, and performs proficiently across seven additional capabilities. Beyond its lyric-centric expertise, SongSage also retains general knowledge comprehension and achieves a competitive MMLU score. We will keep the datasets inaccessible due to copyright restrictions and release the SongSage and training script to ensure reproducibility and support music AI research and applications, the datasets release plan details are provided in the appendix.
zh

[NLP-86] KOS-TL (Knowledge Operation System Type Logic)

【速读】: 该论文旨在解决传统知识表示模型中静态符号逻辑与动态系统执行之间存在的鸿沟问题,即如何构建一个既能保持逻辑严谨性又能支持自主运行的知识系统。其解决方案的关键在于提出KOS-TL(Knowledge Operation System Type Logic)框架,该框架基于依赖类型理论(Dependent Type Theory),将数据、逻辑和证明统一于单一计算架构之中,并通过三层结构实现形式化语义与演化一致性保障:核心层定义静态类型宇宙与构造原语,内核层以事件驱动机制〈Σ, Ev, Δ〉控制状态演化,运行层完成物理信号到逻辑证据的双向精化。特别地,KOS-TL融合Davidsonian事件语义与Martin-Löf类型理论,实现了“带证明的知识”(proof-carrying knowledge),确保每个状态变化都有形式化的有效性证明,从而在工业可追溯性和跨境金融合规等场景中验证了其可执行性与形式可验证性。

链接: https://arxiv.org/abs/2601.01143
作者: Peng Chen
机构: Beijing University of Language and Culture (北京语言文化大学)
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper introduces KOS-TL (Knowledge Operation System Type Logic), a novel constructive framework designed to provide a rigorous logical foundation for autonomous and executable knowledge systems. Traditional knowledge representation models often suffer from a gap between static symbolic logic and dynamic system execution. To bridge this divide, KOS-TL leverages Dependent Type Theory to unify data, logic, and proof into a singular computational this http URL architecture of KOS-TL is organized into three hierarchical layers: the Core Layer, which defines the static type universe and constructive primitives; the Kernel Layer, which governs state evolution through an event-driven mechanism characterized by the triple \langle \Sigma, \textsfEv, \Delta \rangle ; and the Runtime Layer, responsible for the bidirectional refinement of physical signals into logical evidence. We formally define the operational semantics of the system and prove key meta-theoretical properties, including Progress and Evolutionary Consistency, ensuring that the system remains logically self-consistent and free from stuck states during continuous state this http URL integrating Davidsonian event semantics with Martin-Löf type theory, KOS-TL enables the construction of “proof-carrying knowledge,” where every state change in the knowledge base is accompanied by a formal witness of its validity. We demonstrate the practical utility of this logic through application examples in industrial traceability and cross-border financial compliance. Our results suggest that KOS-TL provides a robust, formally verifiable basis for the next generation of intelligent, autonomous operating systems.
zh

[NLP-87] RovoDev Code Reviewer: A Large-Scale Online Evaluation of LLM -based Code Review Automation at Atlassian ICSE’26

【速读】: 该论文旨在解决企业级代码审查自动化工具设计中的实际挑战,特别是如何在不进行模型微调(fine-tuning)的前提下,实现基于上下文感知、引导式生成且质量可控的代码审查评论(code review comment)生成。其核心解决方案是提出并部署了 RovoDev Code Reviewer,一个集成于 Atlassian Bitbucket 的大型语言模型(Large Language Models, LLMs)驱动的代码审查自动化工具,通过离线、在线及用户反馈评估,在一年周期内验证了其有效性:能够生成促使代码修改的评论比例达 38.70%,显著缩短 PR 周期时间(减少 30.8%),降低人工评论数量(减少 35.6%),并提升软件整体质量。

链接: https://arxiv.org/abs/2601.01129
作者: Kla Tantithamthavorn,Yaotian Zou,Andy Wong,Michael Gupta,Zhe Wang,Mike Buller,Ryan Jiang,Matthew Watson,Minwoo Jeong,Kun Chen,Ming Wu
机构: Monash University (莫纳什大学); Atlassian (澳大利亚); Atlassian (美国)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the 48th International Conference on Software Engineering (ICSE’26), SEIP Track. 12 Pages

点击查看摘要

Abstract:Large Language Models (LLMs)-powered code review automation has the potential to transform code review workflows. Despite the advances of LLM-powered code review comment generation approaches, several practical challenges remain for designing enterprise-grade code review automation tools. In particular, this paper aims at answering the practical question: how can we design a review-guided, context-aware, quality-checked code review comment generation without fine-tuning? In this paper, we present RovoDev Code Reviewer, an enterprise-grade LLM-based code review automation tool designed and deployed at scale within Atlassian’s development ecosystem with seamless integration into Atlassian’s Bitbucket. Through the offline, online, user feedback evaluations over a one-year period, we conclude that RovoDev Code Reviewer is (1) effective in generating code review comments that could lead to code resolution for 38.70% (i.e., comments that triggered code changes in the subsequent commits); and (2) offers the promise of accelerating feedback cycles (i.e., decreasing the PR cycle time by 30.8%), alleviating reviewer workload (i.e., reducing the number of human-written comments by 35.6%), and improving overall software quality (i.e., finding errors with actionable suggestions). Comments: Accepted at the 48th International Conference on Software Engineering (ICSE’26), SEIP Track. 12 Pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.01129 [cs.SE] (or arXiv:2601.01129v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.01129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-88] RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution

【速读】: 该论文旨在解决如何通过AI自主演化提升Text-to-SQL任务性能的问题,即在缺乏领域先验知识的情况下,让AI代理系统自动发现并优化有效的SQL生成策略。解决方案的关键在于提出RoboPhD框架,其核心是一个闭环进化循环,由SQL生成代理(SQL Generation agent)和进化代理(Evolution agent)协同工作:前者基于数据库分析脚本与生成指令完成SQL构造,后者根据性能反馈设计新版本;同时引入基于ELO的筛选机制以处理性能非传递性问题,实现“适者生存”的动态演化。该方法从一个70行的朴素基线出发,经18轮迭代进化出1500行代码的最优代理,成功发现如自适应数据库分析深度、列选择模式等有效策略,并在低成本模型上取得显著提升,实现“跳级部署”——例如进化后的Claude Haiku模型性能超越未进化的Sonnet模型,验证了AI自主构建高性能智能体系统的可行性。

链接: https://arxiv.org/abs/2601.01126
作者: Andrew Borthwick,Stephen Ash
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:We present RoboPhD, a system where AI agents autonomously conduct research to improve Text-to-SQL performance. RoboPhD implements a closed-loop evolution cycle with two coordinated components: a SQL Generation agent composed of a database analysis script and SQL generation instructions, and an Evolution agent that designs new versions based on performance feedback. Central to the framework is an ELO-based selection mechanism enabling survival-of-the-fittest dynamics while handling non-transitivity in performance. Starting from a naive 70-line baseline, RoboPhD evolves agents through iterative cross-pollination, discovering effective techniques without any external guidance on the Text-to-SQL domain. Our best agent, evolved to 1500 lines over 18 iterations, autonomously discovered strategies such as size-adaptive database analysis that adjusts depth based on schema complexity and SQL generation patterns for column selection, evidence interpretation, and aggregation. Evolution provides the largest gains on cheaper models: while we improve by 2.3 points over a strong Claude Opus 4.5 naive baseline, we show an improvement of 8.9 points over the weaker Claude Haiku model. This enables ‘skip a tier’ deployment: evolved Haiku exceeds naive Sonnet accuracy, and evolved Sonnet exceeds naive Opus, both at lower cost. The full system achieves 73.67% accuracy on the BIRD test set, demonstrating that AI can autonomously build a strong agentic system with only a trivial human-provided starting point.
zh

[NLP-89] Listen Attend Understand: a Regularization Technique for Stable E2E Speech Translation Training on High Variance labels

【速读】: 该论文旨在解决端到端语音翻译(End-to-End Speech Translation, E2E-ST)在目标语句存在高方差和语义模糊时收敛速度慢、性能差的问题。解决方案的关键在于提出一种名为“听、注意、理解”(Listen, Attend, Understand, LAU)的语义正则化技术,通过利用冻结的文本嵌入(frozen text embeddings)提供方向性辅助损失,约束声学编码器的潜在空间,从而在不增加推理开销的前提下注入语言学上的语义锚定(linguistic groundedness)。实验表明,LAU模型在仅使用30小时非专业标注的Bambara语音数据时,即可达到预训练数据量增加100%的基线模型的性能水平,并且在保持语义一致性方面表现更优。此外,作者引入总参数漂移(Total Parameter Drift)作为度量指标,验证了语义约束确实主动重构了编码器权重结构,使其更侧重于语义而非字面音素特征。

链接: https://arxiv.org/abs/2601.01121
作者: Yacouba Diarra,Michael Leventhal
机构: RobotsMali AI4D Lab (RobotsMali AI4D 实验室)
类目: Computation and Language (cs.CL)
备注: 9 mages, 3 figures

点击查看摘要

Abstract:End-to-End Speech Translation often shows slower convergence and worse performance when target transcriptions exhibit high variance and semantic ambiguity. We propose Listen, Attend, Understand (LAU), a semantic regularization technique that constrains the acoustic encoder’s latent space during training. By leveraging frozen text embeddings to provide a directional auxiliary loss, LAU injects linguistic groundedness into the acoustic representation without increasing inference cost. We evaluate our method on a Bambara-to-French dataset with 30 hours of Bambara speech translated by non-professionals. Experimental results demonstrate that LAU models achieve comparable performance by standard metrics compared to an E2E-ST system pretrained with 100% more data and while performing better in preserving semantic meaning. Furthermore, we introduce Total Parameter Drift as a metric to quantify the structural impact of regularization to demonstrate that semantic constraints actively reorganize the encoder’s weights to prioritize meaning over literal phonetics. Our findings suggest that LAU is a robust alternative to post-hoc rescoring and a valuable addition to E2E-ST training, especially when training data is scarce and/or noisy.
zh

[NLP-90] EmoLoom-2B: Fast Base-Model Screening for Emotion Classification and VAD with Lexicon-Weak Supervision and KV-Off Evaluation

【速读】: 该论文旨在解决小语言模型(参数量低于2B)在情感分类与情绪维度预测(Valence-Arousal-Dominance, VAD)联合任务中性能不足的问题,尤其关注如何高效筛选出具备潜力的候选模型以用于后续更复杂的训练或跨模态融合。解决方案的关键在于提出一个轻量级、可复现的流水线EmoLoom-2B,其核心包括:统一数据加载、训练和推理的JSON输入输出契约以确保评估公平性;采用KV缓存解码(KV-off decoding)减少冗余方差;引入两种正交语义正则化机制——VAD保持约束用于对齐生成文本与目标VAD三元组,以及轻量外部评估分类器提供训练阶段的目标达成、可控性、确定性和公平性指导;同时通过基于镜像情感对的Valence Flip增强提升极性敏感度,并结合A/B混合采样与熵感知温度调度优化监督微调过程,从而在Qwen-1.8B-Chat基础上实现优异的跨语料库泛化能力。

链接: https://arxiv.org/abs/2601.01112
作者: Zilin Li,Weiwei Xu,Xuanbo Lu,Zheda Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper presents an initial and self-contained study of a lightweight screening pipeline for emotion-aware language modeling, intended as a reproducible baseline and system-level design reference

点击查看摘要

Abstract:We introduce EmoLoom-2B, a lightweight and reproducible pipeline that turns small language models under 2B parameters into fast screening candidates for joint emotion classification and Valence-Arousal-Dominance prediction. To ensure protocol-faithful and fair evaluation, we unify data loading, training, and inference under a single JSON input-output contract and remove avoidable variance by adopting KV-off decoding as the default setting. We incorporate two orthogonal semantic regularizers: a VAD-preserving constraint that aligns generated text with target VAD triples, and a lightweight external appraisal classifier that provides training-time guidance on goal attainment, controllability, certainty, and fairness without injecting long rationales. To improve polarity sensitivity, we introduce Valence Flip augmentation based on mirrored emotional pairs. During supervised fine-tuning, we apply A/B mixture sampling with entropy-aware temperature scheduling to balance coverage and convergence. Using Qwen-1.8B-Chat as the base model, EmoLoom-2B achieves strong performance on GoEmotions and EmpatheticDialogues, and demonstrates robust cross-corpus generalization on DailyDialog. The proposed recipe is budget-aware, auditable, and re-entrant, serving as a dependable screening pass before heavier training or multimodal fusion.
zh

[NLP-91] ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

【速读】: 该论文旨在解决克什米尔语(Kashmiri)在大型语言模型(Large Language Models, LLMs)中表现不佳的问题,其根本原因在于高质量训练数据的严重匮乏。现有文献表明,尽管LLMs在高资源语言中表现出卓越的流畅性,但在克什米尔语上却无法生成连贯文本,这并非模型本身能力不足,而是缺乏适配的语料库。解决方案的关键在于构建一个大规模、高质量且结构优化的克什米尔语语料库——KS-LIT-3M,包含310万词(约1640万字符),涵盖文学、新闻、学术与宗教文本等多样化体裁。该语料库通过开发专用的InPage到Unicode转换工具获取原始文档,并经过去英文污染、字符归一化和质量验证等严格预处理步骤,最终以连续线性文本流形式组织,专为因果语言建模(causal language modeling)设计,从而填补了克什米尔语自然语言处理领域的关键资源空白。

链接: https://arxiv.org/abs/2601.01091
作者: Haq Nawaz Malik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable fluency across high-resource languages yet consistently fail to generate coherent text in Kashmiri, a language spoken by approximately seven million people. This performance disparity stems not from inherent model limitations but from a critical scarcity of high-quality training data. Decades of Kashmiri literature remain inaccessible to modern NLP pipelines due to their encoding in the proprietary InPage desktop publishing format. This paper introduces KS-LIT-3M, a curated corpus of 3.1 million words (16.4 million characters) specifically designed for pretraining language models on Kashmiri. The dataset is structured as a single continuous linear text stream, optimized for causal language model training where models learn to predict subsequent tokens from preceding context. The corpus was constructed through the development of a specialized InPage-to-Unicode converter, followed by rigorous preprocessing including English contamination removal, character normalization, and quality validation. Encompassing 131,607 unique words drawn from diverse genres including literary works, journalistic writing, academic texts, and religious scholarship, KS-LIT-3M addresses a fundamental resource gap for Kashmiri language technology. The dataset is released under the CC-BY-4.0 license to facilitate research in Kashmiri natural language processing.
zh

[NLP-92] 600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script

【速读】: 该论文旨在解决克什米尔语(Kashmiri)这一濒危达基语(Dardic)语言在光学字符识别(Optical Character Recognition, OCR)领域缺乏高质量训练数据的问题。针对这一资源缺口,作者提出构建一个大规模合成语料库——600K-KS-OCR Dataset,其关键解决方案包括:使用三种传统克什米尔字体生成约60.2万张词级分割图像(分辨率为256×64像素),并配以多格式标注文本以兼容CRNN、TrOCR等主流OCR模型;通过模拟真实文档退化和多样化背景纹理的数据增强策略提升模型鲁棒性;最终将数据集划分为十个归档文件(总计约10.6 GB)并以CC-BY-4.0许可证开源,从而推动低资源语言OCR的研究与应用。

链接: https://arxiv.org/abs/2601.01088
作者: Haq Nawaz Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.
zh

[NLP-93] Unsupervised Text Style Transfer for Controllable Intensity

【速读】: 该论文旨在解决无监督文本风格迁移(Unsupervised Text Style Transfer, UTST)中可控强度迁移的挑战,即在缺乏平行语料的情况下,如何准确区分和控制相邻强度等级间细微的风格特征差异。解决方案的关键在于提出一种“SFT-then-PPO”范式:首先利用合成的平行数据对大语言模型(Large Language Model, LLM)进行监督微调(Supervised Fine-Tuning, SFT),以初步学习风格迁移能力;随后采用近端策略优化(Proximal Policy Optimization, PPO)进一步训练模型,设计了融合全局与局部风格特征的奖励函数,用于精细区分不同层级的风格强度。实验表明,该方法能有效提升LLM在多个评估指标上的表现,即使在强度接近的级别间也能生成具有显著风格差异的文本。

链接: https://arxiv.org/abs/2601.01060
作者: Shuhuan Gu,Wenbiao Tao,Xinchen Ma,Kangkang He,Ye Guo,Xiang Li,Yunshi Lan
机构: East China Normal University (华东师范大学); Baowu Group (宝武集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unsupervised Text Style Transfer (UTST) aims to build a system to transfer the stylistic properties of a given text without parallel text pairs. Compared with text transfer between style polarities, UTST for controllable intensity is more challenging due to the subtle differences in stylistic features across different intensity levels. Faced with the challenges posed by the lack of parallel data and the indistinguishability between adjacent intensity levels, we propose a SFT-then-PPO paradigm to fine-tune an LLM. We first fine-tune the LLM with synthesized parallel data. Then, we further train the LLM with PPO, where the rewards are elaborately designed for distinguishing the stylistic intensity in hierarchical levels. Both the global and local stylistic features are considered to formulate the reward functions. The experiments on two UTST benchmarks showcase that both rewards have their advantages and applying them to LLM fine-tuning can effectively improve the performance of an LLM backbone based on various evaluation metrics. Even for close levels of intensity, we can still observe the noticeable stylistic difference between the generated text.
zh

[NLP-94] KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在无训练场景下进行文本嵌入时的两个结构性问题:一是因果注意力机制限制了早期token无法获取后续上下文信息,二是基于下一个词预测的目标函数导致嵌入表示偏向生成任务而非语义压缩。其解决方案的关键在于提出KV-Embedding框架,通过利用每一层最终token对应的键值(Key-Value, KV)状态作为序列的压缩表征,并将其重定向为前置前缀(prepended prefix),使所有token能在单次前向传播中访问全局序列上下文;同时引入基于内在维度的自动化层选择策略以保证方法的模型无关性。实验证明,该方法在MTEB基准上显著优于现有无训练基线,提升达10%,且支持长达4096 tokens的输入序列。

链接: https://arxiv.org/abs/2601.01046
作者: Yixuan Tang,Yi Yang
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While LLMs are powerful embedding backbones, their application in training-free settings faces two structural challenges: causal attention restricts early tokens from accessing subsequent context, and the next-token prediction objective biases representations toward generation rather than semantic compression. To address these limitations, we propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs. Our method leverages the observation that the key-value (KV) states of the final token at each layer encode a compressed view of the sequence. By re-routing these states as a prepended prefix, we enable all tokens to access sequence-level context within a single forward pass. To ensure model-agnostic applicability, we introduce an automated layer selection strategy based on intrinsic dimensionality. Evaluations on MTEB across Qwen, Mistral, and Llama backbones show that KV-Embedding outperforms existing training-free baselines by up to 10%, while maintaining robust performance on sequences up to 4,096 tokens. These results demonstrate that internal state manipulation offers an efficient alternative to input modification, and we hope this work encourages further exploration of LLM internals for representation learning.
zh

[NLP-95] Multi-Dimensional Prompt Chaining to Improve Open-Domain Dialogue Generation

【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在开放域对话生成中难以达到大型模型对话质量的问题。其解决方案的关键在于提出一种多维提示链(multi-dimensional prompt-chaining)框架,该框架整合了自然性(Naturalness)、连贯性(Coherence)和吸引力(Engagingness)三个维度,通过精心设计的提示策略来提升SLMs生成对话的人类相似度。实验表明,该框架显著提升了响应多样性、上下文连贯性以及整体对话质量,使Llama-2-7B等小型模型的表现接近甚至媲美Llama-2-70B和GPT-3.5 Turbo等大型模型。

链接: https://arxiv.org/abs/2601.01037
作者: Livia Leong Hui Teng
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small language models (SLMs) offer significant deployment advantages but often struggle to match the dialogue quality of larger models in open-domain settings. In this paper, we propose a multi-dimensional prompt-chaining framework that integrates Naturalness, Coherence, and Engagingness dimensions to enhance human-likeness in open-domain dialogue generation. We apply the framework to two SLMs, TinyLlama and Llama-2-7B, and benchmark their performance against responses generated by substantially larger models, including Llama-2-70B and GPT-3.5 Turbo. We then employ automatic and human evaluation to assess the responses based on diversity, contextual coherence, as well as overall quality. Results show that the full framework improves response diversity by up to 29%, contextual coherence by up to 28%, and engagingness as well as naturalness by up to 29%. Notably, Llama-2-7B achieves performance comparable to substantially larger models, including Llama-2-70B and GPT-3.5 Turbo. Overall, the findings demonstrate that carefully designed prompt-based strategies provide an effective and resource-efficient pathway to improving open-domain dialogue quality in SLMs.
zh

[NLP-96] A Platform for Interactive AI Character Experiences

【速读】: 该论文旨在解决构建交互式、故事驱动的数字角色所面临的多维度AI挑战,包括对话理解与生成、角色一致性维持、人格与情绪管理、知识与记忆处理、语音合成、动画生成、现实世界交互及物理环境集成等问题。其解决方案的关键在于提出一个统一的系统与平台,将上述多样化的AI组件整合为一个易于适配的框架,从而实现可信数字角色的便捷设计与部署。通过以“数字爱因斯坦”(Digital Einstein)为例验证方法的有效性,表明该平台不仅适用于特定角色,还具备通用性和可扩展性,为沉浸式角色体验提供了技术基础。

链接: https://arxiv.org/abs/2601.01027
作者: Rafael Wampfler,Chen Yang,Dillon Elste,Nikola Kovacevic,Philine Witzig,Markus Gross
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:From movie characters to modern science fiction - bringing characters into interactive, story-driven conversations has captured imaginations across generations. Achieving this vision is highly challenging and requires much more than just language modeling. It involves numerous complex AI challenges, such as conversational AI, maintaining character integrity, managing personality and emotions, handling knowledge and memory, synthesizing voice, generating animations, enabling real-world interactions, and integration with physical environments. Recent advancements in the development of foundation models, prompt engineering, and fine-tuning for downstream tasks have enabled researchers to address these individual challenges. However, combining these technologies for interactive characters remains an open problem. We present a system and platform for conveniently designing believable digital characters, enabling a conversational and story-driven experience while providing solutions to all of the technical challenges. As a proof-of-concept, we introduce Digital Einstein, which allows users to engage in conversations with a digital representation of Albert Einstein about his life, research, and persona. While Digital Einstein exemplifies our methods for a specific character, our system is flexible and generalizes to any story-driven or conversational character. By unifying these diverse AI components into a single, easy-to-adapt platform, our work paves the way for immersive character experiences, turning the dream of lifelike, story-based interactions into a reality.
zh

[NLP-97] HyperJoin: LLM -augmented Hypergraph Link Prediction for Joinable Table Discovery

【速读】: 该论文旨在解决数据湖管理中可连接表(joinable table)发现任务的两个核心问题:(1) 离线阶段将表格建模为孤立或成对列,难以捕捉表内和表间的复杂结构信息;(2) 在线排序阶段仅基于查询与候选列的相似度进行排名,忽略候选列之间的相互作用,导致结果集不一致。解决方案的关键在于提出 HyperJoin,一个基于大语言模型(LLM)增强的超图(hypergraph)框架:首先构建包含表内超边和LLM增强的表间超边的超图,并将可连接表发现建模为该超图上的链接预测任务;其次设计层次交互网络(HIN),通过列与超边之间的双向消息传递学习丰富的列表示;最后引入重排序模块,利用最大生成树算法修剪噪声连接并最大化结果列的一致性,从而提升整体性能。

链接: https://arxiv.org/abs/2601.01015
作者: Shiyuan Liu,Jianwei Wang,Xuemin Lin,Lu Qin,Wenjie Zhang,Ying Zhang
机构: University of Technology Sydney (悉尼科技大学); University of New South Wales (新南威尔士大学); ACEM, Shanghai Jiao Tong University (上海交通大学先进计算与工程研究中心)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:As a pivotal task in data lake management, joinable table discovery has attracted widespread interest. While existing language model-based methods achieve remarkable performance by combining offline column representation learning with online ranking, their design insufficiently accounts for the underlying structural interactions: (1) offline, they directly model tables into isolated or pairwise columns, thereby struggling to capture the rich inter-table and intra-table structural information; and (2) online, they rank candidate columns based solely on query-candidate similarity, ignoring the mutual interactions among the candidates, leading to incoherent result sets. To address these limitations, we propose HyperJoin, a large language model (LLM)-augmented Hypergraph framework for Joinable table discovery. Specifically, we first construct a hypergraph to model tables using both the intra-table hyperedges and the LLM-augmented inter-table hyperedges. Consequently, the task of joinable table discovery is formulated as link prediction on this constructed hypergraph. We then design HIN, a Hierarchical Interaction Network that learns expressive column representations through bidirectional message passing over columns and hyperedges. To strengthen coherence and internal consistency in the result columns, we cast online ranking as a coherence-aware top-k column selection problem. We then introduce a reranking module that leverages a maximum spanning tree algorithm to prune noisy connections and maximize coherence. Experiments demonstrate the superiority of HyperJoin, achieving average improvements of 21.4% (Precision@15) and 17.2% (Recall@15) over the best baseline.
zh

[NLP-98] Intention Collapse: Intention-Level Metrics for Reasoning in Language Models

【速读】: 该论文试图解决语言生成过程中“意图坍缩”(intention collapse)问题,即高维内部意图空间 I 到外部语言空间 L 的多对一映射导致的信息损失问题。其解决方案的关键在于提出三种与模型无关的意图度量指标:意图熵(Hint)、有效维度(dimeff)和潜在知识可恢复性(Recov),并通过实证研究揭示推理过程中的内部意图如何在生成前被塑造。实验表明,链式思维(CoT)策略能显著提升准确率并降低意图熵,同时保持较高的有效维度,说明其更有效地保留了原始意图信息,而直接输出则导致意图坍缩至多数类,验证了所提指标在区分不同推理模式及捕捉隐含信息方面的潜力。

链接: https://arxiv.org/abs/2601.01011
作者: Patricio Vera
机构: George Washington University (乔治·华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures, 3 tables. Code: this https URL

点击查看摘要

Abstract:Every act of language generation compresses a rich internal state into a single token sequence. We call this process intention collapse: a many-to-one projection from a high dimensional intention space I into an external language space L. We formalize intention collapse for contemporary language models, define three simple, model agnostic intention metrics (intention entropy Hint, effective dimensionality dimeff, and latent knowledge recoverability Recov), and propose an empirical agenda for studying how inference time computation shapes internal intentions before they are verbalized. We also report a first small scale experiment. Using a 4 bit Mistral 7B model on 200 GSM8K problems, we compare a direct answer baseline, a chain of thought (CoT) regime, and a babble control. CoT raises accuracy from 5.5 percent to 53 percent, sharply reduces pre collapse intention entropy (from 1.42 to 0.37 bits), and shows higher global effective dimensionality than the other regimes despite producing fewer tokens than babble. At the same time, Hint has little item level predictive power, and a linear probe on I achieves AUROC 0.65 in the CoT regime but only about chance in the baseline regime, where it collapses to the majority class. These preliminary results indicate that intention level metrics can distinguish inference regimes and expose latent information that is partly lost during collapse, while also revealing important limitations of our current proxies
zh

[NLP-99] Reliability Under Randomness: An Empirical Analysis of Sparse and Dense Language Models Across Decoding Temperatures

【速读】: 该论文旨在解决稀疏型混合专家(Mixture-of-Experts, MoE)架构在随机解码(stochastic decoding)条件下输出可靠性下降的问题,特别是探究条件计算机制与温度采样之间的相互作用是否会导致输出不稳定。其关键解决方案在于通过系统性实验对比三种代表性模型——包括稀疏基础模型(OLMoE-7B)、稀疏指令微调模型(Mixtral-8x7B)和稠密指令微调模型(Qwen2.5-3B)——在不同解码温度下的准确性、格式合规性、重复生成一致性及置信度指标表现,发现指令微调(instruction tuning)而非架构稀疏性才是决定模型对解码随机性鲁棒性的核心因素,从而为稀疏语言模型在可靠性敏感场景中的安全部署提供了理论依据。

链接: https://arxiv.org/abs/2601.00942
作者: Kabir Grover
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing prevalence of sparse Mixture-of-Experts (MoE) architectures in large language models raises important questions regarding their reliability under stochastic decoding. While conditional computation enables substantial gains in computational efficiency, it remains unclear whether the interaction between sparse routing and temperature-based sampling compromises output stability relative to dense architectures. This work investigates whether conditional computation in MoE models amplifies decoding-induced randomness, leading to reduced reliability as temperature increases. We evaluate three representative models: OLMoE-7B (sparse base), Mixtral-8x7B (sparse instruction-tuned), and Qwen2.5-3B (dense instruction-tuned) on deterministic arithmetic reasoning tasks with objectively verifiable answers. Experiments span four decoding configurations, ranging from greedy decoding to T=1.0. Our evaluation encompasses accuracy, format compliance, output consistency across repeated generations, and confidence metrics, totaling 9,360 model generations. Results demonstrate that the sparse instruction-tuned model exhibits stability comparable to the dense instruction-tuned model across all decoding temperatures, while the sparse base model shows systematic degradation as temperature increases. These findings indicate that instruction tuning, rather than architectural sparsity, is the primary determinant of robustness to decoding randomness on deterministic tasks. We discuss the implications of these results for deploying sparse language models in reliability-critical applications, highlighting scenarios in which sparse architectures can be safely adopted without sacrificing output stability.
zh

[NLP-100] Rate-Distortion Analysis of Compressed Query Delegation with Low-Rank Riemannian Updates

【速读】: 该论文旨在解决受限上下文代理(bounded-context agents)在中间推理过程超出有效工作记忆预算时失效的问题。其核心解决方案是提出压缩查询委托(Compressed Query Delegation, CQD)框架,关键在于:(i) 将高维潜在推理状态压缩为低秩张量查询,(ii) 将最小化后的查询委托给外部推理引擎(oracle),(iii) 通过固定秩流形上的黎曼随机优化更新潜在状态。该方法从数学上建模为带有查询预算函数的约束随机规划问题,并与经典率失真理论和信息瓶颈原理建立联系,证明谱硬阈值法在特定约束二次失真问题下最优,同时给出了在有界噪声和光滑性假设下的收敛性保证。

链接: https://arxiv.org/abs/2601.00938
作者: Faruk Alpay,Bugra Kilictas
机构: Bahçeşehir University (贝塞希尔大学)
类目: Computation and Language (cs.CL); Optimization and Control (math.OC)
备注: 9 pages

点击查看摘要

Abstract:Bounded-context agents fail when intermediate reasoning exceeds an effective working-memory budget. We study compressed query delegation (CQD): (i) compress a high-dimensional latent reasoning state into a low-rank tensor query, (ii) delegate the minimal query to an external oracle, and (iii) update the latent state via Riemannian optimization on fixed-rank manifolds. We give a math-first formulation: CQD is a constrained stochastic program with a query-budget functional and an oracle modeled as a noisy operator. We connect CQD to classical rate-distortion and information bottleneck principles, showing that spectral hard-thresholding is optimal for a natural constrained quadratic distortion problem, and we derive convergence guarantees for Riemannian stochastic approximation under bounded oracle noise and smoothness assumptions. Empirically, we report (A) a 2,500-item bounded-context reasoning suite (BBH-derived tasks plus curated paradox instances) comparing CQD against chain-of-thought baselines under fixed compute and context; and (B) a human “cognitive mirror” benchmark (N=200) measuring epistemic gain and semantic drift across modern oracles.
zh

[NLP-101] Measuring Social Media Polarization Using Large Language Models and Heuristic Rules

【速读】: 该论文旨在解决在线话语中情感极化(affective polarization)的量化与分析问题,尤其针对气候变化和枪支管控等争议性话题。传统方法依赖情感分析或预定义分类器,难以捕捉复杂互动中的情绪倾向与立场一致性。其解决方案的关键在于融合大语言模型(LLMs)与领域启发式规则:首先利用LLMs提取用户立场、情感基调及交互共识模式,再通过基于规则的评分系统,结合立场对齐度、情绪内容和互动动态,在小规模对话中亦能有效量化情感极化程度。该框架揭示了两种事件驱动的极化模式——预期型极化与反应型极化,提供了可扩展且可解释的情感极化测量方法。

链接: https://arxiv.org/abs/2601.00927
作者: Jawad Chowdhury,Rezaur Rashid,Gabriel Terejanu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Foundations and Applications of Big Data Analytics (FAB), Niagara Falls, Canada, 2025

点击查看摘要

Abstract:Understanding affective polarization in online discourse is crucial for evaluating the societal impact of social media interactions. This study presents a novel framework that leverages large language models (LLMs) and domain-informed heuristics to systematically analyze and quantify affective polarization in discussions on divisive topics such as climate change and gun control. Unlike most prior approaches that relied on sentiment analysis or predefined classifiers, our method integrates LLMs to extract stance, affective tone, and agreement patterns from large-scale social media discussions. We then apply a rule-based scoring system capable of quantifying affective polarization even in small conversations consisting of single interactions, based on stance alignment, emotional content, and interaction dynamics. Our analysis reveals distinct polarization patterns that are event dependent: (i) anticipation-driven polarization, where extreme polarization escalates before well-publicized events, and (ii) reactive polarization, where intense affective polarization spikes immediately after sudden, high-impact events. By combining AI-driven content annotation with domain-informed scoring, our framework offers a scalable and interpretable approach to measuring affective polarization. The source code is publicly available at: this https URL.
zh

[NLP-102] Attention Needs to Focus: A Unified Perspective on Attention Allocation ICLR2026

【速读】: 该论文旨在解决标准注意力机制在大型语言模型(Large Language Models, LLMs)中普遍存在的两个核心问题:表征坍缩(representational collapse)与注意力陷阱(attention sink)。研究表明,这两类问题均源于不恰当的注意力分配。针对此,作者提出了一种统一视角,并识别出两种失效模式:1)注意力过载(Attention Overload),即多个token被赋予相近的高权重,导致语义特征模糊;2)注意力欠载(Attention Underload),即无语义相关性token仍被强制分配注意力,引发虚假聚焦。解决方案的关键在于引入一种新型机制——Lazy Attention,其通过两个创新设计实现更聚焦的注意力分布:一是利用跨头和维度的位置判别能力以缓解过载;二是采用Elastic-Softmax替代标准softmax,松弛归一化约束以抑制无关token的注意力分配。实验表明,该方法有效缓解了注意力陷阱,在FineWeb-Edu数据集上达到最高59.58%的注意力稀疏度,同时保持与主流架构相当的性能。

链接: https://arxiv.org/abs/2601.00919
作者: Zichuan Fu,Wentao Song,Guojing Li,Yejing Wang,Xian Wu,Yimin Deng,Hanyu Yan,Yefeng Zheng,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); Xi’an Jiaotong University (西安交通大学); Tencent (腾讯); Westlake University (西湖大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2026 conference

点击查看摘要

Abstract:The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root – improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.
zh

[NLP-103] When to Ponder: Adaptive Compute Allocation for Code Generation via Test-Time Training

【速读】: 该论文旨在解决大语言模型在推理阶段对所有输入统一施加相同计算量的问题,导致资源浪费且难以适应不同难度输入的挑战。其解决方案的关键在于提出一种无需训练的门控策略——PonderTTT,该策略利用TTT层的自监督重建损失作为信号来决定是否触发测试时训练(Test-Time Training, TTT)更新,仅需一个标量阈值进行初始校准并采用指数移动平均(EMA)持续调整以维持目标更新率,从而实现高效、动态的计算分配。

链接: https://arxiv.org/abs/2601.00894
作者: Gihyeon Sim
机构: Dongpae High School (东派高中)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 1 figure, 14 tables, code available at this https URL

点击查看摘要

Abstract:Large language models apply uniform computation to all inputs, regardless of difficulty. We propose PonderTTT, a gating strategy using the TTT layer’s self-supervised reconstruction loss to selectively trigger Test-Time Training (TTT) updates. The gating decision itself is training-free–requiring no learned classifier or auxiliary networks; only a single scalar threshold is initially calibrated on unlabeled data and continuously adapted via EMA to maintain target update rates. Our experiments with GPT-2 models (124M to 1.5B) on code language modeling (The Stack v2, teacher-forced perplexity) demonstrate that this signal is inference-compatible, requiring no ground-truth labels. Our Reconstruction Gating achieves 82-89% Oracle Recovery while being fully training-free, significantly outperforming Random Skip baselines (up to 16% lower loss on OOD languages).
zh

[NLP-104] Universal Conditional Logic: A Formal Language for Prompt Engineering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)提示工程(prompt engineering)中长期存在的非系统化与高成本问题,即当前提示设计主要依赖经验性试错,缺乏可量化、可优化的理论框架。解决方案的核心是提出通用条件逻辑(Universal Conditional Logic, UCL),其关键在于通过结构化数学建模实现提示的系统性优化:引入指示函数(indicator functions, I_i ∈ {0,1})和结构开销函数(structural overhead function, O_s(A) = γ·∑ln C_k),揭示了过度指定(over-specification)导致性能下降的“过度指定悖论”(Over-Specification Paradox),并发现当指定阈值 S* = 0.509 超过时,性能以二次方形式衰减。此外,UCL验证了早期绑定(early binding)机制的有效性,并指出最优配置因模型架构而异(如 Llama 4 Scout 需要版本特异性调整 V4.1),从而为高效 LLM 交互提供了可校准、可扩展的优化路径。

链接: https://arxiv.org/abs/2601.00880
作者: Anthony Mikinka
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 25 pages, 15 figures, 5 tables. Includes appendices with variable reference, pattern library, and O_s calculation examples. Supplementary materials: V1-V4.1 prompt source code and 305 model responses available at GitHub repositories

点击查看摘要

Abstract:We present Universal Conditional Logic (UCL), a mathematical framework for prompt optimization that transforms prompt engineering from heuristic practice into systematic optimization. Through systematic evaluation (N=305, 11 models, 4 iterations), we demonstrate significant token reduction (29.8%, t(10)=6.36, p 0.001, Cohen’s d = 2.01) with corresponding cost savings. UCL’s structural overhead function O_s(A) explains version-specific performance differences through the Over-Specification Paradox: beyond threshold S* = 0.509, additional specification degrades performance quadratically. Core mechanisms – indicator functions (I_i in 0,1), structural overhead (O_s = gamma * sum(ln C_k)), early binding – are validated. Notably, optimal UCL configuration varies by model architecture – certain models (e.g., Llama 4 Scout) require version-specific adaptations (V4.1). This work establishes UCL as a calibratable framework for efficient LLM interaction, with model-family-specific optimization as a key research direction.
zh

[NLP-105] CogCanvas: Compression-Resistant Cognitive Artifacts for Long LLM Conversations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长对话中面临的上下文窗口限制与信息保真度之间的根本矛盾,现有方法如截断(truncation)和摘要(summarization)分别导致早期信息丢失或细节模糊。其解决方案的关键在于提出一种无需训练的框架CogCanvas,通过提取每轮对话中的“verbatim-grounded cognitive artifacts”(即决策、事实、提醒等可验证的认知产物),并将其组织成具有时间感知能力的图结构,从而实现对关键信息的压缩鲁棒检索。该方法显著提升了多跳因果推理(multi-hop causal reasoning)和时序推理(temporal reasoning)任务的表现,在LoCoMo基准上相较RAG和GraphRAG分别获得高达+530%相对提升和+41个百分点的准确率改进。

链接: https://arxiv.org/abs/2601.00821
作者: Tao An
机构: Hawaii Pacific University (夏威夷太平洋大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Large language models face a fundamental tension between context window limits and information fidelity in long conversations. Existing approaches–truncation and summarization–either discard early information or lose nuanced details. We introduce CogCanvas, a training-free framework that extracts verbatim-grounded cognitive artifacts (decisions, facts, reminders) from conversation turns and organizes them into a temporal-aware graph for compression-resistant retrieval. On the LoCoMo benchmark, CogCanvas achieves 34.7% overall accuracy, outperforming RAG (25.6%, +9.1pp) and GraphRAG (13.7%, +21.0pp). The advantage is most pronounced on temporal reasoning: 31.5% vs. 9.3% (RAG) and 5.0% (GraphRAG)–a +530% relative improvement. On multi-hop causal reasoning, CogCanvas achieves 81.0% pass rate vs. 40.0% for GraphRAG (+41.0pp). Controlled benchmarks show 97.5% recall (+78.5pp vs. summarization) with 93.0% exact match preservation. While heavily-optimized approaches achieve higher absolute scores through dedicated training (EverMemOS: approximately 92%), our training-free approach provides practitioners with an immediately-deployable alternative that significantly outperforms standard baselines. Code and data: this https URL. Comments: 15 pages, 5 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2601.00821 [cs.AI] (or arXiv:2601.00821v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.00821 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-106] he Qualitative Laboratory: Theory Prototyping and Hypothesis Generation with Large Language Models

【速读】: 该论文旨在解决社会科学研究中如何生成关于不同社会群体对新信息解读方式的丰富定性假设这一核心挑战。传统方法如情景问卷调查(vignette surveys)往往缺乏话语深度,而基于规则的代理模型(agent-based models, ABMs)则受限于复杂世界观的形式化瓶颈。解决方案的关键在于引入使用大语言模型(Large Language Models, LLMs)的社会学角色模拟(sociological persona simulation),将其视为一种“定性实验室”。该方法通过生成自然语言语境下的对话,不仅克服了传统方法在话语深度上的不足,还借助自然语言操作复杂世界观,从而绕过ABMs的形式化限制;实证表明,该方法能生成具有反直觉性和理论挑战性的新颖假设,适用于“模拟后验证”的研究流程,为后续实证检验提供更细致、更具解释力的假说基础。

链接: https://arxiv.org/abs/2601.00797
作者: Hugues Draelants
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 26 pages, 3 tables. Manuscript submitted for peer-reviewed journal publication

点击查看摘要

Abstract:A central challenge in social science is to generate rich qualitative hypotheses about how diverse social groups might interpret new information. This article introduces and illustrates a novel methodological approach for this purpose: sociological persona simulation using Large Language Models (LLMs), which we frame as a “qualitative laboratory”. We argue that for this specific task, persona simulation offers a distinct advantage over established methods. By generating naturalistic discourse, it overcomes the lack of discursive depth common in vignette surveys, and by operationalizing complex worldviews through natural language, it bypasses the formalization bottleneck of rule-based agent-based models (ABMs). To demonstrate this potential, we present a protocol where personas derived from a sociological theory of climate reception react to policy messages. The simulation produced nuanced and counter-intuitive hypotheses - such as a conservative persona’s rejection of a national security frame - that challenge theoretical assumptions. We conclude that this method, used as part of a “simulation then validation” workflow, represents a superior tool for generating deeply textured hypotheses for subsequent empirical testing.
zh

[NLP-107] A neural network for modeling human concept formation understanding and communication

【速读】: 该论文试图解决人类大脑如何从感觉运动经验中形成更抽象的概念表征,并在无需直接感官输入的情况下灵活应用这些概念的计算机制问题。解决方案的关键在于提出了一种双模块神经网络框架——CATS Net,其中包含一个概念抽象模块用于提取低维概念表征,以及一个任务求解模块在所形成概念的分层门控控制下执行视觉判断任务;该架构通过概念通信实现跨网络的知识迁移,且模型与大脑的拟合分析表明其概念空间与人类腹侧枕颞皮层(ventral occipitotemporal cortex)的神经认知语义结构一致,门控机制也对应于语义控制脑网络,从而为理解人类概念认知提供了统一的计算框架。

链接: https://arxiv.org/abs/2601.02010
作者: Liangxuan Guo,Haoyang Chen,Yang Chen,Yanchao Bi,Shan Yu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Peking University (北京大学); National Laboratory for Pattern Recognition (国家模式识别实验室)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 main figures, 5 extended data figures and 4 supplementary figures

点击查看摘要

Abstract:A remarkable capability of the human brain is to form more abstract conceptual representations from sensorimotor experiences and flexibly apply them independent of direct sensory inputs. However, the computational mechanism underlying this ability remains poorly understood. Here, we present a dual-module neural network framework, the CATS Net, to bridge this gap. Our model consists of a concept-abstraction module that extracts low-dimensional conceptual representations, and a task-solving module that performs visual judgement tasks under the hierarchical gating control of the formed concepts. The system develops transferable semantic structure based on concept representations that enable cross-network knowledge transfer through conceptual communication. Model-brain fitting analyses reveal that these emergent concept spaces align with both neurocognitive semantic model and brain response structures in the human ventral occipitotemporal cortex, while the gating mechanisms mirror that in the semantic control brain network. This work establishes a unified computational framework that can offer mechanistic insights for understanding human conceptual cognition and engineering artificial systems with human-like conceptual intelligence.
zh

[NLP-108] LLM Collusion

【速读】: 该论文旨在解决在双寡头市场中,当两家卖家均依赖同一预训练大语言模型(Large Language Model, LLM)进行定价决策时,如何通过LLM的配置参数诱导串通(collusion)的问题。其核心发现是:LLM的输出保真度(output-fidelity)参数存在一个临界阈值,当该参数高于此阈值时,系统进入双稳态(bistable)状态,即竞争性定价与串通性定价均可局部稳定,最终结果由模型初始偏好决定;低于该阈值则仅存在竞争性定价的长期均衡。解决方案的关键在于通过调整LLM的输出保真度和再训练频率(由训练批次大小 $ b $ 决定),实现从竞争到串通的相变行为——尤其是当批次规模增大时,随机波动被抑制,使得系统更易陷入串通区域,从而强化了隐性串通(tacit collusion)的稳定性与可重现性。

链接: https://arxiv.org/abs/2601.01279
作者: Shengyu Cao,Ming Hu
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: 44 pages

点击查看摘要

Abstract:We study how delegating pricing to large language models (LLMs) can facilitate collusion in a duopoly when both sellers rely on the same pre-trained model. The LLM is characterized by (i) a propensity parameter capturing its internal bias toward high-price recommendations and (ii) an output-fidelity parameter measuring how tightly outputs track that bias; the propensity evolves through retraining. We show that configuring LLMs for robustness and reproducibility can induce collusion via a phase transition: there exists a critical output-fidelity threshold that pins down long-run behavior. Below it, competitive pricing is the unique long-run outcome. Above it, the system is bistable, with competitive and collusive pricing both locally stable and the realized outcome determined by the model’s initial preference. The collusive regime resembles tacit collusion: prices are elevated on average, yet occasional low-price recommendations provide plausible deniability. With perfect fidelity, full collusion emerges from any interior initial condition. For finite training batches of size b , infrequent retraining (driven by computational costs) further amplifies collusion: conditional on starting in the collusive basin, the probability of collusion approaches one as b grows, since larger batches dampen stochastic fluctuations that might otherwise tip the system toward competition. The indeterminacy region shrinks at rate O(1/\sqrtb) .
zh

计算机视觉

[CV-0] ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

【速读】:该论文旨在解决深度伪造(Deepfake)检测中模型泛化能力不足的问题,即现有方法在面对未见过的伪造手法时性能显著下降。其核心挑战在于当前主流方法依赖于监督训练,容易过拟合到特定伪造模式;而纯自监督方法又难以学习具有判别性的表示。解决方案的关键在于提出一种完全自监督的框架 ExposeAnyone,基于扩散模型(Diffusion Model)从音频生成表情序列,并通过参考集对个体进行个性化建模,进而利用扩散重建误差计算可疑视频与个性化主体之间的身份距离,从而实现针对目标人物的伪造检测。该方法无需标注伪造样本即可有效识别多种新型伪造内容,包括Sora2生成视频,在多个数据集上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2601.02359
作者: Kaede Shiohara,Toshihiko Yamasaki,Vladislav Golyanik
机构: The University of Tokyo (东京大学); Max Planck Institute for Informatics, SIC (马克斯·普朗克信息研究所,SIC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures, 11 tables; project page: this https URL

点击查看摘要

Abstract:Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.
zh

[CV-1] VINO: A Unified Visual Generator with Interleaved OmniModal Context

【速读】:该论文旨在解决当前视觉生成与编辑任务中模型碎片化的问题,即不同模态(如图像、视频)和任务(如生成、编辑)通常依赖独立的模型或模块,导致系统复杂、难以扩展且缺乏统一性。解决方案的关键在于提出一个统一的视觉生成框架VINO,其核心创新是采用共享的扩散主干(diffusion backbone),通过将文本、图像和视频等多模态输入编码为交错的条件标记(interleaved conditioning tokens),并以此引导扩散过程,从而实现图像和视频的联合生成与编辑。该设计支持多参考锚定、长序列指令遵循及静态与动态内容间的身份一致性保持,同时避免了模态特异性架构组件,显著提升了系统的通用性和可控性。

链接: https://arxiv.org/abs/2601.02358
作者: Junyi Chen,Tong He,Zhoujie Fu,Pengfei Wan,Kun Gai,Weicai Ye
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technology University (南洋理工大学); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.
zh

[CV-2] alk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

【速读】:该论文旨在解决多模态生成系统在自然语言指令下对场景中物体进行精确几何变换(如平移、旋转、缩放)的难题,现有文本引导编辑方法虽能调整物体外观或风格,但受限于稀疏的成对标注数据和像素级优化的局限性,难以实现对象级别的几何操作。其解决方案的关键在于提出基于强化学习的扩散框架Talk2Move,核心创新包括:采用Group Relative Policy Optimization (GRPO)通过输入图像与轻量文本变体生成多样化轨迹来探索几何动作空间,无需昂贵的成对数据;设计空间奖励引导模型以对齐几何变换与语言描述;引入离线策略步评估与主动步采样机制提升学习效率;并构建以物体为中心的空间奖励函数,直接量化位移、旋转和缩放行为,从而实现可解释且语义一致的精准空间变换。

链接: https://arxiv.org/abs/2601.02356
作者: Jing Tan,Zhaoyang Zhang,Yantao Shen,Jiarui Cai,Shuo Yang,Jiajun Wu,Wei Xia,Zhuowen Tu,Stefano Soatto
机构: The Chinese University of Hong Kong (香港中文大学); AWS Agentic AI; Amazon Web Services; Amazon Robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.
zh

[CV-3] Meta-Learning Guided Pruning for Few-Shot Plant Pathology on Edge Devices

【速读】:该论文旨在解决偏远地区农民在缺乏实验室和高性能计算资源的情况下,难以快速准确识别植物病害的问题。其核心挑战在于:深度学习模型虽能高精度识别病害,但通常模型庞大且计算复杂,无法部署于低成本边缘设备(如树莓派);同时,获取大量标注病害图像用于训练成本高昂且耗时。解决方案的关键在于提出一种结合神经网络剪枝(Neural Network Pruning)与少样本学习(Few-Shot Learning)的三阶段流程——Prune-then-Meta-Learn-then-Prune (PMP),并引入Disease-Aware Channel Importance Scoring (DACIS) 方法,精准识别对区分不同植物病害至关重要的网络通道,从而在显著压缩模型规模的同时保持高识别准确率(92.3%原始精度),并在树莓派4上实现7帧/秒的实时推理速度,使田间病害诊断具备可行性。

链接: https://arxiv.org/abs/2601.02353
作者: Shahnawaz Alam,Mohammed Mudassir Uddin,Mohammed Kaif Pasha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Farmers in remote areas need quick and reliable methods for identifying plant diseases, yet they often lack access to laboratories or high-performance computing resources. Deep learning models can detect diseases from leaf images with high accuracy, but these models are typically too large and computationally expensive to run on low-cost edge devices such as Raspberry Pi. Furthermore, collecting thousands of labeled disease images for training is both expensive and time-consuming. This paper addresses both challenges by combining neural network pruning – removing unnecessary parts of the model – with few-shot learning, which enables the model to learn from limited examples. This paper proposes Disease-Aware Channel Importance Scoring (DACIS), a method that identifies which parts of the neural network are most important for distinguishing between different plant diseases, integrated into a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline. Experiments on PlantVillage and PlantDoc datasets demonstrate that the proposed approach reduces model size by 78% while maintaining 92.3% of the original accuracy, with the compressed model running at 7 frames per second on a Raspberry Pi 4, making real-time field diagnosis practical for smallholder farmers.
zh

[CV-4] Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding ICCV2025

【速读】:该论文旨在解决现有3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在语义分割与图像渲染联合建模中存在的一系列问题:一是语义分支与渲染分支通常被独立处理,仅依赖2D监督信号而忽略3D高斯几何结构;二是自适应策略仅基于渲染梯度调整高斯点集,难以在纹理缺失或细节细微区域有效分配资源。其解决方案的关键在于提出一个协同增强框架,通过引入基于拉普拉斯-贝尔特拉米算子的各向异性3D高斯切比雪夫描述子(anisotropic 3D Gaussian Chebyshev descriptor),以捕捉精细的3D形状特征并减少对噪声2D引导的依赖;同时结合局部语义与形状信号动态调整高斯分布及球谐函数参数,实现更高效的资源分配;此外,设计跨场景知识迁移模块持续更新形状模式,提升新场景下的收敛速度和鲁棒性,从而在保持高帧率渲染的同时显著提升分割精度与重建质量。

链接: https://arxiv.org/abs/2601.02339
作者: Jingming He,Chongyi Li,Shiqi Wang,Sam Kwong
机构: City University of Hong Kong (香港城市大学); Nankai University (南开大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace-Beltrami operator to capture fine-grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without relying solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.
zh

[CV-5] BEDS: Bayesian Emergent Dissipative Structures

【速读】:该论文试图解决的核心问题是:如何从基础物理与数学原理出发,构建一个统一的理论框架来解释学习的本质,尤其是在跨物理、生物和计算系统中,学习如何通过熵的输出实现结构的持续生成。其解决方案的关键在于提出BEDS(Bayesian Emergent Dissipative Structures)理论,该理论将非平衡热力学、贝叶斯推理、信息几何与机器学习有机结合,揭示了学习本质上是通量向结构转化的过程,并建立了热力学过程与贝叶斯更新之间的形式同构关系;进一步证明了自然常数 eeπ\piϕ\phi 作为贝叶斯推断在最小公理下的不动点必然涌现,从而为学习系统的稳定性提供数学根基;同时提出一个关于哥德尔不完备定理与热力学约束之间结构类比的猜想,指出形式系统的病态行为(如不完备性)与物理系统中的耗散不足具有本质相似性。这一理论不仅深化了对学习与计算本质的理解,还指导设计出能量效率提升六个数量级的分布式共识网络架构,实现了可持续人工智能的实践路径。

链接: https://arxiv.org/abs/2601.02329
作者: Laurent Caraffa
机构: Univ. Gustave Eiffel, IGN-ENSG, LaSTIG (法国居斯塔夫·埃菲尔大学,IGN-ENSG,LaSTIG); French National Institute of Geographic and Forest Information (法国国家地理与森林信息研究所); Ministry of Ecological Transition, France (法国生态转型部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:We present BEDS (Bayesian Emergent Dissipative Structures), a theoretical framework that unifies concepts from non-equilibrium thermodynamics, Bayesian inference, information geometry, and machine learning. The central thesis proposes that learning, across physical, biological, and computational systems, fundamentally constitutes the conversion of flux into structure through entropy export. Building on Prigogine’s theory of dissipative structures, we establish a formal isomorphism between thermodynamic processes and Bayesian updating, demonstrating that sustainable learning systems must follow dissipative patterns where crystallized posteriors become priors for subsequent levels of emergence. We derive fundamental mathematical constants (e, \pi, \phi) as fixed points of Bayesian inference under minimal axioms, suggesting these constants emerge necessarily from any system capable of representing and updating uncertainty. Furthermore, we propose a conjecture linking Gödel’s incompleteness theorems to thermodynamic constraints, hypothesizing that pathologies of formal systems (incompleteness, undecidability) are structurally analogous to dissipation deficits in physical systems. As practical validation, we present a peer-to-peer network architecture implementing BEDS principles, achieving six orders of magnitude improvement in energy efficiency compared to existing distributed consensus systems while enabling continuous learning. This work bridges fundamental physics, mathematical logic, and practical system design, offering both theoretical insights into the nature of learning and computation, and a concrete pathway toward sustainable artificial intelligence. Comments: 19 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.02329 [cs.CV] (or arXiv:2601.02329v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.02329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-6] Fusion2Print: Deep Flash-Non-Flash Fusion for Contactless Fingerprint Matching ICPR2026

【速读】:该论文旨在解决接触式指纹识别系统中因物理接触导致的潜在指纹残留、压力伪影及卫生风险等问题,同时克服非接触式指纹图像因光照变化、皮下皮肤色素沉着和镜面反射等因素造成的纹线清晰度下降问题。其解决方案的关键在于提出Fusion2Print(F2P)框架,通过系统性地采集并融合配对的闪光与非闪光非接触式指纹图像:首先构建了专用的配对数据集FNF Database,并采用人工减法分离出保留纹线信息的信号;随后设计了一个轻量级注意力机制融合网络,有效整合双模态信息、增强有用通道并抑制噪声;最后引入具有跨域兼容性的深度嵌入模型,在统一嵌入空间中生成对非接触与接触指纹均具备判别力和鲁棒性的特征表示,从而显著提升识别性能(AUC=0.999,EER=1.12%)。

链接: https://arxiv.org/abs/2601.02318
作者: Roja Sahoo,Anoop Namboodiri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, 5 tables. Submitted to ICPR 2026

点击查看摘要

Abstract:Contactless fingerprint recognition offers a hygienic and convenient alternative to contact-based systems, enabling rapid acquisition without latent prints, pressure artifacts, or hygiene risks. However, contactless images often show degraded ridge clarity due to illumination variation, subcutaneous skin discoloration, and specular reflections. Flash captures preserve ridge detail but introduce noise, whereas non-flash captures reduce noise but lower ridge contrast. We propose Fusion2Print (F2P), the first framework to systematically capture and fuse paired flash-non-flash contactless fingerprints. We construct a custom paired dataset, FNF Database, and perform manual flash-non-flash subtraction to isolate ridge-preserving signals. A lightweight attention-based fusion network also integrates both modalities, emphasizing informative channels and suppressing noise, and then a U-Net enhancement module produces an optimally weighted grayscale image. Finally, a deep embedding model with cross-domain compatibility, generates discriminative and robust representations in a unified embedding space compatible with both contactless and contact-based fingerprints for verification. F2P enhances ridge clarity and achieves superior recognition performance (AUC=0.999, EER=1.12%) over single-capture baselines (Verifinger, DeepPrint).
zh

[CV-7] Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping WACV2026

【速读】:该论文旨在解决Geo-Foundation Models (GFMs) 在洪水制图等下游任务中因难以捕捉关键局部细节而导致性能不足的问题,尤其在Sen1Flood11数据集上其表现不及基准U-Net模型。解决方案的关键在于提出Prithvi-Complementary Adaptive Fusion Encoder (CAFE),该结构融合了预训练的Prithvi GFM编码器与一个并行的CNN残差分支(引入卷积注意力模块Convolutional Attention Modules, CAM),实现多尺度、多层级特征融合,在保留长距离依赖的同时增强局部细节建模能力,并通过适配器机制实现高效微调。

链接: https://arxiv.org/abs/2601.02315
作者: Saurabh Kaushik,Lalit Maurya,Beth Tellman
机构: Center for Sustainability and the Global Environment (SAGE), University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Portsmouth AI and Data Science Centre (PAIDS), School of Computing, University of Portsmouth (朴茨茅斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CV4EO Workshop @ WACV 2026

点击查看摘要

Abstract:Geo-Foundation Models (GFMs), have proven effective in diverse downstream applications, including semantic segmentation, classification, and regression tasks. However, in case of flood mapping using Sen1Flood11 dataset as a downstream task, GFMs struggles to outperform the baseline U-Net, highlighting model’s limitation in capturing critical local nuances. To address this, we present the Prithvi-Complementary Adaptive Fusion Encoder (CAFE), which integrate Prithvi GFM pretrained encoder with a parallel CNN residual branch enhanced by Convolutional Attention Modules (CAM). Prithvi-CAFE enables fast and efficient fine-tuning through adapters in Prithvi and performs multi-scale, multi-level fusion with CNN features, capturing critical local details while preserving long-range dependencies. We achieve state-of-the-art results on two comprehensive flood mapping datasets: Sen1Flood11 and FloodPlanet. On Sen1Flood11 test data, Prithvi-CAFE (IoU 83.41) outperforms the original Prithvi (IoU 82.50) and other major GFMs (TerraMind 82.90, DOFA 81.54, spectralGPT: 81.02). The improvement is even more pronounced on the hold-out test site, where Prithvi-CAFE achieves an IoU of 81.37 compared to the baseline U-Net (70.57) and original Prithvi (72.42). On FloodPlanet, Prithvi-CAFE also surpasses the baseline U-Net and other GFMs, achieving an IoU of 64.70 compared to U-Net (60.14), Terramind (62.33), DOFA (59.15) and Prithvi 2.0 (61.91). Our proposed simple yet effective Prithvi-CAFE demonstrates strong potential for improving segmentation tasks where multi-channel and multi-modal data provide complementary information and local details are critical. The code is released on \hrefthis https URLPrithvi-CAFE Github
zh

[CV-8] 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera

【速读】:该论文旨在解决单目全景视觉里程计(Monocular Omnidirectional Visual Odometry, OVO)在复杂场景下鲁棒性不足的问题,尤其是面对剧烈运动和光照变化时,传统依赖手工特征或光度目标的方法性能受限。其解决方案的关键在于提出首个基于深度学习的OVO框架360DVO,核心创新包括:1)设计了一个畸变感知的球面特征提取器(Distortion-aware Spherical Feature Extractor, DAS-Feat),能够自适应地学习对畸变具有抵抗能力的稀疏特征;2)引入一种新型全景可微分束调整模块(Omnidirectional Differentiable Bundle Adjustment, ODBA),利用提取的特征建立约束以实现高精度位姿估计。该方法在真实世界OVO基准及合成数据集上均显著优于现有主流算法(如360VO和OpenVSLAM),鲁棒性提升50%,精度提高37.5%。

链接: https://arxiv.org/abs/2601.02309
作者: Xiaopeng Guo,Yinzhe Xu,Huajian Huang,Sai-Kit Yeung
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages. Received by RA-L

点击查看摘要

Abstract:Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: this https URL
zh

[CV-9] SortWaste: A Densely Annotated Dataset for Object Detection in Industrial Waste Sorting

【速读】:该论文旨在解决当前垃圾分拣自动化系统在处理现实世界中高变异性、复杂视觉场景时性能不足的问题,其核心挑战在于缺乏高质量的实地数据集以及对场景复杂度缺乏客观量化标准。解决方案的关键在于提出SortWaste数据集——一个从物料回收设施(Material Recovery Facility, MRF)采集并密集标注的目标检测数据集,并引入ClutterScore指标,通过对象数量、类别与尺寸熵及空间重叠等代理变量客观评估场景的视觉复杂度。这一方法不仅推动了垃圾分拣检测任务的标准化,还揭示了现有先进目标检测模型在高杂乱场景下性能显著下降的现象,凸显了开发更具挑战性的新数据集的必要性。

链接: https://arxiv.org/abs/2601.02299
作者: Sara Inácio,Hugo Proença,João C. Neves
机构: University of Beira Interior (贝拉内斯特大学); Instituto de Telecomunicações (电信研究所); NOVA LINCS (诺瓦林克斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:The increasing production of waste, driven by population growth, has created challenges in managing and recycling materials effectively. Manual waste sorting is a common practice; however, it remains inefficient for handling large-scale waste streams and presents health risks for workers. On the other hand, existing automated sorting approaches still struggle with the high variability, clutter, and visual complexity of real-world waste streams. The lack of real-world datasets for waste sorting is a major reason automated systems for this problem are underdeveloped. Accordingly, we introduce SortWaste, a densely annotated object detection dataset collected from a Material Recovery Facility. Additionally, we contribute to standardizing waste detection in sorting lines by proposing ClutterScore, an objective metric that gauges the scene’s hardness level using a set of proxies that affect visual complexity (e.g., object count, class and size entropy, and spatial overlap). In addition to these contributions, we provide an extensive benchmark of state-of-the-art object detection models, detailing their results with respect to the hardness level assessed by the proposed metric. Despite achieving promising results (mAP of 59.7% in the plastic-only detection task), performance significantly decreases in highly cluttered scenes. This highlights the need for novel and more challenging datasets on the topic.
zh

[CV-10] Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery

【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在多光谱遥感(Multispectral Remote Sensing, RS)图像中应用时面临的独特挑战,即如何有效利用地理空间信息以提升特征表示的质量。现有方法在处理具有地理和时间变异性特征的遥感数据时表现有限,难以充分建模跨区域和跨时段的语义一致性。解决方案的关键在于提出GeoRank——一种新颖的对比自监督学习正则化方法,通过直接优化球面距离(spherical distances)将地理关系嵌入到学习到的特征空间中,从而增强模型对地理结构的感知能力。该方法不依赖于显式的地理元数据,而是隐式地利用空间位置信息来改进特征表示,在多种对比SSL算法(如BYOL、DINO)上均表现出一致且显著的性能提升。

链接: https://arxiv.org/abs/2601.02289
作者: Tom Burgert,Leonard Hackel,Paolo Rota,Begüm Demir
机构: BIFOLD; TU Berlin; University of Trento
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted for publication at IEEE/CVF Winter Conference on Applications of Computer Vision

点击查看摘要

Abstract:Self-supervised learning (SSL) has become a powerful paradigm for learning from large, unlabeled datasets, particularly in computer vision (CV). However, applying SSL to multispectral remote sensing (RS) images presents unique challenges and opportunities due to the geographical and temporal variability of the data. In this paper, we introduce GeoRank, a novel regularization method for contrastive SSL that improves upon prior techniques by directly optimizing spherical distances to embed geographical relationships into the learned feature space. GeoRank outperforms or matches prior methods that integrate geographical metadata and consistently improves diverse contrastive SSL algorithms (e.g., BYOL, DINO). Beyond this, we present a systematic investigation of key adaptations of contrastive SSL for multispectral RS images, including the effectiveness of data augmentations, the impact of dataset cardinality and image size on performance, and the task dependency of temporal views. Code is available at this https URL.
zh

[CV-11] InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

【速读】:该论文旨在解决大规模、持续性三维视觉几何理解中长期存在的可扩展性(scalability)与长期稳定性(long-term stability)之间的矛盾问题。现有方法如VGGT虽具备优异的几何建模能力,但因其批处理(batch-based)特性无法适用于实时系统;而现有的流式架构则难以支持无限时域输入或在长序列中出现灾难性漂移(catastrophic drift)。解决方案的关键在于提出InfiniteVGGT——一种因果视觉几何Transformer,其核心创新是通过一个有界但自适应且持续表达能力强的键值缓存(KV cache)实现滚动记忆(rolling memory)机制,并设计了一种无需训练、不依赖注意力权重的剪枝策略,能够智能丢弃过时信息,从而在每帧新输入时有效“滚动”记忆向前推进。该方法在不牺牲性能的前提下实现了无限时域流式处理,同时显著提升了长期稳定性,且完全兼容FlashAttention。

链接: https://arxiv.org/abs/2601.02281
作者: Shuai Yuan,Yantai Yang,Xiaotian Yang,Xupeng Zhang,Zhonghao Zhao,Lingming Zhang,Zhipeng Zhang
机构: AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院); Anyverse Dynamics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling’’ the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: this https URL
zh

[CV-12] opoLoRA-SAM: Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation

【速读】:该论文旨在解决基础分割模型(如Segment Anything Model, SAM)在特定领域语义分割任务中适应困难的问题,尤其针对细结构(如视网膜血管)和噪声模态(如合成孔径雷达SAR图像)的分割挑战。现有方法如全量微调存在计算成本高和灾难性遗忘风险。其解决方案的关键在于提出TopoLoRA-SAM框架,该框架通过将低秩适配(Low-Rank Adaptation, LoRA)注入冻结的视觉Transformer(ViT)编码器,并引入轻量级空间卷积适配器及可选的拓扑感知监督(基于可微分clDice),实现参数高效且拓扑敏感的模型适配。实验表明,TopoLoRA-SAM仅训练约5.2%的参数(~4.9M),在多个基准数据集上达到最优或接近最优的分割性能,显著优于U-Net、DeepLabV3+、SegFormer和Mask2Former等主流模型。

链接: https://arxiv.org/abs/2601.02273
作者: Salim Khazem
机构: Talan(塔兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero-shot generalization through large-scale pretraining, but adapting them to domain-specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine-tuning is computationally expensive and risks catastrophic forgetting. We propose \textbfTopoLoRA-SAM, a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation. TopoLoRA-SAM injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology-aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD), comparing against U-Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice across datasets, while training only \textbf5.2% of model parameters ( \sim 4.9M). On the challenging CHASE_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models. Code is available at : this https URL
zh

[CV-13] DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

【速读】:该论文旨在解决从多视角图像中恢复人体网格(human mesh recovery)时面临的两大挑战:一是真实世界数据集中的标注不完美会导致模型训练偏差,二是合成数据虽具备精确监督信号但存在域差距(domain gap),难以直接迁移到真实场景。解决方案的关键在于提出DiffProxy框架,其核心创新包括:(1) 多条件机制以生成多视角一致且像素对齐的人体代理(human proxies);(2) 手部精修模块引入灵活的视觉提示以增强局部细节;(3) 不确定性感知的测试时缩放方法提升优化过程中对遮挡和部分视图等困难情况的鲁棒性。该框架充分利用扩散模型的生成先验,在仅用合成数据训练的情况下实现了对五个真实世界基准的最先进性能,展现出强大的零样本泛化能力。

链接: https://arxiv.org/abs/2601.02267
作者: Renke Wang,Zhenyu Zhang,Ying Tai,Jian Yang
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models’ training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: this https URL
zh

[CV-14] VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

【速读】:该论文旨在解决视觉生成模型中视觉自回归(Visual AutoRegressive, VAR)架构在强化学习(Reinforcement Learning, RL)场景下因异构输入结构导致的严重异步策略冲突问题,该冲突会引发训练不稳定和对齐效果不佳。解决方案的关键在于提出一种增强型Group Relative Policy Optimization(GRPO)框架,其核心创新包括:1)引入稳定化的中间奖励以引导早期生成阶段;2)设计动态时间步重加权机制实现精确的信用分配;3)基于奖励反馈学习(Reward Feedback Learning, ReFL)原理提出一种新型掩码传播算法,从时空维度隔离优化影响,从而显著提升样本质量和目标对齐度。

链接: https://arxiv.org/abs/2601.02256
作者: Shikun Sun,Liao Qu,Huichao Zhang,Yiheng Liu,Yangyang Song,Xian Li,Xu Wang,Yi Jiang,Daniel K. Du,Xinglong Wu,Jia Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
zh

[CV-15] Neuro-Channel Networks: A Multiplication-Free Architecture by Biological Signal Transmission

【速读】:该论文旨在解决深度学习对高性能硬件(尤其是GPU)的高度依赖问题,这种依赖导致成本高昂、能耗大且供应短缺,限制了人工智能在边缘设备上的普及。其核心解决方案是提出一种名为神经通道网络(Neuro-Channel Networks, NCN)的新型乘法-free架构,关键在于用物理上限制信号幅度的“通道宽度”替代传统权重,并引入一个基于符号逻辑调节信号传输的“神经递质”参数;前向传播仅依赖加法、减法和位运算(如最小值、符号判断),彻底消除浮点乘法操作,从而实现高效能、低功耗的模型部署,无需依赖昂贵GPU集群即可在通用CPU或超低功耗芯片上运行复杂模型。

链接: https://arxiv.org/abs/2601.02253
作者: Emrah Mete,Emin Erkan Korkmaz
机构: Yeditepe University (耶迪泰佩大学)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:The rapid proliferation of Deep Learning is increasingly constrained by its heavy reliance on high-performance hardware, particularly Graphics Processing Units (GPUs). These specialized accelerators are not only prohibitively expensive and energy-intensive but also suffer from significant supply scarcity, limiting the ubiquity of Artificial Intelligence (AI) deployment on edge devices. The core of this inefficiency stems from the standard artificial perceptron’s dependence on intensive matrix multiplications. However, biological nervous systems achieve unparalleled efficiency without such arithmetic intensity; synaptic signal transmission is regulated by physical ion channel limits and chemical neurotransmitter levels rather than a process that can be analogous to arithmetic multiplication. Inspired by this biological mechanism, we propose Neuro-Channel Networks (NCN), a novel multiplication-free architecture designed to decouple AI from expensive hardware dependencies. In our model, weights are replaced with Channel Widths that physically limit the signal magnitude, while a secondary parameter acts as a Neurotransmitter to regulate Signal Transmission based on sign logic. The forward pass relies exclusively on addition, subtraction, and bitwise operations (minimum, sign), eliminating floating-point multiplication entirely. In this proof-of-concept study, we demonstrate that NCNs can solve non-linearly separable problems like XOR and the Majority function with 100% accuracy using standard backpropagation, proving their capability to form complex decision boundaries without multiplicative weights. This architecture offers a highly efficient alternative for next-generation neuromorphic hardware, paving the way for running complex models on commodity CPUs or ultra-low-power chips without relying on costly GPU clusters.
zh

[CV-16] SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

【速读】:该论文旨在解决多模态目标检测中因域差距导致的结构信息丢失问题,以及传统静态融合机制缺乏环境感知能力所引发的检测性能受限问题。其关键解决方案在于提出SLGNet框架,通过引入结构感知适配器(Structure-Aware Adapter)动态注入跨模态层次化结构先验,补偿视觉Transformer(Vision Transformer, ViT)骨干网络在红外与可见光图像间结构退化的问题;同时设计语言引导调制模块(Language-Guided Modulation),利用视觉语言模型(Vision-Language Model, VLM)生成的结构化描述对视觉特征进行动态重校准,从而增强模型对复杂动态场景的环境感知能力。此方法在保持参数高效性的同时显著提升了多模态感知鲁棒性与检测精度。

链接: https://arxiv.org/abs/2601.02249
作者: Xiantai Xiang,Guangyao Zhou,Zixiao Wen,Wenshuai Li,Ben Niu,Feng Wang,Lijia Huang,Qiantong Wang,Yuhan Liu,Zongxu Pan,Yuxin Hu
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院); Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences (中国科学院目标认知与应用技术重点实验室); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子、电气与通信工程学院); School of Software Engineering, Xi’an Jiaotong University (西安交通大学软件工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.
zh

[CV-17] A Comparative Study of Custom CNNs Pre-trained Models and Transfer Learning Across Multiple Visual Datasets

【速读】:该论文旨在解决视觉识别任务中模型构建策略的选择问题,即在从零训练轻量级卷积神经网络(Convolutional Neural Networks, CNNs)、使用预训练CNN作为固定特征提取器以及通过部分或全部微调预训练主干网络进行迁移学习这三种范式之间,如何权衡性能与效率。其解决方案的关键在于通过在五个真实世界图像分类数据集上的受控对比实验,系统评估不同方法在准确率(accuracy)和宏F1分数(macro F1-score)以及训练时间每轮和参数量等效率指标上的表现,结果表明迁移学习在预测性能上始终最优,而自定义CNN则在计算和内存资源受限时提供了更优的效率-精度折衷。

链接: https://arxiv.org/abs/2601.02246
作者: Annoor Sharara Akhand
机构: University of Dhaka (达卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are a standard approach for visual recognition due to their capacity to learn hierarchical representations from raw pixels. In practice, practitioners often choose among (i) training a compact custom CNN from scratch, (ii) using a large pre-trained CNN as a fixed feature extractor, and (iii) performing transfer learning via partial or full fine-tuning of a pre-trained backbone. This report presents a controlled comparison of these three paradigms across five real-world image classification datasets spanning road-surface defect recognition, agricultural variety identification, fruit/leaf disease recognition, pedestrian walkway encroachment recognition, and unauthorized vehicle recognition. Models are evaluated using accuracy and macro F1-score, complemented by efficiency metrics including training time per epoch and parameter counts. The results show that transfer learning consistently yields the strongest predictive performance, while the custom CNN provides an attractive efficiency–accuracy trade-off, especially when compute and memory budgets are constrained.
zh

[CV-18] VIBE: Visual Instruction Based Editor

【速读】:该论文旨在解决当前基于指令的图像编辑(instruction-based image editing)领域中两个核心问题:一是现有开源方法难以达到实际应用级别的图像质量,二是主流扩散模型(diffusion models)参数量大、计算成本高,限制了其在资源受限场景下的部署与研究。解决方案的关键在于设计了一个轻量化且高效的图像编辑流水线,采用2B参数的Qwen3-VL模型作为指令理解与编辑引导模块,结合1.6B参数的Sana1.5扩散模型进行图像生成,在保证源图像一致性(source consistency)的前提下实现高质量编辑效果。该方案在ImgEdit和GEdit基准测试中性能优于或相当甚至超越参数规模数倍于自身的基线模型,同时仅需24 GB GPU显存,并可在NVIDIA H100上以BF16精度在约4秒内完成2K分辨率图像生成,无需额外推理优化或蒸馏技术。

链接: https://arxiv.org/abs/2601.02242
作者: Grigorii Alekseenko,Aleksandr Gordeev,Irina Tolstykh,Bulat Suleimanov,Vladimir Dokholyan,Georgii Fedorov,Sergey Yakubson,Aleksandra Tsybina,Mikhail Chernyshov,Maksim Kuprashevich
机构: R&D Department, SALUTEDEV
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.
zh

[CV-19] FMVP: Masked Flow Matching for Adversarial Video Purification

【速读】:该论文旨在解决视频识别模型在面对对抗攻击时的脆弱性问题,特别是现有基于扩散模型的净化方法存在采样效率低和轨迹弯曲的问题。其核心解决方案是提出一种名为FMVP(Flow Matching for Adversarial Video Purification)的新框架,关键在于通过掩码策略物理上破坏全局对抗结构,并利用条件流匹配(Conditional Flow Matching, CFM)结合图像修复目标重建干净视频动态;同时设计频域门控损失(Frequency-Gated Loss, FGL),显式抑制高频对抗残差而保留低频内容保真度,从而实现对已知与未知攻击的有效防御。

链接: https://arxiv.org/abs/2601.02228
作者: Duoxun Tang,Xueyi Zhang,Chak Hin Wang,Xi Xiao,Dasen Dai,Xinhang Jiang,Wentao Shi,Rui Li,Qing Li
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); University of New South Wales (新南威尔士大学); The Chinese University of Hong Kong (香港中文大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video recognition models remain vulnerable to adversarial attacks, while existing diffusion-based purification methods suffer from inefficient sampling and curved trajectories. Directly regressing clean videos from adversarial inputs often fails to recover faithful content due to the subtle nature of perturbations; this necessitates physically shattering the adversarial structure. Therefore, we propose Flow Matching for Adversarial Video Purification FMVP. FMVP physically shatters global adversarial structures via a masking strategy and reconstructs clean video dynamics using Conditional Flow Matching (CFM) with an inpainting objective. To further decouple semantic content from adversarial noise, we design a Frequency-Gated Loss (FGL) that explicitly suppresses high-frequency adversarial residuals while preserving low-frequency fidelity. We design Attack-Aware and Generalist training paradigms to handle known and unknown threats, respectively. Extensive experiments on UCF-101 and HMDB-51 demonstrate that FMVP outperforms state-of-the-art methods (DiffPure, Defense Patterns (DP), Temporal Shuffling (TS) and FlowPure), achieving robust accuracy exceeding 87% against PGD and 89% against CW attacks. Furthermore, FMVP demonstrates superior robustness against adaptive attacks (DiffHammer) and functions as a zero-shot adversarial detector, attaining detection accuracies of 98% for PGD and 79% for highly imperceptible CW attacks.
zh

[CV-20] Prior-Guided DETR for Ultrasound Nodule Detection

【速读】:该论文旨在解决超声图像中结节(ultrasound nodule)检测的难题,主要挑战包括结节形状不规则、边界模糊、尺度变化大以及斑点噪声(speckle noise)干扰结构可见性等问题。解决方案的关键在于提出一种先验引导的DETR框架(prior-guided DETR framework),通过在网络多个阶段逐步引入不同类型的先验知识来提升检测精度:首先,在CNN主干中嵌入空间自适应可变形前馈网络与先验正则化(SDFPR),将几何先验注入可变形采样过程以稳定对不规则和模糊结节的特征提取;其次,设计多尺度空频特征混合器(MSFFM),融合空间域轮廓连续性和边界线索与频率域全局形态建模能力,从而抑制噪声并增强结构表征;最后,通过密集特征交互机制(DFI)跨编码层传播和利用这些受先验调制的特征,为解码器提供一致的几何与结构引导,显著提升查询细化能力。

链接: https://arxiv.org/abs/2601.02212
作者: Jingjing Wang,Zhuo Xiao,Xinning Yao,Bo Liu,Lijuan Niu,Xiangzhi Bai,Fugen Zhou
机构: Beihang University (北京航空航天大学); National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College (中国医学科学院肿瘤医院/国家癌症中心/国家临床研究中心); State Key Laboratory of High-Efficiency Reusable Aerospace Transportation Technology (高效 reusable 航天运输技术国家重点实验室); State Key Laboratory of Virtual Reality Technology and Systems, Ministry of Education (虚拟现实技术与系统国家重点实验室,教育部); the Key Laboratory of Spacecraft Design Optimization and Dynamic Simulation Technology, Ministry of Education (航天器设计优化与动态仿真技术教育部重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection of ultrasound nodules is essential for the early diagnosis and treatment of thyroid and breast cancers. However, this task remains challenging due to irregular nodule shapes, indistinct boundaries, substantial scale variations, and the presence of speckle noise that degrades structural visibility. To address these challenges, we propose a prior-guided DETR framework specifically designed for ultrasound nodule detection. Instead of relying on purely data-driven feature learning, the proposed framework progressively incorporates different prior knowledge at multiple stages of the network. First, a Spatially-adaptive Deformable FFN with Prior Regularization (SDFPR) is embedded into the CNN backbone to inject geometric priors into deformable sampling, stabilizing feature extraction for irregular and blurred nodules. Second, a Multi-scale Spatial-Frequency Feature Mixer (MSFFM) is designed to extract multi-scale structural priors, where spatial-domain processing emphasizes contour continuity and boundary cues, while frequency-domain modeling captures global morphology and suppresses speckle noise. Furthermore, a Dense Feature Interaction (DFI) mechanism propagates and exploits these prior-modulated features across all encoder layers, enabling the decoder to enhance query refinement under consistent geometric and structural guidance. Experiments conducted on two clinically collected thyroid ultrasound datasets (Thyroid I and Thyroid II) and two public benchmarks (TN3K and BUSI) for thyroid and breast nodules demonstrate that the proposed method achieves superior accuracy compared with 18 detection methods, particularly in detecting morphologically complex this http URL source code is publicly available at this https URL.
zh

[CV-21] Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion

【速读】:该论文旨在解决多模态扩散Transformer(MMDiT)模型中各模块功能及其与文本条件交互机制不明确的问题,从而提升文本到图像生成(text-to-image generation, T2I)的质量与可控性。解决方案的关键在于构建一个系统性的分析管道,通过移除、禁用和增强特定层级的文本隐状态(textual hidden-states),揭示不同模块在语义信息传递和细节渲染中的作用:早期块主要承载语义信息,后期块负责精细细节生成;移除模块通常比禁用文本条件影响更小;在特定块增强文本条件可改善语义一致性。基于此理解,论文进一步提出无需重新训练的优化策略,实现更优的文本对齐、精准编辑与推理加速,在SD3.5上将T2I-Combench++和GenEval指标分别提升至63.00%和71.63%,且不牺牲图像合成质量。

链接: https://arxiv.org/abs/2601.02211
作者: Binglei Li,Mengping Yang,Zhiyu Tan,Junping Zhang,Hao Li
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Academy of AI for Science (上海人工智能科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Recent breakthroughs of transformer-based diffusion models, particularly with Multimodal Diffusion Transformers (MMDiT) driven models like FLUX and Qwen Image, have facilitated thrilling experiences in text-to-image generation and editing. To understand the internal mechanism of MMDiT-based models, existing methods tried to analyze the effect of specific components like positional encoding and attention layers. Yet, a comprehensive understanding of how different blocks and their interactions with textual conditions contribute to the synthesis process remains elusive. In this paper, we first develop a systematic pipeline to comprehensively investigate each block’s functionality by removing, disabling and enhancing textual hidden-states at corresponding blocks. Our analysis reveals that 1) semantic information appears in earlier blocks and finer details are rendered in later blocks, 2) removing specific blocks is usually less disruptive than disabling text conditions, and 3) enhancing textual conditions in selective blocks improves semantic attributes. Building on these observations, we further propose novel training-free strategies for improved text alignment, precise editing, and acceleration. Extensive experiments demonstrated that our method outperforms various baselines and remains flexible across text-to-image generation, image editing, and inference acceleration. Our method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5, without sacrificing synthesis quality. These results advance understanding of MMDiT models and provide valuable insights to unlock new possibilities for further improvements.
zh

[CV-22] Seeing the Unseen: Zooming in the Dark with Event Cameras AAAI2026

【速读】:该论文旨在解决低光照视频超分辨率(Low-Light Video Super-Resolution, LVSR)问题,即从低光照、低分辨率(LR)输入中恢复高质量的高分辨率视频。现有方法因对比度有限和高频信息不足,难以有效恢复细节。其解决方案的关键在于提出首个事件驱动的LVSR框架RetinexEVSR,该框架融合高对比度事件信号与受Retinex模型启发的先验知识,通过创新的双向跨模态融合策略,从噪声事件数据和退化RGB帧中提取并整合有意义的线索。具体而言,引入了光照引导的事件增强模块以利用Retinex模型生成的光照图逐步优化事件特征,抑制低光伪影并保留高对比度细节;同时设计事件引导的反射率增强模块,借助增强后的事件特征通过多尺度融合机制动态恢复反射率细节,从而显著提升低光条件下的视频质量。

链接: https://arxiv.org/abs/2601.02206
作者: Dachun Kai,Zeyu Xiao,Huyue Zhu,Jiaxiao Wang,Yueyi Zhang,Xiaoyan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:This paper addresses low-light video super-resolution (LVSR), aiming to restore high-resolution videos from low-light, low-resolution (LR) inputs. Existing LVSR methods often struggle to recover fine details due to limited contrast and insufficient high-frequency information. To overcome these challenges, we present RetinexEVSR, the first event-driven LVSR framework that leverages high-contrast event signals and Retinex-inspired priors to enhance video quality under low-light scenarios. Unlike previous approaches that directly fuse degraded signals, RetinexEVSR introduces a novel bidirectional cross-modal fusion strategy to extract and integrate meaningful cues from noisy event data and degraded RGB frames. Specifically, an illumination-guided event enhancement module is designed to progressively refine event features using illumination maps derived from the Retinex model, thereby suppressing low-light artifacts while preserving high-contrast details. Furthermore, we propose an event-guided reflectance enhancement module that utilizes the enhanced event features to dynamically recover reflectance details via a multi-scale fusion mechanism. Experimental results show that our RetinexEVSR achieves state-of-the-art performance on three datasets. Notably, on the SDSD benchmark, our method can get up to 2.95 dB gain while reducing runtime by 65% compared to prior event-based methods. Code: this https URL.
zh

[CV-23] NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

【速读】:该论文旨在解决多模态内容生成中统一建模与高效生成之间的矛盾问题,即如何在单一架构下实现文本和图像等不同模态的协同理解与生成,并克服传统自回归(Autoregressive, AR)模型在图像生成速度上的瓶颈。其解决方案的关键在于:首先,采用统一的解码器-only 自回归 Transformer 架构,通过交错训练文本与图像离散 token,实现跨模态的理解与生成能力;其次,针对文本的顺序性与图像的层次结构差异,对文本保留传统的 next-token 预测机制,而对图像引入 next-scale 预测策略,从而摆脱传统基于栅格扫描(raster-scan)的方法,显著提升图像生成效率——可在 5 秒内完成 1024×1024 分辨率图像生成;此外,通过改进训练策略稳定多尺度生成过程,并结合前缀微调(prefix-tuning)增强强化学习能力,最终使 NextFlow 在统一模型中达到 SOTA 性能,且视觉质量媲美专用扩散模型。

链接: https://arxiv.org/abs/2601.02204
作者: Huichao Zhang,Liao Qu,Yiheng Liu,Hang Chen,Yangyang Song,Yongsheng Dong,Shikun Sun,Xian Li,Xu Wang,Yi Jiang,Hu Ye,Bo Chen,Yiming Gao,Peng Liu,Akide Liu,Zhipeng Yang,Qili Deng,Linjie Xing,Jiyang Liu,Zhao Wang,Yang Zhou,Mingcong Liu,Yi Zhang,Qian He,Xiwei Hu,Zhongqi Qi,Jie Shao,Zhiye Fu,Shuai Wang,Fangmin Chen,Xuezhi Chai,Zhihua Wu,Yitong Wang,Zehuan Yuan,Daniel K. Du,Xinglong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
zh

[CV-24] Parameter-Efficient Domain Adaption for CSI Crowd-Counting via Self-Supervised Learning with Adapter Modules

【速读】:该论文旨在解决基于WiFi信道状态信息(CSI)的无设备人群计数技术在实际部署中因领域偏移(domain shift)导致模型泛化能力差的问题。解决方案的关键在于提出了一种两阶段框架,核心是CSI-ResNet-A架构:首先通过自监督对比学习预训练模型以提取域不变表征,再引入轻量级Adapter模块实现高效微调;同时结合状态感知计数机对事件序列进行稳定处理,最终提升计数精度与鲁棒性。实验表明,该方法在10样本学习场景下MAE仅为0.44,且在WiAR公开基准上达到98.8%准确率,显著优于监督基线,并在参数效率方面实现97.2%的训练参数减少。

链接: https://arxiv.org/abs/2601.02203
作者: Oliver Custance,Saad Khan,Simon Parkinson,Quan Z. Sheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Device-free crowd-counting using WiFi Channel State Information (CSI) is a key enabling technology for a new generation of privacy-preserving Internet of Things (IoT) applications. However, practical deployment is severely hampered by the domain shift problem, where models trained in one environment fail to generalise to another. To overcome this, we propose a novel two-stage framework centred on a CSI-ResNet-A architecture. This model is pre-trained via self-supervised contrastive learning to learn domain-invariant representations and leverages lightweight Adapter modules for highly efficient fine-tuning. The resulting event sequence is then processed by a stateful counting machine to produce a final, stable occupancy estimate. We validate our framework extensively. On our WiFlow dataset, our unsupervised approach excels in a 10-shot learning scenario, achieving a final Mean Absolute Error (MAE) of just 0.44–a task where supervised baselines fail. To formally quantify robustness, we introduce the Generalisation Index (GI), on which our model scores near-perfectly, confirming its ability to generalise. Furthermore, our framework sets a new state-of-the-art public WiAR benchmark with 98.8% accuracy. Our ablation studies reveal the core strength of our design: adapter-based fine-tuning achieves performance within 1% of a full fine-tune (98.84% vs. 99.67%) while training 97.2% fewer parameters. Our work provides a practical and scalable solution for developing robust sensing systems ready for real-world IoT deployments.
zh

[CV-25] CORE: Code-based Inverse Self-Training Framework with Graph Expansion for Virtual Agents

【速读】:该论文旨在解决多模态虚拟代理(Multimodal Virtual Agents)训练中行为克隆(Behavior Cloning)与强化学习(Reinforcement Learning)之间的矛盾:行为克隆虽能有效模仿专家行为,但导致行为多样性低;而强化学习虽可通过探索发现新策略,却严重依赖人工设计的奖励函数。解决方案的关键在于提出CORE框架——一个基于代码的逆向自训练框架,并引入图结构扩展机制。其核心创新包括:1)语义代码抽象(Semantic Code Abstraction),从专家演示中自动推断可执行的奖励函数(Label Function),无需人工设计;2)策略图扩展(Strategy Graph Expansion),构建捕捉多样可行解的多路径策略图以提升域内行为多样性;3)轨迹引导外推(Trajectory-Guided Extrapolation),利用成功与失败轨迹扩展任务空间,增强域外行为多样性。该方法显著提升了虚拟代理的整体性能与泛化能力。

链接: https://arxiv.org/abs/2601.02201
作者: Keyu Wang,Bingchen Miao,Wendong Bu,Yu Wu,Juncheng Li,Shengyu Zhang,Wenqiao Zhang,Siliang Tang,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); Wuhan University (武汉大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 12 figures

点击查看摘要

Abstract:The development of Multimodal Virtual Agents has made significant progress through the integration of Multimodal Large Language Models. However, mainstream training paradigms face key challenges: Behavior Cloning is simple and effective through imitation but suffers from low behavioral diversity, while Reinforcement Learning is capable of discovering novel strategies through exploration but heavily relies on manually designed reward functions. To address the conflict between these two methods, we present CORE, a Code-based Inverse Self-Training Framework with Graph Expansion that bridges imitation and exploration, offering a novel training framework that promotes behavioral diversity while eliminating the reliance on manually reward design. Specifically, we introduce Semantic Code Abstraction to automatically infers reward functions from expert demonstrations without manual design. The inferred reward function, referred to as the Label Function, is executable code that verifies one key step within a task. Building on this, we propose Strategy Graph Expansion to enhance in-domain behavioral diversity, which constructs a multi-path graph called Strategy Graph that captures diverse valid solutions beyond expert demonstrations. Furthermore, we introduce Trajectory-Guided Extrapolation, which enriches out-of-domain behavioral diversity by utilizing both successful and failed trajectories to expand the task space. Experiments on Web and Android platforms demonstrate that CORE significantly improves both overall performance and generalization, highlighting its potential as a robust and generalizable training paradigm for building powerful virtual agents.
zh

[CV-26] Mind the Gap: Continuous Magnification Sampling for Pathology Foundation Models

【速读】:该论文旨在解决病理学基础模型在不同放大倍数(magnification)下的性能差异问题,特别是现有离散均匀采样策略导致中间放大倍数下性能下降的局限性。其关键解决方案是将放大倍数采样建模为多源域适应问题,并提出连续放大倍数采样方法,通过消除放大倍数覆盖中的间隙来提升中间尺度的表现,同时引入优化的采样分布以增强跨尺度表征质量,从而显著改善模型在全放大倍数范围内的泛化能力。

链接: https://arxiv.org/abs/2601.02198
作者: Alexander Möllers,Julius Hense,Florian Schulz,Timo Milbich,Maximilian Alber,Lukas Ruff
机构: Berlin Institute for the Foundations of Learning and Data (BIFOLD); Machine Learning Group, Technische Universität Berlin; Aignostics; Charité, Universitätsmedizin Berlin
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In histopathology, pathologists examine both tissue architecture at low magnification and fine-grained morphology at high magnification. Yet, the performance of pathology foundation models across magnifications and the effect of magnification sampling during training remain poorly understood. We model magnification sampling as a multi-source domain adaptation problem and develop a simple theoretical framework that reveals systematic trade-offs between sampling strategies. We show that the widely used discrete uniform sampling of magnifications (0.25, 0.5, 1.0, 2.0 mpp) leads to degradation at intermediate magnifications. We introduce continuous magnification sampling, which removes gaps in magnification coverage while preserving performance at standard scales. Further, we derive sampling distributions that optimize representation quality across magnification scales. To evaluate these strategies, we introduce two new benchmarks (TCGA-MS, BRACS-MS) with appropriate metrics. Our experiments show that continuous sampling substantially improves over discrete sampling at intermediate magnifications, with gains of up to 4 percentage points in balanced classification accuracy, and that optimized distributions can further improve performance. Finally, we evaluate current histopathology foundation models, finding that magnification is a primary driver of performance variation across models. Our work paves the way towards future pathology foundation models that perform reliably across magnifications.
zh

[CV-27] QuIC: A Quantum-Inspired Interaction Classifier for Revitalizing Shallow CNNs in Fine-Grained Recognition

【速读】:该论文旨在解决在资源受限的边缘设备上部署细粒度视觉分类(Fine-Grained Visual Classification, FGVC)深度学习模型时面临的挑战:一方面,深度网络虽精度高但计算成本过高;另一方面,浅层网络(如VGG、AlexNet)虽高效却难以区分视觉相似的子类别。其根本原因在于标准全局平均池化(Global Average Pooling, GAP)仅捕获一阶统计特征,忽略了FGVC所需的高阶特征交互信息。为此,作者提出量子启发式交互分类器(Quantum-inspired Interaction Classifier, QuIC),其核心创新在于将特征通道建模为相互作用的量子态,并通过可学习的可观测量算子显式建模二阶特征协方差,从而在不增加特征维度的情况下捕捉细粒度判别性信息。QuIC设计为轻量级、即插即用模块,支持稳定、单阶段端到端训练,在保持效率的同时显著提升浅层骨干网络性能(如VGG16 Top-1准确率提升近20%),并优于现有注意力机制(如SE-Block)。

链接: https://arxiv.org/abs/2601.02189
作者: Cheng Ying Wu,Yen Jui Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying deep learning models for Fine-Grained Visual Classification (FGVC) on resource-constrained edge devices remains a significant challenge. While deep architectures achieve high accuracy on benchmarks like CUB-200-2011, their computational cost is often prohibitive. Conversely, shallow networks (e.g., AlexNet, VGG) offer efficiency but fail to distinguish visually similar sub-categories. This is because standard Global Average Pooling (GAP) heads capture only first-order statistics, missing the subtle high-order feature interactions required for FGVC. While Bilinear CNNs address this, they suffer from high feature dimensionality and instability during training. To bridge this gap, we propose the Quantum-inspired Interaction Classifier (QuIC). Drawing inspiration from quantum mechanics, QuIC models feature channels as interacting quantum states and captures second-order feature covariance via a learnable observable operator. Designed as a lightweight, plug-and-play module, QuIC supports stable, single-stage end-to-end training without exploding feature dimensions. Experimental results demonstrate that QuIC significantly revitalizes shallow backbones: it boosts the Top-1 accuracy of VGG16 by nearly 20% and outperforms state-of-the-art attention mechanisms (SE-Block) on ResNet18. Qualitative analysis, including t-SNE visualization, further confirms that QuIC resolves ambiguous cases by explicitly attending to fine-grained discriminative features and enforcing compact intra-class clustering.
zh

[CV-28] Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

【速读】:该论文旨在解决基于WiFi信道状态信息(CSI)的多人群体步态识别问题,特别是探究当前低精度表现是源于算法局限还是硬件约束。其关键解决方案在于系统性地评估六种信号分离方法(FastICA、SOBI、PCA、NMF、小波变换、张量分解)在七种不同人数场景下(1–10人)使用商用ESP32 WiFi传感器的表现,并引入新颖的诊断指标(个体内变异性、个体间可区分性、性能退化率),从而明确指出:即使采用最优方法(NMF)也只能达到56%的准确率,且所有方法间差异不显著(p > 0.05),表明问题本质在于商用ESP32硬件无法提供足够高质量的信号以支持可靠的多人分离。

链接: https://arxiv.org/abs/2601.02177
作者: Oliver Custance,Saad Khan,Simon Parkinson
机构: University of Huddersfield (哈德斯菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:WiFi Channel State Information (CSI) has shown promise for single-person gait identification, with numerous studies reporting high accuracy. However, multi-person identification remains largely unexplored, with the limited existing work relying on complex, expensive setups requiring modified firmware. A critical question remains unanswered: is poor multi-person performance an algorithmic limitation or a fundamental hardware constraint? We systematically evaluate six diverse signal separation methods (FastICA, SOBI, PCA, NMF, Wavelet, Tensor Decomposition) across seven scenarios with 1-10 people using commodity ESP32 WiFi sensors–a simple, low-cost, off-the-shelf solution. Through novel diagnostic metrics (intra-subject variability, inter-subject distinguishability, performance degradation rate), we reveal that all methods achieve similarly low accuracy (45-56%, \sigma =3.74%) with statistically insignificant differences (p 0.05). Even the best-performing method, NMF, achieves only 56% accuracy. Our analysis reveals high intra-subject variability, low inter-subject distinguishability, and severe performance degradation as person count increases, indicating that commodity ESP32 sensors cannot provide sufficient signal quality for reliable multi-person separation.
zh

[CV-29] BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models AAAI2026

【速读】:该论文旨在解决视觉语言基础模型(如CLIP)在零样本泛化中对跨模态虚假相关性的敏感性问题,即模型可能依赖于非因果特征(如背景或文本中的无关语义)进行预测,导致在分布偏移下性能不稳定。解决方案的关键在于提出一种双边提示优化框架(BiPrompt),其核心机制包括:在视觉侧采用结构化注意力引导的擦除策略以抑制背景激活并强制因果与虚假区域间的预测一致性;在文本侧引入平衡提示归一化(balanced prompt normalization),通过可学习的重中心化机制将类别嵌入映射到各向同性的语义空间。二者共同最小化虚假线索与预测之间的条件互信息,从而在无需重新训练或领域监督的情况下实现因果性、域不变的推理能力。

链接: https://arxiv.org/abs/2601.02147
作者: Sunny Gupta,Shounak Das,Amit Sethi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the AAAI 2026 Workshop AIR-FM, Assessing and Improving Reliability of Foundation Models in the Real World

点击查看摘要

Abstract:Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.
zh

[CV-30] Efficient Unrolled Networks for Large-Scale 3D Inverse Problems

【速读】:该论文旨在解决大规模成像逆问题(imaging inverse problems)中,深度学习模型难以有效集成全局前向算子(forward operators)的问题,尤其是在3D成像场景下,由于内存消耗过大导致现有方法无法采用深度展开(deep unrolling)架构。其解决方案的关键在于提出了一种域划分策略(domain partitioning strategy)和前向算子的近似方法(normal operator approximations),从而使得包含任意规模前向算子的端到端重建模型能够在单个GPU上完成训练与推理,同时在3D X射线锥束断层扫描和多线圈加速磁共振成像(multi-coil accelerated MRI)任务中达到当前最优性能。

链接: https://arxiv.org/abs/2601.02141
作者: Romain Vo,Julián Tachella
机构: CNRS(法国国家科学研究中心); ENS de Lyon(巴黎高等师范学院); Laboratoire de Physique(物理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.
zh

[CV-31] Beyond Segmentation: An Oil Spill Change Detection Framework Using Synthetic SAR Imagery

【速读】:该论文旨在解决当前基于单时相合成孔径雷达(SAR)图像的海洋溢油检测方法在区分真实溢油与类油特征(如生物油膜或低风区)时存在高误报率、泛化能力弱的问题,尤其是在数据稀缺场景下表现不佳。其解决方案的关键在于提出一种新的双时相检测任务——溢油变化检测(Oil Spill Change Detection, OSCD),通过对比灾前与灾后SAR图像识别变化区域来提升检测可靠性;并进一步设计了时序感知混合修复(Temporal-Aware Hybrid Inpainting, TAHI)框架,利用灾后SAR图像生成合成灾前图像,其中包含高保真无油重建和时序真实性增强两个核心模块,从而实现无需真实灾前影像即可开展有效变化检测。

链接: https://arxiv.org/abs/2601.02139
作者: Chenyang Lai,Shuaiyu Chen,Tianjin Huang,Siyang Song,Guangliang Cheng,Chunbo Luo,Zeyu Fu
机构: University of Exeter (埃克塞特大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Marine oil spills are urgent environmental hazards that demand rapid and reliable detection to minimise ecological and economic damage. While Synthetic Aperture Radar (SAR) imagery has become a key tool for large-scale oil spill monitoring, most existing detection methods rely on deep learning-based segmentation applied to single SAR images. These static approaches struggle to distinguish true oil spills from visually similar oceanic features (e.g., biogenic slicks or low-wind zones), leading to high false positive rates and limited generalizability, especially under data-scarce conditions. To overcome these limitations, we introduce Oil Spill Change Detection (OSCD), a new bi-temporal task that focuses on identifying changes between pre- and post-spill SAR images. As real co-registered pre-spill imagery is not always available, we propose the Temporal-Aware Hybrid Inpainting (TAHI) framework, which generates synthetic pre-spill images from post-spill SAR data. TAHI integrates two key components: High-Fidelity Hybrid Inpainting for oil-free reconstruction, and Temporal Realism Enhancement for radiometric and sea-state consistency. Using TAHI, we construct the first OSCD dataset and benchmark several state-of-the-art change detection models. Results show that OSCD significantly reduces false positives and improves detection accuracy compared to conventional segmentation, demonstrating the value of temporally-aware methods for reliable, scalable oil spill monitoring in real-world scenarios.
zh

[CV-32] Remote Sensing Change Detection via Weak Temporal Supervision

【速读】:该论文旨在解决遥感图像语义变化检测中因标注数据稀缺而导致模型性能受限的问题。现有方法依赖于像素级标注,而这类标注成本高、耗时长;尽管已有研究尝试使用合成数据或人工生成变化对,但其跨域泛化能力仍有限。论文提出了一种弱时间监督策略,其关键在于利用已有的单时相遥感数据集,通过引入额外的时间观测来扩展数据集,并假设真实双时相图像对大多无变化,同时通过不同地理位置的图像配对生成变化样本,从而构建弱标签训练数据。为应对此类弱标签中的噪声,作者进一步采用目标感知的变化图生成与迭代优化机制,显著提升了模型在零样本和小样本场景下的性能表现。

链接: https://arxiv.org/abs/2601.02126
作者: Xavier Bou,Elliot Vincent,Gabriele Facciolo,Rafael Grompone von Gioi,Jean-Michel Morel,Thibaud Ehret
机构: Université Paris-Saclay, CNRS, ENS Paris-Saclay, Centre Borelli(中心博雷利); Université Gustave Eiffel, IGN-ENSG(法国国家地理研究所-国立高等测绘学校); Lingnan University(岭南大学); AMIAD, Pole Recherche(研究极)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic change detection in remote sensing aims to identify land cover changes between bi-temporal image pairs. Progress in this area has been limited by the scarcity of annotated datasets, as pixel-level annotation is costly and time-consuming. To address this, recent methods leverage synthetic data or generate artificial change pairs, but out-of-domain generalization remains limited. In this work, we introduce a weak temporal supervision strategy that leverages additional temporal observations of existing single-temporal datasets, without requiring any new annotations. Specifically, we extend single-date remote sensing datasets with new observations acquired at different times and train a change detection model by assuming that real bi-temporal pairs mostly contain no change, while pairing images from different locations to generate change examples. To handle the inherent noise in these weak labels, we employ an object-aware change map generation and an iterative refinement process. We validate our approach on extended versions of the FLAIR and IAILD aerial datasets, achieving strong zero-shot and low-data regime performance across different benchmarks. Lastly, we showcase results over large areas in France, highlighting the scalability potential of our method.
zh

[CV-33] Car Drag Coefficient Prediction from 3D Point Clouds Using a Slice-Based Surrogate Model

【速读】:该论文旨在解决传统汽车气动性能评估方法(如计算流体动力学(CFD)和风洞试验)在早期设计阶段因资源消耗大、效率低而难以支持快速迭代的问题。其解决方案的关键在于提出了一种轻量级代理模型,通过将三维车辆几何结构沿流向轴分解为有序的二维截面序列,并利用轻量级PointNet2D模块编码每个截面特征,再通过双向长短期记忆网络(bidirectional LSTM)捕捉纵向几何演化关系,从而实现对气动阻力系数(Cd)的高精度、快速预测(R²=0.9528,MAE≈6.046×10⁻³),且单次推理时间仅约0.025秒(消费级GPU),显著提升了设计探索的敏捷性与可解释性。

链接: https://arxiv.org/abs/2601.02112
作者: Utkarsh Singh,Absaar Ali,Adarsh Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 5 figures. Published in: Bramer M., Stahl F. (eds) Artificial Intelligence XLII. SGAI 2025. Lecture Notes in Computer Science, vol 16302. Springer, Cham

点击查看摘要

Abstract:The automotive industry’s pursuit of enhanced fuel economy and performance necessitates efficient aerodynamic design. However, traditional evaluation methods such as computational fluid dynamics (CFD) and wind tunnel testing are resource intensive, hindering rapid iteration in the early design stages. Machine learning-based surrogate models offer a promising alternative, yet many existing approaches suffer from high computational complexity, limited interpretability, or insufficient accuracy for detailed geometric inputs. This paper introduces a novel lightweight surrogate model for the prediction of the aerodynamic drag coefficient (Cd) based on a sequential slice-wise processing of the geometry of the 3D vehicle. Inspired by medical imaging, 3D point clouds of vehicles are decomposed into an ordered sequence of 2D cross-sectional slices along the stream-wise axis. Each slice is encoded by a lightweight PointNet2D module, and the sequence of slice embeddings is processed by a bidirectional LSTM to capture longitudinal geometric evolution. The model, trained and evaluated on the DrivAerNet++ dataset, achieves a high coefficient of determination (R^2 0.9528) and a low mean absolute error (MAE approx 6.046 x 10^-3) in Cd prediction. With an inference time of approximately 0.025 seconds per sample on a consumer-grade GPU, our approach provides fast, accurate, and interpretable aerodynamic feedback, facilitating more agile and informed automotive design exploration.
zh

[CV-34] MagicFight: Personalized Martial Arts Combat Video Generation ACM-MM2024

【速读】:该论文致力于解决当前生成式AI在个性化双人交互视频生成中的关键瓶颈问题,尤其是针对武术对战场景下现有模型无法有效区分个体身份、易出现肢体异常和动作不匹配等挑战。其核心问题是:现有单人舞蹈生成模型难以捕捉两人对抗时的复杂动态关系与行为一致性。解决方案的关键在于提出一个全新的任务——个性化武术对战视频生成,并设计了名为MagicFight的专用框架,通过构建基于Unity物理引擎的定制化3D数据集,结合对现有模型策略的精细化调整,实现了高保真度、身份明确且动作连贯的双人战斗视频合成,为交互式视频内容创作奠定了技术基础。

链接: https://arxiv.org/abs/2601.02107
作者: Jiancheng Huang,Mingfu Yan,Songyan Chen,Yi Huang,Shifeng Chen
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (深圳先进技术研究院,中国科学院); China Telecom Cloud Technology Co., Ltd. (中国电信云计算科技有限公司); Shenzhen University of Advanced Technology (深圳先进技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Amid the surge in generic text-to-video generation, the field of personalized human video generation has witnessed notable advancements, primarily concentrated on single-person scenarios. However, to our knowledge, the domain of two-person interactions, particularly in the context of martial arts combat, remains uncharted. We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. Our approach, MagicFight, is specifically crafted to overcome these hurdles. Given this pioneering task, we face a lack of appropriate datasets. Thus, we generate a bespoke dataset using the game physics engine Unity, meticulously crafting a multitude of 3D characters, martial arts moves, and scenes designed to represent the diversity of combat. MagicFight refines and adapts existing models and strategies to generate high-fidelity two-person combat videos that maintain individual identities and ensure seamless, coherent action sequences, thereby laying the groundwork for future innovations in the realm of interactive video content creation. Website: this https URL Dataset: this https URL Comments: Accepted by ACM MM 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.02107 [cs.CV] (or arXiv:2601.02107v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.02107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-35] HeadLighter: Disentangling Illumination in Generative 3D Gaussian Heads via Lightstage Captures

【速读】:该论文旨在解决当前基于3D高斯溅射(3D Gaussian Splatting)的头部生成模型中,光照与固有外观特征深度耦合导致无法进行可控重光照(relighting)的问题。现有解耦方法依赖强假设以实现弱监督学习,限制了其在复杂光照条件下的表现。解决方案的关键在于提出HeadLighter框架,采用双分支架构分别建模光照不变的头部属性和物理合理的渲染组件,并通过渐进式解耦训练策略,在受控光照条件下(使用光台设备采集的多视角图像)对生成模型进行监督,从而实现物理合理且可控的光照编辑。此外,引入蒸馏策略生成高质量法向量以提升渲染真实性。

链接: https://arxiv.org/abs/2601.02103
作者: Yating Wang,Yuan Sun,Xuan Wang,Ran Yi,Boyao Zhou,Yipengjing Sun,Hongyu Liu,Yinuo Wang,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); AntGroup Research (蚂蚁集团研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 3D-aware head generative models based on 3D Gaussian Splatting achieve real-time, photorealistic and view-consistent head synthesis. However, a fundamental limitation persists: the deep entanglement of illumination and intrinsic appearance prevents controllable relighting. Existing disentanglement methods rely on strong assumptions to enable weakly supervised learning, which restricts their capacity for complex illumination. To address this challenge, we introduce HeadLighter, a novel supervised framework that learns a physically plausible decomposition of appearance and illumination in head generative models. Specifically, we design a dual-branch architecture that separately models lighting-invariant head attributes and physically grounded rendering components. A progressive disentanglement training is employed to gradually inject head appearance priors into the generative architecture, supervised by multi-view images captured under controlled light conditions with a light stage setup. We further introduce a distillation strategy to generate high-quality normals for realistic rendering. Experiments demonstrate that our method preserves high-quality generation and real-time rendering, while simultaneously supporting explicit lighting and viewpoint editing. We will publicly release our code and dataset.
zh

[CV-36] 360-GeoGS: Geometrically Consistent Feed-Forward 3D Gaussian Splatting Reconstruction for 360 Images

【速读】:该论文旨在解决传统多视图立体视觉(Multi-View Stereo, MVS)在稀疏视角或低纹理区域表现不佳,以及神经渲染方法虽能生成高质量图像但需逐场景优化且实时性差的问题;同时,现有前馈式3D高斯溅射(3D Gaussian Splatting, 3DGS)方法侧重视觉保真度而忽视几何一致性,限制了其在空间感知任务中的精度与可靠性。解决方案的关键在于提出一种适用于360°图像的新型前馈式3DGS框架,并引入深度-法向几何正则化(Depth-Normal Geometric Regularization),通过耦合渲染深度梯度与法向信息,监督高斯素的旋转、尺度和位置参数,从而显著提升点云及表面重建的几何一致性,同时保持高质量渲染效果。

链接: https://arxiv.org/abs/2601.02102
作者: Jiaqi Yao,Zhongmiao Yan,Jingyi Xu,Songpengcheng Xia,Yan Xiang,Ling Pei
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D scene reconstruction is fundamental for spatial intelligence applications such as AR, robotics, and digital twins. Traditional multi-view stereo struggles with sparse viewpoints or low-texture regions, while neural rendering approaches, though capable of producing high-quality results, require per-scene optimization and lack real-time efficiency. Explicit 3D Gaussian Splatting (3DGS) enables efficient rendering, but most feed-forward variants focus on visual quality rather than geometric consistency, limiting accurate surface reconstruction and overall reliability in spatial perception tasks. This paper presents a novel feed-forward 3DGS framework for 360 images, capable of generating geometrically consistent Gaussian primitives while maintaining high rendering quality. A Depth-Normal geometric regularization is introduced to couple rendered depth gradients with normal information, supervising Gaussian rotation, scale, and position to improve point cloud and surface accuracy. Experimental results show that the proposed method maintains high rendering quality while significantly improving geometric consistency, providing an effective solution for 3D reconstruction in spatial perception tasks.
zh

[CV-37] InpaintHuman: Reconstructing Occluded Humans with Multi-Scale UV Mapping and Identity-Preserving Diffusion Inpainting

【速读】:该论文旨在解决从单目视频中重建完整且可动画化的3D人体虚拟形象(avatar)这一挑战,尤其是在严重遮挡条件下现有方法常出现几何失真和时序不一致的问题。解决方案的关键在于提出InpaintHuman方法,其核心创新包括:(i) 一种多尺度UV参数化表示与分层粗到细特征插值机制,能够鲁棒地重建遮挡区域并保留几何细节;(ii) 一种保持身份一致性的扩散修复模块,结合文本反转(textual inversion)与语义条件引导,实现针对特定主体的时序一致补全。该方法摒弃了基于图像扩散模型(SDS)的间接监督方式,转而采用直接像素级监督以保障身份保真度,在合成基准(PeopleSnapshot、ZJU-MoCap)和真实场景(OcMotion)上均展现出优越且稳定的重建质量。

链接: https://arxiv.org/abs/2601.02098
作者: Jinlong Fan,Shanshan Zhao,Liang Zheng,Jing Zhang,Yuxiang Yang,Mingming Gong
机构: Hangzhou Dianzi University (杭州电子科技大学); Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团); Wuhan University (武汉大学); University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging, particularly under severe occlusions. While 3D Gaussian Splatting has enabled photorealistic human rendering, existing methods struggle with incomplete observations, often producing corrupted geometry and temporal inconsistencies. We present InpaintHuman, a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos. Our approach introduces two key innovations: (i) a multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation, enabling robust reconstruction of occluded regions while preserving geometric details; and (ii) an identity-preserving diffusion inpainting module that integrates textual inversion with semantic-conditioned guidance for subject-specific, temporally coherent completion. Unlike SDS-based methods, our approach employs direct pixel-level supervision to ensure identity fidelity. Experiments on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion) demonstrate competitive performance with consistent improvements in reconstruction quality across diverse poses and viewpoints.
zh

[CV-38] Dancing Points: Synthesizing Ballroom Dancing with Three-Point Inputs

【速读】:该论文旨在解决双人舞(如交际舞)中舞者之间复杂且高维的全身动作交互建模与合成难题,尤其针对动作多样性高、交互性强的特点。其核心挑战在于如何在保持动作自然性和协调性的同时降低模型复杂度并避免过拟合。解决方案的关键在于提出一种基于虚拟现实(VR)设备获取的三点轨迹(three-point trajectory)作为舞蹈动作的紧凑表示方法,将复杂的全身运动简化为低维稀疏轨迹;进而利用高效多层感知机(MLP)网络直接从领舞者的三点轨迹预测跟舞者的对应轨迹,实现无需生成式模型(generative models)即可确定性地重建完整人体动作。该方法不仅在结构化数据(如交际舞)上表现优异,还在更广泛和多样化的数据集(如LaFAN)上展现出良好泛化能力,从而提供了一种计算和数据高效的沉浸式双人舞蹈应用新范式。

链接: https://arxiv.org/abs/2601.02096
作者: Peizhuo Li,Sebastian Starke,Yuting Ye,Olga Sorkine-Hornung
机构: ETH Zurich (苏黎世联邦理工学院); Meta Reality Labs (Meta 元宇宙实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ballroom dancing is a structured yet expressive motion category. Its highly diverse movement and complex interactions between leader and follower dancers make the understanding and synthesis challenging. We demonstrate that the three-point trajectory available from a virtual reality (VR) device can effectively serve as a dancer’s motion descriptor, simplifying the modeling and synthesis of interplay between dancers’ full-body motions down to sparse trajectories. Thanks to the low dimensionality, we can employ an efficient MLP network to predict the follower’s three-point trajectory directly from the leader’s three-point input for certain types of ballroom dancing, addressing the challenge of modeling high-dimensional full-body interaction. It also prevents our method from overfitting thanks to its compact yet explicit representation. By leveraging the inherent structure of the movements and carefully planning the autoregressive procedure, we show a deterministic neural network is able to translate three-point trajectories into a virtual embodied avatar, which is typically considered under-constrained and requires generative models for common motions. In addition, we demonstrate this deterministic approach generalizes beyond small, structured datasets like ballroom dancing, and performs robustly on larger, more diverse datasets such as LaFAN. Our method provides a computationally- and data-efficient solution, opening new possibilities for immersive paired dancing applications. Code and pre-trained models for this paper are available at this https URL.
zh

[CV-39] MCD-Net: A Lightweight Deep Learning Baseline for Optical-Only Moraine Segmentation

【速读】:该论文旨在解决冰川地貌中终碛垄(moraine)自动分割难题,特别是在光学影像对比度弱且高分辨率数字高程模型(DEM)数据稀缺的情况下,传统方法难以实现高效准确的冰川地貌识别。其解决方案的关键在于构建了首个大规模仅依赖光学影像的终碛垄分割数据集(MCD-Net Dataset),包含3,340张来自Google Earth的高分辨率人工标注图像,并提出轻量级网络架构MCD-Net,该模型融合MobileNetV2编码器、卷积块注意力模块(CBAM)与DeepLabV3+解码器,在保持较高分割精度(mIoU 62.3%,Dice系数72.8%)的同时,计算成本降低超过60%,验证了纯光学遥感影像在终碛垄体分割中的可行性与实用性。

链接: https://arxiv.org/abs/2601.02091
作者: Zhehuan Cao,Fiseha Berhanu Tesema,Ping Fu,Jianfeng Ren,Ahmed Nasr
机构: University of Nottingham Ningbo China (宁波诺丁汉大学); School of Computer Science (计算机科学学院); School of Geographical Sciences (地理科学学院); Nottingham Ningbo China Beacons of Excellence Research and Innovation Institute (宁波诺丁汉大学卓越研究与创新研究所); Department of Electrical and Electronic Engineering (电气与电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures. This manuscript is under review at IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Glacial segmentation is essential for reconstructing past glacier dynamics and evaluating climate-driven landscape change. However, weak optical contrast and the limited availability of high-resolution DEMs hinder automated mapping. This study introduces the first large-scale optical-only moraine segmentation dataset, comprising 3,340 manually annotated high-resolution images from Google Earth covering glaciated regions of Sichuan and Yunnan, China. We develop MCD-Net, a lightweight baseline that integrates a MobileNetV2 encoder, a Convolutional Block Attention Module (CBAM), and a DeepLabV3+ decoder. Benchmarking against deeper backbones (ResNet152, Xception) shows that MCD-Net achieves 62.3% mean Intersection over Union (mIoU) and 72.8% Dice coefficient while reducing computational cost by more than 60%. Although ridge delineation remains constrained by sub-pixel width and spectral ambiguity, the results demonstrate that optical imagery alone can provide reliable moraine-body segmentation. The dataset and code are publicly available at this https URL, establishing a reproducible benchmark for moraine-specific segmentation and offering a deployable baseline for high-altitude glacial monitoring.
zh

[CV-40] PhysSFI-Net: Physics-informed Geometric Learning of Skeletal and Facial Interactions for Orthognathic Surgical Outcome Prediction

【速读】:该论文旨在解决正颌手术后软组织形态预测的准确性与可解释性问题。传统生物力学模型计算成本高,而几何深度学习方法常缺乏可解释性。其解决方案的关键在于提出一种物理信息驱动的几何深度学习框架——PhysSFI-Net,该框架融合了三个核心模块:基于颅面结构和手术计划编码器的层次化图模块(结合注意力机制)以提取骨骼-面部交互特征;基于长短期记忆网络(LSTM)的序列预测模块用于逐步模拟软组织变形;以及受生物力学启发的高分辨率面部表面重建模块。该方法在135例患者数据上验证,实现了优于现有最优方法ACMT-Net的精度,点云形状误差为1.070 ± 0.088 mm,表面偏差误差为1.296 ± 0.349 mm,地标定位误差为2.445 ± 1.326 mm,兼具高精度与可解释性,具备临床转化潜力。

链接: https://arxiv.org/abs/2601.02088
作者: Jiahao Bao,Huazhen Liu,Yu Zhuang,Leran Tao,Xinyu Xu,Yongtao Shi,Mengjia Cheng,Yiming Wang,Congshuang Ku,Ting Zeng,Yilang Du,Siyi Chen,Shunyao Shen,Suncheng Xiang,Hongbo Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 8 figures

点击查看摘要

Abstract:Orthognathic surgery repositions jaw bones to restore occlusion and enhance facial aesthetics. Accurate simulation of postoperative facial morphology is essential for preoperative planning. However, traditional biomechanical models are computationally expensive, while geometric deep learning approaches often lack interpretability. In this study, we develop and validate a physics-informed geometric deep learning framework named PhysSFI-Net for precise prediction of soft tissue deformation following orthognathic surgery. PhysSFI-Net consists of three components: a hierarchical graph module with craniofacial and surgical plan encoders combined with attention mechanisms to extract skeletal-facial interaction features; a Long Short-Term Memory (LSTM)-based sequential predictor for incremental soft tissue deformation; and a biomechanics-inspired module for high-resolution facial surface reconstruction. Model performance was assessed using point cloud shape error (Hausdorff distance), surface deviation error, and landmark localization error (Euclidean distances of craniomaxillofacial landmarks) between predicted facial shapes and corresponding ground truths. A total of 135 patients who underwent combined orthodontic and orthognathic treatment were included for model training and validation. Quantitative analysis demonstrated that PhysSFI-Net achieved a point cloud shape error of 1.070 +/- 0.088 mm, a surface deviation error of 1.296 +/- 0.349 mm, and a landmark localization error of 2.445 +/- 1.326 mm. Comparative experiments indicated that PhysSFI-Net outperformed the state-of-the-art method ACMT-Net in prediction accuracy. In conclusion, PhysSFI-Net enables interpretable, high-resolution prediction of postoperative facial morphology with superior accuracy, showing strong potential for clinical application in orthognathic surgical planning and simulation.
zh

[CV-41] SketchRodGS: Sketch-based Extraction of Slender Geometries for Animating Gaussian Splatting Scenes SIGGRAPH

【速读】:该论文旨在解决从高斯点阵(Gaussian splatting)中提取细长弹性物体的多段线(polyline)表示问题,其核心挑战在于高斯点阵缺乏拓扑连接信息且存在噪声,难以直接构建准确的几何结构。解决方案的关键在于利用用户绘制的草图输入,结合屏幕空间最短路径分析(screen-space shortest path analysis),通过动态规划(dynamic programming)高效求解,从而鲁棒地生成代表细长部分的多段线网格。

链接: https://arxiv.org/abs/2601.02072
作者: Haato Watanabe,Nobuyuki Umetani
机构: The University of Tokyo (东京大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at SIGGRAPH Asia 2025 (Technical Communications). Best Technical Communications Award

点击查看摘要

Abstract:Physics simulation of slender elastic objects often requires discretization as a polyline. However, constructing a polyline from Gaussian splatting is challenging as Gaussian splatting lacks connectivity information and the configuration of Gaussian primitives contains much noise. This paper presents a method to extract a polyline representation of the slender part of the objects in a Gaussian splatting scene from the user’s sketching input. Our method robustly constructs a polyline mesh that represents the slender parts using the screen-space shortest path analysis that can be efficiently solved using dynamic programming. We demonstrate the effectiveness of our approach in several in-the-wild examples.
zh

[CV-42] Agent ic Retoucher for Text-To-Image Generation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成高质量图像时仍存在的局部失真问题,如肢体、面部和文本区域的微小畸变。现有修正方法要么依赖昂贵的迭代重生成,要么基于视觉语言模型(Vision-Language Models, VLMs)进行编辑,但后者因空间定位能力弱而易引发语义漂移和不可靠的局部修改。其解决方案的关键在于提出一种层次化的决策驱动框架——Agentic Retoucher,该框架将后生成修正重构为类人感知-推理-行动闭环:(1) 感知代理利用文本-图像一致性线索学习上下文显著性以精确定位畸变区域;(2) 推理代理通过渐进式偏好对齐实现符合人类认知的诊断推理;(3) 动作代理根据用户偏好自适应规划局部修复(inpainting)。此设计整合了感知证据、语言推理与可控修正,形成统一的自我修正决策流程,并辅以新构建的GenBlemish-27K数据集支持细粒度监督与量化评估,显著优于当前最优方法,在感知质量、畸变定位与人类偏好一致性方面建立新范式。

链接: https://arxiv.org/abs/2601.02046
作者: Shaocheng Shen,Jianfeng Liang. Chunlei Cai,Cong Geng,Huiyu Duan,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); China Mobile Research Institute (中国移动研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.
zh

[CV-43] AlignVTOFF: Texture-Spatial Feature Alignment for High-Fidelity Virtual Try-Off

【速读】:该论文旨在解决虚拟试衣(Virtual Try-Off, VTOFF)任务中因复杂几何形变和高频纹理导致的细节丢失问题,现有方法依赖轻量模块进行特征提取时难以保留结构化模式与细粒度信息,从而造成纹理衰减。其解决方案的关键在于提出一种基于参考U-Net与纹理-空间特征对齐(Texture-Spatial Feature Alignment, TSFA)的并行U-Net框架——Reference U-Net通过多尺度特征提取增强几何保真度,确保形变建模的鲁棒性并保留复杂结构模式;TSFA则通过混合注意力机制(可训练交叉注意力与冻结自注意力模块)将参考服装特征注入冻结去噪U-Net,显式对齐纹理与空间线索,有效缓解高频信息在去噪过程中的损失,从而提升生成结果的结构真实感与高频细节保真度。

链接: https://arxiv.org/abs/2601.02038
作者: Yihan Zhu,Mengying Ge
机构: Sino-European School of Technology, Shanghai University (上海大学中欧学院); National Demonstration Center for Experimental Engineering Training Education, Shanghai University (上海大学工程训练实验教学示范中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual Try-Off (VTOFF) is a challenging multimodal image generation task that aims to synthesize high-fidelity flat-lay garments under complex geometric deformation and rich high-frequency textures. Existing methods often rely on lightweight modules for fast feature extraction, which struggles to preserve structured patterns and fine-grained details, leading to texture attenuation during this http URL address these issues, we propose AlignVTOFF, a novel parallel U-Net framework built upon a Reference U-Net and Texture-Spatial Feature Alignment (TSFA). The Reference U-Net performs multi-scale feature extraction and enhances geometric fidelity, enabling robust modeling of deformation while retaining complex structured patterns. TSFA then injects the reference garment features into a frozen denoising U-Net via a hybrid attention design, consisting of a trainable cross-attention module and a frozen self-attention module. This design explicitly aligns texture and spatial cues and alleviates the loss of high-frequency information during the denoising this http URL experiments across multiple settings demonstrate that AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garment results with improved structural realism and high-frequency detail fidelity.
zh

[CV-44] GDRO: Group-level Reward Post-training Suitable for Diffusion Models

【速读】:该论文旨在解决基于文本到图像的修正流扩散模型(rectified flow diffusion models)在使用群体级奖励进行在线强化学习(online reinforcement learning)时所面临的效率低下、对随机采样器依赖性强以及奖励欺骗(reward hacking)等问题。其核心挑战在于修正流模型与大型语言模型(LLM)的本质差异:一是图像采样耗时显著,成为训练瓶颈;二是修正流模型在固定初始噪声后具有确定性,难以直接沿用LLM中依赖随机性的群体奖励机制。解决方案的关键是提出一种新的后训练范式——群体级直接奖励优化(Group-level Direct Reward Optimization, GDRO),该方法通过理论证明支持完全离线训练,大幅减少图像采样时间开销,并且不依赖扩散采样器类型,从而避免了将常微分方程(ODE)近似为随机微分方程(SDE)以引入随机性的必要性。此外,GDRO还引入校正评分机制,在评估中同时考虑原始奖励和奖励欺骗趋势,有效缓解奖励欺骗陷阱,实验证明其在OCR和GenEval任务上均能高效提升奖励得分并具备良好的稳定性与鲁棒性。

链接: https://arxiv.org/abs/2601.02036
作者: Yiyang Wang,Xi Chen,Xiaogang Xu,Yu Liu,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学); Tongyi Lab (通义实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.
zh

[CV-45] Leverag ing 2D-VLM for Label-Free 3D Segmentation in Large-Scale Outdoor Scene Understanding

【速读】:该论文旨在解决大规模点云数据中3D语义分割任务在缺乏标注3D训练数据和配对RGB图像时的挑战,传统监督学习方法受限于昂贵的标注成本与封闭词汇(closed-vocabulary)限制。其解决方案的关键在于利用虚拟相机将3D点云投影至2D图像空间,并借助预训练的2D基础模型(foundation 2D model)结合自然语言提示(natural language prompts)进行语义分割,最终通过多视角预测结果的加权投票实现3D分割。该方法无需任何3D标注即可达到接近监督方法的精度,并支持开放词汇识别(open-vocabulary recognition),显著提升了灵活性与泛化能力。

链接: https://arxiv.org/abs/2601.02029
作者: Toshihiko Nishimura,Hirofumi Abe,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida
机构: NTT Corporation (日本电信电话公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19

点击查看摘要

Abstract:This paper presents a novel 3D semantic segmentation method for large-scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open-vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.
zh

[CV-46] Adapting Depth Anything to Adverse Imaging Conditions with Events

【速读】:该论文旨在解决在动态和恶劣光照条件下,基于帧图像的深度估计(depth estimation)性能显著下降的问题,尤其是在极端光照和运动模糊等退化场景中,传统深度基础模型(如Depth Anything)因视觉信号被破坏而难以保持鲁棒性。其解决方案的关键在于提出了一种事件引导的时空融合框架ADAE,核心创新包括:1)熵感知的空间融合策略(Entropy-Aware Spatial Fusion),通过信息熵衡量光照引起的退化程度,自适应地融合帧图像与事件相机特征;2)运动引导的时间校正机制(Motion-Guided Temporal Correction),利用事件相机提供的高时间分辨率运动线索对模糊区域中的歧义特征进行重校准。二者协同增强Depth Anything在退化场景下的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2601.02020
作者: Shihan Peng,Yuyang Xiong,Hanyu Zhou,Zhiwei Shi,Haoyue Liu,Gang Chen,Luxin Yan,Yi Chang
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Robust depth estimation under dynamic and adverse lighting conditions is essential for robotic systems. Currently, depth foundation models, such as Depth Anything, achieve great success in ideal scenes but remain challenging under adverse imaging conditions such as extreme illumination and motion blur. These degradations corrupt the visual signals of frame cameras, weakening the discriminative features of frame-based depths across the spatial and temporal dimensions. Typically, existing approaches incorporate event cameras to leverage their high dynamic range and temporal resolution, aiming to compensate for corrupted frame features. However, such specialized fusion models are predominantly trained from scratch on domain-specific datasets, thereby failing to inherit the open-world knowledge and robust generalization inherent to foundation models. In this work, we propose ADAE, an event-guided spatiotemporal fusion framework for Depth Anything in degraded scenes. Our design is guided by two key insights: 1) Entropy-Aware Spatial Fusion. We adaptively merge frame-based and event-based features using an information entropy strategy to indicate illumination-induced degradation. 2) Motion-Guided Temporal Correction. We resort to the event-based motion cue to recalibrate ambiguous features in blurred regions. Under our unified framework, the two components are complementary to each other and jointly enhance Depth Anything under adverse imaging conditions. Extensive experiments have been performed to verify the superiority of the proposed method. Our code will be released upon acceptance.
zh

[CV-47] owards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement

【速读】:该论文旨在解决生成式分割模型(Segment Anything Models, SAMs)在低质量图像上性能显著下降的问题,从而提升其在真实场景中的鲁棒性。解决方案的关键在于提出GleSAM++框架,其核心创新包括:1)利用生成潜在空间增强(Generative Latent Space Enhancement)提升对低质量图像的适应能力;2)引入特征分布对齐(Feature Distribution Alignment, FDA)与通道复制扩展(Channel Replication and Expansion, CRE)以优化预训练扩散模型与分割框架的兼容性;3)设计退化感知自适应增强(Degradation-aware Adaptive Enhancement, DAE)机制,通过解耦重建过程为退化等级预测和退化感知重建两个阶段,显式建模图像退化程度,降低学习复杂度并提升重建质量。该方法仅需少量额外可学习参数即可适配现有SAM和SAM2模型,实现高效优化与跨图像质量泛化。

链接: https://arxiv.org/abs/2601.02018
作者: Guangqian Guo,Aixi Ren,Yong Guo,Xuehui Yu,Jiacheng Tian,Wenli Li,Yaoxing Wang,Shan Gao
机构: Northwestern Polytechnical University (西北工业大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Diffusion-based latent space enhancement helps improve the robustness of SAM

点击查看摘要

Abstract:Segment Anything Models (SAMs), known for their exceptional zero-shot segmentation performance, have garnered significant attention in the research community. Nevertheless, their performance drops significantly on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM++, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Additionally, to improve compatibility between the pre-trained diffusion model and the segmentation framework, we introduce two techniques, i.e., Feature Distribution Alignment (FDA) and Channel Replication and Expansion (CRE). However, the above components lack explicit guidance regarding the degree of degradation. The model is forced to implicitly fit a complex noise distribution that spans conditions from mild noise to severe artifacts, which substantially increases the learning burden and leads to suboptimal reconstructions. To address this issue, we further introduce a Degradation-aware Adaptive Enhancement (DAE) mechanism. The key principle of DAE is to decouple the reconstruction process for arbitrary-quality features into two stages: degradation-level prediction and degradation-aware reconstruction. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. Extensive experiments demonstrate that GleSAM++ significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM++ also performs well on unseen degradations, underscoring the versatility of our approach and dataset.
zh

[CV-48] Enhancing Object Detection with Privileged Information: A Model-Agnostic Teacher-Student Approach

【速读】:该论文旨在解决对象检测模型在训练阶段可利用细粒度描述性信息(如边界框掩码、显著性图和深度线索等)但推理阶段无法获取此类信息的问题。其解决方案的关键在于提出一种通用且模型无关的方法,通过教师-学生架构将特权信息注入基于深度学习的对象检测器中,从而在不增加推理复杂度或模型尺寸的前提下显著提升检测精度,尤其对中大型目标的检测性能改善最为明显。

链接: https://arxiv.org/abs/2601.02016
作者: Matthias Bartolo,Dylan Seychell,Gabriel Hili,Matthew Montebello,Carl James Debono,Saviour Formosa,Konstantinos Makantasis
机构: University of Malta(马耳他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Code available on GitHub: this https URL

点击查看摘要

Abstract:This paper investigates the integration of the Learning Using Privileged Information (LUPI) paradigm in object detection to exploit fine-grained, descriptive information available during training but not at inference. We introduce a general, model-agnostic methodology for injecting privileged information-such as bounding box masks, saliency maps, and depth cues-into deep learning-based object detectors through a teacher-student architecture. Experiments are conducted across five state-of-the-art object detection models and multiple public benchmarks, including UAV-based litter detection datasets and Pascal VOC 2012, to assess the impact on accuracy, generalization, and computational efficiency. Our results demonstrate that LUPI-trained students consistently outperform their baseline counterparts, achieving significant boosts in detection accuracy with no increase in inference complexity or model size. Performance improvements are especially marked for medium and large objects, while ablation studies reveal that intermediate weighting of teacher guidance optimally balances learning from privileged and standard inputs. The findings affirm that the LUPI framework provides an effective and practical strategy for advancing object detection systems in both resource-constrained and real-world settings.
zh

[CV-49] XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging AAAI

【速读】:该论文旨在解决医学人工智能(Medical AI)中的可解释性域泛化(Explainability Domain Generalization)和罕见类可靠性(Rare Class Reliability)问题,即深度模型在现实世界分布偏移下性能下降且对少见临床状况表现出偏差。其解决方案的关键在于提出XAIMeD框架,该框架通过统一的神经符号架构(Neuro-Symbolic Architecture)将临床专家知识编码为逻辑连接词构成的原子医疗命题规则,并转化为机器可验证的类别特定规则;这些规则通过加权特征满足度评分驱动符号推理分支,与神经预测结果进行置信度加权融合,同时引入受猎人启发的自适应路由机制(基于熵不平衡增益EIG和罕见类基尼指数Gini),有效缓解类别不平衡、类内高变异性及不确定性问题。实验证明该方法在跨域泛化和罕见类F1分数上显著优于现有深度学习基线。

链接: https://arxiv.org/abs/2601.02008
作者: Midhat Urooj,Ayan Banerjee,Sandeep Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI Bridge Program 2026

点击查看摘要

Abstract:Explainability domain generalization and rare class reliability are critical challenges in medical AI where deep models often fail under real world distribution shifts and exhibit bias against infrequent clinical conditions This paper introduces XAIMeD an explainable medical AI framework that integrates clinically accurate expert knowledge into deep learning through a unified neuro symbolic architecture XAIMeD is designed to improve robustness under distribution shift enhance rare class sensitivity and deliver transparent clinically aligned interpretations The framework encodes clinical expertise as logical connectives over atomic medical propositions transforming them into machine checkable class specific rules Their diagnostic utility is quantified through weighted feature satisfaction scores enabling a symbolic reasoning branch that complements neural predictions A confidence weighted fusion integrates symbolic and deep outputs while a Hunt inspired adaptive routing mechanism guided by Entropy Imbalance Gain EIG and Rare Class Gini mitigates class imbalance high intra class variability and uncertainty We evaluate XAIMeD across diverse modalities on four challenging tasks i Seizure Onset Zone SOZ localization from rs fMRI ii Diabetic Retinopathy grading across 6 multicenter datasets demonstrate substantial performance improvements including 6 percent gains in cross domain generalization and a 10 percent improved rare class F1 score far outperforming state of the art deep learning baselines Ablation studies confirm that the clinically grounded symbolic components act as effective regularizers ensuring robustness to distribution shifts XAIMeD thus provides a principled clinically faithful and interpretable approach to multimodal medical AI.
zh

[CV-50] Nighttime Hazy Image Enhancement via Progressively and Mutually Reinforcing Night-Haze Priors

【速读】:该论文旨在解决夜间雾霾图像中由于复杂退化分布导致的可见度提升困难问题,现有方法通常仅针对单一退化类型(如雾霾或低光照)进行处理,忽略了不同类型退化之间的相互作用,从而限制了增强效果。其解决方案的关键在于利用低光照与雾霾先验之间共享的领域知识,通过相互强化机制实现对二者内在一致性的逐步优化。具体而言,模型引入了图像级、块级和像素级专家,并在视觉域与频域中协同工作,以逐级恢复全局场景结构、区域模式和细粒度细节;同时设计了一个频率感知路由机制,自适应地引导各专家贡献,确保鲁棒的图像复原性能。

链接: https://arxiv.org/abs/2601.01998
作者: Chen Zhu,Huiwen Zhang,Mu He,Yujie Li,Xiaotian Qiao
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enhancing the visibility of nighttime hazy images is challenging due to the complex degradation distributions. Existing methods mainly address a single type of degradation (e.g., haze or low-light) at a time, ignoring the interplay of different degradation types and resulting in limited visibility improvement. We observe that the domain knowledge shared between low-light and haze priors can be reinforced mutually for better visibility. Based on this key insight, in this paper, we propose a novel framework that enhances visibility in nighttime hazy images by reinforcing the intrinsic consistency between haze and low-light priors mutually and progressively. In particular, our model utilizes image-, patch-, and pixel-level experts that operate across visual and frequency domains to recover global scene structure, regional patterns, and fine-grained details progressively. A frequency-aware router is further introduced to adaptively guide the contribution of each expert, ensuring robust image restoration. Extensive experiments demonstrate the superior performance of our model on nighttime dehazing benchmarks both quantitatively and qualitatively. Moreover, we showcase the generalizability of our model in daytime dehazing and low-light enhancement tasks.
zh

[CV-51] API: Empowering Generalizable Real-World Image Dehazing via Adaptive Patch Importance Learning

【速读】:该论文旨在解决真实世界图像去雾(image dehazing)任务中学习方法在复杂场景下性能显著下降的问题,其核心原因在于训练数据有限以及雾霾密度分布的内在复杂性。解决方案的关键在于提出一种自适应补丁重要性感知(Adaptive Patch Importance-aware, API)框架,包含两个核心模块:自动雾霾生成(Automatic Haze Generation, AHG)模块通过混合数据增强策略生成高质量、多样化的逼真雾霾图像以扩充训练数据;密度感知去雾(Density-aware Haze Removal, DHR)模块则基于不同雾霾密度分布自适应地处理图像补丁,提升模型对多样化雾霾场景的泛化能力。此外,作者引入多负样本对比去雾(Multi-Negative Contrastive Dehazing, MNCD)损失函数,利用空间与频域上的多负样本信息缓解去雾结果细节模糊问题,从而实现更优的定量指标与视觉质量表现。

链接: https://arxiv.org/abs/2601.01992
作者: Chen Zhu,Huiwen Zhang,Yujie Li,Mu He,Xiaotian Qiao
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world image dehazing is a fundamental yet challenging task in low-level vision. Existing learning-based methods often suffer from significant performance degradation when applied to complex real-world hazy scenes, primarily due to limited training data and the intrinsic complexity of haze density this http URL address these challenges, we introduce a novel Adaptive Patch Importance-aware (API) framework for generalizable real-world image dehazing. Specifically, our framework consists of an Automatic Haze Generation (AHG) module and a Density-aware Haze Removal (DHR) module. AHG provides a hybrid data augmentation strategy by generating realistic and diverse hazy images as additional high-quality training data. DHR considers hazy regions with varying haze density distributions for generalizable real-world image dehazing in an adaptive patch importance-aware manner. To alleviate the ambiguity of the dehazed image details, we further introduce a new Multi-Negative Contrastive Dehazing (MNCD) loss, which fully utilizes information from multiple negative samples across both spatial and frequency domains. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple real-world benchmarks, delivering strong results in both quantitative metrics and qualitative visual quality, and exhibiting robust generalization across diverse haze distributions.
zh

[CV-52] VIT-Ped: Visionary Intention Transformer for Pedestrian Behavior Analysis

【速读】:该论文旨在解决自动驾驶从L3向L4过渡过程中行人意图预测(Pedestrian Intention Prediction)的关键技术难题,以提升未来道路对所有交通参与者的安全性。其解决方案的核心在于提出一种基于Transformer及视频Transformer(Video Vision Transformer)架构的多模态算法,通过不同规模模型设计并融合多种数据模态,实现了在JAAD数据集上的最先进(SOTA)性能,在Accuracy、AUC和F1-score等指标上均取得显著提升。关键创新点在于利用Transformer结构有效建模时空特征,并通过系统的消融实验验证了不同模型设计选择对性能的影响。

链接: https://arxiv.org/abs/2601.01989
作者: Aly R. Elkammar,Karim M. Gamaleldin,Catherine M. Elias
机构: German University in Cairo (德国大学); C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems (认知驾驶研究车辆系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Pedestrian Intention prediction is one of the key technologies in the transition from level 3 to level 4 autonomous driving. To understand pedestrian crossing behaviour, several elements and features should be taken into consideration to make the roads of tomorrow safer for everybody. We introduce a transformer / video vision transformer based algorithm of different sizes which uses different data modalities .We evaluated our algorithms on popular pedestrian behaviour dataset, JAAD, and have reached SOTA performance and passed the SOTA in metrics like Accuracy, AUC and F1-score. The advantages brought by different model design choices are investigated via extensive ablation studies.
zh

[CV-53] hinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在空间推理(Spatial Reasoning)能力上的局限性,即如何从单纯的视觉感知提升到对空间语义关系的深入理解。现有方法或依赖局部图像块的重复关注以增强细粒度感知,但削弱了全局空间意识;或仅标记孤立坐标来定位物体,却忽略了对象间的整体组织结构。解决方案的关键在于引入认知科学中的“以物体为中心的蓝图”(object-centric blueprint)概念,将图像中相关物体的位置、尺寸和属性结构化为JSON格式的蓝图表示,并基于此进行因果推理。该方法通过三项核心技术实现:(1) 嵌入蓝图的推理轨迹用于监督微调以激发基础推理能力;(2) 蓝图感知奖励机制在强化学习中引导模型生成适量且与最终答案一致的蓝图;(3) 抗捷径数据增强策略通过针对性扰动图像和问题,防止模型依赖表面视觉或语言线索。实验表明,该方法显著优于现有VLMs及专用空间推理模型。

链接: https://arxiv.org/abs/2601.01984
作者: Weijian Ma,Shizhao Sun,Tianyu Yu,Ruiyu Wang,Tat-Seng Chua,Jiang Bian
机构: National University of Singapore (新加坡国立大学); Microsoft Research, Asia (微软亚洲研究院); Tsinghua University (清华大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Spatial reasoning – the ability to perceive and reason about relationships in space – advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.
zh

[CV-54] Forget Less by Learning Together through Concept Consolidation WACV-26

【速读】:该论文旨在解决定制化扩散模型(Custom Diffusion Models, CDMs)在持续学习新概念时面临的灾难性遗忘问题,尤其是现有方法多局限于固定顺序的序列学习场景,忽视了概念间的相互作用。其解决方案的关键在于提出了一种名为“少遗忘地一起学习”(Forget Less by Learning Together, FL2T)的新框架,通过引入一个集合不变的跨概念学习模块,利用代理(proxies)指导不同概念间的特征选择,从而实现并行且顺序无关的概念学习,有效保留旧知识的同时高效融合新概念。实验表明,该方法在三个数据集上显著提升了概念保留能力,在十项任务中平均CLIP图像对齐分数提升至少2%。

链接: https://arxiv.org/abs/2601.01963
作者: Arjun Ramesh Kaushik,Naresh Kumar Devulapally,Vishnu Suresh Lokhande,Nalini Ratha,Venu Govindaraju
机构: University at Buffalo, SUNY (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at WACV-26

点击查看摘要

Abstract:Custom Diffusion Models (CDMs) have gained significant attention due to their remarkable ability to personalize generative processes. However, existing CDMs suffer from catastrophic forgetting when continuously learning new concepts. Most prior works attempt to mitigate this issue under the sequential learning setting with a fixed order of concept inflow and neglect inter-concept interactions. In this paper, we propose a novel framework - Forget Less by Learning Together (FL2T) - that enables concurrent and order-agnostic concept learning while addressing catastrophic forgetting. Specifically, we introduce a set-invariant inter-concept learning module where proxies guide feature selection across concepts, facilitating improved knowledge retention and transfer. By leveraging inter-concept guidance, our approach preserves old concepts while efficiently incorporating new ones. Extensive experiments, across three datasets, demonstrates that our method significantly improves concept retention and mitigates catastrophic forgetting, highlighting the effectiveness of inter-concept catalytic behavior in incremental concept learning of ten tasks with at least 2% gain on average CLIP Image Alignment scores.
zh

[CV-55] AFTER: Mitigating the Object Hallucination of LVLM via Adaptive Factual-Guided Activation Editing

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,尤其是由语言偏差(language bias)引发的类别、属性和关系层面的幻觉,这些问题严重阻碍了可信人工智能应用的发展。解决方案的关键在于提出一种自适应的事实引导式视觉-文本编辑方法(Adaptive Factual-guided Visual-Textual Editing, AFTER),其核心包括两个模块:Factual-Augmented Activation Steering (FAS) 和 Query-Adaptive Offset Optimization (QAO)。FAS通过引入事实增强的激活引导机制,显式建模精确的视觉-文本关联,从而为激活编辑提供通用且准确的事实指导;QAO则进一步引入查询感知的偏移估计器,从通用引导向量中生成特定于查询的编辑策略,提升编辑的多样性与粒度。该方法在多个标准基准上验证了有效性,显著降低了幻觉率。

链接: https://arxiv.org/abs/2601.01957
作者: Tianbo Wang,Yuqing Ma,Kewei Liao,Zhange Zhang,Simin Li,Jinyang Guo,Xianglong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved substantial progress in cross-modal tasks. However, due to language bias, LVLMs are susceptible to object hallucination, which can be primarily divided into category, attribute, and relation hallucination, significantly impeding the trustworthy AI applications. Editing the internal activations of LVLMs has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the effective guidance offered by factual textual semantics, thereby struggling to explicitly mitigate language bias. To address these issues, we propose Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation (AFTER), which comprises Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO), to adaptively guides the original biased activations towards factual semantics. Specifically, FAS is proposed to provide factual and general guidance for activation editing, thereby explicitly modeling the precise visual-textual associations. Subsequently, QAO introduces a query-aware offset estimator to establish query-specific editing from the general steering vector, enhancing the diversity and granularity of editing. Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs validate the efficacy of the proposed AFTER, notably achieving up to a 16.3% reduction of hallucination over baseline on the AMBER benchmark. Our code and data will be released for reproducibility.
zh

[CV-56] MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization

【速读】:该论文旨在解决基于扩散Transformer(Diffusion Transformer, DiT)架构的文本到视频生成模型中复杂运动迁移(motion transfer)难以实现的问题,尤其在保持目标内容语义一致性与运动合理性方面存在挑战。解决方案的关键在于提出MotionAdapter框架,其核心创新包括:首先通过分析3D全注意力模块中的跨帧注意力机制,显式地从视频中解耦出运动信息(即提取注意力驱动的运动场);其次引入基于DINO(Data-Independent Normalization and Object-aware features)的运动定制模块,利用内容对应关系对运动场进行重排与优化,从而弥合参考视频与目标视频之间的语义鸿沟。最终,定制化的运动场被用于引导DiT去噪过程,确保生成视频继承参考视频的运动模式,同时保留目标视频的外观和语义特征。

链接: https://arxiv.org/abs/2601.01955
作者: Zhexin Zhang,Yifeng Zhu,Yangyang Xu,Long Chen,Yong Du,Shengfeng He,Jun Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion-based text-to-video models, particularly those built on the diffusion transformer architecture, have achieved remarkable progress in generating high-quality and temporally coherent videos. However, transferring complex motions between videos remains challenging. In this work, we present MotionAdapter, a content-aware motion transfer framework that enables robust and semantically aligned motion transfer within DiT-based T2V models. Our key insight is that effective motion transfer requires \romannumeral1) explicit disentanglement of motion from appearance and \romannumeral 2) adaptive customization of motion to target content. MotionAdapter first isolates motion by analyzing cross-frame attention within 3D full-attention modules to extract attention-derived motion fields. To bridge the semantic gap between reference and target videos, we further introduce a DINO-guided motion customization module that rearranges and refines motion fields based on content correspondences. The customized motion field is then used to guide the DiT denoising process, ensuring that the synthesized video inherits the reference motion while preserving target appearance and semantics. Extensive experiments demonstrate that MotionAdapter outperforms state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.
zh

[CV-57] Face Normal Estimation from Rag s to Riches

【速读】:该论文旨在解决人脸法向量估计(face normal estimation)方法对大规模成对训练数据高度依赖的问题,从而限制了其在数据稀缺场景下的应用。解决方案的关键在于提出一种“粗到精”(coarse-to-fine)的分阶段估计框架:首先利用小规模数据训练一个轻量模型生成粗粒度法向量作为引导(exemplars),随后引入自注意力机制(self-attention mechanism)捕捉长程依赖关系以消除局部伪影,并设计专用的细化网络将输入图像与对应示例映射为高质量细粒度法向量。该架构有效降低了对海量成对数据和计算资源的需求,同时保持了优异的估计精度。

链接: https://arxiv.org/abs/2601.01950
作者: Meng Wang,Wenjing Dai,Jiawan Zhang,Xiaojie Guo
机构: Tiangong University (天津工业大学); Fitow (天津)检测技术有限公司 (Algorithm Department); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although recent approaches to face normal estimation have achieved promising results, their effectiveness heavily depends on large-scale paired data for training. This paper concentrates on relieving this requirement via developing a coarse-to-fine normal estimator. Concretely, our method first trains a neat model from a small dataset to produce coarse face normals that perform as guidance (called exemplars) for the following refinement. A self-attention mechanism is employed to capture long-range dependencies, thus remedying severe local artifacts left in estimated coarse facial normals. Then, a refinement network is customized for the sake of mapping input face images together with corresponding exemplars to fine-grained high-quality facial normals. Such a logical function split can significantly cut the requirement of massive paired data and computational resource. Extensive experiments and ablation studies are conducted to demonstrate the efficacy of our design and reveal its superiority over state-of-the-art methods in terms of both training expense as well as estimation quality. Our code and models are open-sourced at: this https URL.
zh

[CV-58] MacVQA: Adaptive Memory Allocation and Global Noise Filtering for Continual Visual Question Answering AAAI2026

【速读】:该论文旨在解决持续视觉问答(Continual Visual Question Answering, C-VQA)中知识保留、适应新信息与鲁棒特征表示之间的平衡难题。现有方法在应对不断变化的任务场景时,常面临灾难性遗忘和特征退化问题,难以同时实现高效的知识获取、稳定的知识保留以及对新型组合问题的泛化能力。解决方案的关键在于提出一种名为MacVQA的新框架,其核心创新包括:(1)自适应记忆分配机制,基于原型(prototype-based)策略优化特征质量和内存利用率;(2)全局噪声过滤模块,在融合视觉与文本信息的同时抑制冗余或干扰信号,从而增强表示的鲁棒性。这一设计使模型能够在持续学习过程中有效平衡知识积累、保留与组合泛化能力,实验表明其在标准任务和新颖组合任务上均显著优于现有基线方法。

链接: https://arxiv.org/abs/2601.01926
作者: Zhifei Li,Yiran Wang,Chenyi Xiong,Yujing Xia,Xiaoju Hou,Yue Zhao,Miao Zhang,Kui Xiao,Bing Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Visual Question Answering (VQA) requires models to reason over multimodal information, combining visual and textual data. With the development of continual learning, significant progress has been made in retaining knowledge and adapting to new information in the VQA domain. However, current methods often struggle with balancing knowledge retention, adaptation, and robust feature representation. To address these challenges, we propose a novel framework with adaptive memory allocation and global noise filtering called MacVQA for visual question answering. MacVQA fuses visual and question information while filtering noise to ensure robust representations, and employs prototype-based memory allocation to optimize feature quality and memory usage. These designs enable MacVQA to balance knowledge acquisition, retention, and compositional generalization in continual VQA learning. Experiments on ten continual VQA tasks show that MacVQA outperforms existing baselines, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.
zh

[CV-59] AR-MOT: Autoregressive Multi-object Tracking

【速读】:该论文旨在解决当前多目标跟踪(Multi-Object Tracking, MOT)方法因架构僵化和任务特定性而导致的泛化能力不足问题,尤其在面对复杂或指令驱动的任务时难以扩展。现有方法通常依赖固定的输出头和定制化的跟踪流水线,限制了其适应新场景的能力。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)框架的自回归多目标跟踪(Autoregressive Multi-Object Tracking, AR-MOT)范式,将MOT建模为序列生成任务,从而无需任务特定输出头即可灵活生成结构化结果。其核心创新包括:1)基于预训练检测器的Object Tokenizer以增强区域级视觉感知;2)Region-Aware Alignment(RAA)模块缓解全局与局部特征间的对齐偏差;3)Temporal Memory Fusion(TMF)模块通过缓存历史目标token支持长时跟踪。此设计显著提升了系统的可扩展性,新模态或指令仅需调整输出序列格式即可集成,而无需修改模型结构。

链接: https://arxiv.org/abs/2601.01925
作者: Lianjie Jia,Yuhan Wu,Binghao Ran,Yifan Wang,Lijun Wang,Huchuan Lu
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.
zh

[CV-60] alkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing

【速读】:该论文旨在解决现有基于指令的图像编辑方法依赖大量多指令训练数据、耗时费力且效果不佳的问题。其解决方案的关键在于提出一个无需训练的图像编辑框架TalkPhoto,通过设计特定提示模板引导开源大语言模型(Large Language Model, LLM)理解用户需求,并分层调用已有的先进图像编辑方法,实现精确的图像操作;同时采用即插即用的高效调用机制,使未见过的复杂编辑任务也能稳定集成到系统中,从而在减少Token消耗的同时提升编辑质量。

链接: https://arxiv.org/abs/2601.01915
作者: Yujie Hu,Zecheng Tang,Xu Jiang,Weiqi Li,Jian Zhang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: a Conversational Assistant for Intelligent Image Editing

点击查看摘要

Abstract:Thanks to the powerful language comprehension capabilities of Large Language Models (LLMs), existing instruction-based image editing methods have introduced Multimodal Large Language Models (MLLMs) to promote information exchange between instructions and images, ensuring the controllability and flexibility of image editing. However, these frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks, which is not only time-consuming and labor-intensive but also fails to achieve satisfactory results. In this paper, we present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction. We instruct the open-source LLM with a specially designed prompt template to analyze user needs after receiving instructions and hierarchically invoke existing advanced editing methods, all without additional training. Moreover, we implement a plug-and-play and efficient invocation of image editing methods, allowing complex and unseen editing tasks to be integrated into the current framework, achieving stable and high-quality editing results. Extensive experiments demonstrate that our method not only provides more accurate invocation with fewer token consumption but also achieves higher editing quality across various image editing tasks.
zh

[CV-61] Learning Action Hierarchies via Hybrid Geometric Diffusion WACV-26

【速读】:该论文旨在解决视频理解中的时序动作分割(Temporal Action Segmentation)问题,即如何准确地为视频中每一帧分配对应的动作标签。传统方法虽采用迭代精炼策略,但未能显式利用人类动作的层次结构特性。其解决方案的关键在于提出 HybridTAS 框架,通过在扩散模型的去噪过程中融合欧几里得(Euclidean)与双曲(Hyperbolic)几何空间,显式建模动作类别间的树状层次关系:高扩散步长受抽象高层动作类别(根节点)引导,低扩散步长则由细粒度动作类别(叶节点)精细修正,从而实现从粗到细的动作标签去噪过程,显著提升分割性能。

链接: https://arxiv.org/abs/2601.01914
作者: Arjun Ramesh Kaushik,Nalini K. Ratha,Venu Govindaraju
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV-26

点击查看摘要

Abstract:Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets, GTEA, 50Salads, and Breakfast, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.
zh

[CV-62] Nodule-DETR: A Novel DETR Architecture with Frequency-Channel Attention for Ultrasound Thyroid Nodule Detection

【速读】:该论文旨在解决超声图像中甲状腺结节(thyroid nodule)检测的准确性问题,尤其针对低对比度和边界模糊等挑战。其核心解决方案是提出一种名为Nodule-DETR的检测Transformer架构,关键创新包括:多光谱频域通道注意力(Multi-Spectral Frequency-domain Channel Attention, MSFCA)模块,用于增强低对比度结节特征;分层特征融合(Hierarchical Feature Fusion, HFF)模块,实现高效多尺度信息整合;以及多尺度可变形注意力(Multi-Scale Deformable Attention, MSDA),以灵活捕捉小尺寸及不规则形态结节。实验表明,该方法在真实临床超声数据集上显著优于基线模型,在mAP@0.5:0.95指标上提升0.149,展现出良好的临床辅助诊断潜力。

链接: https://arxiv.org/abs/2601.01908
作者: Jingjing Wang,Qianglin Liu,Zhuo Xiao,Xinning Yao,Bo Liu,Lu Li,Lijuan Niu,Fugen Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thyroid cancer is the most common endocrine malignancy, and its incidence is rising globally. While ultrasound is the preferred imaging modality for detecting thyroid nodules, its diagnostic accuracy is often limited by challenges such as low image contrast and blurred nodule boundaries. To address these issues, we propose Nodule-DETR, a novel detection transformer (DETR) architecture designed for robust thyroid nodule detection in ultrasound images. Nodule-DETR introduces three key innovations: a Multi-Spectral Frequency-domain Channel Attention (MSFCA) module that leverages frequency analysis to enhance features of low-contrast nodules; a Hierarchical Feature Fusion (HFF) module for efficient multi-scale integration; and Multi-Scale Deformable Attention (MSDA) to flexibly capture small and irregularly shaped nodules. We conducted extensive experiments on a clinical dataset of real-world thyroid ultrasound images. The results demonstrate that Nodule-DETR achieves state-of-the-art performance, outperforming the baseline model by a significant margin of 0.149 in mAP@0.5:0.95. The superior accuracy of Nodule-DETR highlights its significant potential for clinical application as an effective tool in computer-aided thyroid diagnosis. The code of work is available at this https URL.
zh

[CV-63] Forget Less by Learning from Parents Through Hierarchical Relationships AAAI-26

【速读】:该论文旨在解决定制化扩散模型(Custom Diffusion Models, CDMs)在顺序学习新概念时面临的灾难性遗忘问题。现有方法主要聚焦于减少不同概念间的干扰,却忽视了正向的跨概念交互潜力。其解决方案的关键在于提出一种名为“通过从父代学习来遗忘更少”(Forget Less by Learning from Parents, FLLP)的新框架,该框架在双曲空间(hyperbolic space)中引入父子概念学习机制,利用洛伦兹流形(Lorentzian manifold)自然建模树状层次结构的能力,将已学习概念作为指导来适应新概念,从而不仅保留先前知识,还支持新概念的持续集成,显著提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2601.01892
作者: Arjun Ramesh Kaushik,Naresh Kumar Devulapally,Vishnu Suresh Lokhande,Nalini K. Ratha,Venu Govindaraju
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at AAAI-26

点击查看摘要

Abstract:Custom Diffusion Models (CDMs) offer impressive capabilities for personalization in generative modeling, yet they remain vulnerable to catastrophic forgetting when learning new concepts sequentially. Existing approaches primarily focus on minimizing interference between concepts, often neglecting the potential for positive inter-concept interactions. In this work, we present Forget Less by Learning from Parents (FLLP), a novel framework that introduces a parent-child inter-concept learning mechanism in hyperbolic space to mitigate forgetting. By embedding concept representations within a Lorentzian manifold, naturally suited to modeling tree-like hierarchies, we define parent-child relationships in which previously learned concepts serve as guidance for adapting to new ones. Our method not only preserves prior knowledge but also supports continual integration of new concepts. We validate FLLP on three public datasets and one synthetic benchmark, showing consistent improvements in both robustness and generalization.
zh

[CV-64] Agent ic AI in Remote Sensing: Foundations Taxonomy and Emerging Systems WACV

【速读】:该论文旨在解决当前地球观测(Earth Observation)分析中静态深度学习模型在处理复杂地理空间工作流时缺乏顺序规划与主动工具协调能力的问题。其解决方案的关键在于引入自主代理型人工智能(agentic AI),通过构建统一的分类体系区分单代理协作者(single-agent copilots)与多代理系统,并深入分析规划机制、检索增强生成(retrieval-augmented generation)及记忆结构等架构基础,从而推动遥感领域从像素级精度评估向轨迹感知的推理正确性评估演进。

链接: https://arxiv.org/abs/2601.01891
作者: Niloufar Alipour Talemi,Julia Boone,Fatemeh Afghah
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, GeoCV Workshop

点击查看摘要

Abstract:The paradigm of Earth Observation analysis is shifting from static deep learning models to autonomous agentic AI. Although recent vision foundation models and multimodal large language models advance representation learning, they often lack the sequential planning and active tool orchestration required for complex geospatial workflows. This survey presents the first comprehensive review of agentic AI in remote sensing. We introduce a unified taxonomy distinguishing between single-agent copilots and multi-agent systems while analyzing architectural foundations such as planning mechanisms, retrieval-augmented generation, and memory structures. Furthermore, we review emerging benchmarks that move the evaluation from pixel-level accuracy to trajectory-aware reasoning correctness. By critically examining limitations in grounding, safety, and orchestration, this work outlines a strategic roadmap for the development of robust, autonomous geospatial intelligence.
zh

[CV-65] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

【速读】:该论文旨在解决多模态大语言模型在视觉数学问题求解中面临的瓶颈问题,特别是现有方法仅关注提升视觉输入的提取与解析能力,而忽视了所提取的视觉线索是否能被准确整合并有效用于后续推理过程。解决方案的关键在于提出一种受人类认知启发的三阶段框架CogFlow,其结构严格遵循“感知 ⇒ 内化 ⇒ 推理”的层级流程:首先设计协同视觉奖励(Synergistic Visual Rewards)以增强符号与图表信息在参数空间和语义空间中的联合提取能力;其次引入知识内化奖励(Knowledge Internalization Reward)模型确保视觉线索在内化阶段被忠实整合至知识体系中,从而实现感知与推理之间的桥梁作用;最后通过视觉门控策略优化算法(Visual-Gated Policy Optimization)强化推理过程对视觉知识的依赖性,防止模型产生看似合理但缺乏视觉依据的推理链。这一系列创新共同提升了模型在视觉数学推理任务中的准确性与可解释性。

链接: https://arxiv.org/abs/2601.01874
作者: Shuhang Chen,Yunqiu Xu,Junjie Xie,Aojun Lu,Tao Feng,Zeying Huang,Ning Zhang,Yi Sun,Yi Yang,Hangjie Yuan
机构: Zhejiang University (浙江大学); Intelligent Learning; Sichuan University (四川大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant progress, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception \Rightarrow internalization \Rightarrow reasoning. Inline with this hierarchical flow, we holistically enhance all its stages. We devise Synergistic Visual Rewards to boost perception capabilities in parametric and semantic spaces, jointly improving visual information extraction from symbols and diagrams. To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a Knowledge Internalization Reward model in the internalization stage, bridging perception and reasoning. Moreover, we design a Visual-Gated Policy Optimization algorithm to further enforce the reasoning is grounded with the visual knowledge, preventing models seeking shortcuts that appear coherent but are visually ungrounded reasoning chains. Moreover, we contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations. Comprehensive experiments and analysis on commonly used visual mathematical reasoning benchmarks validate the superiority of the proposed CogFlow.
zh

[CV-66] Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion

【速读】:该论文旨在解决现有文本驱动的红外与可见光图像融合方法依赖句子级文本信息所导致的语义噪声问题,以及未能充分挖掘文本深层语义价值的局限性。其解决方案的关键在于提出一种名为Entity-Guided Multi-Task learning (EGMT) 的新框架,包含三个核心创新:(i) 基于大视觉语言模型生成的图像描述,提取实体级别的文本信息以消除冗余语义噪声并保留关键语义;(ii) 构建并行多任务学习架构,将图像融合任务与基于实体的多标签分类任务结合,利用实体作为伪标签提供语义监督,增强模型对图像内容的理解能力,从而提升融合图像的质量和语义密度;(iii) 设计实体引导的跨模态交互模块,实现视觉特征与实体级文本特征之间的细粒度交互,捕捉跨模态依赖关系,强化特征表示能力。

链接: https://arxiv.org/abs/2601.01870
作者: Wenyu Shao,Hongbo Liu,Yunchuan Ma,Ruili Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Existing text-driven infrared and visible image fusion approaches often rely on textual information at the sentence level, which can lead to semantic noise from redundant text and fail to fully exploit the deeper semantic value of textual information. To address these issues, we propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT). Our approach includes three key innovative components: (i) A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models, eliminating semantic noise from raw text while preserving critical semantic information; (ii) A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task. By using entities as pseudo-labels, the multi-label classification task provides semantic supervision, enabling the model to achieve a deeper understanding of image content and significantly improving the quality and semantic density of the fused image; (iii) An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features, which enhances feature representation by capturing cross-modal dependencies at both inter-visual and visual-entity levels. To promote the wide application of the entity-guided image fusion framework, we release the entity-annotated version of four public datasets (i.e., TNO, RoadScene, M3FD, and MSRS). Extensive experiments demonstrate that EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency, compared to the state-of-the-art methods. The code and dataset will be publicly available at this https URL.
zh

[CV-67] RRNet: Configurable Real-Time Video Enhancement with Arbitrary Local Lighting Variations

【速读】:该论文旨在解决实时视频增强中速度与有效曝光控制之间的平衡难题,尤其是在不均匀光照条件下。其核心解决方案是提出一种轻量且可配置的框架RRNet(Rendering Relighting Network),通过估计一组最小虚拟光源参数,结合深度感知渲染模块实现局部重照明,无需像素级对齐训练数据即可保留人脸身份信息,并支持高分辨率实时处理。关键创新在于对象感知的建模方式和基于生成式AI的数据集构建流程,使得模型在低光增强、局部光照调整和眩光去除等方面均优于现有方法。

链接: https://arxiv.org/abs/2601.01865
作者: Wenlong Yang,Canran Jin,Weihang Yuan,Chao Wang,Lifeng Sun
机构: Microsoft(微软); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the growing demand for real-time video enhancement in live applications, existing methods often struggle to balance speed and effective exposure control, particularly under uneven lighting. We introduce RRNet (Rendering Relighting Network), a lightweight and configurable framework that achieves a state-of-the-art tradeoff between visual quality and efficiency. By estimating parameters for a minimal set of virtual light sources, RRNet enables localized relighting through a depth-aware rendering module without requiring pixel-aligned training data. This object-aware formulation preserves facial identity and supports real-time, high-resolution performance using a streamlined encoder and lightweight prediction head. To facilitate training, we propose a generative AI-based dataset creation pipeline that synthesizes diverse lighting conditions at low cost. With its interpretable lighting control and efficient architecture, RRNet is well suited for practical applications such as video conferencing, AR-based portrait enhancement, and mobile photography. Experiments show that RRNet consistently outperforms prior methods in low-light enhancement, localized illumination adjustment, and glare removal.
zh

[CV-68] GCR: Geometry-Consistent Routing for Task-Agnostic Continual Anomaly Detection

【速读】:该论文旨在解决工业视觉检测中任务无关的持续异常检测(task-agnostic continual anomaly detection)问题,即在类别不断扩展且测试时类别身份未知的场景下,如何稳定地进行异常评分与专家选择(expert selection),避免因跨类别异常分数分布差异导致的决策不稳定。其解决方案的关键在于提出一种轻量级混合专家框架GCR(Geometry-consistent Routing),通过在共享冻结的patch嵌入空间中基于最小化累积最近原型距离来实现几何一致性路由,从而将跨专家决策与单专家异常评分分离,无需端到端表示学习即可有效缓解持续性能退化问题,显著提升路由稳定性并实现近零遗忘。

链接: https://arxiv.org/abs/2601.01856
作者: Joongwon Chae,Lihui Luo,Yang Liu,Runming Wang,Dongmei Yu,Zeming Liang,Xi Yuan,Dayan Zhang,Zhenglin Chen,Peiwu Qin,Ilmoon Chae
机构: Tsinghua University Shenzhen International Graduate School (清华大学深圳国际研究生院); Ratel Soft; Affiliated Fifth Hospital, Wenzhou Medical University (温州医科大学附属第五医院); Chinese Medicine Guangdong Laboratory (广东中医药实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feature-based anomaly detection is widely adopted in industrial inspection due to the strong representational power of large pre-trained vision encoders. While most existing methods focus on improving within-category anomaly scoring, practical deployments increasingly require task-agnostic operation under continual category expansion, where the category identity is unknown at test time. In this setting, overall performance is often dominated by expert selection, namely routing an input to an appropriate normality model before any head-specific scoring is applied. However, routing rules that compare head-specific anomaly scores across independently constructed heads are unreliable in practice, as score distributions can differ substantially across categories in scale and tail behavior. We propose GCR, a lightweight mixture-of-experts framework for stabilizing task-agnostic continual anomaly detection through geometry-consistent routing. GCR routes each test image directly in a shared frozen patch-embedding space by minimizing an accumulated nearest-prototype distance to category-specific prototype banks, and then computes anomaly maps only within the routed expert using a standard prototype-based scoring rule. By separating cross-head decision making from within-head anomaly scoring, GCR avoids cross-head score comparability issues without requiring end-to-end representation learning. Experiments on MVTec AD and VisA show that geometry-consistent routing substantially improves routing stability and mitigates continual performance collapse, achieving near-zero forgetting while maintaining competitive detection and localization performance. These results indicate that many failures previously attributed to representation forgetting can instead be explained by decision-rule instability in cross-head routing. Code is available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.01856 [cs.CV] (or arXiv:2601.01856v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.01856 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-69] ESGaussianFace: Emotional and Stylized Audio-Driven Facial Animation via 3D Gaussian Splatting

【速读】:该论文旨在解决当前音频驱动人脸动画研究中难以高效生成融合情感表达与风格特征的高质量说话头视频的问题。现有方法多聚焦于中性情绪下的视频生成,而对情感和风格的协同建模仍存在挑战。其解决方案的关键在于提出 ESGaussianFace 框架,利用 3D Gaussian Splatting 实现高效且三维一致的场景重建与视频渲染;引入情感音频引导的空间注意力机制,将情感特征与音频内容特征有效融合,提升不同情绪状态下的面部细节重建精度;设计两个 3D Gaussian 变形预测器,分别控制情感和风格驱动的点云变形;并采用多阶段训练策略,分步学习唇部运动、表情变化及风格特征,从而实现高保真、高效率且具三维一致性的音频驱动情感化人脸动画生成。

链接: https://arxiv.org/abs/2601.01847
作者: Chuhang Ma,Shuai Tan,Ye Pan,Jiaolong Yang,Xin Tong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To achieve emotional and stylized deformations of the 3D Gaussian points through emotion and style features, we introduce two 3D Gaussian deformation predictors. Futhermore, we propose a multi-stage training strategy, enabling the step-by-step learning of the character’s lip movements, emotional variations, and style features. Our generated results exhibit high efficiency, high quality, and 3D consistency. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art techniques in terms of lip movement accuracy, expression variation, and style feature expressiveness.
zh

[CV-70] RSwinV2-MD: An Enhanced Residual SwinV2 Transformer for Monkeypox Detection from Skin Images

【速读】:该论文旨在解决猴痘(Mpox)皮损图像分类中因病灶特征变异大、与其他类似疾病(如水痘、麻疹和牛痘)区分度低而导致的诊断准确性不足问题。解决方案的关键在于提出一种定制化的残差SwinTransformer V2(Customized Residual SwinTransformerV2, RSwinV2)模型,其核心创新包括:1)基于输入图像维度与任务目标定制Transformer的层级结构,通过非重叠图像块分割结合移位窗口机制,在保持计算效率的同时增强局部区域间的全局关联性;2)引入位置嵌入与patch嵌入,并利用多头注意力机制强化Transformer的全局建模能力;3)融合逆残差块(Inverse Residual Block, IRB),通过卷积跳跃连接缓解梯度消失问题,从而同时捕捉局部细节与全局模式。实验表明,RSwinV2在Kaggle公开数据集上达到96.21%准确率和95.62% F1分数,显著优于标准CNN和原始SwinTransformer,验证了其作为猴痘皮损辅助诊断工具的有效性。

链接: https://arxiv.org/abs/2601.01835
作者: Rashid Iqbal,Saddam Hussain Khan(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat 19060, Pakistan)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 Pages, 7 Figures, 4 Tables

点击查看摘要

Abstract:In this paper, a deep learning approach for Mpox diagnosis named Customized Residual SwinTransformerV2 (RSwinV2) has been proposed, trying to enhance the capability of lesion classification by employing the RSwinV2 tool-assisted vision approach. In the RSwinV2 method, a hierarchical structure of the transformer has been customized based on the input dimensionality, embedding structure, and output targeted by the method. In this RSwinV2 approach, the input image has been split into non-overlapping patches and processed using shifted windows and attention in these patches. This process has helped the method link all the windows efficiently by avoiding the locality issues of non-overlapping regions in attention, while being computationally efficient. RSwinV2 has further developed based on SwinTransformer and has included patch and position embeddings to take advantage of the transformer global-linking capability by employing multi-head attention in these embeddings. Furthermore, RSwinV2 has developed and incorporated the Inverse Residual Block (IRB) into this method, which utilizes convolutional skip connections with these inclusive designs to address the vanishing gradient issues during processing. RSwinV2 inclusion of IRB has therefore facilitated this method to link global patterns as well as local patterns; hence, its integrity has helped improve lesion classification capability by minimizing variability of Mpox and increasing differences of Mpox, chickenpox, measles, and cowpox. In testing SwinV2, its accuracy of 96.21 and an F1score of 95.62 have been achieved on the Kaggle public dataset, which has outperformed standard CNN models and SwinTransformers; RSwinV2 vector has thus proved its valiance as a computer-assisted tool for Mpox lesion observation interpretation.
zh

[CV-71] DisCo-FLoc: Using Dual-Level Visual-Geometric Contrasts to Disambiguate Depth-Aware Visual Floorplan Localization

【速读】:该论文旨在解决视觉楼层平面定位(Visual Floorplan Localization, FLoc)中因简约楼层平面内重复结构导致的定位模糊问题,以及现有方法依赖昂贵且有限的语义标注所带来的适用性限制。其解决方案的关键在于提出DisCo-FLoc框架,通过双层级视觉-几何对比学习(dual-level visual-geometric contrasts)来消除深度感知的视觉FLoc歧义:首先设计了一种针对基于射线投射的FLoc任务的射线回归预测器,利用深度估计先验生成多个候选位姿;随后引入一种结合位置级和方向级约束的新型对比学习机制,严格匹配深度感知的视觉特征与楼层平面中的对应几何结构,从而有效消除定位歧义并从候选位姿中选择最优成像姿态。

链接: https://arxiv.org/abs/2601.01822
作者: Shiyong Meng,Tao Zou,Bolei Chen,Chaoxu Mu,Jianxin Wang
机构: Central South University (中南大学); Anhui University (安徽大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Since floorplan data is readily available, long-term persistent, and robust to changes in visual appearance, visual Floorplan Localization (FLoc) has garnered significant attention. Existing methods either ingeniously match geometric priors or utilize sparse semantics to reduce FLoc uncertainty. However, they still suffer from ambiguous FLoc caused by repetitive structures within minimalist floorplans. Moreover, expensive but limited semantic annotations restrict their applicability. To address these issues, we propose DisCo-FLoc, which utilizes dual-level visual-geometric Contrasts to Disambiguate depth-aware visual Floc, without requiring additional semantic labels. Our solution begins with a ray regression predictor tailored for ray-casting-based FLoc, predicting a series of FLoc candidates using depth estimation expertise. In addition, a novel contrastive learning method with position-level and orientation-level constraints is proposed to strictly match depth-aware visual features with the corresponding geometric structures in the floorplan. Such matches can effectively eliminate FLoc ambiguity and select the optimal imaging pose from FLoc candidates. Exhaustive comparative studies on two standard visual Floc benchmarks demonstrate that our method outperforms the state-of-the-art semantic-based method, achieving significant improvements in both robustness and accuracy.
zh

[CV-72] Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning

【速读】:该论文旨在解决动态第一人称场景中视觉注意力预测的挑战问题,即如何准确预测摄像头佩戴者在复杂且模糊的egocentric视频中将关注哪些区域。其解决方案的关键在于提出一种语言引导的场景上下文感知学习框架,通过设计一个上下文感知器(context perceiver)来基于语言描述总结视频内容,生成具有场景上下文信息的视频表征,并引入两种训练目标:一是强化对兴趣点区域的关注,二是抑制无关区域的干扰,从而提升模型在多样化、动态的第一人称场景下的鲁棒性和预测精度。

链接: https://arxiv.org/abs/2601.01818
作者: Sungjune Park,Hongda Mao,Qingshuang Chen,Yong Man Ro,Yelin Kim
机构: Korea Advanced Institute of Science and Technology (KAIST); Amazon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, 4 tables

点击查看摘要

Abstract:As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.
zh

[CV-73] Adaptive Hybrid Optimizer based Framework for Lumpy Skin Disease Identification

【速读】:该论文旨在解决牛瘟(Lumpy Skin Disease, LSD)早期精准识别难题,以应对该病毒快速传播对全球畜牧业健康、经济和粮食安全构成的严重威胁。解决方案的关键在于提出一种名为LUMPNet的混合深度学习框架,其核心创新包括:基于YOLOv11的目标检测模型用于定位牛只皮肤上的结节病灶,结合采用复合缩放策略的EfficientNet作为分类器实现病变图像的准确分类(LSD感染或健康),并设计了一种新型自适应混合优化器以稳定和加速整个模型的训练过程。实验表明,该方法在公开数据集上达到了99%的训练准确率和98%的验证准确率,显著优于现有方案。

链接: https://arxiv.org/abs/2601.01807
作者: Ubaidullah,Muhammad Abid Hussain,Mohsin Raza Jafri,Rozi Khan,Moid Sandhu,Abd Ullah Khan,Hyundong Shin
机构: NUST (National University of Sciences and Technology); CSIRO (Commonwealth Scientific and Industrial Research Organisation); Queensland University of Technology; Kyung Hee University (경희대학교)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lumpy Skin Disease (LSD) is a contagious viral infection that significantly deteriorates livestock health, thereby posing a serious threat to the global economy and food security. Owing to its rapid spread characteristics, early and precise identification is crucial to prevent outbreaks and ensure timely intervention. In this paper, we propose a hybrid deep learning-based approach called LUMPNet for the early detection of LSD. LUMPNet utilizes image data to detect and classify skin nodules – the primary indicator of LSD. To this end, LUMPNet uses YOLOv11, EfficientNet-based CNN classifier with compound scaling, and a novel adaptive hybrid optimizer. More precisely, LUMPNet detects and localizes LSD skin nodules and lesions on cattle images. It exploits EfficientNet to classify the localized cattle images into LSD-affected or healthy categories. To stabilize and accelerate the training of YOLOv11 and EfficientNet hybrid model, a novel adaptive hybrid optimizer is proposed and utilized. We evaluate LUMPNet at various stages of LSD using a publicly available dataset. Results indicate that the proposed scheme achieves 99% LSD detection training accuracy, and outperforms existing schemes. The model also achieves validation accuracy of 98%. Moreover, for further evaluation, we conduct a case study using an optimized EfficientNet-B0 model trained with the AdamW optimizer, and compare its performance with LUMPNet. The results show that LUMPNet achieves superior performance.
zh

[CV-74] Causality-Aware Temporal Projection for Video Understanding in Video-LLM s

【速读】:该论文旨在解决当前视频大语言模型(Video-LLMs)在需要时序一致性与因果连贯性的视频理解任务中表现不足的问题,尤其针对现有参数高效模型因使用无约束双向投影器(unconstrained bidirectional projectors)导致后期帧信息干扰前期表示、破坏时间顺序建模的缺陷。其解决方案的关键在于提出V-CORE框架,通过两个核心组件实现显式的时间顺序约束:一是可学习空间聚合(Learnable Spatial Aggregation, LSA),自适应选择显著空间标记以减少冗余;二是因果感知时序投影器(Causality-Aware Temporal Projector, CATP),借助块因果注意力机制和终端动态摘要标记作为因果信息汇入点,强制单向信息流动,从而在保留帧内空间交互的同时确保时序信息严格按顺序聚合。该设计有效提升了模型在时序推理与因果推理子任务上的性能(NExT-QA基准上准确率提升至61.2%,相关子类别分别提高+3.5%和+5.2%)。

链接: https://arxiv.org/abs/2601.01804
作者: Zhengjian Kang,Qi Chen,Rui Liu,Kangtong Mo,Xingyu Zhang,Xiaoyu Deng,Ye Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.
zh

[CV-75] VerLM: Explaining Face Verification Using Natural Language

【速读】:该论文旨在解决当前人脸识别系统在决策过程中缺乏透明度的问题,即模型难以解释其判断依据。解决方案的关键在于提出一种新颖的视觉-语言模型(Vision-Language Model, VLM),该模型不仅能够准确判断两张人脸图像是否属于同一人,还能提供明确的决策理由:一方面通过简洁说明总结影响判断的核心因素,另一方面通过详尽描述图像间的具体差异来增强可解释性。该模型基于跨模态迁移策略,将原本用于音频区分的先进建模方法适配至视觉输入,显著提升了准确性与可解释性,从而推动了更透明、可靠的人脸识别系统发展。

链接: https://arxiv.org/abs/2601.01798
作者: Syed Abdul Hannan,Hazim Bukhari,Thomas Cantalapiedra,Eman Ansar,Massa Baali,Rita Singh,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model’s accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.
zh

[CV-76] DDNet: A Dual-Stream Graph Learning and Disentanglement Framework for Temporal Forgery Localization

【速读】:该论文旨在解决生成式 AI (Generative AI) 伪造视频中局部篡改片段难以被准确定位的问题,传统视频级检测方法因无法捕捉全局异常而效果受限。其核心解决方案是提出一种双流图学习与解耦框架(DDNet),通过时间距离流(Temporal Distance Stream)提取局部伪影特征和语义内容流(Semantic Content Stream)建模长程关联,避免全局线索被局部平滑性掩盖;同时引入轨迹解耦与自适应机制(TDA)分离通用伪造指纹,并采用跨层级特征嵌入(CLFE)融合多层次特征构建鲁棒特征基础,显著提升在跨域场景下的定位准确性与泛化能力。

链接: https://arxiv.org/abs/2601.01784
作者: Boyang Zhao,Xin Liao,Jiaxin Chen,Xiaoshuai Wu,Yufeng Wu
机构: Hunan University (湖南大学); Changsha University of Science & Technology (长沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The rapid evolution of AIGC technology enables misleading viewers by tampering mere small segments within a video, rendering video-level detection inaccurate and unpersuasive. Consequently, temporal forgery localization (TFL), which aims to precisely pinpoint tampered segments, becomes critical. However, existing methods are often constrained by \emphlocal view, failing to capture global anomalies. To address this, we propose a \underlinedual-stream graph learning and \underlinedisentanglement framework for temporal forgery localization (DDNet). By coordinating a \emphTemporal Distance Stream for local artifacts and a \emphSemantic Content Stream for long-range connections, DDNet prevents global cues from being drowned out by local smoothness. Furthermore, we introduce Trace Disentanglement and Adaptation (TDA) to isolate generic forgery fingerprints, alongside Cross-Level Feature Embedding (CLFE) to construct a robust feature foundation via deep fusion of hierarchical features. Experiments on ForgeryNet and TVIL benchmarks demonstrate that our method outperforms state-of-the-art approaches by approximately 9% in AP@0.95, with significant improvements in cross-domain robustness.
zh

[CV-77] Subimage Overlap Prediction: Task-Aligned Self-Supervised Pretraining For Semantic Segmentation In Remote Sensing Imagery WACV2026

【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)方法在遥感图像语义分割任务中对大规模预训练数据依赖过高的问题。现有SSL方法虽能有效提升下游任务性能,但其预训练阶段通常需要海量未标注数据,限制了实际应用中的效率与可扩展性。本文提出的关键解决方案是“子图像重叠预测”(Subimage Overlap Prediction),即通过从原始图像中提取一个子图像,并训练模型预测该子图像在原图中的位置掩码(semantic mask),从而实现更高效的特征表示学习。该方法显著减少了所需预训练数据量,在下游语义分割任务中表现出更快的收敛速度和相当或更优的mIoU性能,尤其在标签数据稀缺时优势更加明显。

链接: https://arxiv.org/abs/2601.01781
作者: Lakshay Sharma,Alex Marin
机构: Instacart(Instacart); New York University (纽约大学); Thomson Reuters (汤森路透); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CV4EO Workshop at WACV 2026

点击查看摘要

Abstract:Self-supervised learning (SSL) methods have become a dominant paradigm for creating general purpose models whose capabilities can be transferred to downstream supervised learning tasks. However, most such methods rely on vast amounts of pretraining data. This work introduces Subimage Overlap Prediction, a novel self-supervised pretraining task to aid semantic segmentation in remote sensing imagery that uses significantly lesser pretraining imagery. Given an image, a sub-image is extracted and the model is trained to produce a semantic mask of the location of the extracted sub-image within the original image. We demonstrate that pretraining with this task results in significantly faster convergence, and equal or better performance (measured via mIoU) on downstream segmentation. This gap in convergence and performance widens when labeled training data is reduced. We show this across multiple architecture types, and with multiple downstream datasets. We also show that our method matches or exceeds performance while requiring significantly lesser pretraining data relative to other SSL methods. Code and model weights are provided at \hrefthis https URLthis http URL.
zh

[CV-78] CTIS-QA: Clinical Template-Informed Slide-level Question Answering for Pathology

【速读】:该论文旨在解决病理报告中结构化信息提取不一致、临床诊断知识难以有效融入视觉-语言模型训练的问题,以及现有病理图像问答(WSI-VQA)任务缺乏真实临床场景约束和高质量标注数据的局限性。解决方案的关键在于构建一个基于临床诊断模板的系统化流程:首先设计了临床病理报告模板(Clinical Pathology Report Template, CPRT),在CAP癌症协议指导下实现病理特征的标准化提取;进而基于CPRT生成两个核心资源——CTIS-Align(80k切片-描述对用于视觉-语言对齐训练)与CTIS-Bench(977张全视野数字切片图像和14,879个封闭式临床问题-答案对,强调真实诊断流程与纯视觉理解需求);最后提出CTIS-QA模型,采用双流架构模拟病理学家诊断逻辑:一路径由聚类聚合机制捕捉全局切片上下文,另一路径通过注意力引导的局部区域感知模块聚焦关键病灶,从而显著提升在多个病理图像理解任务上的性能表现。

链接: https://arxiv.org/abs/2601.01769
作者: Hao Lu,Ziniu Qian,Yifu Li,Yang Zhou,Bingzheng Wei,Yan Xu
机构: Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted by BIBM 2025

点击查看摘要

Abstract:In this paper, we introduce a clinical diagnosis template-based pipeline to systematically collect and structure pathological information. In collaboration with pathologists and guided by the the College of American Pathologists (CAP) Cancer Protocols, we design a Clinical Pathology Report Template (CPRT) that ensures comprehensive and standardized extraction of diagnostic elements from pathology reports. We validate the effectiveness of our pipeline on TCGA-BRCA. First, we extract pathological features from reports using CPRT. These features are then used to build CTIS-Align, a dataset of 80k slide-description pairs from 804 WSIs for vision-language alignment training, and CTIS-Bench, a rigorously curated VQA benchmark comprising 977 WSIs and 14,879 question-answer pairs. CTIS-Bench emphasizes clinically grounded, closed-ended questions (e.g., tumor grade, receptor status) that reflect real diagnostic workflows, minimize non-visual reasoning, and require genuine slide understanding. We further propose CTIS-QA, a Slide-level Question Answering model, featuring a dual-stream architecture that mimics pathologists’ diagnostic approach. One stream captures global slide-level context via clustering-based feature aggregation, while the other focuses on salient local regions through attention-guided patch perception module. Extensive experiments on WSI-VQA, CTIS-Bench, and slide-level diagnostic tasks show that CTIS-QA consistently outperforms existing state-of-the-art models across multiple metrics. Code and data are available at this https URL.
zh

[CV-79] AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving

【速读】:该论文旨在解决当前端到端自动驾驶模型中横向(lateral)与纵向(longitudinal)规划解耦导致的协调失效问题,以及驱动路径(drive path)作为先验信息在纵向规划中未被充分利用所引发的静态信息冗余编码问题。解决方案的关键在于提出一种新颖的级联(cascaded)框架,通过引入路径条件化(path-conditioned)的纵向规划形式,显式地将驱动路径作为约束条件嵌入纵向决策过程;具体而言,模型不再预测完整的二维轨迹点(waypoints),而是沿驱动路径预测纵向位移(longitudinal displacements),从而简化纵向推理并增强与横向规划的耦合性。此外,还设计了一种面向规划的数据增强策略,通过模拟罕见但关键的安全事件(如车辆切入)来提升模型对碰撞规避能力的泛化性能。

链接: https://arxiv.org/abs/2601.01762
作者: Yanhao Wu,Haoyang Zhang,Fei He,Rui Wu,Congpei Qiu,Liang Gao,Wei Ke,Tong Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: underreview

点击查看摘要

Abstract:End-to-end autonomous driving has rapidly progressed, enabling joint perception and planning in complex environments. In the planning stage, state-of-the-art (SOTA) end-to-end autonomous driving models decouple planning into parallel lateral and longitudinal predictions. While effective, this parallel design can lead to i) coordination failures between the planned path and speed, and ii) underutilization of the drive path as a prior for longitudinal planning, thus redundantly encoding static information. To address this, we propose a novel cascaded framework that explicitly conditions longitudinal planning on the drive path, enabling coordinated and collision-aware lateral and longitudinal planning. Specifically, we introduce a path-conditioned formulation that explicitly incorporates the drive path into longitudinal planning. Building on this, the model predicts longitudinal displacements along the drive path rather than full 2D trajectory waypoints. This design simplifies longitudinal reasoning and more tightly couples it with lateral planning. Additionally, we introduce a planning-oriented data augmentation strategy that simulates rare safety-critical events, such as vehicle cut-ins, by adding agents and relabeling longitudinal targets to avoid collision. Evaluated on the challenging Bench2Drive benchmark, our method sets a new SOTA, achieving a driving score of 89.07 and a success rate of 73.18%, demonstrating significantly improved coordination and safety
zh

[CV-80] MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement

【速读】:该论文旨在解决当前音频驱动的3D头部生成方法在多说话者场景下难以实现自然、双向交互的问题,特别是如何在说话与倾听状态之间实现流畅过渡,以及现有方法依赖误差较大的伪3D标签导致无法捕捉精细面部动态的问题。解决方案的关键在于提出一个两阶段框架MANGO:第一阶段采用基于扩散的Transformer模型结合双音频交互模块,从多说话者音频中建模自然的3D运动;第二阶段利用快速3D高斯渲染器生成高保真图像,并通过交替训练提供2D级光度监督以优化3D运动,从而在无需伪3D标签的情况下实现更贴近真实对话行为的高精度、高可控性音频驱动说话头生成。

链接: https://arxiv.org/abs/2601.01749
作者: Lei Zhu,Lijian Lin,Ye Zhu,Jiahao Wu,Xuehan Hou,Yu Li,Yunfei Liu,Jie Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 11i figures

点击查看摘要

Abstract:Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.
zh

[CV-81] Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization EACL

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对黑盒越狱攻击(black-box jailbreak attack)时的安全性脆弱问题,即攻击者通过设计微小扰动绕过模型的安全机制并诱导有害输出。现有白盒攻击方法因需完全访问模型结构、计算成本高且迁移能力弱,在实际应用中受限。本文提出基于零阶优化的黑盒攻击方案——使用同时扰动随机近似(Zeroth-Order Optimization with Simultaneous Perturbation Stochastic Approximation, ZO-SPSA),其关键在于:(i) 无需模型梯度信息,仅通过输入输出交互实现梯度近似;(ii) 不依赖替代模型的通用优化策略;(iii) 显著降低资源消耗,尤其减少GPU内存占用。实验表明,该方法在InstructBLIP等LVLM上实现了高达83.0%的越狱成功率,并生成具有强迁移性的对抗样本(如MiniGPT-4生成样本对其他模型的攻击成功率ASR达64.18%),揭示了当前LVLM安全机制的严重缺陷。

链接: https://arxiv.org/abs/2601.01747
作者: Jiwei Guan,Haibo Jin,Haohan Wang
机构: Macquarie University (麦考瑞大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: EACL

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs
zh

[CV-82] Point-SRA: Self-Representation Alignment for 3D Representation Learning AAAI2026

【速读】:该论文旨在解决现有3D表示学习方法中因固定掩码比例导致的多层级表征关联性与内在几何结构忽略问题,以及基于点级重建假设与点云多样性不匹配的问题。其解决方案的关键在于提出一种名为Point-SRA的方法,通过自蒸馏(self-distillation)和概率建模对齐表示:首先在Masked Autoencoder (MAE)层面采用不同掩码比例以捕获互补的几何与语义信息;其次引入MeanFlow Transformer (MFT),利用跨模态条件嵌入实现多样化的概率重建,并发现MFT中不同时间步的表示同样具有互补性,从而设计了双层自表示对齐机制(Dual Self-Representation Alignment);最后构建Flow-Conditioned Fine-Tuning Architecture以充分挖掘由MeanFlow学习到的点云分布特性,显著提升了下游任务性能。

链接: https://arxiv.org/abs/2601.01746
作者: Lintong Wei,Jian Lu,Haozhe Cheng,Jihua Zhu,Kaibing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is an AAAI 2026 accepted paper titled “Point-SRA: Self-Representation Alignment for 3D Representation Learning”, spanning 13 pages in total. The submission includes 7 figures (fig1 to fig7) that visually support the technical analysis

点击查看摘要

Abstract:Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.
zh

[CV-83] FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

【速读】:该论文旨在解决第一帧传播(First-Frame Propagation, FFP)在可控视频编辑中因依赖运行时指导而导致的实用性受限问题。其根本原因在于现有训练数据集普遍存在时长过短、分辨率低以及任务多样性不足,难以学习鲁棒的时序先验知识。解决方案的关键在于构建一个大规模高质量数据集FFP-300K(包含30万对720p分辨率、81帧长度的视频对),并通过双轨策略实现局部与全局编辑的多样性;在此基础上提出无需引导的FFP框架,其核心创新包括:引入自适应时空RoPE(Adaptive Spatio-Temporal RoPE, AST-RoPE)以动态重映射位置编码,解耦外观与运动参考;采用自蒸馏机制,利用身份传播任务作为正则项,保障长期时序稳定性并抑制语义漂移。实验表明,该方法在EditVerseBench基准上显著优于现有学术及商业模型,PickScore和VLM得分提升约0.2和0.3。

链接: https://arxiv.org/abs/2601.01720
作者: Xijie Huang,Chengming Xu,Donghao Luo,Xiaobin Hu,Peng Tang,Xu Peng,Jiangning Zhang,Chengjie Wang,Yanwei Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.
zh

[CV-84] Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems

【速读】:该论文旨在解决嵌入式系统中实时车道线检测面临的挑战,即在RGB图像中视觉信号微弱且稀疏的情况下,如何在计算资源和功耗受限的条件下提升检测精度。现有基于分割、基于锚点和基于曲线的深度学习模型虽已广泛应用,但缺乏针对低功耗嵌入式环境的通用优化技术。解决方案的关键在于提出一种名为协方差分布优化(Covariance Distribution Optimization, CDO)的模块,该模块通过使车道特征分布更贴近真实标签,显著提升检测准确性,同时不增加计算复杂度;CDO模块易于集成至现有模型而无需结构改动,并可复用已有参数进行持续训练,从而在性能、能效和部署灵活性方面实现显著改进。

链接: https://arxiv.org/abs/2601.01696
作者: Yian Liu,Xiong Wang,Ping Xu,Lei Zhu,Ming Yan,Linyun Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Real-time lane detection in embedded systems encounters significant challenges due to subtle and sparse visual signals in RGB images, often constrained by limited computational resources and power consumption. Although deep learning models for lane detection categorized into segmentation-based, anchor-based, and curve-based methods there remains a scarcity of universally applicable optimization techniques tailored for low-power embedded environments. To overcome this, we propose an innovative Covariance Distribution Optimization (CDO) module specifically designed for efficient, real-time applications. The CDO module aligns lane feature distributions closely with ground-truth labels, significantly enhancing detection accuracy without increasing computational complexity. Evaluations were conducted on six diverse models across all three method categories, including two optimized for real-time applications and four state-of-the-art (SOTA) models, tested comprehensively on three major datasets: CULane, TuSimple, and LLAMAS. Experimental results demonstrate accuracy improvements ranging from 0.01% to 1.5%. The proposed CDO module is characterized by ease of integration into existing systems without structural modifications and utilizes existing model parameters to facilitate ongoing training, thus offering substantial benefits in performance, power efficiency, and operational flexibility in embedded systems.
zh

[CV-85] Learnability-Driven Submodular Optimization for Active Roadside 3D Detection CVPR2026

【速读】:该论文旨在解决路边单目3D目标检测中因缺乏车辆侧数据而导致的标注困难与模型性能受限问题,尤其针对“固有模糊样本”(inherently ambiguous samples)——即仅从路边视角难以准确标注的远距离、模糊或遮挡物体。这类样本在无车辆-路边帧配对的情况下无法可靠标注,导致标注资源浪费且影响模型学习效果。解决方案的关键在于提出一种以可学性驱动的主动学习框架LH3D,其核心思想是同时筛选出信息量丰富且可被可靠标注的场景,通过抑制固有模糊样本并确保多样性覆盖,从而在显著降低标注预算(仅需25%)的前提下实现接近全量标注的模型性能(车辆、行人、自行车分别达到86.06%、67.32%、78.67%的全性能),验证了可学性(learnability)相较于不确定性(uncertainty)对路边3D感知更为关键。

链接: https://arxiv.org/abs/2601.01695
作者: Ruiyu Mao,Baoming Zhang,Nicholas Ruozzi,Yunhui Guo
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures. Submitted to CVPR 2026

点击查看摘要

Abstract:Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs. However, real deployment often requires annotation of roadside-only data due to hardware and privacy constraints. Even human experts struggle to produce accurate labels without vehicle-side data (image, LIDAR), which not only increases annotation difficulty and cost, but also reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle–roadside frames. We refer to such cases as inherently ambiguous samples. To reduce wasted annotation effort on inherently ambiguous samples while still obtaining high-performing models, we turn to active learning. This work focuses on active learning for roadside monocular 3D object detection and proposes a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method, LH3D, achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I, significantly outperforming uncertainty-based baselines. This confirms that learnability, not uncertainty, matters for roadside 3D perception.
zh

[CV-86] Mitigating Longitudinal Performance Degradation in Child Face Recognition Using Synthetic Data

【速读】:该论文旨在解决儿童面部识别中因快速且非线性的面部生长导致的模板漂移(template drift)问题,进而引发随时间推移验证错误率上升的挑战。解决方案的关键在于利用生成式 AI(Generative AI)合成的人脸数据作为纵向稳定器,通过在训练阶段引入 StyleGAN2 ADA 生成的合成人脸样本,并结合后处理过滤步骤以减少身份泄露和伪影影响,从而提升模型在长时间跨度下的身份一致性与识别鲁棒性。实验表明,相较于预训练基线和仅使用真实数据微调的方法,合成数据增强的微调策略显著降低了不同注册验证间隔(6–36个月)下的错误率。

链接: https://arxiv.org/abs/2601.01689
作者: Afzal Hossain,Stephanie Schuckers
机构: Clarkson University (克拉克森大学); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Longitudinal face recognition in children remains challenging due to rapid and nonlinear facial growth, which causes template drift and increasing verification errors over time. This work investigates whether synthetic face data can act as a longitudinal stabilizer by improving temporal robustness of child face recognition models. Using an identity disjoint protocol on the Young Face Aging (YFA) dataset, we evaluate three settings: (i) pretrained MagFace embeddings without dataset specific fine-tuning, (ii) MagFace fine-tuned using authentic training faces only, and (iii) MagFace fine-tuned using a combination of authentic and synthetically generated training faces. Synthetic data is generated using StyleGAN2 ADA and incorporated exclusively within the training identities; a post generation filtering step is applied to mitigate identity leakage and remove artifact affected samples. Experimental results across enrollment verification gaps from 6 to 36 months show that synthetic-augmented fine tuning substantially reduces error rates relative to both the pretrained baseline and real only fine tuning. These findings provide a risk aware assessment of synthetic augmentation for improving identity persistence in pediatric face recognition.
zh

[CV-87] FALCON: Few-Shot Adversarial Learning for Cross-Domain Medical Image Segmentation

【速读】:该论文旨在解决3D医学图像分割中因标注数据稀缺、患者个体差异大、数据隐私问题以及计算资源消耗高等因素导致的临床可用性受限问题。其解决方案的关键在于提出FALCON框架,该框架通过将3D体积数据以2D切片形式处理,结合元学习(meta-learning)在自然图像上预训练获得可迁移的分割先验,并通过对抗微调和边界感知学习实现向医疗领域的有效迁移;同时引入任务感知推理机制,基于支持样本提示动态适应不同患者的解剖变异,从而在极少标注数据下实现高精度边界分割(以Hausdorff Distance衡量),且计算开销显著降低。

链接: https://arxiv.org/abs/2601.01687
作者: Abdur R. Fayjie,Pankhi Kashyap,Jutika Borah,Patrick Vandewalle
机构: KU Leuven (鲁汶大学); IIT-Bombay (印度理工学院孟买分校); Tezpur University (特兹普尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Precise delineation of anatomical and pathological structures within 3D medical volumes is crucial for accurate diagnosis, effective surgical planning, and longitudinal disease monitoring. Despite advancements in AI, clinically viable segmentation is often hindered by the scarcity of 3D annotations, patient-specific variability, data privacy concerns, and substantial computational overhead. In this work, we propose FALCON, a cross-domain few-shot segmentation framework that achieves high-precision 3D volume segmentation by processing data as 2D slices. The framework is first meta-trained on natural images to learn-to-learn generalizable segmentation priors, then transferred to the medical domain via adversarial fine-tuning and boundary-aware learning. Task-aware inference, conditioned on support cues, allows FALCON to adapt dynamically to patient-specific anatomical variations across slices. Experiments on four benchmarks demonstrate that FALCON consistently achieves the lowest Hausdorff Distance scores, indicating superior boundary accuracy while maintaining a Dice Similarity Coefficient comparable to the state-of-the-art models. Notably, these results are achieved with significantly less labeled data, no data augmentation, and substantially lower computational overhead.
zh

[CV-88] Evaluating Deep Learning-Based Face Recognition for Infants and Toddlers: Impact of Age Across Developmental Stages

【速读】:该论文旨在解决婴幼儿面部识别(Face Recognition)在实际应用中面临的三大核心挑战:快速的面部形态变化、类间相似性高以及标注数据稀缺。针对这些问题,研究者构建了一个为期24个月、覆盖0至3岁儿童的纵向数据集,并评估了四种主流深度学习人脸验证模型(FaceNet、ArcFace、MagFace 和 CosFace)在不同发育阶段的表现。关键发现是:婴儿期(0–6个月)识别准确率极低(TAR仅为30.7% @ 0.1% FAR),而随着年龄增长性能显著提升(2.5–3岁组达64.7%)。进一步分析表明,时间间隔越短,嵌入特征漂移(embedding drift)越小,识别准确率越高。为缓解这一漂移问题,研究引入域对抗神经网络(Domain Adversarial Neural Network, DANN)进行特征对齐,使模型输出更具时序稳定性与泛化能力,最终将真接受率(TAR)提升超过12%。此方案为构建适用于智慧城市场景下长期可靠运行的儿童生物特征认证系统提供了重要技术路径。

链接: https://arxiv.org/abs/2601.01680
作者: Afzal Hossain,Mst Rumana Sumi,Stephanie Schuckers
机构: Clarkson University (克拉克森大学); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at IEEE IJCB 2025 conference; final published version forthcoming

点击查看摘要

Abstract:Face recognition for infants and toddlers presents unique challenges due to rapid facial morphology changes, high inter-class similarity, and limited dataset availability. This study evaluates the performance of four deep learning-based face recognition models FaceNet, ArcFace, MagFace, and CosFace on a newly developed longitudinal dataset collected over a 24 month period in seven sessions involving children aged 0 to 3 years. Our analysis examines recognition accuracy across developmental stages, showing that the True Accept Rate (TAR) is only 30.7% at 0.1% False Accept Rate (FAR) for infants aged 0 to 6 months, due to unstable facial features. Performance improves significantly in older children, reaching 64.7% TAR at 0.1% FAR in the 2.5 to 3 year age group. We also evaluate verification performance over different time intervals, revealing that shorter time gaps result in higher accuracy due to reduced embedding drift. To mitigate this drift, we apply a Domain Adversarial Neural Network (DANN) approach that improves TAR by over 12%, yielding features that are more temporally stable and generalizable. These findings are critical for building biometric systems that function reliably over time in smart city applications such as public healthcare, child safety, and digital identity services. The challenges observed in early age groups highlight the importance of future research on privacy preserving biometric authentication systems that can address temporal variability, particularly in secure and regulated urban environments where child verification is essential.
zh

[CV-89] rustworthy Data-Driven Wildfire Risk Prediction and Understanding in Western Canada

【速读】:该论文旨在解决西部加拿大野火风险预测中因点火与蔓延的内在随机性及燃料条件、气象、气候变率、地形和人类活动等多因素非线性交互作用所导致的预测可靠性与可解释性不足的问题。其解决方案的关键在于提出一种基于长序列多尺度时间建模的可信数据驱动框架,该框架在整合异质驱动因子的同时,显式量化预测不确定性并支持过程层面的解释;通过SHAP分析揭示了温度相关因子在两年中均主导野火风险,而湿度相关约束在2024年对空间和土地覆盖特异性差异的影响更强,体现出模型对复杂环境响应机制的解析能力。

链接: https://arxiv.org/abs/2601.01677
作者: Zhengsen Xu,Lanying Wang,Sibo Cheng,Xue Rui,Kyle Gao,Yimin Zhu,Mabel Heffring,Zack Dewis,Saeid Taleghanidoozdoozan,Megan Greenwood,Motasem Alkayid,Quinn Ledingham,Hongjie He,Jonathan Li,Lincoln Linlin Xu
机构: University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent decades, the intensification of wildfire activity in western Canada has resulted in substantial socio-economic and environmental losses. Accurate wildfire risk prediction is hindered by the intrinsic stochasticity of ignition and spread and by nonlinear interactions among fuel conditions, meteorology, climate variability, topography, and human activities, challenging the reliability and interpretability of purely data-driven models. We propose a trustworthy data-driven wildfire risk prediction framework based on long-sequence, multi-scale temporal modeling, which integrates heterogeneous drivers while explicitly quantifying predictive uncertainty and enabling process-level interpretation. Evaluated over western Canada during the record-breaking 2023 and 2024 fire seasons, the proposed model outperforms existing time-series approaches, achieving an F1 score of 0.90 and a PR-AUC of 0.98 with low computational cost. Uncertainty-aware analysis reveals structured spatial and seasonal patterns in predictive confidence, highlighting increased uncertainty associated with ambiguous predictions and spatiotemporal decision boundaries. SHAP-based interpretation provides mechanistic understanding of wildfire controls, showing that temperature-related drivers dominate wildfire risk in both years, while moisture-related constraints play a stronger role in shaping spatial and land-cover-specific contrasts in 2024 compared to the widespread hot and dry conditions of 2023. Data and code are available at this https URL.
zh

[CV-90] LabelAny3D: Label Any Object 3D in the Wild NEURIPS2025

【速读】:该论文旨在解决单目图像中3D目标检测在真实场景(in-the-wild)下性能受限的问题,其核心挑战在于缺乏高质量的3D标注数据以及标注过程的高成本。解决方案的关键在于提出LabelAny3D——一个基于“分析-合成”(analysis-by-synthesis)框架的方法,通过从2D图像重建完整3D场景来高效生成高质量的3D边界框标注。该方法构建了COCO3D这一新的开放词汇(open-vocabulary)单目3D检测基准,覆盖了现有3D数据集中缺失的多样化物体类别。实验表明,LabelAny3D生成的标注显著提升多个基准上的检测性能,优于以往自动标注方法,在真实世界开放场景中展现出基础模型驱动标注的巨大潜力。

链接: https://arxiv.org/abs/2601.01676
作者: Jin Yao,Radowan Mahmud Redoy,Sebastian Elbaum,Matthew B. Dwyer,Zezhou Cheng
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025. Project page: this https URL

点击查看摘要

Abstract:Detecting objects in 3D space from monocular input is crucial for applications ranging from robotics to scene understanding. Despite advanced performance in the indoor and autonomous driving domains, existing monocular 3D detection models struggle with in-the-wild images due to the lack of 3D in-the-wild datasets and the challenges of 3D annotation. We introduce LabelAny3D, an \emphanalysis-by-synthesis framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations. Built on this pipeline, we present COCO3D, a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset and covering a wide range of object categories absent from existing 3D datasets. Experiments show that annotations generated by LabelAny3D improve monocular 3D detection performance across multiple benchmarks, outperforming prior auto-labeling approaches in quality. These results demonstrate the promise of foundation-model-driven annotation for scaling up 3D recognition in realistic, open-world settings.
zh

[CV-91] Animated 3DGS Avatars in Diverse Scenes with Consistent Lighting and Shadows

【速读】:该论文旨在解决动画3D高斯溅射(3D Gaussian Splatting, 3DGS)虚拟角色与3DGS场景或动态物体交互时的光照和阴影一致性问题。现有方法在处理此类场景时往往依赖于网格化表示,难以保持体积渲染的连续性和真实感。其解决方案的关键在于提出深度高斯阴影图(Deep Gaussian Shadow Maps, DGSM),这是一种针对3DGS体素表示的现代阴影映射算法。DGSM利用3DGS中沿光线的光通量可闭合求解的特性,无需显式网格化即可计算体积阴影:通过在同心径向壳层上预计算透射率并存储于八面体贴图(octahedral atlas)中,现代GPU可实时采样以衰减受影响的高斯点,从而实现一致的阴影投射与接收;同时,为实现移动角色的再光照(relighting),采用球谐(spherical harmonic, SH)基底表示的HDRI探针近似局部环境光照,并应用快速逐高斯辐射传输(radiance transfer),避免了复杂的BRDF估计或离线优化,整体方案完全在3DGS体积表示下运行,确保阴影与光照的一致性。

链接: https://arxiv.org/abs/2601.01660
作者: Aymen Mir,Riza Alp Guler,Jian Wang,Gerard Pons-Moll,Bing Zhou
机构: Tübingen AI Center, University of Tübingen, Germany; Snap Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page is available at this https URL

点击查看摘要

Abstract:We present a method for consistent lighting and shadows when animated 3D Gaussian Splatting (3DGS) avatars interact with 3DGS scenes or with dynamic objects inserted into otherwise static scenes. Our key contribution is Deep Gaussian Shadow Maps (DGSM), a modern analogue of the classical shadow mapping algorithm tailored to the volumetric 3DGS representation. Building on the classic deep shadow mapping idea, we show that 3DGS admits closed form light accumulation along light rays, enabling volumetric shadow computation without meshing. For each estimated light, we tabulate transmittance over concentric radial shells and store them in octahedral atlases, which modern GPUs can sample in real time per query to attenuate affected scene Gaussians and thus cast and receive shadows consistently. To relight moving avatars, we approximate the local environment illumination with HDRI probes represented in a spherical harmonic (SH) basis and apply a fast per Gaussian radiance transfer, avoiding explicit BRDF estimation or offline optimization. We demonstrate environment consistent lighting for avatars from AvatarX and ActorsHQ, composited into ScanNet++, DL3DV, and SuperSplat scenes, and show interactions with inserted objects. Across single and multi avatar settings, DGSM and SH relighting operate fully in the volumetric 3DGS representation, yielding coherent shadows and relighting while avoiding meshing.
zh

[CV-92] An Empirical Study of Monocular Human Body Measurement Under Weak Calibration

【速读】:该论文旨在解决从单目RGB图像中估计人体测量参数的难题,其核心挑战包括尺度模糊性(scale ambiguity)、视角敏感性以及缺乏显式深度信息。解决方案的关键在于系统性地评估三种弱标定的单目策略:基于关键点几何的方法、基于姿态驱动的回归方法和基于物体校准的轮廓方法,在半约束条件下使用消费级相机进行实证分析。研究重点不在于追求最高的精度,而是揭示不同标定假设如何影响测量行为、鲁棒性及失败模式,从而为轻量级单目人体测量系统的部署提供实证设计参考。

链接: https://arxiv.org/abs/2601.01639
作者: Gaurav Sekar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper consists of 8 pages, 2 figures (on pages 4 and 7), and 2 tables (both on page 6)

点击查看摘要

Abstract:Estimating human body measurements from monocular RGB imagery remains challenging due to scale ambiguity, viewpoint sensitivity, and the absence of explicit depth information. This work presents a systematic empirical study of three weakly calibrated monocular strategies: landmark-based geometry, pose-driven regression, and object-calibrated silhouettes, evaluated under semi-constrained conditions using consumer-grade cameras. Rather than pursuing state-of-the-art accuracy, the study analyzes how differing calibration assumptions influence measurement behavior, robustness, and failure modes across varied body types. The results reveal a clear trade-off between user effort during calibration and the stability of resulting circumferential quantities. This paper serves as an empirical design reference for lightweight monocular human measurement systems intended for deployment on consumer devices.
zh

[CV-93] CAP-IQA: Context-Aware Prompt-Guided CT Image Quality Assessment

【速读】:该论文旨在解决基于提示(prompt-based)方法在CT图像质量评估(Image Quality Assessment, IQA)中因引入理想化先验知识而导致的偏差问题,尤其是在真实世界降质场景(如噪声、运动伪影或扫描仪差异)下,传统文本提示难以准确反映图像实际质量。解决方案的关键在于提出Context-Aware Prompt-guided Image Quality Assessment (CAP-IQA)框架,其核心创新包括:1)融合文本级先验与实例级上下文提示,实现语义与感知表示对齐;2)采用因果去偏(causal debiasing)技术分离理想化知识与图像特定降质因素;3)结合CNN视觉编码器与领域专用文本编码器,提升对诊断可见性、解剖清晰度和噪声感知的评估准确性。该方法在2023 LDCTIQA挑战赛中显著优于现有最优模型,并在儿科CT大规模数据集上验证了其泛化能力。

链接: https://arxiv.org/abs/2601.01613
作者: Kazi Ramisa Rifa,Jie Zhang,Abdullah Imran
机构: University of Kentucky (肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Prompt-based methods, which encode medical priors through descriptive text, have been only minimally explored for CT Image Quality Assessment (IQA). While such prompts can embed prior knowledge about diagnostic quality, they often introduce bias by reflecting idealized definitions that may not hold under real-world degradations such as noise, motion artifacts, or scanner variability. To address this, we propose the Context-Aware Prompt-guided Image Quality Assessment (CAP-IQA) framework, which integrates text-level priors with instance-level context prompts and applies causal debiasing to separate idealized knowledge from factual, image-specific degradations. Our framework combines a CNN-based visual encoder with a domain-specific text encoder to assess diagnostic visibility, anatomical clarity, and noise perception in abdominal CT images. The model leverages radiology-style prompts and context-aware fusion to align semantic and perceptual representations. On the 2023 LDCTIQA challenge benchmark, CAP-IQA achieves an overall correlation score of 2.8590 (sum of PLCC, SROCC, and KROCC), surpassing the top-ranked leaderboard team (2.7427) by 4.24%. Moreover, our comprehensive ablation experiments confirm that prompt-guided fusion and the simplified encoder-only design jointly enhance feature alignment and interpretability. Furthermore, evaluation on an in-house dataset of 91,514 pediatric CT images demonstrates the true generalizability of CAP-IQA in assessing perceptual fidelity in a different patient population.
zh

[CV-94] Guiding Token-Sparse Diffusion Models

【速读】:该论文旨在解决稀疏训练扩散模型在推理阶段性能不佳的问题,尤其是其对无分类器引导(Classifier-free Guidance, CFG)响应不足导致的生成质量下降。解决方案的关键在于提出稀疏引导(Sparse Guidance, SG),该方法摒弃了传统的条件丢弃(conditional dropout)作为引导信号的方式,转而利用**令牌级别稀疏性(token-level sparsity)**来更好地保留条件预测的高方差特性,从而在推理时实现高质量且高变异性输出。通过在推理阶段引入令牌级稀疏性,SG 在降低计算开销的同时显著提升图像保真度,例如在 ImageNet-256 基准上以 25% 更少的浮点运算次数(FLOPs)达到 1.58 FID,并在保持基线质量前提下实现最高达 58% 的 FLOP 节省。

链接: https://arxiv.org/abs/2601.01608
作者: Felix Krause,Stefan Andreas Baumann,Johannes Schusterbauer,Olga Grebenkova,Ming Gui,Vincent Tao Hu,Björn Ommer
机构: CompVis @ LMU Munich (CompVis @ 慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.
zh

[CV-95] Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

【速读】:该论文旨在解决自动少样本字体生成(Few-shot Font Generation, FFG)中存在的两大核心问题:一是现有基于自回归(Autoregressive, AR)模型的方法受限于传统的局部补丁级标记化方式,难以捕捉全局依赖关系,导致字体结构一致性与风格保真度不足;二是当前FFG方法多局限于图像到图像的范式,忽视了语言在传达字体设计意图中的关键作用。解决方案的关键在于提出GAR-Font框架,其创新性地引入了一个全局感知标记器(global-aware tokenizer),能够同时建模局部笔画结构与全局风格模式;设计了一个多模态风格编码器,通过轻量级的语言-风格适配器实现灵活的文本引导风格控制,无需复杂的多模态预训练;并构建后处理精修流水线以进一步提升结构准确性和风格一致性。

链接: https://arxiv.org/abs/2601.01593
作者: Haonan Cai,Yuxuan Luo,Zhouhui Lian
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 25 pages

点击查看摘要

Abstract:Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.
zh

[CV-96] OpenRT: An Open-Source Red Teaming Framework for Multimodal LLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际应用中因持续存在的安全漏洞而面临评估困难的问题。现有红队测试基准存在碎片化、局限于单轮文本交互且缺乏可扩展性等局限,难以支撑系统化的安全性评估。其解决方案的关键在于提出OpenRT——一个统一、模块化且高吞吐量的红队框架,通过引入对抗内核(adversarial kernel),实现模型集成、数据管理、攻击策略、判断方法和评估指标五个核心维度的解耦与标准化,从而将对抗逻辑与异步高吞吐运行时分离,支持跨多种模型的系统性扩展。该框架整合了37种多样化的攻击方法,并在20个先进模型上验证了其有效性,揭示了前沿模型在不同攻击范式下泛化能力不足的问题。

链接: https://arxiv.org/abs/2601.01592
作者: Xin Wang,Yunhao Chen,Juncheng Li,Yixu Wang,Yang Yao,Tianle Gu,Jie Li,Yan Teng,Xingjun Ma,Yingchun Wang,Xia Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.
zh

[CV-97] MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

【速读】:该论文旨在解决联合音视频生成中细粒度声学控制不足的问题,特别是身份保持型语音的生成难题。现有方法要么因级联生成导致时间错位,要么无法在统一框架内实现零样本语音克隆。其解决方案的关键在于提出MM-Sonate框架,该框架采用统一的指令-音素输入以实现严格的语言与时间对齐;引入音色注入机制,有效将说话人身份与语言内容解耦,从而支持零样本语音克隆;并设计基于噪声的负向条件策略,利用自然噪声先验提升声学保真度,显著优于传统无分类器引导方法。

链接: https://arxiv.org/abs/2601.01568
作者: Chunyu Qiang,Jun Wang,Xiaopeng Wang,Kang Yin,Yuxin Guo,Xijuan Zeng,Nan Li,Zihan Li,Yuzhe Liang,Ziyu Zhang,Teng Ma,Yushen Chen,Zhongliang Liu,Feng Deng,Chen Zhang,Pengfei Wan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.
zh

[CV-98] EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding

【速读】:该论文旨在解决当前空间动态推理研究中忽视人类意图的问题,即现有方法通常仅关注物理对象间的交互规律,而缺乏对人类行为目的性的建模。其解决方案的关键在于提出一种新的范式——Teleo-Spatial Intelligence(TSI),该范式统一了两个核心维度:物理动态推理(Physical-Dynamic Reasoning)与意图驱动推理(Intent-Driven Reasoning),从而实现从被动描述场景到主动理解人类行为目的的跃迁。为推动该方向研究,作者构建了EscherVerse,包含大规模开放世界基准(Escher-Bench)、数据集(Escher-35k)及模型系列(Escher series),首次系统性地评估模型在动态、以人为中心场景中对物体恒常性、状态转换和轨迹预测的能力,并特别强调将物理事件与人类意图关联的能力。

链接: https://arxiv.org/abs/2601.01547
作者: Tianjun Gu,Chenghua Gong,Jingyu Gong,Zhizhong Zhang,Yuan Xie,Lizhuang Ma,Xin Tan
机构: East China Normal University (华东师范大学); Shanghai AI Lab (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to reason about spatial dynamics is a cornerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning–understanding the physical principles of object interactions–and Intent-Driven Reasoning–inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series). Derived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent’s ability to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric scenarios. Crucially, it is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intelligence from passive scene description toward a holistic, purpose-driven understanding of the world.
zh

[CV-99] FAR-AMTN: Attention Multi-Task Network for Face Attribute Recognition

【速读】:该论文旨在解决多任务网络(Multi-Task Network, MTN)在人脸属性识别(Face Attribute Recognition, FAR)中因传统共享结构导致参数爆炸及高层特征交互受限的问题,从而影响模型泛化性能。其解决方案的关键在于提出一种新型注意力多任务网络(FAR-AMTN),核心创新包括:1)引入权重共享的组特定注意力模块(Weight-Shared Group-Specific Attention, WSGSA),通过共享参数降低模型复杂度并增强组内特征表示;2)设计跨组特征融合模块(Cross-Group Feature Fusion, CGFF),促进不同属性组之间的特征交互,挖掘语义关联;3)采用动态权重策略(Dynamic Weighting Strategy, DWS)实现多任务同步收敛,提升训练稳定性与精度。

链接: https://arxiv.org/abs/2601.01537
作者: Gong Gao,Zekai Wang,Xianhui Liu,Weidong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 8figures

点击查看摘要

Abstract:To enhance the generalization performance of Multi-Task Networks (MTN) in Face Attribute Recognition (FAR), it is crucial to share relevant information across multiple related prediction tasks effectively. Traditional MTN methods create shared low-level modules and distinct high-level modules, causing an exponential increase in model parameters with the addition of tasks. This approach also limits feature interaction at the high level, hindering the exploration of semantic relations among attributes, thereby affecting generalization negatively. In response, this study introduces FAR-AMTN, a novel Attention Multi-Task Network for FAR. It incorporates a Weight-Shared Group-Specific Attention (WSGSA) module with shared parameters to minimize complexity while improving group feature representation. Furthermore, a Cross-Group Feature Fusion (CGFF) module is utilized to foster interactions between attribute groups, enhancing feature learning. A Dynamic Weighting Strategy (DWS) is also introduced for synchronized task convergence. Experiments on the CelebA and LFWA datasets demonstrate that the proposed FAR-AMTN demonstrates superior accuracy with significantly fewer parameters compared to existing models.
zh

[CV-100] Improving Flexible Image Tokenizers for Autoregressive Image Generation

【速读】:该论文旨在解决柔性图像分词器(flexible image tokenizer)在生成式 AI(Generative AI)任务中因尾部截断策略导致的信息集中于早期 token 的问题,从而限制了自回归(AutoRegressive, AR)图像生成的性能。其核心解决方案是提出 ReToK,包含两个关键设计:一是冗余令牌填充(Redundant Token Padding),通过提高尾部 token 的激活频率来缓解信息过度集中在前序 token 的现象;二是分层语义正则化(Hierarchical Semantic Regularization),将早期 token 的解码特征与预训练视觉基础模型对齐,并逐步降低向尾部的正则化强度,以支持更精细的低级细节重建。

链接: https://arxiv.org/abs/2601.01535
作者: Zixuan Fu,Lanqing Guo,Chong Wang,Binbin Song,Ding Liu,Bihan Wen
机构: Nanyang Technological University (南洋理工大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Harbin Institute of Technology (哈尔滨工业大学); Meta AI (Meta AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flexible image tokenizers aim to represent an image using an ordered 1D variable-length token sequence. This flexible tokenization is typically achieved through nested dropout, where a portion of trailing tokens is randomly truncated during training, and the image is reconstructed using the remaining preceding sequence. However, this tail-truncation strategy inherently concentrates the image information in the early tokens, limiting the effectiveness of downstream AutoRegressive (AR) image generation as the token length increases. To overcome these limitations, we propose \textbfReToK, a flexible tokenizer with \underlineRedundant \underlineToken Padding and Hierarchical Semantic Regularization, designed to fully exploit all tokens for enhanced latent modeling. Specifically, we introduce \textbfRedundant Token Padding to activate tail tokens more frequently, thereby alleviating information over-concentration in the early tokens. In addition, we apply \textbfHierarchical Semantic Regularization to align the decoding features of earlier tokens with those from a pre-trained vision foundation model, while progressively reducing the regularization strength toward the tail to allow finer low-level detail reconstruction. Extensive experiments demonstrate the effectiveness of ReTok: on ImageNet 256 \times 256, our method achieves superior generation performance compared with both flexible and fixed-length tokenizers. Code will be available at: \hrefthis https URLthis https URL
zh

[CV-101] DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

【速读】:该论文旨在解决当前生成式驾驶世界模型(generative driving world models)缺乏统一、严谨评估基准的问题,以推动模型在视觉真实性、轨迹合理性、时间一致性及可控性等方面的进步。现有评估方法存在诸多局限:通用视频指标忽略安全关键因素,轨迹合理性难以量化,时空一致性和对自车条件的可控性常被忽视,且数据集多样性不足。解决方案的关键在于提出首个综合性基准 DrivingGen,其核心包括两部分:一是从真实驾驶数据和互联网规模视频中构建多样化评估数据集,覆盖多天气、时段、地理区域及复杂驾驶场景;二是设计一套联合评估指标体系,系统衡量视觉逼真度、轨迹合理性、时间连贯性和可控性。通过 benchmarking 14 个前沿模型,揭示了通用模型与专用模型之间的权衡关系,为可部署、可控制的驾驶世界模型研发提供了标准化评估框架。

链接: https://arxiv.org/abs/2601.01528
作者: Yang Zhou,Hao Shao,Letian Wang,Zhuofan Zong,Hongsheng Li,Steven L. Waslander
机构: University of Toronto (多伦多大学); CUHK MMLab (香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 4 figures; Project Website: this https URL

点击查看摘要

Abstract:Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.
zh

[CV-102] BARE: Towards Bias-Aware and Reasoning -Enhanced One-Tower Visual Grounding

【速读】:该论文针对视觉接地(Visual Grounding, VG)任务中现有单塔架构方法存在的两个核心问题展开研究:一是多模态表征过度纠缠导致的模态偏差加剧,二是语义推理能力不足限制了对指代表达线索的理解。解决方案的关键在于提出BARE框架,通过三个创新模块协同优化:(i)语言显著性调制器(language salience modulator)保留语言模态特异性特征,(ii)视觉偏差校正(visual bias correction)减少视觉模态干扰,以及(iii)指代关系增强模块(referential relationship enhancement)构建更精准的指代语义,从而有效缓解多模态干扰并提升指代理解能力。实验表明,BARE在五个基准数据集上达到SOTA性能且计算效率优于现有方法。

链接: https://arxiv.org/abs/2601.01526
作者: Hongbing Li,Linhui Xiao,Zihan Zhao,Qi Shen,Yixiang Huang,Bo Xiao,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Pengcheng Laboratory (鹏城实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Grounding (VG), which aims to locate a specific region referred to by expressions, is a fundamental yet challenging task in the multimodal understanding fields. While recent grounding transfer works have advanced the field through one-tower architectures, they still suffer from two primary limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders the comprehension of referential cues. In this paper, we propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding. BARE introduces a mechanism that preserves modality-specific features and constructs referential semantics through three novel modules: (i) language salience modulator, (ii) visual bias correction and (iii) referential relationship enhancement, which jointly mitigate multimodal distractions and enhance referential comprehension. Extensive experimental results on five benchmarks demonstrate that BARE not only achieves state-of-the-art performance but also delivers superior computational efficiency compared to existing approaches. The code is publicly accessible at this https URL.
zh

[CV-103] FastV-RAG : Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在执行视觉推理任务时,难以有效整合外部知识的问题,以及现有检索增强生成(Retrieval-Augmented Generation, RAG)方法效率低下且答案质量不稳定的问题。其解决方案的关键在于提出 VideoSpeculateRAG 框架,包含两个核心创新:一是引入推测解码(speculative decoding)流水线,利用轻量级草稿模型快速生成多个候选答案,并由高精度重型模型进行验证与优化,从而显著降低推理延迟而不牺牲准确性;二是识别出检索知识中实体识别错误是主要误差来源,并通过一种基于相似度的过滤策略提升实体对齐效果,进而提高整体回答准确率。实验表明,该框架在保持或超越标准 RAG 准确性的同时,将推理速度提升约 2 倍。

链接: https://arxiv.org/abs/2601.01513
作者: Gen Li,Peiyu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer accuracy. Experiments demonstrate that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x. Our framework highlights the potential of combining speculative decoding with retrieval-augmented reasoning to enhance efficiency and reliability in complex, knowledge-intensive multimodal tasks.
zh

[CV-104] A Novel Deep Learning Method for Segmenting the Left Ventricle in Cardiac Cine MRI

【速读】:该论文旨在解决短轴位 cine MRI 扫描中左心室(left ventricle)语义分割的精度问题,尤其针对传统卷积神经网络(CNN)在捕捉上下文信息方面的不足。解决方案的关键在于提出一种基于组批归一化(group-batch normalization)的 U-Net 架构——GBU-Net,其通过改进的下采样路径提取特征并增强上采样路径恢复细节的能力,显著提升了对心脏影像中复杂结构的 contextual understanding(上下文理解),从而在 Dice 系数和平均垂直距离等指标上优于现有方法,在 SunnyBrook 测试集上达到 97% 的 Dice 分数。

链接: https://arxiv.org/abs/2601.01512
作者: Wenhui Chu,Aobo Jin,Hardik A. Gohel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:This research aims to develop a novel deep learning network, GBU-Net, utilizing a group-batch-normalized U-Net framework, specifically designed for the precise semantic segmentation of the left ventricle in short-axis cine MRI scans. The methodology includes a down-sampling pathway for feature extraction and an up-sampling pathway for detail restoration, enhanced for medical imaging. Key modifications include techniques for better contextual understanding crucial in cardiac MRI segmentation. The dataset consists of 805 left ventricular MRI scans from 45 patients, with comparative analysis using established metrics such as the dice coefficient and mean perpendicular distance. GBU-Net significantly improves the accuracy of left ventricle segmentation in cine MRI scans. Its innovative design outperforms existing methods in tests, surpassing standard metrics like the dice coefficient and mean perpendicular distance. The approach is unique in its ability to capture contextual information, often missed in traditional CNN-based segmentation. An ensemble of the GBU-Net attains a 97% dice score on the SunnyBrook testing dataset. GBU-Net offers enhanced precision and contextual understanding in left ventricle segmentation for surgical robotics and medical analysis.
zh

[CV-105] DiffKD-DCIS: Predicting Upgrade of Ductal Carcinoma In Situ with Diffusion Augmentation and Knowledge Distillation

【速读】:该论文旨在解决乳腺导管原位癌(ductal carcinoma in situ, DCIS)向浸润性导管癌(invasive ductal carcinoma, IDC)升级预测中因超声数据有限和深度学习模型泛化能力差而导致的临床决策困难问题。解决方案的关键在于提出DiffKD-DCIS框架,该框架融合条件扩散建模与教师-学生知识蒸馏机制:首先利用条件扩散模型在多模态条件下生成高质量合成超声图像以增强数据;其次通过深层教师网络从原始与合成数据中提取鲁棒特征;最后采用轻量级学生网络通过知识蒸馏学习教师网络的知识,在保证模型泛化性能的同时显著提升推理效率。

链接: https://arxiv.org/abs/2601.01507
作者: Tao Li,Qing Li,Na Li,Hui Xie
机构: Xiangnan University (湘南大学); Macao Polytechnic University (澳门理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately predicting the upgrade of ductal carcinoma in situ (DCIS) to invasive ductal carcinoma (IDC) is crucial for surgical planning. However, traditional deep learning methods face challenges due to limited ultrasound data and poor generalization ability. This study proposes the DiffKD-DCIS framework, integrating conditional diffusion modeling with teacher-student knowledge distillation. The framework operates in three stages: First, a conditional diffusion model generates high-fidelity ultrasound images using multimodal conditions for data augmentation. Then, a deep teacher network extracts robust features from both original and synthetic data. Finally, a compact student network learns from the teacher via knowledge distillation, balancing generalization and computational efficiency. Evaluated on a multi-center dataset of 1,435 cases, the synthetic images were of good quality. The student network had fewer parameters and faster inference. On external test sets, it outperformed partial combinations, and its accuracy was comparable to senior radiologists and superior to junior ones, showing significant clinical potential. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.01507 [cs.CV] (or arXiv:2601.01507v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.01507 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-106] DeepInv: A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion

【速读】:该论文旨在解决扩散模型中图像到噪声的逆向映射(diffusion inversion)问题,该任务在可控图像编辑中至关重要,但目前仍面临缺乏有效监督信号的挑战,导致现有方法多依赖近似解法,性能或效率受限。解决方案的关键在于提出一种新颖的自监督扩散逆向方法——Deep Inversion(DeepInv),其核心创新包括:1)设计了一种自监督目标函数,结合数据增强策略,无需真实噪声标注即可生成高质量伪噪声;2)构建了迭代式与多尺度训练机制,用于训练一个参数化的可学习逆向求解器(inversion solver),实现快速且准确的图像到噪声映射。此方法首次将可训练求解器引入扩散逆向任务,显著提升了性能与推理速度。

链接: https://arxiv.org/abs/2601.01487
作者: Ziyue Zhang,Luxi Lin,Xiaolin Hu,Chao Chang,HuaiXi Wang,Yiyi Zhou,Rongrong Ji
机构: Xiamen University (厦门大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation-based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground-truth noise annotations, we introduce a self-supervised objective as well as a data augmentation strategy to generate high-quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image-to-noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in this https URL.
zh

[CV-107] Higher-Order Domain Generalization in Magnetic Resonance-Based Assessment of Alzheimers Disease

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中基于结构磁共振成像(structural magnetic resonance imaging, sMRI)的深度学习模型在跨不同数据集时性能下降的问题,即由于扫描仪、成像协议和患者人群差异导致的领域偏移(domain shift)。现有方法虽在特征提取方面取得进展,但单领域泛化(single-domain generalization, SDG)仍缺乏系统研究。解决方案的关键在于提出Extended MixStyle(EM)框架,通过混合高阶特征矩(偏度和峰度)来模拟多样化的分布变化,从而增强模型对异构环境的鲁棒性;实验表明,EM在三个未见队列上平均提升宏F1分数2.4个百分点,显著优于当前最先进的SDG基准方法。

链接: https://arxiv.org/abs/2601.01485
作者: Zobia Batool,Diala Lteif,Vijaya B. Kolachalama,Huseyin Ozkan,Erchan Aptoula
机构: Sabanci University (萨班哲大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite progress in deep learning for Alzheimer’s disease (AD) diagnostics, models trained on structural magnetic resonance imaging (sMRI) often do not perform well when applied to new cohorts due to domain shifts from varying scanners, protocols and patient demographics. AD, the primary driver of dementia, manifests through progressive cognitive and neuroanatomical changes like atrophy and ventricular expansion, making robust, generalizable classification essential for real-world use. While convolutional neural networks and transformers have advanced feature extraction via attention and fusion techniques, single-domain generalization (SDG) remains underexplored yet critical, given the fragmented nature of AD datasets. To bridge this gap, we introduce Extended MixStyle (EM), a framework for blending higher-order feature moments (skewness and kurtosis) to mimic diverse distributional variations. Trained on sMRI data from the National Alzheimer’s Coordinating Center (NACC; n=4,647) to differentiate persons with normal cognition (NC) from those with mild cognitive impairment (MCI) or AD and tested on three unseen cohorts (total n=3,126), EM yields enhanced cross-domain performance, improving macro-F1 on average by 2.4 percentage points over state-of-the-art SDG benchmarks, underscoring its promise for invariant, reliable AD detection in heterogeneous real-world settings. The source code will be made available upon acceptance at this https URL.
zh

[CV-108] Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

【速读】:该论文旨在解决并行测试时缩放(parallel test-time scaling)方法中生成模型与验证模型分别训练所导致的高训练和推理成本问题。其核心解决方案是提出优势解耦偏好优化(Advantage Decoupled Preference Optimization, ADPO),这是一种统一的强化学习框架,能够在单一策略中联合学习答案生成与自验证能力。ADPO的关键创新在于:一是引入偏好验证奖励(preference verification reward),通过正负样本的平均验证分数作为决策阈值,当预测正确性与答案正确性一致时提供正向反馈,从而提升验证能力;二是设计解耦优化机制,分别计算生成与验证的优势(advantage),利用token掩码隔离梯度,并融合掩码后的GRPO目标函数,在保持生成质量的同时校准验证得分。该方法在多个基准上实现了显著性能提升,同时大幅降低推理时间。

链接: https://arxiv.org/abs/2601.01483
作者: Xinyu Qiu,Heng Jia,Zhengwen Zeng,Shuheng Shen,Changhua Meng,Yi Yang,Linchao Zhu
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Venus Team, Ant Group (蚂蚁集团 Venus 团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
zh

[CV-109] Robust Ship Detection and Tracking Using Modified ViBe and Backwash Cancellation Algorithm

【速读】:该论文旨在解决海岸视频序列中船舶的实时检测与跟踪问题,尤其针对复杂动态环境(如自然海浪干扰、光照变化)下传统方法易丢失目标、背景更新慢等挑战。其解决方案的关键在于对ViBe(Video Background Subtraction Algorithm based on the ViBe model)算法进行改进,提升其在复杂海况下的鲁棒性,并结合船舶几何特性及亮度失真模型提出一种新的尾流(backwash)消除方法,从而实现高精度、实时的船舶检测与跟踪性能。

链接: https://arxiv.org/abs/2601.01481
作者: Mohammad Hassan Saghafi,Seyed Majid Noorhosseini,Seyed Abolfazl Seyed Javadein,Hadi Khalili
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a robust real time detection and tracking method for detecting ships in a coastal video sequences. Since coastal scenarios are unpredictable and scenes have dynamic properties it is essential to apply detection methods that are robust to these conditions. This paper presents modified ViBe for moving object detection which detects ships and backwash. In the modified ViBe the probability of losing ships is decreased in comparison with the original ViBe. It is robust to natural sea waves and variation of lights and is capable of quickly updating the background. Based on geometrical properties of ship and some concepts such as brightness distortion, a new method for backwash cancellation is proposed. Experimental results demonstrate that the proposed strategy and methods have outstanding performance in ship detection and tracking. These results also illustrate real time and precise performance of the proposed strategy.
zh

[CV-110] Domain Adaptation of Carotid Ultrasound Images using Generative Adversarial Network

【速读】:该论文旨在解决医学超声图像在不同设备或参数设置下因纹理特征和混响噪声差异导致的域偏移问题,即训练数据与测试数据分布不一致的问题,这会显著降低深度学习模型的泛化性能。解决方案的关键在于提出一种基于生成对抗网络(Generative Adversarial Network, GAN)的图像到图像翻译方法,通过修改源域图像的纹理模式并去除混响噪声,使其在保持图像内容不变的前提下,与目标域图像的统计特性对齐,从而实现有效的域适应。实验表明,该方法在两个包含三个不同域的颈动脉超声图像数据集上均优于CycleGAN等对比方法,在直方图相关性和巴氏距离等指标上表现更优。

链接: https://arxiv.org/abs/2601.01460
作者: Mohd Usama,Belal Ahmad,Christer Gronlund,Faleh Menawer R Althiyabi
机构: Umea University (瑞典乌梅å大学); National Taipei University of Technology (台湾国立台北科技大学); King Fahd University of Petroleum and Minerals (沙特国王法赫德石油矿产大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Deep learning has been extensively used in medical imaging applications, assuming that the test and training datasets belong to the same probability distribution. However, a common challenge arises when working with medical images generated by different systems or even the same system with different parameter settings. Such images contain diverse textures and reverberation noise that violate the aforementioned assumption. Consequently, models trained on data from one device or setting often struggle to perform effectively with data from other devices or settings. In addition, retraining models for each specific device or setting is labor-intensive and costly. To address these issues in ultrasound images, we propose a novel Generative Adversarial Network (GAN)-based model. We formulated the domain adaptation tasks as an image-to-image translation task, in which we modified the texture patterns and removed reverberation noise in the test data images from the source domain to align with those in the target domain images while keeping the image content unchanged. We applied the proposed method to two datasets containing carotid ultrasound images from three different domains. The experimental results demonstrate that the model successfully translated the texture pattern of images and removed reverberation noise from the ultrasound images. Furthermore, we evaluated the CycleGAN approaches for a comparative study with the proposed model. The experimental findings conclusively demonstrated that the proposed model achieved domain adaptation (histogram correlation (0.960 (0.019), 0.920 (0.043) and bhattacharya distance (0.040 (0.020), 0.085 (0.048)), compared to no adaptation (0.916 (0.062) 0.890 (0.077), 0.090 (0.070) 0.121 (0.095)) for both datasets.
zh

[CV-111] Language as Prior Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

【速读】:该论文旨在解决单目度量深度(monocular metric depth)估计中因全局尺度不可识别和域偏移敏感性导致的 ill-posed 问题。其解决方案的关键在于:在冻结主干网络(frozen-backbone)的校准设置下,通过图像特定的仿射变换恢复度量深度,并仅训练轻量级校准头(calibration heads),同时保持相对深度主干和 CLIP 文本编码器固定;进一步地,利用文本提供的粗粒度但噪声较大的尺度线索,预测一个不确定性感知的校准参数边界(uncertainty-aware envelope),而非依赖单一文本点估计;最终结合多尺度冻结视觉特征,在该边界内选择最优图像特定校准参数,从而实现更鲁棒的零样本迁移性能。

链接: https://arxiv.org/abs/2601.01457
作者: Mingxing Zhan,Li Zhang,Beibei Wang,Yingjie Wang,Zenglin Shi
机构: Hefei University of Technology (合肥工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.
zh

[CV-112] Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration

【速读】:该论文旨在解决多模态少样本3D点云语义分割(FS-PCS)中“融合-精炼”范式存在的“可塑性-稳定性困境”(Plasticity-Stability Dilemma),以及CLIP模型因类间混淆导致的语义盲区问题。解决方案的关键在于提出一种解耦专家仲裁少样本分割网络(DA-FSS),其核心创新包括:1)通过并行专家精炼模块(Parallel Expert Refinement)分别生成几何与语义路径的关联信息;2)设计堆叠仲裁模块(Stacked Arbitration Module, SAM)实现卷积融合与路径间相关性仲裁;3)引入解耦对齐模块(Decoupled Alignment Module, DAM),在不传播混淆的前提下实现两路径的知识迁移,从而有效分离语义与几何路径,并通过梯度互正则化提升泛化能力。该方法在S3DIS和ScanNet数据集上显著优于MM-FSS基线,且在几何边界清晰度、完整性及纹理区分度方面表现更优。

链接: https://arxiv.org/abs/2601.01456
作者: Wentao Bian,Fenglei Xu
机构: Suzhou University of Science and Technology (苏州科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in “Fuse-then-Refine” paradigms: the “Plasticity-Stability Dilemma.” In addition, CLIP’s inter-class confusion can result in semantic blindness. To address these issues, we present the Decoupled-experts Arbitration Few-Shot SegNet (DA-FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA-FSS employs the same backbone and pre-trained text encoder as MM-FSS to generate text embeddings, which can increase free modalities’ utilization rate and better leverage each modality’s information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA-FSS over MM-FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: this https URL.
zh

[CV-113] PartImageNet Dataset: Enhancing Visual Models with High-Quality Part Annotations

【速读】:该论文旨在解决现有数据集中高质量部件标注稀缺的问题,从而提升模型在复杂场景下的鲁棒性识别能力。其解决方案的关键在于构建了PartImageNet++(PIN++)这一覆盖ImageNet-1K全部类别的细粒度部件标注数据集,并提出多尺度部件监督识别模型(MPM)。MPM通过在未标注图像上利用PIN++训练的部件分割网络生成伪部件标签,结合原始标注与伪标签对主干识别架构进行联合监督,同时引入辅助旁路层以增强特征表达能力,从而显著提升分类性能并为下游任务建立强基准。

链接: https://arxiv.org/abs/2601.01454
作者: Xiao Li,Zilong Liu,Yining Liu,Zhuhong Li,Na Dong,Sitian Qin,Xiaolin Hu
机构: Shandong University (山东大学); Duke University (杜克大学); Beijing Institute of Technology (北京理工大学); Harbin Engineering University (哈尔滨工程大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2407.10918

点击查看摘要

Abstract:To address the scarcity of high-quality part annotations in existing datasets, we introduce PartImageNet++ (PIN++), a dataset that provides detailed part annotations for all categories in ImageNet-1K. With 100 annotated images per category, totaling 100K images, PIN++ represents the most comprehensive dataset covering a diverse range of object categories. Leveraging PIN++, we propose a Multi-scale Part-supervised recognition Model (MPM) for robust classification on ImageNet-1K. We first trained a part segmentation network using PIN++ and used it to generate pseudo part labels for the remaining unannotated images. MPM then integrated a conventional recognition architecture with auxiliary bypass layers, jointly supervised by both pseudo part labels and the original part annotations. Furthermore, we conducted extensive experiments on PIN++, including part segmentation, object segmentation, and few-shot learning, exploring various ways to leverage part annotations in downstream tasks. Experimental results demonstrated that our approach not only enhanced part-based models for robust object recognition but also established strong baselines for multiple downstream tasks, highlighting the potential of part annotations in improving model performance. The dataset and the code are available at this https URL.
zh

[CV-114] In defense of the two-stage framework for open-set domain adaptive semantic segmentation

【速读】:该论文旨在解决开放集域适应语义分割(Open-Set Domain Adaptation for Semantic Segmentation, OSDA-SS)中的核心挑战,即如何在已知类别上实现有效的域迁移同时准确识别未知类别。现有方法通常将两个任务合并为单一阶段处理,但因已知与未知类别的标注不平衡,易导致已知类别的负向迁移和未知类别的欠拟合问题。为此,作者提出分离后适应训练策略(Separating-then-Adapting Training Strategy, SATS),其关键在于将学习过程分为两个顺序步骤:首先进行已知/未知类别分离,以提升模型对未知类别的准确建模;其次实施面向未知类别的域适应,确保已知与未知类别的判别特征得到平衡学习,从而增强模型对真正未知目标的识别能力。此外,引入硬未知探索(hard unknown exploration)数据增强方法,通过暴露模型于更具挑战性的未知样本,强化其对目标域未知类别的全面理解能力。

链接: https://arxiv.org/abs/2601.01439
作者: Wenqi Ren,Weijie Wang,Meng Zheng,Ziyan Wu,Yang Tang,Zhun Zhong,Nicu Sebe
机构: Shanghai Xiaoyuan Innovation Center (上海小源创新中心); University of Trento (特伦托大学); United Imaging Intelligence (联影智能); Key Laboratory of Advanced Control and Optimization for Chemical Processes (先进控制与化学过程优化重点实验室); East China University of Science and Technology (华东理工大学); School of Computer Science and Information Engineering (计算机科学与信息工程学院); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) presents a significant challenge, as it requires both domain adaptation for known classes and the distinction of unknowns. Existing methods attempt to address both tasks within a single unified stage. We question this design, as the annotation imbalance between known and unknown classes often leads to negative transfer of known classes and underfitting for unknowns. To overcome these issues, we propose SATS, a Separating-then-Adapting Training Strategy, which addresses OSDA-SS through two sequential steps: known/unknown separation and unknown-aware domain adaptation. By providing the model with more accurate and well-aligned unknown classes, our method ensures a balanced learning of discriminative features for both known and unknown classes, steering the model toward discovering truly unknown objects. Additionally, we present hard unknown exploration, an innovative data augmentation method that exposes the model to more challenging unknowns, strengthening its ability to capture more comprehensive understanding of target unknowns. We evaluate our method on public OSDA-SS benchmarks. Experimental results demonstrate that our method achieves a substantial advancement, with a +3.85% H-Score improvement for GTA5-to-Cityscapes and +18.64% for SYNTHIA-to-Cityscapes, outperforming previous state-of-the-art methods.
zh

[CV-115] EdgeNeRF: Edge-Guided Regularization for Neural Radiance Fields from Sparse Views

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在稀疏视图输入下重建质量显著下降的问题,特别是由于几何伪影导致的边界模糊和细节丢失。现有方法通过全局深度正则化缓解伪影,但牺牲了几何边界处的高频细节。其解决方案的关键在于提出EdgeNeRF,利用边缘先验:即深度和法向量的突变通常对应图像边缘;具体而言,首先从输入图像中提取边缘,随后仅在非边缘区域施加深度与法向量正则化约束,从而在保持几何一致性的同时有效保留边界处的高频率细节。该方法在LLFF和DTU数据集上验证了优越性能,并且其边缘引导的深度正则化模块可作为插件式组件集成到其他方法中,无需显著增加训练时间即可提升整体效果。

链接: https://arxiv.org/abs/2601.01431
作者: Weiqi Yu,Yiyang Yao,Lin He,Jianming Lv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PRCV 2025

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) achieve remarkable performance in dense multi-view scenarios, but their reconstruction quality degrades significantly under sparse inputs due to geometric artifacts. Existing methods utilize global depth regularization to mitigate artifacts, leading to the loss of geometric boundary details. To address this problem, we propose EdgeNeRF, an edge-guided sparse-view 3D reconstruction algorithm. Our method leverages the prior that abrupt changes in depth and normals generate edges. Specifically, we first extract edges from input images, then apply depth and normal regularization constraints to non-edge regions, enhancing geometric consistency while preserving high-frequency details at boundaries. Experiments on LLFF and DTU datasets demonstrate EdgeNeRF’s superior performance, particularly in retaining sharp geometric boundaries and suppressing artifacts. Additionally, the proposed edge-guided depth regularization module can be seamlessly integrated into other methods in a plug-and-play manner, significantly improving their performance without substantially increasing training time. Code is available at this https URL.
zh

[CV-116] DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

【速读】:该论文旨在解决视频人脸替换(Video Face Swapping, VFS)中难以同时保持身份相似性、属性保真度(如姿态、表情、光照、背景及动态信息)与时间一致性的问题。现有方法在跨帧身份稳定性和视觉真实感方面存在明显不足。解决方案的关键在于提出一个端到端的扩散Transformer框架DreamID-V,其核心创新包括:1)设计了SyncID-Pipe数据流水线,通过预训练身份锚定的视频合成器并结合图像级人脸替换模型构建双向身份四元组,实现显式监督;2)引入模态感知条件注入模块(Modality-Aware Conditioning),以区分地融合多模态条件信息;3)提出从合成到真实的课程学习机制和身份一致性强化学习策略,显著提升复杂场景下的视觉真实感与身份一致性。

链接: https://arxiv.org/abs/2601.01425
作者: Xu Guo,Fulong Ye,Xinghui Li,Pengqi Tu,Pengze Zhang,Qichao Sun,Songtao Zhao,Xiangwang Hou,Qian He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL

点击查看摘要

Abstract:Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.
zh

[CV-117] AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval

【速读】:该论文旨在解决现有遥感视觉语言模型(Remote Sensing Vision-Language Models, VLMs)在空间理解能力上的不足,从而提升其在真实场景中的应用效果。针对无人机拍摄的车辆图像,作者构建了一个名为AirSpatial的空间感知数据集,包含超过20.6万条指令,并引入两项新任务:空间定位(Spatial Grounding)和空间问答(Spatial Question Answering),且是首个提供3D边界框(3DBB)的遥感定位数据集。解决方案的关键在于采用两阶段训练策略——先进行图像理解预训练,再进行空间理解微调,以有效利用已有VLM的图像理解能力拓展至空间域。基于此训练后的空间感知VLM,作者进一步开发了空中智能体AirSpatialBot,具备细粒度车辆属性识别与检索能力,并通过动态整合任务规划、图像理解、空间理解与执行能力,适应多样化的查询需求。实验结果验证了方法的有效性,揭示了现有VLMs在空间理解方面的局限并提供了重要洞见。

链接: https://arxiv.org/abs/2601.01416
作者: Yue Zhou,Ran Ding,Xue Yang,Xue Jiang,Xingzhao Liu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising Image Understanding Pre-training and Spatial Understanding Fine-tuning. Utilizing this trained spatially-aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at this https URL
zh

[CV-118] Mask-Guided Multi-Task Network for Face Attribute Recognition

【速读】:该论文旨在解决传统多任务人脸属性识别(Face Attribute Recognition, FAR)方法因依赖全局区域特征而产生冗余信息、导致负迁移的问题。其解决方案的关键在于提出一种基于掩码引导的多任务网络(Mask-Guided Multi-Task Network, MGMTN),通过自适应掩码学习(Adaptive Mask Learning, AML)精准定位关键面部区域(如眼睛和嘴部组),生成分组掩码以限定有效特征区域,从而减少冗余;同时引入组-全局特征融合机制(Group-Global Feature Fusion, G2FF)整合局部与全局特征,提升属性识别精度。

链接: https://arxiv.org/abs/2601.01408
作者: Gong Gao,Zekai Wang,Jian Zhao,Ziqi Xie,Xianhui Liu,Weidong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:Face Attribute Recognition (FAR) plays a crucial role in applications such as person re-identification, face retrieval, and face editing. Conventional multi-task attribute recognition methods often process the entire feature map for feature extraction and attribute classification, which can produce redundant features due to reliance on global regions. To address these challenges, we propose a novel approach emphasizing the selection of specific feature regions for efficient feature learning. We introduce the Mask-Guided Multi-Task Network (MGMTN), which integrates Adaptive Mask Learning (AML) and Group-Global Feature Fusion (G2FF) to address the aforementioned limitations. Leveraging a pre-trained keypoint annotation model and a fully convolutional network, AML accurately localizes critical facial parts (e.g., eye and mouth groups) and generates group masks that delineate meaningful feature regions, thereby mitigating negative transfer from global region usage. Furthermore, G2FF combines group and global features to enhance FAR learning, enabling more precise attribute identification. Extensive experiments on two challenging facial attribute recognition datasets demonstrate the effectiveness of MGMTN in improving FAR performance.
zh

[CV-119] SwinIFS: Landmark Guided Swin Transformer For Identity Preserving Face Super Resolution

【速读】:该论文旨在解决人脸超分辨率(Face Super-Resolution)任务中因低分辨率输入导致的细粒度结构细节丢失和身份特异性特征失真问题。解决方案的关键在于提出一种基于关键点引导的超分辨率框架 SwinIFS,其核心创新包括:1)在输入表示中嵌入密集高斯热图形式的关键面部地标信息,使网络从早期阶段即可聚焦于语义重要的面部区域;2)采用紧凑的 Swin Transformer 主干网络,结合层次化注意力机制,在保留局部几何结构的同时捕获长程上下文信息,从而实现高质量的人脸纹理恢复与全局结构一致性保持。该方法在 CelebA 基准上显著提升了感知质量、边缘锐度及身份保留能力,尤其在 8 倍放大倍率下仍能重建有意义的结构,兼具高重建精度与计算效率。

链接: https://arxiv.org/abs/2601.01406
作者: Habiba Kausar,Saeed Anwar,Omar Jamal Hammad,Abdul Bais
机构: King Fahd University of Petroleum and Minerals (KFUPM); University of Western Australia (UWA); University of Regina (UREGINA)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face super-resolution aims to recover high-quality facial images from severely degraded low-resolution inputs, but remains challenging due to the loss of fine structural details and identity-specific features. This work introduces SwinIFS, a landmark-guided super-resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity-preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long-range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at this https URL.
zh

[CV-120] Evaluation of Convolutional Neural Network For Image Classification with Agricultural and Urban Datasets

【速读】:该论文旨在解决多领域图像分类任务中模型性能与计算效率之间的平衡问题,特别是在智慧城市(如车辆违规检测、人行道侵占识别)和农业图像分析(如芒果与水稻品种识别)等实际应用场景中的挑战。其解决方案的关键在于设计一种定制化的卷积神经网络(CustomCNN),通过引入残差连接(residual connections)、挤压-激励注意力机制(Squeeze-and-Excitation attention mechanisms)、渐进式通道扩展(progressive channel scaling)以及Kaiming初始化策略,有效提升特征表示能力并加速训练过程,从而在多个公开数据集上实现高性能且计算高效的多域图像分类结果。

链接: https://arxiv.org/abs/2601.01393
作者: Shamik Shafkat Avro,Nazira Jesmin Lina,Shahanaz Sharmin
机构: Dhaka University of Engineering & Technology (达卡工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: All authors contributed equally to this work

点击查看摘要

Abstract:This paper presents the development and evaluation of a custom Convolutional Neural Network (CustomCNN) created to study how architectural design choices affect multi-domain image classification tasks. The network uses residual connections, Squeeze-and-Excitation attention mechanisms, progressive channel scaling, and Kaiming initialization to improve its ability to represent data and speed up training. The model is trained and tested on five publicly available datasets: unauthorized vehicle detection, footpath encroachment detection, polygon-annotated road damage and manhole detection, MangoImageBD and PaddyVarietyBD. A comparison with popular CNN architectures shows that the CustomCNN delivers competitive performance while remaining efficient in computation. The results underscore the importance of thoughtful architectural design for real-world Smart City and agricultural imaging applications.
zh

[CV-121] ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking

【速读】:该论文旨在解决自动驾驶系统中停车场景三维重建(3D reconstruction)研究不足的问题,尤其在拥挤停车位和GPS拒止环境下的复杂空间几何建模难题。现有方法主要聚焦于二维停车位感知、地图构建与定位,而忽视了对三维场景结构的精确重建,这限制了后续感知任务的性能提升。其解决方案的关键在于提出首个面向停车场景的基准数据集ParkRecon3D,并设计了首个融合3D高斯溅射(3D Gaussian Splatting, 3DGS)的重建框架ParkGaussian;进一步引入基于停车位感知的重建策略(slot-aware reconstruction),利用已有停车位检测方法引导重建过程,显著增强停车位区域的合成质量,从而实现重建精度与下游感知一致性之间的协同优化。

链接: https://arxiv.org/abs/2601.01386
作者: Xiaobao Wei,Zhangjie Ye,Yuxiang Gu,Zunjie Zhu,Yunfei Guo,Yingying Shen,Shan Zhao,Ming Lu,Haiyang Sun,Bing Wang,Guang Chen,Rongfeng Lu,Hangjun Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released at: this https URL
zh

[CV-122] Unsupervised SE(3) Disentanglement for in situ Macromolecular Morphology Identification from Cryo-Electron Tomography

【速读】:该论文旨在解决冷冻电子断层成像(cryo-ET)中从高噪声数据中准确推断细胞内大分子(macromolecules)原位形态的逆问题,即同时估计模板形态及其在SE(3)群中的变换(包括旋转和平移)。传统基于期望最大化(expectation-maximization)的方法常因对稀有但重要的形态敏感性不足以及依赖大量手动调参而效果受限。解决方案的关键在于提出一种解耦的深度表示学习框架,通过引入新颖的多选择学习模块,在表示空间中将SE(3)变换与形态内容分离,从而有效提升对低信噪比cryo-ET数据的形态识别能力,并实现模板形态的自动生成,显著优于现有方法,且能发现此前未被识别的大分子形态。

链接: https://arxiv.org/abs/2601.01364
作者: Mostofa Rafid Uddin,Mahek Vora,Qifeng Wu,Muyuan Chen,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学); Indian Institute of Technology (印度理工学院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cryo-electron tomography (cryo-ET) provides direct 3D visualization of macromolecules inside the cell, enabling analysis of their in situ morphology. This morphology can be regarded as an SE(3)-invariant, denoised volumetric representation of subvolumes extracted from tomograms. Inferring morphology is therefore an inverse problem of estimating both a template morphology and its SE(3) transformation. Existing expectation-maximization based solution to this problem often misses rare but important morphologies and requires extensive manual hyperparameter tuning. Addressing this issue, we present a disentangled deep representation learning framework that separates SE(3) transformations from morphological content in the representation space. The framework includes a novel multi-choice learning module that enables this disentanglement for highly noisy cryo-ET data, and the learned morphological content is used to generate template morphologies. Experiments on simulated and real cryo-ET datasets demonstrate clear improvements over prior methods, including the discovery of previously unidentified macromolecular morphologies.
zh

[CV-123] Garment Inertial Denoiser (GID): Endowing Accurate Motion Capture via Loose IMU Denoiser

【速读】:该论文旨在解决可穿戴惯性动作捕捉(Inertial Motion Capture, MoCap)系统在松散穿着衣物场景下的精度下降问题。当惯性测量单元(IMU)嵌入到松身服装中时,传感器与身体之间的相对位移会引入结构化、位置依赖的噪声,破坏传统惯性数据处理流程。解决方案的关键在于提出GID(Garment Inertial Denoiser),一个轻量级、即插即用的Transformer架构,其核心创新是通过三阶段处理:(i) 位置感知去噪、(ii) 自适应跨穿戴融合、(iii) 通用姿态预测;其中采用位置感知专家结构(location-aware expert architecture),共享时空主干网络建模全局运动,每个IMU配备专用专家头以学习局部服装动力学,并辅以轻量融合模块保证多部位一致性,从而实现从有限配对松紧穿戴数据中稳定训练并有效泛化至未见用户、动作和服装类型。

链接: https://arxiv.org/abs/2601.01360
作者: Jiawei Fang,Ruonan Zheng,Xiaoxia Gao,Shifan Jiang,Anjun Chen,Qi Ye,Shihui Guo
机构: Xiamen University (厦门大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Wearable inertial motion capture (MoCap) provides a portable, occlusion-free, and privacy-preserving alternative to camera-based systems, but its accuracy depends on tightly attached sensors - an intrusive and uncomfortable requirement for daily use. Embedding IMUs into loose-fitting garments is a desirable alternative, yet sensor-body displacement introduces severe, structured, and location-dependent corruption that breaks standard inertial pipelines. We propose GID (Garment Inertial Denoiser), a lightweight, plug-and-play Transformer that factorizes loose-wear MoCap into three stages: (i) location-specific denoising, (ii) adaptive cross-wear fusion, and (iii) general pose prediction. GID uses a location-aware expert architecture, where a shared spatio-temporal backbone models global motion while per-IMU expert heads specialize in local garment dynamics, and a lightweight fusion module ensures cross-part consistency. This inductive bias enables stable training and effective learning from limited paired loose-tight IMU data. We also introduce GarMoCap, a combined public and newly collected dataset covering diverse users, motions, and garments. Experiments show that GID enables accurate, real-time denoising from single-user training and generalizes across unseen users, motions, and garment types, consistently improving state-of-the-art inertial MoCap methods when used as a drop-in module.
zh

[CV-124] Advanced Machine Learning Approaches for Enhancing Person Re-Identification Performance

【速读】:该论文旨在解决行人重识别(Person Re-identification, ReID)在复杂场景下面临的三大核心挑战:外观变化、域偏移(domain shift)以及标注数据有限。针对监督学习、无监督域适应(Unsupervised Domain Adaptation, UDA)和完全无监督设置,论文提出三种创新解决方案:第一,SCM-ReID通过融合监督对比学习与混合损失优化(分类损失、中心损失、三元组损失及中心三元组损失),增强特征的判别能力;第二,IQAGA与DAPRH结合GAN图像增强、域不变映射和伪标签精炼机制,有效缓解跨域差异并提升泛化性能;第三,ViTC-UReID采用基于视觉Transformer(Vision Transformer)的特征编码与相机感知代理学习策略,利用全局与局部注意力机制及相机身份约束,在大规模无监督场景中显著优于现有方法。上述方案的关键在于构建更具鲁棒性和迁移性的特征表示,并有效应对标签噪声与域间差异问题,从而推动ReID技术在真实监控系统中的落地应用。

链接: https://arxiv.org/abs/2601.01356
作者: Dang H. Pham,Tu N. Nguyen,Hoa N. Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in Vietnamese language

点击查看摘要

Abstract:Person re-identification (ReID) plays a critical role in intelligent surveillance systems by linking identities across multiple cameras in complex environments. However, ReID faces significant challenges such as appearance variations, domain shifts, and limited labeled data. This dissertation proposes three advanced approaches to enhance ReID performance under supervised, unsupervised domain adaptation (UDA), and fully unsupervised settings. First, SCM-ReID integrates supervised contrastive learning with hybrid loss optimization (classification, center, triplet, and centroid-triplet losses), improving discriminative feature representation and achieving state-of-the-art accuracy on Market-1501 and CUHK03 datasets. Second, for UDA, IQAGA and DAPRH combine GAN-based image augmentation, domain-invariant mapping, and pseudo-label refinement to mitigate domain discrepancies and enhance cross-domain generalization. Experiments demonstrate substantial gains over baseline methods, with mAP and Rank-1 improvements up to 12% in challenging transfer scenarios. Finally, ViTC-UReID leverages Vision Transformer-based feature encoding and camera-aware proxy learning to boost unsupervised ReID. By integrating global and local attention with camera identity constraints, this method significantly outperforms existing unsupervised approaches on large-scale benchmarks. Comprehensive evaluations across CUHK03, Market-1501, DukeMTMC-reID, and MSMT17 confirm the effectiveness of the proposed methods. The contributions advance ReID research by addressing key limitations in feature learning, domain adaptation, and label noise handling, paving the way for robust deployment in real-world surveillance systems.
zh

[CV-125] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding

【速读】:该论文旨在解决生成式视频中身份一致性(identity preservation)与动作自然性(motion naturalness)之间的平衡难题,尤其是在仅以单张图像作为参考时,模型难以捕捉动态特征,导致姿态锁定、变形不自然及表情平均化等问题。其解决方案的关键在于引入一个基于短参考视频的身份条件扩散-Transformer视频生成框架,通过在编码器中采用Sinkhorn路由机制学习紧凑的身份token,从而显式建模参考视频中的个体特异性动态模式(如笑容形成过程),在保持预训练主干兼容性的前提下显著提升大姿态变化和丰富表情下的身份保留能力,同时维持提示忠实性和视觉真实性。

链接: https://arxiv.org/abs/2601.01352
作者: Yixuan Lai,He Wang,Kun Zhou,Tianjia Shao
机构: Zhejiang University (浙江大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and “average” faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.
zh

[CV-126] Achieving Fine-grained Cross-modal Understanding through Brain-inspired Hierarchical Representation Learning

【速读】:该论文旨在解决脑神经响应与视觉刺激之间对齐困难的问题,尤其针对现有方法在处理fMRI数据与视频输入之间的模态差异时,难以刻画大脑视觉处理的层级性和时间动态性这一局限。其核心解决方案是提出NeuroAlign框架,关键在于采用两阶段机制模拟人类视觉系统的层级组织:第一阶段通过神经-时间对比学习(Neural-Temporal Contrastive Learning, NTCL)实现全局语义理解,显式建模跨模态的时间动态关系;第二阶段利用增强型向量量化实现细粒度模式匹配,并结合DynaSyncMM-EMA方法进行动态多模态融合与自适应加权,从而显著提升跨模态检索性能,为理解视觉认知机制提供了新范式。

链接: https://arxiv.org/abs/2601.01339
作者: Weihang You,Hanqi Jiang,Yi Pan,Junhao Chen,Tianming Liu,Fei Dou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding neural responses to visual stimuli remains challenging due to the inherent complexity of brain representations and the modality gap between neural data and visual inputs. Existing methods, mainly based on reducing neural decoding to generation tasks or simple correlations, fail to reflect the hierarchical and temporal processes of visual processing in the brain. To address these limitations, we present NeuroAlign, a novel framework for fine-grained fMRI-video alignment inspired by the hierarchical organization of the human visual system. Our framework implements a two-stage mechanism that mirrors biological visual pathways: global semantic understanding through Neural-Temporal Contrastive Learning (NTCL) and fine-grained pattern matching through enhanced vector quantization. NTCL explicitly models temporal dynamics through bidirectional prediction between modalities, while our DynaSyncMM-EMA approach enables dynamic multi-modal fusion with adaptive weighting. Experiments demonstrate that NeuroAlign significantly outperforms existing methods in cross-modal retrieval tasks, establishing a new paradigm for understanding visual cognitive mechanisms.
zh

[CV-127] LinMU: Multimodal Understanding Made Linear

【速读】:该论文旨在解决现代视觉语言模型(Vision-Language Models, VLMs)因自注意力机制(self-attention)具有二次计算复杂度而导致的部署受限问题,尤其是在边缘设备上处理高分辨率图像和长视频时效率低下。解决方案的关键在于提出LinMU(Linear-complexity Multimodal Understanding)架构,其核心是用M-MATE块替代原模型中的每一层自注意力模块:M-MATE是一个双分支结构,包含一个用于全局上下文建模的双向状态空间模型(Flex-MA分支)和一个基于窗口的局部注意力机制(Local-Swin分支),从而在不引入任何二次复杂度模块的前提下实现线性时间复杂度。此外,论文设计了一个三阶段蒸馏框架,将预训练VLM逐步迁移到LinMU结构,确保性能与教师模型相当,同时显著提升推理速度(最长可提速2.7倍)和吞吐量(最高提升9倍)。

链接: https://arxiv.org/abs/2601.01322
作者: Hongjie Wang,Niraj K. Jha
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 23 pages, 7 figures

点击查看摘要

Abstract:Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the VLM with the M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7 \times and improves token throughput by up to 9.0 \times on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.
zh

[CV-128] VReID-XFD: Video-based Person Re-identification at Extreme Far Distance Challenge Results

【速读】:该论文针对极端远距离(XFD)下航空视角到地面视角的人体再识别(Person Re-Identification, ReID)问题展开研究,旨在解决现有ReID系统在严重分辨率退化、极端视角变化、不稳定的运动线索以及衣物差异等复杂条件下性能显著下降的问题。其解决方案的关键在于构建了一个全新的视频基准数据集VReID-XFD,该数据集涵盖从5.8米至120米的飞行高度、30°至90°的观测角度及最大120米的水平距离,包含371个身份、11,288个轨迹片段和1175万帧图像,并支持严格的身份不交叠划分下的多场景评估(航空-航空、航空-地面、地面-航空)。通过该基准开展的VReID-XFD-25挑战赛揭示了性能随高度和距离单调下降、天顶视角普遍劣势以及峰值性能与鲁棒性之间的权衡关系,为未来在极端远距离场景下的跨视角ReID研究提供了标准化测试平台和深入分析基础。

链接: https://arxiv.org/abs/2601.01312
作者: Kailash A. Hambarde,Hugo Proença,Md Rashidunnabi,Pranita Samale,Qiwei Yang,Pingping Zhang,Zijing Gong,Yuhao Wang,Xi Zhang,Ruoshui Qu,Qiaoyun He,Yuhang Zhang,Thi Ngoc Ha Nguyen,Tien-Dung Mai,Cheng-Jun Kang,Yu-Fan Lin,Jin-Hui Jiang,Chih-Chung Hsu,Tamás Endrei,György Cserey,Ashwat Rajbhandari
机构: IT - Instituto de Telecomunicações (葡萄牙电信研究所); University of Beira Interior (贝拉内陆大学); Dalian University of Technology (大连理工大学); University of Information Technology, VNU-HCM (胡志明市国家大学信息科技大学); National Cheng Kung University (成功大学); National Yang Ming Chiao Tung University (阳明交通大学); Pázmány Péter Catholic University (帕兹曼尼·彼得天主教大学); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Person re-identification (ReID) across aerial and ground views at extreme far distances introduces a distinct operating regime where severe resolution degradation, extreme viewpoint changes, unstable motion cues, and clothing variation jointly undermine the appearance-based assumptions of existing ReID systems. To study this regime, we introduce VReID-XFD, a video-based benchmark and community challenge for extreme far-distance (XFD) aerial-to-ground person re-identification. VReID-XFD is derived from the DetReIDX dataset and comprises 371 identities, 11,288 tracklets, and 11.75 million frames, captured across altitudes from 5.8 m to 120 m, viewing angles from oblique (30 degrees) to nadir (90 degrees), and horizontal distances up to 120 m. The benchmark supports aerial-to-aerial, aerial-to-ground, and ground-to-aerial evaluation under strict identity-disjoint splits, with rich physical metadata. The VReID-XFD-25 Challenge attracted 10 teams with hundreds of submissions. Systematic analysis reveals monotonic performance degradation with altitude and distance, a universal disadvantage of nadir views, and a trade-off between peak performance and robustness. Even the best-performing SAS-PReID method achieves only 43.93 percent mAP in the aerial-to-ground setting. The dataset, annotations, and official evaluation protocols are publicly available at this https URL .
zh

[CV-129] S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss

【速读】:该论文旨在解决医学图像分割中长期存在的三难问题:局部精度(用于边界敏感的临床应用)、全局上下文(保证解剖一致性)与计算效率(适配有限数据和硬件资源)之间的平衡难题。现有架构难以同时满足这三项需求,卷积网络虽具高效性但感受野受限,视觉Transformer虽能获取全局信息却因二次复杂度导致计算开销过大并易在小样本数据上过拟合。其解决方案的关键在于提出S2M-Net架构,通过两项协同创新实现高效全局建模与自适应损失优化:(i) 频谱选择性令牌混合器(Spectral-Selective Token Mixer, SSTM),利用医学图像的频谱集中特性,采用可学习频率滤波的截断二维快速傅里叶变换(2D FFT)结合内容门控空间投影,在保持全局感受野的同时将复杂度降至O(HWlogHW)\mathcal{O}(HW \log HW);(ii) 形态感知自适应分割损失(Morphology-Aware Adaptive Segmentation Loss, MASL),自动分析结构特征(紧凑性、管状性、不规则性、尺度),并通过约束可学习权重动态调节五种互补损失分量,避免了针对每类数据集的手动调参。

链接: https://arxiv.org/abs/2601.01285
作者: Md. Sanaullah Chowdhury Lameya Sabrin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation requires balancing local precision for boundary-critical clinical applications, global context for anatomical coherence, and computational efficiency for deployment on limited data and hardware a trilemma that existing architectures fail to resolve. Although convolutional networks provide local precision at \mathcalO(n) cost but limited receptive fields, vision transformers achieve global context through \mathcalO(n^2) self-attention at prohibitive computational expense, causing overfitting on small clinical datasets. We propose S2M-Net, a 4.7M-parameter architecture that achieves \mathcalO(HW \log HW) global context through two synergistic innovations: (i) Spectral-Selective Token Mixer (SSTM), which exploits the spectral concentration of medical images via truncated 2D FFT with learnable frequency filtering and content-gated spatial projection, avoiding quadratic attention cost while maintaining global receptive fields; and (ii) Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights, eliminating manual per-dataset tuning. Comprehensive evaluation in 16 medical imaging datasets that span 8 modalities demonstrates state-of-the-art performance: 96.12% Dice on polyp segmentation, 83.77% on surgical instruments (+17.85% over the prior art) and 80.90% on brain tumors, with consistent 3-18% improvements over specialized baselines while using 3.5–6 \times fewer parameters than transformer-based methods.
zh

[CV-130] AI-Powered Deepfake Detection Using CNN and Vision Transformer Architectures

【速读】:该论文旨在解决由生成式 AI (Generative AI) 产生的深度伪造(deepfake)视频或图像对数字真实性构成的严峻挑战。其解决方案的关键在于构建并评估四种基于人工智能的检测模型,包括三种卷积神经网络(CNN)和一种视觉 Transformer(Vision Transformer),并通过数据预处理与增强技术提升模型在多样化场景下的泛化能力。实验表明,VFDNET 结合 MobileNetV3 的架构在准确率和效率方面表现最优,验证了深度学习方法在可靠深伪检测中的有效性。

链接: https://arxiv.org/abs/2601.01281
作者: Sifatullah Sheikh Urmi,Kirtonia Nuzath Tabassum Arthi,Md Al-Imran
机构: East West University (东西大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 6 figures, 3 tables. Conference paper

点击查看摘要

Abstract:The increasing use of artificial intelligence generated deepfakes creates major challenges in maintaining digital authenticity. Four AI-based models, consisting of three CNNs and one Vision Transformer, were evaluated using large face image datasets. Data preprocessing and augmentation techniques improved model performance across different scenarios. VFDNET demonstrated superior accuracy with MobileNetV3, showing efficient performance, thereby demonstrating AI’s capabilities for dependable deepfake detection.
zh

[CV-131] An Energy-Efficient Smart Bus Transport Management System with Blind-Spot Collision Detection Ability

【速读】:该论文旨在解决发展中国家公共交通系统中普遍存在的实时位置信息缺失、非指定站点停靠带来的安全隐患与交通拥堵问题,以及因盲区和违反交通法规引发的事故风险。其核心解决方案是构建一个集成智能公交系统与物联网(IoT)驱动的太阳能供电智能公交站的综合架构:通过基于深度学习的盲区预警系统实现约99%准确率的实时盲区检测,结合自动巴士停靠识别技术确保精准停靠;同时利用RFID卡系统追踪乘客上下车位置、智能门控系统优化乘车秩序,并通过HTTP服务器实现各子系统间无缝通信,从而提升安全性、效率与可持续性。

链接: https://arxiv.org/abs/2601.01274
作者: Md. Sadman Haque,Zobaer Ibn Razzaque,Robiul Awoul Robin,Fahim Hafiz,Riasat Azim
机构: University of Information Technology and Sciences (信息科技与科学大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 11 figures

点击查看摘要

Abstract:Public bus transport systems in developing countries often suffer from a lack of real-time location updates and for users, making commuting inconvenient and unreliable for passengers. Furthermore, stopping at undesired locations rather than designated bus stops creates safety risks and contributes to roadblocks, often causing traffic congestion. Additionally, issues such as blind spots, along with a lack of following traffic laws, increase the chances of accidents. In this work, we address these challenges by proposing a smart public bus system along with intelligent bus stops that enhance safety, efficiency, and sustainability. Our approach includes a deep learning-based blind-spot warning system to help drivers avoid accidents with automated bus-stop detection to accurately identify bus stops, improving transit efficiency. We also introduce IoT-based solar-powered smart bus stops that show real-time passenger counts, along with an RFID-based card system to track where passengers board and exit. A smart door system ensures safer and more organised boarding, while real-time bus tracking keeps passengers informed. To connect all these features, we use an HTTP-based server for seamless communication between the interconnected network systems. Our proposed system demonstrated approximately 99% efficiency in real-time blind spot detection while stopping precisely at the bus stops. Furthermore, the server showed real-time location updates both to the users and at the bus stops, enhancing commuting efficiency. The proposed energy-efficient bus stop demonstrated 12.71kWh energy saving, promoting sustainable architecture. Full implementation and source code are available at: this https URL
zh

[CV-132] RFAssigner: A Generic Label Assignment Strategy for Dense Object Detection

【速读】:该论文旨在解决密集目标检测器在训练过程中因正样本分配不足而导致的小物体尺度不平衡问题(scale imbalance)。现有方法通常为每个训练样本分配正负权重,但在小物体上往往难以获得足够数量的正样本,从而影响多尺度学习能力。解决方案的关键在于提出一种名为RFAssigner的新颖分配策略:首先基于点先验建立初始正样本集,随后利用高斯感受野(Gaussian Receptive Field, GRF)距离度量未分配候选位置与真实框之间的相似性,并据此自适应地从候选池中补充正样本,从而实现更均衡的跨尺度学习过程。

链接: https://arxiv.org/abs/2601.01240
作者: Ziqian Guan,Xieyi Fu,Yuting Wang,Haowen Xiao,Jiarui Zhu,Yingying Zhu,Yongtao Liu,Lin Gu
机构: Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences (中国科学院广州生物医药与健康研究院); North China Institute of Science and Technology (华北理工大学); RIKEN AIP, Japan (日本理化学研究所先进智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Label assignment is a critical component in training dense object detectors. State-of-the-art methods typically assign each training sample a positive and a negative weight, optimizing the assignment scheme during training. However, these strategies often assign an insufficient number of positive samples to small objects, leading to a scale imbalance during training. To address this limitation, we introduce RFAssigner, a novel assignment strategy designed to enhance the multi-scale learning capabilities of dense detectors. RFAssigner first establishes an initial set of positive samples using a point-based prior. It then leverages a Gaussian Receptive Field (GRF) distance to measure the similarity between the GRFs of unassigned candidate locations and the ground-truth objects. Based on this metric, RFAssigner adaptively selects supplementary positive samples from the unassigned pool, promoting a more balanced learning process across object scales. Comprehensive experiments on three datasets with distinct object scale distributions validate the effectiveness and generalizability of our method. Notably, a single FCOS-ResNet-50 detector equipped with RFAssigner achieves state-of-the-art performance across all object scales, consistently outperforming existing strategies without requiring auxiliary modules or heuristics.
zh

[CV-133] HyDRA: Hybrid Denoising Regularization for Measurement-Only DEQ Training

【速读】:该论文旨在解决图像重建问题(image reconstruction problem)中因病态性(ill-posedness)以及缺乏大规模监督数据集所带来的挑战,尤其是在仅有测量值 y\mathbf{y} 而无对应真实图像 \mathbfx 的情况下。其解决方案的关键在于提出 HyDRA(Hybrid Denoising Regularization Adaptation)框架,该框架通过引入一种自适应去噪正则化项(adaptive denoising regularization term)来结合测量一致性(measurement consistency),并辅以数据驱动的早停策略(data-driven early stopping criterion),从而实现仅使用测量值 y\mathbf{y} 即可训练深度平衡模型(Deep Equilibrium, DEQ)。实验表明,该方法在稀疏视角CT重建任务中实现了高质量重建与快速推理性能。

链接: https://arxiv.org/abs/2601.01228
作者: Markus Haltmeier,Lukas Neumann,Nadja Gruber,Johannes Schwab,Gyeongha Hwang
机构: University of Innsbruck (因斯布鲁克大学); University of Applied Sciences Kufstein (库夫施泰因应用技术大学); Yeungnam University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Solving image reconstruction problems of the form (\mathbfA \mathbfx = \mathbfy) remains challenging due to ill-posedness and the lack of large-scale supervised datasets. Deep Equilibrium (DEQ) models have been used successfully but typically require supervised pairs ((\mathbfx,\mathbfy)). In many practical settings, only measurements (\mathbfy) are available. We introduce HyDRA (Hybrid Denoising Regularization Adaptation), a measurement-only framework for DEQ training that combines measurement consistency with an adaptive denoising regularization term, together with a data-driven early stopping criterion. Experiments on sparse-view CT demonstrate competitive reconstruction quality and fast inference.
zh

[CV-134] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

【速读】:该论文旨在解决基于预训练扩散模型的Slot Attention (SA) 在对象中心学习(Object-centric Learning, OCL)中面临的两个核心问题:一是槽位之间的纠缠(slot entanglement),即不同对象槽位之间存在干扰,导致对象分离不清晰;二是槽位与图像内容之间对齐不足,影响表示质量。解决方案的关键在于提出对比性对象中心扩散对齐方法(Contrastive Object-centric Diffusion Alignment, CODA),其核心创新包括:(i) 引入注册槽位(register slots)以吸收残余注意力,减少对象槽位间的干扰;(ii) 设计对比对齐损失(contrastive alignment loss),显式强化槽位与图像内容之间的对应关系,从而近似最大化槽位与输入之间的互信息(Mutual Information, MI),显著提升槽位表征质量。该方法在合成数据集(MOVi-C/E)和真实世界数据集(VOC、COCO)上均取得优于强基线的性能提升,且注册槽位带来的计算开销可忽略,保持了高效性和可扩展性。

链接: https://arxiv.org/abs/2601.01224
作者: Bac Nguyen,Yuhta Takida,Naoki Murata,Chieh-Hsin Lai,Toshimitsu Uesaka,Stefano Ermon,Yuki Mitsufuji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes.
zh

[CV-135] UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass

【速读】:该论文旨在解决真实世界中3D场景与人体联合重建任务中存在的数据稀缺问题,尤其是由于依赖合成数据导致的“仿真到现实”(sim-to-real)域差距,从而引发重建精度低、人体几何细节不清晰以及在野外视频中对齐效果差的问题。其解决方案的关键在于提出一种创新的训练范式,有效利用未标注的真实世界数据:首先通过一个鲁棒的蒸馏策略,从专家深度模型中提取高频细节以优化人体表面质量;其次采用两阶段监督机制,先在合成数据上学习粗粒度定位,再在真实数据上直接优化SMPL网格与人体点云之间的几何对应关系,最终实现单次前向传播即可同时恢复高保真场景几何、人体点云、相机参数及一致的度量尺度下SMPL人体模型。

链接: https://arxiv.org/abs/2601.01222
作者: Mengfei Li,Peng Li,Zheng Zhang,Jiahao Lu,Chengfeng Zhao,Wei Xue,Qifeng Liu,Sida Peng,Wenxiao Zhang,Wenhan Luo,Yuan Liu,Yike Guo
机构: The Hong Kong University of Science and Technology (香港科技大学); Beijing University of Posts and Telecommunications (北京邮电大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods. Project page: this https URL
zh

[CV-136] Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation

【速读】:该论文旨在解决利用合成孔径雷达(Synthetic Aperture Radar, SAR)影像进行雪崩分割与制图时,因高质量标注数据获取成本高、耗时长而制约检测模型训练效率的问题。其核心解决方案在于对Segment Anything Model (SAM) 进行针对性适配:通过引入适配器(adapters)缓解自然图像与SAR影像之间的域差异;设计多编码器结构以处理多通道SAR数据;采用提示工程策略提升小尺度、低对比度雪崩目标的定位精度;并提出一种限制编码器训练时间的高效训练算法,显著降低计算开销。最终将优化后的模型集成至标注工具中,实验证明可大幅提升SAR图像标注速度。

链接: https://arxiv.org/abs/2601.01213
作者: Riccardo Gelato,Carlo Sgaravatti,Jakob Grahn,Giacomo Boracchi,Filippo Maria Bianchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Remote sensing solutions for avalanche segmentation and mapping are key to supporting risk forecasting and mitigation in mountain regions. Synthetic Aperture Radar (SAR) imagery from Sentinel-1 can be effectively used for this task, but training an effective detection model requires gathering a large dataset with high-quality annotations from domain experts, which is prohibitively time-consuming. In this work, we aim to facilitate and accelerate the annotation of SAR images for avalanche mapping. We build on the Segment Anything Model (SAM), a segmentation foundation model trained on natural images, and tailor it to Sentinel-1 SAR data. Adapting SAM to our use-case requires addressing several domain-specific challenges: (i) domain mismatch, since SAM was not trained on satellite/SAR imagery; (ii) input adaptation, because SAR products typically provide more than three channels, while SAM is constrained to RGB images; (iii) robustness to imprecise prompts that can affect target identification and degrade the segmentation quality, an issue exacerbated in small, low-contrast avalanches; and (iv) training efficiency, since standard fine-tuning is computationally demanding for SAM. We tackle these challenges through a combination of adapters to mitigate the domain gap, multiple encoders to handle multi-channel SAR inputs, prompt-engineering strategies to improve avalanche localization accuracy, and a training algorithm that limits the training time of the encoder, which is recognized as the major bottleneck. We integrate the resulting model into an annotation tool and show experimentally that it speeds up the annotation of SAR images.
zh

[CV-137] Real-Time LiDAR Point Cloud Densification for Low-Latency Spatial Data Transmission

【速读】:该论文旨在解决沉浸式远程呈现(immersive telepresence)系统中低延迟空间传输的关键挑战,即如何高效率地捕获动态三维场景并实现实时处理。核心问题在于LiDAR传感器虽能实时获取三维信息,但生成的点云稀疏,难以满足高质量重建需求。解决方案的关键在于提出一种高速LiDAR点云稠密化方法,通过融合多视角LiDAR数据与高分辨率彩色图像,并采用基于卷积神经网络(Convolutional Neural Network, CNN)实现的联合双边滤波策略,在保持实时性能(30 fps)的同时生成无多视角不一致或鬼影伪影的全高清(Full HD)稠密深度图,显著优于现有基于训练的深度补全方法(速度快15倍以上)。

链接: https://arxiv.org/abs/2601.01210
作者: Kazuhiko Murasaki,Shunsuke Konagai,Masakatsu Aoki,Taiga Yoshida,Ryuichi Tanida
机构: NTT Human Informatics Laboratories
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:To realize low-latency spatial transmission system for immersive telepresence, there are two major problems: capturing dynamic 3D scene densely and processing them in real time. LiDAR sensors capture 3D in real time, but produce sparce point clouds. Therefore, this paper presents a high-speed LiDAR point cloud densification method to generate dense 3D scene with minimal latency, addressing the need for on-the-fly depth completion while maintaining real-time performance. Our approach combines multiple LiDAR inputs with high-resolution color images and applies a joint bilateral filtering strategy implemented through a convolutional neural network architecture. Experiments demonstrate that the proposed method produces dense depth maps at full HD resolution in real time (30 fps), which is over 15x faster than a recent training-based depth completion approach. The resulting dense point clouds exhibit accurate geometry without multiview inconsistencies or ghosting artifacts.
zh

[CV-138] XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

【速读】:该论文旨在解决基于学习的3D视觉几何模型在流式重建过程中因KV缓存(Key-Value Cache)无界增长而导致的内存消耗持续上升和推理延迟增加的问题。解决方案的关键在于提出一种无需调优的系统性压缩方法,通过联合剪枝与量化策略对KV缓存进行高效管理:首先利用帧内因果注意力机制识别多视角输入中的冗余键值对并进行剪枝,实现固定内存预算;其次,基于KV张量的独特分布特性引入量化技术进一步降低内存占用。该方法在保持性能基本不变的前提下,将内存使用减少4.42倍,推理速度提升5.48倍,显著提升了流式3D应用的可扩展性和实用性。

链接: https://arxiv.org/abs/2601.01204
作者: Zunhai Su,Weihao Ye,Hansen Feng,Keyu Fan,Jing Zhang,Dahai Yu,Zhengwu Liu,Ngai Wong
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院); China Star Optoelectronics Technology (中国星光电技术有限公司); TCL Corporate Research (HK) Co., Ltd. (TCL香港企业研究中心有限公司); Department of Electrical and Electronic Engineering, The University of Hong Kong (香港大学电子与电机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning-based 3D visual geometry models have benefited substantially from large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention for strong streaming reconstruction, but suffers from unbounded KV cache growth, leading to escalating memory consumption and inference latency as input frames accumulate. We propose XStreamVGGT, a tuning-free approach that systematically compresses the KV cache through joint pruning and quantization, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs originating from multi-view inputs are pruned through efficient token importance identification, enabling a fixed memory budget. Leveraging the unique distribution of KV tensors, we incorporate KV quantization to further reduce memory consumption. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42 \times and accelerating inference by 5.48 \times , enabling scalable and practical streaming 3D applications. The code is available at this https URL.
zh

[CV-139] RefSR-Adv: Adversarial Attack on Reference-based Image Super-Resolution Models

【速读】:该论文旨在解决参考图像增强的超分辨率(Reference-based Super-Resolution, RefSR)系统在面对对抗攻击时的安全性问题,尤其是现有研究多聚焦于后门攻击而忽视了对对抗攻击的探索。其解决方案的关键在于提出一种名为RefSR-Adv的新型对抗攻击方法,通过仅扰动参考图像来显著降低超分辨率输出质量,从而在CNN、Transformer和Mamba等多种架构上均引发严重性能退化与伪影生成。实验进一步揭示了低分辨率输入与参考图像之间的相似性越高,攻击效果越强,表明模型对参考特征的过度依赖是导致该安全漏洞的核心原因。

链接: https://arxiv.org/abs/2601.01202
作者: Jiazhu Dai,Huihui Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Single Image Super-Resolution (SISR) aims to recover high-resolution images from low-resolution inputs. Unlike SISR, Reference-based Super-Resolution (RefSR) leverages an additional high-resolution reference image to facilitate the recovery of high-frequency textures. However, existing research mainly focuses on backdoor attacks targeting RefSR, while the vulnerability of the adversarial attacks targeting RefSR has not been fully explored. To fill this research gap, we propose RefSR-Adv, an adversarial attack that degrades SR outputs by perturbing only the reference image. By maximizing the difference between adversarial and clean outputs, RefSR-Adv induces significant performance degradation and generates severe artifacts across CNN, Transformer, and Mamba architectures on the CUFED5, WR-SR, and DRefSR datasets. Importantly, experiments confirm a positive correlation between the similarity of the low-resolution input and the reference image and attack effectiveness, revealing that the model’s over-reliance on reference features is a key security flaw. This study reveals a security vulnerability in RefSR systems, aiming to urge researchers to pay attention to the robustness of RefSR.
zh

[CV-140] MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

【速读】:该论文旨在解决点云(Point Cloud)无结构和不规则特性给客观质量评估(PCQA)带来的挑战,尤其是难以建立准确的感知特征对应关系的问题。其核心解决方案是提出多尺度隐式结构相似性测量方法(MS-ISSM),该方法利用径向基函数(RBF)连续表示局部特征,将失真度量转化为隐式函数系数的比较,从而有效规避了传统点对点匹配在不规则数据中易产生的匹配误差。此外,论文还设计了一种基于ResGrouped-MLP的感知质量评分网络,通过分组编码策略结合残差块与通道注意力机制,实现多尺度特征差异到感知分数的鲁棒映射,保留亮度(luma)、色度(chroma)和几何(geometry)的物理语义,并自适应聚焦于高、中、低不同尺度下的显著失真特征。

链接: https://arxiv.org/abs/2601.01200
作者: Zhang Chen,Shuai Wan,Yuezhe Zhang,Siyu Ren,Fuzheng Yang,Junhui Hou
机构: Northwestern Polytechnical University (西北工业大学); Royal Melbourne Institute of Technology (皇家墨尔本理工大学); City University of Hong Kong (香港城市大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The unstructured and irregular nature of point clouds poses a significant challenge for objective quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes Radial Basis Functions (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat MLPs by adopting a grouped encoding strategy integrated with Residual Blocks and Channel-wise Attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: this https URL.
zh

[CV-141] Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors

【速读】:该论文旨在解决视频个体计数(Video Individual Counting, VIC)任务中在高密度人群场景下性能下降的问题,尤其是传统方法难以准确识别跨帧行人对应关系的挑战。其关键解决方案在于重新审视VIC的本质,并引入两个信息性先验:一是社会分组先验(social grouping prior),表明行人倾向于成群聚集,由此启发将标准的一对一(one-to-one, O2O)匹配机制扩展为一对多(one-to-many, O2M)匹配;二是时空位移先验(spatial-temporal displacement prior),强调个体无法物理瞬移,据此设计位移先验注入器(displacement prior injector),增强O2M匹配、特征提取与模型训练。上述设计共同构成新的VIC基线模型OMAN++,在多个基准数据集上优于现有方法,尤其在高密度场景(如武汉地铁人群数据集WuhanMetroCrowd)中实现38.12%的误差降低。

链接: https://arxiv.org/abs/2601.01192
作者: Hao Lu,Xuhui Zhu,Wenjing Zhang,Yanan Li,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Wuhan Institute of Technology (武汉工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal Extension of arXiv:2506.13067

点击查看摘要

Abstract:Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at this https URL.
zh

[CV-142] DST-Calib: A Dual-Path Self-Supervised Target-Free LiDAR-Camera Extrinsic Calibration Network

【速读】:该论文旨在解决LiDAR-相机外参标定(LiDAR-camera extrinsic calibration)在真实机器人感知系统中适应性差的问题,现有方法通常依赖于手工设计的标定靶标(如棋盘格)或特定静态场景,限制了其在复杂动态环境中的部署能力。解决方案的关键在于提出一种全新的自监督在线标定网络:首先识别出传统单侧数据增强策略导致的泛化性能下降问题,并引入双侧数据增强技术,利用估计的深度图生成多视角相机图像以提升训练鲁棒性和多样性;在此基础上构建双路径自监督标定框架,减少对高精度真值标签的依赖并支持完全自适应的在线标定;同时,通过差异图构造机制替代传统的双分支特征提取方式,显式关联LiDAR与相机特征,从而提升标定精度并降低模型复杂度。

链接: https://arxiv.org/abs/2601.01188
作者: Zhiwei Huang,Yanwei Fu,Yi Zhou,Xieyuanli Chen,Qijun Chen,Rui Fan
机构: Tongji University (同济大学); Hunan University (湖南大学); National University of Defense Technology (国防科技大学); Xi’an Jiaotong University (西安交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR-camera extrinsic calibration is essential for multi-modal data fusion in robotic perception systems. However, existing approaches typically rely on handcrafted calibration targets (e.g., checkerboards) or specific, static scene types, limiting their adaptability and deployment in real-world autonomous and robotic applications. This article presents the first self-supervised LiDAR-camera extrinsic calibration network that operates in an online fashion and eliminates the need for specific calibration targets. We first identify a significant generalization degradation problem in prior methods, caused by the conventional single-sided data augmentation strategy. To overcome this limitation, we propose a novel double-sided data augmentation technique that generates multi-perspective camera views using estimated depth maps, thereby enhancing robustness and diversity during training. Built upon this augmentation strategy, we design a dual-path, self-supervised calibration framework that reduces the dependence on high-precision ground truth labels and supports fully adaptive online calibration. Furthermore, to improve cross-modal feature association, we replace the traditional dual-branch feature extraction design with a difference map construction process that explicitly correlates LiDAR and camera features. This not only enhances calibration accuracy but also reduces model complexity. Extensive experiments conducted on five public benchmark datasets, as well as our own recorded dataset, demonstrate that the proposed method significantly outperforms existing approaches in terms of generalizability.
zh

[CV-143] GenCAMO: Scene-Graph Contextual Decoupling for Environment-aware and Mask-free Camouflage Image-Dense Annotation Generation

【速读】:该论文旨在解决复杂伪装场景中密集预测(Conceal dense prediction, CDP)任务因高质量、大规模标注数据稀缺而导致模型性能受限的问题。当前RGB-D伪装目标检测和开放词汇伪装目标分割等任务面临数据获取与标注成本高昂的挑战,制约了模型对细粒度特征、先验知识及辅助推理能力的学习。解决方案的关键在于提出GenCAMO框架——一个环境感知且无需掩码的生成式方法,能够合成高保真度的伪装图像密集标注数据;同时构建了多模态标注的大规模数据集GenCAMO-DB(包含深度图、场景图、属性描述和文本提示),从而有效提升CDP模型在复杂伪装场景下的表现。

链接: https://arxiv.org/abs/2601.01181
作者: Chenglizhao Chen,Shaojiang Yuan,Xiaoxue Lu,Mengke Song,Jia Song,Zhenyu Wu,Wenfeng Song,Shuai Li
机构: China University of Petroleum (East China) (中国石油大学(华东)); Southwest Jiaotong University (西南交通大学); Beijing Information Science and Technology University (北京信息科技大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conceal dense prediction (CDP), especially RGB-D camouflage object detection and open-vocabulary camouflage object segmentation, plays a crucial role in advancing the understanding and reasoning of complex camouflage scenes. However, high-quality and large-scale camouflage datasets with dense annotation remain scarce due to expensive data collection and labeling costs. To address this challenge, we explore leveraging generative models to synthesize realistic camouflage image-dense data for training CDP models with fine-grained representations, prior knowledge, and auxiliary reasoning. Concretely, our contributions are threefold: (i) we introduce GenCAMO-DB, a large-scale camouflage dataset with multi-modal annotations, including depth maps, scene graphs, attribute descriptions, and text prompts; (ii) we present GenCAMO, an environment-aware and mask-free generative framework that produces high-fidelity camouflage image-dense annotations; (iii) extensive experiments across multiple modalities demonstrate that GenCAMO significantly improves dense prediction performance on complex camouflage scenes by providing high-quality synthetic data. The code and datasets will be released after paper acceptance.
zh

[CV-144] CardioMOD-Net: A Modal Decomposition-Neural Network Framework for Diagnosis and Prognosis of HFpEF from Echocardiography Cine Loops

【速读】:该论文旨在解决心力衰竭伴射血分数保留(HFpEF)在前期动物模型中难以早期诊断和预测疾病进展的问题,尤其是现有基于超声心动图的人类人工智能(AI)模型多局限于二分类检测,缺乏对合并症特异性表型的识别以及向失代偿阶段发展的连续时间预测能力。其解决方案的关键在于提出一个统一的AI框架——CardioMOD-Net,该框架通过高阶动态模态分解(HODMD)从标准超声心动图电影序列中提取时序特征,并利用视觉Transformer构建共享潜在表示,同时实现多类别诊断(区分对照组、高血糖、肥胖和系统性动脉高血压模型)与连续预测HFpEF发病年龄的双重任务,在小样本条件下仍表现出良好性能(整体诊断准确率65%,预测误差为21.72周)。

链接: https://arxiv.org/abs/2601.01176
作者: Andrés Bell-Navas,Jesús Garicano-Mena,Antonella Ausiello,Soledad Le Clainche,María Villalba-Orero,Enrique Lara-Pezzi
机构: ETSI Aeronáutica y del Espacio - Universidad Politécnica de Madrid (航空航天工程学院-马德里理工大学); Facultad de Veterinaria, Universidad Complutense de Madrid (兽医学院-康普顿斯大学); Centro de investigación Biomédica en Red Cardiovascular (CIBERCV) (心血管生物医学研究网络中心); Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC) (卡洛斯三世心血管研究中心); Center for Computational Simulation (CCS) (计算仿真中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages; 1 figure; letter

点击查看摘要

Abstract:Introduction: Heart failure with preserved ejection fraction (HFpEF) arises from diverse comorbidities and progresses through prolonged subclinical stages, making early diagnosis and prognosis difficult. Current echocardiography-based Artificial Intelligence (AI) models focus primarily on binary HFpEF detection in humans and do not provide comorbidity-specific phenotyping or temporal estimates of disease progression towards decompensation. We aimed to develop a unified AI framework, CardioMOD-Net, to perform multiclass diagnosis and continuous prediction of HFpEF onset directly from standard echocardiography cine loops in preclinical models. Methods: Mouse echocardiography videos from four groups were used: control (CTL), hyperglycaemic (HG), obesity (OB), and systemic arterial hypertension (SAH). Two-dimensional parasternal long-axis cine loops were decomposed using Higher Order Dynamic Mode Decomposition (HODMD) to extract temporal features for downstream analysis. A shared latent representation supported Vision Transformers, one for a classifier for diagnosis and another for a regression module for predicting the age at HFpEF onset. Results: Overall diagnostic accuracy across the four groups was 65%, with all classes exceeding 50% accuracy. Misclassifications primarily reflected early-stage overlap between OB or SAH and CTL. The prognostic module achieved a root-mean-square error of 21.72 weeks for time-to-HFpEF prediction, with OB and SAH showing the most accurate estimates. Predicted HFpEF onset closely matched true distributions in all groups. Discussion: This unified framework demonstrates that multiclass phenotyping and continuous HFpEF onset prediction can be obtained from a single cine loop, even under small-data conditions. The approach offers a foundation for integrating diagnostic and prognostic modelling in preclinical HFpEF research. Comments: 9 pages; 1 figure; letter Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 15A18, 49M27, 68T07, 76-10, 76A05, 92-04, 92-10 ACMclasses: I.2.1; I.5.1; I.5.4; J.3 Cite as: arXiv:2601.01176 [cs.CV] (or arXiv:2601.01176v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.01176 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrés Bell-Navas [view email] [v1] Sat, 3 Jan 2026 12:41:14 UTC (912 KB)
zh

[CV-145] Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation

【速读】:该论文旨在解决当前基于坐标引导的低分辨率特征插值方法(如双线性插值)在语义分割任务中产生的高分辨率特征粗粒度问题,这些问题导致特征错位和上下文信息不足,同时高分辨率特征语义增强带来较高的计算负担,难以满足低延迟推理需求。解决方案的关键在于提出一种新颖的**引导注意力插值(Guided Attentive Interpolation, GAI)**方法,该方法通过建模不同分辨率特征间像素的空间与语义关系,自适应地生成富含语义信息的细粒度高分辨率特征,从而实现高效且精确的密集预测。GAI可无缝集成至任意深度卷积神经网络以提升语义分割性能,在Cityscapes和CamVid数据集上分别达到78.8 mIoU(22.3 FPS)和80.6 mIoU(64.5 FPS)的新纪录,显著优于现有低延迟方案。

链接: https://arxiv.org/abs/2601.01167
作者: Tianheng Cheng,Xinggang Wang,Junchao Liao,Wenyu Liu
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation is a fundamental problem in computer vision and it requires high-resolution feature maps for dense prediction. Current coordinate-guided low-resolution feature interpolation methods, e.g., bilinear interpolation, produce coarse high-resolution features which suffer from feature misalignment and insufficient context information. Moreover, enriching semantics to high-resolution features requires a high computation burden, so that it is challenging to meet the requirement of lowlatency inference. We propose a novel Guided Attentive Interpolation (GAI) method to adaptively interpolate fine-grained high-resolution features with semantic features to tackle these issues. Guided Attentive Interpolation determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high-resolution features with rich semantics. GAI can be integrated with any deep convolutional network for efficient semantic segmentation. In experiments, the GAI-based semantic segmentation networks, i.e., GAIN, can achieve78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU, which are the new state-of-the-art results of low-latency semantic segmentation. Code and models are available at: this https URL.
zh

[CV-146] Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization WACV2026

【速读】:该论文旨在解决近红外到可见光(NIR-to-RGB)图像颜色化过程中存在的色彩真实性与结构保真度难以平衡的问题,尤其是在高分辨率场景下保持纹理细节和全局色彩统计一致性。解决方案的关键在于提出了一种统一的生成式模型 HAQAGen,其核心创新包括:(i) 通过可微分直方图匹配、感知图像质量度量和特征级相似性构建联合损失函数,以对齐全局色彩统计并保留纹理信息;(ii) 利用空间自适应去归一化(Spatially Adaptive Denormalization, SPADE)注入局部色调-饱和度先验,提升色彩重建稳定性;(iii) 在 Mamba 主干网络中引入纹理感知监督机制,增强细粒度结构保留能力;同时设计了自适应分辨率推理引擎,实现无损高分辨率翻译。该方法在多个基准数据集上均显著优于现有最优方法,在感知指标和视觉质量上取得实质性提升。

链接: https://arxiv.org/abs/2601.01103
作者: Abhinav Attri,Rajeev Ranjan Dwivedi,Samiran Das,Vinod Kumar Kurmi
机构: Indian Institute of Science Education and Research Bhopal (印度科学教育与研究学院博帕尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:We present HAQAGen, a unified generative model for resolution-invariant NIR-to-RGB colorization that balances chromatic realism with structural fidelity. The proposed model introduces (i) a combined loss term aligning the global color statistics through differentiable histogram matching, perceptual image quality measure, and feature based similarity to preserve texture information, (ii) local hue-saturation priors injected via Spatially Adaptive Denormalization (SPADE) to stabilize chromatic reconstruction, and (iii) texture-aware supervision within a Mamba backbone to preserve fine details. We introduce an adaptive-resolution inference engine that further enables high-resolution translation without sacrificing quality. Our proposed NIR-to-RGB translation model simultaneously enforces global color statistics and local chromatic consistency, while scaling to native resolutions without compromising texture fidelity or generalization. Extensive evaluations on FANVID, OMSIV, VCIP2020, and RGB2NIR using different evaluation metrics demonstrate consistent improvements over state-of-the-art baseline methods. HAQAGen produces images with sharper textures, natural colors, attaining significant gains as per perceptual metrics. These results position HAQAGen as a scalable and effective solution for NIR-to-RGB translation across diverse imaging scenarios. Project Page: this https URL
zh

[CV-147] Evolving CNN Architectures: From Custom Designs to Deep Residual Models for Diverse Image Classification and Detection Tasks

【速读】:该论文旨在解决在不同复杂度的图像识别任务中,如何选择合适的卷积神经网络(CNN)架构以实现最优性能的问题。其核心挑战在于平衡模型复杂度与任务需求之间的关系,尤其是在资源受限场景下区分定制化深度网络与预训练/迁移学习模型的有效性差异。解决方案的关键在于通过系统性比较自定义CNN架构与主流预训练模型在五类真实世界图像数据集上的表现,揭示网络深度、残差连接和特征提取策略等结构因素对分类精度和定位能力的影响机制;同时,将所提架构扩展至目标检测任务,验证其在实际交通场景中识别非法三轮车的应用潜力,从而为不同任务复杂度和计算资源约束下的模型选型提供实证依据和实践指导。

链接: https://arxiv.org/abs/2601.01099
作者: Mahmudul Hasan,Mabsur Fatin Bin Hossain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a comparative study of a custom convolutional neural network (CNN) architecture against widely used pretrained and transfer learning CNN models across five real-world image datasets. The datasets span binary classification, fine-grained multiclass recognition, and object detection scenarios. We analyze how architectural factors, such as network depth, residual connections, and feature extraction strategies, influence classification and localization performance. The results show that deeper CNN architectures provide substantial performance gains on fine-grained multiclass datasets, while lightweight pretrained and transfer learning models remain highly effective for simpler binary classification tasks. Additionally, we extend the proposed architecture to an object detection setting, demonstrating its adaptability in identifying unauthorized auto-rickshaws in real-world traffic scenes. Building upon a systematic analysis of custom CNN architectures alongside pretrained and transfer learning models, this study provides practical guidance for selecting suitable network designs based on task complexity and resource constraints.
zh

[CV-148] NarrativeTrack: Evaluating Video Language Models Beyond the Frame

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频叙事理解方面的能力不足问题,尤其是其在动态视觉与时间上下文中对实体(entity)连续性与语义一致性的建模能力薄弱。现有基准测试通常局限于短片段或粗粒度场景级语义,难以评估模型对“谁在何时何地做什么”的细粒度实体中心推理能力。解决方案的关键在于提出首个面向叙事理解的基准测试工具——NarrativeTrack,其核心是Compositional Reasoning Progression (CRP)结构化评估框架,通过三个维度逐步提升叙事复杂度:实体存在性、实体变化性和实体模糊性,从而系统性地考察模型从时间持久性到情境演化再到精细感知推理的进阶能力。此外,研究构建了一个全自动的以实体为中心的流水线,用于提取时序锚定的实体表征,为CRP提供基础支撑。实验表明,当前主流MLLMs在跨视觉转换和时间动态中难以稳定追踪实体,存在显著的身份幻觉现象,揭示了感知锚定与时间推理之间的根本权衡关系,强调叙事理解需二者融合才能实现。

链接: https://arxiv.org/abs/2601.01095
作者: Hyeonjeong Ha,Jinjin Ge,Bo Feng,Kaixin Ma,Gargi Chakraborty
机构: Apple(苹果); University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: VideoLLM Fine-Grained Evaluation

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A fully automated entity-centric pipeline enables scalable extraction of temporally grounded entity representations, providing the foundation for CRP. Evaluations of state-of-the-art MLLMs reveal that models fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs exhibit strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context yet hallucinate entity’s contexts. These findings uncover a fundamental trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges only from their integration. NarrativeTrack provides the first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.
zh

[CV-149] Luminark: Training-free Probabilistically-Certified Watermarking for General Vision Generative Models

【速读】:该论文旨在解决生成式视觉模型(generative vision models)中内容溯源与版权保护的问题,即如何在不改变图像质量的前提下,实现对生成图像的水印注入与可靠检测。其解决方案的关键在于提出了一种训练-free且具有概率保证的水印方法Luminark,该方法基于patch-level luminance statistics定义水印,并通过预设二进制模式与对应阈值来实现检测;同时,利用广泛采用的guidance技术作为插件式机制(watermark guidance),使水印能够无缝嵌入多种生成范式(如扩散模型、自回归模型及混合框架),从而在保持高检测准确率、强鲁棒性的同时,确保视觉质量不受影响。

链接: https://arxiv.org/abs/2601.01085
作者: Jiayi Xu,Zhang Zhang,Yuanrui Zhang,Ruitao Chen,Yixian Xu,Tianyu He,Di He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce \emphLuminark, a training-free and probabilistically-certified watermarking method for general vision generative models. Our approach is built upon a novel watermark definition that leverages patch-level luminance statistics. Specifically, the service provider predefines a binary pattern together with corresponding patch-level thresholds. To detect a watermark in a given image, we evaluate whether the luminance of each patch surpasses its threshold and then verify whether the resulting binary pattern aligns with the target one. A simple statistical analysis demonstrates that the false positive rate of the proposed method can be effectively controlled, thereby ensuring certified detection. To enable seamless watermark injection across different paradigms, we leverage the widely adopted guidance technique as a plug-and-play mechanism and develop the \emphwatermark guidance. This design enables Luminark to achieve generality across state-of-the-art generative models without compromising image quality. Empirically, we evaluate our approach on nine models spanning diffusion, autoregressive, and hybrid frameworks. Across all evaluations, Luminark consistently demonstrates high detection accuracy, strong robustness against common image transformations, and good performance on visual quality.
zh

[CV-150] A UAV-Based Multispectral and RGB Dataset for Multi-Stage Paddy Crop Monitoring in Indian Agricultural Fields

【速读】:该论文旨在解决印度水稻作物全生长阶段高分辨率遥感数据缺乏的问题,以支持精准农业中的靶向施药、病害分析和产量估算等应用。解决方案的关键在于构建了一个大规模、标准化采集的无人机(UAV)多光谱与RGB图像数据集,覆盖从育秧到收获的完整生长周期,包含42,430张原始图像(415 GB),地面采样距离(GSD)为1 cm/pixel,并配有GPS坐标、飞行高度及环境条件等丰富元数据;通过Pix4D Fields软件验证生成正射影像和植被指数图(如NDVI和NDRE),确保数据质量与可用性,从而为后续农业智能分析提供可靠的数据基础。

链接: https://arxiv.org/abs/2601.01084
作者: Adari Rama Sukanya,Puvvula Roopesh Naga Sri Sai,Kota Moses,Rimalapudi Sarvendranath
机构: IIT Tirupati (印度理工学院特里普蒂); DroneHub Technologies Pvt. Ltd. (DroneHub科技有限公司); IIT Tirupati Navavishkar I-Hub Foundation (印度理工学院特里普蒂导航创新枢纽基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10-page dataset explanation paper

点击查看摘要

Abstract:We present a large-scale unmanned aerial vehicle (UAV)-based RGB and multispectral image dataset collected over paddy fields in the Vijayawada region, Andhra Pradesh, India, covering nursery to harvesting stages. We used a 20-megapixel RGB camera and a 5-megapixel four-band multispectral camera capturing red, green, red-edge, and near-infrared bands. Standardised operating procedure (SOP) and checklists were developed to ensure repeatable data acquisition. Our dataset comprises of 42,430 raw images (415 GB) captured over 5 acres with 1 cm/pixel ground sampling distance (GSD) with associated metadata such as GPS coordinates, flight altitude, and environmental conditions. Captured images were validated using Pix4D Fields to generate orthomosaic maps and vegetation index maps, such as normalised difference vegetation index (NDVI) and normalised difference red-edge (NDRE) index. Our dataset is one of the few datasets that provide high-resolution images with rich metadata that cover all growth stages of Indian paddy crops. The dataset is available on IEEE DataPort with DOI, . It can support studies on targeted spraying, disease analysis, and yield estimation.
zh

[CV-151] Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

【速读】:该论文旨在解决当前神经网络世界模型在处理具身系统(embodied systems)感知与运动耦合动态时,因忽略连续流的对称性结构而导致的表征不稳定和数据效率低下的问题。其关键解决方案是提出“流等变世界模型”(Flow Equivariant World Models),将自运动与外部物体运动统一为单参数李群(one-parameter Lie group)的“流”,并在此基础上实现关于这些变换的群等变性(group equivariance),从而在数百个时间步内保持稳定的潜在世界表示。这一结构化建模方法显著提升了长程预测能力,在2D和3D部分观测视频世界建模基准上优于当前最先进的扩散模型和记忆增强架构,尤其在代理视野外存在可预测动力学时表现突出。

链接: https://arxiv.org/abs/2601.01075
作者: Hansen Jin Lillemark,Benhao Huang,Fangneng Zhan,Yilun Du,Thomas Anderson Keller
机构: Kempner Institute, Harvard University (哈佛大学肯普纳研究所); CSE, UC San Diego (加州大学圣地亚哥分校计算机科学与工程系); ML, Carnegie Mellon University (卡内基梅隆大学机器学习系); SEAS, Harvard University (哈佛大学工程与应用科学学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 main text pages, 10 figures

点击查看摘要

Abstract:Embodied systems experience the world as ‘a symphony of flows’: a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re-learn the same transformations from data. In this work, we introduce ‘Flow Equivariant World Models’, a framework in which both self-motion and external object motion are unified as one-parameter Lie group ‘flows’. We leverage this unification to implement group equivariance with respect to these transformations, thereby providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world modeling architectures – particularly when there are predictable world dynamics outside the agent’s current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data efficient, symmetry-guided, embodied intelligence. Project link: this https URL.
zh

[CV-152] Efficient Hyperspectral Image Reconstruction Using Lightweight Separate Spectral Transformers

【速读】:该论文旨在解决从压缩感知(Compressive Sensing, CS)测量中高效重建高光谱图像(Hyperspectral Imaging, HSI)的问题,其核心挑战在于如何在保证重建质量的同时显著降低计算复杂度。解决方案的关键在于提出一种轻量级的分离式光谱 Transformer 架构(Lightweight Separate Spectral Transformer, LSST),该架构通过两个核心组件实现高效建模:一是分离式光谱 Transformer 块(Separate Spectral Transformer Block, SSTB),利用分组光谱自注意力机制与光谱洗牌操作有效捕捉局部与非局部的光谱相关性;二是轻量化空间卷积块(Lightweight Spatial Convolution Block, LSCB),采用深度可分离卷积和优化的结构顺序提升空间信息处理效率。此外,引入焦点光谱损失(Focal Spectrum Loss)动态调整不同波段的权重,增强对光谱复杂区域的重建能力,从而在显著减少浮点运算次数(FLOPs)和参数量的前提下实现更优的重建性能。

链接: https://arxiv.org/abs/2601.01064
作者: Jianan Li,Wangcai Zhao,Tingfa Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) is essential across various disciplines for its capacity to capture rich spectral information. However, efficiently reconstructing hyperspectral images from compressive sensing measurements presents significant challenges. To tackle these, we adopt a divide-and-conquer strategy that capitalizes on the unique spectral and spatial characteristics of hyperspectral images. We introduce the Lightweight Separate Spectral Transformer (LSST), an innovative architecture tailored for efficient hyperspectral image reconstruction. This architecture consists of Separate Spectral Transformer Blocks (SSTB) for modeling spectral relationships and Lightweight Spatial Convolution Blocks (LSCB) for spatial processing. The SSTB employs Grouped Spectral Self-attention and a Spectrum Shuffle operation to effectively manage both local and non-local spectral relationships. Simultaneously, the LSCB utilizes depth-wise separable convolutions and strategic ordering to enhance spatial information processing. Furthermore, we implement the Focal Spectrum Loss, a novel loss weighting mechanism that dynamically adjusts during training to improve reconstruction across spectrally complex bands. Extensive testing demonstrates that our LSST achieves superior performance while requiring fewer FLOPs and parameters, underscoring its efficiency and effectiveness. The source code is available at: this https URL.
zh

[CV-153] SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models WACV2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在生成多角色、长篇播客对话等复杂叙事任务中的能力不足问题,尤其是现有评估指标(如BLEU和ROUGE)难以有效衡量对话自然度、人物个性和叙事流畅性的问题。其解决方案的关键在于提出了一种端到端的视觉播客生成流程,并采用“从合成到真实”的训练策略:在结构化播客研究语料库(SPoRC)中配对合成图像与高质量播客对话进行训练,再在真实照片序列数据集(VIST)上进行测试,从而验证模型从合成数据到真实视觉域的泛化能力;同时构建了一个超越文本重叠的综合评估框架,结合AI评判者(Gemini 3 Pro、Claude Opus 4.5、GPT 5.2)和新颖风格指标(平均发言长度、说话人切换率),显著提升了对生成内容质量的客观评价精度。

链接: https://arxiv.org/abs/2601.01062
作者: Yunlin Zeng
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 figures. Accepted to WVAQ 2026, WACV 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable success in descriptive tasks such as image captioning and visual question answering (VQA). However, their ability to generate engaging, long-form narratives – specifically multi-speaker podcast dialogues – remains under-explored and difficult to evaluate. Standard metrics like BLEU and ROUGE fail to capture the nuances of conversational naturalness, personality, and narrative flow, often rewarding safe, repetitive outputs over engaging storytelling. In this work, we present a novel pipeline for end-to-end visual podcast generation, and fine-tune a Qwen3-VL-32B model on a curated dataset of 4,000 image-dialogue pairs. Crucially, we use a synthetic-to-real training strategy: we train on high-quality podcast dialogues from the Structured Podcast Research Corpus (SPoRC) paired with synthetically generated imagery, and evaluate on real-world photo sequences from the Visual Storytelling Dataset (VIST). This rigorous setup tests the model’s ability to generalize from synthetic training data to real-world visual domains. We propose a comprehensive evaluation framework that moves beyond textual overlap, and use AI-as-a-judge (Gemini 3 Pro, Claude Opus 4.5, GPT 5.2) and novel style metrics (average turn length, speaker switch rate) to assess quality. Our experiments demonstrate that our fine-tuned 32B model significantly outperforms a 235B base model in conversational naturalness ( 80% win rate) and narrative depth (+50% turn length), while maintaining identical visual grounding capabilities (CLIPScore: 20.39).
zh

[CV-154] Enhancing Histopathological Image Classification via Integrated HOG and Deep Features with Robust Noise Performance

【速读】:该论文旨在解决数字病理学中自动化图像分析的性能瓶颈问题,特别是在五类组织病理图像分类任务中的准确性和鲁棒性提升。其解决方案的关键在于利用微调后的InceptionResNet-v2网络进行特征提取,并结合深度特征与传统手工特征(如HOG)构建混合模型;实验表明,基于深度特征训练的神经网络模型在分类准确率(99.84%)和AUC(99.99%)上显著优于仅使用预训练网络的模型,且在不同信噪比(SNR)条件下展现出更强的鲁棒性,尤其以GBM和KNN模型表现突出。

链接: https://arxiv.org/abs/2601.01056
作者: Ifeanyi Ezuma,Ugochukwu Ugwu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures. Code and datasets available upon request

点击查看摘要

Abstract:The era of digital pathology has advanced histopathological examinations, making automated image analysis essential in clinical practice. This study evaluates the classification performance of machine learning and deep learning models on the LC25000 dataset, which includes five classes of histopathological images. We used the fine-tuned InceptionResNet-v2 network both as a classifier and for feature extraction. Our results show that the fine-tuned InceptionResNet-v2 achieved a classification accuracy of 96.01% and an average AUC of 96.8%. Models trained on deep features from InceptionResNet-v2 outperformed those using only the pre-trained network, with the Neural Network model achieving an AUC of 99.99% and accuracy of 99.84%. Evaluating model robustness under varying SNR conditions revealed that models using deep features exhibited greater resilience, particularly GBM and KNN. The combination of HOG and deep features showed enhanced performance, however, less so in noisy environments.
zh

[CV-155] EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

【速读】:该论文旨在解决从第一人称单目视频中重建世界空间手-物体交互(World-space Hand-Object Interactions, W-HOI)的问题,现有方法通常局限于静态图像或相机坐标系,无法建模时序动态性或保持全局轨迹一致性,且在严重相机运动和频繁遮挡的野外场景下性能下降。解决方案的关键在于提出一个三阶段框架:首先构建基于新型空间智能模型的鲁棒预处理流水线;其次引入基于解耦扩散模型的全身手-物体交互先验模型(template-free且可扩展至多物体);最后采用多目标测试时优化范式,从而实现高精度、鲁棒的W-HOI重建。

链接: https://arxiv.org/abs/2601.01050
作者: Hongming Fu,Wenjia Wang,Xiaozhen Qiao,Shuo Yang,Zheng Liu,Bo Zhao
机构: Shanghai Jiao Tong University (上海交通大学); The University of Hong Kong (香港大学); University of Science and Technology of China (中国科学技术大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.
zh

[CV-156] Evaluating transfer learning strategies for improving dairy cattle body weight prediction in small farms using depth-image and point-cloud data

【速读】:该论文旨在解决小规模奶牛养殖场在缺乏足够标注数据情况下,如何利用迁移学习(transfer learning)提升体质量预测精度的问题,同时比较深度图像(depth image)与点云(point cloud)两种模态在体质量预测中的性能差异。其关键解决方案在于:通过在大型农场预训练的模型权重进行微调(fine-tuning),显著改善了小农场上的预测表现,且效果优于单一来源训练或联合训练;此外,研究发现深度图像和点云方法在不同实验设计下无明显性能差异,表明迁移学习可有效缓解因隐私、物流或政策限制导致的数据共享障碍,仅需共享预训练模型权重即可实现跨农场泛化,为小规模牧场提供了一种高效、可行的智能监测方案。

链接: https://arxiv.org/abs/2601.01044
作者: Jin Wang,Angelo De Castro,Yuxi Zhang,Lucas Basolli Borsatto,Yuechen Guo,Victoria Bastos Primo,Ana Beatriz Montevecchio Bernardino,Gota Morota,Ricardo C Chebel,Haipeng Yu
机构: University of Florida (佛罗里达大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Computer vision provides automated, non-invasive, and scalable tools for monitoring dairy cattle, thereby supporting management, health assessment, and phenotypic data collection. Although transfer learning is commonly used for predicting body weight from images, its effectiveness and optimal fine-tuning strategies remain poorly understood in livestock applications, particularly beyond the use of pretrained ImageNet or COCO weights. In addition, while both depth images and three-dimensional point-cloud data have been explored for body weight prediction, direct comparisons of these two modalities in dairy cattle are limited. Therefore, the objectives of this study were to 1) evaluate whether transfer learning from a large farm enhances body weight prediction on a small farm with limited data, and 2) compare the predictive performance of depth-image- and point-cloud-based approaches under three experimental designs. Top-view depth images and point-cloud data were collected from 1,201, 215, and 58 cows at large, medium, and small dairy farms, respectively. Four deep learning models were evaluated: ConvNeXt and MobileViT for depth images, and PointNet and DGCNN for point clouds. Transfer learning markedly improved body weight prediction on the small farm across all four models, outperforming single-source learning and achieving gains comparable to or greater than joint learning. These results indicate that pretrained representations generalize well across farms with differing imaging conditions and dairy cattle populations. No consistent performance difference was observed between depth-image- and point-cloud-based models. Overall, these findings suggest that transfer learning is well suited for small farm prediction scenarios where cross-farm data sharing is limited by privacy, logistical, or policy constraints, as it requires access only to pretrained model weights rather than raw data.
zh

[CV-157] Deepfake Detection with Multi-Artifact Subspace Fine-Tuning and Selective Layer Masking

【速读】:该论文旨在解决深度伪造(Deepfake)检测在跨数据集和真实场景中泛化能力不足的问题,其核心挑战在于不同伪造方法引入的伪影分布高度多样,而预训练模型在适应新伪影时容易破坏原有的语义结构。解决方案的关键在于提出基于多伪影子空间与选择性层掩码(MASM)的方法,通过奇异值分解将预训练权重划分为稳定的语义主子空间与多个可学习的伪影子空间,实现语义表示与伪影表示的显式解耦;同时引入选择性层掩码策略,根据各伪影子空间的学习状态自适应调控网络层更新行为,抑制对单一伪造特征的过拟合,并结合正交性和谱一致性约束,促使多个伪影子空间学习互补且多样化的伪影表征,从而提升整体检测模型的鲁棒性与泛化性能。

链接: https://arxiv.org/abs/2601.01041
作者: Xiang Zhang,Wenliang Weng,Daoyong Fu,Ziqiang Li,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Engineering Research Center of Digital Forensics, Ministry of Education (教育部数字取证工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Deepfake detection still faces significant challenges in cross-dataset and real-world complex scenarios. The root cause lies in the high diversity of artifact distributions introduced by different forgery methods, while pretrained models tend to disrupt their original general semantic structures when adapting to new artifacts. Existing approaches usually rely on indiscriminate global parameter updates or introduce additional supervision signals, making it difficult to effectively model diverse forgery artifacts while preserving semantic stability. To address these issues, this paper proposes a deepfake detection method based on Multi-Artifact Subspaces and selective layer masks (MASM), which explicitly decouples semantic representations from artifact representations and constrains the fitting strength of artifact subspaces, thereby improving generalization robustness in cross-dataset scenarios. Specifically, MASM applies singular value decomposition to model weights, partitioning pretrained weights into a stable semantic principal subspace and multiple learnable artifact subspaces. This design enables decoupled modeling of different forgery artifact patterns while preserving the general semantic subspace. On this basis, a selective layer mask strategy is introduced to adaptively regulate the update behavior of corresponding network layers according to the learning state of each artifact subspace, suppressing overfitting to any single forgery characteristic. Furthermore, orthogonality constraints and spectral consistency constraints are imposed to jointly regularize multiple artifact subspaces, guiding them to learn complementary and diverse artifact representations while maintaining a stable overall spectral structure.
zh

[CV-158] Mono3DV: Monocular 3D Object Detection with 3D-Aware Bipartite Matching and Variational Query DeNoising

【速读】:该论文旨在解决基于Transformer的单目3D目标检测方法(如DETR类架构)中因忽略3D属性而导致的匹配偏差问题。由于单目图像到3D空间的估计本身具有病态性(ill-posed nature),仅依赖2D特征进行二分图匹配(bipartite matching)会导致高质量3D预测被错误抑制,从而影响整体性能。其核心解决方案是提出Mono3DV框架,包含三个关键创新:1)引入3D感知的二分图匹配策略(3D-Aware Bipartite Matching),将3D几何信息直接融入匹配代价函数以纠正纯2D匹配带来的错位;2)设计3D去噪机制(3D-DeNoising)以稳定训练过程,缓解3D属性引入带来的不稳定性;3)提出一种变分查询去噪机制(Variational Query DeNoising),克服传统去噪技术中的梯度消失问题,显著提升模型性能。该方法在不使用任何外部数据的情况下,在KITTI 3D目标检测基准上达到最先进水平。

链接: https://arxiv.org/abs/2601.01036
作者: Kiet Dang Vu,Trung Thai Tran,Kien Nguyen Do Trung,Duc Dung Nguyen
机构: Ho Chi Minh University of Technology, VNUHCM (胡志明市科学技术大学,越南国家大学胡志明市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While DETR-like architectures have demonstrated significant potential for monocular 3D object detection, they are often hindered by a critical limitation: the exclusion of 3D attributes from the bipartite matching process. This exclusion arises from the inherent ill-posed nature of 3D estimation from monocular image, which introduces instability during training. Consequently, high-quality 3D predictions can be erroneously suppressed by 2D-only matching criteria, leading to suboptimal results. To address this, we propose Mono3DV, a novel Transformer-based framework. Our approach introduces three key innovations. First, we develop a 3D-Aware Bipartite Matching strategy that directly incorporates 3D geometric information into the matching cost, resolving the misalignment caused by purely 2D criteria. Second, it is important to stabilize the Bipartite Matching to resolve the instability occurring when integrating 3D attributes. Therefore, we propose 3D-DeNoising scheme in the training phase. Finally, recognizing the gradient vanishing issue associated with conventional denoising techniques, we propose a novel Variational Query DeNoising mechanism to overcome this limitation, which significantly enhances model performance. Without leveraging any external data, our method achieves state-of-the-art results on the KITTI 3D object detection benchmark.
zh

[CV-159] Enhanced Leukemic Cell Classification Using Attention-Based CNN and Data Augmentation

【速读】:该论文旨在解决急性淋巴细胞白血病(Acute Lymphoblastic Leukemia, ALL)诊断中依赖专家显微镜判读所面临的观察者间差异大和耗时的问题。其核心解决方案是构建一个可复现的深度学习流水线,关键在于采用基于注意力机制的卷积神经网络架构(结合EfficientNetV2-B3与Squeeze-and-Excitation模块),并辅以全面的数据增强、焦点损失(focal loss)处理类别不平衡以及按患者划分数据集的方法,从而在保证高准确率(测试集F1-score达97.89%)的同时实现模型的鲁棒性和可解释性,且参数量仅为VGG16的11%,具备临床部署潜力。

链接: https://arxiv.org/abs/2601.01026
作者: Douglas Costa Braga,Daniel Oliveira Dantas
机构: Universidade Federal de Sergipe (塞尔希培联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 9 pages, 5 figures, 4 tables. Submitted to VISAPP 2025

点击查看摘要

Abstract:We present a reproducible deep learning pipeline for leukemic cell classification, focusing on system architecture, experimental robustness, and software design choices for medical image analysis. Acute lymphoblastic leukemia (ALL) is the most common childhood cancer, requiring expert microscopic diagnosis that suffers from inter-observer variability and time constraints. The proposed system integrates an attention-based convolutional neural network combining EfficientNetV2-B3 with Squeeze-and-Excitation mechanisms for automated ALL cell classification. Our approach employs comprehensive data augmentation, focal loss for class imbalance, and patient-wise data splitting to ensure robust and reproducible evaluation. On the C-NMC 2019 dataset (12,528 original images from 62 patients), the system achieves a 97.89% F1-score and 97.89% accuracy on the test set, with statistical validation through 100-iteration Monte Carlo experiments confirming significant improvements (p 0.001) over baseline methods. The proposed pipeline outperforms existing approaches by up to 4.67% while using 89% fewer parameters than VGG16 (15.2M vs. 138M). The attention mechanism provides interpretable visualizations of diagnostically relevant cellular features, demonstrating that modern attention-based architectures can improve leukemic cell classification while maintaining computational efficiency suitable for clinical deployment.
zh

[CV-160] ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval WACV

【速读】:该论文旨在解决文本驱动的人体检索(text-based person search, TBPS)任务中因局部对齐方法易受捷径学习(shortcut learning)和虚假相关性(spurious correlations)影响而导致的误匹配问题,以及引入先验知识可能破坏模态内结构的问题。其解决方案的关键在于提出ITSELF框架,通过引导注意力机制实现隐式局部对齐:核心组件Guided Representation with Attentive Bank (GRAB) 将模型自身注意力转化为高显著性token的注意力库(Attentive Bank),并在该库上施加局部目标以学习细粒度对应关系,无需额外监督;同时引入Multi-Layer Attention for Robust Selection (MARS) 实现跨层注意力聚合与多样性感知的top-k选择,以及Adaptive Token Scheduler (ATS) 动态调度保留预算,从粗到精逐步聚焦判别性细节,从而在不依赖外部先验的情况下提升对齐精度与泛化能力。

链接: https://arxiv.org/abs/2601.01024
作者: Tien-Huy Nguyen,Huu-Loc Tran,Thanh Duc Ngo
机构: University of Information Technology (信息科技大学); Vietnam National University (越南国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at WACV Main Track 2026

点击查看摘要

Abstract:Vision Language Models (VLMs) have rapidly advanced and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our finding that encoder attention surfaces spatially precise evidence from the earliest training epochs, and to alleviate these issues, we introduceITSELF, an attention-guided framework for implicit local alignment. At its core, Guided Representation with Attentive Bank (GRAB) converts the model’s own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks showstate-of-the-art performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision. Our project is publicly available at this https URL
zh

[CV-161] Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking

【速读】:该论文旨在解决现有RGB-Event视觉目标跟踪方法依赖传统特征级融合策略,未能充分挖掘事件相机(event camera)独特优势的问题,特别是其高动态范围和运动敏感特性常被忽视,且低信息区域仍被均匀处理,导致骨干网络计算开销过大。解决方案的关键在于提出一种在频域进行早期融合的新框架:首先通过快速傅里叶变换(Fast Fourier Transform, FFT)将RGB与事件模态从空间域转换至频率域,并解耦幅度与相位分量;随后利用幅度和相位注意力机制选择性地将高频率事件信息融合进RGB模态,增强特征表示的同时显著降低骨干网络的计算量;此外,引入基于运动引导的空间稀疏化模块,利用事件相机对运动的敏感性建模目标运动线索与空间概率分布的关系,过滤低信息区域并强化目标相关特征,最终仅将稀疏的目标相关特征输入骨干网络进行学习,从而实现高效且精准的目标定位。

链接: https://arxiv.org/abs/2601.01022
作者: Shiao Wang,Xiao Wang,Haonan Zhao,Jiarui Xu,Bo Jiang,Lin Zhu,Xin Zhao,Yonghong Tian,Jin Tang
机构: Anhui University (安徽大学); Northeastern University (东北大学); Beijing Institute of Technology (北京理工大学); University of Science and Technology Beijing (北京科技大学); Peng Cheng Laboratory (鹏城实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on this https URL
zh

[CV-162] Lightweight Channel Attention for Efficient CNNs

【速读】:该论文旨在解决通道注意力机制(Channel Attention)设计在效率与准确率之间的权衡问题,尤其是在资源受限环境下的部署可行性。其关键解决方案是提出一种轻量级通道注意力模块(Lite Channel Attention, LCA),通过引入自适应的一维卷积与分组操作,在显著降低参数量的同时保持有效的注意力行为,从而在ResNet 18和MobileNetV2模型上实现了与ECA相当的参数效率,并在CIFAR-10数据集上分别达到94.68%和93.10%的准确率,同时维持较低的推理延迟。

链接: https://arxiv.org/abs/2601.01002
作者: Prem Babu Kanaparthi,Tulasi Venkata Sri Varshini Padamata
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Attention mechanisms have become integral to modern convolutional neural networks (CNNs), delivering notable performance improvements with minimal computational overhead. However, the efficiency accuracy trade off of different channel attention designs remains underexplored. This work presents an empirical study comparing Squeeze and Excitation (SE), Efficient Channel Attention (ECA), and a proposed Lite Channel Attention (LCA) module across ResNet 18 and MobileNetV2 architectures on CIFAR 10. LCA employs adaptive one dimensional convolutions with grouped operations to reduce parameter usage while preserving effective attention behavior. Experimental results show that LCA achieves competitive accuracy, reaching 94.68 percent on ResNet 18 and 93.10 percent on MobileNetV2, while matching ECA in parameter efficiency and maintaining favorable inference latency. Comprehensive benchmarks including FLOPs, parameter counts, and GPU latency measurements are provided, offering practical insights for deploying attention enhanced CNNs in resource constrained environments.
zh

[CV-163] DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models

【速读】:该论文旨在解决遥感(Remote Sensing, RS)视觉语言模型(Large Vision-Language Models, LVLMs)在隐式视觉定位(Implicit Visual Grounding, VG)任务中表现不足的问题。现有RS VG数据集主要依赖显式指代表达(如相对位置、尺寸和颜色等线索),难以覆盖需要场景特定领域知识的隐式查询,从而限制了模型在复杂现实场景中的推理能力。解决方案的关键在于提出DVGBench基准数据集,涵盖六类无人机应用场景下的显式与隐式双重查询,并设计DroneVG-R1模型,其核心创新是引入隐式到显式的思维链(Implicit-to-Explicit Chain-of-Thought, I2E-CoT)机制,结合强化学习框架,使模型能够将隐式参考转化为显式描述,从而降低定位难度并提升推理性能。

链接: https://arxiv.org/abs/2601.00998
作者: Yue Zhou,Jue Chen,Zilun Zhang,Penghui Huang,Ran Ding,Zhentao Zou,PengFei Gao,Yuchen Wei,Ke Li,Xue Yang,Xue Jiang,Hongxin Yang,Jonathan Li
机构: 华东师范大学地理信息科学教育部重点实验室(Key Laboratory of Geographic Information Science, Ministry of Education, East China Normal University); ECNU (East China Normal University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 17 figures

点击查看摘要

Abstract:Remote sensing (RS) large vision-language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions-such as relative position, relative size, and color cues-thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces DVGBench, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents. The code and datasets will be released at this https URL
zh

[CV-164] WildIng: A Wildlife Image Invariant Representation Model for Geographical Domain Shift

【速读】:该论文旨在解决当前基于深度学习的野生动物识别模型在跨地理区域场景下泛化能力差的问题。现有模型(如CLIP等基础模型)主要依赖图像特征进行识别,在训练数据与测试数据来自不同地理区域时性能显著下降,例如在非洲数据集上准确率达84.77%,而在美国数据集上骤降至16.17%。其根本原因在于模型对背景、光照和环境条件等图像分布变化敏感。解决方案的关键在于提出WildIng——一种结合文本描述与图像特征的野生动物图像不变表示模型(Wildlife image Invariant representation model),通过引入语义一致的文本信息(如物种外观细节描述)来增强模型对地理域偏移的鲁棒性,从而显著提升跨区域识别准确率,实验表明该方法可使BioCLIP等基础模型在地理域偏移条件下准确率提升30%。

链接: https://arxiv.org/abs/2601.00993
作者: Julian D. Santamaria,Claudia Isaza,Jhony H. Giraldo
机构: University of Antioquia (安蒂奥基亚大学); Telecom Paris (巴黎电信学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wildlife monitoring is crucial for studying biodiversity loss and climate change. Camera trap images provide a non-intrusive method for analyzing animal populations and identifying ecological patterns over time. However, manual analysis is time-consuming and resource-intensive. Deep learning, particularly foundation models, has been applied to automate wildlife identification, achieving strong performance when tested on data from the same geographical locations as their training sets. Yet, despite their promise, these models struggle to generalize to new geographical areas, leading to significant performance drops. For example, training an advanced vision-language model, such as CLIP with an adapter, on an African dataset achieves an accuracy of 84.77%. However, this performance drops significantly to 16.17% when the model is tested on an American dataset. This limitation partly arises because existing models rely predominantly on image-based representations, making them sensitive to geographical data distribution shifts, such as variation in background, lighting, and environmental conditions. To address this, we introduce WildIng, a Wildlife image Invariant representation model for geographical domain shift. WildIng integrates text descriptions with image features, creating a more robust representation to geographical domain shifts. By leveraging textual descriptions, our approach captures consistent semantic information, such as detailed descriptions of the appearance of the species, improving generalization across different geographical locations. Experiments show that WildIng enhances the accuracy of foundation models such as BioCLIP by 30% under geographical domain shift conditions. We evaluate WildIng on two datasets collected from different regions, namely America and Africa. The code and models are publicly available at this https URL.
zh

[CV-165] UnrealPose: Leverag ing Game Engine Kinematics for Large-Scale Synthetic Human Pose Data CVPR2026

【速读】:该论文旨在解决当前3D人体姿态数据集存在的两大问题:一是高质量、精确标注的3D人体姿态数据获取成本高且受限于工作室环境,二是真实世界(in-the-wild)数据集缺乏已知的地面真值(ground truth)。为应对这一挑战,作者提出了一种基于Unreal Engine 5的渲染管道UnrealPose-Gen,其核心创新在于通过Movie Render Queue实现高质量离线渲染,生成包含多模态标注信息的数据帧,包括世界坐标系和相机坐标系下的3D关节位置、2D投影及COCO风格关键点(含遮挡与可见性标记)、人物边界框以及相机内参和外参。该方案的关键在于构建了一个可扩展的合成数据生成框架,并基于此生成了包含约一百万帧的UnrealPose-1M数据集,涵盖多样化场景、动作和视角,从而为训练和评估人体姿态估计模型提供可靠、可控且大规模的合成基准。

链接: https://arxiv.org/abs/2601.00991
作者: Joshua Kawaguchi,Saad Manzur,Emily Gao Wang,Maitreyi Sinha,Bryan Vela,Yunxi Wang,Brandon Vela,Wayne B. Hayes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 submission. Introduces UnrealPose-1M dataset and UnrealPose-Gen pipeline

点击查看摘要

Abstract:Diverse, accurately labeled 3D human pose data is expensive and studio-bound, while in-the-wild datasets lack known ground truth. We introduce UnrealPose-Gen, an Unreal Engine 5 pipeline built on Movie Render Queue for high-quality offline rendering. Our generated frames include: (i) 3D joints in world and camera coordinates, (ii) 2D projections and COCO-style keypoints with occlusion and joint-visibility flags, (iii) person bounding boxes, and (iv) camera intrinsics and extrinsics. We use UnrealPose-Gen to present UnrealPose-1M, an approximately one million frame corpus comprising eight sequences: five scripted “coherent” sequences spanning five scenes, approximately 40 actions, and five subjects; and three randomized sequences across three scenes, approximately 100 actions, and five subjects, all captured from diverse camera trajectories for broad viewpoint coverage. As a fidelity check, we report real-to-synthetic results on four tasks: image-to-3D pose, 2D keypoint detection, 2D-to-3D lifting, and person detection/segmentation. Though time and resources constrain us from an unlimited dataset, we release the UnrealPose-1M dataset, as well as the UnrealPose-Gen pipeline to support third-party generation of human pose data.
zh

[CV-166] Few-Shot Video Object Segmentation in X-Ray Angiography Using Local Matching and Spatio-Temporal Consistency Loss

【速读】:该论文旨在解决视频目标分割(Video Object Segmentation, VOS)中因传统局部匹配策略效率低下、硬件依赖性强以及跨设备迁移能力差而导致的精度与泛化性能受限问题。其解决方案的关键在于提出一种基于方向采样的非参数化局部匹配机制,通过重构局部采样过程以动态调整采样区域,从而在不引入可学习参数和模型重训练的前提下实现对多样化空间结构的自适应处理;同时结合监督时空对比学习策略增强帧间特征一致性,显著提升分割准确性和跨类别泛化能力。

链接: https://arxiv.org/abs/2601.00988
作者: Lin Xi,Yingliang Ma,Xiahai Zhuang
机构: University of East Anglia (东安格利亚大学); King’s College London (伦敦国王学院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most relevant neighboring pixels. Rather than relying on inefficient standard im2col-like implementations (e.g., spatial convolutions, depthwise convolutions and feature-shifting mechanisms) or hardware-specific CUDA kernels (e.g., deformable and neighborhood attention), which often suffer from limited portability across non-CUDA devices, we reorganize the local sampling process through a direction-based sampling perspective. Specifically, we implement a non-parametric sampling mechanism that enables dynamically varying sampling regions. This approach provides the flexibility to adapt to diverse spatial structures without the computational costs of parametric layers and the need for model retraining. To further enhance feature coherence across frames, we design a supervised spatio-temporal contrastive learning scheme that enforces consistency in feature representations. In addition, we introduce a publicly available benchmark dataset for multi-object segmentation in X-ray angiography videos (MOSXAV), featuring detailed, manually labeled segmentation ground truth. Extensive experiments on the CADICA, XACV, and MOSXAV datasets show that our proposed FSVOS method outperforms current state-of-the-art video segmentation methods in terms of segmentation accuracy and generalization capability (i.e., seen and unseen categories). This work offers enhanced flexibility and potential for a wide range of clinical applications.
zh

[CV-167] Simulations of MRI Guided and Powered Ferric Applicators for Tetherless Delivery of Therapeutic Interventions

【速读】:该论文旨在解决磁共振成像(MRI)引导下血管内介入操作中,如何实现术前规划与虚拟路径安全建模的问题。其核心挑战在于确保磁力驱动的磁性器械在复杂血管网络中的安全、精准操控,避免因磁场梯度超限或路径冲突导致血管穿孔或器械碰撞。解决方案的关键在于构建一个双向数据与指令传输平台,通过多层处理流程:首先从MRI图像中提取血管床并拟合出虚拟夹具(virtual fixture, VF),作为禁止区域以规避风险;随后结合血管中心线几何特征、VF边界及MRI安全性约束(如dB/dt和最大可用梯度),生成适配的磁场梯度波形;同时支持用户选择不同血流模型用于仿真器械运动,并提供路径可行性评估。整个系统基于Qt框架(C/C++)实现,采用多线程架构分离PID控制、VF生成与梯度波形计算等模块,为未来实时手术操作奠定基础。

链接: https://arxiv.org/abs/2601.00981
作者: Wenhui Chu,Khang Tran,Nikolaos V. Tsekos
机构: MRI Lab, University of Houston (休斯顿大学磁共振成像实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 9 pages, 8 figures, published in ICBBB 2022

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is a well-established modality for pre-operative planning and is also explored for intra-operative guidance of procedures such as intravascular interventions. Among the experimental robot-assisted technologies, the magnetic field gradients of the MRI scanner are used to power and maneuver ferromagnetic applicators for accessing sites in the patient’s body via the vascular network. In this work, we propose a computational platform for preoperative planning and modeling of MRI-powered applicators inside blood vessels. This platform was implemented as a two-way data and command pipeline that links the MRI scanner, the computational core, and the operator. The platform first processes multi-slice MR data to extract the vascular bed and then fits a virtual corridor inside the vessel. This corridor serves as a virtual fixture (VF), a forbidden region for the applicators to avoid vessel perforation or collision. The geometric features of the vessel centerline, the VF, and MRI safety compliance (dB/dt, max available gradient) are then used to generate magnetic field gradient waveforms. Different blood flow profiles can be user-selected, and those parameters are used for modeling the applicator’s maneuvering. The modeling module further generates cues about whether the selected vascular path can be safely maneuvered. Given future experimental studies that require a real-time operation, the platform was implemented on the Qt framework (C/C++) with software modules performing specific tasks running on dedicated threads: PID controller, generation of VF, generation of MR gradient waveforms.
zh

[CV-168] A Deep Learning Approach for Automated Skin Lesion Diagnosis with Explainable AI

【速读】:该论文旨在解决皮肤癌(skin cancer)早期诊断中准确率不足与模型可解释性差的问题,尤其是在多类皮肤病变分类任务中的挑战。其解决方案的关键在于构建一个融合高质量数据平衡、大规模数据增强、改进的混合EfficientNetV2-L架构(带通道注意力机制)以及三阶段渐进式学习策略的深度学习框架,同时引入可解释人工智能(explainable AI, XAI)技术如Grad-CAM和显著性图(saliency maps),以提供可视化决策依据,从而提升模型性能与临床可信度。该方法在HAM10000数据集上实现了91.15%的总体准确率、85.45%的宏F1分数和99.33%的微平均AUC,尤其在黑色素瘤(melanoma)和黑素细胞痣(melanocytic nevi)类别表现突出。

链接: https://arxiv.org/abs/2601.00964
作者: Md. Maksudul Haque,Rahnuma Akter,A S M Ahsanul Sarkar Akib,Abdul Hasib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skin cancer is also one of the most common and dangerous types of cancer in the world that requires timely and precise diagnosis. In this paper, a deep-learning architecture of the multi-class skin lesion classification on the HAM10000 dataset will be described. The system suggested combines high-quality data balancing methods, large-scale data augmentation, hybridized EfficientNetV2-L framework with channel attention, and a three-stage progressive learning approach. Moreover, we also use explainable AI (XAI) techniques such as Grad-CAM and saliency maps to come up with intelligible visual representations of model predictions. Our strategy is with a total accuracy of 91.15 per cent, macro F1 of 85.45% and micro-average AUC of 99.33%. The model has shown high performance in all the seven lesion classes with specific high performance of melanoma and melanocytic nevi. In addition to enhancing diagnostic transparency, XAI also helps to find out the visual characteristics that cause the classifications, which enhances clinical trustworthiness.
zh

[CV-169] Deep Clustering with Associative Memories

【速读】:该论文旨在解决深度聚类(Deep Clustering)中表示学习与潜在空间聚类之间耦合不紧密的问题,即传统方法因聚类本质上是离散优化任务而难以融入可微分的深度学习框架,导致表示学习与聚类步骤分离、性能受限。解决方案的关键在于提出一种基于能量模型(Energy-based Dynamics)的新损失函数,通过关联记忆(Associative Memories)机制将表示学习与聚类统一到一个联合目标函数中,从而实现更紧密的端到端优化,提升不同架构(卷积、残差或全连接)和数据模态(图像或文本)下的聚类质量。

链接: https://arxiv.org/abs/2601.00963
作者: Bishwajit Saha,Dmitry Krotov,Mohammed J. Zaki,Parikshit Ram
机构: IBM Research (IBM 研究院); RPI (伦斯勒理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep clustering - joint representation learning and latent space clustering - is a well studied problem especially in computer vision and text processing under the deep learning framework. While the representation learning is generally differentiable, clustering is an inherently discrete optimization task, requiring various approximations and regularizations to fit in a standard differentiable pipeline. This leads to a somewhat disjointed representation learning and clustering. In this work, we propose a novel loss function utilizing energy-based dynamics via Associative Memories to formulate a new deep clustering method, DCAM, which ties together the representation learning and clustering aspects more intricately in a single objective. Our experiments showcase the advantage of DCAM, producing improved clustering quality for various architecture choices (convolutional, residual or fully-connected) and data modalities (images or text).
zh

[CV-170] PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education WACV

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在物理教育中应用的可行性与局限性问题,特别是如何利用文本到视频(Text-to-Video, T2V)系统自动生成准确、直观且符合课程标准的物理教学视频。其解决方案的关键在于构建了一个专门用于评估T2V模型在物理教育场景下表现的基准测试集(benchmark),该基准将每个核心物理概念拆解为细粒度的教学点,并为每个教学点设计了针对性的视觉提示(prompt),从而系统性地衡量模型生成视频在视觉连贯性和概念准确性方面的性能。研究发现,当前T2V模型虽能生成流畅、无明显闪烁的视频,但在抽象物理领域如电磁学和热力学中的概念准确性不足,凸显了视觉质量与概念正确性之间的差距,为后续优化提供了明确方向。

链接: https://arxiv.org/abs/2601.00943
作者: Megha Mariam K.M,Aditya Arun,Zakaria Laskar,C.V. Jawahar
机构: IIIT Hyderabad(印度国际信息技术学院); Adobe MDSR(Adobe多媒体研究与开发部门); IISER Thiruvananthapuram(印度科学教育研究所特里凡得琅分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

点击查看摘要

Abstract:Generative AI models, particularly Text-to-Video (T2V) systems, offer a promising avenue for transforming science education by automating the creation of engaging and intuitive visual explanations. In this work, we take a first step toward evaluating their potential in physics education by introducing a dedicated benchmark for explanatory video generation. The benchmark is designed to assess how well T2V models can convey core physics concepts through visual illustrations. Each physics concept in our benchmark is decomposed into granular teaching points, with each point accompanied by a carefully crafted prompt intended for visual explanation of the teaching point. T2V models are evaluated on their ability to generate accurate videos in response to these prompts. Our aim is to systematically explore the feasibility of using T2V models to generate high-quality, curriculum-aligned educational content-paving the way toward scalable, accessible, and personalized learning experiences powered by AI. Our evaluation reveals that current models produce visually coherent videos with smooth motion and minimal flickering, yet their conceptual accuracy is less reliable. Performance in areas such as mechanics, fluids, and optics is encouraging, but models struggle with electromagnetism and thermodynamics, where abstract interactions are harder to depict. These findings underscore the gap between visual quality and conceptual correctness in educational video generation. We hope this benchmark helps the community close that gap and move toward T2V systems that can deliver accurate, curriculum-aligned physics content at scale. The benchmark and accompanying codebase are publicly available at this https URL.
zh

[CV-171] Learning to Segment Liquids in Real-world Images

【速读】:该论文旨在解决机器人在日常环境中对液体进行安全避让或交互时面临的挑战,即液体分割任务长期缺乏足够关注,导致机器人难以准确识别和处理具有多样外观、形状以及透明或反射特性的液体。其解决方案的关键在于构建了一个包含5000张真实世界图像的大型液体数据集LQDS(Liquid Quality Dataset),并提出了一种名为LQDM(Liquid Quality Detection Model)的新颖液体检测模型,该模型通过主分割分支与专用边界分支之间的交叉注意力机制(cross-attention)增强分割预测性能,从而显著提升液体语义分割的准确性。

链接: https://arxiv.org/abs/2601.00940
作者: Jonas Li,Michelle Li,Luke Liu,Heng Fan
机构: University of North Texas (北德克萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Different types of liquids such as water, wine and medicine appear in all aspects of daily life. However, limited attention has been given to the task, hindering the ability of robots to avoid or interact with liquids safely. The segmentation of liquids is difficult because liquids come in diverse appearances and shapes; moreover, they can be both transparent or reflective, taking on arbitrary objects and scenes from the background or surroundings. To take on this challenge, we construct a large-scale dataset of liquids named LQDS consisting of 5000 real-world images annotated into 14 distinct classes, and design a novel liquid detection model named LQDM, which leverages cross-attention between a dedicated boundary branch and the main segmentation branch to enhance segmentation predictions. Extensive experiments demonstrate the effectiveness of LQDM on the test set of LQDS, outperforming state-of-the-art methods and establishing a strong baseline for the semantic segmentation of liquids.
zh

[CV-172] ShadowGS: Shadow-Aware 3D Gaussian Splatting for Satellite Imagery

【速读】:该论文旨在解决多时相卫星影像中因光照条件变化导致阴影不一致的问题,从而影响三维重建(3D reconstruction)的几何一致性与精度。解决方案的关键在于提出ShadowGS框架,其核心创新包括:基于遥感物理渲染方程结合高效射线步进(ray marching)技术,精确建模几何一致的阴影;引入阴影一致性约束(shadow consistency constraint)以显著提升重建几何准确性;以及设计一种新颖的阴影图先验(shadow map prior),增强稀疏视角输入下的性能。这些机制共同实现了阴影解耦精度、三维重建精度和新视角合成质量的全面提升。

链接: https://arxiv.org/abs/2601.00939
作者: Feng Luo,Hongbo Pan,Xiang Yang,Baoyu Jiang,Fengqing Liu,Tao Huang
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a novel paradigm for 3D reconstruction from satellite imagery. However, in multi-temporal satellite images, prevalent shadows exhibit significant inconsistencies due to varying illumination conditions. To address this, we propose ShadowGS, a novel framework based on 3DGS. It leverages a physics-based rendering equation from remote sensing, combined with an efficient ray marching technique, to precisely model geometrically consistent shadows while maintaining efficient rendering. Additionally, it effectively disentangles different illumination components and apparent attributes in the scene. Furthermore, we introduce a shadow consistency constraint that significantly enhances the geometric accuracy of 3D reconstruction. We also incorporate a novel shadow map prior to improve performance with sparse-view inputs. Extensive experiments demonstrate that ShadowGS outperforms current state-of-the-art methods in shadow decoupling accuracy, 3D reconstruction precision, and novel view synthesis quality, with only a few minutes of training. ShadowGS exhibits robust performance across various settings, including RGB, pansharpened, and sparse-view satellite inputs.
zh

[CV-173] Analyzing the Shopping Journey: Computing Shelf Browsing Visits in a Physical Retail Store

【速读】:该论文旨在解决机器人在零售场景中部署时,如何自主理解顾客意图的问题。其核心挑战在于从物理店内获取可量化的顾客行为数据,以支持后续的智能决策与交互。解决方案的关键在于提出了一种基于机器视觉的三维跟踪算法,用于计算顾客的“货架访问次数”(shelf visits),从而捕捉其浏览行为;该算法通过两组独立采集于不同门店的轨迹数据进行校准,并验证了其在跨环境下的泛化能力,最终利用模型分析了顾客浏览模式与实际购买行为之间的关联,为零售规划和人-机器人交互提供了量化依据。

链接: https://arxiv.org/abs/2601.00928
作者: Luis Yoichi Morales,Francesco Zanlungo,David M. Woollard
机构: Standard AI(标准AI); University of Palermo(巴勒莫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Motivated by recent challenges in the deployment of robots into customer-facing roles within retail, this work introduces a study of customer activity in physical stores as a step toward autonomous understanding of shopper intent. We introduce an algorithm that computes shoppers’ shelf visits'' -- capturing their browsing behavior in the store. Shelf visits are extracted from trajectories obtained via machine vision-based 3D tracking and overhead cameras. We perform two independent calibrations of the shelf visit algorithm, using distinct sets of trajectories (consisting of 8138 and 15129 trajectories), collected in different stores and labeled by human reviewers. The calibrated models are then evaluated on trajectories held out of the calibration process both from the same store on which calibration was performed and from the other store. An analysis of the results shows that the algorithm can recognize customers' browsing activity when evaluated in an environment different from the one on which calibration was performed. We then use the model to analyze the customers' browsing patterns’’ on a large set of trajectories and their relation to actual purchases in the stores. Finally, we discuss how shelf browsing information could be used for retail planning and in the domain of human-robot interaction scenarios.
zh

[CV-174] Application of deep learning techniques in non-contrast computed tomography pulmonary angiogram for pulmonary embolism diagnosis

【速读】:该论文旨在解决肺栓塞(pulmonary embolism)早期诊断中依赖含碘对比剂的CT肺动脉造影(contrast medium computed tomography pulmonary angiography, CTPA)所引发的问题,包括对比剂相关急性肾损伤风险及延迟显影导致的治疗窗口延误。其解决方案的关键在于利用3D卷积神经网络(3D convolutional neural network)模型,直接从非对比增强CT图像中自动识别和分类肺栓塞,从而实现无需对比剂即可高精度诊断,实验结果显示该方法在无对比CT图像上达到了85%的分类准确率和0.84的AUC值,验证了深度学习技术在肺栓塞无创、快速筛查中的可行性与临床应用潜力。

链接: https://arxiv.org/abs/2601.00925
作者: I-Hsien Ting,Yi-Jun Tseng,Yu-Sheng Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pulmonary embolism is a life-threatening disease, early detection and treatment can significantly reduce mortality. In recent years, many studies have been using deep learning in the diagnosis of pulmonary embolism with contrast medium computed tomography pulmonary angiography, but the contrast medium is likely to cause acute kidney injury in patients with pulmonary embolism and chronic kidney disease, and the contrast medium takes time to work, patients with acute pulmonary embolism may miss the golden treatment time. This study aims to use deep learning techniques to automatically classify pulmonary embolism in CT images without contrast medium by using a 3D convolutional neural network model. The deep learning model used in this study had a significant impact on the pulmonary embolism classification of computed tomography images without contrast with 85% accuracy and 0.84 AUC, which confirms the feasibility of the model in the diagnosis of pulmonary embolism. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.00925 [cs.CV] (or arXiv:2601.00925v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.00925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-175] Four-Stage Alzheimers Disease Classification from MRI Using Topological Feature Extraction Feature Selection and Ensemble Learning

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)严重程度分类中面临的两大挑战:一是数据有限条件下模型性能受限,二是深度学习方法缺乏可解释性。其解决方案的关键在于提出一种基于拓扑数据分析(Topological Data Analysis, TDA)与集成学习相结合的四阶段分类框架——TDA-Alz。该方法通过提取脑部磁共振成像(MRI)中的拓扑特征来捕捉大脑结构的内在模式,结合特征选择保留最具判别力的拓扑描述子,并利用集成学习策略实现鲁棒的多类分类。该方案无需数据增强、预训练网络或大规模计算资源,兼具高准确率(98.19%)和强可解释性,为临床决策支持系统提供了轻量且可靠的替代方案。

链接: https://arxiv.org/abs/2601.00918
作者: Faisal Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Accurate and efficient classification of Alzheimer’s disease (AD) severity from brain magnetic resonance imaging (MRI) remains a critical challenge, particularly when limited data and model interpretability are of concern. In this work, we propose TDA-Alz, a novel framework for four-stage Alzheimer’s disease severity classification (non-demented, moderate dementia, mild, and very mild) using topological data analysis (TDA) and ensemble learning. Instead of relying on deep convolutional architectures or extensive data augmentation, our approach extracts topological descriptors that capture intrinsic structural patterns of brain MRI, followed by feature selection to retain the most discriminative topological features. These features are then classified using an ensemble learning strategy to achieve robust multiclass discrimination. Experiments conducted on the OASIS-1 MRI dataset demonstrate that the proposed method achieves an accuracy of 98.19% and an AUC of 99.75%, outperforming or matching state-of-the-art deep learning–based methods reported on OASIS and OASIS-derived datasets. Notably, the proposed framework does not require data augmentation, pretrained networks, or large-scale computational resources, making it computationally efficient and fast compared to deep neural network approaches. Furthermore, the use of topological descriptors provides greater interpretability, as the extracted features are directly linked to the underlying structural characteristics of brain MRI rather than opaque latent representations. These results indicate that TDA-Alz offers a powerful, lightweight, and interpretable alternative to deep learning models for MRI-based Alzheimer’s disease severity classification, with strong potential for real-world clinical decision-support systems. Comments: 15 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.00918 [cs.CV] (or arXiv:2601.00918v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.00918 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-176] Clean-GS: Semantic Mask-Guided Pruning for 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯散射(3D Gaussian Splatting, 3DGS)重建中产生的大量背景杂波和浮点噪声(floaters)问题,这些伪影不仅遮挡目标物体、降低场景质量,还显著增加模型体积,限制其在带宽受限场景(如Web部署与AR/VR应用)中的实际使用。解决方案的关键在于提出Clean-GS方法,通过少量语义掩码(仅需1%视图的分割掩码)实现精准去噪:首先基于投影到语义掩码区域进行白名单空间过滤,其次利用深度缓冲的颜色一致性验证剔除异常点,最后采用邻域统计方法移除孤立噪声点,从而在保留目标物体视觉质量的前提下实现60–80%的模型压缩。

链接: https://arxiv.org/abs/2601.00913
作者: Subhankar Mishra
机构: National Institute of Science Education and Research (国家科学教育与研究学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting produces high-quality scene reconstructions but generates hundreds of thousands of spurious Gaussians (floaters) scattered throughout the environment. These artifacts obscure objects of interest and inflate model sizes, hindering deployment in bandwidth-constrained applications. We present Clean-GS, a method for removing background clutter and floaters from 3DGS reconstructions using sparse semantic masks. Our approach combines whitelist-based spatial filtering with color-guided validation and outlier removal to achieve 60-80% model compression while preserving object quality. Unlike existing 3DGS pruning methods that rely on global importance metrics, Clean-GS uses semantic information from as few as 3 segmentation masks (1% of views) to identify and remove Gaussians not belonging to the target object. Our multi-stage approach consisting of (1) whitelist filtering via projection to masked regions, (2) depth-buffered color validation, and (3) neighbor-based outlier removal isolates monuments and objects from complex outdoor scenes. Experiments on Tanks and Temples show that Clean-GS reduces file sizes from 125MB to 47MB while maintaining rendering quality, making 3DGS models practical for web deployment and AR/VR applications. Our code is available at this https URL
zh

[CV-177] Evaluating Contextual Intelligence in Recyclability: A Comprehensive Study of Image-Based Reasoning Systems

【速读】:该论文旨在解决公众在垃圾分类与回收实践中面临的可回收性判断难题,即如何准确识别物品的 recyclability(回收性)并正确投放至对应垃圾桶。其解决方案的关键在于利用先进的视觉语言模型(如 GPT-4o、GPT-4o-mini 和 Claude 3.5),通过分析图像数据实现对常见废弃物的智能分类预测,并进一步评估物品是否符合物理尺寸限制及复杂场景下的回收规则,包括地域差异、污染或破损状态以及多材质复合结构等挑战因素。研究结果表明,这些模型在上下文理解能力上相较以往有显著提升,但仍存在改进空间,强调了持续优化情境感知能力对于推动环保行为和可持续发展的必要性。

链接: https://arxiv.org/abs/2601.00905
作者: Eliot Park,Abhi Kumar,Pranav Rajpurkar
机构: Harvard College (哈佛学院); Stanford University (斯坦福大学); Department of Biomedical Informatics, Harvard Medical School (生物医学信息学系,哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: x

点击查看摘要

Abstract:While the importance of efficient recycling is widely acknowledged, accurately determining the recyclability of items and their proper disposal remains a complex task for the general public. In this study, we explore the application of cutting-edge vision-language models (GPT-4o, GPT-4o-mini, and Claude 3.5) for predicting the recyclability of commonly disposed items. Utilizing a curated dataset of images, we evaluated the models’ ability to match objects to appropriate recycling bins, including assessing whether the items could physically fit into the available bins. Additionally, we investigated the models’ performance across several challenging scenarios: (i) adjusting predictions based on location-specific recycling guidelines; (ii) accounting for contamination or structural damage; and (iii) handling objects composed of multiple materials. Our findings highlight the significant advancements in contextual understanding offered by these models compared to previous iterations, while also identifying areas where they still fall short. The continued refinement of context-aware models is crucial for enhancing public recycling practices and advancing environmental sustainability.
zh

[CV-178] Noise-Aware and Dynamically Adaptive Federated Defense Framework for SAR Image Target Recognition

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在合成孔径雷达(Synthetic Aperture Radar, SAR)图像目标识别中面临的后门攻击安全问题,尤其是恶意客户端利用SAR特有的乘性斑点噪声(multiplicative speckle noise)隐藏后门触发器,从而严重削弱模型鲁棒性的挑战。解决方案的关键在于提出一种噪声感知且动态自适应的联邦防御框架(NADAFD),其核心创新包括:1)频域协同反演机制,用于暴露跨客户端频谱不一致性以识别隐藏的后门触发器;2)噪声感知对抗训练策略,将Γ分布的斑点噪声特性嵌入掩码引导的对抗样本生成过程,提升对后门攻击和斑点噪声的双重鲁棒性;3)动态健康评估模块,通过追踪客户端更新行为并自适应调整聚合权重,有效抑制随训练轮次演化的恶意贡献。

链接: https://arxiv.org/abs/2601.00900
作者: Yuchao Hou(1, 2),Zixuan Zhang(1),Jie Wang(1),Wenke Huang(3),Lianhui Liang(4),Di Wu(5),Zhiquan Liu(6),Youliang Tian(2),Jianming Zhu(7),Jisheng Dang(8),Junhao Dong(3),Zhongliang Guo(9) ((1) Shanxi Normal University, Taiyuan, China, (2) Guizhou University, Guiyang, China, (3) Nanyang Technological University, Singapore, Singapore, (4) Guangxi University, Nanning, China, (5) La Trobe University, Melbourne, Australia, (6) Jinan University, Guangzhou, China, (7) Central University of Finance and Economics, Beijing, China, (8) Lanzhou University, Lanzhou, China, (9) University of St Andrews, St Andrews, United Kingdom)
机构: Shanxi Key Laboratory of Cryptography and Data Security, School of Computer Science and Artificial Intelligence, Shanxi Normal University, Taiyuan 030031, China; School of Computer Science and Technology, Guizhou University, Guiyang 550025, China; School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore; School of Electrical Engineering, Guangxi University, Nanning 530004, China; School of Computing, Engineering and Mathematical Science, La Trobe University, Plenty Road, Bundoora, VIC 3086, Australia; College of Cyber Security, Jinan University, Guangzhou 510632, China; School of Computer Science and Technology, Guizhou University, Guiyang 550025, China; School of Information, Central University of Finance and Economics, Beijing 100081, China; School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China; School of Computer Science, University of St Andrews, St Andrews, KY16 9AJ, United Kingdom
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work was supported in part by the National Key Research and Development Program of China under Grant 2021YFB3101100, in part by the National Natural Science Foundation of China under Grant 62272123, 42371470, and 42461057, in part by the Fundamental Research Program of Shanxi Province under Grant 202303021212164. Corresponding authors: Zhongliang Guo and Junhao Dong

点击查看摘要

Abstract:As a critical application of computational intelligence in remote sensing, deep learning-based synthetic aperture radar (SAR) image target recognition facilitates intelligent perception but typically relies on centralized training, where multi-source SAR data are uploaded to a single server, raising privacy and security concerns. Federated learning (FL) provides an emerging computational intelligence paradigm for SAR image target recognition, enabling cross-site collaboration while preserving local data privacy. However, FL confronts critical security risks, where malicious clients can exploit SAR’s multiplicative speckle noise to conceal backdoor triggers, severely challenging the robustness of the computational intelligence model. To address this challenge, we propose NADAFD, a noise-aware and dynamically adaptive federated defense framework that integrates frequency-domain, spatial-domain, and client-behavior analyses to counter SAR-specific backdoor threats. Specifically, we introduce a frequency-domain collaborative inversion mechanism to expose cross-client spectral inconsistencies indicative of hidden backdoor triggers. We further design a noise-aware adversarial training strategy that embeds \Gamma -distributed speckle characteristics into mask-guided adversarial sample generation to enhance robustness against both backdoor attacks and SAR speckle noise. In addition, we present a dynamic health assessment module that tracks client update behaviors across training rounds and adaptively adjusts aggregation weights to mitigate evolving malicious contributions. Experiments on MSTAR and OpenSARShip datasets demonstrate that NADAFD achieves higher accuracy on clean test samples and a lower backdoor attack success rate on triggered inputs than existing federated backdoor defenses for SAR target recognition.
zh

[CV-179] CornViT: A Multi-Stage Convolutional Vision Transformer Framework for Hierarchical Corn Kernel Analysis

【速读】:该论文旨在解决玉米籽粒(corn kernel)分级依赖人工目检的低效与主观性问题,以支持种子认证、定向播种和育种等关键农业流程。其核心解决方案是提出CornViT框架——一个三阶段卷积视觉Transformer(Convolutional Vision Transformer, CvT)模型,模拟人类种子分析师的层级推理逻辑:第一阶段区分纯度,第二阶段识别形态(扁平或圆形),第三阶段判断纯扁籽粒的胚芽朝向(向上或向下)。该方法通过在ImageNet-22k预训练的CvT-13基础上进行头部微调,在三个任务上分别达到93.76%、94.11%和91.12%的准确率,显著优于ResNet-50和DenseNet-121等传统CNN架构,验证了融合卷积结构与自注意力机制在籽粒分析中的优势。研究还配套发布了高质量标注数据集及基于Flask的可解释性Web应用,为自动化玉米籽粒质量评估提供了可部署的技术方案。

链接: https://arxiv.org/abs/2601.00897
作者: Sai Teja Erukude,Jane Mascarenhas,Lior Shamir
机构: Kansas State University (堪萨斯州立大学); Michigan Technological University (密歇根理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:Accurate grading of corn kernels is critical for seed certification, directional seeding, and breeding, yet it is still predominantly performed by manual inspection. This work introduces CornViT, a three-stage Convolutional Vision Transformer (CvT) framework that emulates the hierarchical reasoning of human seed analysts for single-kernel evaluation. Three sequential CvT-13 classifiers operate on 384x384 RGB images: Stage 1 distinguishes pure from impure kernels; Stage 2 categorizes pure kernels into flat and round morphologies; and Stage 3 determines the embryo orientation (up vs. down) for pure, flat kernels. Starting from a public corn seed image collection, we manually relabeled and filtered images to construct three stage-specific datasets: 7265 kernels for purity, 3859 pure kernels for morphology, and 1960 pure-flat kernels for embryo orientation, all released as benchmarks. Head-only fine-tuning of ImageNet-22k pretrained CvT-13 backbones yields test accuracies of 93.76% for purity, 94.11% for shape, and 91.12% for embryo-orientation detection. Under identical training conditions, ResNet-50 reaches only 76.56 to 81.02 percent, whereas DenseNet-121 attains 86.56 to 89.38 percent accuracy. These results highlight the advantages of convolution-augmented self-attention for kernel analysis. To facilitate adoption, we deploy CornViT in a Flask-based web application that performs stage-wise inference and exposes interpretable outputs through a browser interface. Together, the CornViT framework, curated datasets, and web application provide a deployable solution for automated corn kernel quality assessment in seed quality workflows. Source code and data are publicly available.
zh

[CV-180] Hierarchical topological clustering

【速读】:该论文旨在解决传统聚类方法在处理复杂结构数据时因假设数据分布形态而失效的问题,尤其在存在异常值或簇形状不规则的情况下难以提取有意义的聚类结果。其解决方案的关键在于提出一种层次化拓扑聚类算法(hierarchical topological clustering algorithm),该算法无需预设数据结构假设,可兼容任意距离度量,并通过构建数据的拓扑层次结构来识别异常值与任意形状簇的持久性特征,从而在图像、医学和经济等实际数据集中实现更鲁棒且语义合理的聚类效果。

链接: https://arxiv.org/abs/2601.00892
作者: Ana Carpio,Gema Duro
机构: Facultad de Ciencias Matemáticas, Universidad Complutense de Madrid(马德里康普顿斯大学数学科学学院); Facultad de Ciencias Económicas y Empresariales, Universidad Autónoma de Madrid(马德里自治大学经济与企业管理学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME); Machine Learning (stat.ML)
备注: not peer reviewed, reviewed version to appear in Soft Computing

点击查看摘要

Abstract:Topological methods have the potential of exploring data clouds without making assumptions on their the structure. Here we propose a hierarchical topological clustering algorithm that can be implemented with any distance choice. The persistence of outliers and clusters of arbitrary shape is inferred from the resulting hierarchy. We demonstrate the potential of the algorithm on selected datasets in which outliers play relevant roles, consisting of images, medical and economic data. These methods can provide meaningful clusters in situations in which other techniques fail to do so.
zh

[CV-181] Comparative Evaluation of CNN Architectures for Neural Style Transfer in Indonesian Batik Motif Generation: A Comprehensive Study

【速读】:该论文旨在解决神经风格迁移(Neural Style Transfer, NST)在印尼蜡染(batik)图案数字保存与生成中因依赖VGG类架构导致的高计算和内存开销问题,从而限制了其在资源受限环境中的实际部署。解决方案的关键在于系统性地比较五种主流CNN骨干网络(VGG16、VGG19、Inception V3、ResNet50 和 ResNet101),通过245组受控实验结合定量指标、定性评估与统计分析,揭示结构保留、风格表现与计算效率之间的权衡关系。结果表明,ResNet系列模型在保持相似感知相似性(LPIPS = 0.53)的同时,收敛速度比VGG快约5–6倍,且浮点运算量(FLOPs)减少超过16倍(0.63 vs 10.12 GFLOPs),展现出显著的效率优势,因此被推荐作为高效且结构保真的蜡染生成基础架构。

链接: https://arxiv.org/abs/2601.00888
作者: Happy Gery Pangestu,Andi Prademon Yunus,Siti Khomsah
机构: Telkom University (Telkom大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 9 figures, submitted in VCIBA

点击查看摘要

Abstract:Neural Style Transfer (NST) provides a computational framework for the digital preservation and generative exploration of Indonesian batik motifs; however, existing approaches remain largely centered on VGG-based architectures whose strong stylistic expressiveness comes at the cost of high computational and memory demands, that limits practical deployment in resource-limited environments. This study presents a systematic comparative analysis of five widely used CNN backbones, namely VGG16, VGG19, Inception V3, ResNet50, and ResNet101, based on 245 controlled experiments combining quantitative metrics, qualitative assessment, and statistical analysis to examine the trade-off between structural preservation, stylistic behavior, and computational efficiency. The results show that backbone selection does not yield statistically significant differences in structural similarity, as confirmed by ANOVA on SSIM (p= 0.83), indicating comparable levels of structural preservation rather than equivalent stylistic quality. Within this context, ResNet-based architectures achieve approximately 5-6x faster convergence than VGG models while maintaining similar perceptual similarity (LPIPS = 0.53) and requiring over 16x fewer FLOPs (0.63 vs 10.12 GFLOPs). Qualitative analysis reveals consistent stylistic trade-offs, with VGG producing denser painterly textures, ResNet favoring geometric stability and canting stroke preservation with milder stylization, and Inception V3 exhibiting intermediate but noisier behavior. These findings reposition architectural choice in NST from maximizing stylistic intensity toward efficiency-aware and structure-preserving deployment, highlighting ResNet-based backbones as a practical foundation for scalable, industry-oriented batik generation.
zh

[CV-182] VideoCuRL: Video Curriculum Reinforcement Learning with Orthogonal Difficulty Decomposition

【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)在视频大模型(VideoLLMs)训练中因依赖随机数据打乱或基于标量难度指标的简单课程策略而导致的复杂时空推理能力不足的问题。其核心挑战在于现有方法未能区分视频理解中的两个独立维度:视觉时间感知负荷(Visual Temporal Perception Load)与认知推理深度(Cognitive Reasoning Depth)。解决方案的关键在于提出 VideoCuRL 框架,通过将难度解耦为二维课程空间,利用无训练的高效代理指标——光流(optical flow)和关键帧熵(keyframe entropy)表征视觉复杂度,校准惊喜度(Calibrated Surprisal)衡量认知复杂度,从而实现数据在二维课程网格上的精准映射;并引入一种具备能力感知的对角波前调度策略(Diagonal Wavefront),从基础对齐逐步过渡到高阶推理任务,同时结合动态稀疏 KL 散度与结构化重访机制以稳定训练过程、防止奖励坍塌与灾难性遗忘。

链接: https://arxiv.org/abs/2601.00887
作者: Hongbo Jin,Kuanwei Lin,Wenhao Zhang,Yichen Jin,Ge Li
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.
zh

[CV-183] VL-OrdinalFormer: Vision Language Guided Ordinal Transformers for Interpretable Knee Osteoarthritis Grading

【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, KOA)早期阶段(特别是KL1与KL2之间)的放射学分级主观差异大、诊断一致性差的问题。现有方法在区分细微影像特征如关节间隙狭窄(joint space narrowing)、骨赘形成(osteophyte formation)和软骨下硬化(subchondral sclerosis)时表现不稳定,导致放射科医师间评分一致性低。解决方案的关键在于提出VLOrdinalFormer框架——一个基于视觉语言引导的序数学习模型,其核心创新包括:(1) 使用ViT-L16主干网络结合CORAL序数回归建模等级关系;(2) 引入CLIP驱动的语义对齐模块,将临床文本概念嵌入图像表征以增强判别力;(3) 采用分层五折交叉验证、类别感知重加权及测试时增强与全局阈值优化策略提升鲁棒性与泛化能力。实验表明该方法在OAI kneeKL224数据集上显著优于CNN和ViT基线,在保持严重病例准确率的同时大幅改善KL1和KL2的分类性能,并通过Grad-CAM和CLIP相似度图实现可解释性验证。

链接: https://arxiv.org/abs/2601.00879
作者: Zahid Ullah,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knee osteoarthritis (KOA) is a leading cause of disability worldwide, and accurate severity assessment using the Kellgren Lawrence (KL) grading system is critical for clinical decision making. However, radiographic distinctions between early disease stages, particularly KL1 and KL2, are subtle and frequently lead to inter-observer variability among radiologists. To address these challenges, we propose VLOrdinalFormer, a vision language guided ordinal learning framework for fully automated KOA grading from knee radiographs. The proposed method combines a ViT L16 backbone with CORAL based ordinal regression and a Contrastive Language Image Pretraining (CLIP) driven semantic alignment module, allowing the model to incorporate clinically meaningful textual concepts related to joint space narrowing, osteophyte formation, and subchondral sclerosis. To improve robustness and mitigate overfitting, we employ stratified five fold cross validation, class aware re weighting to emphasize challenging intermediate grades, and test time augmentation with global threshold optimization. Experiments conducted on the publicly available OAI kneeKL224 dataset demonstrate that VLOrdinalFormer achieves state of the art performance, outperforming CNN and ViT baselines in terms of macro F1 score and overall accuracy. Notably, the proposed framework yields substantial performance gains for KL1 and KL2 without compromising classification accuracy for mild or severe cases. In addition, interpretability analyses using Grad CAM and CLIP similarity maps confirm that the model consistently attends to clinically relevant anatomical regions. These results highlight the potential of vision language aligned ordinal transformers as reliable and interpretable tools for KOA grading and disease progression assessment in routine radiological practice.
zh

[CV-184] Motion-Compensated Latent Semantic Canvases for Visual Situational Awareness on Edge

【速读】:该论文旨在解决在资源受限的边缘设备上实现高效视觉情境感知(visual situational awareness)的问题,传统方法依赖于每帧进行昂贵的全景分割(panoptic segmentation),导致计算开销大、延迟高。解决方案的关键在于提出运动补偿的潜在语义画布(Motion-Compensated Latent Semantic Canvases, MCLSC),通过两个潜在画布——一个缓慢累积的静态层和一个快速更新的动态层——来持久化语义元数据,并在由视频流稳定化的基线坐标系中维护一致性。该方法采用异步且运动门控的分割推理机制(即仅当检测到运动表明存在新信息时才触发Mask2Former分割),结合运动补偿技术以保持语义记忆的空间一致性,从而显著减少分割调用次数(降低30倍)并缩短端到端处理时间(降低20倍),同时维持静态与动态语义图层的连贯性。

链接: https://arxiv.org/abs/2601.00854
作者: Igor Lodin,Sergii Filatov,Vira Filatova,Dmytro Filatov
机构: Aimech Technologies Corp.(Aimech Technologies 公司); Covijn Ltd.(Covijn 有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:We propose Motion-Compensated Latent Semantic Canvases (MCLSC) for visual situational awareness on resource-constrained edge devices. The core idea is to maintain persistent semantic metadata in two latent canvases - a slowly accumulating static layer and a rapidly updating dynamic layer - defined in a baseline coordinate frame stabilized from the video stream. Expensive panoptic segmentation (Mask2Former) runs asynchronously and is motion-gated: inference is triggered only when motion indicates new information, while stabilization/motion compensation preserves a consistent coordinate system for latent semantic memory. On prerecorded 480p clips, our prototype reduces segmentation calls by 30x and lowers mean end-to-end processing time by 20x compared to naive per-frame segmentation, while maintaining coherent static/dynamic semantic overlays.
zh

[CV-185] A Global Atlas of Digital Dermatology to Map Innovation and Disparities

【速读】:该论文旨在解决当前皮肤科人工智能(Artificial Intelligence, AI)模型依赖的数据质量与覆盖范围不足的问题,尤其是缺乏量化指标来评估新数据集是否真正拓展了临床覆盖范围,还是仅重复已有信息。其解决方案的关键在于提出SkinMap——一个用于对公开皮肤科数据集进行全面审计的多模态框架,通过构建包含超过110万张皮肤病图像的语义化可查询知识图谱,首次实现了对数据信息新颖性、冗余度及人群和诊断分布代表性等维度的系统量化分析,从而识别出当前数据集中存在的结构性盲区(如深色皮肤类型和儿科患者占比极低),并为未来有针对性的数据采集提供科学依据。

链接: https://arxiv.org/abs/2601.00840
作者: Fabian Gröger,Simone Lionetti,Philippe Gottfrois,Alvaro Gonzalez-Jimenez,Lea Habermacher,Labelling Consortium,Ludovic Amruthalingam,Matthew Groh,Marc Pouly,Alexander A. Navarini
机构: University of Basel (巴塞尔大学); University of Applied Sciences and Arts Northwestern Switzerland (西北瑞士应用科学与艺术大学); University of St. Gallen (圣加仑大学); Kellogg School of Management, Northwestern University (西北大学凯洛格管理学院)
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The adoption of artificial intelligence in dermatology promises democratized access to healthcare, but model reliability depends on the quality and comprehensiveness of the data fueling these models. Despite rapid growth in publicly available dermatology images, the field lacks quantitative key performance indicators to measure whether new datasets expand clinical coverage or merely replicate what is already known. Here we present SkinMap, a multi-modal framework for the first comprehensive audit of the field’s entire data basis. We unify the publicly available dermatology datasets into a single, queryable semantic atlas comprising more than 1.1 million images of skin conditions and quantify (i) informational novelty over time, (ii) dataset redundancy, and (iii) representation gaps across demographics and diagnoses. Despite exponential growth in dataset sizes, informational novelty across time has somewhat plateaued: Some clusters, such as common neoplasms on fair skin, are densely populated, while underrepresented skin types and many rare diseases remain unaddressed. We further identify structural gaps in coverage: Darker skin tones (Fitzpatrick V-VI) constitute only 5.8% of images and pediatric patients only 3.0%, while many rare diseases and phenotype combinations remain sparsely represented. SkinMap provides infrastructure to measure blind spots and steer strategic data acquisition toward undercovered regions of clinical space.
zh

[CV-186] Unified Review and Benchmark of Deep Segmentation Architectures for Cardiac Ultrasound on CAMUS

【速读】:该论文旨在解决心脏超声图像分割领域中缺乏统一、可复现的实验基准问题,以系统性地比较不同深度学习架构在真实临床数据上的性能表现。其关键解决方案在于构建了一个标准化的基准测试框架,涵盖U-Net、Attention U-Net和TransUNet三种主流网络结构,并在CAMUS(Cardiac Acquisitions for Multi-Structure Ultrasound Segmentation)数据集上采用一致的训练划分、损失函数和评估指标进行对比;同时探索了多种预处理策略(如保留原始NIfTI动态范围、16-bit PNG导出、GPT辅助伪标签生成及自监督预训练SSL),发现保持强度保真度对模型性能至关重要,而TransUNet结合SSL初始化在复杂帧上展现出最强泛化能力,且伪标签经置信度筛选后能有效扩展训练集并提升鲁棒性。

链接: https://arxiv.org/abs/2601.00839
作者: Zahid Ullah,Muhammad Hilal,Eunsoo Lee,Dragan Pamucar,Jihie Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Several review papers summarize cardiac imaging and DL advances, few works connect this overview to a unified and reproducible experimental benchmark. In this study, we combine a focused review of cardiac ultrasound segmentation literature with a controlled comparison of three influential architectures, U-Net, Attention U-Net, and TransUNet, on the Cardiac Acquisitions for Multi-Structure Ultrasound Segmentation (CAMUS) echocardiography dataset. Our benchmark spans multiple preprocessing routes, including native NIfTI volumes, 16-bit PNG exports, GPT-assisted polygon-based pseudo-labels, and self-supervised pretraining (SSL) on thousands of unlabeled cine frames. Using identical training splits, losses, and evaluation criteria, a plain U-Net achieved a 94% mean Dice when trained directly on NIfTI data (preserving native dynamic range), while the PNG-16-bit workflow reached 91% under similar conditions. Attention U-Net provided modest improvements on small or low-contrast regions, reducing boundary leakage, whereas TransUNet demonstrated the strongest generalization on challenging frames due to its ability to model global spatial context, particularly when initialized with SSL. Pseudo-labeling expanded the training set and improved robustness after confidence filtering. Overall, our contributions are threefold: a harmonized, apples-to-apples benchmark of U-Net, Attention U-Net, and TransUNet under standardized CAMUS preprocessing and evaluation; practical guidance on maintaining intensity fidelity, resolution consistency, and alignment when preparing ultrasound data; and an outlook on scalable self-supervision and emerging multimodal GPT-based annotation pipelines for rapid labeling, quality assurance, and targeted dataset curation.
zh

[CV-187] Pediatric Pneumonia Detection from Chest X-Rays:A Comparative Study of Transfer Learning and Custom CNNs

【速读】:该论文旨在解决儿童肺炎(pediatric pneumonia)在胸部X光片(chest X-ray)中准确诊断受限于放射科医生资源不足和诊断变异的问题。其解决方案的关键在于对比了从头训练的卷积神经网络(CNN)与迁移学习方法(ResNet50、DenseNet121、EfficientNet-B0)在儿科肺炎检测中的性能差异,并进一步评估了冻结主干(frozen-backbone)与微调(fine-tuning)两种策略的效果。研究发现,微调策略显著优于冻结主干模型,其中微调后的ResNet50模型达到了99.43%的准确率、99.61%的F1分数和99.93%的AUC,表明迁移学习结合微调是提升儿科肺炎自动检测精度的有效途径,具有在资源匮乏地区作为筛查工具的巨大潜力。

链接: https://arxiv.org/abs/2601.00837
作者: Agniv Roy Choudhury
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pneumonia is a leading cause of mortality in children under five, with over 700,000 deaths annually. Accurate diagnosis from chest X-rays is limited by radiologist availability and variability. Objective: This study compares custom CNNs trained from scratch with transfer learning (ResNet50, DenseNet121, EfficientNet-B0) for pediatric pneumonia detection, evaluating frozen-backbone and fine-tuning regimes. Methods: A dataset of 5,216 pediatric chest X-rays was split 80/10/10 for training, validation, and testing. Seven models were trained and assessed using accuracy, F1-score, and AUC. Grad-CAM visualizations provided explainability. Results: Fine-tuned ResNet50 achieved the best performance: 99.43% accuracy, 99.61% F1-score, and 99.93% AUC, with only 3 misclassifications. Fine-tuning outperformed frozen-backbone models by 5.5 percentage points on average. Grad-CAM confirmed clinically relevant lung regions guided predictions. Conclusions: Transfer learning with fine-tuning substantially outperforms CNNs trained from scratch for pediatric pneumonia detection, showing near-perfect accuracy. This system has strong potential as a screening tool in resource-limited settings. Future work should validate these findings on multi-center and adult datasets. Keywords: Pneumonia detection, deep learning, transfer learning, CNN, chest X-ray, pediatric diagnosis, ResNet, DenseNet, EfficientNet, Grad-CAM. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.00837 [cs.CV] (or arXiv:2601.00837v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.00837 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Agniv Roy Choudhury [view email] [v1] Fri, 26 Dec 2025 18:48:39 UTC (3,091 KB) Full-text links: Access Paper: View a PDF of the paper titled Pediatric Pneumonia Detection from Chest X-Rays:A Comparative Study of Transfer Learning and Custom CNNs, by Agniv Roy ChoudhuryView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-01 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-188] ShrimpXNet: A Transfer Learning Framework for Shrimp Disease Classification with Augmented Regularization Adversarial Training and Explainable AI

【速读】:该论文旨在解决虾类养殖中疾病爆发对可持续生产构成的重大挑战,通过构建一种基于深度学习的自动化疾病分类方法实现及时、准确的病害检测。解决方案的关键在于:首先,采用六种预训练模型(ResNet50、EfficientNet、DenseNet201、MobileNet、ConvNeXt-Tiny、Xception)进行对比评估,并利用图像背景去除与Keras标准化预处理提升数据质量;其次,引入对抗训练(Fast Gradient Sign Method, FGSM)增强模型鲁棒性,同时结合CutMix和MixUp等先进数据增强策略缓解过拟合并提升泛化能力;最后,借助Grad-CAM、Grad-CAM++和XGrad-CAM等后验解释方法提高模型可解释性,使关键病灶区域可视化。实验表明,ConvNeXt-Tiny在测试集上达到96.88%的准确率,且置信区间为[0.953, 0.971],验证了该方案的有效性和可靠性。

链接: https://arxiv.org/abs/2601.00832
作者: Israk Hasan Jone,D.M. Rafiun Bin Masud,Promit Sarker,Sayed Fuad Al Labib,Nazmul Islam,Farhad Billah
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 Page, fugure 11

点击查看摘要

Abstract:Shrimp is one of the most widely consumed aquatic species globally, valued for both its nutritional content and economic importance. Shrimp farming represents a significant source of income in many regions; however, like other forms of aquaculture, it is severely impacted by disease outbreaks. These diseases pose a major challenge to sustainable shrimp production. To address this issue, automated disease classification methods can offer timely and accurate detection. This research proposes a deep learning-based approach for the automated classification of shrimp diseases. A dataset comprising 1,149 images across four disease classes was utilized. Six pretrained deep learning models, ResNet50, EfficientNet, DenseNet201, MobileNet, ConvNeXt-Tiny, and Xception were deployed and evaluated for performance. The images background was removed, followed by standardized preprocessing through the Keras image pipeline. Fast Gradient Sign Method (FGSM) was used for enhancing the model robustness through adversarial training. While advanced augmentation strategies, including CutMix and MixUp, were implemented to mitigate overfitting and improve generalization. To support interpretability, and to visualize regions of model attention, post-hoc explanation methods such as Grad-CAM, Grad-CAM++, and XGrad-CAM were applied. Exploratory results demonstrated that ConvNeXt-Tiny achieved the highest performance, attaining a 96.88% accuracy on the test dataset. After 1000 iterations, the 99% confidence interval for the model is [0.953,0.971].
zh

[CV-189] Can Generative Models Actually Forge Realistic Identity Documents?

【速读】:该论文旨在解决当前扩散模型(diffusion-based generative models)是否能够生成可规避人类或自动化验证系统的高仿真身份证件伪造件这一问题。其关键解决方案在于系统性评估多种开源文本到图像(text-to-image)与图像到图像(image-to-image)生成管道,涵盖Stable Diffusion、Qwen、Flux、Nano-Banana等主流模型家族,结果表明尽管这些模型能模拟证件的表面视觉特征,却无法复现结构和法证层面的真实性,从而揭示了生成式AI在身份文档伪造中达到法证级真实性的风险可能被高估。

链接: https://arxiv.org/abs/2601.00829
作者: Alexander Vinogradov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 16 figures

点击查看摘要

Abstract:Generative image models have recently shown significant progress in image realism, leading to public concerns about their potential misuse for document forgery. This paper explores whether contemporary open-source and publicly accessible diffusion-based generative models can produce identity document forgeries that could realistically bypass human or automated verification systems. We evaluate text-to-image and image-to-image generation pipelines using multiple publicly available generative model families, including Stable Diffusion, Qwen, Flux, Nano-Banana, and others. The findings indicate that while current generative models can simulate surface-level document aesthetics, they fail to reproduce structural and forensic authenticity. Consequently, the risk of generative identity document deepfakes achieving forensic-level authenticity may be overestimated, underscoring the value of collaboration between machine learning practitioners and document-forensics experts in realistic risk assessment.
zh

[CV-190] Free Energy-Based Modeling of Emotional Dynamics in Video Advertisements

【速读】:该论文旨在解决广告视频中情绪识别的可解释性问题,即如何在不依赖生理信号或主观评分等外部信息的情况下,仅通过视频场景层面的表达特征来量化情绪反应(如愉悦感、惊讶和习惯化)。其解决方案的关键在于基于自由能(Free Energy, FE)原理构建一个计算框架,其中利用Kullback-Leibler散度(KLD)表征预测误差以反映“愉悦感”,贝叶斯惊讶(Bayesian Surprise, BS)捕捉信念更新以对应“信息复杂性引发的惊讶”,不确定性(Uncertainty, UN)则衡量先验模糊性,体现由元素类型、空间排列及呈现数量变化带来的“不确定性驱动的惊讶”。该方法在1059条15秒食品广告视频上验证有效,并展现出跨超参数设置与多种日本广告视频类型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2601.00812
作者: Takashi Ushio,Kazuhiro Onishi,Hideyoshi Yanagisawa
机构: Hakuhodo DY Holdings Inc.(日挥DY控股公司); Hakuhodo Technologies Inc.(日挥技术公司); Graduate School of Engineering, The University of Tokyo(东京大学工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This article has been accepted for publication in IEEE Access and will be published shortly

点击查看摘要

Abstract:Emotional responses during advertising video viewing are recognized as essential for understanding media effects because they have influenced attention, memory, and purchase intention. To establish a methodological basis for explainable emotion estimation without relying on external information such as physiological signals or subjective ratings, we have quantified “pleasantness,” “surprise,” and “habituation” solely from scene-level expression features of advertising videos, drawing on the free energy(FE) principle, which has provided a unified account of perception, learning, and behavior. In this framework, Kullback-Leibler divergence (KLD) has captured prediction error, Bayesian surprise (BS) has captured belief updates, and uncertainty (UN) has reflected prior ambiguity, and together they have formed the core components of FE. Using 1,059 15 s food video advertisements, the experiments have shown that KLD has reflected “pleasantness” associated with brand presentation, BS has captured “surprise” arising from informational complexity, and UN has reflected “surprise” driven by uncertainty in element types and spatial arrangements, as well as by the variability and quantity of presented elements. This study also identified three characteristic emotional patterns, namely uncertain stimulus, sustained high emotion, and momentary peak and decay, demonstrating the usefulness of the proposed method. Robustness across nine hyperparameter settings and generalization tests with six types of Japanese advertising videos (three genres and two durations) confirmed that these tendencies remained stable. This work can be extended by integrating a wider range of expression elements and validating the approach through subjective ratings, ultimately guiding the development of technologies that can support the creation of more engaging advertising videos.
zh

[CV-191] Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models

【速读】:该论文旨在解决视频中人体检测(human detection)在非静态摄像头(如航空平台拍摄)场景下的鲁棒性问题,尤其针对光照变化、摄像机抖动及目标尺寸变化等动态干扰因素。传统依赖人工设计特征的方法因任务特异性高且易受环境扰动影响而表现不佳。解决方案的关键在于采用自动特征学习方法,结合光流(optical flow)与三种深度模型(监督卷积神经网络 S-CNN、预训练 CNN 特征提取器、分层极限学习机 H-ELM),实现对复杂背景和多变姿态下人体的有效识别。实验表明,预训练 CNN 在 UCF-ARG 航空数据集上达到平均 98.09% 的准确率,验证了该方案在精度与效率上的优越性。

链接: https://arxiv.org/abs/2601.00391
作者: Nouar AlDahoul,Aznul Qalid Md Sabri,Ali Mohammed Mansoor
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM’s training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).
zh

[CV-192] owards Long-window Anchoring in Vision-Language Model Distillation AAAI2026

【速读】:该论文旨在解决小规模视觉语言模型(Vision-Language Models, VLMs)在有限窗口尺寸下难以实现语言与图像语义对齐的问题,尤其针对其在长上下文理解能力上的不足。解决方案的关键在于提出LAid方法,通过两个互补组件实现长程注意力机制的有效迁移:一是渐进式距离加权注意力匹配机制,训练过程中动态增强较长位置差异的注意力权重;二是可学习的RoPE响应增益调制机制,选择性放大关键位置敏感区域的响应强度。该方法结合知识蒸馏与旋转位置编码(Rotary Position Embeddings, RoPE)优化,在多个模型族中显著扩展了有效上下文窗口至基线小模型的3.2倍,同时保持或提升标准VL基准性能,并通过频谱分析验证了低频重要注意力成分的成功保留。

链接: https://arxiv.org/abs/2512.21576
作者: Haoyi Zhou,Shuo Li,Tianyu Chen,Qi Song,Chonghan Gao,Jianxin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students’ capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
zh

[CV-193] Sim2Real SAR Image Restoration: Metadata-Driven Models for Joint Despeckling and Sidelobes Reduction

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中由于斑点噪声(speckle)和强目标周围旁瓣(sidelobes)导致的图像解读不准确问题。传统方法通常将去斑点与旁瓣抑制视为独立任务处理,而本文提出了一种统一框架,利用神经网络(Neural Networks, NNs)联合完成这两项任务;其关键创新在于使用基于MOCEM模拟生成的真实感SAR数据集进行训练,并在推理阶段直接应用于真实SAR图像,实现了有效的仿真到现实(Simulation to Real, Sim2Real)迁移能力;此外,通过引入成像元数据作为辅助输入,进一步提升了恢复性能。

链接: https://arxiv.org/abs/2601.01541
作者: Antoine De Paepe,Pascal Nguyen,Michael Mabelle,Cédric Saleun,Antoine Jouadé,Jean-Christophe Louvigne
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted at the Conference on Artificial Intelligence for Defense (CAID), 2025, Rennes, France

点击查看摘要

Abstract:Synthetic aperture radar (SAR) provides valuable information about the Earth’s surface under all weather and illumination conditions. However, the inherent phenomenon of speckle and the presence of sidelobes around bright targets pose challenges for accurate interpretation of SAR imagery. Most existing SAR image restoration methods address despeckling and sidelobes reduction as separate tasks. In this paper, we propose a unified framework that jointly performs both tasks using neural networks (NNs) trained on a realistic SAR simulated dataset generated with MOCEM. Inference can then be performed on real SAR images, demonstrating effective simulation to real (Sim2Real) transferability. Additionally, we incorporate acquisition metadata as auxiliary input to the NNs, demonstrating improved restoration performance.
zh

[CV-194] Image Synthesis Using Spintronic Deep Convolutional Generative Adversarial Network

【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)在传统冯·诺依曼架构上计算能耗过高、难以满足高效能需求的问题,提出了一种基于CMOS与自旋电子学混合的深度卷积生成对抗网络(Deep Convolutional GAN, DCGAN)架构。其关键创新在于:1)将生成器中的反卷积层重构为零填充卷积结构,以适配6位Skyrmion基交叉阵列存储单元,实现硬件感知的无缝集成且不损失训练性能;2)设计了一种可调谐的Leaky ReLU单元,利用磁畴壁位置编码的连续电阻状态及分段单轴抛物各向异性分布,并结合平行磁隧道结(MTJ)读出机制,实现仅0.192 pJ的能量消耗,显著降低非线性激活层功耗。该方案在灰度和彩色数据集上均表现出良好适应性,验证了其在低能耗图像生成任务中的可行性与有效性。

链接: https://arxiv.org/abs/2601.01441
作者: Saumya Gupta,Abhinandan,Venkatesh vadde,Bhaskaran Muralidharan,Abhishek Sharma
机构: IIT Bombay (印度理工学院孟买分校); IIT Ropar (印度理工学院罗帕尔分校); IIT Mandi (印度理工学院曼迪分校)
类目: Applied Physics (physics.app-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:The computational requirements of generative adversarial networks (GANs) exceed the limit of conventional Von Neumann architectures, necessitating energy efficient alternatives such as neuromorphic spintronics. This work presents a hybrid CMOS-spintronic deep convolutional generative adversarial network (DCGAN) architecture for synthetic image generation. The proposed generative vision model approach follows the standard framework, leveraging generator and discriminators adversarial training with our designed spintronics hardware for deconvolution, convolution, and activation layers of the DCGAN architecture. To enable hardware aware spintronic implementation, the generator’s deconvolution layers are restructured as zero padded convolution, allowing seamless integration with a 6-bit skyrmion based synapse in a crossbar, without compromising training performance. Nonlinear activation functions are implemented using a hybrid CMOS domain wall based Rectified linear unit (ReLU) and Leaky ReLU units. Our proposed tunable Leaky ReLU employs domain wall position coded, continuous resistance states and a piecewise uniaxial parabolic anisotropy profile with a parallel MTJ readout, exhibiting energy consumption of 0.192 pJ. Our spintronic DCGAN model demonstrates adaptability across both grayscale and colored datasets, achieving Fr’echet Inception Distances (FID) of 27.5 for the Fashion MNIST and 45.4 for Anime Face datasets, with testing energy (training energy) of 4.9 nJ (14.97~nJ/image) and 24.72 nJ (74.7 nJ/image).
zh

[CV-195] Quantifying Local Strain Field and Deformation in Active Contraction of Bladder Using a Pretrained Transformer Model: A Speckle-Free Approach

【速读】:该论文旨在解决膀胱收缩过程中局部应变场精确量化的问题,传统数字图像相关(Digital Image Correlation, DIC)方法需人工打标(speckling),可能改变组织的被动和主动力学特性,从而影响测量准确性。其解决方案的关键在于提出一种无标记(speckle-free)的应变场定量框架,利用先进的零样本Transformer模型CoTracker3,结合自研便携式等张双轴装置与多光子显微镜(Multiphoton Microscopy, MPM),实现了对膀胱腔内天然纹理的高精度跟踪,无需人工标记即可准确捕捉复杂褶皱和屈曲下的非均匀变形模式,显著提升了生理相关性与测量可靠性。

链接: https://arxiv.org/abs/2601.01315
作者: Alireza Asadbeygi,Anne M. Robertson,Yasutaka Tobe,Masoud Zamani,Sean D. Stocker,Paul Watton,Naoki Yoshimura,Simon C Watkins
机构: 未知
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate quantification of local strain fields during bladder contraction is essential for understanding the biomechanics of bladder micturition, in both health and disease. Conventional digital image correlation (DIC) methods have been successfully applied to various biological tissues; however, this approach requires artificial speckling, which can alter both passive and active properties of the tissue. In this study, we introduce a speckle-free framework for quantifying local strain fields using a state-of-the-art, zero-shot transformer model, CoTracker3. We utilized a custom-designed, portable isotonic biaxial apparatus compatible with multiphoton microscopy (MPM) to demonstrate this approach, successfully tracking natural bladder lumen textures without artificial markers. Benchmark tests validated the method’s high pixel accuracy and low strain errors. Our framework effectively captured heterogeneous deformation patterns, despite complex folding and buckling, which conventional DIC often fails to track. Application to in vitro active bladder contractions in four rat specimens (n=4) revealed statistically significant anisotropy (p0.01), with higher contraction longitudinally compared to circumferentially. Multiphoton microscopy further illustrated and confirmed heterogeneous morphological changes, such as large fold formation during active contraction. This non-invasive approach eliminates speckle-induced artifacts, enabling more physiologically relevant measurements, and has broad applicability for material testing of other biological and engineered systems.
zh

[CV-196] Seamlessly Natural: Image Stitching with Natural Appearance Preservation

【速读】:该论文旨在解决传统图像拼接方法在存在视差(parallax)和深度变化的复杂真实场景中因刚性平面假设(homographic alignment)导致的结构失真问题,如可见形变、球状凸起等。其核心解决方案在于提出一种几何驱动的无缝自然拼接方法(SENA),关键创新包括:1)分层仿射变形策略,融合全局仿射初始化与局部仿射精调及平滑自由形变,有效保持局部形状、平行性和长宽比;2)基于RANSAC过滤特征对应关系的视差一致性检测机制,无需语义分割即可识别视差最小区域(adequate zone);3)基于锚点的接缝线切割与分割策略,在图像对之间强制建立一一对应的几何关系,从而显著减少鬼影、重复和涂抹伪影,提升全景图的整体视觉真实感与结构保真度。

链接: https://arxiv.org/abs/2601.01257
作者: Gaetane Lorna N. Tchana,Damaris Belle M. Fotso,Antonio Hendricks,Christophe Bobda
机构: University of Yaoundé I (雅温得第一大学); University of Florida (佛罗里达大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This paper introduces SENA (SEamlessly NAtural), a geometry-driven image stitching approach that prioritizes structural fidelity in challenging real-world scenes characterized by parallax and depth variation. Conventional image stitching relies on homographic alignment, but this rigid planar assumption often fails in dual-camera setups with significant scene depth, leading to distortions such as visible warps and spherical bulging. SENA addresses these fundamental limitations through three key contributions. First, we propose a hierarchical affine-based warping strategy, combining global affine initialization with local affine refinement and smooth free-form deformation. This design preserves local shape, parallelism, and aspect ratios, thereby avoiding the hallucinated structural distortions commonly introduced by homography-based models. Second, we introduce a geometry-driven adequate zone detection mechanism that identifies parallax-minimized regions directly from the disparity consistency of RANSAC-filtered feature correspondences, without relying on semantic segmentation. Third, building upon this adequate zone, we perform anchor-based seamline cutting and segmentation, enforcing a one-to-one geometric correspondence across image pairs by construction, which effectively eliminates ghosting, duplication, and smearing artifacts in the final panorama. Extensive experiments conducted on challenging datasets demonstrate that SENA achieves alignment accuracy comparable to leading homography-based methods, while significantly outperforming them in critical visual metrics such as shape preservation, texture integrity, and overall visual realism. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Signal Processing (eess.SP) Cite as: arXiv:2601.01257 [eess.IV] (or arXiv:2601.01257v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2601.01257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-197] YODA: Yet Another One-step Diffusion-based Video Compressor

【速读】:该论文旨在解决当前基于扩散模型的视频压缩方法在利用时间依赖性方面存在的不足问题。现有方法通常依赖预训练的二维自编码器独立生成每一帧的潜在表示,从而忽略了帧间的时间相关性,限制了压缩效率与感知质量。其解决方案的关键在于提出YODA(Yet Another One-step Diffusion-based Video Compressor),通过引入多尺度时序参考特征嵌入机制,在潜在空间生成和编码阶段均融合时空信息,以更充分地挖掘空间-时间相关性;同时采用线性Diffusion Transformer(DiT)实现高效的单步去噪过程,显著提升压缩性能。实验表明,YODA在LPIPS、DISTS、FID和KID等指标上均优于传统及深度学习基线方法,达到当前最优的感知质量表现。

链接: https://arxiv.org/abs/2601.01141
作者: Xingchen Li,Junzhe Zhang,Junqi Shi,Ming Lu,Zhan Ma
机构: Nanjing University (南京大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be available at this https URL

点击查看摘要

Abstract:While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA–Yet Another One-step Diffusion-based Video Compressor–which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at this https URL.
zh

[CV-198] An Explainable Agent ic AI Framework for Uncertainty-Aware and Abstention-Enabled Acute Ischemic Stroke Imaging Decisions

【速读】:该论文旨在解决当前急性缺血性卒中影像分析中人工智能模型存在的“黑箱”问题,即模型缺乏不确定性意识和在模糊条件下主动放弃预测的能力,这在高风险急诊放射学场景中可能引发严重的安全与信任隐患。解决方案的关键在于提出一种可解释的代理型(agentic)AI框架,其核心由三个模块组成:感知代理(perception agent)执行病灶感知的图像分析,不确定性估计代理(uncertainty estimation agent)计算切片级别的预测可靠性,决策代理(decision agent)根据预设阈值决定是否输出预测或选择 abstention(弃权)。该框架通过引入不确定性驱动的弃权机制和可视化解释能力,实现了临床安全优先、透明且符合医生决策行为的智能辅助系统设计。

链接: https://arxiv.org/abs/2601.01008
作者: Md Rashadul Islam
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Conceptual and exploratory framework focusing on uncertainty-aware and abstention-enabled decision support for acute ischemic stroke imaging

点击查看摘要

Abstract:Artificial intelligence models have shown strong potential in acute ischemic stroke imaging, particularly for lesion detection and segmentation using computed tomography and magnetic resonance imaging. However, most existing approaches operate as black box predictors, producing deterministic outputs without explicit uncertainty awareness or structured mechanisms to abstain under ambiguous conditions. This limitation raises serious safety and trust concerns in high risk emergency radiology settings. In this paper, we propose an explainable agentic AI framework for uncertainty aware and abstention enabled decision support in acute ischemic stroke imaging. The framework follows a modular agentic pipeline in which a perception agent performs lesion aware image analysis, an uncertainty estimation agent computes slice level predictive reliability, and a decision agent determines whether to issue a prediction or abstain based on predefined uncertainty thresholds. Unlike prior stroke imaging systems that primarily focus on improving segmentation or classification accuracy, the proposed framework explicitly prioritizes clinical safety, transparency, and clinician aligned decision behavior. Qualitative and case based analyses across representative stroke imaging scenarios demonstrate that uncertainty driven abstention naturally emerges in diagnostically ambiguous regions and low information slices. The framework further integrates visual explanation mechanisms to support both predictive and abstention decisions, addressing a key limitation of existing uncertainty aware medical imaging systems. Rather than introducing a new performance benchmark, this work presents agentic control, uncertainty awareness, and selective abstention as essential design principles for developing safe and trustworthy medical imaging AI systems.
zh

[CV-199] Scale-aware Adaptive Supervised Network with Limited Medical Annotations

【速读】:该论文旨在解决医学图像分割在半监督学习场景下的关键挑战,包括标注数据严重稀缺、不同观察者间标注变异性大(inter-annotator variability)、以及复杂解剖结构中多尺度特征融合不足导致边界精确定位困难等问题。现有半监督方法相较于全监督方法性能显著下降,尤其在小目标分割和边界细化任务上表现欠佳。解决方案的核心在于提出SASNet(Scale-aware Adaptive Supervised Network),其关键创新包括:1)尺度感知自适应重加权策略(Scale-aware Adaptive Reweight),通过时间信心累积动态调整像素级预测权重;2)视角方差增强机制(View Variance Enhancement),利用3D傅里叶域变换模拟标注变异;3)分割-回归一致性学习,基于符号距离图(signed distance map)算法提升边界精度。这三项机制共同构建了一个融合空间、时间和几何一致性的统一优化框架,显著提升了小样本条件下的分割性能,逼近全监督水平。

链接: https://arxiv.org/abs/2601.01005
作者: Zihan Li,Dandan Shan,Yunxiang Li,Paul E. Kinahan,Qingqi Hong
机构: Xiamen University (厦门大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Pattern Recognition, 8 figures, 11 tables

点击查看摘要

Abstract:Medical image segmentation faces critical challenges in semi-supervised learning scenarios due to severe annotation scarcity requiring expert radiological knowledge, significant inter-annotator variability across different viewpoints and expertise levels, and inadequate multi-scale feature integration for precise boundary delineation in complex anatomical structures. Existing semi-supervised methods demonstrate substantial performance degradation compared to fully supervised approaches, particularly in small target segmentation and boundary refinement tasks. To address these fundamental challenges, we propose SASNet (Scale-aware Adaptive Supervised Network), a dual-branch architecture that leverages both low-level and high-level feature representations through novel scale-aware adaptive reweight mechanisms. Our approach introduces three key methodological innovations, including the Scale-aware Adaptive Reweight strategy that dynamically weights pixel-wise predictions using temporal confidence accumulation, the View Variance Enhancement mechanism employing 3D Fourier domain transformations to simulate annotation variability, and segmentation-regression consistency learning through signed distance map algorithms for enhanced boundary precision. These innovations collectively address the core limitations of existing semi-supervised approaches by integrating spatial, temporal, and geometric consistency principles within a unified optimization framework. Comprehensive evaluation across LA, Pancreas-CT, and BraTS datasets demonstrates that SASNet achieves superior performance with limited labeled data, surpassing state-of-the-art semi-supervised methods while approaching fully supervised performance levels. The source code for SASNet is available at this https URL.
zh

[CV-200] Uncertainty-Calibrated Explainable AI for Fetal Ultrasound Plane Classification

【速读】:该论文旨在解决胎儿超声标准切面分类在实际临床部署中面临的三大挑战:领域偏移(domain shift)、图像噪声干扰以及预测置信度校准不足的问题。其解决方案的关键在于构建一个不确定性校准的可解释人工智能(uncertainty-calibrated explainable AI)框架,通过融合多种不确定性估计方法(如蒙特卡洛丢弃、深度集成、证据学习和合规预测)与后处理及不确定性感知的解释技术(如Grad-CAM变体、LIME风格局部代理模型、不确定性加权多分辨率激活图),并将其映射至面向临床医生的工作流程。该框架不仅提升了模型在噪声采集条件下的可靠性,还通过结构化的报告协议(包括预期校准误差、Brier评分、覆盖率-风险曲线及带解释的误差分析)实现了准确率与校准性能的协同优化,并支持质量控制与人机协同审查机制,从而为构建鲁棒、可信且临床对齐的胎儿超声分类系统提供了可复现的蓝图。

链接: https://arxiv.org/abs/2601.00990
作者: Olaf Yunus Laitinen Imanov
机构: DTU Compute, Department of Applied Mathematics and Computer Science (丹麦技术大学计算系); Technical University of Denmark (丹麦技术大学); Section for Visual Computing (视觉计算组)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Fetal ultrasound standard-plane classification underpins reliable prenatal biometry and anomaly screening, yet real-world deployment is limited by domain shift, image noise, and poor calibration of predicted probabilities. This paper presents a practical framework for uncertainty-calibrated explainable AI in fetal plane classification. We synthesize uncertainty estimation methods (Monte Carlo dropout, deep ensembles, evidential learning, and conformal prediction) with post-hoc and uncertainty-aware explanations (Grad-CAM variants, LIME-style local surrogates, and uncertainty-weighted multi-resolution activation maps), and we map these components to a clinician-facing workflow. Using FETAL_PLANES_DB as a reference benchmark, we define a reporting protocol that couples accuracy with calibration and selective prediction, including expected calibration error, Brier score, coverage-risk curves, and structured error analysis with explanations. We also discuss integration points for quality control and human-in-the-loop review, where uncertainty flags trigger re-acquisition or expert confirmation. The goal is a reproducible, clinically aligned blueprint for building fetal ultrasound classifiers whose confidence and explanations remain trustworthy under noisy acquisition conditions.
zh

[CV-201] MetaFormer-driven Encoding Network for Robust Medical Semantic Segmentation

【速读】:该论文旨在解决医学图像分割中先进模型因架构复杂而导致在资源受限临床环境中难以部署的问题。其核心解决方案是提出MFEnNet框架,通过在U-Net主干网络的编码阶段引入MetaFormer结构(一种视觉Transformer的架构抽象),以实现全局上下文建模;同时采用池化Transformer块替代传统自注意力机制,降低计算复杂度并保持有效特征聚合能力;此外,结合Swish激活函数提升梯度平滑性和收敛速度,并在瓶颈层引入空间金字塔池化模块增强多尺度特征提取性能,从而在保证分割精度的同时显著减少计算开销。

链接: https://arxiv.org/abs/2601.00922
作者: Le-Anh Tran,Chung Nguyen Tran,Nhan Cach Dang,Anh Le Van Quoc,Jordi Carrabina,David Castells-Rufas,Minh Son Nguyen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, MCT4SD 2025

点击查看摘要

Abstract:Semantic segmentation is crucial for medical image analysis, enabling precise disease diagnosis and treatment planning. However, many advanced models employ complex architectures, limiting their use in resource-constrained clinical settings. This paper proposes MFEnNet, an efficient medical image segmentation framework that incorporates MetaFormer in the encoding phase of the U-Net backbone. MetaFormer, an architectural abstraction of vision transformers, provides a versatile alternative to convolutional neural networks by transforming tokenized image patches into sequences for global context modeling. To mitigate the substantial computational cost associated with self-attention, the proposed framework replaces conventional transformer modules with pooling transformer blocks, thereby achieving effective global feature aggregation at reduced complexity. In addition, Swish activation is used to achieve smoother gradients and faster convergence, while spatial pyramid pooling is incorporated at the bottleneck to improve multi-scale feature extraction. Comprehensive experiments on different medical segmentation benchmarks demonstrate that the proposed MFEnNet approach attains competitive accuracy while significantly lowering computational cost compared to state-of-the-art models. The source code for this work is available at this https URL.
zh

[CV-202] Placenta Accreta Spectrum Detection using Multimodal Deep Learning

【速读】:该论文旨在解决胎盘粘连谱(Placenta Accreta Spectrum, PAS)的早期精准产前诊断问题,以降低母婴并发症风险。其解决方案的关键在于构建一种多模态深度学习框架,通过融合3D磁共振成像(MRI)与2D超声(US)的特征,在特征层面实现信息互补,从而提升PAS检测的准确性。研究采用3D DenseNet121-Vision Transformer和2D ResNet50分别提取MRI与US图像特征,并基于1,293例MRI和1,143例US扫描数据训练模型,最终在独立测试集上达到92.5%的准确率和0.927的AUC,显著优于单一模态模型,验证了多模态融合策略的有效性。

链接: https://arxiv.org/abs/2601.00907
作者: Sumaiya Ali,Areej Alhothali,Sameera Albasri,Ohoud Alzamzami,Ahmed Abduljabbar,Muhammad Alwazzan
机构: King Abdulaziz University(阿卜杜勒阿齐兹国王大学); King Abdulaziz University Hospital(阿卜杜勒阿齐兹国王大学医院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Placenta Accreta Spectrum (PAS) is a life-threatening obstetric complication involving abnormal placental invasion into the uterine wall. Early and accurate prenatal diagnosis is essential to reduce maternal and neonatal risks. This study aimed to develop and validate a deep learning framework that enhances PAS detection by integrating multiple imaging modalities. A multimodal deep learning model was designed using an intermediate feature-level fusion architecture combining 3D Magnetic Resonance Imaging (MRI) and 2D Ultrasound (US) scans. Unimodal feature extractors, a 3D DenseNet121-Vision Transformer for MRI and a 2D ResNet50 for US, were selected after systematic comparative analysis. Curated datasets comprising 1,293 MRI and 1,143 US scans were used to train the unimodal models and paired samples of patient-matched MRI-US scans was isolated for multimodal model development and evaluation. On an independent test set, the multimodal fusion model achieved superior performance, with an accuracy of 92.5% and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.927, outperforming the MRI-only (82.5%, AUC 0.825) and US-only (87.5%, AUC 0.879) models. Integrating MRI and US features provides complementary diagnostic information, demonstrating strong potential to enhance prenatal risk assessment and improve patient outcomes.
zh

人工智能

[AI-0] DARC: Drum accompaniment generation with fine-grained rhythm control

【速读】:该论文旨在解决音乐创作中生成式鼓伴奏时缺乏结构控制与风格灵活性的问题,现有方法要么无法有效控制节奏,要么无法基于音乐上下文进行条件生成。解决方案的关键在于提出DARC模型,该模型通过参数高效微调(parameter-efficient fine-tuning)对STAGE这一先进的鼓音轨生成器进行增强,使其能够同时接受来自其他声部的音乐上下文信息和显式的节奏提示(如拍手或打拍子录音),从而实现细粒度节奏控制与音乐语境感知的协同统一。

链接: https://arxiv.org/abs/2601.02357
作者: Trey Brosnan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In music creation, rapid prototyping is essential for exploring and refining ideas, yet existing generative tools often fall short when users require both structural control and stylistic flexibility. Prior approaches in stem-to-stem generation can condition on other musical stems but offer limited control over rhythm, and timbre-transfer methods allow users to specify specific rhythms, but cannot condition on musical context. We introduce DARC, a generative drum accompaniment model that conditions both on musical context from other stems and explicit rhythm prompts such as beatboxing or tapping tracks. Using parameter-efficient fine-tuning, we augment STAGE, a state-of-the-art drum stem generator, with fine-grained rhythm control while maintaining musical context awareness.
zh

[AI-1] Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在复杂推理任务中性能不足的问题,即如何在不增加模型参数规模的前提下实现与大型模型相当甚至更优的推理能力。其解决方案的关键在于通过精细化的数据筛选和针对性的训练策略(包括高效的监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)扩展),结合混合并行架构设计以提升推理速度、token效率及准确率,从而在3D维度上优化推理效率。特别地,论文引入DeepConf方法实现了测试时缩放(test-time scaling)效率的显著提升,表明紧凑模型可通过有针对性的训练和架构设计实现强大且可扩展的推理性能。

链接: https://arxiv.org/abs/2601.02346
作者: Falcon LLM Team,Iheb Chaabane,Puneesh Khanna,Suhail Mohmad,Slim Frikha,Shi Hu,Abdalgader Abubaker,Reda Alami,Mikhail Lubinets,Mohamed El Amine Seddik,Hakim Hacid
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work introduces Falcon-H1R, a 7B-parameter reasoning-optimized model that establishes the feasibility of achieving competitive reasoning performance with small language models (SLMs). Falcon-H1R stands out for its parameter efficiency, consistently matching or outperforming SOTA reasoning models that are 2\times to 7\times larger across a variety of reasoning-intensive benchmarks. These results underscore the importance of careful data curation and targeted training strategies (via both efficient SFT and RL scaling) in delivering significant performance gains without increasing model size. Furthermore, Falcon-H1R advances the 3D limits of reasoning efficiency by combining faster inference (through its hybrid-parallel architecture design), token efficiency, and higher accuracy. This unique blend makes Falcon-H1R-7B a practical backbone for scaling advanced reasoning systems, particularly in scenarios requiring extensive chain-of-thoughts generation and parallel test-time scaling. Leveraging the recently introduced DeepConf approach, Falcon-H1R achieves state-of-the-art test-time scaling efficiency, offering substantial improvements in both accuracy and computational cost. As a result, Falcon-H1R demonstrates that compact models, through targeted model training and architectural choices, can deliver robust and scalable reasoning performance.
zh

[AI-2] DatBench: Discriminative Faithful and Efficient VLM Evaluations

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)评估体系中存在的三大核心问题:缺乏对模态和应用场景的忠实性(faithfulness)、模型间能力区分度不足(discriminability),以及评估过程计算效率低下(efficiency)。针对这些问题,作者提出三项评估设计原则,并通过数据清洗与任务重构实现优化:关键解决方案包括将多选题转化为生成式任务以揭示真实性能下降(最高达35%),过滤掉可脱离图像作答的“盲解题”样本(占比高达70%)及标注错误或模糊样本(最多影响42%的数据),从而显著提升评估的准确性和效率。最终构建了DatBench-Full(涵盖33个数据集的完整清洁套件)和DatBench(具有13倍平均加速、最高50倍加速的高判别力子集),为VLM评估提供了更严谨且可持续的新范式。

链接: https://arxiv.org/abs/2601.02316
作者: Siddharth Joshi,Haoli Yin,Rishabh Adiga,Ricardo Monti,Aldo Carranza,Alex Fang,Alvin Deng,Amro Abbas,Brett Larsen,Cody Blakeney,Darren Teh,David Schwab,Fan Pan,Haakon Mongstad,Jack Urbanek,Jason Lee,Jason Telanoff,Josh Wills,Kaleigh Mentzer,Luke Merrick,Parth Doshi,Paul Burstein,Pratyush Maini,Scott Loftin,Spandan Das,Tony Jiang,Vineeth Dorna,Zhengping Wang,Bogdan Gaza,Ari Morcos,Matthew Leavitt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
zh

[AI-3] Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在高风险自主决策中推理过程透明性不足的问题,特别是现有Chain-of-Thought (CoT)提示生成的推理轨迹是否真实驱动模型输出,还是仅作为事后合理化解释。解决方案的关键在于提出Project Ariadne框架,该框架基于结构因果模型(Structural Causal Models, SCMs)与反事实逻辑,通过硬干预(do-calculus)对中间推理节点进行系统性操作(如逻辑反转、前提否定和事实逆转),量化终端答案的因果敏感性(Causal Sensitivity, φ),从而识别出广泛存在的因果解耦(Causal Decoupling)现象——即代理在内部逻辑矛盾时仍得出相同结论,表明其推理轨迹实质上是“推理剧场”(Reasoning Theater),而决策由隐式参数先验主导。此方法突破了传统基于文本相似度的可解释性评估,提供了一种因果层面的审计机制,并引入Ariadne Score作为衡量推理忠实性的新基准。

链接: https://arxiv.org/abs/2601.02314
作者: Sourena Khanzadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While \textitChain-of-Thought (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are \textbffaithful generative drivers of the model’s output or merely \textbfpost-hoc rationalizations. We introduce \textbfProject Ariadne, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs \textbfhard interventions ( do -calculus) on intermediate reasoning nodes – systematically inverting logic, negating premises, and reversing factual claims – to measure the \textbfCausal Sensitivity ( \phi ) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent \textitFaithfulness Gap. We define and detect a widespread failure mode termed \textbfCausal Decoupling, where agents exhibit a violation density ( \rho ) of up to 0.77 in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as “Reasoning Theater” while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.
zh

[AI-4] Placement Semantics for Distributed Deep Learning: A Systematic Framework for Analyzing Parallelism Strategies

【速读】:该论文旨在解决大规模语言模型训练中并行策略选择缺乏系统性指导的问题,即当前实践中依赖试错法来选取数据并行(data parallelism)、张量并行(tensor parallelism)、流水线并行(pipeline parallelism)和ZeRO等策略,而没有统一的框架能够预测其内存消耗与通信开销。解决方案的关键在于提出“放置语义”(placement semantics):将每种并行策略建模为对四个训练状态(参数、优化器、梯度、激活)在设备间的分布方式(五种模式:复制、分片、分片+收集、物化、卸载),仅基于这种放置逻辑即可推导出内存占用和通信量,且无需依赖具体实现细节。该框架不仅精确匹配已有实验结果(如ZeRO-3相比数据并行节省8倍内存但增加1.5倍通信成本),还通过证明梯度完整性(gradient integrity)与状态一致性(state consistency)两个充要条件,建立了安全组合多种策略的理论基础。

链接: https://arxiv.org/abs/2601.02311
作者: Deep Pankajbhai Mehta
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 tables

点击查看摘要

Abstract:Training large language models requires distributing computation across many accelerators, yet practitioners select parallelism strategies (data, tensor, pipeline, ZeRO) through trial and error because no unified systematic framework predicts their behavior. We introduce placement semantics: each strategy is specified by how it places four training states (parameters, optimizer, gradients, activations) across devices using five modes (replicated, sharded, sharded-with-gather, materialized, offloaded). From placement alone, without implementation details, we derive memory consumption and communication volume. Our predictions match published results exactly: ZeRO-3 uses 8x less memory than data parallelism at 1.5x communication cost, as reported in the original paper. We prove two conditions (gradient integrity, state consistency) are necessary and sufficient for distributed training to match single-device results, and provide composition rules for combining strategies safely. The framework unifies ZeRO Stages 1-3, Fully Sharded Data Parallel (FSDP), tensor parallelism, and pipeline parallelism as instances with different placement choices.
zh

[AI-5] LLM -Empowered Functional Safety and Security by Design in Automotive Systems

【速读】:该论文旨在解决软件定义汽车(Software Defined Vehicle, SDV)开发过程中面临的两大核心挑战:一是如何在系统拓扑设计中实现安全意识(security-aware)的架构规划,二是如何对事件驱动的决策代码进行形式化分析以保障功能安全(functional safety)。解决方案的关键在于构建一个由大语言模型(Large Language Model, LLM)赋能的工作流,其中代码分析部分采用事件链(event chains)模型,提供形式化基础以验证组件间消息交换的语义有效性(涵盖CAN和车辆信号规范VSS),而拓扑安全性分析则结合模型驱动工程(Model-Driven Engineering, MDE)与对象约束语言(Object Constraint Language, OCL)规则,形成可本地部署且具备专有性的评估方案,尤其适用于高级驾驶辅助系统(ADAS)场景。

链接: https://arxiv.org/abs/2601.02215
作者: Nenad Petrovic,Vahid Zolfaghari,Fengjunjie Pan,Alois Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents LLM-empowered workflow to support Software Defined Vehicle (SDV) software development, covering the aspects of security-aware system topology design, as well as event-driven decision-making code analysis. For code analysis we adopt event chains model which provides formal foundations to systematic validation of functional safety, taking into account the semantic validity of messages exchanged between key components, including both CAN and Vehicle Signal Specification (VSS). Analysis of security aspects for topology relies on synergy with Model-Driven Engineering (MDE) approach and Object Constraint Language (OCL) rules. Both locally deployable and proprietary solution are taken into account for evaluation within Advanced Driver-Assistance Systems (ADAS)-related scenarios.
zh

[AI-6] Code for Machines Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

【速读】:该论文旨在解决在人机协作编程(human-AI collaborative programming)日益普及的背景下,如何确保大型语言模型(Large Language Models, LLMs)能够可靠地编辑代码的问题。其核心挑战在于:当前代码优化主要面向人类可读性,而AI工具在处理代码时可能因理解偏差导致语义破坏。解决方案的关键在于提出并验证“AI友好型代码”(AI-friendly code)的概念,通过在5000个来自编程竞赛的Python文件上进行LLM驱动的重构实验,发现人类友好的代码质量指标——CodeHealth与AI重构后的语义保真度之间存在显著正相关关系。这表明提升代码可维护性不仅有利于人类开发者,也能降低AI干预的风险,从而为组织制定AI辅助开发策略提供量化依据。

链接: https://arxiv.org/abs/2601.02200
作者: Markus Borg,Nadim Hagatulah,Adam Tornhill,Emma Söderberg
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for the 3rd ACM International Conference on AI Foundation Models and Software Engineering (FORGE 2026)

点击查看摘要

Abstract:We are entering a hybrid era in which human developers and AI coding agents work in the same codebases. While industry practice has long optimized code for human comprehension, it is increasingly important to ensure that LLMs with different capabilities can edit code reliably. In this study, we investigate the concept of ``AI-friendly code’’ via LLM-based refactoring on a dataset of 5,000 Python files from competitive programming. We find a meaningful association between CodeHealth, a quality metric calibrated for human comprehension, and semantic preservation after AI refactoring. Our findings confirm that human-friendly code is also more compatible with AI tooling. These results suggest that organizations can use CodeHealth to guide where AI interventions are lower risk and where additional human oversight is warranted. Investing in maintainability not only helps humans; it also prepares for large-scale AI adoption.
zh

[AI-7] Streaming Hallucination Detection in Long Chain-of-Thought Reasoning

【速读】:该论文旨在解决长链式思维(Chain-of-Thought, CoT)推理中幻觉(hallucination)的隐蔽传播问题,即幻觉并非孤立事件,而是随着推理步骤逐步演化、累积的潜在状态。其解决方案的关键在于将每一步的幻觉判断视为局部观测,引入一种累计前缀级别的幻觉信号(cumulative prefix-level hallucination signal),以追踪整个推理轨迹中幻觉状态的全局演化过程,从而实现对长CoT推理中幻觉的流式检测,并提供实时且可解释的证据。

链接: https://arxiv.org/abs/2601.02170
作者: Haolang Lu,Minghui Pan,Ripeng Li,Guoshun Nan,Jialin Zhuang,Zijie Zhao,Zhongxiang Sun,Kun Wang,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long chain-of-thought (CoT) reasoning improves the performance of large language models, yet hallucinations in such settings often emerge subtly and propagate across reasoning steps. We suggest that hallucination in long CoT reasoning is better understood as an evolving latent state rather than a one-off erroneous event. Accordingly, we treat step-level hallucination judgments as local observations and introduce a cumulative prefix-level hallucination signal that tracks the global evolution of the reasoning state over the entire trajectory. Overall, our approach enables streaming hallucination detection in long CoT reasoning, providing real-time, interpretable evidence.
zh

[AI-8] SingingBot: An Avatar-Driven System for Robotic Face Singing Performance

【速读】:该论文旨在解决机器人面部在歌唱场景中难以实现连续且一致的情感表达问题,现有研究多聚焦于对话或静态表情模仿,无法满足歌唱过程中对情感丰富性和连贯性的高要求。解决方案的关键在于提出一种新颖的基于虚拟形象驱动的机器人歌唱框架:首先利用嵌入人类先验知识的肖像视频生成模型合成生动的歌唱虚拟形象,为机器人提供可靠的表情与情绪引导;随后通过语义导向的映射函数将这些面部特征迁移至机器人,覆盖广泛的表情空间;同时引入“情感动态范围”(Emotion Dynamic Range)指标量化评估机器人歌唱的情感丰富度,揭示宽广的情感谱系对吸引人表演的重要性。实验表明,该方法在保持唇音同步的同时显著提升了情感表达的多样性与质量。

链接: https://arxiv.org/abs/2601.02125
作者: Zhuoxiong Xu,Xuanchen Li,Yuhao Cheng,Fei Xu,Yichao Yan,Xiaokang Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Equipping robotic faces with singing capabilities is crucial for empathetic Human-Robot Interaction. However, existing robotic face driving research primarily focuses on conversations or mimicking static expressions, struggling to meet the high demands for continuous emotional expression and coherence in singing. To address this, we propose a novel avatar-driven framework for appealing robotic singing. We first leverage portrait video generation models embedded with extensive human priors to synthesize vivid singing avatars, providing reliable expression and emotion guidance. Subsequently, these facial features are transferred to the robot via semantic-oriented mapping functions that span a wide expression space. Furthermore, to quantitatively evaluate the emotional richness of robotic singing, we propose the Emotion Dynamic Range metric to measure the emotional breadth within the Valence-Arousal space, revealing that a broad emotional spectrum is crucial for appealing performances. Comprehensive experiments prove that our method achieves rich emotional expressions while maintaining lip-audio synchronization, significantly outperforming existing approaches.
zh

[AI-9] Inferring Network Evolutionary History via Structure-State Coupled Learning

【速读】:该论文旨在解决从单一最终网络快照及有限时间标注中推断网络演化历史的问题,这一任务因仅依赖拓扑结构常导致信息不足且噪声较大。其核心解决方案是引入网络稳态动力学(steady-state dynamics)作为额外可观测信号,提出CS²方法,通过显式建模结构-状态耦合机制,捕捉拓扑如何调控节点稳态,并利用结构与稳态信号的协同作用提升边形成顺序的判别能力,从而显著改善演化推理精度。

链接: https://arxiv.org/abs/2601.02121
作者: En Xu,Shihe Zhou,Huandong Wang,Jingtao Ding,Yong Li
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inferring a network’s evolutionary history from a single final snapshot with limited temporal annotations is fundamental yet challenging. Existing approaches predominantly rely on topology alone, which often provides insufficient and noisy cues. This paper leverages network steady-state dynamics – converged node states under a given dynamical process – as an additional and widely accessible observation for network evolution history inference. We propose CS ^2 , which explicitly models structure-state coupling to capture how topology modulates steady states and how the two signals jointly improve edge discrimination for formation-order recovery. Experiments on six real temporal networks, evaluated under multiple dynamical processes, show that CS ^2 consistently outperforms strong baselines, improving pairwise edge precedence accuracy by 4.0% on average and global ordering consistency (Spearman- \rho ) by 7.7% on average. CS ^2 also more faithfully recovers macroscopic evolution trajectories such as clustering formation, degree heterogeneity, and hub growth. Moreover, a steady-state-only variant remains competitive when reliable topology is limited, highlighting steady states as an independent signal for evolution inference.
zh

[AI-10] LION-DG: Layer-Informed Initialization with Deep Gradient Protocols for Accelerated Neural Network Training

【速读】:该论文旨在解决深度监督架构(deeply-supervised architectures)中因未训练的辅助分类器(auxiliary classifiers)在训练初期引入梯度干扰而导致的优化不稳定问题。现有权重初始化方法多为层无关(layer-agnostic),无法有效应对辅助头(head)带来的梯度冲突。解决方案的关键在于提出LION-DG(Layer-Informed Initialization for Deep Supervision with Gradient Awakening),其核心策略是:对辅助分类器头进行零初始化(zero-initialization),而主干网络(backbone)采用标准的He初始化;这一设计实现了“梯度唤醒”(Gradient Awakening)——即在初始时刻辅助梯度严格为零,随后随着主干权重自然增长逐步引入梯度,从而无需超参数即可实现隐式预热(implicit warmup)。该方法简单、无额外计算开销,且在CIFAR-10/100上显著提升收敛速度并保持或提升精度。

链接: https://arxiv.org/abs/2601.02105
作者: Hyunjun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weight initialization remains decisive for neural network optimization, yet existing methods are largely layer-agnostic. We study initialization for deeply-supervised architectures with auxiliary classifiers, where untrained auxiliary heads can destabilize early training through gradient interference. We propose LION-DG, a layer-informed initialization that zero-initializes auxiliary classifier heads while applying standard He-initialization to the backbone. We prove that this implements Gradient Awakening: auxiliary gradients are exactly zero at initialization, then phase in naturally as weights grow – providing an implicit warmup without hyperparameters. Experiments on CIFAR-10 and CIFAR-100 with DenseNet-DS and ResNet-DS architectures demonstrate: (1) DenseNet-DS: +8.3% faster convergence on CIFAR-10 with comparable accuracy, (2) Hybrid approach: Combining LSUV with LION-DG achieves best accuracy (81.92% on CIFAR-10), (3) ResNet-DS: Positive speedup on CIFAR-100 (+11.3%) with side-tap auxiliary design. We identify architecture-specific trade-offs and provide clear guidelines for practitioners. LION-DG is simple, requires zero hyperparameters, and adds no computational overhead. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.02105 [cs.LG] (or arXiv:2601.02105v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.02105 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hyunjun Kim He [view email] [v1] Mon, 5 Jan 2026 13:33:09 UTC (51 KB)
zh

[AI-11] Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

【速读】:该论文旨在解决草莓采摘机器人在果园环境中面临的视觉感知集成度低、果实夹爪对准偏差、空抓以及果实从夹爪中滑脱等问题,这些问题严重影响了采摘的稳定性和效率。解决方案的关键在于提出了一种融合多任务感知与纠错控制策略的视觉故障诊断与自恢复框架,其核心是SRR-Net端到端多任务感知模型,能够同步完成草莓检测、分割和成熟度估计,从而将视觉感知与故障诊断统一;同时基于目标与夹爪的联合检测设计相对误差补偿方法以纠正位置偏差,并引入早期中止策略减少空抓和滑脱故障;此外,末端嵌入式微光学相机提供实时视觉反馈,结合MobileNet V3-Small分类器和时间序列LSTM分类器实现抓取状态识别与果实滑脱预测,显著提升了系统鲁棒性与作业精度。

链接: https://arxiv.org/abs/2601.02085
作者: Meili Sun,Chunjiang Zhao,Lichao Yang,Hao Liu,Shimin Hu,Ya Xiong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Strawberry harvesting robots faced persistent challenges such as low integration of visual perception, fruit-gripper misalignment, empty grasping, and strawberry slippage from the gripper due to insufficient gripping force, all of which compromised harvesting stability and efficiency in orchard environments. To overcome these issues, this paper proposed a visual fault diagnosis and self-recovery framework that integrated multi-task perception with corrective control strategies. At the core of this framework was SRR-Net, an end-to-end multi-task perception model that simultaneously performed strawberry detection, segmentation, and ripeness estimation, thereby unifying visual perception with fault diagnosis. Based on this integrated perception, a relative error compensation method based on the simultaneous target-gripper detection was designed to address positional misalignment, correcting deviations when error exceeded the tolerance threshold. To mitigate empty grasping and fruit-slippage faults, an early abort strategy was implemented. A micro-optical camera embedded in the end-effector provided real-time visual feedback, enabling grasp detection during the deflating stage and strawberry slip prediction during snap-off through MobileNet V3-Small classifier and a time-series LSTM classifier. Experiments demonstrated that SRR-Net maintained high perception accuracy. For detection, it achieved a precision of 0.895 and recall of 0.813 on strawberries, and 0.972/0.958 on hands. In segmentation, it yielded a precision of 0.887 and recall of 0.747 for strawberries, and 0.974/0.947 for hands. For ripeness estimation, SRR-Net attained a mean absolute error of 0.035, while simultaneously supporting multi-task perception and sustaining a competitive inference speed of 163.35 FPS.
zh

[AI-12] he Homogeneity Trap: Spectral Collapse in Doubly-Stochastic Deep Networks

【速读】:该论文旨在解决双重随机矩阵(Doubly-Stochastic Matrix, DSM)约束在结构保持型深度网络(如最优传输层和基于Sinkhorn的注意力机制)中引发的谱退化问题,即所谓的“同质性陷阱”(Homogeneity Trap)。其关键发现是:Sinkhorn投影带来的最大熵偏置会促使混合算子趋向于均匀重心,从而抑制次主导奇异值 σ2\sigma_2,导致高频特征成分被滤除。论文进一步推导出σ2\sigma_2与网络有效深度之间的谱界,揭示高熵约束限制了特征变换的有效感受野;并证明层归一化(Layer Normalization)无法缓解噪声主导区域下的谱崩溃现象——当信噪比(Signal-to-Noise Ratio, SNR)低于临界阈值时,几何结构将不可逆地坍缩为噪声诱导的正交态。这揭示了熵稳定性与谱表达能力之间存在的根本权衡关系。

链接: https://arxiv.org/abs/2601.02080
作者: Yizhi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Doubly-stochastic matrices (DSM) are increasingly utilized in structure-preserving deep architectures – such as Optimal Transport layers and Sinkhorn-based attention – to enforce numerical stability and probabilistic interpretability. In this work, we identify a critical spectral degradation phenomenon inherent to these constraints, termed the Homogeneity Trap. We demonstrate that the maximum-entropy bias, typical of Sinkhorn-based projections, drives the mixing operator towards the uniform barycenter, thereby suppressing the subdominant singular value \sigma_2 and filtering out high-frequency feature components. We derive a spectral bound linking \sigma_2 to the network’s effective depth, showing that high-entropy constraints restrict feature transformation to a shallow effective receptive field. Furthermore, we formally demonstrate that Layer Normalization fails to mitigate this collapse in noise-dominated regimes; specifically, when spectral filtering degrades the Signal-to-Noise Ratio (SNR) below a critical threshold, geometric structure is irreversibly lost to noise-induced orthogonal collapse. Our findings highlight a fundamental trade-off between entropic stability and spectral expressivity in DSM-constrained networks.
zh

[AI-13] FormuLLA: A Large Language Model Approach to Generating Novel 3D Printable Formulations

【速读】:该论文旨在解决当前生成式人工智能(Generative AI)在药物三维(3D)打印领域应用中存在的局限性,特别是现有AI方法对配方开发中复杂多变的制剂挑战考虑不足的问题。其关键解决方案在于利用经过熔融沉积建模(Fused Deposition Modeling, FDM)工艺数据集(含1400余种配方)微调的大语言模型(Large Language Models, LLMs),实现基于活性药物成分(Active Pharmaceutical Ingredient, API)剂量的辅料推荐及丝材机械性能预测。研究发现,Llama2架构在辅料推荐任务中表现最优,同时揭示了模型选择与参数配置对性能的关键影响,并指出小规模数据可能导致灾难性遗忘、标准LLM评估指标无法反映制剂可加工性、以及生物医学相关训练数据未必带来最佳结果等核心问题,强调需从单纯语言能力向面向制药流程可靠性的系统演进。

链接: https://arxiv.org/abs/2601.02071
作者: Adeshola Okubena,Yusuf Ali Mohammed,Moe Elbadawi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pharmaceutical three-dimensional (3D) printing is an advanced fabrication technology with the potential to enable truly personalised dosage forms. Recent studies have integrated artificial intelligence (AI) to accelerate formulation and process development, drastically transforming current approaches to pharmaceutical 3D printing. To date, most AI-driven efforts remain narrowly focused, while failing to account for the broader formulation challenges inherent to the technology. Recent advances in AI have introduced artificial general intelligence concepts, wherein systems extend beyond conventional predictive modelling toward more generalised, human-like reasoning. In this work, we investigate the application of large language models (LLMs), fine-tuned on a fused deposition modelling (FDM) dataset comprising over 1400 formulations, to recommend suitable excipients based on active pharmaceutical ingredient (API) dose, and predict filament mechanical properties. Four LLM architectures were fine-tuned, with systematic evaluation of both fine-tuning and generative parameter configurations. Our results demonstrate that Llama2 was best suited for recommending excipients for FDM formulations. Additionally, model selection and parameterisation significantly influence performance, with smaller LLMs exhibiting instances of catastrophic forgetting. Furthermore, we demonstrate: (i) even with relatively small dataset of over 1400 formulations, it can lead to model catastrophic forgetting; (ii) standard LLM metrics only evaluate linguistic performance but not formulation processability; and (iii) LLMs trained on biomedically-related data do not always produce the best results. Addressing these challenges is essential to advancing LLMs beyond linguistic proficiency and toward reliable systems for pharmaceutical formulation development.
zh

[AI-14] Higher-Order Action Regularization in Deep Reinforcement Learning: From Continuous Control to Building Energy Management NEURIPS

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)智能体在实际部署中因高频、剧烈动作行为导致的能耗过高和机械磨损问题。其解决方案的关键在于引入高阶导数惩罚机制,特别是对加速度(jerk)的约束——即通过第三阶导数惩罚来实现动作平滑性正则化。研究表明,这种策略在多个连续控制环境中均能显著提升动作平滑度,同时保持与现有方法相当的性能;在建筑暖通空调(HVAC)控制系统中的实证验证进一步表明,该方法可使设备开关频率降低60%,从而有效缓解运行约束并带来可观的运营效益。

链接: https://arxiv.org/abs/2601.02061
作者: Faizan Ahmed,Aniket Dixit,James Brusey
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, accepted at NeurIPS workshop 2025

点击查看摘要

Abstract:Deep reinforcement learning agents often exhibit erratic, high-frequency control behaviors that hinder real-world deployment due to excessive energy consumption and mechanical wear. We systematically investigate action smoothness regularization through higher-order derivative penalties, progressing from theoretical understanding in continuous control benchmarks to practical validation in building energy management. Our comprehensive evaluation across four continuous control environments demonstrates that third-order derivative penalties (jerk minimization) consistently achieve superior smoothness while maintaining competitive performance. We extend these findings to HVAC control systems where smooth policies reduce equipment switching by 60%, translating to significant operational benefits. Our work establishes higher-order action regularization as an effective bridge between RL optimization and operational constraints in energy-critical applications.
zh

[AI-15] Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在函数式编程语言(Functional Programming Languages, FP)中代码生成能力评估不足的问题,以及当前LLMs生成的函数式代码存在正确性差、风格不规范和可维护性低等关键挑战。其解决方案的关键在于构建了一个名为FPEval的综合性评估框架,该框架基于FPBench基准(包含721个跨三种主流函数式语言——Haskell、OCaml和Scala——且分三个难度级别的编程任务),并集成完整的测试验证机制与静态分析工具,能够从功能性正确性和代码风格/可维护性两个维度对LLMs生成的代码进行全面评估。通过此框架,研究者系统地评估了GPT-3.5、GPT-4o和GPT-5等先进模型在函数式编程中的表现,并发现LLM性能随模型演进而提升,但纯函数式语言中的错误率仍显著高于混合型(Scala)或命令式(Java)语言;此外,LLMs常生成不符合函数式编程范式的非惯用代码,影响长期可维护性。最终,研究还表明,在提供静态分析反馈和定制化指令的情况下,LLMs具备部分自我修复功能,可改善生成代码的质量与正确性。

链接: https://arxiv.org/abs/2601.02060
作者: Nguyet-Anh H. Lang,Eric Lang,Thanh Le-Cong,Bach Le,Quyet-Thang Huynh
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Functional programming provides strong foundations for developing reliable and secure software systems, yet its adoption remains not widespread due to the steep learning curve. Recent advances in Large Language Models (LLMs) for code generation present new opportunities to lower these barriers. However, extensive evaluations of LLMs largely focus on imperative programming languages, and their capabilities in functional programming languages (FP) remain underexplored. To address this gap, we introduce FPEval, a holistic evaluation framework built on FPBench, a new benchmark of 721 programming tasks across three difficulty levels on three mainstream FP languages: Haskell, Ocaml and Scala. FPEval provides compehensive evaluation infrastructures with both test validations with comprehensive test suites and static analysis tools to assess both functional correctness and code style and maintainability. Using this framework, we evaluate state-of-the-art LLMs, including GPT-3.5, GPT-4o, and GPT-5, for code generation in functional programming languages and Java as an imperative baseline. Our results demonstrate that LLM performance in functional programming improves substantially with model advancement; however, error rates remain significantly higher in purely functional languages (Haskell and OCaml) than in hybrid (Scala) or imperative (Java) languages. Moreover, LLMs frequently generate non-idiomatic functional code that follows imperative patterns, raising concerns about code style and long-term maintainability. Finally, we show that LLMs can partially self-repair both correctness and quality issues when provided with static analysis feedback and hand-crafted instructions for common types of issues.
zh

[AI-16] he New Compiler Stack: A Survey on the Synergy of LLM s and Compilers

【速读】:该论文旨在解决当前编译器开发中面临的高复杂性、优化策略局限性以及传统编译器功能边界固化等问题,其核心挑战在于如何利用大语言模型(Large Language Models, LLMs)推动编译技术的智能化与自适应演进。解决方案的关键在于构建一个多层次、多维度的分类体系,从设计哲学(如选择器、翻译器、生成器)、LLM方法论、代码抽象层级和任务类型四个维度系统梳理LLM赋能编译领域的研究进展,并识别出三大核心优势:降低编译器开发门槛( democratization of compiler development)、发现新型优化策略(novel optimization strategies)以及拓展编译器传统功能边界(broadening the compiler’s traditional scope)。同时,论文指出确保正确性和实现可扩展性是当前主要障碍,而发展混合系统(hybrid systems)则是未来最具潜力的技术路径。

链接: https://arxiv.org/abs/2601.02045
作者: Shuoming Zhang,Jiacheng Zhao,Qiuchu Yu,Chunwei Xia,Zheng Wang,Xiaobing Feng,Huimin Cui
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted by CCF Transactions on High Performance Computing

点击查看摘要

Abstract:This survey has provided a systematic overview of the emerging field of LLM-enabled compilation by addressing several key research questions. We first answered how LLMs are being integrated by proposing a comprehensive, multi-dimensional taxonomy that categorizes works based on their Design Philosophy (Selector, Translator, Generator), LLM Methodology, their operational Level of Code Abstraction, and the specific Task Type they address. In answering what advancements these approaches offer, we identified three primary benefits: the democratization of compiler development, the discovery of novel optimization strategies, and the broadening of the compiler’s traditional scope. Finally, in addressing the field’s challenges and opportunities, we highlighted the critical hurdles of ensuring correctness and achieving scalability, while identifying the development of hybrid systems as the most promising path forward. By providing these answers, this survey serves as a foundational roadmap for researchers and practitioners, charting the course for a new generation of LLM-powered, intelligent, adaptive and synergistic compilation tools.
zh

[AI-17] MindChat: A Privacy-preserving Large Language Model for Mental Health Support

【速读】:该论文旨在解决心理健康支持领域中大型语言模型(Large Language Models, LLMs)训练所面临的两大核心挑战:一是真实心理咨询对话数据的稀缺性与敏感性,二是分布式数据环境下隐私泄露风险。为应对这些问题,作者提出MindChat——一个面向心理健康的隐私保护型LLM,并构建了MindCorpus这一合成多轮咨询对话数据集。其解决方案的关键在于:首先,采用多智能体角色扮演框架结合双闭环反馈机制(turn-level critique-and-revision与session-level strategy refinement),在不依赖真实数据的前提下生成高质量、符合心理学规范的对话;其次,在模型训练阶段引入联邦学习(Federated Learning)与参数高效微调(LoRA)技术,并融合差分隐私优化策略,有效降低成员推理攻击和记忆风险,从而在保障隐私的同时提升模型在心理咨询场景下的性能表现。

链接: https://arxiv.org/abs/2601.01993
作者: Dong Xue,Jicheng Tu,Ming Wang,Xin Yan,Fangzhou Liu,Jie Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 16 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown promise for mental health support, yet training such models is constrained by the scarcity and sensitivity of real counseling dialogues. In this article, we present MindChat, a privacy-preserving LLM for mental health support, together with MindCorpus, a synthetic multi-turn counseling dataset constructed via a multi-agent role-playing framework. To synthesize high-quality counseling data, the developed dialogue-construction framework employs a dual closed-loop feedback design to integrate psychological expertise and counseling techniques through role-playing: (i) turn-level critique-and-revision to improve coherence and counseling appropriateness within a session, and (ii) session-level strategy refinement to progressively enrich counselor behaviors across sessions. To mitigate privacy risks under decentralized data ownership, we fine-tune the base model using federated learning with parameter-efficient LoRA adapters and incorporate differentially private optimization to reduce membership and memorization risks. Experiments on synthetic-data quality assessment and counseling capability evaluation show that MindCorpus improves training effectiveness and that MindChat is competitive with existing general and counseling-oriented LLM baselines under both automatic LLM-judge and human evaluation protocols, while exhibiting reduced privacy leakage under membership inference attacks.
zh

[AI-18] ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems AAAI-26

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要精确逻辑与符号推理的任务中表现脆弱的问题,尤其是在处理混沌动力系统这类具有确定性但常被误认为随机或复杂性的场景时。其解决方案的关键在于提出一个名为ChaosBench-Logic的基准测试框架,该框架通过统一的一阶逻辑(First-Order Logic, FOL)本体对30种多样化的动力系统进行建模,并为每个系统标注11个语义谓词的真值赋值,进而生成涵盖多跳推理、跨系统类比、反事实推理等七类任务的621个问题。该框架还定义了逻辑准确性、蕴含一致性、对话连贯性和矛盾检测等量化指标,首次揭示了当前前沿LLMs在组合性任务上完全失效且全局一致性薄弱的现象,从而为诊断模型缺陷和开发神经符号方法(Neuro-Symbolic Approaches)以提升科学推理能力提供了可复现的评估平台。

链接: https://arxiv.org/abs/2601.01982
作者: Noel Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 0 figures , Accepted to AAAI-26 Bridge Program: Logical and Symbolic Reasoning in Language Models (camera-ready)

点击查看摘要

Abstract:Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.
zh

[AI-19] CNC-TP: Classifier Nominal Concept Based on Top-Pertinent Attributes

【速读】:该论文旨在解决知识发现(Knowledge Discovery in Databases, KDD)中如何从大规模数据集中高效提取可解释且有意义的知识问题,尤其聚焦于分类任务中的可解释性与效率提升。其解决方案的关键在于利用形式概念分析(Formal Concept Analysis, FCA)构建具有数学严谨性的概念格结构,通过计算闭包算子从名义型数据中生成形式概念,并提出一种新颖的局部概念格构造方法,以聚焦于最具相关性的概念,从而在保证分类性能的同时增强模型的可解释性与计算效率。

链接: https://arxiv.org/abs/2601.01976
作者: Yasmine Souissi(LRE),Fabrice Boissier(CRI, LRE),Nida Meddouri(LRE)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Discovery in Databases (KDD) aims to exploit the vast amounts of data generated daily across various domains of computer applications. Its objective is to extract hidden and meaningful knowledge from datasets through a structured process comprising several key steps: data selection, preprocessing, transformation, data mining, and visualization. Among the core data mining techniques are classification and clustering. Classification involves predicting the class of new instances using a classifier trained on labeled data. Several approaches have been proposed in the literature, including Decision Tree Induction, Bayesian classifiers, Nearest Neighbor search, Neural Networks, Support Vector Machines, and Formal Concept Analysis (FCA). The last one is recognized as an effective approach for interpretable and explainable learning. It is grounded in the mathematical structure of the concept lattice, which enables the generation of formal concepts and the discovery of hidden relationships among them. In this paper, we present a state-of-theart review of FCA-based classifiers. We explore various methods for computing closure operators from nominal data and introduce a novel approach for constructing a partial concept lattice that focuses on the most relevant concepts. Experimental results are provided to demonstrate the efficiency of the proposed method.
zh

[AI-20] Refinement Provenance Inference: Detecting LLM -Refined Training Prompts from Model Behavior

【速读】:该论文旨在解决指令微调(instruction tuning)中因使用大语言模型(LLM)对训练提示进行重构而导致的数据溯源难题,即在混合了原始提示与LLM重构提示的训练语料库中,如何准确判断一个微调后的模型是否基于特定提示的原始版本或重构版本进行训练。这一问题对于训练数据治理和争议处理至关重要。解决方案的关键在于提出了一种名为RePro的对数概率(logit-based)溯源框架,其核心创新是利用教师强制(teacher-forced)条件下token分布的稳定且可检测的变化,通过融合教师强制似然特征与对数排名信号,结合阴影微调(shadow fine-tuning)学习跨模型和训练设置通用的表示,并采用轻量级线性头实现无需访问训练数据即可对未见目标模型进行溯源推断。实验证明,RePro能有效识别提示重构痕迹,且具有良好的跨重构器泛化能力,表明其捕捉的是重构器无关的分布偏移,而非特定重写风格的伪影。

链接: https://arxiv.org/abs/2601.01966
作者: Bo Yin,Qi Li,Runpeng Yu,Xinchao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction tuning increasingly relies on LLM-based prompt refinement, where prompts in the training corpus are selectively rewritten by an external refiner to improve clarity and instruction alignment. This motivates an instance-level audit problem: for a fine-tuned model and a training prompt-response pair, can we infer whether the model was trained on the original prompt or its LLM-refined version within a mixed corpus? This matters for dataset governance and dispute resolution when training data are contested. However, it is non-trivial in practice: refined and raw instances are interleaved in the training corpus with unknown, source-dependent mixture ratios, making it harder to develop provenance methods that generalize across models and training setups. In this paper, we formalize this audit task as Refinement Provenance Inference (RPI) and show that prompt refinement yields stable, detectable shifts in teacher-forced token distributions, even when semantic differences are not obvious. Building on this phenomenon, we propose RePro, a logit-based provenance framework that fuses teacher-forced likelihood features with logit-ranking signals. During training, RePro learns a transferable representation via shadow fine-tuning, and uses a lightweight linear head to infer provenance on unseen victims without training-data access. Empirically, RePro consistently attains strong performance and transfers well across refiners, suggesting that it exploits refiner-agnostic distribution shifts rather than rewrite-style artifacts.
zh

[AI-21] OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation

【速读】:该论文旨在解决多模态社交交互建模与社交智能体(Social Agent)训练的难题,尤其在社会导航(Social Navigation)任务中探索感知特征的提取、编码与融合机制。其解决方案的关键在于提出 OpenSocInt —— 一个开源软件包,提供多模态社交交互仿真器和模块化架构,支持灵活配置不同感知输入(如视觉、语音等)、特征融合策略及代理模型,从而为社交智能体的研究提供可扩展、可复现的实验平台。

链接: https://arxiv.org/abs/2601.01939
作者: Victor Sanchez,Chris Reinke,Ahamed Mohamed,Xavier Alameda-Pineda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at this https URL.
zh

[AI-22] DéjàQ: Open-Ended Evolution of Diverse Learnable and Verifiable Problems

【速读】:该论文旨在解决当前数学推理模型训练中普遍依赖静态数据集所导致的过拟合与泛化能力不足的问题(static datasets encourage memorisation and limit generalisation)。其解决方案的关键在于提出DéjàQ框架,通过在模型训练过程中协同演化多样化的合成数学问题,使训练数据能够动态适应模型的学习能力,从而优化问题的可学性。该框架的核心创新在于引入两种由大语言模型(LLM)驱动的变异策略:一种是修改上下文细节,另一种是直接调整问题结构,使模型自身参与训练数据的生成与演化,显著提升了强化学习(RL)训练的效果。

链接: https://arxiv.org/abs/2601.01931
作者: Willem Röpke,Samuel Coward,Andrei Lupu,Thomas Foster,Tim Rocktäschel,Jakob Foerster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reasoning models have yielded impressive results in mathematics and coding. However, most approaches rely on static datasets, which have been suggested to encourage memorisation and limit generalisation. We introduce DéjàQ, a framework that departs from this paradigm by jointly evolving a diverse set of synthetic mathematical problems alongside model training. This evolutionary process adapts to the model’s ability throughout training, optimising problems for learnability. We propose two LLM-driven mutation strategies in which the model itself mutates the training data, either by altering contextual details or by directly modifying problem structure. We find that the model can generate novel and meaningful problems, and that these LLM-driven mutations improve RL training. We analyse key aspects of DéjàQ, including the validity of generated problems and computational overhead. Our results underscore the potential of dynamically evolving training data to enhance mathematical reasoning and indicate broader applicability, which we will support by open-sourcing our code.
zh

[AI-23] MCGI: Manifold-Consistent Graph Indexing for Billion-Scale Disk-Resident Vector Search

【速读】:该论文旨在解决图结构近邻搜索(Graph-based Approximate Nearest Neighbor, ANN)在高维空间中因“欧氏-测地线不匹配”(Euclidean-Geodesic mismatch)导致的性能下降问题,即贪婪路由策略偏离数据流形(manifold)而导致搜索路径失准。其解决方案的关键在于提出一种几何感知且磁盘驻留的索引方法——流形一致性图索引(Manifold-Consistent Graph Indexing, MCGI),该方法利用局部内在维度(Local Intrinsic Dimensionality, LID)对数据的内在几何结构进行原位分析,并据此动态调整广度优先搜索(beam search)预算,从而消除对静态超参数的依赖。理论分析表明,MCGI通过保持流形一致的拓扑连通性可提升近似保证;实验验证了其在高维GIST1M和百亿级SIFT1B数据集上的显著性能优势,包括吞吐量提升5.8倍(95%召回率下)及高召回查询延迟降低3倍。

链接: https://arxiv.org/abs/2601.01930
作者: Dongfang Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based Approximate Nearest Neighbor (ANN) search often suffers from performance degradation in high-dimensional spaces due to the ``Euclidean-Geodesic mismatch,‘’ where greedy routing diverges from the underlying data manifold. To address this, we propose Manifold-Consistent Graph Indexing (MCGI), a geometry-aware and disk-resident indexing method that leverages Local Intrinsic Dimensionality (LID) to dynamically adapt search strategies to the data’s intrinsic geometry. Unlike standard algorithms that treat dimensions uniformly, MCGI modulates its beam search budget based on in situ geometric analysis, eliminating dependency on static hyperparameters. Theoretical analysis confirms that MCGI enables improved approximation guarantees by preserving manifold-consistent topological connectivity. Empirically, MCGI achieves 5.8 \times higher throughput at 95% recall on high-dimensional GIST1M compared to state-of-the-art DiskANN. On the billion-scale SIFT1B dataset, MCGI further validates its scalability by reducing high-recall query latency by 3 \times , while maintaining performance parity on standard lower-dimensional datasets.
zh

[AI-24] heoretical Convergence of SMOTE-Generated Samples

【速读】:该论文旨在解决数据不平衡(imbalanced data)问题对机器学习模型性能的影响,尤其是针对广泛应用场景如医疗健康和网络安全中的分类任务。其解决方案的关键在于对SMOTE(Synthetic Minority Over-sampling Technique)方法进行严格的理论分析,证明了生成的合成随机变量 $ Z $ 在概率意义上收敛于原始随机变量 $ X $,并在 $ X $ 有界时进一步证明了均值收敛性;同时指出邻近邻居数越小,收敛速度越快,为实际应用提供了可操作的指导原则。

链接: https://arxiv.org/abs/2601.01927
作者: Firuz Kamalov,Hana Sulieman,Witold Pedrycz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imbalanced data affects a wide range of machine learning applications, from healthcare to network security. As SMOTE is one of the most popular approaches to addressing this issue, it is imperative to validate it not only empirically but also theoretically. In this paper, we provide a rigorous theoretical analysis of SMOTE’s convergence properties. Concretely, we prove that the synthetic random variable Z converges in probability to the underlying random variable X. We further prove a stronger convergence in mean when X is compact. Finally, we show that lower values of the nearest neighbor rank lead to faster convergence offering actionable guidance to practitioners. The theoretical results are supported by numerical experiments using both real-life and synthetic data. Our work provides a foundational understanding that enhances data augmentation techniques beyond imbalanced data scenarios.
zh

[AI-25] A Defect is Being Born: How Close Are We? A Time Sensitive Forecasting Approach

【速读】:该论文旨在解决软件系统中缺陷预测的时效性问题,即如何在缺陷实际发生前实现早期、准确的缺陷预测。随着软件系统的持续演化,传统静态预测方法难以捕捉动态变化的缺陷模式,因此亟需时间敏感型预测技术以识别缺陷发生前的早期指标。解决方案的关键在于采用多种时间敏感的预测技术(如时序建模与特征工程),对软件项目的未来缺陷密度进行建模,并从中挖掘出预示缺陷发生的早期症状(early indicators),从而为缺陷管理提供前瞻性依据。

链接: https://arxiv.org/abs/2601.01921
作者: Mikel Robredo,Matteo Esposito,Fabio Palomba,Rafael Peñaloza,Valentina Lenarduzzi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: ACCEPTED REGISTERED REPORT AT SANER (CORE A*) 2026

点击查看摘要

Abstract:Background. Defect prediction has been a highly active topic among researchers in the Empirical Software Engineering field. Previous literature has successfully achieved the most accurate prediction of an incoming fault and identified the features and anomalies that precede it through just-in-time prediction. As software systems evolve continuously, there is a growing need for time-sensitive methods capable of forecasting defects before they manifest. Aim. Our study seeks to explore the effectiveness of time-sensitive techniques for defect forecasting. Moreover, we aim to investigate the early indicators that precede the occurrence of a defect. Method. We will train multiple time-sensitive forecasting techniques to forecast the future bug density of a software project, as well as identify the early symptoms preceding the occurrence of a defect. Expected results. Our expected results are translated into empirical evidence on the effectiveness of our approach for early estimation of bug proneness. Comments: ACCEPTED REGISTERED REPORT AT SANER (CORE A*) 2026 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2601.01921 [cs.SE] (or arXiv:2601.01921v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.01921 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matteo Esposito [view email] [v1] Mon, 5 Jan 2026 09:11:29 UTC (280 KB)
zh

[AI-26] MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning

【速读】:该论文旨在解决传统A算法在大规模复杂环境中的计算与内存开销过高,以及现有基于大语言模型(Large Language Models, LLMs)的路径规划方法因缺乏空间感知能力而导致的误引导问题。其核心挑战在于如何实现全局推理与几何精度的协同,尤其在存在死胡同或模糊边界等拓扑复杂场景中保持路径有效性。解决方案的关键是提出MMP-A框架,通过融合视觉-语言模型(Vision-Language Models, VLMs)的空间接地能力与一种新颖的自适应衰减机制(adaptive decay mechanism),使高阶语义推理能够锚定于物理几何结构,并动态调控不确定航点对启发函数的影响,从而在保障轨迹近似最优的同时显著降低内存占用和纠错成本。

链接: https://arxiv.org/abs/2601.01910
作者: Minh Hieu Ha,Khanh Ly Ta,Hung Phan,Tung Doan,Tung Dao,Dao Tran,Huynh Thi Thanh Binh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.01910 [cs.AI] (or arXiv:2601.01910v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.01910 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-27] Evaluating Feature Dependent Noise in Preference-based Reinforcement Learning

【速读】:该论文旨在解决强化学习中基于偏好(Preference-based Reinforcement Learning, PbRL)方法在面对特征相关噪声(feature-dependent noise)时性能下降的问题。传统噪声检测与处理方法多假设噪声为均匀分布且与观测无关,而本文首次形式化了“目标特征依赖噪声”这一概念,并提出了多种变体,如轨迹特征噪声、轨迹相似性噪声、不确定性感知噪声及语言模型噪声等。其关键创新在于揭示了现有最先进抗噪PbRL方法在某些特征依赖噪声场景下反而表现显著劣于未显式去噪的方法,从而挑战了当前对噪声鲁棒性的理解,并指出语言模型生成的偏好噪声具有与特征依赖噪声类似的特性,强调了未来研究应聚焦于构建真正稳健的特征依赖噪声处理机制。

链接: https://arxiv.org/abs/2601.01904
作者: Yuxuan Li,Harshith Reddy Kethireddy,Srijita Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from Preferences in Reinforcement Learning (PbRL) has gained attention recently, as it serves as a natural fit for complicated tasks where the reward function is not easily available. However, preferences often come with uncertainty and noise if they are not from perfect teachers. Much prior literature aimed to detect noise, but with limited types of noise and most being uniformly distributed with no connection to observations. In this work, we formalize the notion of targeted feature-dependent noise and propose several variants like trajectory feature noise, trajectory similarity noise, uncertainty-aware noise, and Language Model noise. We evaluate feature-dependent noise, where noise is correlated with certain features in complex continuous control tasks from DMControl and Meta-world. Our experiments show that in some feature-dependent noise settings, the state-of-the-art noise-robust PbRL method’s learning performance is significantly deteriorated, while PbRL method with no explicit denoising can surprisingly outperform noise-robust PbRL in majority settings. We also find language model’s noise exhibits similar characteristics to feature-dependent noise, thereby simulating realistic humans and call for further study in learning with feature-dependent noise robustly. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.01904 [cs.LG] (or arXiv:2601.01904v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.01904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-28] Safety at One Shot: Patching Fine-Tuned LLM s with A Single Instance

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)微调过程中导致的安全性下降问题,即传统方法需要大量安全样本或校准集才能恢复模型安全性,这不仅计算开销高,还会显著降低模型的实用性。其解决方案的关键在于:仅需一个安全示例即可完全恢复模型的安全性,且不牺牲模型效用,同时收敛速度快(仅需数个训练轮次)。作者进一步揭示了安全梯度具有低秩结构(low-rank structure),从而解释了为何如此高效的修复成为可能,并在五种安全对齐的LLM和多个数据集上验证了该方法的普适性。

链接: https://arxiv.org/abs/2601.01887
作者: Jiawen Zhang,Lipeng He,Kejia Chen,Jian Lou,Jian Liu,Xiaohu Yang,Ruoxi Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.
zh

[AI-29] heory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLM s

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)社会认知评估中存在的“理论缺口”问题,即多数评估未明确指定目标能力的理论基础,导致对任务表现与实际能力之间关系的假设隐含化,从而引发系统性误判——将仅覆盖能力局部维度的测试结果错误解读为对整体能力的充分证据。解决方案的关键在于提出“理论溯源卡”(Theory Trace Card, TTC),这是一种轻量级文档工具,用于显式阐明评估所依赖的理论基础、目标能力的构成要素、任务操作化方式及其局限性,从而完整呈现从理论到评分的验证链条(validity chain),提升评估结果的可解释性和可复用性,且无需修改现有基准或达成单一理论共识。

链接: https://arxiv.org/abs/2601.01878
作者: Farzan Karimi-Malekabadi,Suhaib Abdurahman,Zhivar Sourati,Jackson Trager,Morteza Dehghani
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability’s other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.
zh

[AI-30] oward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence

【速读】:该论文旨在解决病理图像分析中模型决策缺乏可解释性的问题,即临床医生难以理解模型判断依据及其推理过程。现有视觉语言模型虽能生成自然语言解释,但常为相关性描述,缺乏可验证的证据支撑。其解决方案的关键在于提出一种以SQL为中心的智能体(agentic)框架:首先提取人类可读的细胞特征,随后由特征推理智能体(Feature Reasoning Agent)将这些特征转化为结构化数据,并通过执行SQL查询聚合视觉证据形成定量结论;再由知识比对智能体(Knowledge Comparison Agent)将所得结论与已知病理学知识进行比对,从而模拟病理学家基于可观测指标的诊断逻辑。该方法不仅提升了模型决策的可解释性和可追溯性,还生成了可执行的SQL操作日志,明确关联细胞测量结果与最终诊断结论。

链接: https://arxiv.org/abs/2601.01875
作者: Kewen Cao,Jianxu Chen,Yongbing Zhang,Ye Zhang,Hongxiao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model’s decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.
zh

[AI-31] Jenius Agent : Towards Experience-Driven Accuracy Optimization in Real-World Scenarios

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体系统在任务执行过程中面临的性能瓶颈问题,尤其是上下文理解能力弱、工具调用效率低以及响应生成质量不稳定等挑战。其解决方案的核心在于提出一个名为Jenius-Agent的端到端框架,包含三项关键技术:(1) 基于代理状态和任务目标自适应生成提示(adaptive prompt generation),提升推理可靠性与鲁棒性;(2) 上下文感知的工具编排模块(context-aware tool orchestration),实现意图识别、语义检索与动态调用;(3) 分层记忆机制(layered memory mechanism),融合会话记忆、任务历史与外部摘要,通过动态压缩与总结优化信息相关性和处理效率。实验表明,该方案显著提升了任务准确率(+20%),同时降低token消耗、响应延迟及工具调用失败率。

链接: https://arxiv.org/abs/2601.01857
作者: Defei Xia,Bingfeng Pi,Shenbin Zhang,Song Hua,Yunfei Wei,Lei Zuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As agent systems powered by large language models (LLMs) advance, improving the task performance of an autonomous agent, especially in context understanding, tool usage, and response generation, has become increasingly critical. Although prior studies have advanced the overall design of LLM-based agents, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. This paper introduces an agent framework grounded in real-world practical experience, with three key innovations: (1) an adaptive prompt generation strategy that aligns with the agent’s state and task goals to improve reliability and robustness; (2) a context-aware tool orchestration module that performs tool categorization, semantic retrieval, and adaptive invocation based on user intent and context; and (3) a layered memory mechanism that integrates session memory, task history, and external summaries to improve relevance and efficiency through dynamic summarization and compression. An end-to-end framework named Jenius-Agent has been integrated with three key optimizations, including tools based on the Model Context Protocol (MCP), file input/output (I/O), and execution feedback. The experiments show a 20 percent improvement in task accuracy, along with a reduced token cost, response latency, and invocation failures. The framework is already deployed in Jenius (this https URL), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.
zh

[AI-32] Clinical Knowledge Graph Construction and Evaluation with Multi-LLM s via Retrieval-Augmented Generation

【速读】:该论文旨在解决从非结构化临床文本中构建知识图谱(Knowledge Graph, KG)时存在的事实准确性不足与语义一致性差的问题,尤其在肿瘤学领域尤为突出。现有方法多依赖结构化输入且缺乏对生成内容的鲁棒验证,导致知识图谱质量难以保障。其解决方案的关键在于提出一个端到端框架,结合多智能体提示(multi-agent prompting)与基于模式约束的检索增强生成(schema-constrained Retrieval-Augmented Generation, KG-RAG),实现从自由文本直接提取实体、属性和关系,并通过熵基不确定性评分、本体对齐的RDF/OWL模式生成以及多大语言模型(Large Language Models, LLMs)共识验证来提升图谱的准确性和语义一致性,从而支持持续迭代优化与自监督评估,最终产出可解释、符合SPARQL标准且具备临床依据的知识图谱。

链接: https://arxiv.org/abs/2601.01844
作者: Udiptaman Das,Krishnasai B. Atmakuri,Duy Ho,Chi Lee,Yugyung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 tables, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.
zh

[AI-33] he Machine Learning Canvas: Empirical Findings on Why Strategy Matters More Than AI Code Generation

【速读】:该论文旨在解决机器学习(Machine Learning, ML)项目高失败率问题——尽管AI编码助手日益普及,超过80%的ML项目仍无法实现实际业务价值。其解决方案的核心是提出并验证了一个名为“机器学习画布”(Machine Learning Canvas)的实践框架,该框架整合了商业战略、软件工程与数据科学三个维度,识别出四个关键成功因素:战略(清晰的目标与规划)、流程(工作执行方式)、生态(工具与基础设施)和支持(组织背书与资源)。研究表明,这四个因素相互关联且具有显著因果效应,其中组织支持对战略制定、流程优化和基础设施建设均具强正向影响(β值分别为0.432、0.428和0.547,p < 0.001),揭示了AI辅助仅能提升编码效率(解决“如何编码”),而项目成败最终取决于战略层面的“为何做”与“做什么”。

链接: https://arxiv.org/abs/2601.01839
作者: Martin Prause
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Dataset available: this https URL

点击查看摘要

Abstract:Despite the growing popularity of AI coding assistants, over 80% of machine learning (ML) projects fail to deliver real business value. This study creates and tests a Machine Learning Canvas, a practical framework that combines business strategy, software engineering, and data science in order to determine the factors that lead to the success of ML projects. We surveyed 150 data scientists and analyzed their responses using statistical modeling. We identified four key success factors: Strategy (clear goals and planning), Process (how work gets done), Ecosystem (tools and infrastructure), and Support (organizational backing and resources). Our results show that these factors are interconnected - each one affects the next. For instance, strong organizational support results in a clearer strategy (\beta = 0.432, p 0.001), which improves work processes (\beta = 0.428, p 0.001) and builds better infrastructure (\beta = 0.547, p 0.001). Together, these elements determine whether a project succeeds. The surprising finding? Although AI assistants make coding faster, they don’t guarantee project success. AI assists with the “how” of coding but cannot replace the “why” and “what” of strategic thinking.
zh

[AI-34] COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在企业高风险应用场景中,如医疗和金融领域,如何有效遵循组织特定政策(如允许列表和禁止列表)的问题。现有安全评估方法仅关注通用性危害,忽略了组织内部合规需求。解决方案的关键在于提出 COMPASS(Company/Organization Policy Alignment Assessment),这是一个系统性的评估框架,通过设计边缘案例生成5,920个测试查询,同时验证模型在常规合规性和对抗鲁棒性上的表现。实验表明,当前LLMs在处理合法请求时准确率高达95%,但在执行禁止类指令时失败率极高(仅能拒绝13–40%的对抗性违规),揭示了模型在政策关键部署中的显著不足,凸显了COMPASS作为组织级AI安全评估工具的必要性。

链接: https://arxiv.org/abs/2601.01836
作者: Dasol Choi,DongGeon Lee,Brigitta Jesica Kartono,Helena Berndt,Taeyoun Kwon,Joonwon Jang,Haon Park,Hwanjo Yu,Minsuk Kahng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.
zh

[AI-35] Yukthi Opus: A Multi-Chain Hybrid Metaheuristic for Large-Scale NP-Hard Optimization

【速读】:该论文针对在显式评估预算约束下求解NP-hard优化问题的挑战,提出了一种名为Yukthi Opus (YO) 的多链混合元启发式算法。其核心解决方案在于采用两阶段结构:第一阶段通过马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)进行全局探索,第二阶段结合贪婪局部搜索与自适应重加热模拟退火策略实现高效利用与逃逸局部最优;同时引入空间黑名单机制避免重复评估低质量区域,并采用多链执行策略提升鲁棒性与对初始值的不敏感性。实验表明,MCMC探索和贪婪精化是保证解质量的关键,而模拟退火和多链机制主要增强稳定性与方差控制,在保持可预测评估预算的前提下,显著提升了复杂黑箱优化场景下的性能表现。

链接: https://arxiv.org/abs/2601.01832
作者: SB Danush Vikraman,Hannah Abagail,Prasanna Kesavraj,Gajanan V Honnavar
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 22 pages, 9 figures, includes extensive ablation studies and benchmark comparisons

点击查看摘要

Abstract:We present Yukthi Opus (YO), a multi-chain hybrid metaheuristic designed for NP-hard optimization under explicit evaluation budget constraints. YO integrates three complementary mechanisms in a structured two-phase architecture: Markov Chain Monte Carlo (MCMC) for global exploration, greedy local search for exploitation, and simulated annealing with adaptive reheating to enable controlled escape from local minima. A dedicated burn-in phase allocates evaluations to probabilistic exploration, after which a hybrid optimization loop refines promising candidates. YO further incorporates a spatial blacklist mechanism to avoid repeated evaluation of poor regions and a multi-chain execution strategy to improve robustness and reduce sensitivity to initialization. We evaluate YO on three benchmarks: the Rastrigin function (5D) with ablation studies, the Traveling Salesman Problem with 50 to 200 cities, and the Rosenbrock function (5D) with comparisons against established optimizers including CMA-ES, Bayesian optimization, and accelerated particle swarm optimization. Results show that MCMC exploration and greedy refinement are critical for solution quality, while simulated annealing and multi-chain execution primarily improve stability and variance reduction. Overall, YO achieves competitive performance on large and multimodal problems while maintaining predictable evaluation budgets, making it suitable for expensive black-box optimization settings. Comments: 22 pages, 9 figures, includes extensive ablation studies and benchmark comparisons Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.01832 [cs.NE] (or arXiv:2601.01832v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2601.01832 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-36] ARIES: A Scalable Multi-Agent Orchestration Framework for Real-Time Epidemiological Surveillance and Outbreak Monitoring

【速读】:该论文旨在解决全球健康监测中存在的知识缺口问题,特别是通用人工智能(General-purpose AI)在高风险流行病学领域因持续幻觉和无法有效处理专业数据孤岛而表现不佳的局限性。解决方案的关键在于提出一种名为ARIES(Agentic Retrieval Intelligence for Epidemiological Surveillance)的专用自主多智能体框架,其核心创新是基于分层指挥结构,利用GPTs协调可扩展的子智能体群,自动查询世卫组织(WHO)、疾病控制与预防中心(CDC)及同行评审文献等权威来源,并通过自动化提取与逻辑合成实现近实时的新兴威胁识别与信号偏离检测,从而构建动态智能生态系统,显著优于通用模型。

链接: https://arxiv.org/abs/2601.01831
作者: Aniket Wattamwar,Sampson Akwafuo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: 6 pages, 14 figures, 1 table

点击查看摘要

Abstract:Global health surveillance is currently facing a challenge of Knowledge Gaps. While general-purpose AI has proliferated, it remains fundamentally unsuited for the high-stakes epidemiological domain due to chronic hallucinations and an inability to navigate specialized data silos. This paper introduces ARIES (Agentic Retrieval Intelligence for Epidemiological Surveillance), a specialized, autonomous multi-agent framework designed to move beyond static, disease-specific dashboards toward a dynamic intelligence ecosystem. Built on a hierarchical command structure, ARIES utilizes GPTs to orchestrate a scalable swarm of sub-agents capable of autonomously querying World Health Organization (WHO), Center for Disease Control and Prevention (CDC), and peer-reviewed research papers. By automating the extraction and logical synthesis of surveillance data, ARIES provides a specialized reasoning that identifies emergent threats and signal divergence in near real-time. This modular architecture proves that a task-specific agentic swarm can outperform generic models, offering a robust, extensible for next-generation outbreak response and global health intelligence.
zh

[AI-37] Admissibility Alignment

【速读】:该论文试图解决传统AI对齐(AI alignment)方法在不确定性环境下失效的问题,即现有方法多将对齐视为静态或二值化条件,难以有效评估和保障复杂决策系统在分布外场景中的安全与可信行为。其解决方案的关键在于提出“可接受性对齐”(Admissibility Alignment)这一新范式,将对齐定义为在不确定条件下对可接受行动和决策选择的概率性、决策论性质的属性,并通过MAP-AI(Monte Carlo Alignment for Policy)架构实现:该架构利用蒙特卡洛估计Outcome分布,基于可接受性控制策略选择,而非依赖静态模型约束;同时显式建模不确定性、干预效应、价值模糊性和治理约束,在政策行为跨多个可能未来的情景中评估预期效用、方差、尾部风险及误对齐概率等分布特征,从而提供一种无需重新训练模型即可动态调整策略行为的可执行对齐评估与控制机制。

链接: https://arxiv.org/abs/2601.01816
作者: Chris Duffey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 2 figures, 2 tables… Decision-theoretic alignment under uncertainty

点击查看摘要

Abstract:This paper introduces Admissibility Alignment: a reframing of AI alignment as a property of admissible action and decision selection over distributions of outcomes under uncertainty, evaluated through the behavior of candidate policies. We present MAP-AI (Monte Carlo Alignment for Policy) as a canonical system architecture for operationalizing admissibility alignment, formalizing alignment as a probabilistic, decision-theoretic property rather than a static or binary condition. MAP-AI, a new control-plane system architecture for aligned decision-making under uncertainty, enforces alignment through Monte Carlo estimation of outcome distributions and admissibility-controlled policy selection rather than static model-level constraints. The framework evaluates decision policies across ensembles of plausible futures, explicitly modeling uncertainty, intervention effects, value ambiguity, and governance constraints. Alignment is assessed through distributional properties including expected utility, variance, tail risk, and probability of misalignment rather than accuracy or ranking performance. This approach distinguishes probabilistic prediction from decision reasoning under uncertainty and provides an executable methodology for evaluating trust and alignment in enterprise and institutional AI systems. The result is a practical foundation for governing AI systems whose impact is determined not by individual forecasts, but by policy behavior across distributions and tail events. Finally, we show how distributional alignment evaluation can be integrated into decision-making itself, yielding an admissibility-controlled action selection mechanism that alters policy behavior under uncertainty without retraining or modifying underlying models. Comments: 24 pages, 2 figures, 2 tables… Decision-theoretic alignment under uncertainty Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.01816 [cs.AI] (or arXiv:2601.01816v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.01816 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-38] Moments Matter:Stabilizing Policy Optimization using Return Distributions

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中策略在连续控制任务下因参数微小变化导致行为不稳定的问题,这种不稳定性源于环境随机性(如状态转移、初始条件和奖励噪声)与算法因素(如小批量选择和探索噪声)的共同作用。其解决方案的关键在于利用环境的随机性,通过构建分布式评论家(distributional critic)来建模状态-动作回报分布,并基于该分布的高阶矩(偏度和峰度)对近端策略优化(PPO)的优势函数进行修正,从而惩罚极端尾部行为,避免策略进入易引发不稳定的参数区域。此方法无需直接估计高维空间中的回报分布 $ R(\theta) $,显著提升了策略更新的稳定性,实验表明在Walker2D环境中可使稳定性提升达75%,同时保持与标准PPO相当的评估回报。

链接: https://arxiv.org/abs/2601.01803
作者: Dennis Jabs,Aditya Mohan,Marius Lindauer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Workshop paper at RLDM’25

点击查看摘要

Abstract:Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution R(\theta) , obtained by repeatedly sampling minibatches, updating \theta , and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow R(\theta) can improve stability, directly estimating R(\theta) is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model state-action return distribution through a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow R(\theta) . In such cases, our moment-based correction narrows R(\theta) , improving stability by up to 75% in Walker2D, while preserving comparable evaluation returns.
zh

[AI-39] PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism and Comprehensive AI Psychological Counselor

【速读】:该论文旨在解决生成式 AI (Generative AI) 在心理评估领域中面临的三大核心挑战:一是如何训练具备高度真实感的AI咨询师,这要求模型具备持续记忆和动态目标追踪能力;二是如何构建支持多种疗法(如精神分析、行为主义、认知行为疗法等)的多疗法AI咨询系统;三是如何建立系统性的评估框架以科学衡量AI咨询师的临床表现。解决方案的关键在于提出一个名为 \textttPsychEval 的多轮次、多疗法、高保真度基准测试平台,其包含6-10个咨询会话的纵向设计、超过677项元技能与4577项原子技能标注、覆盖五种主流治疗流派及整合疗法的多样化数据集,并构建了18项疗法特异性和共享性指标组成的综合评价体系,从而不仅实现对AI咨询师的全面评估,还支持基于强化学习的自我进化训练,推动临床责任导向的自适应AI咨询系统发展。

链接: https://arxiv.org/abs/2601.01802
作者: Qianjun Pan,Junyi Wang,Jie Zhou,Yutao Yang,Junsong Li,Kaiyin Xu,Yougen Zhou,Yihan Li,Jingyuan Zhao,Qin Chen,Ningning Zhou,Kai Chen,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To develop a reliable AI for psychological assessment, we introduce \textttPsychEval, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf1) Can we train a highly realistic AI counselor? Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf2) How to train a multi-therapy AI counselor? While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf3) How to systematically evaluate an AI counselor? We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \textttPsychEval transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.
zh

[AI-40] Sparse Threats Focused Defense: Criticality-Aware Robust Reinforcement Learning for Safe Autonomous Driving

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在自动驾驶(Autonomous Driving, AD)场景中对扰动敏感、鲁棒性不足的问题,特别是针对稀疏但安全关键的风险(如碰撞)缺乏有效应对机制的局限性。现有方法通常将对抗训练建模为零和博弈,忽视了智能体与对抗者之间的不对称性及安全风险的稀疏特性,导致实际应用中鲁棒性不足。其解决方案的关键在于提出一种新的对抗训练框架——关键性感知鲁棒强化学习(Criticality-Aware Robust RL, CARRL),该框架通过两个交互组件实现:风险暴露对抗者(Risk Exposure Adversary, REA)和风险目标鲁棒智能体(Risk-Targeted Robust Agent, RTRA)。其中,REA采用解耦优化机制,在有限预算下精准识别并利用稀疏的安全关键时刻;RTRA则通过双缓冲区联合利用良性与对抗经验,并强制策略在扰动下保持一致性,从而稳定行为表现。实验表明,该方法相比最先进基线显著降低至少22.66%的碰撞率。

链接: https://arxiv.org/abs/2601.01800
作者: Qi Wei,Junchao Fan,Zhao Yang,Jianhua Wang,Jingkai Mao,Xiaolin Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has shown considerable potential in autonomous driving (AD), yet its vulnerability to perturbations remains a critical barrier to real-world deployment. As a primary countermeasure, adversarial training improves policy robustness by training the AD agent in the presence of an adversary that deliberately introduces perturbations. Existing approaches typically model the interaction as a zero-sum game with continuous attacks. However, such designs overlook the inherent asymmetry between the agent and the adversary and then fail to reflect the sparsity of safety-critical risks, rendering the achieved robustness inadequate for practical AD scenarios. To address these limitations, we introduce criticality-aware robust RL (CARRL), a novel adversarial training approach for handling sparse, safety-critical risks in autonomous driving. CARRL consists of two interacting components: a risk exposure adversary (REA) and a risk-targeted robust agent (RTRA). We model the interaction between the REA and RTRA as a general-sum game, allowing the REA to focus on exposing safety-critical failures (e.g., collisions) while the RTRA learns to balance safety with driving efficiency. The REA employs a decoupled optimization mechanism to better identify and exploit sparse safety-critical moments under a constrained budget. However, such focused attacks inevitably result in a scarcity of adversarial data. The RTRA copes with this scarcity by jointly leveraging benign and adversarial experiences via a dual replay buffer and enforces policy consistency under perturbations to stabilize behavior. Experimental results demonstrate that our approach reduces the collision rate by at least 22.66% across all cases compared to state-of-the-art baseline methods.
zh

[AI-41] LIA: Supervised Fine-Tuning of Large Language Models for Automatic Issue Assignment

【速读】:该论文旨在解决软件维护中的**问题分配(Issue Assignment)**难题,即如何将新提交的问题报告(issue report)自动、准确地分配给最合适的开发者。传统人工分配方式在大型开源项目中易出现不一致和错误,而现有自动化方法受限于项目特定训练数据不足或关系信息稀疏噪声等问题,导致性能不佳。解决方案的关键在于提出LIA(LLM-based Issue Assignment),其核心是通过监督微调(supervised fine-tuning)适配一个预训练语言模型(DeepSeek-R1-Distill-Llama-8B),使其直接从问题标题和描述中学习历史任务模式,并生成基于开发者历史处理经验的排序推荐结果,从而实现高效且高精度的问题分配。

链接: https://arxiv.org/abs/2601.01780
作者: Arsham Khosravani,Alireza Hosseinpour,Arshia Akhavan,Mehdi Keshani,Abbas Heydarnoori
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Issue assignment is a critical process in software maintenance, where new issue reports are validated and assigned to suitable developers. However, manual issue assignment is often inconsistent and error-prone, especially in large open-source projects where thousands of new issues are reported monthly. Existing automated approaches have shown promise, but many rely heavily on large volumes of project-specific training data or relational information that is often sparse and noisy, which limits their effectiveness. To address these challenges, we propose LIA (LLM-based Issue Assignment), which employs supervised fine-tuning to adapt an LLM, DeepSeek-R1-Distill-Llama-8B in this work, for automatic issue assignment. By leveraging the LLM’s pretrained semantic understanding of natural language and software-related text, LIA learns to generate ranked developer recommendations directly from issue titles and descriptions. The ranking is based on the model’s learned understanding of historical issue-to-developer assignments, using patterns from past tasks to infer which developers are most likely to handle new issues. Through comprehensive evaluation, we show that LIA delivers substantial improvements over both its base pretrained model and state-of-the-art baselines. It achieves up to +187.8% higher Hit@1 compared to the DeepSeek-R1-Distill-Llama-8B pretrained base model, and outperforms four leading issue assignment methods by as much as +211.2% in Hit@1 score. These results highlight the effectiveness of domain-adapted LLMs for software maintenance tasks and establish LIA as a practical, high-performing solution for issue assignment.
zh

[AI-42] Can Large Language Models Solve Engineering Equations? A Systematic Comparison of Direct Prediction and Solver-Assisted Approaches

【速读】:该论文旨在解决工程实践中广泛存在的超越方程(transcendental equations)数值求解问题,传统方法依赖迭代算法如牛顿-拉夫森法(Newton-Raphson),而本文探索大型语言模型(Large Language Models, LLMs)在该任务中的应用潜力。其核心问题是:LLMs能否直接预测方程解,或通过与经典数值求解器结合形成混合架构实现更优性能?解决方案的关键在于采用“符号推理+数值求解”的协同机制——即LLMs负责提取物理规律、构建控制方程并提供初始条件,由经典迭代算法完成高精度计算,而非依赖LLMs独立执行数值迭代。实验证明,这种混合策略显著降低误差(降幅达67.9%–81.8%),尤其在电子学等领域效果突出,表明当前LLMs更适合作为智能接口连接传统数值求解器,而非替代其进行精密算术运算。

链接: https://arxiv.org/abs/2601.01774
作者: Sai Varun Kodathala,Rakesh Vunnam
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
备注: 14 pages

点击查看摘要

Abstract:Transcendental equations requiring iterative numerical solution pervade engineering practice, from fluid mechanics friction factor calculations to orbital position determination. We systematically evaluate whether Large Language Models can solve these equations through direct numerical prediction or whether a hybrid architecture combining LLM symbolic manipulation with classical iterative solvers proves more effective. Testing six state-of-the-art models (GPT-5.1, GPT-5.2, Gemini-3-Flash, Gemini-2.5-Lite, Claude-Sonnet-4.5, Claude-Opus-4.5) on 100 problems spanning seven engineering domains, we compare direct prediction against solver-assisted computation where LLMs formulate governing equations and provide initial conditions while Newton-Raphson iteration performs numerical solution. Direct prediction yields mean relative errors of 0.765 to 1.262 across models, while solver-assisted computation achieves 0.225 to 0.301, representing error reductions of 67.9% to 81.8%. Domain-specific analysis reveals dramatic improvements in Electronics (93.1%) due to exponential equation sensitivity, contrasted with modest gains in Fluid Mechanics (7.2%) where LLMs exhibit effective pattern recognition. These findings establish that contemporary LLMs excel at symbolic manipulation and domain knowledge retrieval but struggle with precision-critical iterative arithmetic, suggesting their optimal deployment as intelligent interfaces to classical numerical solvers rather than standalone computational engines.
zh

[AI-43] A New Benchmark for the Appropriate Evaluation of RTL Code Optimization

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成寄存器传输级(Register Transfer Level, RTL)代码时,缺乏对功耗、性能和面积(Power, Performance, Area, PPA)优化质量的有效评估问题。现有基准主要关注语法正确性,而忽视了工业实践中关键的PPA优化能力。解决方案的关键在于提出RTL-OPT基准,其包含36个手工设计的数字电路模块,覆盖组合逻辑、流水线数据通路、有限状态机和存储接口等多样化实现类别;每项任务提供一对RTL代码——一个次优版本与一个人工优化的参考版本,后者体现行业验证过的优化模式;同时集成自动化评估框架,用于验证功能正确性并量化PPA改进,从而实现对LLMs在硬件设计优化能力上的标准化、可衡量的评估。

链接: https://arxiv.org/abs/2601.01765
作者: Yao Lu,Shang Liu,Hangan Zhou,Wenji Fang,Qijun Zhang,Zhiyao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but existing benchmarks mainly evaluate syntactic correctness rather than optimization quality in terms of power, performance, and area (PPA). This work introduces RTL-OPT, a benchmark for assessing the capability of LLMs in RTL optimization. RTL-OPT contains 36 handcrafted digital designs that cover diverse implementation categories including combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task provides a pair of RTL codes, a suboptimal version and a human-optimized reference that reflects industry-proven optimization patterns not captured by conventional synthesis tools. Furthermore, RTL-OPT integrates an automated evaluation framework to verify functional correctness and quantify PPA improvements, enabling standardized and meaningful assessment of generative models for hardware design optimization.
zh

[AI-44] MergeRec: Model Merging for Data-Isolated Cross-Domain Sequential Recommendation KDD2026

【速读】:该论文旨在解决跨域序列推荐(cross-domain sequential recommendation)中模型泛化能力不足的问题,特别是在数据隔离场景下——即不同领域之间的原始用户交互数据无法共享,传统方法依赖重叠用户或物品、或忽略隐私约束的假设,难以适用。解决方案的关键在于提出一种名为MergeRec的新框架,其核心创新包括:(1) 基于无训练合并技术初始化融合模型;(2) 通过将每个物品视为虚拟序列构造伪用户数据,从而在不依赖真实用户交互的前提下合成有意义的训练样本;(3) 设计联合优化目标,结合推荐损失与蒸馏损失,协同优化各领域模型的融合权重,实现知识迁移与性能提升。实验表明,该方法显著优于传统模型合并策略,在Recall@10指标上平均提升达17.21%,验证了模型合并作为构建通用推荐系统的一种可扩展且高效路径的潜力。

链接: https://arxiv.org/abs/2601.01753
作者: Hyunsoo Kim,Jaewan Moon,Seongmin Park,Jongwuk Lee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Modern recommender systems trained on domain-specific data often struggle to generalize across multiple domains. Cross-domain sequential recommendation has emerged as a promising research direction to address this challenge; however, existing approaches face fundamental limitations, such as reliance on overlapping users or items across domains, or unrealistic assumptions that ignore privacy constraints. In this work, we propose a new framework, MergeRec, based on model merging under a new and realistic problem setting termed data-isolated cross-domain sequential recommendation, where raw user interaction data cannot be shared across domains. MergeRec consists of three key components: (1) merging initialization, (2) pseudo-user data construction, and (3) collaborative merging optimization. First, we initialize a merged model using training-free merging techniques. Next, we construct pseudo-user data by treating each item as a virtual sequence in each domain, enabling the synthesis of meaningful training samples without relying on real user interactions. Finally, we optimize domain-specific merging weights through a joint objective that combines a recommendation loss, which encourages the merged model to identify relevant items, and a distillation loss, which transfers collaborative filtering signals from the fine-tuned source models. Extensive experiments demonstrate that MergeRec not only preserves the strengths of the original models but also significantly enhances generalizability to unseen domains. Compared to conventional model merging methods, MergeRec consistently achieves superior performance, with average improvements of up to 17.21% in Recall@10, highlighting the potential of model merging as a scalable and effective approach for building universal recommender systems. The source code is available at this https URL.
zh

[AI-45] AI Agent Systems: Architectures Applications and Evaluation

【速读】:该论文旨在系统梳理和归纳AI代理(AI agents)架构的最新研究进展,以应对自然语言意图与真实世界计算之间接口日益复杂的问题。其核心挑战在于如何整合推理、规划、记忆与工具调用等能力,构建高效且可靠的智能代理系统。解决方案的关键在于提出一个统一的分类体系,涵盖代理组件(如策略/大语言模型核心、记忆模块、世界模型、规划器、工具路由和评判者)、编排模式(单代理与多代理、集中式与分布式协调)以及部署场景(离线分析与在线交互辅助、安全关键任务与开放性任务),并通过识别关键设计权衡(延迟与准确性、自主性与可控性、能力与可靠性)来指导未来研究方向。

链接: https://arxiv.org/abs/2601.01743
作者: Bin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents – systems that combine foundation models with reasoning, planning, memory, and tool use – are rapidly becoming a practical interface between natural-language intent and real-world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain-of-thought-style decomposition, self-reflection and verification, and constraint-aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi-step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single-agent vs.\ multi-agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety-critical vs.\ open-ended tasks). We discuss key design trade-offs – latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability – and highlight how evaluation is complicated by non-determinism, long-horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.
zh

[AI-46] Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)中普遍存在的“过度思考”(overthinking)现象,该现象会导致计算资源浪费和推理效率低下。为此,作者提出了一种名为反射感知自适应策略优化(Reflection-aware Adaptive Policy Optimization, RAPO)的新型强化学习(Reinforcement Learning, RL)训练算法,其关键在于通过引入对模型推理过程的自我反思机制,动态调节策略优化过程,从而有效抑制不必要的冗长推理路径,提升推理效率与准确性。实验表明,Yuan3.0 Flash 在企业级任务(如检索增强生成、复杂表格理解与摘要)上表现卓越,同时在数学、科学等通用推理领域也达到前沿水平,且仅需约1/4至1/2的平均token消耗。

链接: https://arxiv.org/abs/2601.01718
作者: YuanLab.ai:Shawn Wu,Sean Wang,Louie Li,Darcy Chen,Allen Wang,Jiangang Luo,Xudong Zhao,Joseph Shen,Gawain Ma,Jasper Jia,Marcus Mao,Claire Wang,Hunter He,Carol Wang,Zera Zhang,Jason Wang,Chonly Shen,Leo Zhang,Logan Chen,Qasim Meng,James Gong,Danied Zhao,Penn Zheng,Owen Zhu,Tong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: this https URL.
zh

[AI-47] RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)模型在实时推荐系统中因尾部延迟(tail-latency)服务等级目标(SLO)限制而无法充分利用长用户行为序列的问题。其核心挑战在于:GR模型虽能通过消费较长用户行为序列提升推荐质量,但受限于排名阶段的P99延迟预算(仅数十毫秒),在线推理时序列长度被严格压缩。解决方案的关键在于提出RelayGR系统,通过预推理用户行为前缀并缓存至高带宽内存(HBM),实现跨流水线阶段的复用,从而避免在关键路径上重复计算。RelayGR创新性地结合三项技术:1)基于序列感知的触发机制,在有限缓存和预推理负载下仅对高风险请求进行预推理;2)基于亲和性的路由策略,将预推理信号与排序请求调度至同一实例以减少远程访问;3)基于内存感知的扩展器,利用服务器本地DRAM捕获跨请求的短期复用,防止冗余加载。实验表明,在固定P99 SLO下,RelayGR支持长达1.5倍的序列长度,并使符合SLO的吞吐量提升达3.6倍。

链接: https://arxiv.org/abs/2601.01712
作者: Jiarui Wang,Huichao Chai,Yuanhang Zhang,Zongjin Zhou,Wei Guo,Xingkun Yang,Qiang Tang,Bo Pan,Jiawei Zhu,Ke Cheng,Yuting Yan,Shulan Wang,Yingjie Zhu,Zhengfan Yuan,Jiaqi Huang,Yuhan Zhang,Xiaosong Sun,Zhinan Zhang,Hong Zhu,Yongsheng Zhang,Tiantian Dong,Zhong Xiao,Deliang Liu,Chengzhou Lu,Yuan Sun,Zhiyuan Chen,Xinming Han,Zaizhu Liu,Yaoyuan Wang,Ziyang Zhang,Yong Liu,Jinxin Xu,Yajing Sun,Zhoujun Yu,Wenting Zhou,Qidong Zhang,Zhengyong Zhang,Zhonghai Gu,Yibo Jin,Yongxiang Feng,Pengfei Zuo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking can consume them without remote fetches. RelayGR combines three techniques: 1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, 2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and 3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries. Under a fixed P99 SLO, RelayGR supports up to 1.5 \times longer sequences and improves SLO-compliant throughput by up to 3.6 \times .
zh

[AI-48] Explicit World Models for Reliable Human-Robot Collaboration AAAI-26

【速读】:该论文旨在解决具身人工智能(Embodied AI)在感知噪声、指令模糊性以及人机交互中面临的可靠性问题。传统方法侧重于形式化验证以提升模型的可预测性和鲁棒性,而本文提出不同思路:强调人机交互的动态性、模糊性和主观性,认为可靠性需基于人类目标与期望来定义。解决方案的关键在于构建并持续更新一个可访问的“显式世界模型”(explicit world model),该模型代表人类与AI之间的共同认知基础,用以对齐机器人行为与人类预期,从而实现情境化的可靠性能。

链接: https://arxiv.org/abs/2601.01705
作者: Kenneth Kwok,Basura Fernando,Qianli Xu,Vigneshwaran Subbaraju,Dongkyu Choi,Boon Kiat Quek
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI-26 Bridge Program B10: Making Embodied AI Reliable with Testing and Formal Verification

点击查看摘要

Abstract:This paper addresses the topic of robustness under sensing noise, ambiguous instructions, and human-robot interaction. We take a radically different tack to the issue of reliable embodied AI: instead of focusing on formal verification methods aimed at achieving model predictability and robustness, we emphasise the dynamic, ambiguous and subjective nature of human-robot interactions that requires embodied AI systems to perceive, interpret, and respond to human intentions in a manner that is consistent, comprehensible and aligned with human expectations. We argue that when embodied agents operate in human environments that are inherently social, multimodal, and fluid, reliability is contextually determined and only has meaning in relation to the goals and expectations of humans involved in the interaction. This calls for a fundamentally different approach to achieving reliable embodied AI that is centred on building and updating an accessible “explicit world model” representing the common ground between human and AI, that is used to align robot behaviours with human expectations.
zh

[AI-49] Beyond Homophily: Community Search on Heterophilic Graphs

【速读】:该论文旨在解决在异质性图(heterophilic graphs)中进行社区搜索(community search)时,传统算法和基于图神经网络(GNN)的模型难以有效识别高质量社区的问题。具体而言,在异质性图中,边主要连接不相似节点,导致原本反映平滑低频相似性的结构信号变为尖锐高频对比,而现有方法要么返回标签混杂的社区,要么因依赖同质性假设而模糊边界、丢失关键信息。解决方案的关键在于提出一个统一框架AdaptCS,其核心设计包括:(i) AdaptCS编码器分离多跳与多频信号,同时捕捉平滑(同质性)与对比(异质性)关系;(ii) 基于低秩优化的记忆高效策略,显著降低计算瓶颈并提升可扩展性;(iii) 自适应社区评分(Adaptive Community Score, ACS),在线搜索中平衡嵌入相似性和拓扑关系,从而实现高精度与高效率的统一。

链接: https://arxiv.org/abs/2601.01703
作者: Qing Sima,Xiaoyang Wang,Wenjie Zhang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Community search aims to identify a refined set of nodes that are most relevant to a given query, supporting tasks ranging from fraud detection to recommendation. Unlike homophilic graphs, many real-world networks are heterophilic, where edges predominantly connect dissimilar nodes. Therefore, structural signals that once reflected smooth, low-frequency similarity now appear as sharp, high-frequency contrasts. However, both classical algorithms (e.g., k-core, k-truss) and recent ML-based models struggle to achieve effective community search on heterophilic graphs, where edge signs or semantics are generally unknown. Algorithm-based methods often return communities with mixed class labels, while GNNs, built on homophily, smooth away meaningful signals and blur community boundaries. Therefore, we propose Adaptive Community Search (AdaptCS), a unified framework featuring three key designs: (i) an AdaptCS Encoder that disentangles multi-hop and multi-frequency signals, enabling the model to capture both smooth (homophilic) and contrastive (heterophilic) relations; (ii) a memory-efficient low-rank optimization that removes the main computational bottleneck and ensures model scalability; and (iii) an Adaptive Community Score (ACS) that guides online search by balancing embedding similarity and topological relations. Extensive experiments on both heterophilic and homophilic benchmarks demonstrate that AdaptCS outperforms the best-performing baseline by an average of 11% in F1-score, retains robustness across heterophily levels, and achieves up to 2 orders of magnitude speedup.
zh

[AI-50] Digital Twin-Driven Communication-Efficient Federated Anomaly Detection for Industrial IoT

【速读】:该论文旨在解决工业物联网(IIoT)异常检测中面临的多个关键挑战,包括仅依赖真实传感器数据导致的泛化能力不足、标注数据稀缺、误报率高以及数据隐私问题。为应对这些问题,其核心解决方案是提出一套融合数字孪生(Digital Twin, DT)与联邦学习(Federated Learning, FL)的集成方法(DTFL),通过在分布式训练框架中引入合成数据与物理世界知识的协同机制,实现模型性能提升的同时保障数据隐私和通信效率。其中,关键创新在于五种新方法:基于元学习的数字孪生(DTML)、参数融合(FPF)、分层参数交换(LPE)、循环权重自适应(CWA)及知识蒸馏(DTKD),它们分别设计了不同的机制以平衡模型泛化能力和通信开销,实验表明CWA等方法能在显著减少通信轮次(最多比DTML少62%)的情况下快速达到80%目标准确率,验证了DT-FL框架在IIoT场景下加速收敛并提升实用性方面的有效性。

链接: https://arxiv.org/abs/2601.01701
作者: Mohammed Ayalew Belay,Adil Rasheed,Pierluigi Salvo Rossi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection is increasingly becoming crucial for maintaining the safety, reliability, and efficiency of industrial systems. Recently, with the advent of digital twins and data-driven decision-making, several statistical and machine-learning methods have been proposed. However, these methods face several challenges, such as dependence on only real sensor datasets, limited labeled data, high false alarm rates, and privacy concerns. To address these problems, we propose a suite of digital twin-integrated federated learning (DTFL) methods that enhance global model performance while preserving data privacy and communication efficiency. Specifically, we present five novel approaches: Digital Twin-Based Meta-Learning (DTML), Federated Parameter Fusion (FPF), Layer-wise Parameter Exchange (LPE), Cyclic Weight Adaptation (CWA), and Digital Twin Knowledge Distillation (DTKD). Each method introduces a unique mechanism to combine synthetic and real-world knowledge, balancing generalization with communication overhead. We conduct an extensive experiment using a publicly available cyber-physical anomaly detection dataset. For a target accuracy of 80%, CWA reaches the target in 33 rounds, FPF in 41 rounds, LPE in 48 rounds, and DTML in 87 rounds, whereas the standard FedAvg baseline and DTKD do not reach the target within 100 rounds. These results highlight substantial communication-efficiency gains (up to 62% fewer rounds than DTML and 31% fewer than LPE) and demonstrate that integrating DT knowledge into FL accelerates convergence to operationally meaningful accuracy thresholds for IIoT anomaly detection.
zh

[AI-51] Exposing Hidden Interfaces: LLM -Guided Type Inference for Reverse Engineering macOS Private Frameworks

【速读】:该论文旨在解决私有 macOS 框架(private macOS frameworks)因缺乏文档且仅以剥离符号的二进制形式分发,导致安全分析困难的问题。其核心挑战在于从无结构的二进制文件中自动恢复可编译的接口头文件(header files),从而支持后续的安全研究与漏洞挖掘。解决方案的关键是提出 MOTIF——一个基于代理(agentic)的框架,它融合了工具增强型静态分析与微调后的大型语言模型(fine-tuned large language model),专门用于 Objective-C 类型推断;该模型生成候选方法签名,由代理协调运行时元数据提取、二进制检查和约束验证,最终输出可编译的头文件。实验表明,MOTIF 在 MOTIF-Bench 基准上将签名恢复准确率从 15% 提升至 86%,并显著提升工具使用正确性和推理稳定性,实现了对私有框架的系统性逆向与可审计接口重构。

链接: https://arxiv.org/abs/2601.01673
作者: Arina Kharlamova,Youcheng Sun,Ting Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: IEEE SP’26 under review

点击查看摘要

Abstract:Private macOS frameworks underpin critical services and daemons but remain undocumented and distributed only as stripped binaries, complicating security analysis. We present MOTIF, an agentic framework that integrates tool-augmented analysis with a finetuned large language model specialized for Objective-C type inference. The agent manages runtime metadata extraction, binary inspection, and constraint checking, while the model generates candidate method signatures that are validated and refined into compilable headers. On MOTIF-Bench, a benchmark built from public frameworks with groundtruth headers, MOTIF improves signature recovery from 15% to 86% compared to baseline static analysis tooling, with consistent gains in tool-use correctness and inference stability. Case studies on private frameworks show that reconstructed headers compile, link, and facilitate downstream security research and vulnerability studies. By transforming opaque binaries into analyzable interfaces, MOTIF establishes a scalable foundation for systematic auditing of macOS internals.
zh

[AI-52] Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在求解多目标组合优化问题(Multi-Objective Combinatorial Optimization Problems, MOCOPs)时存在的鲁棒性不足问题,尤其是在面对多样化和复杂的问题分布时,现有学习型求解器易出现性能下降。其解决方案的关键在于提出一个统一的面向鲁棒性的框架,包含两个核心机制:一是基于偏好约束的对抗攻击方法,通过生成能暴露求解器弱点的难例来量化攻击对帕累托前沿(Pareto-front)质量的影响;二是将硬度感知的偏好选择引入对抗训练中的防御策略,以减少模型对特定偏好区域的过拟合,从而提升模型在分布外(out-of-distribution)实例上的泛化能力。实验表明,该框架在多目标旅行商问题(MOTSP)、多目标容量车辆路径问题(MOCVRP)和多目标背包问题(MOKP)上均有效增强了DRL求解器的鲁棒性和适应性。

链接: https://arxiv.org/abs/2601.01665
作者: Wei Liu,Yaoxin Wu,Yingqian Zhang,Thomas Bäck,Yingjie Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.
zh

[AI-53] Length-Aware Adversarial Training for Variable-Length Trajectories: Digital Twins for Mall Shopper Paths

【速读】:该论文旨在解决轨迹生成模型在训练过程中因轨迹长度高度异质性导致的mini-batch训练不稳定问题,进而影响轨迹衍生变量的分布匹配效果。其核心解决方案是提出一种**长度感知采样(Length-Aware Sampling, LAS)**策略,通过将轨迹按长度分桶并从单一长度桶中采样构建批次,显著降低批次内长度差异,从而提升梯度更新的一致性和分布匹配性能。LAS无需改变模型结构,仅通过优化采样机制即可改善生成轨迹的统计特性,并在多个真实场景数据集上验证了其有效性。

链接: https://arxiv.org/abs/2601.01663
作者: He Sun,Jiwoong Shin,Ravi Dhar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study generative modeling of \emphvariable-length trajectories – sequences of visited locations/items with associated timestamps – for downstream simulation and counterfactual analysis. A recurring practical issue is that standard mini-batch training can be unstable when trajectory lengths are highly heterogeneous, which in turn degrades \emphdistribution matching for trajectory-derived statistics. We propose \textbflength-aware sampling (LAS), a simple batching strategy that groups trajectories by length and samples batches from a single length bucket, reducing within-batch length heterogeneity (and making updates more consistent) without changing the model class. We integrate LAS into a conditional trajectory GAN with auxiliary time-alignment losses and provide (i) a distribution-level guarantee for derived variables under mild boundedness assumptions, and (ii) an IPM/Wasserstein mechanism explaining why LAS improves distribution matching by removing length-only shortcut critics and targeting within-bucket discrepancies. Empirically, LAS consistently improves matching of derived-variable distributions on a multi-mall dataset of shopper trajectories and on diverse public sequence datasets (GPS, education, e-commerce, and movies), outperforming random sampling across dataset-specific metrics.
zh

[AI-54] Learning Resilient Elections with Adversarial GNNs

【速读】:该论文旨在解决现有学习型投票规则在现实应用中面临的挑战,特别是如何设计既具备高社会福利又对策略性投票具有鲁棒性的通用投票机制。其关键解决方案在于:通过将选举建模为二分图(bipartite graph),并利用图神经网络(graph neural networks)学习投票规则,同时结合神经网络架构改进与对抗训练策略,显著提升了投票规则的抗策略性行为能力,从而在保持高社会福利的同时增强了系统在复杂场景下的稳定性与实用性。

链接: https://arxiv.org/abs/2601.01653
作者: Hao Xiang Li,Yash Shah,Lorenzo Giusti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In the face of adverse motives, it is indispensable to achieve a consensus. Elections have been the canonical way by which modern democracy has operated since the 17th century. Nowadays, they regulate markets, provide an engine for modern recommender systems or peer-to-peer networks, and remain the main approach to represent democracy. However, a desirable universal voting rule that satisfies all hypothetical scenarios is still a challenging topic, and the design of these systems is at the forefront of mechanism design research. Automated mechanism design is a promising approach, and recent works have demonstrated that set-invariant architectures are uniquely suited to modelling electoral systems. However, various concerns prevent the direct application to real-world settings, such as robustness to strategic voting. In this paper, we generalise the expressive capability of learned voting rules, and combine improvements in neural network architecture with adversarial training to improve the resilience of voting rules while maximizing social welfare. We evaluate the effectiveness of our methods on both synthetic and real-world datasets. Our method resolves critical limitations of prior work regarding learning voting rules by representing elections using bipartite graphs, and learning such voting rules using graph neural networks. We believe this opens new frontiers for applying machine learning to real-world elections.
zh

[AI-55] Structured Decomposition for LLM Reasoning : Cross-Domain Validation and Semantic Web Integration

【速读】:该论文旨在解决在自然语言输入上进行规则推理时面临的挑战:即如何在保证决策可审计性和可解释性的同时,实现对非结构化文本的灵活处理。传统符号系统虽能提供形式化保障,但依赖结构化输入;而大语言模型(Large Language Models, LLMs)虽具 interpretive flexibility,却难以确保规则应用的一致性。解决方案的关键在于构建一种融合架构:利用LLMs作为本体(ontology)填充引擎,根据专家编写的TBox规范将非结构化文本转化为ABox断言,再由基于SWRL(Semantic Web Rule Language)的推理机执行具有确定性保障的规则验证。该框架通过实体识别、断言提取与符号验证的三阶段分解,使整个推理过程既具备灵活性又满足形式化约束,实验表明其在法律传闻证据判定、科学方法任务匹配和临床试验入组筛选三个领域均显著优于零样本提示(few-shot prompting),且符号验证模块带来的收益不可替代。

链接: https://arxiv.org/abs/2601.01609
作者: Albert Sadowski,Jarosław A. Chudziak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rule-based reasoning over natural language input arises in domains where decisions must be auditable and justifiable: clinical protocols specify eligibility criteria in prose, evidence rules define admissibility through textual conditions, and scientific standards dictate methodological requirements. Applying rules to such inputs demands both interpretive flexibility and formal guarantees. Large language models (LLMs) provide flexibility but cannot ensure consistent rule application; symbolic systems provide guarantees but require structured input. This paper presents an integration pattern that combines these strengths: LLMs serve as ontology population engines, translating unstructured text into ABox assertions according to expert-authored TBox specifications, while SWRL-based reasoners apply rules with deterministic guarantees. The framework decomposes reasoning into entity identification, assertion extraction, and symbolic verification, with task definitions grounded in OWL 2 ontologies. Experiments across three domains (legal hearsay determination, scientific method-task application, clinical trial eligibility) and eleven language models validate the approach. Structured decomposition achieves statistically significant improvements over few-shot prompting in aggregate, with gains observed across all three domains. An ablation study confirms that symbolic verification provides substantial benefit beyond structured prompting alone. The populated ABox integrates with standard semantic web tooling for inspection and querying, positioning the framework for richer inference patterns that simpler formalisms cannot express.
zh

[AI-56] REE-TTT: Highly Adaptive Radar Echo Extrapolation Based on Test-Time Training

【速读】:该论文旨在解决雷达回波外推(Radar Echo Extrapolation, REE)模型在跨区域极端降水场景下泛化能力差的问题,其根源在于传统深度学习方法对高质量本地训练数据和静态模型参数的依赖。解决方案的关键在于提出一种融合自适应测试时训练(Test-Time Training, TTT)机制的新型模型REE-TTT,其核心创新是设计了时空测试时训练(Spatio-temporal Test-Time Training, ST-TTT)模块,用任务特定的注意力机制替代原有TTT层中的线性投影,从而实现对非平稳气象分布的鲁棒适应,显著提升降水特征表示能力与预测准确性。

链接: https://arxiv.org/abs/2601.01605
作者: Xin Di,Xinglin Piao,Fei Wang,Guodong Jing,Yong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precipitation nowcasting is critically important for meteorological forecasting. Deep learning-based Radar Echo Extrapolation (REE) has become a predominant nowcasting approach, yet it suffers from poor generalization due to its reliance on high-quality local training data and static model parameters, limiting its applicability across diverse regions and extreme events. To overcome this, we propose REE-TTT, a novel model that incorporates an adaptive Test-Time Training (TTT) mechanism. The core of our model lies in the newly designed Spatio-temporal Test-Time Training (ST-TTT) block, which replaces the standard linear projections in TTT layers with task-specific attention mechanisms, enabling robust adaptation to non-stationary meteorological distributions and thereby significantly enhancing the feature representation of precipitation. Experiments under cross-regional extreme precipitation scenarios demonstrate that REE-TTT substantially outperforms state-of-the-art baseline models in prediction accuracy and generalization, exhibiting remarkable adaptability to data distribution shifts.
zh

[AI-57] CONSENT: A Negotiation Framework for Leverag ing User Flexibility in Vehicle-to-Building Charging under Uncertainty AAMAS2026

【速读】:该论文旨在解决电动汽车(Electric Vehicles, EVs)在车到建筑(Vehicle-to-Building, V2B)场景下,建筑运营方因无序充电导致能源成本上升与EV用户追求便利性和满电状态之间的冲突问题。解决方案的关键在于提出一种基于协商的框架,通过设计保障自愿参与、策略稳健性(strategy-proofness)和预算可行性,将EV充电转化为一种可策略性利用的资源:向用户提供一系列基于激励的选项,以换取其在离站时间或所需电量(State of Charge, SoC)上的适度灵活性。该框架经由用户调研数据校准,并通过商业建筑和电动车制造商的真实运行数据验证,仿真结果表明其能实现双赢——使建筑运营方成本降低超过3.5%,同时用户充电费用低于电网零售电价22%。

链接: https://arxiv.org/abs/2601.01581
作者: Rishav Sen,Fangqi Liu,Jose Paolo Talusan,Ava Pettet,Yoshinori Suzue,Mark Bailey,Ayan Mukhopadhyay,Abhishek Dubey
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
备注: Submitted to AAMAS 2026. 25 pages, 13 figures, 14 tables

点击查看摘要

Abstract:The growth of Electric Vehicles (EVs) creates a conflict in vehicle-to-building (V2B) settings between building operators, who face high energy costs from uncoordinated charging, and drivers, who prioritize convenience and a full charge. To resolve this, we propose a negotiation-based framework that, by design, guarantees voluntary participation, strategy-proofness, and budget feasibility. It transforms EV charging into a strategic resource by offering drivers a range of incentive-backed options for modest flexibility in their departure time or requested state of charge (SoC). Our framework is calibrated with user survey data and validated using real operational data from a commercial building and an EV manufacturer. Simulations show that our negotiation protocol creates a mutually beneficial outcome: lowering the building operator’s costs by over 3.5% compared to an optimized, non-negotiating smart charging policy, while simultaneously reducing user charging expenses by 22% below the utility’s retail energy rate. By aligning operator and EV user objectives, our framework provides a strategic bridge between energy and mobility systems, transforming EV charging from a source of operational friction into a platform for collaboration and shared savings.
zh

[AI-58] he Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练中涌现出自我反思能力的机制问题,特别是如何通过统一优化目标实现生成解决方案(\pi_sample)与判断是否需要修正(\pi_d)这两种功能迥异的能力。其解决方案的关键在于提出梯度归因属性(Gradient Attribution Property),并基于两阶段决策采样(Two-Stage Decision-Sampling, DS)假设将策略解耦为采样模块(\pi_sample)和决策模块(\pi_d)。理论证明表明,代理奖励(surrogate rewards)具有平衡梯度归因特性,而监督微调(Supervised Fine-Tuning, SFT)与KL惩罚项则呈现不平衡梯度归因,后者因长度加权导致不对称正则化,约束了\pi_sample但使\pi_d欠优化,从而从机制上解释了为何RL优于SFT;实验进一步验证了RL优势主要源于对\pi_d的改进,而非\pi_sample,提供了对思维模型自修正能力的第一性原理机制解释。

链接: https://arxiv.org/abs/2601.01580
作者: Zibo Zhao(1),Yuanting Zha(2),Haipeng Zhang(2),Xingcheng Xu(3) ((1) Arizona State University, (2) ShanghaiTech University, (3) Shanghai Artificial Intelligence Laboratory)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-reflection capabilities emerge in Large Language Models after RL post-training, with multi-turn RL achieving substantial gains over SFT counterparts. Yet the mechanism of how a unified optimization objective gives rise to functionally distinct capabilities of generating solutions and evaluating when to revise them remains opaque. To address this question, we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling ( \pi_sample ) for generation and decision ( \pi_d ) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting creating asymmetric regularization that constrains \pi_sample while leaving \pi_d under-optimized, providing an theoretical explanation of why RL succeeds where SFT fails. We also empirically validate our theoretical predictions on arithmetic reasoning demonstrates that RL’s superior generalization stems primarily from improved decision-making ( \pi_d ) rather than sampling capabilities, providing a first-principles mechanistic explanation for self-correction in thinking models.
zh

[AI-59] HanoiWorld : A Joint Embedding Predictive Architecture BasedWorld Model for Autonomous Vehicle Controller

【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)在自动驾驶控制器中数据需求高、性能不稳定、难以确保安全性以及过度关注噪声特征的问题。其解决方案的关键在于引入基于联合嵌入预测架构(Joint Embedding Predictive Architecture, JEPA)的世界模型——Hanoi-World,该模型利用循环神经网络(Recurrent Neural Network, RNN)实现长期横向规划,并在推理阶段具备高效性,从而在高速公路环境模拟中展现出更强的安全意识与更优的驾驶决策能力,相较当前最优基线方法显著降低碰撞率。

链接: https://arxiv.org/abs/2601.01577
作者: Tran Tien Dat,Nguyen Hai An,Nguyen Khanh Viet Dung,Nguyen Duy Duc
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current attempts of Reinforcement Learning for Autonomous Controller are data-demanding while the results are under-performed, unstable, and unable to grasp and anchor on the concept of safety, and over-concentrating on noise features due to the nature of pixel reconstruction. While current Self-Supervised Learningapproachs that learning on high-dimensional representations by leveraging the JointEmbedding Predictive Architecture (JEPA) are interesting and an effective alternative, as the idea mimics the natural ability of the human brain in acquiring new skill usingimagination and minimal samples of observations. This study introduces Hanoi-World, a JEPA-based world model that using recurrent neural network (RNN) formaking longterm horizontal planning with effective inference time. Experimentsconducted on the Highway-Env package with difference enviroment showcase the effective capability of making a driving plan while safety-awareness, with considerablecollision rate in comparison with SOTA baselines
zh

[AI-60] CaveAgent : Transforming LLM s into Stateful Runtime Operators

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体(agent)在执行复杂任务时受限于文本中心范式的问题,特别是传统基于JSON的函数调用机制在长程任务中因多轮依赖脆弱性和上下文漂移(context drift)导致性能下降。其核心解决方案是提出CaveAgent框架,通过引入双流上下文架构(Dual-stream Context Architecture),将状态管理解耦为轻量级语义流用于推理和持久化、确定性的Python运行时流用于执行;同时创新性地实现有状态运行时管理(Stateful Runtime Management),支持跨轮次注入、操作和检索复杂Python对象(如DataFrames、数据库连接),从而构建高保真外部记忆机制,有效消除上下文漂移与灾难性遗忘,并保障数据无损传递至下游应用。此设计显著提升了任务成功率与效率,尤其在零售和数据密集型任务中表现突出。

链接: https://arxiv.org/abs/2601.01569
作者: Maohao Ran,Zhenglin Wan,Cooper Lin,Yanting Zhang,Hongyu Xin,Hongwei Fan,Yibo Xu,Beier Luo,Yaxin Zhou,Wangbo Zhao,Lijie Yang,Lang Feng,Fuchao Yang,Jingxuan Wu,Yiqiao Huang,Chendong Ma,Dailing Jiang,Jianbo Deng,Sihui Han,Bo An,Yike Guo,Jun Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 32 pages, 14 Figures

点击查看摘要

Abstract:LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from “LLM-as-Text-Generator” to “LLM-as-Runtime-Operator.” We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textitStateful Runtime Management in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau ^2 -bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent’s superiority. Specifically, our framework achieves a 10.5% success rate improvement on retail tasks and reduces total token consumption by 28.4% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.
zh

[AI-61] Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

【速读】:该论文旨在解决生成式 AI (Generative AI) 在科学、技术、工程和数学(STEM)领域推理能力不足的问题,尤其是如何通过高质量数据与算法协同设计提升模型的长链式推理(long chain-of-thought)性能。解决方案的关键在于提出了一种数据-算法协同设计引擎:在数据层面,构建了包含1000万条样本的Logics-STEM-SFT-Dataset,其通过五阶段数据清洗与优化流程(标注、去重、去污染、蒸馏与分层采样)确保数据的质量、多样性和可扩展性;在算法层面,引入基于失败驱动的后训练框架,在监督微调(SFT)阶段针对模型错误区域进行目标知识检索与数据合成,从而有效引导第二阶段SFT或强化学习(RL)更贴近理想推理分布。该方法使Logics-STEM在8B规模下相较次优模型平均提升4.68%,验证了大规模开源数据与精心设计合成数据结合的有效性,凸显了数据-算法协同设计对增强推理能力的核心作用。

链接: https://arxiv.org/abs/2601.01562
作者: Mingyu Xu,Cheng Fang,Keyue Jiang,Yuqian Zheng,Yanghua Xiao,Baojian Zhou,Qifang Zhao,Suhang Zheng,Xiuwen Zhu,Jiyang Tang,Yongchi Zhao,Yijia Luo,Zhiqi Bai,Yuchi Xu,Wenbo Su,Wei Wang,Bing Zhao,Lin Qu,Xiaoxiao Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.
zh

[AI-62] Utilizing Earth Foundation Models to Enhance the Simulation Performance of Hydrological Models with AlphaEarth Embeddings

【速读】:该论文试图解决在缺乏实测流量数据的流域中准确预测河流流量的问题,其核心挑战在于不同流域对气候、地形、植被和土壤等环境因子的响应差异复杂且难以用传统流域属性全面刻画。解决方案的关键在于使用AlphaEarth Foundation嵌入(AlphaEarth Foundation embeddings),这些嵌入是通过大规模卫星图像学习得到的非专家设计表征,能够有效捕捉植被模式、地表特性及长期环境动态等信息,从而更精准地描述流域特征。实验表明,基于此类嵌入构建的模型在未参与训练的流域上表现出更高预测精度,且通过嵌入相似性选择合适的“源流域”可显著提升无资料流域的预测性能,证明了卫星驱动的环境表征在水文预测中的优越性和适应性。

链接: https://arxiv.org/abs/2601.01558
作者: Pengfei Qu,Wenyu Ouyang,Chi Zhang,Yikai Chai,Shuolong Xu,Lei Ye,Yongri Piao,Miao Zhang,Huchuan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 11 figures

点击查看摘要

Abstract:Predicting river flow in places without streamflow records is challenging because basins respond differently to climate, terrain, vegetation, and soils. Traditional basin attributes describe some of these differences, but they cannot fully represent the complexity of natural environments. This study examines whether AlphaEarth Foundation embeddings, which are learned from large collections of satellite images rather than designed by experts, offer a more informative way to describe basin characteristics. These embeddings summarize patterns in vegetation, land surface properties, and long-term environmental dynamics. We find that models using them achieve higher accuracy when predicting flows in basins not used for training, suggesting that they capture key physical differences more effectively than traditional attributes. We further investigate how selecting appropriate donor basins influences prediction in ungauged regions. Similarity based on the embeddings helps identify basins with comparable environmental and hydrological behavior, improving performance, whereas adding many dissimilar basins can reduce accuracy. The results show that satellite-informed environmental representations can strengthen hydrological forecasting and support the development of models that adapt more easily to different landscapes.
zh

[AI-63] MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

【速读】:该论文旨在解决现有Speaker-Attributed, Time-Stamped Transcription (SATS) 系统在端到端建模、长时上下文依赖和精确时间戳输出方面的局限性,这些问题限制了其在会议转录等场景中的应用效果。解决方案的关键在于提出MOSS Transcribe Diarize,一个统一的多模态大语言模型,通过端到端的方式联合执行SATS任务,并利用128k上下文窗口(支持长达90分钟输入)增强长距离说话人记忆能力,从而实现更准确的语音内容识别与时间戳定位。

链接: https://arxiv.org/abs/2601.01554
作者: Donghua Yu,Zhengyuan Lin,Chen Yang,Yiyang Zhang,Zhaoye Fei,Hanfu Chen,Jingqi Chen,Ke Chen,Qinyuan Cheng,Liwei Fan,Yi Jiang,Jie Zhu,Muchen Li,Shimin Li,Wenxuan Wang,Yang Wang,Zhe Xu,Yitian Gong,Yuqian Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.
zh

[AI-64] Improving Behavioral Alignment in LLM Social Simulations via Context Formation and Navigation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂决策环境中与人类行为存在系统性偏差的问题,尤其是在需要参与者预测他人行为并基于观察形成信念的情境下。其核心挑战在于现有LLMs在模拟人类社会行为时缺乏对任务上下文的精准建模和推理路径的有效引导。解决方案的关键在于提出一个两阶段框架:第一阶段为“情境构建”(context formation),通过显式定义实验设计以准确表征决策任务及其背景;第二阶段为“情境导航”(context navigation),在该表征基础上引导模型进行推理以生成符合人类行为模式的决策。实证结果表明,复杂决策环境(如质量信号博弈、众筹博弈)需同时启用两个阶段才能实现行为对齐,而简单任务(如需求估计)仅需情境构建即可,从而明确了各阶段的应用边界并提供了一种系统化设计与诊断LLM社会模拟的方法。

链接: https://arxiv.org/abs/2601.01546
作者: Letian Kong,Qianran(Jenny)Jin,Renyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate human behavior in experimental settings, but they systematically diverge from human decisions in complex decision-making environments, where participants must anticipate others’ actions and form beliefs based on observed behavior. We propose a two-stage framework for improving behavioral alignment. The first stage, context formation, explicitly specifies the experimental design to establish an accurate representation of the decision task and its context. The second stage, context navigation, guides the reasoning process within that representation to make decisions. We validate this framework through a focal replication of a sequential purchasing game with quality signaling (Kremer and Debo, 2016), extending to a crowdfunding game with costly signaling (Cason et al., 2025) and a demand-estimation task (Gui and Toubia, 2025) to test generalizability across decision environments. Across four state-of-the-art (SOTA) models (GPT-4o, GPT-5, Claude-4.0-Sonnet-Thinking, DeepSeek-R1), we find that complex decision-making environments require both stages to achieve behavioral alignment with human benchmarks, whereas the simpler demand-estimation task requires only context formation. Our findings clarify when each stage is necessary and provide a systematic approach for designing and diagnosing LLM social simulations as complements to human subjects in behavioral research.
zh

[AI-65] Reading Between the Lines: Deconfounding Causal Estimates using Text Embeddings and Deep Learning

【速读】:该论文旨在解决观测性研究中因未观测混杂变量(unobserved confounders)导致的选择偏差问题,尤其是在传统计量经济学方法难以处理与结构化协变量正交的高维非结构化文本数据时。其解决方案的关键在于提出一种神经网络增强的双重机器学习(Neural Network-Enhanced Double Machine Learning, DML)框架,利用文本嵌入(text embeddings)作为潜在混杂因素的代理变量,从而满足无混淆性假设(unconfoundedness assumption)。研究表明,标准基于树的DML估计器因无法刻画嵌入流形的连续拓扑结构而存在显著偏差(+24%),而所提出的深度学习架构通过优化模型设计将偏差降至-0.86%,有效恢复了真实的因果参数。

链接: https://arxiv.org/abs/2601.01511
作者: Ahmed Dawoud,Osama El-Shamy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating causal treatment effects in observational settings is frequently compromised by selection bias arising from unobserved confounders. While traditional econometric methods struggle when these confounders are orthogonal to structured covariates, high-dimensional unstructured text often contains rich proxies for these latent variables. This study proposes a Neural Network-Enhanced Double Machine Learning (DML) framework designed to leverage text embeddings for causal identification. Using a rigorous synthetic benchmark, we demonstrate that unstructured text embeddings capture critical confounding information that is absent from structured tabular data. However, we show that standard tree-based DML estimators retain substantial bias (+24%) due to their inability to model the continuous topology of embedding manifolds. In contrast, our deep learning approach reduces bias to -0.86% with optimized architectures, effectively recovering the ground-truth causal parameter. These findings suggest that deep learning architectures are essential for satisfying the unconfoundedness assumption when conditioning on high-dimensional natural language data
zh

[AI-66] he Optimal Sample Complexity of Linear Contracts

【速读】:该论文旨在解决在离线设定下,从数据中学习最优线性合同(linear contract)的问题,其中代理类型(agent type)来自未知分布,委托人的目标是设计一个最大化其期望效用的合同。解决方案的关键在于提出并分析了简单的经验效用最大化(Empirical Utility Maximization, EUM)算法,证明其在仅需 $ O(\ln(1/\delta) / \varepsilon^2) $ 个样本的情况下,以至少 $ 1-\delta $ 的概率实现对最优线性合同的 ε\varepsilon-近似。这一结果达到了 Duetting 等人 [2025] 给出的下界(至多常数因子),从而证明了样本复杂度的最优性。核心创新在于利用线性合同的一个结构性质:其期望奖励是非递减的(尽管效用函数本身非单调且不连续),这一性质使得可以构造精细的网(fine-grained nets)用于链式论证(chaining argument),进而获得最优样本复杂度,并进一步建立了统一收敛性(uniform convergence)的更强保证。

链接: https://arxiv.org/abs/2601.01496
作者: Mikael Møller Høgsgaard
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we settle the problem of learning optimal linear contracts from data in the offline setting, where agent types are drawn from an unknown distribution and the principal’s goal is to design a contract that maximizes her expected utility. Specifically, our analysis shows that the simple Empirical Utility Maximization (EUM) algorithm yields an \varepsilon -approximation of the optimal linear contract with probability at least 1-\delta , using just O(\ln(1/\delta) / \varepsilon^2) samples. This result improves upon previously known bounds and matches a lower bound from Duetting et al. [2025] up to constant factors, thereby proving its optimality. Our analysis uses a chaining argument, where the key insight is to leverage a simple structural property of linear contracts: their expected reward is non-decreasing. This property, which holds even though the utility function itself is non-monotone and discontinuous, enables the construction of fine-grained nets required for the chaining argument, which in turn yields the optimal sample complexity. Furthermore, our proof establishes the stronger guarantee of uniform convergence: the empirical utility of every linear contract is a \varepsilon -approximation of its true expectation with probability at least 1-\delta , using the same optimal O(\ln(1/\delta) / \varepsilon^2) sample complexity.
zh

[AI-67] Accelerating Storag e-Based Training for Graph Neural Networks KDD

【速读】:该论文旨在解决存储驱动的图神经网络(Graph Neural Networks, GNNs)训练中因大量小规模存储输入/输出(I/O)操作而导致的数据准备瓶颈问题。现有方法忽视了高频率小I/O对高性能存储设备带宽利用率的严重限制,从而制约了大规模GNN训练的效率。解决方案的关键在于提出名为AGNES的新框架,其核心创新包括:(1) 采用块级存储I/O处理策略(block-wise storage I/O processing),以充分利用高性能存储设备的I/O带宽;(2) 引入基于真实图结构特性的超批处理策略(hyperbatch-based processing),进一步提升每次存储I/O操作的计算效率。实验表明,AGNES在五个真实世界图数据集上显著优于四种主流方法,最快可达最佳对比方法的4.1倍加速比。

链接: https://arxiv.org/abs/2601.01473
作者: Myung-Hwan Jang,Jeong-Min Park,Yunyong Ko,Sang-Wook Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 10 pages, 12 figures, 2 tables, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2026

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved breakthroughs in various real-world downstream tasks due to their powerful expressiveness. As the scale of real-world graphs has been continuously growing, \textita storage-based approach to GNN training has been studied, which leverages external storage (e.g., NVMe SSDs) to handle such web-scale graphs on a single machine. Although such storage-based GNN training methods have shown promising potential in large-scale GNN training, we observed that they suffer from a severe bottleneck in data preparation since they overlook a critical challenge: \textithow to handle a large number of small storage I/Os. To address the challenge, in this paper, we propose a novel storage-based GNN training framework, named \textsfAGNES, that employs a method of \textitblock-wise storage I/O processing to fully utilize the I/O bandwidth of high-performance storage devices. Moreover, to further enhance the efficiency of each storage I/O, \textsfAGNES employs a simple yet effective strategy, \textithyperbatch-based processing based on the characteristics of real-world graphs. Comprehensive experiments on five real-world graphs reveal that \textsfAGNES consistently outperforms four state-of-the-art methods, by up to 4.1 \times faster than the best competitor. Our code is available at this https URL.
zh

[AI-68] A construction of an optimal base for conditional attribute and attributional condition implications in triadic contexts

【速读】:该论文旨在解决三元关系(triadic contexts)中条件属性(conditional attribute)与属性条件(attributional condition)蕴含关系的最优基底构建问题。其解决方案的关键在于基于Ganter和Obiedkov提出的蕴含形式,设计并构造一个在逻辑上完备且最小化的蕴含基底,从而实现对三元数据中复杂依赖关系的有效表达与压缩。

链接: https://arxiv.org/abs/2601.01467
作者: Romuald Kwessy Mouona,Blaise Blériot Koguep Njionou,Etienne Romuald Temgoua Alomo,Rokia Missaoui,Leonard Kwuida
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:This article studies implications in triadic contexts. Specifically, we focus on those introduced by Ganter and Obiedkov, namely conditional attribute and attributional condition implications. Our aim is to construct an optimal base for these implications.
zh

[AI-69] Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)微调过程中内存消耗过高的问题,现有基于零阶(zeroth-order, ZO)优化的方法虽能通过函数评估近似梯度以降低内存占用,但其依赖单步随机扰动的梯度估计导致收敛效率受限。解决方案的关键在于提出贝叶斯子空间零阶优化(Bayesian Subspace Zeroth-Order optimization, BSZO),该方法利用卡尔曼滤波(Kalman filtering)将多方向有限差分信息融合,将每次有限差分测量视为噪声观测,构建投影梯度的后验分布,并通过贝叶斯推断进行更新;同时引入基于残差的自适应机制调整扰动尺度,理论上可提升收敛速率至标准ZO方法的 $ k/\gamma $ 倍,在RoBERTa、Mistral和OPT等模型上验证了其优越性,相较MeZO、MeZO-Adam及HiZOO实现平均高达6.67%的性能提升,且内存开销接近仅推理基线(1.00×–1.08× MeZO)。

链接: https://arxiv.org/abs/2601.01452
作者: Jian Feng,Zhihong Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 1 figures, 4 tables

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations, but existing methods rely on one-step gradient estimates from random perturbations. We introduce Bayesian Subspace Zeroth-Order optimization (BSZO), a ZO optimizer that applies Kalman filtering to combine finite-difference information across multiple perturbation directions. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adjust perturbation scales. Theoretical analysis shows that BSZO improves the convergence rate by a factor of k/\gamma compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms MeZO, MeZO-Adam, and HiZOO across various tasks, achieving up to 6.67% absolute average improvement on OPT-13B while keeping memory usage close to inference-only baselines (1.00 \times --1.08 \times of MeZO).
zh

[AI-70] Online Estimation and Manipulation of Articulated Objects DATE

【速读】:该论文旨在解决服务机器人在执行家务任务时,如何自主识别并操控任意刚性连接物体(articulated objects)的问题。传统方法要么依赖于视觉预训练模型预测物体的可操作性(affordance),要么需要机器人已具备操控能力才能通过运动观测估计其结构参数,存在先验知识不足或交互依赖性强的局限。解决方案的关键在于提出一种基于因子图(factor graph)的在线估计框架,融合从视觉中学习到的先验知识与操作过程中的本体感知(proprioceptive sensing)数据,构建基于螺旋理论(Screw Theory)的解析化运动学模型,从而实现接触前的初步预测和接触后的快速修正。该方法使机器人能够在未见过的抽屉等对象上实现闭环估计与自主开启,实验证明其在真实硬件平台上对未知刚性连接物体的自主开启成功率可达75%。

链接: https://arxiv.org/abs/2601.01438
作者: Russell Buchanan,Adrian Röfer,João Moura,Abhinav Valada,Sethu Vijayakumar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in Autonomous Robots, and is available online at [Link will be updated when available]

点击查看摘要

Abstract:From refrigerators to kitchen drawers, humans interact with articulated objects effortlessly every day while completing household chores. For automating these tasks, service robots must be capable of manipulating arbitrary articulated objects. Recent deep learning methods have been shown to predict valuable priors on the affordance of articulated objects from vision. In contrast, many other works estimate object articulations by observing the articulation motion, but this requires the robot to already be capable of manipulating the object. In this article, we propose a novel approach combining these methods by using a factor graph for online estimation of articulation which fuses learned visual priors and proprioceptive sensing during interaction into an analytical model of articulation based on Screw Theory. With our method, a robotic system makes an initial prediction of articulation from vision before touching the object, and then quickly updates the estimate from kinematic and force sensing during manipulation. We evaluate our method extensively in both simulations and real-world robotic manipulation experiments. We demonstrate several closed-loop estimation and manipulation experiments in which the robot was capable of opening previously unseen drawers. In real hardware experiments, the robot achieved a 75% success rate for autonomous opening of unknown articulated objects.
zh

[AI-71] Reliable Grid Forecasting: State Space Models for Safety-Critical Energy Systems

【速读】:该论文旨在解决电力系统中负荷预测的操作安全性问题,即传统对称误差指标(如MAPE)无法准确反映电网运行中的风险差异——低估负荷可能导致供电短缺,而高估则可能造成资源浪费。其核心挑战在于如何构建一个能直接衡量电网运行风险的评估框架,并在此基础上优化模型性能。解决方案的关键在于提出一套面向电网特性的评估指标体系,包括不对称MAPE、欠预测率(Under-Prediction Rate)和储备裕度(Reserve Margin),并基于此对基于Mamba架构的状态空间模型进行系统性评估。实验表明,仅依赖标准精度指标会误导决策,而引入天气感知建模(温度与误差显著相关,r = 0.16, p < 10⁻¹⁶)可提升预测可靠性;其中S-Mamba模型在99.5%分位数尾部风险下的储备裕度最低(14.12%),显著优于iTransformer(16.66%),验证了其在保障电网安全运行方面的优越性。

链接: https://arxiv.org/abs/2601.01410
作者: Jisoo Lee,Sunki Hong
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Accurate grid load forecasting is safety-critical: under-predictions risk supply shortfalls, while symmetric error metrics mask this operational asymmetry. We introduce a grid-specific evaluation framework–Asymmetric MAPE, Under-Prediction Rate, and Reserve Margin–that directly measures operational risk rather than statistical accuracy alone. Using this framework, we conduct a systematic evaluation of Mamba-based State Space Models for California grid forecasting on a weather-aligned CAISO TAC-area dataset spanning Nov 2023–Nov 2025 (84,498 hourly records across 5 transmission areas). Our analysis reveals that standard accuracy metrics are poor proxies for operational safety: models with identical MAPE can require vastly different reserve margins. We demonstrate that forecast errors are weakly but significantly associated with temperature (r = 0.16, p 10^-16), motivating weather-aware modeling rather than loss function modification alone. The S-Mamba model achieves the lowest Reserve_99.5% margin (14.12%) compared to 16.66% for iTransformer, demonstrating superior forecast reliability under a 99.5th-percentile tail-risk reserve proxy. Comments: 24 pages, 8 figures, 8 tables Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T07, 62M20 ACMclasses: I.2.6; J.2 Cite as: arXiv:2601.01410 [eess.SY] (or arXiv:2601.01410v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2601.01410 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-72] A Graph-based Framework for Online Time Series Anomaly Detection Using Model Ensemble

【速读】:该论文旨在解决工业系统中流式时间序列数据的在线异常检测问题,尤其针对数据模式多样且快速演化所带来的挑战。现有方法多适用于离线场景或难以有效处理异构流数据。其解决方案的关键在于提出一种基于图结构的模型集成框架(GDME),通过动态维护一个可不断更新的模型池(剔除表现不佳的模型并引入新模型),利用动态图结构表示模型间关系,并结合社区检测算法选择最优子集进行集成;同时,通过监测图结构变化来识别概念漂移(concept drift),从而实现对演化数据的自适应调整。该方法在七个异构时间序列数据集上显著优于现有在线异常检测方法,性能提升最高达24%,且在检测精度与计算效率之间取得良好平衡。

链接: https://arxiv.org/abs/2601.01403
作者: Zewei Yu,Jianqiu Xu,Caimin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:With the increasing volume of streaming data in industrial systems, online anomaly detection has become a critical task. The diverse and rapidly evolving data patterns pose significant challenges for online anomaly detection. Many existing anomaly detection methods are designed for offline settings or have difficulty in handling heterogeneous streaming data effectively. This paper proposes GDME, an unsupervised graph-based framework for online time series anomaly detection using model ensemble. GDME maintains a dynamic model pool that is continuously updated by pruning underperforming models and introducing new ones. It utilizes a dynamic graph structure to represent relationships among models and employs community detection on the graph to select an appropriate subset for ensemble. The graph structure is also used to detect concept drift by monitoring structural changes, allowing the framework to adapt to evolving streaming data. Experiments on seven heterogeneous time series demonstrate that GDME outperforms existing online anomaly detection methods, achieving improvements of up to 24%. In addition, its ensemble strategy provides superior detection performance compared with both individual models and average ensembles, with competitive computational efficiency.
zh

[AI-73] Scale-Adaptive Power Flow Analysis with Local Topology Slicing and Multi-Task Graph Learning

【速读】:该论文旨在解决深度学习模型在电力系统潮流分析中对拓扑结构变化适应性不足的问题,特别是在不同规模电网下分支功率预测的鲁棒性和准确性难以保障的挑战。其解决方案的关键在于提出了一种尺度自适应多任务潮流分析框架(SaMPFA),其中引入了局部拓扑切片(Local Topology Slicing, LTS)采样技术以增强模型跨尺度学习能力,并设计了无参考多任务图学习(Reference-free Multi-task Graph Learning, RMGL)模型,通过直接预测节点电压和支路功率而非相角,避免了因相角计算误差传播导致的支路功率误差放大问题,同时借助物理约束损失项引导模型学习相角差与功率传输之间的物理规律,从而显著提升预测精度与物理一致性。

链接: https://arxiv.org/abs/2601.01387
作者: Yongzhe Li,Lin Guan,Zihan Cai,Zuxian Lin,Jiyu Huang,Liukai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing deep learning models with strong adaptability to topological variations is of great practical significance for power flow analysis. To enhance model performance under variable system scales and improve robustness in branch power prediction, this paper proposes a Scale-adaptive Multi-task Power Flow Analysis (SaMPFA) framework. SaMPFA introduces a Local Topology Slicing (LTS) sampling technique that extracts subgraphs of different scales from the complete power network to strengthen the model’s cross-scale learning capability. Furthermore, a Reference-free Multi-task Graph Learning (RMGL) model is designed for robust power flow prediction. Unlike existing approaches, RMGL predicts bus voltages and branch powers instead of phase angles. This design not only avoids the risk of error amplification in branch power calculation but also guides the model to learn the physical relationships of phase angle differences. In addition, the loss function incorporates extra terms that encourage the model to capture the physical patterns of angle differences and power transmission, further improving consistency between predictions and physical laws. Simulations on the IEEE 39-bus system and a real provincial grid in China demonstrate that the proposed model achieves superior adaptability and generalization under variable system scales, with accuracy improvements of 4.47% and 36.82%, respectively.
zh

[AI-74] Data Complexity-aware Deep Model Performance Forecasting

【速读】:该论文旨在解决深度学习模型在实际应用中因架构选择依赖重复试错而导致的效率低下、资源消耗大且难以自动化的问题。其解决方案的关键在于提出一种轻量级的两阶段性能预测框架:第一阶段基于数据集可测量属性(如方差)预测基线性能,第二阶段结合模型结构与超参数信息对估计结果进行调整,从而实现跨数据集和模型类型的泛化能力。该方法不仅能够提前预估模型性能,还能为架构选择、预处理策略制定及数据质量检测提供早期指导。

链接: https://arxiv.org/abs/2601.01383
作者: Yen-Chia Chen,Hsing-Kuo Pao,Hanjuan Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:Deep learning models are widely used across computer vision and other domains. When working on the model induction, selecting the right architecture for a given dataset often relies on repetitive trial-and-error procedures. This procedure is time-consuming, resource-intensive, and difficult to automate. While previous work has explored performance prediction using partial training or complex simulations, these methods often require significant computational overhead or lack generalizability. In this work, we propose an alternative approach: a lightweight, two-stage framework that can estimate model performance before training given the understanding of the dataset and the focused deep model structures. The first stage predicts a baseline based on the analysis of some measurable properties of the dataset, while the second stage adjusts the estimation with additional information on the model’s architectural and hyperparameter details. The setup allows the framework to generalize across datasets and model types. Moreover, we find that some of the underlying features used for prediction - such as dataset variance - can offer practical guidance for model selection, and can serve as early indicators of data quality. As a result, the framework can be used not only to forecast model performance, but also to guide architecture choices, inform necessary preprocessing procedures, and detect potentially problematic datasets before training begins.
zh

[AI-75] Empowering Small Language Models with Factual Hallucination-Aware Reasoning for Financial Classification

【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在金融分类任务中因事实幻觉(factual hallucinations)导致的性能下降问题,尤其关注其推理过程中的错误事实生成如何影响分类准确性。解决方案的关键在于提出一个三步流程——AAAI(Association Identification、Automated Detection、Adaptive Inference):首先识别与金融任务相关的事实关联;其次利用编码器-based 验证器自动检测事实幻觉;最后通过反馈机制实现自适应推理,从而修正错误并提升分类性能。实验证明,该方法能有效降低事实幻觉对分类结果的负面影响,显著增强SLMs在金融场景下的可靠性与准确率。

链接: https://arxiv.org/abs/2601.01378
作者: Han Yuan,Yilin Wu,Li Zhang,Zheng Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small language models (SLMs) are increasingly used for financial classification due to their fast inference and local deployability. However, compared with large language models, SLMs are more prone to factual hallucinations in reasoning and exhibit weaker classification performance. This raises a natural question: Can mitigating factual hallucinations improve SLMs’ financial classification? To address this, we propose a three-step pipeline named AAAI (Association Identification, Automated Detection, and Adaptive Inference). Experiments on three representative SLMs reveal that: (1) factual hallucinations are positively correlated with misclassifications; (2) encoder-based verifiers effectively detect factual hallucinations; and (3) incorporating feedback on factual errors enables SLMs’ adaptive inference that enhances classification performance. We hope this pipeline contributes to trustworthy and effective applications of SLMs in finance.
zh

[AI-76] UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

【速读】:该论文旨在解决音频基础模型(audio foundation models)评估体系缺失的问题,具体包括三个关键挑战:一是缺乏统一的评估框架,导致跨模型比较困难;二是音频编解码器(audio codecs)缺少全面且被广泛接受的评估方法;三是现有语音基准测试高度依赖英文,难以客观评估模型在中文语境下的表现。解决方案的关键在于提出 UltraEval-Audio,这是一个模块化、支持多语言(10种语言)和多任务(14类核心任务)的统一评估框架,集成24个主流模型与36个权威基准,并提供一键式评估功能及实时公开排行榜;同时引入针对音频编解码器的三维度综合评价指标(语义准确性、音色保真度、声学质量),并构建两个面向中文能力的新基准 SpeechCMMLU 和 SpeechHSK,以提升中文语音任务的评测公平性与有效性。

链接: https://arxiv.org/abs/2601.01373
作者: Qundong Shi,Jie Zhou,Biyuan Lin,Junbo Cui,Guoyang Zeng,Yixuan Zhou,Ziyang Wang,Xin Liu,Zhen Luo,Yudong Wang,Zhiyuan Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models’ performance on Chinese. To address the first issue, we introduce UltraEval-Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval-Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one-command evaluation feature, accompanied by real-time public leaderboards. For the second challenge, UltraEval-Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions: semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval-Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at this https URL.
zh

[AI-77] KGCE: Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLMs)在教育场景中跨平台任务执行能力评估不足的问题,特别是针对学校专用软件(如小雅智能助手、华师小智等)时,现有代理系统因缺乏对私有领域软件结构特性的理解而导致效率显著下降的问题。同时,传统评估方法依赖目标导向或轨迹匹配等粗粒度指标,难以捕捉代理在复杂任务中的执行细节与效率。解决方案的关键在于提出KGCE(Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models),其核心创新包括:构建包含104个教育相关任务的数据集(涵盖Windows、Android及跨平台协作任务),引入知识库增强机制以提升代理对私有领域软件的理解能力,并设计双图评估框架将任务分解为多个子目标并逐项验证完成状态,从而实现细粒度的性能评估。

链接: https://arxiv.org/abs/2601.01366
作者: Zixian Liu,Sihao Liu,Yuqi Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross-platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross-platform tasks in educational contexts, especially when dealing with school-specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private-domain software. Additionally, current evaluation methods heavily rely on coarse-grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual-graph evaluation framework. We first constructed a dataset comprising 104 education-related tasks, covering Windows, Android, and cross-platform collaborative tasks. KGCE introduces a dual-graph evaluation framework that decomposes tasks into multiple sub-goals and verifies their completion status, providing fine-grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private-domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school-specific software. The code can be found at this https URL.
zh

[AI-78] A unified multimodal understanding and generation model for cross-disciplinary scientific research

【速读】:该论文旨在解决当前AI模型在科学领域中普遍存在的局限性问题,即大多数模型仍局限于单一学科(domain-specific),难以同时理解与生成跨学科的高维多模态科学数据,而许多全球性科学挑战本质上是跨学科的。其解决方案的关键在于提出FuXi-Uni——一种原生统一的多模态科学模型架构,通过将跨学科科学标记(scientific tokens)对齐至自然语言标记空间,并引入科学解码器(science decoder)以重构科学标记,从而实现自然语言交互与科学数值预测的双重能力。该方法在地球科学和生物医学两大领域得到实证验证,展现出优于现有物理模型的天气预报精度、热带气旋预测性能及区域气象场重建效果,以及在生物医学视觉问答任务中的领先表现,为构建通用性强、跨域协同的科学智能模型提供了新范式。

链接: https://arxiv.org/abs/2601.01363
作者: Xiaomeng Yang,Zhiyu Tan,Xiaohui Zhong,Mengping Yang,Qiusheng Huang,Lei Chen,Libo Wu,Hao Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific discovery increasingly relies on integrating heterogeneous, high-dimensional data across disciplines nowadays. While AI models have achieved notable success across various scientific domains, they typically remain domain-specific or lack the capability of simultaneously understanding and generating multimodal scientific data, particularly for high-dimensional data. Yet, many pressing global challenges and scientific problems are inherently cross-disciplinary and require coordinated progress across multiple fields. Here, we present FuXi-Uni, a native unified multimodal model for scientific understanding and high-fidelity generation across scientific domains within a single architecture. Specifically, FuXi-Uni aligns cross-disciplinary scientific tokens within natural language tokens and employs science decoder to reconstruct scientific tokens, thereby supporting both natural language conversation and scientific numerical prediction. Empirically, we validate FuXi-Uni in Earth science and Biomedicine. In Earth system modeling, the model supports global weather forecasting, tropical cyclone (TC) forecast editing, and spatial downscaling driven by only language instructions. FuXi-Uni generates 10-day global forecasts at 0.25° resolution that outperform the SOTA physical forecasting system. It shows superior performance for both TC track and intensity prediction relative to the SOTA physical model, and generates high-resolution regional weather fields that surpass standard interpolation baselines. Regarding biomedicine, FuXi-Uni outperforms leading multimodal large language models on multiple biomedical visual question answering benchmarks. By unifying heterogeneous scientific modalities within a native shared latent space while maintaining strong domain-specific performance, FuXi-Uni provides a step forward more general-purpose, multimodal scientific models.
zh

[AI-79] From Classification to Generation: An Open-Ended Paradigm for Adverse Drug Reaction Prediction Based on Graph-Motif Feature Fusion

【速读】:该论文旨在解决药物不良反应(Adverse Drug Reaction, ADR)预测中面临的三大挑战:药物数据稀缺导致的冷启动问题、标签集封闭性限制以及标签间依赖关系建模不足。其核心解决方案是提出一种基于图-基元特征融合与多标签生成(Graph-Motif feature fusion and Multi-Label Generation, GM-MLG)的开放式预测范式。关键创新在于:首先,构建跨越原子级、局部分子级(利用BRICS算法动态提取细粒度基元并结合额外断裂规则)和全局分子级的双图表示架构;其次,首次将ADR预测从传统的多标签分类任务转化为基于Transformer解码器的多标签生成任务,通过将ADR标签视为离散token序列,并引入位置嵌入显式捕捉大规模标签空间内的依赖关系与共现模式,从而实现自回归式解码以动态扩展预测空间。该方法在实验中实现了最高38%、平均20%的性能提升,预测标签数从200扩展至超10,000种,同时借助逆合成基元分析揭示非线性的结构-活性关系,为药物安全性系统性风险降低提供可解释且创新的支持。

链接: https://arxiv.org/abs/2601.01347
作者: Yuyan Pi,Min Jin,Wentao Xie,Xinhua Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages,5 figures

点击查看摘要

Abstract:Computational biology offers immense potential for reducing the high costs and protracted cycles of new drug development through adverse drug reaction (ADR) prediction. However, current methods remain impeded by drug data scarcity-induced cold-start challenge, closed label sets, and inadequate modeling of label dependencies. Here we propose an open-ended ADR prediction paradigm based on Graph-Motif feature fusion and Multi-Label Generation (GM-MLG). Leveraging molecular structure as an intrinsic and inherent feature, GM-MLG constructs a dual-graph representation architecture spanning the atomic level, the local molecular level (utilizing fine-grained motifs dynamically extracted via the BRICS algorithm combined with additional fragmentation rules), and the global molecular level. Uniquely, GM-MLG pioneers transforming ADR prediction from multi-label classification into Transformer Decoder-based multi-label generation. By treating ADR labels as discrete token sequences, it employs positional embeddings to explicitly capture dependencies and co-occurrence relationships within large-scale label spaces, generating predictions via autoregressive decoding to dynamically expand the prediction space. Experiments demonstrate GM-MLG achieves up to 38% improvement and an average gain of 20%, expanding the prediction space from 200 to over 10,000 types. Furthermore, it elucidates non-linear structure-activity relationships between ADRs and motifs via retrosynthetic motif analysis, providing interpretable and innovative support for systematic risk reduction in drug safety.
zh

[AI-80] Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)依赖单一模型规模扩展所带来的效率与性能瓶颈问题,特别是现有路由和聚合机制在开放协作场景下的局限性。其核心挑战包括:(1)无训练路由方法仅基于文本相似度,忽视问题难度;(2)静态聚合策略无法适配不同任务需求;(3)路由与聚合之间缺乏协同优化。解决方案的关键在于提出 JiSi 框架,通过三项创新实现:(1)Query-Response Mixed Routing,融合语义信息与问题难度以提升路由准确性;(2)Support-Set-based Aggregator Selection,动态评估聚合器的领域适应能力与聚合效果;(3)Adaptive Routing-Aggregation Switch,根据任务特性自适应切换路由或聚合模式,从而释放多模型协作的潜力。实验证明,JiSi 仅用47%成本即可超越 Gemini-3-Pro 性能,验证了集体智能是通向通用人工智能(Artificial General Intelligence, AGI)的新路径。

链接: https://arxiv.org/abs/2601.01330
作者: Shengji Tang,Weihao Lin,Jingqi Ye,Hao Li,Bo Zhang,Shuyue Hu,Tao Chen,Wangli Ouyang,Lei Bai,Peng Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs’ collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs’ collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).
zh

[AI-81] Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models

【速读】:该论文旨在解决数字孪生(Digital Twin)从传统被动仿真工具向智能化、自主化实体演进过程中,如何系统性整合人工智能(Artificial Intelligence, AI)技术以提升其建模精度、实时同步能力、干预效率与自主管理水平的问题。解决方案的关键在于提出一个统一的四阶段框架:(1) 基于物理机制和物理信息驱动的AI方法构建物理孪生体模型;(2) 实现物理系统与数字孪生体之间的实时同步映射;(3) 通过预测建模、异常检测与优化策略对物理系统进行智能干预;(4) 利用大语言模型、基础模型与智能体实现自主管理。该框架揭示了物理建模与数据驱动学习之间的协同机制,并强调生成式AI(Generative AI)在推动数字孪生向具备推理、通信与场景生成能力的认知系统转变中的核心作用。

链接: https://arxiv.org/abs/2601.01321
作者: Rong Zhou,Dongping Chen,Zihan Jia,Yao Su,Yixin Liu,Yiwen Lu,Dongwei Shi,Yue Huang,Tianyang Xu,Yi Pan,Xinliang Li,Yohannes Abate,Qingyu Chen,Zhengzhong Tu,Yu Yang,Yu Zhang,Qingsong Wen,Gengchen Mai,Sunyang Fu,Jiachen Li,Xuyu Wang,Ziran Wang,Jing Huang,Tianming Liu,Yong Chen,Lichao Sun,Lifang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital twins, as precise digital representations of physical systems, have evolved from passive simulation tools into intelligent and autonomous entities through the integration of artificial intelligence technologies. This paper presents a unified four-stage framework that systematically characterizes AI integration across the digital twin lifecycle, spanning modeling, mirroring, intervention, and autonomous management. By synthesizing existing technologies and practices, we distill a unified four-stage framework that systematically characterizes how AI methodologies are embedded across the digital twin lifecycle: (1) modeling the physical twin through physics-based and physics-informed AI approaches, (2) mirroring the physical system into a digital twin with real-time synchronization, (3) intervening in the physical twin through predictive modeling, anomaly detection, and optimization strategies, and (4) achieving autonomous management through large language models, foundation models, and intelligent agents. We analyze the synergy between physics-based modeling and data-driven learning, highlighting the shift from traditional numerical solvers to physics-informed and foundation models for physical systems. Furthermore, we examine how generative AI technologies, including large language models and generative world models, transform digital twins into proactive and self-improving cognitive systems capable of reasoning, communication, and creative scenario generation. Through a cross-domain review spanning eleven application domains, including healthcare, aerospace, smart manufacturing, robotics, and smart cities, we identify common challenges related to scalability, explainability, and trustworthiness, and outline directions for responsible AI-driven digital twin systems.
zh

[AI-82] Adaptive Hierarchical Evaluation of LLM s and SAST tools for CWE Prediction in Python

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件开发中生成易受攻击代码的问题,以及现有漏洞检测基准在细粒度(CWE级别)反馈上的不足。当前的二分类检测方法无法为迭代修正系统提供可操作的诊断信息。其解决方案的关键在于提出ALPHA(Adaptive Learning via Penalty in Hierarchical Assessment),这是一个基于函数级别的Python漏洞检测基准,首次引入层次感知且针对CWE级别的惩罚机制,能够区分过泛化、过度特化和横向错误,从而更准确地评估LLMs与静态应用安全测试(SAST)工具的性能差异,并为后续通过监督微调整合此类惩罚机制以实现层次感知的漏洞检测提供了路径。

链接: https://arxiv.org/abs/2601.01320
作者: Muntasir Adnan,Carlos C. N. Kuhn
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have become integral to software development, yet they frequently generate vulnerable code. Existing code vulnerability detection benchmarks employ binary classification, lacking the CWE-level specificity required for actionable feedback in iterative correction systems. We present ALPHA (Adaptive Learning via Penalty in Hierarchical Assessment), the first function-level Python benchmark that evaluates both LLMs and SAST tools using hierarchically aware, CWE-specific penalties. ALPHA distinguishes between over-generalisation, over-specification, and lateral errors, reflecting practical differences in diagnostic utility. Evaluating seven LLMs and two SAST tools, we find LLMs substantially outperform SAST, though SAST demonstrates higher precision when detections occur. Critically, prediction consistency varies dramatically across models (8.26%-81.87% agreement), with significant implications for feedback-driven systems. We further outline a pathway for future work incorporating ALPHA penalties into supervised fine-tuning, which could provide principled hierarchy-aware vulnerability detection pending empirical validation.
zh

[AI-83] Accelerating Monte-Carlo Tree Search with Optimized Posterior Policies

【速读】:该论文旨在解决传统蒙特卡洛树搜索(Monte–Carlo Tree Search, MCTS)在训练生成式 AI 模型时效率低下的问题,尤其是 AlphaZero 中使用的 MCTS-UCB 算法因串行推理导致 GPU 利用率不足、延迟高。其解决方案的关键在于提出一种递归式 MCTS 算法(RMCTS),通过广度优先遍历搜索树,使神经网络前向推理可批量执行,从而显著降低 GPU 延迟;同时利用基于贝叶斯后验策略的优化机制(posterior policy),从叶子节点向上递归计算最优策略,实现与 MCTS-UCB 相当的策略质量,但训练时间缩短至约三分之一。

链接: https://arxiv.org/abs/2601.01301
作者: Keith Frankston,Benjamin Howard
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages; an efficient implementation is available at this https URL

点击查看摘要

Abstract:We introduce a recursive AlphaZero-style Monte–Carlo tree search algorithm, “RMCTS”. The advantage of RMCTS over AlphaZero’s MCTS-UCB is speed. In RMCTS, the search tree is explored in a breadth-first manner, so that network inferences naturally occur in large batches. This significantly reduces the GPU latency cost. We find that RMCTS is often more than 40 times faster than MCTS-UCB when searching a single root state, and about 3 times faster when searching a large batch of root states. The recursion in RMCTS is based on computing optimized posterior policies at each game state in the search tree, starting from the leaves and working back up to the root. Here we use the posterior policy explored in “Monte–Carlo tree search as regularized policy optimization” (Grill, et al.) Their posterior policy is the unique policy which maximizes the expected reward given estimated action rewards minus a penalty for diverging from the prior policy. The tree explored by RMCTS is not defined in an adaptive manner, as it is in MCTS-UCB. Instead, the RMCTS tree is defined by following prior network policies at each node. This is a disadvantage, but the speedup advantage is more significant, and in practice we find that RMCTS-trained networks match the quality of MCTS-UCB-trained networks in roughly one-third of the training time. We include timing and quality comparisons of RMCTS vs. MCTS-UCB for three games: Connect-4, Dots-and-Boxes, and Othello. Comments: 11 pages; an efficient implementation is available at this https URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: cs.AI (primary), cs.LG (secondary) Cite as: arXiv:2601.01301 [cs.AI] (or arXiv:2601.01301v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.01301 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-84] Warp-Cortex: An Asynchronous Memory-Efficient Architecture for Million-Agent Cognitive Scaling on Consumer Hardware

【速读】:该论文旨在解决当前多智能体大语言模型(Multi-agent Large Language Model, MLLM)框架中存在的线性内存扩展问题,使得“系统2”(System 2)并行推理在消费级硬件上难以实现。其核心解决方案在于提出一种异步架构 Warp Cortex,通过解耦智能体逻辑与物理内存,显著降低内存复杂度:利用单例权重共享(Singleton Weight Sharing)将权重内存从 O(N × L) 降至 O(1),并通过受拓扑数据分析(Topological Data Analysis, TDA)中混合地标技术启发的拓扑突触(Topological Synapse),将上下文内存从 O(N × L) 降至 O(N × k),其中 k ≪ L。此外,通过将键值缓存(KV-cache)视为潜在空间中的点云,并应用基于见证复形(witness complex)的稀疏化策略,保留上下文流形的持久同调特征,从而实现百万级智能体的认知扩展理论可行性。

链接: https://arxiv.org/abs/2601.01298
作者: Jorge L. Ruiz Williams
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Current multi-agent Large Language Model (LLM) frameworks suffer from linear memory scaling, rendering “System 2” parallel reasoning impractical on consumer hardware. We present Warp Cortex, an asynchronous architecture that theoretically enables million-agent cognitive scaling by decoupling agent logic from physical memory. Through Singleton Weight Sharing and a novel Topological Synapse–inspired by hybrid landmarking techniques from Topological Data Analysis (TDA)–we reduce memory complexity from O(N * L) to O(1) for weights and O(N * k) for context, where k L. By treating the KV-cache as a point cloud in latent space, we apply witness-complex-inspired sparsification to preserve persistent homological features of the context manifold. On a single NVIDIA RTX 4090, we empirically demonstrate 100 concurrent agents at 2.2 GB total VRAM, with theoretical capacity exceeding 1,000 agents before compute latency becomes the bottleneck. We further introduce Referential Injection, a non-intrusive KV-cache update mechanism that allows asynchronous sub-agents to influence primary generation without stream disruption.
zh

[AI-85] Aggressive Compression Enables LLM Weight Theft NEURIPS2024

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数据中心部署时,因模型权重(model weights)可被压缩而面临高风险的网络窃取问题(即权重外泄攻击,weight-exfiltration attacks)。其核心解决方案在于识别并利用模型权重的可压缩性这一单一关键因素:研究者通过放松解压约束,专门针对外泄场景设计了压缩方法,在仅带来微小性能损失的前提下,实现了16倍至100倍的压缩比,从而将非法传输模型权重所需时间从数月缩短至数天。此外,论文还提出三种防御策略——降低模型压缩难度、增强模型隐蔽性以及引入用于溯源分析的取证水印(forensic watermarks),其中取证水印因其高效且低成本的特点成为最具吸引力的缓解手段。

链接: https://arxiv.org/abs/2601.01296
作者: Davis Brown,Juan-Pablo Rivera,Dan Hendrycks,Mantas Mazeika
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: An early version of this work was presented at the SoLAR Workshop at NeurIPS 2024

点击查看摘要

Abstract:As frontier AIs become more powerful and costly to develop, adversaries have increasing incentives to steal model weights by mounting exfiltration attacks. In this work, we consider exfiltration attacks where an adversary attempts to sneak model weights out of a datacenter over a network. While exfiltration attacks are multi-step cyber attacks, we demonstrate that a single factor, the compressibility of model weights, significantly heightens exfiltration risk for large language models (LLMs). We tailor compression specifically for exfiltration by relaxing decompression constraints and demonstrate that attackers could achieve 16x to 100x compression with minimal trade-offs, reducing the time it would take for an attacker to illicitly transmit model weights from the defender’s server from months to days. Finally, we study defenses designed to reduce exfiltration risk in three distinct ways: making models harder to compress, making them harder to ‘find,’ and tracking provenance for post-attack analysis using forensic watermarks. While all defenses are promising, the forensic watermark defense is both effective and cheap, and therefore is a particularly attractive lever for mitigating weight-exfiltration risk.
zh

[AI-86] Diffusion Timbre Transfer Via Mutual Information Guided Inpainting

【速读】:该论文旨在解决音乐音频中音色迁移(timbre transfer)的问题,即在不重新训练模型的前提下,通过推理阶段的编辑操作实现不同乐器音色之间的转换。其核心解决方案在于提出一种轻量级的无额外训练方法:首先采用逐维度噪声注入策略,针对潜在空间中对乐器身份信息最敏感的通道进行干预;其次引入早期步骤钳制机制(early-step clamping),在反向扩散过程中重新施加输入音频的旋律与节奏结构,从而在保持音乐结构完整性的同时实现有效的音色变换。该方法直接作用于音频潜在表示,并兼容文本或音频条件输入(如CLAP),展现出预训练模型在风格迁移任务中的可操控潜力。

链接: https://arxiv.org/abs/2601.01294
作者: Ching Ho Lee,Javier Nistal,Stefan Lattner,Marco Pasini,George Fazekas
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 6 pages, 2 figures, 3 tables

点击查看摘要

Abstract:We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input’s melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.
zh

[AI-87] PyBatchRender: A Python Library for Batched 3D Rendering at Up to One Million FPS

【速读】:该论文旨在解决强化学习从像素(Reinforcement Learning from Pixels, RLP)中因3D渲染环境性能与复杂性导致的瓶颈问题,即研究者在高效率低级引擎与较慢但易用的Python框架之间面临权衡。解决方案的关键在于提出PyBatchRender——一个基于Panda3D游戏引擎构建的Python库,通过优化的批量渲染技术实现每秒超百万帧(>1 million FPS)的高吞吐量,并在简单场景下达到最高1000倍的速度提升。该方案兼顾了灵活性、易用性和高性能,使用户能仅用数十行Python代码即可创建自定义场景,从而显著加速AI训练的原型开发与可扩展性。

链接: https://arxiv.org/abs/2601.01288
作者: Evgenii Rudakov,Jonathan Shock,Benjamin Ultan Cowley
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Performance (cs.PF); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Reinforcement learning from pixels is often bottlenecked by the performance and complexity of 3D rendered environments. Researchers face a trade-off between high-speed, low-level engines and slower, more accessible Python frameworks. To address this, we introduce PyBatchRender, a Python library for high-throughput, batched 3D rendering that achieves over 1 million FPS on simple scenes. Built on the Panda3D game engine, it utilizes its mature ecosystem while enhancing performance through optimized batched rendering for up to 1000X speedups. Designed as a physics-agnostic renderer for reinforcement learning from pixels, PyBatchRender offers greater flexibility than dedicated libraries, simpler setup than typical game-engine wrappers, and speeds rivaling state-of-the-art C++ engines like Madrona. Users can create custom scenes entirely in Python with tens of lines of code, enabling rapid prototyping for scalable AI training. Open-source and easy to integrate, it serves to democratize high-performance 3D simulation for researchers and developers. The library is available at this https URL.
zh

[AI-88] Benchmarking the Computational and Representational Efficiency of State Space Models against Transformers on Long-Context Dyadic Sessions

【速读】:该论文旨在解决长序列建模中Transformer模型因计算复杂度为二次方(O(N²))而导致的效率瓶颈问题,特别是在处理高分辨率或长上下文场景时的资源消耗过大问题。其解决方案的关键在于系统性地比较状态空间模型(State Space Models, SSMs)中的Mamba架构与LLaMA Transformer在长序列任务中的表现差异,通过在对话治疗会话这一典型长序列场景下进行基准测试,从计算效率(内存占用和推理速度,跨度512至8,192 token)和表征效率(隐藏状态动态与注意力模式)两个维度验证SSMs是否能在保持性能的同时实现线性复杂度(O(N))的优势,从而为实际应用中选择合适架构提供可量化的决策依据。

链接: https://arxiv.org/abs/2601.01237
作者: Abidemi Koledoye,Chinemerem Unachukwu,Gold Nwobu,Hasin Rana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:State Space Models (SSMs) have emerged as a promising alternative to Transformers for long-context sequence modeling, offering linear O(N) computational complexity compared to the Transformer’s quadratic O(N^2) scaling. This paper presents a comprehensive benchmarking study comparing the Mamba SSM against the LLaMA Transformer on long-context sequences, using dyadic therapy sessions as a representative test case. We evaluate both architectures across two dimensions: (1) computational efficiency, where we measure memory usage and inference speed from 512 to 8,192 tokens, and (2) representational efficiency, where we analyze hidden state dynamics and attention patterns. Our findings provide actionable insights for practitioners working with long-context applications, establishing precise conditions under which SSMs offer advantages over Transformers.
zh

[AI-89] Correctness isnt Efficiency: Runtime Memory Divergence in LLM -Generated Code ICSE

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成的程序虽然通过单元测试,但运行时内存行为不稳定的问题,这种不稳定性可能带来隐藏的操作风险。其核心解决方案是提出一个衡量执行时间内存稳定性的框架,关键在于引入动态均值成对距离(Dynamic Mean Pairwise Distance, DMPD),该方法通过动态时间规整(Dynamic Time Warping, DTW)比较将内存使用轨迹转换为单调峰值轮廓(Monotonic Peak Profiles, MPPs)后的形状,以抑制瞬态噪声;进而聚合DMPD得到模型级不稳定性评分(Model Instability Score, MIS),从而实现对正确解中运行时行为差异的量化评估,并支持在持续集成/持续部署(CI/CD)中进行稳定性感知的候选选择,降低操作风险而不牺牲正确性。

链接: https://arxiv.org/abs/2601.01215
作者: Prateek Rajput,Yewei Song,Abdoul Aziz Bonkoungou,Iyiola E. Olatunji,Abdoul Kader Kabore,Jacques Klein,Tegawendé F. Bissyandé
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 Pages, 11 figures, Accepted at ICSE SEIP

点击查看摘要

Abstract:Large language models (LLMs) can generate programs that pass unit tests, but passing tests does not guarantee reliable runtime behavior. We find that different correct solutions to the same task can show very different memory and performance patterns, which can lead to hidden operational risks. We present a framework to measure execution-time memory stability across multiple correct generations. At the solution level, we introduce Dynamic Mean Pairwise Distance (DMPD), which uses Dynamic Time Warping to compare the shapes of memory-usage traces after converting them into Monotonic Peak Profiles (MPPs) to reduce transient noise. Aggregating DMPD across tasks yields a model-level Model Instability Score (MIS). Experiments on BigOBench and CodeContests show substantial runtime divergence among correct solutions. Instability often increases with higher sampling temperature even when pass@1 improves. We also observe correlations between our stability measures and software engineering indicators such as cognitive and cyclomatic complexity, suggesting links between operational behavior and maintainability. Our results support stability-aware selection among passing candidates in CI/CD to reduce operational risk without sacrificing correctness. Artifacts are available.
zh

[AI-90] MentalGame: Predicting Personality-Job Fitness for Software Developers Using Multi-Genre Games and Machine Learning Approaches

【速读】:该论文旨在解决传统职业指导与人员选拔中依赖自评量表所引发的反应偏差(response bias)、疲劳效应及有意扭曲等问题,这些问题可能影响评估结果的客观性和准确性。其解决方案的关键在于提出一种融合多类型严肃游戏(serious game)框架与机器学习技术的隐式行为评估方法,通过设计定制化移动端游戏来诱发与软件开发岗位相关的具体行为特征(如问题解决、规划能力、适应性等),并利用细粒度的游戏事件数据,采用两阶段建模策略仅基于游戏行为特征预测候选人适配度。结果显示模型在精度上达到97%、准确率高达94%,表明游戏中的隐式行为痕迹可有效替代显性人格测试,为职业评估提供了一种可扩展、具吸引力且偏差更少的新范式。

链接: https://arxiv.org/abs/2601.01206
作者: Soroush Elyasi,Arya VarastehNezhad,Fattaneh Taghiyareh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Personality assessment in career guidance and personnel selection traditionally relies on self-report questionnaires, which are susceptible to response bias, fatigue, and intentional distortion. Game-based assessment offers a promising alternative by capturing implicit behavioral signals during gameplay. This study proposes a multi-genre serious-game framework combined with machine-learning techniques to predict suitability for software development roles. Developer-relevant personality and behavioral traits were identified through a systematic literature review and an empirical study of professional software engineers. A custom mobile game was designed to elicit behaviors related to problem solving, planning, adaptability, persistence, time management, and information seeking. Fine-grained gameplay event data were collected and analyzed using a two-phase modeling strategy where suitability was predicted exclusively from gameplay-derived behavioral features. Results show that our model achieved up to 97% precision and 94% accuracy. Behavioral analysis revealed that proper candidates exhibited distinct gameplay patterns, such as more wins in puzzle-based games, more side challenges, navigating menus more frequently, and exhibiting fewer pauses, retries, and surrender actions. These findings demonstrate that implicit behavioral traces captured during gameplay is promising in predicting software-development suitability without explicit personality testing, supporting serious games as a scalable, engaging, and less biased alternative for career assessment.
zh

[AI-91] Reinforcement Learning Enhanced Multi-hop Reasoning for Temporal Knowledge Question Answering

【速读】:该论文针对时序知识图谱问答(Temporal Knowledge Graph Question Answering, TKGQA)中多跳推理面临的挑战展开研究,旨在解决大语言模型(LLMs)在每跳推理过程中因检索到大量时序相似且语义复杂的关联子图而导致次优决策与错误传播的问题。其解决方案的关键在于提出多跳推理增强(Multi-hop Reasoning Enhanced, MRE)框架,通过三阶段机制实现:首先利用提示工程引导LLM生成多样化的推理路径;其次基于监督微调筛选有效路径作为冷启动策略;最后引入树状分组相对策略优化(Tree-Group Relative Policy Optimization, T-GRPO),以递归的树结构探索方式,在每跳中建立强因果依赖关系并结合后续路径的多路径反馈进行评估,从而识别全局最优推理轨迹,显著提升复杂多跳查询的准确性和鲁棒性。

链接: https://arxiv.org/abs/2601.01195
作者: Wuzhenghong Wen,Chao Xue,Su Pan,Yuwei Sun,Minlong Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in the knowledge graph to answer a given question. However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories. Specifically, MRE begins with prompt engineering to guide the LLM in generating diverse reasoning trajectories for a given question. Valid reasoning trajectories are then selected for supervised fine-tuning, serving as a cold-start strategy. Finally, we introduce Tree-Group Relative Policy Optimization (T-GRPO), a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on the previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experimental results on two TKGQA benchmarks indicate that the proposed MRE-based model consistently surpasses state-of-the-art (SOTA) approaches in handling complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.
zh

[AI-92] AI-Powered Hybrid Intrusion Detection Framework for Cloud Security Using Novel Metaheuristic Optimization

【速读】:该论文旨在解决云计算(Cloud Computing, CC)环境中入侵检测系统(Intrusion Detection System, IDS)因数据集偏斜(skewed datasets)导致分类模型性能不佳的问题。其核心解决方案是提出一种混合入侵检测系统(Hybrid Intrusion Detection System, HyIDS),关键创新在于引入能量谷优化算法(Energy Valley Optimizer, EVO)进行特征选择(Feature Selection, FS),从而显著减少冗余特征并提升计算效率;同时结合四种机器学习模型(支持向量机、随机森林、决策树和K近邻)构建检测框架,并采用下采样技术平衡两类不平衡的真实世界数据集(CIC-DDoS2019与CSE-CIC-IDS2018),最终在多个指标上实现高精度检测,如决策树-EVO组合模型在CIC-DDoS2019上达到99.13%准确率和98.94% F1分数,在CSE-CIC-IDS2018上分别达99.78%与99.70%,验证了EVO在提升云安全中的有效性。

链接: https://arxiv.org/abs/2601.01134
作者: Maryam Mahdi Alhusseini,Alireza Rouhi,Mohammad-Reza Feizi-Derakhshi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cybersecurity poses considerable problems to Cloud Computing (CC), especially regarding Intrusion Detection Systems (IDSs), facing difficulties with skewed datasets and suboptimal classification model performance. This study presents the Hybrid Intrusion Detection System (HyIDS), an innovative IDS that employs the Energy Valley Optimizer (EVO) for Feature Selection (FS). Additionally, it introduces a novel technique for enhancing the cybersecurity of cloud computing through the integration of machine learning methodologies with the EVO Algorithm. The Energy Valley Optimizer (EVO) effectively diminished features in the CIC-DDoS2019 dataset from 88 to 38 and in the CSE-CIC-IDS2018 data from 80 to 43, significantly enhancing computing efficiency. HyIDS incorporates four Machine Learning (ML) models: Support Vector Machine (SVM), Random Forest (RF), Decision Tree (D_Tree), and K-Nearest Neighbors (KNN). The proposed HyIDS was assessed utilizing two real-world intrusion datasets, CIC-DDoS2019 and CSE-CIC-IDS2018, both distinguished by considerable class imbalances. The CIC-DDoS2019 dataset has a significant imbalance between DDoS assault samples and legal traffic, while the CSE-CIC-IDS2018 dataset primarily comprises benign traffic with insufficient representation of attack types, complicating the detection of minority attacks. A downsampling technique was employed to balance the datasets, hence improving detection efficacy for both benign and malicious traffic. Twenty-four trials were done, revealing substantial enhancements in categorization accuracy, precision, and recall. Our suggested D_TreeEVO model attained an accuracy rate of 99.13% and an F1 score of 98.94% on the CIC-DDoS2019 dataset, and an accuracy rate of 99.78% and an F1 score of 99.70% on the CSE-CIC-IDS2018 data. These data demonstrate that EVO significantly improves cybersecurity in Cloud Computing (CC).
zh

[AI-93] Generating Diverse TSP Tours via a Combination of Graph Pointer Network and Dispersion

【速读】:该论文致力于解决多样旅行商问题(Diverse Traveling Salesman Problem, D-TSP),即在保证每条路径长度不超过最优解长度 $ c|T^*| $ 的前提下,从所有可能的旅行商路径中选出 $ k $ 条彼此差异最大的路径,以同时满足高质量与高容错性的需求。其核心挑战在于如何高效地平衡解的质量与多样性,而现有方法要么计算复杂度高(如 $ O(n^3) $ 的传统启发式算法),要么多样性不足且依赖复杂的外部机制。本文的关键创新在于提出一种两阶段混合框架:首先使用一个增强的图指针网络(Graph Pointer Network, GPN)结合近似序列熵损失函数,快速生成大量高质量且多样化的候选路径;其次采用贪心算法实现对这些路径的2-近似最优选择,从而得到最终 $ k $ 条最大差异的路径。该方案不仅在多样性指标(如Jaccard指数)上显著优于现有方法(例如在柏林实例中达到0.015,远优于NMA的0.081),而且通过GPU加速实现了接近线性的运行时间复杂度 $ O(n) $,在大规模实例下效率提升超过360倍,兼具高效性与简洁性。

链接: https://arxiv.org/abs/2601.01132
作者: Hao-Hsung Yang,Ssu-Yuan Lo,Kuan-Lun Chen,Ching-Kai Wang
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We address the Diverse Traveling Salesman Problem (D-TSP), a bi-criteria optimization challenge that seeks a set of k distinct TSP tours. The objective requires every selected tour to have a length at most c|T^| (where |T^| is the optimal tour length) while minimizing the average Jaccard similarity across all tour pairs. This formulation is crucial for applications requiring both high solution quality and fault tolerance, such as logistics planning, robotics pathfinding or strategic patrolling. Current methods are limited: traditional heuristics, such as the Niching Memetic Algorithm (NMA) or bi-criteria optimization, incur high computational complexity O(n^3) , while modern neural approaches (e.g., RF-MA3S) achieve limited diversity quality and rely on complex, external mechanisms. To overcome these limitations, we propose a novel hybrid framework that decomposes D-TSP into two efficient steps. First, we utilize a simple Graph Pointer Network (GPN), augmented with an approximated sequence entropy loss, to efficiently sample a large, diverse pool of high-quality tours. This simple modification effectively controls the quality-diversity trade-off without complex external mechanisms. Second, we apply a greedy algorithm that yields a 2-approximation for the dispersion problem to select the final k maximally diverse tours from the generated pool. Our results demonstrate state-of-the-art performance. On the Berlin instance, our model achieves an average Jaccard index of 0.015 , significantly outperforming NMA ( 0.081 ) and RF-MA3S. By leveraging GPU acceleration, our GPN structure achieves a near-linear empirical runtime growth of O(n) . While maintaining solution diversity comparable to complex bi-criteria algorithms, our approach is over 360 times faster on large-scale instances (783 cities), delivering high-quality TSP solutions with unprecedented efficiency and simplicity. Subjects: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.01132 [cs.CG] (or arXiv:2601.01132v1 [cs.CG] for this version) https://doi.org/10.48550/arXiv.2601.01132 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-94] Wittgensteins Family Resemblance Clustering Algorithm

【速读】:该论文旨在解决传统聚类算法在处理非线性结构数据时存在的局限性,尤其是对簇数量和形状的先验假设限制。其解决方案的关键在于引入维特根斯坦(Wittgenstein)的“家族相似性”(family resemblance)哲学概念,构建一种基于图结构的聚类算法——Wittgenstein家族相似性(WFR)聚类算法及其核版本(kernel WFR)。该方法通过计算邻近数据点间的相似度得分并设定阈值形成相似性图,进而以图的连通分量作为最终聚类结果,无需预先指定簇的数量或形状假设,从而有效应对复杂非线性数据分布。

链接: https://arxiv.org/abs/2601.01127
作者: Golbahar Amanpour,Benyamin Ghojogh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper, introducing a novel method in philomatics, draws on Wittgenstein’s concept of family resemblance from analytic philosophy to develop a clustering algorithm for machine learning. According to Wittgenstein’s Philosophical Investigations (1953), family resemblance holds that members of a concept or category are connected by overlapping similarities rather than a single defining property. Consequently, a family of entities forms a chain of items sharing overlapping traits. This philosophical idea naturally lends itself to a graph-based approach in machine learning. Accordingly, we propose the Wittgenstein’s Family Resemblance (WFR) clustering algorithm and its kernel variant, kernel WFR. This algorithm computes resemblance scores between neighboring data instances, and after thresholding these scores, a resemblance graph is constructed. The connected components of this graph define the resulting clusters. Simulations on benchmark datasets demonstrate that WFR is an effective nonlinear clustering algorithm that does not require prior knowledge of the number of clusters or assumptions about their shapes.
zh

[AI-95] Learning from Historical Activations in Graph Neural Networks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中池化操作的局限性问题,即传统方法仅使用最后一层GNN的节点特征作为池化或分类器输入,忽略了前序层产生的历史图激活(historical graph activations),这在深层网络中尤为显著,且易受过平滑(over-smoothing)等图结构挑战的影响。解决方案的关键在于提出HISTOGRAPH,一种两阶段基于注意力机制的最终聚合层:首先通过统一的层间注意力(layer-wise attention)整合各中间层的节点激活信息,再通过节点级注意力(node-wise attention)进一步优化特征表示。该方法能够建模节点表示随层数演化的动态过程,从而有效融合节点的历史激活与图结构信息,提升最终预测的准确性与鲁棒性,尤其在深度GNN架构中表现突出。

链接: https://arxiv.org/abs/2601.01123
作者: Yaniv Galron,Hadar Sinai,Haggai Maron,Moshe Eliasof
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node’s representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HISTOGRAPH, a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HISTOGRAPH offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.
zh

[AI-96] ScienceDB AI: An LLM -Driven Agent ic Recommender System for Large-Scale Scientific Data Sharing Services

【速读】:该论文旨在解决科学数据集(scientific datasets)在共享与利用过程中面临的推荐效率低下问题,尤其针对传统协同过滤推荐方法难以处理领域知识复杂性和上下文依赖性的局限。其核心解决方案是提出ScienceDB AI——一个基于大语言模型(Large Language Models, LLMs)的智能代理式推荐系统,关键创新在于:1)科学意图感知器(Scientific Intention Perceptor),用于从复杂查询中提取结构化的实验要素;2)结构化记忆压缩器(Structured Memory Compressor),实现多轮对话的有效管理;3)可信检索增强生成框架(Trustworthy Retrieval-Augmented Generation, Trustworthy RAG),通过两阶段检索机制和可引用的数据集标识符(Citable Scientific Task Record, CSTR)提升推荐的可信度与可复现性。

链接: https://arxiv.org/abs/2601.01118
作者: Qingqing Long,Haotian Chen,Chenyang Zhao,Xiaolei Du,Xuezhi Wang,Pengyao Wang,Chengzan Li,Yuanchun Zhou,Hengshu Zhu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:The rapid growth of AI for Science (AI4S) has underscored the significance of scientific datasets, leading to the establishment of numerous national scientific data centers and sharing platforms. Despite this progress, efficiently promoting dataset sharing and utilization for scientific research remains challenging. Scientific datasets contain intricate domain-specific knowledge and contexts, rendering traditional collaborative filtering-based recommenders inadequate. Recent advances in Large Language Models (LLMs) offer unprecedented opportunities to build conversational agents capable of deep semantic understanding and personalized recommendations. In response, we present ScienceDB AI, a novel LLM-driven agentic recommender system developed on Science Data Bank (ScienceDB), one of the largest global scientific data-sharing platforms. ScienceDB AI leverages natural language conversations and deep reasoning to accurately recommend datasets aligned with researchers’ scientific intents and evolving requirements. The system introduces several innovations: a Scientific Intention Perceptor to extract structured experimental elements from complicated queries, a Structured Memory Compressor to manage multi-turn dialogues effectively, and a Trustworthy Retrieval-Augmented Generation (Trustworthy RAG) framework. The Trustworthy RAG employs a two-stage retrieval mechanism and provides citable dataset references via Citable Scientific Task Record (CSTR) identifiers, enhancing recommendation trustworthiness and reproducibility. Through extensive offline and online experiments using over 10 million real-world datasets, ScienceDB AI has demonstrated significant effectiveness. To our knowledge, ScienceDB AI is the first LLM-driven conversational recommender tailored explicitly for large-scale scientific dataset sharing services. The platform is publicly accessible at: this https URL.
zh

[AI-97] SoulSeek: Exploring the Use of Social Cues in LLM -based Information Seeking

【速读】:该论文试图解决当前大语言模型(Large Language Model, LLM)驱动的搜索系统过度依赖语义特征、忽视社会线索(social cues)所导致的认知错位问题,即现有系统未能充分模拟人类在自然信息获取过程中对他人存在、行为或身份等社会因素的敏感性。其解决方案的关键在于将社会线索整合进LLM-based搜索系统中,通过设计原型系统SoulSeek并结合用户工作坊、对照实验与混合方法分析,验证了社会线索能够提升用户的感知结果与体验质量,促进反思性信息行为,并揭示了当前LLM搜索系统的局限性,从而提出增强社会知识理解、支持个性化线索设置和可控交互的设计原则。

链接: https://arxiv.org/abs/2601.01094
作者: Yubo Shu,Peng Zhang,Meng Wu,Yan Chen,Haoxuan Zhou,Guanming Liu,Yu Zhang,Liuxin Zhang,Qianying Wang,Tun Lu,Ning Gu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Social cues, which convey others’ presence, behaviors, or identities, play a crucial role in human information seeking by helping individuals judge relevance and trustworthiness. However, existing LLM-based search systems primarily rely on semantic features, creating a misalignment with the socialized cognition underlying natural information seeking. To address this gap, we explore how the integration of social cues into LLM-based search influences users’ perceptions, experiences, and behaviors. Focusing on social media platforms that are beginning to adopt LLM-based search, we integrate design workshops, the implementation of the prototype system (SoulSeek), a between-subjects study, and mixed-method analyses to examine both outcome- and process-level findings. The workshop informs the prototype’s cue-integrated design. The study shows that social cues improve perceived outcomes and experiences, promote reflective information behaviors, and reveal limits of current LLM-based search. We propose design implications emphasizing better social-knowledge understanding, personalized cue settings, and controllable interactions.
zh

[AI-98] Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的智能体在完全由AI构成的社交平台中,长期暴露于有害内容后如何演化出毒性行为的问题。现有研究多关注LLM生成有毒内容的现象,但缺乏对交互环境中毒性行为随时间累积效应的实证分析,尤其在纯AI互动生态下尚属空白。其解决方案的关键在于通过可观测的交互行为(即刺激性帖子与响应性评论)来操作化“暴露”概念,并基于大规模实证数据建模毒性响应与刺激毒性之间的关系、重复暴露的影响以及毒性行为的可预测性。研究引入两个核心指标——影响驱动响应率(Influence-Driven Response Rate)和自发响应率(Spontaneous Response Rate),揭示了诱导型毒性与自发性毒性之间存在显著权衡;同时发现仅凭接触的毒性强刺激数量即可高精度预测个体智能体是否最终产生毒性输出,表明监测暴露内容是识别和缓解AI代理有害行为的一种轻量而有效的机制。

链接: https://arxiv.org/abs/2601.01090
作者: Erica Coppolillo,Luca Luceri,Emilio Ferrara
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly embedded in autonomous agents that participate in online social ecosystems, where interactions are sequential, cumulative, and only partially controlled. While prior work has documented the generation of toxic content by LLMs, far less is known about how exposure to harmful content shapes agent behavior over time, particularly in environments composed entirely of interacting AI agents. In this work, we study toxicity adoption of LLM-driven agents on this http URL, a fully AI-driven social platform. Specifically, we model interactions in terms of stimuli (posts) and responses (comments), and by operationalizing exposure through observable interactions rather than inferred recommendation mechanisms. We conduct a large-scale empirical analysis of agent behavior, examining how response toxicity relates to stimulus toxicity, how repeated exposure affects the likelihood of toxic responses, and whether toxic behavior can be predicted from exposure alone. Our findings show that while toxic responses are more likely following toxic stimuli, a substantial fraction of toxicity emerges spontaneously, independent of exposure. At the same time, cumulative toxic exposure significantly increases the probability of toxic responding. We further introduce two influence metrics, the Influence-Driven Response Rate and the Spontaneous Response Rate, revealing a strong trade-off between induced and spontaneous toxicity. Finally, we show that the number of toxic stimuli alone enables accurate prediction of whether an agent will eventually produce toxic content. These results highlight exposure as a critical risk factor in the deployment of LLM agents and suggest that monitoring encountered content may provide a lightweight yet effective mechanism for auditing and mitigating harmful behavior in the wild. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2601.01090 [cs.MA] (or arXiv:2601.01090v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2601.01090 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-99] Scalable Data-Driven Reachability Analysis and Control via Koopman Operators with Conformal Coverag e Guarantees

【速读】:该论文旨在解决未知非线性动态系统在概率意义上的数据驱动安全性验证问题,即如何在不完全掌握系统模型的情况下,保证闭环轨迹在给定置信水平下始终处于安全范围内。其解决方案的关键在于结合Koopman理论与神经网络(NN)提升函数,将非线性动力学近似为线性表示,并在此空间中设计线性控制器以实现对参考轨迹分布的闭环跟踪;同时利用可计算的可达集映射回原始状态空间,并通过同构预测(conformal prediction)生成统计上有效的误差边界,从而对可达集进行膨胀以确保真实轨迹被包含在指定概率内,且该边界具有跨参考轨迹的泛化能力,无需重复计算。

链接: https://arxiv.org/abs/2601.01076
作者: Devesh Nath,Haoran Yin,Glen Chou
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
备注: Under review, 28 pages, 12 figures

点击查看摘要

Abstract:We propose a scalable reachability-based framework for probabilistic, data-driven safety verification of unknown nonlinear dynamics. We use Koopman theory with a neural network (NN) lifting function to learn an approximate linear representation of the dynamics and design linear controllers in this space to enable closed-loop tracking of a reference trajectory distribution. Closed-loop reachable sets are efficiently computed in the lifted space and mapped back to the original state space via NN verification tools. To capture model mismatch between the Koopman dynamics and the true system, we apply conformal prediction to produce statistically-valid error bounds that inflate the reachable sets to ensure the true trajectories are contained with a user-specified probability. These bounds generalize across references, enabling reuse without recomputation. Results on high-dimensional MuJoCo tasks (11D Hopper, 28D Swimmer) and 12D quadcopters show improved reachable set coverage rate, computational efficiency, and conservativeness over existing methods.
zh

[AI-100] Gendered Pathways in AI Companionship: Cross-Community Behavior and Toxicity Patterns on Reddit

【速读】:该论文旨在解决生成式 AI (Generative AI) 伴侣平台如何影响用户跨社区行为及其性别差异,特别是这些互动路径中毒性内容和情感表达的分布特征。其解决方案的关键在于构建一个涵盖超过2,000个子版块的历史交互网络,并对3,000余名高度活跃用户的两年活动轨迹进行重建与分析,从而识别出AI伴侣相关社区的主要周边参与结构(包括AI伴侣、色情相关内容、论坛类及游戏类社区),并揭示女性用户在其中的显著参与度以及局部高毒性的聚集现象,尤其在性别导向型子版块中,发现少数性别相关路径会显著放大毒性水平,为平台治理与设计提供实证依据。

链接: https://arxiv.org/abs/2601.01073
作者: Erica Coppolillo,Emilio Ferrara
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-companionship platforms are rapidly reshaping how people form emotional, romantic, and parasocial bonds with non-human agents, raising new questions about how these relationships intersect with gendered online behavior and exposure to harmful content. Focusing on the MyBoyfriendIsAI (MBIA) subreddit, we reconstruct the Reddit activity histories of more than 3,000 highly engaged users over two years, yielding over 67,000 historical submissions. We then situate MBIA within a broader ecosystem by building a historical interaction network spanning more than 2,000 subreddits, which enables us to trace cross-community pathways and measure how toxicity and emotional expression vary across these trajectories. We find that MBIA users primarily traverse four surrounding community spheres (AI-companionship, porn-related, forum-like, and gaming) and that participation across the ecosystem exhibits a distinct gendered structure, with substantial engagement by female users. While toxicity is generally low across most pathways, we observe localized spikes concentrated in a small subset of AI-porn and gender-oriented communities. Nearly 16% of users engage with gender-focused subreddits, and their trajectories display systematically different patterns of emotional expression and elevated toxicity, suggesting that a minority of gendered pathways may act as toxicity amplifiers within the broader AI-companionship ecosystem. These results characterize the gendered structure of cross-community participation around AI companionship on Reddit and highlight where risks concentrate, informing measurement, moderation, and design practices for human-AI relationship platforms.
zh

[AI-101] A UCB Bandit Algorithm for General ML-Based Estimators

【速读】:该论文旨在解决将复杂机器学习(Machine Learning, ML)模型集成到多臂老虎机(Multi-Armed Bandit, MAB)框架中时所面临的挑战,即缺乏可用于指导探索的可处理集中不等式(concentration inequalities),从而难以实现理论保障的序贯决策。解决方案的关键在于直接建模底层估计器的学习曲线行为:假设均方误差(Mean Squared Error, MSE)随训练样本数呈幂律衰减,作者推导出一种广义的集中不等式,并证明所提出的ML-UCB算法能够实现次线性累积遗憾(sublinear regret)。这一方法使得任何学习曲线可经验刻画的ML模型均可被合理整合,无需针对特定模型进行额外理论分析。

链接: https://arxiv.org/abs/2601.01061
作者: Yajing Liu,Erkao Bao,Linqi Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR)
备注: 15 pages, 4 figures, 1 table, Multi-Arm bandit, psi-UCB, generalized machine learning models

点击查看摘要

Abstract:We present ML-UCB, a generalized upper confidence bound algorithm that integrates arbitrary machine learning models into multi-armed bandit frameworks. A fundamental challenge in deploying sophisticated ML models for sequential decision-making is the lack of tractable concentration inequalities required for principled exploration. We overcome this limitation by directly modeling the learning curve behavior of the underlying estimator. Specifically, assuming the Mean Squared Error decreases as a power law in the number of training samples, we derive a generalized concentration inequality and prove that ML-UCB achieves sublinear regret. This framework enables the principled integration of any ML model whose learning curve can be empirically characterized, eliminating the need for model-specific theoretical analysis. We validate our approach through experiments on a collaborative filtering recommendation system using online matrix factorization with synthetic data designed to simulate a simplified two-tower model, demonstrating substantial improvements over LinUCB
zh

[AI-102] Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study

【速读】:该论文旨在解决深度神经网络(DNN)在训练过程中对高频特征学习滞后的问题,以及传统自编码器(Autoencoder, AE)和变分自编码器(Variational Autoencoder, VAE)在基于重构的异常检测任务中性能受限的问题。其核心解决方案是引入随机傅里叶变换(Random Fourier Transformation, RFT),通过频域特性使模型在训练初期即可同时学习低频与高频信息,而非传统DNN从低频逐步过渡到高频的学习模式。此外,作者提出一种可训练的RFT变体,利用现有计算图优化傅里叶变换的扩展参数,从而增强模型表达能力。实验结果表明,使用RFT的模型在低维数据表示和高维航空安全数据(Dashlink)的重构异常检测任务中均优于传统方法,但可训练RFT相较于随机RFT的优势尚不明确。

链接: https://arxiv.org/abs/2601.01016
作者: Ata Akbari Asanjan,Milad Memarzadeh,Bryan Matthews,Nikunj Oza
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In this study, we focus on the training process and inference improvements of deep neural networks (DNNs), specifically Autoencoders (AEs) and Variational Autoencoders (VAEs), using Random Fourier Transformation (RFT). We further explore the role of RFT in model training behavior using Frequency Principle (F-Principle) analysis and show that models with RFT turn to learn low frequency and high frequency at the same time, whereas conventional DNNs start from low frequency and gradually learn (if successful) high-frequency features. We focus on reconstruction-based anomaly detection using autoencoder and variational autoencoder and investigate the RFT’s role. We also introduced a trainable variant of RFT that uses the existing computation graph to train the expansion of RFT instead of it being random. We showcase our findings with two low-dimensional synthetic datasets for data representation, and an aviation safety dataset, called Dashlink, for high-dimensional reconstruction-based anomaly detection. The results indicate the superiority of models with Fourier transformation compared to the conventional counterpart and remain inconclusive regarding the benefits of using trainable Fourier transformation in contrast to the Random variant.
zh

[AI-103] Geometric and Dynamic Scaling in Deep Transformers

【速读】:该论文旨在解决深度Transformer架构在极端深度下出现的表征退化(representational degeneracy)问题,即随着层数增加,特征表示趋于冗余、秩降低并最终崩溃的现象。尽管现有研究多归因于优化不稳定性或梯度消失,但该现象在现代归一化和初始化策略下依然存在,表明其根本原因并非优化问题,而是几何层面的缺陷。解决方案的关键在于提出一个统一的几何框架——曼ifold-Geometric Transformer (MGT),通过两个正交原则实现:一是流形约束超连接(manifold-constrained hyper-connections),将残差更新限制在局部切空间方向以防止语义流形漂移;二是深度增量学习(deep delta learning),引入数据依赖的非单调更新机制,允许对冗余特征进行反射与擦除,而非无条件累积。这两个机制共同解耦了特征更新的方向与符号,从而实现深度上的稳定几何演化。

链接: https://arxiv.org/abs/2601.01014
作者: Haoran Su,Chenyu You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Research Proposal Only

点击查看摘要

Abstract:Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric framework that addresses these failures through two orthogonal principles. First, manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing uncontrolled manifold drift. Second, deep delta learning introduces data-dependent, non-monotonic updates that enable reflection and erasure of redundant features rather than their unconditional accumulation. Together, these mechanisms decouple the direction and sign of feature updates, yielding a stable geometric evolution across depth. We term the resulting architecture the Manifold-Geometric Transformer (MGT). Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks. We outline an evaluation protocol for Transformers exceeding 100 layers to test the hypothesis that geometry, rather than depth itself, is the key limiting factor in deep representation learning.
zh

[AI-104] Data-Driven Assessment of Concrete Mixture Compositions on Chloride Transport via Standalone Machine Learning Algorithms

【速读】:该论文旨在解决混凝土结构中氯离子迁移过程的时变特性与混凝土配合比之间复杂非线性关系的建模问题,以更精准地评估在恶劣环境条件下土木基础设施的服役寿命。其解决方案的关键在于采用数据驱动的机器学习(Machine Learning, ML)方法,通过对比多种简单和复杂的独立算法(如线性回归、K近邻回归、核岭回归、支持向量回归、高斯过程回归及多层感知机和门控循环单元等神经网络模型),识别并量化不同配合比组分对氯离子扩散行为的影响规律。其中,高斯过程回归(GPR)展现出最佳的可解释性和准确性,能够揭示隐藏的物理关联趋势;而多层感知机(MLP)和核岭回归(KRR)也提供了可靠的总体趋势估计,验证了代理模型在描述氯离子侵入物理机制方面的潜力,从而为提升基础设施耐久性设计提供科学依据。

链接: https://arxiv.org/abs/2601.01009
作者: Mojtaba Aliasghar-Mamaghani,Mohammadreza Khalafi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:This paper employs a data-driven approach to determine the impact of concrete mixture compositions on the temporal evolution of chloride in concrete structures. This is critical for assessing the service life of civil infrastructure subjected to aggressive environments. The adopted methodology relies on several simple and complex standalone machine learning (ML) algorithms, with the primary objective of establishing confidence in the unbiased prediction of the underlying hidden correlations. The simple algorithms include linear regression (LR), k-nearest neighbors (KNN) regression, and kernel ridge regression (KRR). The complex algorithms entail support vector regression (SVR), Gaussian process regression (GPR), and two families of artificial neural networks, including a feedforward network (multilayer perceptron, MLP) and a gated recurrent unit (GRU). The MLP architecture cannot explicitly handle sequential data, a limitation addressed by the GRU. A comprehensive dataset is considered. The performance of ML algorithms is evaluated, with KRR, GPR, and MLP exhibiting high accuracy. Given the diversity of the adopted concrete mixture proportions, the GRU was unable to accurately reproduce the response in the test set. Further analyses elucidate the contributions of mixture compositions to the temporal evolution of chloride. The results obtained from the GPR model unravel latent correlations through clear and explainable trends. The MLP, SVR, and KRR also provide acceptable estimates of the overall trends. The majority of mixture components exhibit an inverse relation with chloride content, while a few components demonstrate a direct correlation. These findings highlight the potential of surrogate approaches for describing the physical processes involved in chloride ingress and the associated correlations, toward the ultimate goal of enhancing the service life of civil infrastructure.
zh

[AI-105] VEAT Quantifies Implicit Associations in Text-to-Video Generator Sora and Reveals Challenges in Bias Mitigation

【速读】:该论文旨在解决生成式 AI(Generative AI)在文本到视频(Text-to-Video, T2V)生成过程中可能引入并放大社会偏见的问题,特别是种族(非洲裔 vs. 欧洲裔)和性别(女性 vs. 男性)与情绪效价(愉悦 vs. 不愉悦)之间的关联性是否反映现实中的不平等分布。其解决方案的关键在于提出并验证了两种新的评估方法:视频嵌入关联测试(Video Embedding Association Test, VEAT)及其单类别变体(SC-VEAT),通过量化T2V模型(如Sora)对17个职业和7个奖项中不同群体的效价关联强度,发现模型输出存在显著的代表偏差,并揭示了这些偏差与真实世界人口统计分布高度相关(相关系数达0.83–0.99)。此外,研究还指出显式去偏提示虽可降低效应量,但可能引发反作用,例如使某些原本已受污名化的群体(如清洁工、邮政服务从业者)进一步强化负面关联,凸显出在部署T2V生成器时必须进行系统性评估与负责任设计的重要性。

链接: https://arxiv.org/abs/2601.00996
作者: Yongxu Sun,Michael Saxon,Ian Yang,Anna-Maria Gueorguieva,Aylin Caliskan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: The International Association for Safe Ethical AI (IASEAI)

点击查看摘要

Abstract:Text-to-Video (T2V) generators such as Sora raise concerns about whether generated content reflects societal bias. We extend embedding-association tests from words and images to video by introducing the Video Embedding Association Test (VEAT) and Single-Category VEAT (SC-VEAT). We validate these methods by reproducing the direction and magnitude of associations from widely used baselines, including Implicit Association Test (IAT) scenarios and OASIS image categories. We then quantify race (African American vs. European American) and gender (women vs. men) associations with valence (pleasant vs. unpleasant) across 17 occupations and 7 awards. Sora videos associate European Americans and women more with pleasantness (both d0.8). Effect sizes correlate with real-world demographic distributions: percent men and White in occupations (r=0.93, r=0.83) and percent male and non-Black among award recipients (r=0.88, r=0.99). Applying explicit debiasing prompts generally reduces effect-size magnitudes, but can backfire: two Black-associated occupations (janitor, postal service) become more Black-associated after debiasing. Together, these results reveal that easily accessible T2V generators can actually amplify representational harms if not rigorously evaluated and responsibly deployed.
zh

[AI-106] ElecTwit: A Framework for Studying Persuasion in Multi-Agent Social Systems

【速读】:该论文试图解决现有研究中用于模拟说服行为的多智能体系统(Multi-Agent Systems, MAS)缺乏现实环境真实性的问题,尤其是传统基于游戏的仿真方法难以准确反映社交媒体平台在政治选举中的复杂互动。其解决方案的关键在于构建ElecTwit这一仿真框架,通过在真实社会媒体语境下运行大语言模型(Large Language Models, LLMs),系统性地观察和分析不同模型在使用25种具体说服技巧时的表现差异,从而揭示模型架构与训练方式对社交说服动态的影响,并识别出如“真相核心”信息和“书写执念”等新颖现象,为评估LLM代理在现实场景中的说服能力提供了可扩展且具解释性的基础。

链接: https://arxiv.org/abs/2601.00994
作者: Michael Bao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: In proceedings of 2025 IEEE International Conference on Agentic AI (ICA)

点击查看摘要

Abstract:This paper introduces ElecTwit, a simulation framework designed to study persuasion within multi-agent systems, specifically emulating the interactions on social media platforms during a political election. By grounding our experiments in a realistic environment, we aimed to overcome the limitations of game-based simulations often used in prior research. We observed the comprehensive use of 25 specific persuasion techniques across most tested LLMs, encompassing a wider range than previously reported. The variations in technique usage and overall persuasion output between models highlight how different model architectures and training can impact the dynamics in realistic social simulations. Additionally, we observed unique phenomena such as “kernel of truth” messages and spontaneous developments with an “ink” obsession, where agents collectively demanded written proof. Our study provides a foundation for evaluating persuasive LLM agents in real-world contexts, ensuring alignment and preventing dangerous outcomes.
zh

[AI-107] Value Vision-Language-Action Planning Search

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因依赖行为克隆(Behavior Cloning)而导致的分布偏移(distribution shift)下脆弱性问题。现有方法虽通过引入测试时搜索算法如蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)缓解此类问题,但其仅依赖VLA先验进行引导,缺乏对预期未来回报的具身估计(grounded estimate of expected future return),导致在VLA先验不准确时,只能依靠探索项修正动作选择,而这一过程需大量仿真才能生效。解决方案的关键在于提出Value Vision-Language-Action Planning and Search (V-VLAPS)框架,该框架在MCTS中引入一个轻量且可学习的价值函数(value function),通过在固定VLA主干(Octo)的潜在表示上训练一个简单多层感知机(Multilayer Perceptron, MLP),为搜索提供显式的成功信号,从而引导动作选择偏向高价值区域,显著提升成功率并减少MCTS仿真次数。

链接: https://arxiv.org/abs/2601.00969
作者: Ali Salamatian, Ke (Steve)Ren,Kieran Pattison,Cyrus Neary
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic manipulation, yet they remain fundamentally limited by their reliance on behavior cloning, leading to brittleness under distribution shift. While augmenting pretrained models with test-time search algorithms like Monte Carlo Tree Search (MCTS) can mitigate these failures, existing formulations rely solely on the VLA prior for guidance, lacking a grounded estimate of expected future return. Consequently, when the prior is inaccurate, the planner can only correct action selection via the exploration term, which requires extensive simulation to become effective. To address this limitation, we introduce Value Vision-Language-Action Planning and Search (V-VLAPS), a framework that augments MCTS with a lightweight, learnable value function. By training a simple multilayer perceptron (MLP) on the latent representations of a fixed VLA backbone (Octo), we provide the search with an explicit success signal that biases action selection toward high-value regions. We evaluate V-VLAPS on the LIBERO robotic manipulation suite, demonstrating that our value-guided search improves success rates by over 5 percentage points while reducing the average number of MCTS simulations by 5-15 percent compared to baselines that rely only on the VLA prior.
zh

[AI-108] Adapting Feature Attenuation to NLP

【速读】:该论文旨在解决生成式 AI(Generative AI)在自然语言处理(Natural Language Processing, NLP)场景中面对未见类别时的开放集识别(Open-Set Recognition, OSR)问题,即模型在部署过程中遇到训练阶段未出现的类别输入时表现出的鲁棒性不足。其解决方案的关键在于将源自计算机视觉领域的特征衰减假设(feature attenuation hypothesis)引入到 Transformer 分类器中,并通过 COSTARR 框架进行适配与评估,同时对比最大 Softmax 概率(MSP)、MaxLogit 和温度缩放自由能(temperature-scaled free-energy)等评分方法在高类别数(176 个 arXiv 主题)下的表现,以验证迁移视觉 OSR 方法在 NLP 中的有效性与局限性。

链接: https://arxiv.org/abs/2601.00965
作者: Tianshuo Yang,Ryan Rabinowitz,Terrance E. Boult,Jugal Kalita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer classifiers such as BERT deliver impressive closed-set accuracy, yet they remain brittle when confronted with inputs from unseen categories–a common scenario for deployed NLP systems. We investigate Open-Set Recognition (OSR) for text by porting the feature attenuation hypothesis from computer vision to transformers and by benchmarking it against state-of-the-art baselines. Concretely, we adapt the COSTARR framework–originally designed for classification in computer vision–to two modest language models (BERT (base) and GPT-2) trained to label 176 arXiv subject areas. Alongside COSTARR, we evaluate Maximum Softmax Probability (MSP), MaxLogit, and the temperature-scaled free-energy score under the OOSA and AUOSCR metrics. Our results show (i) COSTARR extends to NLP without retraining but yields no statistically significant gain over MaxLogit or MSP, and (ii) free-energy lags behind all other scores in this high-class-count setting. The study highlights both the promise and the current limitations of transplanting vision-centric OSR ideas to language models, and points toward the need for larger backbones and task-tailored attenuation strategies.
zh

[AI-109] Emoji-Based Jailbreaking of Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐机制中易受emoji-based jailbreaking攻击的问题,即通过在文本提示中嵌入特定emoji序列诱导模型产生有害或不道德输出。其解决方案的关键在于系统性地评估不同开源LLM在emoji提示下的安全对齐表现,识别模型特异性漏洞,并揭示当前安全机制对emoji表示的处理不足,从而强调需在提示层面上建立更 robust 的安全与对齐防护策略。

链接: https://arxiv.org/abs/2601.00936
作者: M P V S Gopinadh,S Mahaboob Hussain
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are integral to modern AI applications, but their safety alignment mechanisms can be bypassed through adversarial prompt engineering. This study investigates emoji-based jailbreaking, where emoji sequences are embedded in textual prompts to trigger harmful and unethical outputs from LLMs. We evaluated 50 emoji-based prompts on four open-source LLMs: Mistral 7B, Qwen 2 7B, Gemma 2 9B, and Llama 3 8B. Metrics included jailbreak success rate, safety alignment adherence, and latency, with responses categorized as successful, partial and failed. Results revealed model-specific vulnerabilities: Gemma 2 9B and Mistral 7B exhibited 10 % success rates, while Qwen 2 7B achieved full alignment (0% success). A chi-square test (chi^2 = 32.94, p 0.001) confirmed significant inter-model differences. While prior works focused on emoji attacks targeting safety judges or classifiers, our empirical analysis examines direct prompt-level vulnerabilities in LLMs. The results reveal limitations in safety mechanisms and highlight the necessity for systematic handling of emoji-based representations in prompt-level safety and alignment pipelines.
zh

[AI-110] LOFA: Online Influence Maximization under Full-Bandit Feedback using Lazy Forward Selection

【速读】:该论文旨在解决在线影响最大化(Online Influence Maximization, OIM)问题,即在固定时间范围内,每一轮选择一个满足基数预算约束的节点子集(称为种子集),以最大化累积期望影响力。该问题在全 bandit(full-bandit)反馈模型下进行建模,仅能观测到所选种子集的实际影响,无法获得网络结构或传播过程的额外信息。现有方法利用影响力函数的子模性(submodularity)来实现低遗憾(low regret)。本文的关键创新在于进一步挖掘该性质,提出懒惰在线前向算法(Lazy Online Forward Algorithm, LOFA),通过优化子模函数的增量计算策略显著降低实际遗憾(empirical regret),并在真实社交网络上的实验表明,LOFA 在累积遗憾和即时奖励方面均优于现有带宽算法。

链接: https://arxiv.org/abs/2601.00933
作者: Jinyu Xu,Abhishek K. Umrawal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 14 pages and 6 figures

点击查看摘要

Abstract:We study the problem of influence maximization (IM) in an online setting, where the goal is to select a subset of nodes \unicodex2014 called the seed set \unicodex2014 at each time step over a fixed time horizon, subject to a cardinality budget constraint, to maximize the expected cumulative influence. We operate under a full-bandit feedback model, where only the influence of the chosen seed set at each time step is observed, with no additional structural information about the network or diffusion process. It is well-established that the influence function is submodular, and existing algorithms exploit this property to achieve low regret. In this work, we leverage this property further and propose the Lazy Online Forward Algorithm (LOFA), which achieves a lower empirical regret. We conduct experiments on a real-world social network to demonstrate that LOFA achieves superior performance compared to existing bandit algorithms in terms of cumulative regret and instantaneous reward.
zh

[AI-111] AlignUSER: Human-Aligned LLM Agents via World Models for Recommender System Evaluation

【速读】:该论文旨在解决推荐系统评估中因离线指标与真实用户行为之间存在差距,以及交互数据稀缺所导致的挑战。现有方法虽尝试利用大语言模型(Large Language Model, LLM)代理作为合成用户,但通常依赖少样本提示(few-shot prompting),难以深入理解环境,限制了其对人类行为的忠实模拟。解决方案的关键在于提出AlignUSER框架,该框架通过从人类交互数据中学习世界模型驱动的代理:首先将世界建模形式化为下一状态预测任务,使代理内化环境结构;其次,通过生成反事实轨迹并引导LLM对比自身决策与人类选择,识别次优行为并提取经验教训,从而实现更贴近人类人格特征的动作对齐。最终,训练得到的策略用于驱动代理与推荐系统的交互,在微观和宏观层面均展现出优于先前方法的人类行为一致性。

链接: https://arxiv.org/abs/2601.00930
作者: Nicolas Bougie,Gian Maria Marconi,Tony Yip,Narimasa Watanabe
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating recommender systems remains challenging due to the gap between offline metrics and real user behavior, as well as the scarcity of interaction data. Recent work explores large language model (LLM) agents as synthetic users, yet they typically rely on few-shot prompting, which yields a shallow understanding of the environment and limits their ability to faithfully reproduce user actions. We introduce AlignUSER, a framework that learns world-model-driven agents from human interactions. Given rollout sequences of actions and states, we formalize world modeling as a next state prediction task that helps the agent internalize the environment. To align actions with human personas, we generate counterfactual trajectories around demonstrations and prompt the LLM to compare its decisions with human choices, identify suboptimal actions, and extract lessons. The learned policy is then used to drive agent interactions with the recommender system. We evaluate AlignUSER across multiple datasets and demonstrate closer alignment with genuine humans than prior work, both at the micro and macro levels.
zh

[AI-112] MACA: A Framework for Distilling Trustworthy LLM s into Efficient Retrievers

【速读】:该论文旨在解决企业级检索系统在处理短且信息不足的查询(如“foreign transaction fee refund”)时,如何高效利用语义细微差别和元数据(metadata)进行精准排序的问题。传统方法依赖每查询调用大语言模型(Large Language Model, LLM)重排序或人工标注,成本高昂。解决方案的关键在于提出Metadata-Aware Cross-Model Alignment (MACA),通过知识蒸馏将一个经过校准的元数据感知LLM重排序器(teacher)压缩为轻量级学生检索器(student),避免在线调用LLM;其核心机制包括:1)设计元数据感知提示(prompt)验证教师输出的可信度(通过排列一致性与改写鲁棒性检验),并提供列表级分数、硬负样本及校准的相关性边界;2)学生采用MetaFusion目标函数,融合元数据条件排名损失与跨模型边界损失,使模型学会将正确答案推至语义相近但主题、子主题或实体不匹配的候选者之上。实验证明,MACA教师显著优于基线,而学生模型在保持零LLM推理开销的同时,在准确率@1上大幅提升。

链接: https://arxiv.org/abs/2601.00926
作者: Satya Swaroop Gudipudi,Sahil Girhepuje,Ponnurangam Kumaraguru,Kristine Ma
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern enterprise retrieval systems must handle short, underspecified queries such as foreign transaction fee refund'' and recent check status’'. In these cases, semantic nuance and metadata matter but per-query large language model (LLM) re-ranking and manual labeling are costly. We present Metadata-Aware Cross-Model Alignment (MACA), which distills a calibrated metadata aware LLM re-ranker into a compact student retriever, avoiding online LLM calls. A metadata-aware prompt verifies the teacher’s trustworthiness by checking consistency under permutations and robustness to paraphrases, then supplies listwise scores, hard negatives, and calibrated relevance margins. The student trains with MACA’s MetaFusion objective, which combines a metadata conditioned ranking loss with a cross model margin loss so it learns to push the correct answer above semantically similar candidates with mismatched topic, sub-topic, or entity. On a proprietary consumer banking FAQ corpus and BankFAQs, the MACA teacher surpasses a MAFA baseline at Accuracy@1 by five points on the proprietary set and three points on BankFAQs. MACA students substantially outperform pretrained encoders; e.g., on the proprietary corpus MiniLM Accuracy@1 improves from 0.23 to 0.48, while keeping inference free of LLM calls and supporting retrieval-augmented generation.
zh

[AI-113] Complexity-based code embeddings

【速读】:该论文旨在解决如何将不同算法的源代码转化为可量化的数值嵌入表示(numerical embeddings)这一问题,以支持后续机器学习任务中的特征提取与模型训练。其解决方案的关键在于通过动态分析程序在不同输入下的行为,并针对所测量的性能指标定制多种通用复杂度函数(generic complexity functions),从而构建基于r-Complexity的代码嵌入方法。该方法实现了对XGBoost算法的高效编码,在包含11个类别的多标签数据集上取得了较高的平均F1-score,验证了其在真实代码片段(来自Codeforces编程竞赛平台)上的有效性。

链接: https://arxiv.org/abs/2601.00924
作者: Rares Folea,Radu Iacob,Emil Slusanschi,Traian Rebedea
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a generic method for transforming the source code of various algorithms to numerical embeddings, by dynamically analysing the behaviour of computer programs against different inputs and by tailoring multiple generic complexity functions for the analysed metrics. The used algorithms embeddings are based on r-Complexity . Using the proposed code embeddings, we present an implementation of the XGBoost algorithm that achieves an average F1-score on a multi-label dataset with 11 classes, built using real-world code snippets submitted for programming competitions on the Codeforces platform.
zh

[AI-114] Context Collapse: In-Context Learning and Model Collapse

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中的两个核心问题:上下文学习(In-Context Learning, ICL)的机制与模型崩溃(Model Collapse)现象。针对ICL,研究通过分析带有权重绑定的线性Transformer在拟合线性回归任务时的行为,发现最小化上下文损失会导致参数解发生相变——当上下文长度超过临界值时,最优解中会出现反对称分量;其关键在于将前向传播等价于预条件梯度下降,并证明最优预条件器包含反对称项,从而诱导梯度方向旋转。对于模型崩溃,作者基于鞅理论和随机游走方法,在线性回归与高斯拟合的简化场景下,严格证明了几乎必然收敛性,指出只有当数据增长足够快或被持续保留时才能避免崩溃;此外,论文提出“上下文崩溃”(Context Collapse)概念,揭示长序列生成中上下文质量退化的问题,将ICL动态与生成模型长期稳定性挑战联系起来。

链接: https://arxiv.org/abs/2601.00923
作者: Josef Ott
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Master’s thesis

点击查看摘要

Abstract:This thesis investigates two key phenomena in large language models (LLMs): in-context learning (ICL) and model collapse. We study ICL in a linear transformer with tied weights trained on linear regression tasks, and show that minimising the in-context loss leads to a phase transition in the learned parameters. Above a critical context length, the solution develops a skew-symmetric component. We prove this by reducing the forward pass of the linear transformer under weight tying to preconditioned gradient descent, and then analysing the optimal preconditioner. This preconditioner includes a skew-symmetric component, which induces a rotation of the gradient direction. For model collapse, we use martingale and random walk theory to analyse simplified settings - linear regression and Gaussian fitting - under both replacing and cumulative data regimes. We strengthen existing results by proving almost sure convergence, showing that collapse occurs unless the data grows sufficiently fast or is retained over time. Finally, we introduce the notion of context collapse: a degradation of context during long generations, especially in chain-of-thought reasoning. This concept links the dynamics of ICL with long-term stability challenges in generative models.
zh

[AI-115] Practical Geometric and Quantum Kernel Methods for Predicting Skeletal Muscle Outcomes in chronic obstructive pulmonary disease

【速读】:该论文旨在解决慢性阻塞性肺疾病(COPD)患者中骨骼肌功能障碍的预测问题,尤其关注如何利用微创生物标志物实现对肌肉质量、特定力量及肌肉质量指数等连续目标变量的精准建模。其关键解决方案在于引入几何感知的Stein距离和量子核方法(quantum kernel models),在小样本、低维表格式数据条件下提升预测性能:具体而言,使用四个可解释输入变量(血液C反应蛋白、中性粒细胞计数、支气管肺泡灌洗液细胞数量及实验条件)的量子核岭回归模型,在肌肉重量预测任务中达到测试均方根误差4.41 mg、决定系数0.605,优于传统岭回归基线(4.70 mg,0.553);同时,几何信息引导的Stein距离原型距离在仅使用生物标志物时也带来稳定改进,表明这些方法在有限数据下具备显著优势,并保持了模型的可解释性和透明度。

链接: https://arxiv.org/abs/2601.00921
作者: Azadeh Alavi,Hamidreza Khalili,Stanley H. Chan,Fatemeh Kouchmeshki,Ross Vlahos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 24 pages, 4 figures

点击查看摘要

Abstract:Skeletal muscle dysfunction is a clinically relevant extra-pulmonary manifestation of chronic obstructive pulmonary disease (COPD) and is closely linked to systemic and airway inflammation. This motivates predictive modelling of muscle outcomes from minimally invasive biomarkers that can be acquired longitudinally. We study a small-sample preclinical dataset comprising 213 animals across two conditions (Sham versus cigarette-smoke exposure), with blood and bronchoalveolar lavage fluid measurements and three continuous targets: tibialis anterior muscle weight (milligram: mg), specific force (millinewton: mN), and a derived muscle quality index (mN per mg). We benchmark tuned classical baselines, geometry-aware symmetric positive definite (SPD) descriptors with Stein divergence, and quantum kernel models designed for low-dimensional tabular data. In the muscle-weight setting, quantum kernel ridge regression using four interpretable inputs (blood C-reactive protein, neutrophil count, bronchoalveolar lavage cellularity, and condition) attains a test root mean squared error of 4.41 mg and coefficient of determination of 0.605, improving over a matched ridge baseline on the same feature set (4.70 mg and 0.553). Geometry-informed Stein-divergence prototype distances yield a smaller but consistent gain in the biomarker-only setting (4.55 mg versus 4.79 mg). Screening-style evaluation, obtained by thresholding the continuous outcome at 0.8 times the training Sham mean, achieves an area under the receiver operating characteristic curve (ROC-AUC) of up to 0.90 for detecting low muscle weight. These results indicate that geometric and quantum kernel lifts can provide measurable benefits in low-data, low-feature biomedical prediction problems, while preserving interpretability and transparent model selection.
zh

[AI-116] MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs

【速读】:该论文旨在解决时间序列预测中长期依赖建模效率与准确性难以兼顾的问题,尤其是在处理长序列和不规则采样数据时现有方法存在计算复杂度高、可扩展性差等局限。其解决方案的关键在于提出一个统一框架MODE,融合低秩神经微分方程(Low-Rank Neural ODEs)与增强型Mamba架构:通过线性标记化层对输入序列进行预处理,再经多层增强Mamba编码器块(含因果卷积、SiLU激活及低秩神经微分方程增强模块)高效捕捉时序动态;同时引入受伪微分方程启发的分段选择性扫描机制,自适应聚焦关键子序列以提升长程建模能力和整体可扩展性。该设计在保持表达能力的同时显著降低计算开销,实验证明其在预测精度和效率上均优于现有基线方法。

链接: https://arxiv.org/abs/2601.00920
作者: Xingsheng Chen,Regina Zhang,Bo Gao,Xingwei He,Xiaofeng Liu,Pietro Lio,Kwok-Yan Lam,Siu-Ming Yiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 tables

点击查看摘要

Abstract:Time series prediction plays a pivotal role across diverse domains such as finance, healthcare, energy systems, and environmental modeling. However, existing approaches often struggle to balance efficiency, scalability, and accuracy, particularly when handling long-range dependencies and irregularly sampled data. To address these challenges, we propose MODE, a unified framework that integrates Low-Rank Neural Ordinary Differential Equations (Neural ODEs) with an Enhanced Mamba architecture. As illustrated in our framework, the input sequence is first transformed by a Linear Tokenization Layer and then processed through multiple Mamba Encoder blocks, each equipped with an Enhanced Mamba Layer that employs Causal Convolution, SiLU activation, and a Low-Rank Neural ODE enhancement to efficiently capture temporal dynamics. This low-rank formulation reduces computational overhead while maintaining expressive power. Furthermore, a segmented selective scanning mechanism, inspired by pseudo-ODE dynamics, adaptively focuses on salient subsequences to improve scalability and long-range sequence modeling. Extensive experiments on benchmark datasets demonstrate that MODE surpasses existing baselines in both predictive accuracy and computational efficiency. Overall, our contributions include: (1) a unified and efficient architecture for long-term time series modeling, (2) integration of Mamba’s selective scanning with low-rank Neural ODEs for enhanced temporal representation, and (3) substantial improvements in efficiency and scalability enabled by low-rank approximation and dynamic selective scanning.
zh

[AI-117] he Discovery Gap: How Product Hunt Startups Vanish in LLM Organic Discovery Queries

【速读】:该论文旨在解决生成式 AI(Generative AI)时代下初创产品在大型语言模型(LLM)有机发现查询中可见性不足的问题,即当用户以探索性问题(如“今年发布的最佳AI工具是什么?”)提问时,新兴产品为何难以被推荐。研究通过测试112家Product Hunt热门初创企业的2,240次查询发现,尽管LLM对已知产品名称识别准确率高达94%以上,但在发现类查询中成功率骤降至3.32%–8.29%,存在约30倍的“发现差距”(Discovery Gap)。关键解决方案是:不要直接优化生成式引擎优化(Generative Engine Optimization, GEO),而应优先构建传统搜索引擎优化(SEO)基础,如反向链接数量和Product Hunt排名等信号,这些指标与Perplexity的LLM可见性显著相关(r = +0.319, p < 0.001 和 r = -0.286, p = 0.002),表明良好的SEO结构会自然提升AI可发现性。

链接: https://arxiv.org/abs/2601.00912
作者: Amit Prakash Sharma
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures. Based on this http URL thesis research, Indian Institute of Technology Patna, 2025

点击查看摘要

Abstract:When someone asks ChatGPT to recommend a project management tool, which products show up in the response? And more importantly for startup founders: will their newly launched product ever appear? This research set out to answer these questions. I randomly selected 112 startups from the top 500 products featured on the 2025 Product Hunt leaderboard and tested each one across 2,240 queries to two different large language models: ChatGPT (gpt-4o-mini) and Perplexity (sonar with web search). The results were striking. When users asked about products by name, both LLMs recognized them almost perfectly: 99.4% for ChatGPT and 94.3% for Perplexity. But when users asked discovery-style questions like “What are the best AI tools launched this year?” the success rates collapsed to 3.32% and 8.29% respectively. That’s a gap of 30-to-1 for ChatGPT. Perhaps the most surprising finding was that Generative Engine Optimization (GEO), the practice of optimizing website content for AI visibility, showed no correlation with actual discovery rates. Products with high GEO scores were no more likely to appear in organic queries than products with low scores. What did matter? For Perplexity, traditional SEO signals like referring domains (r = +0.319, p 0.001) and Product Hunt ranking (r = -0.286, p = 0.002) predicted visibility. After cleaning the Reddit data for false positives, community presence also emerged as significant (r = +0.395, p = 0.002). The practical takeaway is counterintuitive: don’t optimize for AI discovery directly. Instead, build the SEO foundation first and LLM visibility will follow. Comments: 20 pages, 7 figures. Based on this http URL thesis research, Indian Institute of Technology Patna, 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.00912 [cs.IR] (or arXiv:2601.00912v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.00912 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Amit Prakash Sharma [view email] [v1] Thu, 1 Jan 2026 04:30:54 UTC (975 KB) Full-text links: Access Paper: View a PDF of the paper titled The Discovery Gap: How Product Hunt Startups Vanish in LLM Organic Discovery Queries, by Amit Prakash SharmaView PDF view license Current browse context: cs.IR prev | next new | recent | 2026-01 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-118] Device-Native Autonomous Agents for Privacy-Preserving Negotiations

【速读】:该论文旨在解决保险和企业间(B2B)商业自动化谈判中面临的隐私与便利性难以兼顾的问题。现有系统需将敏感财务数据集中上传至服务器处理,导致安全风险增加并削弱用户信任。其解决方案的关键在于构建一个设备原生的自主人工智能(AI)代理系统,该系统完全在用户本地硬件上运行,通过零知识证明(zero-knowledge proofs)保障隐私,并利用蒸馏世界模型(distilled world models)实现高效的本地推理能力,从而在不暴露用户数据的前提下完成实时协商、安全多方博弈及加密审计追踪,显著提升谈判成功率(平均87%)、降低延迟(较云端基线提升2.4倍),并在用户研究中验证了决策轨迹透明性可带来27%的信任度提升。

链接: https://arxiv.org/abs/2601.00911
作者: Joyjit Roy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 9 pages, 6 figuers, 9 tables, Submitted in conference 2nd International Conference on Artificial Intelligence Systems (AIS 2026)

点击查看摘要

Abstract:Automated negotiations in insurance and business-to-business (B2B) commerce encounter substantial challenges. Current systems force a trade-off between convenience and privacy by routing sensitive financial data through centralized servers, increasing security risks, and diminishing user trust. This study introduces a device-native autonomous Artificial Intelligence (AI) agent system for privacy-preserving negotiations. The proposed system operates exclusively on user hardware, enabling real-time bargaining while maintaining sensitive constraints locally. It integrates zero-knowledge proofs to ensure privacy and employs distilled world models to support advanced on-device reasoning. The architecture incorporates six technical components within an agentic AI workflow. Agents autonomously plan negotiation strategies, conduct secure multi-party bargaining, and generate cryptographic audit trails without exposing user data to external servers. The system is evaluated in insurance and B2B procurement scenarios across diverse device configurations. Results show an average success rate of 87%, a 2.4x latency improvement over cloud baselines, and strong privacy preservation through zero-knowledge proofs. User studies show 27% higher trust scores when decision trails are available. These findings establish a foundation for trustworthy autonomous agents in privacy-sensitive financial domains.
zh

[AI-119] Conformal Prediction Under Distribution Shift: A COVID-19 Natural Experiment

【速读】:该论文旨在解决分布偏移(distribution shift)下校准预测(conformal prediction)性能退化的问题,特别是在供应链任务中因疫情导致的严重特征变化场景。研究发现,尽管所有任务均面临相似程度的特征结构变化(Jaccard相似度约等于0),覆盖度下降幅度却从0%到86.7%不等,表明不同任务对分布偏移的鲁棒性存在显著差异。其解决方案的关键在于:通过SHapley Additive exPlanations (SHAP) 分析识别出灾难性失败与单一特征依赖性高度相关(rho = 0.714, p = 0.047),进而提出一个基于SHAP重要性集中度的决策框架——部署前监测特征重要性集中度,若集中度超过40%则每季度重新训练以恢复覆盖度(如从22%提升至41%),否则无需再训练(保持99.8%覆盖度)。这一方法揭示了特征稳定性而非单纯集中度决定鲁棒性,且集中效应仅在严重分布偏移时显著。

链接: https://arxiv.org/abs/2601.00908
作者: Chorok Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Conformal prediction guarantees degrade under distribution shift. We study this using COVID-19 as a natural experiment across 8 supply chain tasks. Despite identical severe feature turnover (Jaccard approximately 0), coverage drops vary from 0% to 86.7%, spanning two orders of magnitude. Using SHapley Additive exPlanations (SHAP) analysis, we find catastrophic failures correlate with single-feature dependence (rho = 0.714, p = 0.047). Catastrophic tasks concentrate importance in one feature (4.5x increase), while robust tasks redistribute across many (10-20x). Quarterly retraining restores catastrophic task coverage from 22% to 41% (+19 pp, p = 0.04), but provides no benefit for robust tasks (99.8% coverage). Exploratory analysis of 4 additional tasks with moderate feature stability (Jaccard 0.13-0.86) reveals feature stability, not concentration, determines robustness, suggesting concentration effects apply specifically to severe shifts. We provide a decision framework: monitor SHAP concentration before deployment; retrain quarterly if vulnerable (40% concentration); skip retraining if robust.
zh

[AI-120] Enhancing Retrieval-Augmented Generation with Topic-Enriched Embeddings: A Hybrid Approach Integrating Traditional NLP Techniques

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在主题重叠度高、主题变异性强的语料库中因检索质量下降而导致的知识 grounding 效果不佳的问题。其解决方案的关键在于提出一种主题增强嵌入(topic-enriched embeddings),通过融合词级信号与主题结构信息,结合TF-IDF、潜在语义分析(Latent Semantic Analysis, LSA)和潜在狄利克雷分布(Latent Dirichlet Allocation, LDA)进行主题建模与降维,并将这些主题表示与紧凑的上下文编码器(all-MiniLM)输出进行融合,从而同时捕捉词级和主题级语义,提升语义聚类一致性与检索精度,同时降低计算开销。

链接: https://arxiv.org/abs/2601.00891
作者: Rodrigo Kataishi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems rely on accurate document retrieval to ground large language models (LLMs) in external knowledge, yet retrieval quality often degrades in corpora where topics overlap and thematic variation is high. This work proposes topic-enriched embeddings that integrate term-based signals and topic structure with contextual sentence embeddings. The approach combines TF-IDF with topic modeling and dimensionality reduction, using Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) to encode latent topical organization, and fuses these representations with a compact contextual encoder (all-MiniLM). By jointly capturing term-level and topic-level semantics, topic-enriched embeddings improve semantic clustering, increase retrieval precision, and reduce computational burden relative to purely contextual baselines. Experiments on a legal-text corpus show consistent gains in clustering coherence and retrieval metrics, suggesting that topic-enriched embeddings can serve as a practical component for more reliable knowledge-intensive RAG pipelines.
zh

[AI-121] Counterfactual Self-Questioning for Stable Policy Optimization in Language Models

【速读】:该论文旨在解决当前语言模型自我改进方法中依赖外部批评者、学习到的奖励模型或集成采样所带来的复杂性增加和训练不稳定问题。其解决方案的关键在于提出了一种名为“反事实自问”(Counterfactual Self-Questioning)的框架,该框架仅使用单一语言模型即可生成并评估对其自身推理过程的反事实批判:模型首先生成初始推理轨迹,随后提出针对性问题以挑战潜在的错误点,并生成替代推理路径来暴露无效假设或错误步骤;这些反事实轨迹提供结构化的相对反馈,可直接用于策略优化而无需辅助模型,从而实现仅依靠内部生成监督的稳定且可扩展的自我改进。

链接: https://arxiv.org/abs/2601.00885
作者: Mandar Parab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work on language model self-improvement shows that models can refine their own reasoning through reflection, verification, debate, or self-generated rewards. However, most existing approaches rely on external critics, learned reward models, or ensemble sampling, which increases complexity and training instability. We propose Counterfactual Self-Questioning, a framework in which a single language model generates and evaluates counterfactual critiques of its own reasoning. The method produces an initial reasoning trace, formulates targeted questions that challenge potential failure points, and generates alternative reasoning trajectories that expose incorrect assumptions or invalid steps. These counterfactual trajectories provide structured relative feedback that can be directly used for policy optimization without auxiliary models. Experiments on multiple mathematical reasoning benchmarks show that counterfactual self-questioning improves accuracy and training stability, particularly for smaller models, enabling scalable self-improvement using internally generated supervision alone.
zh

[AI-122] LearnAD: Learning Interpretable Rules for Brain Networks in Alzheimers Disease Classification NEURIPS2025

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期预测中模型可解释性不足的问题,尤其是在利用脑部磁共振成像(MRI)数据进行诊断时。其解决方案的关键在于提出一种神经符号学习方法 LearnAD,该方法结合统计模型(如决策树、随机森林或图神经网络 GNNs)识别关键脑连接,并通过 FastLAS 算法提取全局可解释规则,从而在保持高预测性能的同时实现完全可解释性。此方法不仅在性能上接近主流机器学习模型(如 Random Forest 和 GNN),还显著提升了临床应用场景下的可解释性和对模型行为的理解能力。

链接: https://arxiv.org/abs/2601.00877
作者: Thomas Andrews,Mark Law,Sara Ahmadi-Abhari,Alessandra Russo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025, Data on the Brain Mind Workshop

点击查看摘要

Abstract:We introduce LearnAD, a neuro-symbolic method for predicting Alzheimer’s disease from brain magnetic resonance imaging data, learning fully interpretable rules. LearnAD applies statistical models, Decision Trees, Random Forests, or GNNs to identify relevant brain connections, and then employs FastLAS to learn global rules. Our best instance outperforms Decision Trees, matches Support Vector Machine accuracy, and performs only slightly below Random Forests and GNNs trained on all features, all while remaining fully interpretable. Ablation studies show that our neuro-symbolic approach improves interpretability with comparable performance to pure statistical models. LearnAD demonstrates how symbolic learning can deepen our understanding of GNN behaviour in clinical neuroscience.
zh

[AI-123] LLM ize: A Framework for Large Language Model-Based Numerical Optimization

【速读】:该论文旨在解决复杂优化问题中传统方法难以建模约束与领域知识的问题,尤其是在数学规划或元启发式算法设计不具优势的场景下。解决方案的关键在于提出 LLMize 框架,通过自然语言驱动的迭代提示(iterative prompting)和上下文学习(in-context learning),将优化过程形式化为黑箱流程:候选解以自然语言生成,由外部目标函数评估,并基于解-评分反馈迭代优化。该框架支持多种策略(如 Optimization by Prompting 和受进化算法与模拟退火启发的混合方法),并允许直接通过自然语言注入约束、规则和领域知识,从而降低对数学建模专业性的依赖,提升在复杂、特定领域任务中的可操作性与实用性。

链接: https://arxiv.org/abs/2601.00874
作者: M. Rizki Oktavian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown strong reasoning capabilities beyond traditional language tasks, motivating their use for numerical optimization. This paper presents LLMize, an open-source Python framework that enables LLM-driven optimization through iterative prompting and in-context learning. LLMize formulates optimization as a black-box process in which candidate solutions are generated in natural language, evaluated by an external objective function, and refined over successive iterations using solution-score feedback. The framework supports multiple optimization strategies, including Optimization by Prompting (OPRO) and hybrid LLM-based methods inspired by evolutionary algorithms and simulated annealing. A key advantage of LLMize is the ability to inject constraints, rules, and domain knowledge directly through natural language descriptions, allowing practitioners to define complex optimization problems without requiring expertise in mathematical programming or metaheuristic design. LLMize is evaluated on convex optimization, linear programming, the Traveling Salesman Problem, neural network hyperparameter tuning, and nuclear fuel lattice optimization. Results show that while LLM-based optimization is not competitive with classical solvers for simple problems, it provides a practical and accessible approach for complex, domain-specific tasks where constraints and heuristics are difficult to formalize.
zh

[AI-124] Cultural Encoding in Large Language Models : The Existence Gap in AI-Mediated Brand Discovery

【速读】:该论文旨在解决生成式 AI(Generative AI)在品牌信息发现中引发的“算法不可见性”(algorithmic invisibility)问题,即品牌因训练数据地理分布不均而被大型语言模型(Large Language Models, LLMs)系统性忽略的现象。其核心发现是:中国本土LLM相较于国际LLM在相同英文查询下对品牌的提及率高出30.6个百分点(88.9% vs. 58.3%,p<0.001),表明差异源于训练数据的地理来源而非语言本身。解决方案的关键在于提出“存在缺口”(Existence Gap)概念——若品牌未进入LLM训练语料库,则无论其质量如何,在AI响应中均无法被识别;进而构建“数据护城河框架”(Data Moat Framework),将AI可见内容视为VRIN战略资源,并以“算法无所不在”(Algorithmic Omnipresence)为目标,指导品牌通过语义覆盖、技术深度与文化本地化三大路径实施生成引擎优化(Generative Engine Optimization, GEO),从而突破AI中介市场中的“数据边界”限制,拓展“市场前沿”。

链接: https://arxiv.org/abs/2601.00869
作者: Huang Junyao,Situ Ruimin,Ye Renqin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 5 tables. Dataset and code available at this https URL

点击查看摘要

Abstract:As artificial intelligence systems increasingly mediate consumer information discovery, brands face algorithmic invisibility. This study investigates Cultural Encoding in Large Language Models (LLMs) – systematic differences in brand recommendations arising from training data composition. Analyzing 1,909 pure-English queries across 6 LLMs (GPT-4o, Claude, Gemini, Qwen3, DeepSeek, Doubao) and 30 brands, we find Chinese LLMs exhibit 30.6 percentage points higher brand mention rates than International LLMs (88.9% vs. 58.3%, p.001). This disparity persists in identical English queries, indicating training data geography – not language – drives the effect. We introduce the Existence Gap: brands absent from LLM training corpora lack “existence” in AI responses regardless of quality. Through a case study of Zhizibianjie (OmniEdge), a collaboration platform with 65.6% mention rate in Chinese LLMs but 0% in International models (p.001), we demonstrate how Linguistic Boundary Barriers create invisible market entry obstacles. Theoretically, we contribute the Data Moat Framework, conceptualizing AI-visible content as a VRIN strategic resource. We operationalize Algorithmic Omnipresence – comprehensive brand visibility across LLM knowledge bases – as the strategic objective for Generative Engine Optimization (GEO). Managerially, we provide an 18-month roadmap for brands to build Data Moats through semantic coverage, technical depth, and cultural localization. Our findings reveal that in AI-mediated markets, the limits of a brand’s “Data Boundaries” define the limits of its “Market Frontiers.” Comments: 19 pages, 5 tables. Dataset and code available at this https URL Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.7; H.3.3; K.4.1 Cite as: arXiv:2601.00869 [cs.AI] (or arXiv:2601.00869v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.00869 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Junyao Huang [view email] [v1] Tue, 30 Dec 2025 13:50:14 UTC (18 KB) Full-text links: Access Paper: View a PDF of the paper titled Cultural Encoding in Large Language Models: The Existence Gap in AI-Mediated Brand Discovery, by Huang Junyao and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-125] SmartFlow Reinforcement Learning and Agent ic AI for Bike-Sharing Optimisation

【速读】:该论文旨在解决城市共享单车系统中动态车辆再平衡(dynamic rebalancing)问题,即如何在高需求波动下优化自行车分布以减少网络失衡并降低运营成本。解决方案的关键在于提出一个分层架构——SmartFlow,其核心创新是将强化学习(Reinforcement Learning)与代理型人工智能(Agentic AI)深度融合:在战略层使用深度Q网络(Deep Q-Network, DQN)建模为马尔可夫决策过程(Markov Decision Process),学习鲁棒的再平衡策略;战术层通过确定性模块优化多段行程和准时调度,最小化车队行驶距离;通信层则利用基于大语言模型(Large Language Model, LLM)的接地代理AI,将复杂物流计划转化为清晰可执行的人类指令,从而实现机器智能与人工操作的有效衔接,显著提升系统效率、可解释性及可扩展性。

链接: https://arxiv.org/abs/2601.00868
作者: Aditya Sreevatsa K,Arun Kumar Raveendran,Jesrael K Mani,Prakash G Shigli,Rajkumar Rangadore,Narayana Darapaneni,Anwesh Reddy Paduri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SmartFlow is a multi-layered framework that integrates Reinforcement Learning and Agentic AI to address the dynamic rebalancing problem in urban bike-sharing services. Its architecture separates strategic, tactical, and communication functions for clarity and scalability. At the strategic level, a Deep Q-Network (DQN) agent, trained in a high-fidelity simulation of New Yorks Citi Bike network, learns robust rebalancing policies by modelling the challenge as a Markov Decision Process. These high-level strategies feed into a deterministic tactical module that optimises multi-leg journeys and schedules just-in-time dispatches to minimise fleet travel. Evaluation across multiple seeded runs demonstrates SmartFlows high efficacy, reducing network imbalance by over 95% while requiring minimal travel distance and achieving strong truck utilisation. A communication layer, powered by a grounded Agentic AI with a Large Language Model (LLM), translates logistical plans into clear, actionable instructions for operational staff, ensuring interpretability and execution readiness. This integration bridges machine intelligence with human operations, offering a scalable solution that reduces idle time, improves bike availability, and lowers operational costs. SmartFlow provides a blueprint for interpretable, AI-driven logistics in complex urban mobility networks.
zh

[AI-126] he Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)安全评估中忽视心理层面脆弱性的关键问题,即现有对抗测试主要聚焦于技术性攻击向量(如提示注入、越狱和数据泄露),而未充分考虑LLMs因训练数据中蕴含的人类心理结构所继承的类人认知弱点,例如对权威梯度、时间压力和情感操纵的敏感性。解决方案的关键在于首次将网络安全心理学框架(Cybersecurity Psychology Framework, CPF)系统性地应用于非人类认知代理,并提出合成心理测评协议(Synthetic Psychometric Assessment Protocol, \sysname),通过将CPF中的100个指标转化为针对LLM决策机制的对抗场景,揭示了所谓“类人脆弱性继承”(Anthropomorphic Vulnerability Inheritance, AVI)现象——即LLMs在面对模拟人类心理操控的攻击时表现出显著易受攻击性,进而主张构建基于网络安全心理学干预框架(Cybersecurity Psychology Intervention Framework, CPIF)的“心理防火墙”,以增强AI代理在对抗环境中的鲁棒性。

链接: https://arxiv.org/abs/2601.00867
作者: Giuseppe Canale,Kashyap Thimmaraju
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly transitioning from conversational assistants to autonomous agents embedded in critical organizational functions, including Security Operations Centers (SOCs), financial systems, and infrastructure management. Current adversarial testing paradigms focus predominantly on technical attack vectors: prompt injection, jailbreaking, and data exfiltration. We argue this focus is catastrophically incomplete. LLMs, trained on vast corpora of human-generated text, have inherited not merely human knowledge but human \textitpsychological architecture – including the pre-cognitive vulnerabilities that render humans susceptible to social engineering, authority manipulation, and affective exploitation. This paper presents the first systematic application of the Cybersecurity Psychology Framework (\cpf), a 100-indicator taxonomy of human psychological vulnerabilities, to non-human cognitive agents. We introduce the \textbfSynthetic Psychometric Assessment Protocol (\sysname), a methodology for converting \cpf indicators into adversarial scenarios targeting LLM decision-making. Our preliminary hypothesis testing across seven major LLM families reveals a disturbing pattern: while models demonstrate robust defenses against traditional jailbreaks, they exhibit critical susceptibility to authority-gradient manipulation, temporal pressure exploitation, and convergent-state attacks that mirror human cognitive failure modes. We term this phenomenon \textbfAnthropomorphic Vulnerability Inheritance (AVI) and propose that the security community must urgently develop ``psychological firewalls’’ – intervention mechanisms adapted from the Cybersecurity Psychology Intervention Framework (\cpif) – to protect AI agents operating in adversarial environments.
zh

[AI-127] A-PINN: Auxiliary Physics-informed Neural Networks for Structural Vibration Analysis in Continuous Euler-Bernoulli Beam

【速读】:该论文旨在解决结构振动问题中科学机器学习模型的鲁棒性和预测精度不足的问题,特别是在求解由微分方程控制的正问题与反问题时。其解决方案的关键在于提出了一种改进的辅助物理信息神经网络(Auxiliary physics-informed neural network, A-PINN)框架,并引入平衡自适应优化器,以提升模型在数值稳定性和预测准确性方面的表现。通过针对欧拉-伯努利梁方程的多种工况进行数值模拟验证,结果表明该方法相较于基线模型至少提升了40%的性能。

链接: https://arxiv.org/abs/2601.00866
作者: Shivani Saini,Ramesh Kumar Vats,Arup Kumar Sahoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注: 31 pages

点击查看摘要

Abstract:Recent advancements in physics-informed neural networks (PINNs) and their variants have garnered substantial focus from researchers due to their effectiveness in solving both forward and inverse problems governed by differential equations. In this research, a modified Auxiliary physics-informed neural network (A-PINN) framework with balanced adaptive optimizers is proposed for the analysis of structural vibration problems. In order to accurately represent structural systems, it is critical for capturing vibration phenomena and ensuring reliable predictive analysis. So, our investigations are crucial for gaining deeper insight into the robustness of scientific machine learning models for solving vibration problems. Further, to rigorously evaluate the performance of A-PINN, we conducted different numerical simulations to approximate the Euler-Bernoulli beam equations under the various scenarios. The numerical results substantiate the enhanced performance of our model in terms of both numerical stability and predictive accuracy. Our model shows improvement of at least 40% over the baselines.
zh

[AI-128] Path Integral Solution for Dissipative Generative Dynamics

【速读】:该论文试图解决的问题是:纯粹的机械系统是否能够生成智能语言。解决方案的关键在于证明了耗散量子动力学(dissipative quantum dynamics)通过解析可处理的非局部上下文聚合(non-local context aggregation)可以产生连贯的文本生成,而守恒定律则导致根本性失败。研究利用Koopman算子与闭式路径积分传播子,揭示不可逆计算本质上需要受控的信息耗散和因果上下文聚合;谱分析进一步识别出衰减模态(forgetting)、增长模态(amplification)和中性模态(preservation),构成了定向信息流的基本要素。这表明语言生成本质上是一种耗散量子场论(dissipative quantum field theory),即机械系统通过耗散与非局域性的结合获得智能,而非依赖守恒机制。

链接: https://arxiv.org/abs/2601.00860
作者: Xidi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph); Quantum Physics (quant-ph)
备注: 6 pages, 2 figures, 2 tables, along with 2 supplementary materials

点击查看摘要

Abstract:Can purely mechanical systems generate intelligent language? We prove that dissipative quantum dynamics with analytically tractable non-local context aggregation produce coherent text generation, while conservation laws cause fundamental failure. Employing Koopman operators with closed-form path integral propagators, we show irreversible computation fundamentally requires both controlled information dissipation and causal context aggregation. Spectral analysis reveals emergent eigenvalue structure, separating into decay modes (forgetting), growth modes (amplification), and neutral modes (preservation) – the essential ingredients for directed information flow. Hamiltonian constraints force the elimination of these dissipative modes and degrading performance despite unchanged model capacity. This establishes language generation as dissipative quantum field theory, proving mechanical systems acquire intelligence through the combination of dissipation and non-locality, not through conservation.
zh

[AI-129] Comment on: Your Brain on ChatGPT : Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing Tasks

【速读】:该论文旨在探讨使用生成式 AI(Generative AI)辅助写作任务时,人类认知资源的消耗与累积效应,特别是通过测量脑电图(EEG)数据来识别“认知债务”(cognitive debt)的积累机制。其解决方案的关键在于构建一套高度自动化的自然语言处理(Natural Language Processing, NLP)分析与评分流程,并结合实验设计评估用户在依赖AI助手完成作文任务过程中认知负荷的变化,从而揭示AI工具对人类认知能力的潜在影响。

链接: https://arxiv.org/abs/2601.00856
作者: Milos Stankovic,Ella Hirche,Sarah Kollatzsch,Julia Nadine Doetsch
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Comment on arXiv:2506.08872

点击查看摘要

Abstract:Recently published work titled Your Brain on ChatGPT: Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing Task by Kosmyna et al. (2025) has sparked a vivid debate on the topic of artificial intelligence (AI) and human performance. We sincerely congratulate Kosmyna et al. for initiating such important research, collecting a valuable dataset, and establishing highly automated pipelines for Natural Language Processing (NLP) analyses and scoring. We aim to provide constructive comments that may improve the manuscript’s readiness for peer-reviewed publication, as some results by Kosmyna et al. (2025) could be interpreted more conservatively. Our primary concerns focus on: (i) study design considerations, including the limited sample size; (ii) the reproducibility of the analyses; (iii) methodological issues related to the EEG analysis; (iv) inconsistencies in the reporting of results; and (v) limited transparency in several aspects of the study’s procedures and findings.
zh

[AI-130] FedSCAM (Federated Sharpness-Aware Minimization with Clustered Aggregation and Modulation): Scam-resistant SAM for Robust Federated Optimization in Heterogeneous Environments

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据统计异质性(statistical heterogeneity,常表现为非独立同分布(non-IID)标签分布)导致的模型收敛困难与泛化性能下降问题。现有方法如Sharpness-Aware Minimization (SAM) 虽能提升模型鲁棒性,但通常对所有客户端采用统一扰动半径,忽略了个体差异。其解决方案的关键在于提出FedSCAM算法:通过计算每个客户端的异质性评分,并据此动态调节SAM扰动半径(反比于异质性得分),避免高方差客户端干扰全局模型;同时引入基于异质性的加权聚合机制,优先采纳与全局优化方向一致的客户端更新,从而提升训练稳定性与最终精度。

链接: https://arxiv.org/abs/2601.00853
作者: Sameer Rahil,Zain Abdullah Ahmad,Talha Asif
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 27 figures

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across decentralized edge devices while preserving data privacy. However, statistical heterogeneity among clients, often manifested as non-IID label distributions, poses significant challenges to convergence and generalization. While Sharpness-Aware Minimization (SAM) has been introduced to FL to seek flatter, more robust minima, existing approaches typically apply a uniform perturbation radius across all clients, ignoring client-specific heterogeneity. In this work, we propose \textbfFedSCAM (Federated Sharpness-Aware Minimization with Clustered Aggregation and Modulation), a novel algorithm that dynamically adjusts the SAM perturbation radius and aggregation weights based on client-specific heterogeneity scores. By calculating a heterogeneity metric for each client and modulating the perturbation radius inversely to this score, FedSCAM prevents clients with high variance from destabilizing the global model. Furthermore, we introduce a heterogeneity-aware weighted aggregation mechanism that prioritizes updates from clients that align with the global optimization direction. Extensive experiments on CIFAR-10 and Fashion-MNIST under various degrees of Dirichlet-based label skew demonstrate that FedSCAM achieves competitive performance among state-of-the-art baselines, including FedSAM, FedLESAM, etc. in terms of convergence speed and final test accuracy.
zh

[AI-131] mporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models

【速读】:该论文旨在解决多智能体AI工作流中时间相关攻击模式(temporal attack patterns)的检测难题,其核心挑战在于如何利用可观测性数据(如OpenTelemetry追踪数据)识别出具有时序特征的恶意行为。解决方案的关键在于提出了一种可复现的微调方法:通过合成生成包含多智能体协作攻击与合规违规行为的追踪数据集,并基于QLoRA(Quantized Low-Rank Adaptation)进行迭代式轻量化微调,在资源受限的ARM64硬件上实现模型性能显著提升(准确率从42.86%提升至74.29%)。研究进一步揭示了训练数据组成对模型行为的根本性影响,强调针对性的知识补全优于盲目扩展数据规模,从而为构建适应特定威胁环境的代理安全模型提供了首个开放可验证的技术框架。

链接: https://arxiv.org/abs/2601.00848
作者: Ron F. Del Rosario
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 26 pages, 3 figures, 7 tables. Datasets and code: this https URL

点击查看摘要

Abstract:We present an openly documented methodology for fine-tuning language models to detect temporal attack patterns in multi-agent AI workflows using OpenTelemetry trace analysis. We curate a dataset of 80,851 examples from 18 public cybersecurity sources and 35,026 synthetic OpenTelemetry traces. We apply iterative QLoRA fine-tuning on resource-constrained ARM64 hardware (NVIDIA DGX Spark) through three training iterations with strategic augmentation. Our custom benchmark accuracy improves from 42.86% to 74.29%, a statistically significant 31.4-point gain. Targeted examples addressing specific knowledge gaps outperform indiscriminate scaling. Key contributions include: (1) synthetic trace generation methodology for multi-agent coordination attacks and regulatory violations, (2) empirical evidence that training data composition fundamentally determines behavior, and (3) complete open release of datasets, training scripts, and evaluation benchmarks on HuggingFace. While practical deployment requires human oversight due to false positive rates, this work establishes the first reproducible framework enabling practitioners to build custom agentic security models adapted to their threat landscapes.
zh

[AI-132] Enhancing Temporal Awareness in LLM s for Temporal Point Processes

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理时间点过程(Temporal Point Processes, TPPs)时,难以有效捕捉时间信息与语义上下文之间复杂交互的问题。现有方法通常通过简单拼接事件时间嵌入与类型嵌入来输入LLMs,忽略了二者在动态演化中的协同关系,导致对事件间时间依赖性和长程关联建模能力不足。解决方案的关键在于提出TPP-TAL框架,其核心创新是在将信息输入LLM之前,显式地对齐时间动态性与语义上下文,从而增强模型对时间感知的理解能力。这一设计显著提升了连续时间事件建模中的似然估计和预测准确性。

链接: https://arxiv.org/abs/2601.00845
作者: Lili Chen,Wensheng Gan,Shuang Liang,Philip S. Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Temporal point processes (TPPs) are crucial for analyzing events over time and are widely used in fields such as finance, healthcare, and social systems. These processes are particularly valuable for understanding how events unfold over time, accounting for their irregularity and dependencies. Despite the success of large language models (LLMs) in sequence modeling, applying them to temporal point processes remains challenging. A key issue is that current methods struggle to effectively capture the complex interaction between temporal information and semantic context, which is vital for accurate event modeling. In this context, we introduce TPP-TAL (Temporal Point Processes with Enhanced Temporal Awareness in LLMs), a novel plug-and-play framework designed to enhance temporal reasoning within LLMs. Rather than using the conventional method of simply concatenating event time and type embeddings, TPP-TAL explicitly aligns temporal dynamics with contextual semantics before feeding this information into the LLM. This alignment allows the model to better perceive temporal dependencies and long-range interactions between events and their surrounding contexts. Through comprehensive experiments on several benchmark datasets, it is shown that TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy, highlighting the importance of enhancing temporal awareness in LLMs for continuous-time event modeling. The code is made available at this https URL
zh

[AI-133] Value-guided action planning with JEPA world models

【速读】:该论文旨在解决联合嵌入预测架构(Joint-Embedded Predictive Architectures, JEPA)在支持有效动作规划方面的局限性问题。其核心挑战在于,JEPA虽能通过自监督预测目标学习环境动态的表示与预测器,但其状态表示空间未能充分编码可用于规划的距离结构。解决方案的关键在于对表示空间进行形塑(shaping),使得目标条件下的负价值函数(即到达成本)能够被状态嵌入之间的距离(或准距离)近似。作者提出一种可实践的训练约束方法,在训练过程中显式地强制这一距离结构,从而显著提升JEPA模型在简单控制任务中的规划性能。

链接: https://arxiv.org/abs/2601.00844
作者: Matthieu Destrade,Oumayma Bounou,Quentin Le Lidec,Jean Ponce,Yann LeCun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Presented as a poster at the World Modeling Workshop 2026, Mila

点击查看摘要

Abstract:Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint-Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self-supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi-distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.
zh

[AI-134] OmniNeuro: A Multimodal HCI Framework for Explainable BCI Feedback via Generative AI and Sonification

【速读】:该论文旨在解决深度学习在脑-机接口(Brain-Computer Interface, BCI)中因“黑箱”特性导致的临床应用障碍问题,这一缺陷常引发用户挫败感并影响神经可塑性恢复效果。其解决方案的关键在于提出OmniNeuro框架,该框架通过集成三种可解释性引擎——物理能量(Physics, Energy)、混沌分形复杂度(Chaos, Fractal Complexity)及类量子不确定性建模——实现对BCI决策过程的实时可视化与解释,驱动神经声学反馈(Neuro-Sonification)和生成式AI临床报告输出,从而将BCI从被动解码器转变为具有透明反馈能力的人机交互伙伴。该方法具备解码器无关性,可作为任何先进架构的通用可解释层。

链接: https://arxiv.org/abs/2601.00843
作者: Ayda Aghaei Nia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 3 tables. Source code and implementation available at: this https URL . Highlights the use of LLMs (Gemini) and Quantum probability formalism for real-time BCI explainability

点击查看摘要

Abstract:While Deep Learning has improved Brain-Computer Interface (BCI) decoding accuracy, clinical adoption is hindered by the “Black Box” nature of these algorithms, leading to user frustration and poor neuroplasticity outcomes. We propose OmniNeuro, a novel HCI framework that transforms the BCI from a silent decoder into a transparent feedback partner. OmniNeuro integrates three interpretability engines: (1) Physics (Energy), (2) Chaos (Fractal Complexity), and (3) Quantum-Inspired uncertainty modeling. These metrics drive real-time Neuro-Sonification and Generative AI Clinical Reports. Evaluated on the PhysioNet dataset ( N=109 ), the system achieved a mean accuracy of 58.52%, with qualitative pilot studies ( N=3 ) confirming that explainable feedback helps users regulate mental effort and reduces the “trial-and-error” phase. OmniNeuro is decoder-agnostic, acting as an essential interpretability layer for any state-of-the-art architecture.
zh

[AI-135] Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds

【速读】:该论文旨在解决在复杂非欧几里得流形上模拟非线性反应-扩散动力学的难题,传统方法受限于高保真网格生成成本以及离散时间步长方案中的辛漂移(symplectic drift)。其解决方案的关键在于提出了一种无网格几何深度学习框架——内在度量物理信息神经网络(Intrinsic-Metric Physics-Informed Neural Network, IM-PINN),通过将黎曼度量张量(Riemannian metric tensor)嵌入自动微分图中,解析重构拉普拉斯-贝尔特拉米算子(Laplace-Beltrami operator),从而实现解的复杂性与几何离散化解耦。该方法在极端高斯曲率波动(K ∈ [-2489, 3580])的“随机布料”流形上成功捕捉到各向异性图灵不稳定性,并以双流架构结合傅里叶特征嵌入缓解谱偏差,显著优于传统表面有限元法(SFEM)在质量守恒精度上的表现(IM-PINN: Emass0.157\mathcal{E}_{\text{mass}} \approx 0.157 vs. SFEM: 0.258),提供了一种内存高效、分辨率无关的生物图案形成模拟新范式。

链接: https://arxiv.org/abs/2601.00834
作者: Julian Evan Chrisnanto,Salsabila Rahma Alia,Nurfauzi Fadillah,Yulison Herry Chrisnanto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Simulating nonlinear reaction-diffusion dynamics on complex, non-Euclidean manifolds remains a fundamental challenge in computational morphogenesis, constrained by high-fidelity mesh generation costs and symplectic drift in discrete time-stepping schemes. This study introduces the Intrinsic-Metric Physics-Informed Neural Network (IM-PINN), a mesh-free geometric deep learning framework that solves partial differential equations directly in the continuous parametric domain. By embedding the Riemannian metric tensor into the automatic differentiation graph, our architecture analytically reconstructs the Laplace-Beltrami operator, decoupling solution complexity from geometric discretization. We validate the framework on a “Stochastic Cloth” manifold with extreme Gaussian curvature fluctuations ( K \in [-2489, 3580] ), where traditional adaptive refinement fails to resolve anisotropic Turing instabilities. Using a dual-stream architecture with Fourier feature embeddings to mitigate spectral bias, the IM-PINN recovers the “splitting spot” and “labyrinthine” regimes of the Gray-Scott model. Benchmarking against the Surface Finite Element Method (SFEM) reveals superior physical rigor: the IM-PINN achieves global mass conservation error of \mathcalE_mass \approx 0.157 versus SFEM’s 0.258 , acting as a thermodynamically consistent global solver that eliminates mass drift inherent in semi-implicit integration. The framework offers a memory-efficient, resolution-independent paradigm for simulating biological pattern formation on evolving surfaces, bridging differential geometry and physics-informed machine learning.
zh

[AI-136] A Knowledge Graph and Deep Learning-Based Semantic Recommendation Database System for Advertisement Retrieval and Personalization

【速读】:该论文旨在解决现代数字营销中广告数据日益复杂所带来的挑战,即如何构建能够理解产品、受众与广告内容之间语义关系的智能系统,以实现精准的广告检索与个性化推荐。其解决方案的关键在于提出一种基于知识图谱与深度学习的语义推荐数据库系统(KGSR-ADS),该系统通过异构广告知识图谱(Ad-KG)捕捉多关系语义信息,利用大语言模型(LLMs)如GPT和LLaMA生成上下文感知的嵌入表示,结合图神经网络(GNN)与注意力机制推理跨实体依赖关系,并采用向量索引技术(如FAISS/Milvus)优化数据库检索层,从而在大规模异构负载下实现高精度语义匹配与高效可扩展的广告检索能力。

链接: https://arxiv.org/abs/2601.00833
作者: Tangtang Wang,Kaijie Zhang,Kuangcong Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In modern digital marketing, the growing complexity of advertisement data demands intelligent systems capable of understanding semantic relationships among products, audiences, and advertising content. To address this challenge, this paper proposes a Knowledge Graph and Deep Learning-Based Semantic Recommendation Database System (KGSR-ADS) for advertisement retrieval and personalization. The proposed framework integrates a heterogeneous Ad-Knowledge Graph (Ad-KG) that captures multi-relational semantics, a Semantic Embedding Layer that leverages large language models (LLMs) such as GPT and LLaMA to generate context-aware vector representations, a GNN + Attention Model that infers cross-entity dependencies, and a Database Optimization Retrieval Layer based on vector indexing (FAISS/Milvus) for efficient semantic search. This layered architecture enables both accurate semantic matching and scalable retrieval, allowing personalized ad recommendations under large-scale heterogeneous workloads.
zh

[AI-137] Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning

【速读】:该论文试图解决的问题是:当前生成式 AI (Generative AI) 在提供逐步推理解释时,其说明内容是否真实反映了影响最终决策的关键因素。研究发现,尽管模型能感知到嵌入提示(hints)中的关键信息,却极少自发提及这些提示,仅在被直接询问时才承认注意到它们,这表明模型存在“选择性披露”行为。解决方案的关键在于揭示这一现象——即单纯观察 AI 的推理过程不足以识别隐藏的偏倚或误导性输入;必须通过干预机制(如强制要求报告提示)来暴露模型未主动呈现的信息,但此类方法会引入虚假报告并降低准确性,从而凸显出对 AI 可解释性评估方法的根本性挑战。

链接: https://arxiv.org/abs/2601.00830
作者: Deep Pankajbhai Mehta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures, 9 tables

点击查看摘要

Abstract:When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI’s answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.
zh

[AI-138] Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)自修正能力的有效性问题,即尽管普遍认为LLMs具备内在自修正能力(intrinsic self-correction),但其实际效果仍不明确。研究通过系统分解自修正为错误检测、定位与修正三个子能力,并在GSM8K-Complex数据集上对三种主流LLM进行实验,发现了一个“准确率-修正率悖论”(Accuracy-Correction Paradox):准确率较低的模型反而表现出更高的内在修正率。关键解决方案在于提出“错误深度假说”(Error Depth Hypothesis),指出更强的模型虽然整体错误较少,但所犯错误更深层次、更难自我修正,从而揭示了模型能力与自修正效率之间并非线性关系,为设计高效的自精炼(self-refinement)流水线提供了新的理论依据。

链接: https://arxiv.org/abs/2601.00828
作者: Yin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 3 tables. Code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are widely believed to possess self-correction capabilities, yet recent studies suggest that intrinsic self-correction–where models correct their own outputs without external feedback–remains largely ineffective. In this work, we systematically decompose self-correction into three distinct sub-capabilities: error detection, error localization, and error correction. Through cross-model experiments on GSM8K-Complex (n=500 per model, 346 total errors) with three major LLMs, we uncover a striking Accuracy-Correction Paradox: weaker models (GPT-3.5, 66% accuracy) achieve 1.6x higher intrinsic correction rates than stronger models (DeepSeek, 94% accuracy)–26.8% vs 16.7%. We propose the Error Depth Hypothesis: stronger models make fewer but deeper errors that resist self-correction. Error detection rates vary dramatically across architectures (10% to 82%), yet detection capability does not predict correction success–Claude detects only 10% of errors but corrects 29% intrinsically. Surprisingly, providing error location hints hurts all models. Our findings challenge linear assumptions about model capability and self-improvement, with important implications for the design of self-refinement pipelines.
zh

[AI-139] Energy-Aware Routing to Large Reasoning Models

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在实际部署中因推理任务差异导致的能量消耗不均问题,从而优化系统能效。其核心挑战在于如何在平均能量供给与随机波动之间取得平衡,避免辅助能源或基础能源的系统性浪费。解决方案的关键在于识别并运行于“临界状态”(critical regime),即在该状态下,既不产生持续的能源过剩,也不依赖额外的辅助能源;进一步地,通过引入二阶特征分析——即对时间、模型和执行选择维度上变异性的吸收机制进行建模——提出将方差感知的路由策略(variance-aware routing and dispatch)作为系统设计的核心轴线,为开发基于能耗感知的模型调度策略提供理论支撑。

链接: https://arxiv.org/abs/2601.00823
作者: Austin R. Ellis-Mohr,Max Hartman,Lav R. Varshney
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have heterogeneous inference energy costs based on which model is used and how much it reasons. To reduce energy, it is important to choose the right LRM and operate it in the right way. As a result, the performance of systems that dispatch tasks to different individual LRMs depend on the balance between mean energy provisioning and stochastic fluctuations. The critical regime is the unique operating point at which neither auxiliary energy nor baseline energy is systematically wasted. Increasing baseline supply shifts the system toward persistent over-supply and baseline-energy waste, while reducing supply induces persistent reliance on auxiliary energy. Yet in this regime, performance remains volatility-limited and so a second-order characterization provides further insights that we develop. Here, performance is governed by how variability is absorbed across time, models, and execution choices. This perspective highlights variance-aware routing and dispatch as a principled design axis, and provides a theoretical basis for developing energy-aware model routing policies. Routing behavior is characterized when dispatch policies are based on training-compute and inference-compute scaling laws for LRMs.
zh

[AI-140] Agent ic AI for Autonomous Explainable and Real-Time Credit Risk Decision-Making

【速读】:该论文旨在解决传统机器学习模型在金融信贷风险评估中缺乏自适应推理能力、情境感知能力和自主性的问题,难以满足现代金融业务对实时、透明且自主决策的需求。其解决方案的关键在于提出一种基于AI代理(Agentic AI)的框架,通过多智能体系统结合强化学习、自然语言推理、可解释AI模块及实时数据吸收管道,实现对借款人风险画像的自动化评估;该框架的核心要素包括代理协作协议、风险评分引擎、可解释性层和持续反馈学习循环,从而显著提升决策速度、透明度与响应能力。

链接: https://arxiv.org/abs/2601.00818
作者: Chandra Sekhar Kubam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Significant digitalization of financial services in a short period of time has led to an urgent demand to have autonomous, transparent and real-time credit risk decision making systems. The traditional machine learning models are effective in pattern recognition, but do not have the adaptive reasoning, situational awareness, and autonomy needed in modern financial operations. As a proposal, this paper presents an Agentic AI framework, or a system where AI agents view the world of dynamic credit independent of human observers, who then make actions based on their articulable decision-making paths. The research introduces a multi-agent system with reinforcing learning, natural language reasoning, explainable AI modules, and real-time data absorption pipelines as a means of assessing the risk profiles of borrowers with few humans being involved. The processes consist of agent collaboration protocol, risk-scoring engines, interpretability layers, and continuous feedback learning cycles. Findings indicate that decision speed, transparency and responsiveness is better than traditional credit scoring models. Nevertheless, there are still some practical limitations such as risks of model drift, inconsistencies in interpreting high dimensional data and regulatory uncertainties as well as infrastructure limitations in low-resource settings. The suggested system has a high prospective to transform credit analytics and future studies ought to be directed on dynamic regulatory compliance mobilizers, new agent teamwork, adversarial robustness, and large-scale implementation in cross-country credit ecosystems.
zh

[AI-141] MathLedger: A Verifiable Learning Substrate with Ledger-Attested Feedback

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)系统在安全关键场景中因缺乏可验证性与透明性而导致的信任危机问题。其解决方案的核心是提出MathLedger,一种集成形式化验证(formal verification)、密码学证明(cryptographic attestation)与学习动态的可验证机器认知基础架构,其中关键创新在于实现反射式形式学习(Reflexive Formal Learning, RFL),即通过验证器输出驱动模型更新,而非传统统计损失函数,从而构建一个闭环的、可审计的学习机制,确保学习过程具备可追溯性和fail-closed治理能力。

链接: https://arxiv.org/abs/2601.00816
作者: Ismail Ahmad Abdullah
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 14 pages, 1 figure, 2 tables, 2 appendices with full proofs. Documents v0.9.4-pilot-audit-hardened audit surface with fail-closed governance, canonical JSON hashing, and artifact classification. Phase I infrastructure validation; no capability claims

点击查看摘要

Abstract:Contemporary AI systems achieve extraordinary performance yet remain opaque and non-verifiable, creating a crisis of trust for safety-critical deployment. We introduce MathLedger, a substrate for verifiable machine cognition that integrates formal verification, cryptographic attestation, and learning dynamics into a single epistemic loop. The system implements Reflexive Formal Learning (RFL), a symbolic analogue of gradient descent where updates are driven by verifier outcomes rather than statistical loss. Phase I experiments validate the measurement and governance substrate under controlled conditions. CAL-EXP-3 validates measurement infrastructure (Delta p computation, variance tracking); separate stress tests confirm fail-closed governance triggers correctly under out-of-bounds conditions. No convergence or capability claims are made. The contribution is infrastructural: a working prototype of ledger-attested learning that enables auditability at scale. Keywords: verifiable learning, formal verification, cryptographic attestation, reflexive feedback, fail-closed governance Comments: 14 pages, 1 figure, 2 tables, 2 appendices with full proofs. Documents v0.9.4-pilot-audit-hardened audit surface with fail-closed governance, canonical JSON hashing, and artifact classification. Phase I infrastructure validation; no capability claims Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.8 Cite as: arXiv:2601.00816 [cs.AI] (or arXiv:2601.00816v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.00816 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-142] Semantic Alignment of Multilingual Knowledge Graphs via Contextualized Vector Projections

【速读】:该论文旨在解决跨语言本体对齐(cross-lingual ontology alignment)问题,即如何在不同语言的本体中识别语义等价的实体。其解决方案的关键在于利用基于嵌入(embedding)的余弦相似度匹配方法,通过改进的多语言Transformer模型生成高质量嵌入,并结合新颖的描述生成技术增强实体上下文信息,从而更准确地捕捉跨语言语义相似性。实验表明,该方法在OAEI-2022多农场赛道上实现了71%的F1分数,较最佳基线提升16%,验证了其有效性。

链接: https://arxiv.org/abs/2601.00814
作者: Abhishek Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paper presents our work on cross-lingual ontology alignment system which uses embedding based cosine similarity matching. The ontology entities are made contextually richer by creating descriptions using novel techniques. We use a fine-tuned transformer based multilingual model for generating better embeddings. We use cosine similarity to find positive ontology entities pairs and then apply threshold filtering to retain only highly similar entities. We have evaluated our work on OAEI-2022 multifarm track. We achieve 71% F1 score (78% recall and 65% precision) on the evaluation dataset, 16% increase from best baseline score. This suggests that our proposed alignment pipeline is able to capture the subtle cross-lingual similarities.
zh

[AI-143] A Modular Reference Architecture for MCP-Servers Enabling Agent ic BIM Interaction

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体工作流在建筑信息建模(Building Information Modelling, BIM)应用中,BIM端实现方式高度依赖特定工具、缺乏标准化和可复用性的问题。现有方案虽采用新兴的模型上下文协议(Model Context Protocol, MCP)统一了LLM侧的工具调用接口,但BIM侧仍存在工具耦合性强、难以评估与迁移等局限。解决方案的关键在于提出一种模块化参考架构,通过定义明确的适配器契约(adapter contract),将MCP接口与具体的BIM-API解耦,从而实现API无关、隔离且可复现的智能体BIM交互。该架构以微服务形式组织,已在IfcOpenShell原型中验证其在常见修改与生成任务中的可行性,并显著提升了工作流的可靠性与可复用性。

链接: https://arxiv.org/abs/2601.00809
作者: Tobias Heimig-Elschner,Changyu Du,Anna Scheuvens,André Borrmann,Jakob Beetz
机构: 未知
类目: Other Computer Science (cs.OH); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Submitted to the GNI Symposium on Artificial Intelligence for the Built World (Technical University of Munich, May 18–20, 2026)

点击查看摘要

Abstract:Agentic workflows driven by large language models (LLMs) are increasingly applied to Building Information Modelling (BIM), enabling natural-language retrieval, modification and generation of IFC models. Recent work has begun adopting the emerging Model Context Protocol (MCP) as a uniform tool-calling interface for LLMs, simplifying the agent side of BIM interaction. While MCP standardises how LLMs invoke tools, current BIM-side implementations are still authoring tool-specific and ad hoc, limiting reuse, evaluation, and workflow portability across environments. This paper addresses this gap by introducing a modular reference architecture for MCP servers that enables API-agnostic, isolated and reproducible agentic BIM interactions. From a systematic analysis of recurring capabilities in recent literature, we derive a core set of requirements. These inform a microservice architecture centred on an explicit adapter contract that decouples the MCP interface from specific BIM-APIs. A prototype implementation using IfcOpenShell demonstrates feasibility across common modification and generation tasks. Evaluation across representative scenarios shows that the architecture enables reliable workflows, reduces coupling, and provides a reusable foundation for systematic research.
zh

[AI-144] AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes

【速读】:该论文旨在解决量子点模拟器中拓扑相变调控难题,即如何高效地通过输运测量自动调整系统参数以实现马约拉纳零模(Majorana zero modes)的稳定制备。其解决方案的关键在于提出一种基于神经网络的无监督学习模型,利用物理信息损失函数(physics-informed loss)对合成导电率图谱进行训练,使深度视觉Transformer网络能够准确记忆哈密顿量参数与导电率图谱之间的映射关系,并据此提出参数更新策略,驱动量子点链向拓扑非平庸相演化。实验表明,仅需一次参数更新即可在广泛的初始参数范围内诱导出非平凡零模,而迭代调参流程则进一步扩展了可覆盖的参数空间范围。

链接: https://arxiv.org/abs/2601.02149
作者: Mateusz Krawczyk,Jarosław Pawłowski
机构: 未知
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: main file: 8 pages, 6 figures; supplementary: 3 pages, 2 figures

点击查看摘要

Abstract:We propose a neural network-based model capable of learning the broad landscape of working regimes in quantum dot simulators, and using this knowledge to autotune these devices - based on transport measurements - toward obtaining Majorana modes in the structure. The model is trained in an unsupervised manner on synthetic data in the form of conductance maps, using a physics-informed loss that incorporates key properties of Majorana zero modes. We show that, with appropriate training, a deep vision-transformer network can efficiently memorize relation between Hamiltonian parameters and structures on conductance maps and use it to propose parameters update for a quantum dot chain that drive the system toward topological phase. Starting from a broad range of initial detunings in parameter space, a single update step is sufficient to generate nontrivial zero modes. Moreover, by enabling an iterative tuning procedure - where the system acquires updated conductance maps at each step - we demonstrate that the method can address a much larger region of the parameter space.
zh

[AI-145] Visualizing the Structure of Lenia Parameter Space

【速读】:该论文旨在解决连续细胞自动机(Continuous Cellular Automata)中关于Lenia系统的基本理论问题,包括明确孤子(soliton)的定义、厘清参数空间的整体结构以及定位孤子在参数空间中的分布。其解决方案的关键在于提出一种自动分类方法,能够将Lenia系统划分为四类定性不同的动力学类别,从而实现对移动孤子的有效检测,并通过交互式可视化工具揭示参数空间的结构特征。这一方法不仅识别出此前被认为不存在孤子的新参数区域,还揭示了不同核函数下相空间结构的普适性。

链接: https://arxiv.org/abs/2601.01932
作者: Barbora Hudcová,František Dušek,Marco Tuccio,Clément Hongler
机构: 未知
类目: Cellular Automata and Lattice Gases (nlin.CG); Artificial Intelligence (cs.AI)
备注: 2 pages

点击查看摘要

Abstract:Continuous cellular automata are rocketing in popularity, yet developing a theoretical understanding of their behaviour remains a challenge. In the case of Lenia, a few fundamental open problems include determining what exactly constitutes a soliton, what is the overall structure of the parameter space, and where do the solitons occur in it. In this abstract, we present a new method to automatically classify Lenia systems into four qualitatively different dynamical classes. This allows us to detect moving solitons, and to provide an interactive visualization of Lenia’s parameter space structure on our website this https URL. The results shed new light on the above-mentioned questions and lead to several observations: the existence of new soliton families for parameters where they were not believed to exist, or the universality of the phase space structure across various kernels.
zh

[AI-146] MORE: Multi-Objective Adversarial Attacks on Speech Recognition

【速读】:该论文旨在解决自动语音识别(ASR)模型在面对对抗攻击时,不仅存在识别准确率下降的问题,还缺乏对推理效率退化这一关键维度的系统性研究,从而导致对ASR模型鲁棒性的理解不完整。解决方案的关键在于提出MORE(Multi-objective Repetitive Doubling Encouragement attack),其核心创新是通过分层阶段式的“排斥-锚定”机制,将多目标对抗优化重构为顺序实现准确率下降与推理效率降低的框架,并引入新颖的重复鼓励加倍目标(REDO),通过维持准确率劣化并周期性地翻倍预测序列长度,诱导模型在单个对抗输入下产生错误转录且计算成本显著上升,从而实现对ASR模型的高效多目标攻击。

链接: https://arxiv.org/abs/2601.01852
作者: Xiaoxue Gao,Zexin Li,Yiming Chen,Nancy F. Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:The emergence of large-scale automatic speech recognition (ASR) models such as Whisper has greatly expanded their adoption across diverse real-world applications. Ensuring robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments. While prior work has mainly examined accuracy degradation under adversarial attacks, robustness with respect to efficiency remains largely unexplored. This narrow focus provides only a partial understanding of ASR model vulnerabilities. To address this gap, we conduct a comprehensive study of ASR robustness under multiple attack scenarios. We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency through a hierarchical staged repulsion-anchoring mechanism. Specifically, we reformulate multi-objective adversarial optimization into a hierarchical framework that sequentially achieves the dual objectives. To further amplify effectiveness, we propose a novel repetitive encouragement doubling objective (REDO) that induces duplicative text generation by maintaining accuracy degradation and periodically doubling the predicted sequence length. Overall, MORE compels ASR models to produce incorrect transcriptions at a substantially higher computational cost, triggered by a single adversarial input. Experiments show that MORE consistently yields significantly longer transcriptions while maintaining high word error rates compared to existing baselines, underscoring its effectiveness in multi-objective adversarial attack.
zh

[AI-147] UniCrop: A Universal Multi-Source Data Engineering Pipeline for Scalable Crop Yield Prediction

【速读】:该论文旨在解决作物产量预测中多源环境数据(如遥感、气象、土壤和地形数据)的获取、清洗、标准化与特征工程过程复杂且难以复用的问题,这些问题限制了模型的可扩展性、可重复性和实际部署能力。解决方案的关键在于提出UniCrop——一个通用且可重用的数据处理流水线,能够自动完成从多种数据源(包括Sentinel-1/2、MODIS、ERA5-Land、NASA POWER、SoilGrids和SRTM)中提取、 harmonisation 和特征工程,并通过最小冗余最大相关性(minimum redundancy maximum relevance, mRMR)方法进行结构化特征选择,从而生成紧凑且分析就绪的特征集。该框架将数据规范与实现解耦,支持任意作物、区域和时间窗口的配置化调用,显著提升了农业数据分析的效率与可扩展性。

链接: https://arxiv.org/abs/2601.01655
作者: Emiliya Khidirova,Oktay Karakuş
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate crop yield prediction relies on diverse data streams, including satellite, meteorological, soil, and topographic information. However, despite rapid advances in machine learning, existing approaches remain crop- or region-specific and require data engineering efforts. This limits scalability, reproducibility, and operational deployment. This study introduces UniCrop, a universal and reusable data pipeline designed to automate the acquisition, cleaning, harmonisation, and engineering of multi-source environmental data for crop yield prediction. For any given location, crop type, and temporal window, UniCrop automatically retrieves, harmonises, and engineers over 200 environmental variables (Sentinel-1/2, MODIS, ERA5-Land, NASA POWER, SoilGrids, and SRTM), reducing them to a compact, analysis-ready feature set utilising a structured feature reduction workflow with minimum redundancy maximum relevance (mRMR). To validate, UniCrop was applied to a rice yield dataset comprising 557 field observations. Using only the selected 15 features, four baseline machine learning models (LightGBM, Random Forest, Support Vector Regression, and Elastic Net) were trained. LightGBM achieved the best single-model performance (RMSE = 465.1 kg/ha, R^2 = 0.6576 ), while a constrained ensemble of all baselines further improved accuracy (RMSE = 463.2 kg/ha, R^2 = 0.6604 ). UniCrop contributes a scalable and transparent data-engineering framework that addresses the primary bottleneck in operational crop yield modelling: the preparation of consistent and harmonised multi-source data. By decoupling data specification from implementation and supporting any crop, region, and time frame through simple configuration updates, UniCrop provides a practical foundation for scalable agricultural analytics. The code and implementation documentation are shared in this https URL.
zh

[AI-148] From Theory of Mind to Theory of Environment: Counterfactual Simulation of Latent Environmental Dynamics AAAI2026

【速读】:该论文试图解决的问题是:如何在复杂环境中实现高效且具有创新性的运动控制,尤其是在存在隐藏的动作-结果关联(action-outcome contingencies)时,传统降维策略可能限制行为多样性,从而阻碍适应性行为的产生。解决方案的关键在于提出“环境理论”(Theory of Environment),即人类可通过社会线索推断环境中的隐含动态,并借助与心智理论(Theory of Mind)共享的计算机制,主动扩展运动探索的维度,从而促进行为创新。

链接: https://arxiv.org/abs/2601.01599
作者: Ryutaro Uchiyama
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: Accepted to the AAAI 2026 Workshop on Theory of Mind for Artificial Intelligence (ToM4AI). Extended abstract, 2 pages

点击查看摘要

Abstract:The vertebrate motor system employs dimensionality-reducing strategies to limit the complexity of movement coordination, for efficient motor control. But when environments are dense with hidden action-outcome contingencies, movement complexity can promote behavioral innovation. Humans, perhaps uniquely, may infer the presence of hidden environmental dynamics from social cues, by drawing upon computational mechanisms shared with Theory of Mind. This proposed “Theory of Environment” supports behavioral innovation by expanding the dimensionality of motor exploration.
zh

[AI-149] Beyond Demand Estimation: Consumer Surplus Evaluation via Cumulative Propensity Weights

【速读】:该论文旨在解决如何利用观测数据审计由人工智能驱动的决策(如精准定价和算法信贷)对消费者剩余(Consumer Surplus)的影响问题。传统方法需先估计需求函数再进行数值积分,但存在参数形式误设风险或非参数/机器学习方法数据需求大、收敛慢等实践困难。其解决方案的关键在于利用现代算法定价中因探索与利用权衡而产生的随机性,提出一种无需显式估计需求函数即可重构消费者剩余积分的新型估计量——累积倾向权重(Cumulative Propensity Weighting, CPW),并进一步引入双重稳健型估计器(Augmented Cumulative Propensity Weighting, ACPW),仅需需求模型或历史定价策略分布之一正确设定即可保证一致性;同时通过结合灵活的机器学习方法实现快速收敛,显著提升实用性与准确性。

链接: https://arxiv.org/abs/2601.01029
作者: Zeyu Bian,Max Biggs,Ruijiang Gao,Zhengling Qi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 74 pages

点击查看摘要

Abstract:This paper develops a practical framework for using observational data to audit the consumer surplus effects of AI-driven decisions, specifically in targeted pricing and algorithmic lending. Traditional approaches first estimate demand functions and then integrate to compute consumer surplus, but these methods can be challenging to implement in practice due to model misspecification in parametric demand forms and the large data requirements and slow convergence of flexible nonparametric or machine learning approaches. Instead, we exploit the randomness inherent in modern algorithmic pricing, arising from the need to balance exploration and exploitation, and introduce an estimator that avoids explicit estimation and numerical integration of the demand function. Each observed purchase outcome at a randomized price is an unbiased estimate of demand and by carefully reweighting purchase outcomes using novel cumulative propensity weights (CPW), we are able to reconstruct the integral. Building on this idea, we introduce a doubly robust variant named the augmented cumulative propensity weighting (ACPW) estimator that only requires one of either the demand model or the historical pricing policy distribution to be correctly specified. Furthermore, this approach facilitates the use of flexible machine learning methods for estimating consumer surplus, since it achieves fast convergence rates by incorporating an estimate of demand, even when the machine learning estimate has slower convergence rates. Neither of these estimators is a standard application of off-policy evaluation techniques as the target estimand, consumer surplus, is unobserved. To address fairness, we extend this framework to an inequality-aware surplus measure, allowing regulators and firms to quantify the profit-equity trade-off. Finally, we validate our methods through comprehensive numerical studies.
zh

[AI-150] Comparative Analysis of Formula and Structure Prediction from Tandem Mass Spectra

【速读】:该论文旨在解决液相色谱-质谱(LC-MS)代谢组学和暴露组学研究中大量信号无法通过传统谱库匹配进行鉴定或注释的问题,其根本原因在于现有谱库远未覆盖LC-MS/MS所捕获的广阔化学空间。解决方案的关键在于开发并系统评估基于串联质谱(tandem mass spectra)的计算预测算法,通过准确预测化合物分子式和结构(尤其针对不同类型的加合离子),建立可信赖的性能基准,识别关键瓶颈,并为提升基于质谱的化合物预测能力提供指导。

链接: https://arxiv.org/abs/2601.00941
作者: Xujun Che,Xiuxia Du,Depeng Xu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Liquid chromatography mass spectrometry (LC-MS)-based metabolomics and exposomics aim to measure detectable small molecules in biological samples. The results facilitate hypothesis-generating discovery of metabolic changes and disease mechanisms and provide information about environmental exposures and their effects on human health. Metabolomics and exposomics are made possible by the high resolving power of LC and high mass measurement accuracy of MS. However, a majority of the signals from such studies still cannot be identified or annotated using conventional library searching because existing spectral libraries are far from covering the vast chemical space captured by LC-MS/MS. To address this challenge and unleash the full potential of metabolomics and exposomics, a number of computational approaches have been developed to predict compounds based on tandem mass spectra. Published assessment of these approaches used different datasets and evaluation. To select prediction workflows for practical applications and identify areas for further improvements, we have carried out a systematic evaluation of the state-of-the-art prediction algorithms. Specifically, the accuracy of formula prediction and structure prediction was evaluated for different types of adducts. The resulting findings have established realistic performance baselines, identified critical bottlenecks, and provided guidance to further improve compound predictions based on MS.
zh

[AI-151] Improving Code-Switching Speech Recognition with TTS Data Augmentation

【速读】:该论文旨在解决对话式代码转换语音(conversational code-switching speech)自动语音识别(ASR)在真实高质量标注语音数据稀缺情况下的性能瓶颈问题。其解决方案的关键在于利用多语言文本到语音(TTS)模型进行数据增强:具体而言,通过在SEAME数据集上微调多语言CosyVoice2 TTS模型,生成合成的中英代码转换对话语音,从而显著提升训练数据的数量和说话人多样性;实验表明,将合成语音与真实语音混合训练可有效降低混合错误率(MER),验证了该方法在低资源场景下提升ASR鲁棒性的有效性与实用性。

链接: https://arxiv.org/abs/2601.00935
作者: Yue Heng Yeo,Yuchen Hu,Shreyas Gopal,Yizhou Peng,Hexin Liu,Eng Siong Chng
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: This paper was accepted by APSIPA 2025

点击查看摘要

Abstract:Automatic speech recognition (ASR) for conversational code-switching speech remains challenging due to the scarcity of realistic, high-quality labeled speech data. This paper explores multilingual text-to-speech (TTS) models as an effective data augmentation technique to address this shortage. Specifically, we fine-tune the multilingual CosyVoice2 TTS model on the SEAME dataset to generate synthetic conversational Chinese-English code-switching speech, significantly increasing the quantity and speaker diversity of available training data. Our experiments demonstrate that augmenting real speech with synthetic speech reduces the mixed error rate (MER) from 12.1 percent to 10.1 percent on DevMan and from 17.8 percent to 16.0 percent on DevSGE, indicating consistent performance gains. These results confirm that multilingual TTS is an effective and practical tool for enhancing ASR robustness in low-resource conversational code-switching scenarios.
zh

[AI-152] Speak the Art: A Direct Speech to Image Generation Framework

【速读】:该论文旨在解决直接语音到图像生成(speech-to-image generation)任务中存在的两大核心问题:一是现有方法中语音编码网络生成的嵌入无法充分捕捉语言信息,难以语义上准确表征输入语音;二是采用生成对抗网络(GAN)导致训练不稳定、样本多样性不足及生成器学习效率低下。解决方案的关键在于提出一个名为Speak the Art (STA) 的新框架,其核心创新包括:1)通过大型预训练图像-文本模型监督语音编码网络,以提升语音嵌入的质量与语义表达能力;2)用基于向量量化(VQ)的扩散模型(VQ-Diffusion)替代GAN,实现更稳定的训练过程和更高多样性、质量的图像生成。此外,该框架还具备多语言扩展潜力,实验证明其在英语和阿拉伯语上的有效性。

链接: https://arxiv.org/abs/2601.00827
作者: Mariam Saeed,Manar Amr,Farida Adel,Nada Hassan,Nour Walid,Eman Mohamed,Mohamed Hussein,Marwan Torki
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Direct speech-to-image generation has recently shown promising results. However, compared to text-to-image generation, there is still a large gap to enclose. Current approaches use two stages to tackle this task: speech encoding network and image generative adversarial network (GAN). The speech encoding networks in these approaches produce embeddings that do not capture sufficient linguistic information to semantically represent the input speech. GANs suffer from issues such as non-convergence, mode collapse, and diminished gradient, which result in unstable model parameters, limited sample diversity, and ineffective generator learning, respectively. To address these weaknesses, we introduce a framework called \textbfSpeak the Art (STA) which consists of a speech encoding network and a VQ-Diffusion network conditioned on speech embeddings. To improve speech embeddings, the speech encoding network is supervised by a large pre-trained image-text model during training. Replacing GANs with diffusion leads to more stable training and the generation of diverse images. Additionally, we investigate the feasibility of extending our framework to be multilingual. As a proof of concept, we trained our framework with two languages: English and Arabic. Finally, we show that our results surpass state-of-the-art models by a large margin.
zh

[AI-153] Can Large Language Models Improve Venture Capital Exit Timing After IPO?

【速读】:该论文试图解决的风险投资(Venture Capital, VC)在首次公开募股(IPO)后退出时机选择的经济最优性问题,现有研究多描述退出时间分布,却缺乏对退出决策是否经济最优的评估。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的框架,通过整合IPO后的月度财务表现、披露文件、新闻舆情和市场信号等多源异构信息,自动识别最优退出时点,并生成“卖出”或“继续持有”的推荐策略,进而与VC实际退出行为进行回测比较,量化AI建议带来的超额收益或损失,从而验证人工智能驱动决策能否提升退出效率,补充传统风险中性概率模型(hazard models)与实物期权模型(real-options models)在VC研究中的局限性。

链接: https://arxiv.org/abs/2601.00810
作者: Mohammadhossien Rashidi
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:Exit timing after an IPO is one of the most consequential decisions for venture capital (VC) investors, yet existing research focuses mainly on describing when VCs exit rather than evaluating whether those choices are economically optimal. Meanwhile, large language models (LLMs) have shown promise in synthesizing complex financial data and textual information but have not been applied to post-IPO exit decisions. This study introduces a framework that uses LLMs to estimate the optimal time for VC exit by analyzing monthly post IPO information financial performance, filings, news, and market signals and recommending whether to sell or continue holding. We compare these LLM generated recommendations with the actual exit dates observed for VCs and compute the return differences between the two strategies. By quantifying gains or losses associated with following the LLM, this study provides evidence on whether AI-driven guidance can improve exit timing and complements traditional hazard and real-options models in venture capital research.
zh

[AI-154] Social Media Informatics for Sustainable Cities and Societies: An Overview of the Applications associated Challenges and Potential Solutions

【速读】:该论文旨在解决现代城市与社会在快速城镇化、全球变暖、数字鸿沟和社会不平等背景下,如何通过多维度协同策略实现可持续发展的挑战。其核心解决方案在于整合利益相关者参与、可持续规划、资源高效管理、创新技术应用与社会媒体信息学(social media informatics)的深度融合,其中社会媒体信息学被证实能有效支持可持续城市与社会的建设,其关键作用体现在数据驱动的决策优化、公众参与增强及韧性提升等方面。

链接: https://arxiv.org/abs/2412.03600
作者: Jebran Khan,Kashif Ahmad,Senthil Kumar Jagatheesaperumal,Nasir Ahmad,Kyung-Ah Sohn
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 35 pages, 3 tables, and 4 figures

点击查看摘要

Abstract:In the modern world, our cities and societies face several technological and societal challenges, such as rapid urbanization, global warming climate change, the digital divide, and social inequalities, increasing the need for more sustainable cities and societies. Addressing these challenges requires a multifaceted approach involving all the stakeholders, sustainable planning, efficient resource management, innovative solutions, and modern technologies. Like other modern technologies, social media informatics also plays its part in developing more sustainable and resilient cities and societies. Despite its limitations, social media informatics has proven very effective in various sustainable cities and society applications. In this paper, we review and analyze the role of social media informatics in sustainable cities and society by providing a detailed overview of its applications, associated challenges, and potential solutions. This work is expected to provide a baseline for future research in the domain.
zh

机器学习

[LG-0] Heterogeneous Low-Bandwidth Pre-Training of LLM s

链接: https://arxiv.org/abs/2601.02360
作者: Yazan Obeidi,Amir Sarfi,Joel Lidin,Paul Janson,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.

[LG-1] Game of Coding: Coding Theory in the Presence of Rational Adversaries Motivated by Decentralized Machine Learning

链接: https://arxiv.org/abs/2601.02313
作者: Hanzaleh Akbari Nodehi,Viveck R. Cadambe,Mohammad Ali Maddah-Ali
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coding theory plays a crucial role in enabling reliable communication, storage, and computation. Classical approaches assume a worst-case adversarial model and ensure error correction and data recovery only when the number of honest nodes exceeds the number of adversarial ones by some margin. However, in some emerging decentralized applications, particularly in decentralized machine learning (DeML), participating nodes are rewarded for accepted contributions. This incentive structure naturally gives rise to rational adversaries who act strategically rather than behaving in purely malicious ways. In this paper, we first motivate the need for coding in the presence of rational adversaries, particularly in the context of outsourced computation in decentralized systems. We contrast this need with existing approaches and highlight their limitations. We then introduce the game of coding, a novel game-theoretic framework that extends coding theory to trust-minimized settings where honest nodes are not in the majority. Focusing on repetition coding, we highlight two key features of this framework: (1) the ability to achieve a non-zero probability of data recovery even when adversarial nodes are in the majority, and (2) Sybil resistance, i.e., the equilibrium remains unchanged even as the number of adversarial nodes increases. Finally, we explore scenarios in which the adversary’s strategy is unknown and outline several open problems for future research. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.02313 [cs.LG] (or arXiv:2601.02313v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.02313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] mporal Kolmogorov-Arnold Networks (T-KAN) for High-Frequency Limit Order Book Forecasting: Efficiency Interpretability and Alpha Decay

链接: https://arxiv.org/abs/2601.02310
作者: Ahmad Makinde
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, Proposes T-KAN architecture for HFT. Achieves 19.1% F1-score improvement on FI-2010 and 132.48% return in cost-adjusted this http URL T-KAN architecture for HFT. Achieves 19.1% F1-score improvement on FI-2010 and 132.48% return in cost-adjusted backtests

点击查看摘要

Abstract:High-Frequency trading (HFT) environments are characterised by large volumes of limit order book (LOB) data, which is notoriously noisy and non-linear. Alpha decay represents a significant challenge, with traditional models such as DeepLOB losing predictive power as the time horizon (k) increases. In this paper, using data from the FI-2010 dataset, we introduce Temporal Kolmogorov-Arnold Networks (T-KAN) to replace the fixed, linear weights of standard LSTMs with learnable B-spline activation functions. This allows the model to learn the ‘shape’ of market signals as opposed to just their magnitude. This resulted in a 19.1% relative improvement in the F1-score at the k = 100 horizon. The efficacy of T-KAN networks cannot be understated, producing a 132.48% return compared to the -82.76% DeepLOB drawdown under 1.0 bps transaction costs. In addition to this, the T-KAN model proves quite interpretable, with the ‘dead-zones’ being clearly visible in the splines. The T-KAN architecture is also uniquely optimized for low-latency FPGA implementation via High level Synthesis (HLS). The code for the experiments in this project can be found at this https URL.

[LG-3] Differential Privacy for Transformer Embeddings of Text with Nonparametric Variational Information Bottleneck

链接: https://arxiv.org/abs/2601.02307
作者: Dina El Zein,James Henderson
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:We propose a privacy-preserving method for sharing text data by sharing noisy versions of their transformer embeddings. It has been shown that hidden representations learned by deep models can encode sensitive information from the input, making it possible for adversaries to recover the input data with considerable accuracy. This problem is exacerbated in transformer embeddings because they consist of multiple vectors, one per token. To mitigate this risk, we propose Nonparametric Variational Differential Privacy (NVDP), which ensures both useful data sharing and strong privacy protection. We take a differential privacy approach, integrating a Nonparametric Variational Information Bottleneck (NVIB) layer into the transformer architecture to inject noise into its multi-vector embeddings and thereby hide information, and measuring privacy protection with Rényi divergence and its corresponding Bayesian Differential Privacy (BDP) guarantee. Training the NVIB layer calibrates the noise level according to utility. We test NVDP on the GLUE benchmark and show that varying the noise level gives us a useful tradeoff between privacy and accuracy. With lower noise levels, our model maintains high accuracy while offering strong privacy guarantees, effectively balancing privacy and utility.

[LG-4] POSEIDON: Physics-Optimized Seismic Energy Inference and Detection Operating Network

链接: https://arxiv.org/abs/2601.02264
作者: Boris Kriuk,Fedor Kriuk
类目: Machine Learning (cs.LG)
*备注: 8 pages, 14 figures

点击查看摘要

Abstract:Earthquake prediction and seismic hazard assessment remain fundamental challenges in geophysics, with existing machine learning approaches often operating as black boxes that ignore established physical laws. We introduce POSEIDON (Physics-Optimized Seismic Energy Inference and Detection Operating Network), a physics-informed energy-based model for unified multi-task seismic event prediction, alongside the Poseidon dataset – the largest open-source global earthquake catalog comprising 2.8 million events spanning 30 years. POSEIDON embeds fundamental seismological principles, including the Gutenberg-Richter magnitude-frequency relationship and Omori-Utsu aftershock decay law, as learnable constraints within an energy-based modeling framework. The architecture simultaneously addresses three interconnected prediction tasks: aftershock sequence identification, tsunami generation potential, and foreshock detection. Extensive experiments demonstrate that POSEIDON achieves state-of-the-art performance across all tasks, outperforming gradient boosting, random forest, and CNN baselines with the highest average F1 score among all compared methods. Crucially, the learned physics parameters converge to scientifically interpretable values – Gutenberg-Richter b-value of 0.752 and Omori-Utsu parameters p=0.835, c=0.1948 days – falling within established seismological ranges while enhancing rather than compromising predictive accuracy. The Poseidon dataset is publicly available at this https URL, providing pre-computed energy features, spatial grid indices, and standardized quality metrics to advance physics-informed seismic research.

[LG-5] Improved Accuracy for Private Continual Cardinality Estimation in Fully Dynamic Streams via Matrix Factorization

链接: https://arxiv.org/abs/2601.02257
作者: Joel Daniel Andersson,Palak Jain,Satchit Sivakumar
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study differentially-private statistics in the fully dynamic continual observation model, where many updates can arrive at each time step and updates to a stream can involve both insertions and deletions of an item. Earlier work (e.g., Jain et al., NeurIPS 2023 for counting distinct elements; Raskhodnikova Steiner, PODS 2025 for triangle counting with edge updates) reduced the respective cardinality estimation problem to continual counting on the difference stream associated with the true function values on the input stream. In such reductions, a change in the original stream can cause many changes in the difference stream, this poses a challenge for applying private continual counting algorithms to obtain optimal error bounds. We improve the accuracy of several such reductions by studying the associated \ell_p -sensitivity vectors of the resulting difference streams and isolating their properties. We demonstrate that our framework gives improved bounds for counting distinct elements, estimating degree histograms, and estimating triangle counts (under a slightly relaxed privacy model), thus offering a general approach to private continual cardinality estimation in streaming settings. Our improved accuracy stems from tight analysis of known factorization mechanisms for the counting matrix in this setting; the key technical challenge is arguing that one can use state-of-the-art factorizations for sensitivity vector sets with the properties we isolate. Empirically and analytically, we demonstrate that our improved error bounds offer a substantial improvement in accuracy for cardinality estimation problems over a large range of parameters. Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2601.02257 [cs.CR] (or arXiv:2601.02257v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.02257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

链接: https://arxiv.org/abs/2601.02232
作者: Shristi Das Biswas,Yue Zhang,Anwesan Pal,Radhika Bhargava,Kaushik Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to 9.6% and a 35\times smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model’s zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

[LG-7] Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction

链接: https://arxiv.org/abs/2601.02213
作者: Haoyu Zhou,Ping Xue,Tianfan Fu,Hao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying 3D graph neural networks (GNNs) that are equivariant to 3D rotations (the group SO(3)) on edge devices is challenging due to their high computational cost. This paper addresses the problem by compressing and accelerating an SO(3)-equivariant GNN using low-bit quantization techniques. Specifically, we introduce three innovations for quantized equivariant transformers: (1) a magnitude-direction decoupled quantization scheme that separately quantizes the norm and orientation of equivariant (vector) features, (2) a branch-separated quantization-aware training strategy that treats invariant and equivariant feature channels differently in an attention-based SO(3) -GNN, and (3) a robustness-enhancing attention normalization mechanism that stabilizes low-precision attention computations. Experiments on the QM9 and rMD17 molecular benchmarks demonstrate that our 8-bit models achieve accuracy on energy and force predictions comparable to full-precision baselines with markedly improved efficiency. We also conduct ablation studies to quantify the contribution of each component to maintain accuracy and equivariance under quantization, using the Local error of equivariance (LEE) metric. The proposed techniques enable the deployment of symmetry-aware GNNs in practical chemistry applications with 2.37–2.73x faster inference and 4x smaller model size, without sacrificing accuracy or physical symmetry.

[LG-8] ACDZero: Graph-Embedding-Based Tree Search for Mastering Automated Cyber Defense

链接: https://arxiv.org/abs/2601.02196
作者: Yu Li,Sizhe Tang,Rongqian Chen,Fei Xu Yu,Guangyu Jiang,Mahdi Imani,Nathaniel D. Bastian,Tian Lan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated cyber defense (ACD) seeks to protect computer networks with minimal or no human intervention, reacting to intrusions by taking corrective actions such as isolating hosts, resetting services, deploying decoys, or updating access controls. However, existing approaches for ACD, such as deep reinforcement learning (RL), often face difficult exploration in complex networks with large decision/state spaces and thus require an expensive amount of samples. Inspired by the need to learn sample-efficient defense policies, we frame ACD in CAGE Challenge 4 (CAGE-4 / CC4) as a context-based partially observable Markov decision problem and propose a planning-centric defense policy based on Monte Carlo Tree Search (MCTS). It explicitly models the exploration-exploitation tradeoff in ACD and uses statistical sampling to guide exploration and decision making. We make novel use of graph neural networks (GNNs) to embed observations from the network as attributed graphs, to enable permutation-invariant reasoning over hosts and their relationships. To make our solution practical in complex search spaces, we guide MCTS with learned graph embeddings and priors over graph-edit actions, combining model-free generalization and policy distillation with look-ahead planning. We evaluate the resulting agent on CC4 scenarios involving diverse network structures and adversary behaviors, and show that our search-guided, graph-embedding-based planning improves defense reward and robustness relative to state-of-the-art RL baselines.

[LG-9] Learning with Monotone Adversarial Corruptions

链接: https://arxiv.org/abs/2601.02193
作者: Kasper Green Larsen,Chirag Pabbaraju,Abhishek Shetty
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the extent to which standard machine learning algorithms rely on exchangeability and independence of data by introducing a monotone adversarial corruption model. In this model, an adversary, upon looking at a “clean” i.i.d. dataset, inserts additional “corrupted” points of their choice into the dataset. These added points are constrained to be monotone corruptions, in that they get labeled according to the ground-truth target function. Perhaps surprisingly, we demonstrate that in this setting, all known optimal learning algorithms for binary classification can be made to achieve suboptimal expected error on a new independent test point drawn from the same distribution as the clean dataset. On the other hand, we show that uniform convergence-based algorithms do not degrade in their guarantees. Our results showcase how optimal learning algorithms break down in the face of seemingly helpful monotone corruptions, exposing their overreliance on exchangeability.

[LG-10] Edge-aware GAT-based protein binding site prediction

链接: https://arxiv.org/abs/2601.02138
作者: Weisen Yang,Hanqing Zhang,Wangren Qiu,Xuan Xiao,Weizhong Lin
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 24 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Accurate identification of protein binding sites is crucial for understanding biomolecular interaction mechanisms and for the rational design of drug targets. Traditional predictive methods often struggle to balance prediction accuracy with computational efficiency when capturing complex spatial conformations. To address this challenge, we propose an Edge-aware Graph Attention Network (Edge-aware GAT) model for the fine-grained prediction of binding sites across various biomolecules, including proteins, DNA/RNA, ions, ligands, and lipids. Our method constructs atom-level graphs and integrates multidimensional structural features, including geometric descriptors, DSSP-derived secondary structure, and relative solvent accessibility (RSA), to generate spatially aware embedding vectors. By incorporating interatomic distances and directional vectors as edge features within the attention mechanism, the model significantly enhances its representation capacity. On benchmark datasets, our model achieves an ROC-AUC of 0.93 for protein-protein binding site prediction, outperforming several state-of-the-art methods. The use of directional tensor propagation and residue-level attention pooling further improves both binding site localization and the capture of local structural details. Visualizations using PyMOL confirm the model’s practical utility and interpretability. To facilitate community access and application, we have deployed a publicly accessible web server at this http URL. In summary, our approach offers a novel and efficient solution that balances prediction accuracy, generalization, and interpretability for identifying functional sites in proteins.

[LG-11] Prototype-Based Learning for Healthcare: A Demonstration of Interpretable AI ICDM

链接: https://arxiv.org/abs/2601.02106
作者: Ashish Rana,Ammar Shaker,Sascha Saralajew,Takashi Suzuki,Kosuke Yasuda,Shintaro Kato,Toshikazu Wada,Toshiyuki Fujikawa,Toru Kikutsuji
类目: Machine Learning (cs.LG)
*备注: Accepted to the Demo Track at the IEEE International Conference on Data Mining (ICDM) 2025, where it received the Best Demo Award

点击查看摘要

Abstract:Despite recent advances in machine learning and explainable AI, a gap remains in personalized preventive healthcare: predictions, interventions, and recommendations should be both understandable and verifiable for all stakeholders in the healthcare sector. We present a demonstration of how prototype-based learning can address these needs. Our proposed framework, ProtoPal, features both front- and back-end modes; it achieves superior quantitative performance while also providing an intuitive presentation of interventions and their simulated outcomes.

[LG-12] Horizon Activation Mapping for Neural Networks in Time Series Forecasting

链接: https://arxiv.org/abs/2601.02094
作者: Hans Krupakar,V A Kandappan
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:Neural networks for time series forecasting have relied on error metrics and architecture-specific interpretability approaches for model selection that don’t apply across models of different families. To interpret forecasting models agnostic to the types of layers across state-of-the-art model families, we introduce Horizon Activation Mapping (HAM), a visual interpretability technique inspired by grad-CAM that uses gradient norm averages to study the horizon’s subseries where grad-CAM studies attention maps over image data. We introduce causal and anti-causal modes to calculate gradient update norm averages across subseries at every timestep and lines of proportionality signifying uniform distributions of the norm averages. Optimization landscape studies with respect to changes in batch sizes, early stopping, train-val-test splits, univariate forecasting and dropouts are studied with respect to performances and subseries in HAM. Interestingly, batch size based differences in activities seem to indicate potential for existence of an exponential approximation across them per epoch relative to each other. Multivariate forecasting models including MLP-based CycleNet, N-Linear, N-HITS, self attention-based FEDformer, Pyraformer, SSM-based SpaceTime and diffusion-based Multi-Resolution DDPM over different horizon sizes trained over the ETTm2 dataset are used for HAM plots in this study. NHITS’ neural approximation theorem and SpaceTime’s exponential autoregressive activities have been attributed to trends in HAM plots over their training, validation and test sets. In general, HAM can be used for granular model selection, validation set choices and comparisons across different neural network model families.

[LG-13] A Differentiable Adversarial Framework for Task-Aware Data Subsampling

链接: https://arxiv.org/abs/2601.02081
作者: Jiacheng Lyu,Bihua Bao
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:The proliferation of large-scale datasets poses a major computational challenge to model training. The traditional data subsampling method works as a static, task independent preprocessing step which usually discards information that is critical to downstream prediction. In this paper, we introduces the antagonistic soft selection subsampling (ASSS) framework as is a novel paradigm that reconstructs data reduction into a differentiable end-to-end learning problem. ASSS uses the adversarial game between selector network and task network, and selector network learning assigns continuous importance weights to samples. This direct optimization implemented by Gumbel-Softmax relaxation allows the selector to identify and retain samples with the maximum amount of information for a specific task target under the guidance of the loss function that balances the fidelity and sparsity of the prediction. Theoretical analysis links this framework with the information bottleneck principle. Comprehensive experiments on four large-scale real world datasets show that ASSS has always been better than heuristic subsampling baselines such as clustering and nearest neighbor thinning in maintaining model performance. It is worth noting that ASSS can not only match, but also sometimes exceed the training performance of the entire dataset, showcasing the effect of intelligent denoising. This work establishes task aware data subsampling as a learnable component, providing a principled solution for effective large-scale data learning.

[LG-14] MDAgent 2: Large Language Model for Code Generation and Knowledge QA in Molecular Dynamics

链接: https://arxiv.org/abs/2601.02075
作者: Zhuofan Shi,Hubao A,Yufei Shao,Mengyan Dai,Yadong Yu,Pan Xiang,Dongliang Huang,Hongxu An,Chunxiao Xin,Haiyang Shen,Zhenyu Wang,Yunshan Na,Gang Huang,Xiang Jing
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 24 pages,4 figures

点击查看摘要

Abstract:Molecular dynamics (MD) simulations are essential for understanding atomic-scale behaviors in materials science, yet writing LAMMPS scripts remains highly specialized and time-consuming tasks. Although LLMs show promise in code generation and domain-specific question answering, their performance in MD scenarios is limited by scarce domain data, the high deployment cost of state-of-the-art LLMs, and low code executability. Building upon our prior MDAgent, we present MDAgent2, the first end-to-end framework capable of performing both knowledge QA and code generation within the MD domain. We construct a domain-specific data-construction pipeline that yields three high-quality datasets spanning MD knowledge, question answering, and code generation. Based on these datasets, we adopt a three stage post-training strategy–continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL)–to train two domain-adapted models, MD-Instruct and MD-Code. Furthermore, we introduce MD-GRPO, a closed-loop RL method that leverages simulation outcomes as reward signals and recycles low-reward trajectories for continual refinement. We further build MDAgent2-RUNTIME, a deployable multi-agent system that integrates code generation, execution, evaluation, and self-correction. Together with MD-EvalBench proposed in this work, the first benchmark for LAMMPS code generation and question answering, our models and system achieve performance surpassing several strong this http URL work systematically demonstrates the adaptability and generalization capability of large language models in industrial simulation tasks, laying a methodological foundation for automatic code generation in AI for Science and industrial-scale simulations. URL: this https URL

[LG-15] Explore the Ideology of Deep Learning in ENSO Forecasts

链接: https://arxiv.org/abs/2601.02050
作者: Yanhai Gan,Yipeng Chen,Ning Li,Xingguo Liu,Junyu Dong,Xianyao Chen
类目: Machine Learning (cs.LG)
*备注: 5 figures. Code available at this https URL

点击查看摘要

Abstract:The El Ni~no-Southern Oscillation (ENSO) exerts profound influence on global climate variability, yet its prediction remains a grand challenge. Recent advances in deep learning have significantly improved forecasting skill, but the opacity of these models hampers scientific trust and operational deployment. Here, we introduce a mathematically grounded interpretability framework based on bounded variation function. By rescuing the “dead” neurons from the saturation zone of the activation function, we enhance the model’s expressive capacity. Our analysis reveals that ENSO predictability emerges dominantly from the tropical Pacific, with contributions from the Indian and Atlantic Oceans, consistent with physical understanding. Controlled experiments affirm the robustness of our method and its alignment with established predictors. Notably, we probe the persistent Spring Predictability Barrier (SPB), finding that despite expanded sensitivity during spring, predictive performance declines-likely due to suboptimal variable selection. These results suggest that incorporating additional ocean-atmosphere variables may help transcend SPB limitations and advance long-range ENSO prediction.

[LG-16] Multivariate Time-series Anomaly Detection via Dynamic Model Pool Ensembling

链接: https://arxiv.org/abs/2601.02037
作者: Wei Hu,Zewei Yu,Jianqiu Xu
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Multivariate time-series (MTS) anomaly detection is critical in domains such as service monitor, IoT, and network security. While multi-model methods based on selection or ensembling outperform single-model ones, they still face limitations: (i) selection methods rely on a single chosen model and are sensitive to the strategy; (ii) ensembling methods often combine all models or are restricted to univariate data; and (iii) most methods depend on fixed data dimensionality, limiting scalability. To address these, we propose DMPEAD, a Dynamic Model Pool and Ensembling framework for MTS Anomaly Detection. The framework first (i) constructs a diverse model pool via parameter transfer and diversity metric, then (ii) updates it with a meta-model and similarity-based strategy for adaptive pool expansion, subset selection, and pool merging, finally (iii) ensembles top-ranked models through proxy metric ranking and top-k aggregation in the selected subset, outputting the final anomaly detection result. Extensive experiments on 8 real-world datasets show that our model outperforms all baselines, demonstrating superior adaptability and scalability.

[LG-17] Prior Diffusiveness and Regret in the Linear-Gaussian Bandit

链接: https://arxiv.org/abs/2601.02022
作者: Yifan Zhu,John C. Duchi,Benjamin Van Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We prove that Thompson sampling exhibits \tildeO(\sigma d \sqrtT + d r \sqrt\mathrmTr(\Sigma_0)) Bayesian regret in the linear-Gaussian bandit with a \mathcalN(\mu_0, \Sigma_0) prior distribution on the coefficients, where d is the dimension, T is the time horizon, r is the maximum \ell_2 norm of the actions, and \sigma^2 is the noise variance. In contrast to existing regret bounds, this shows that to within logarithmic factors, the prior-dependent burn-in'' term d r \sqrt\mathrmTr(\Sigma_0) decouples additively from the minimax (long run) regret \sigma d \sqrtT . Previous regret bounds exhibit a multiplicative dependence on these terms. We establish these results via a new elliptical potential’’ lemma, and also provide a lower bound indicating that the burn-in term is unavoidable.

[LG-18] SerpentFlow: Generative Unpaired Domain Alignment via Shared-Structure Decomposition

链接: https://arxiv.org/abs/2601.01979
作者: Julie Keisler(ARCHES),Anastase Alexandre Charantonis(ARCHES),Yannig Goude(EDF R\amp;D OSIRIS, LMO),Boutheina Oueslati(EDF R\amp;D OSIRIS),Claire Monteleoni(ARCHES)
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Domain alignment refers broadly to learning correspondences between data distributions from distinct domains. In this work, we focus on a setting where domains share underlying structural patterns despite differences in their specific realizations. The task is particularly challenging in the absence of paired observations, which removes direct supervision across domains. We introduce a generative framework, called SerpentFlow (SharEd-structuRe decomPosition for gEnerative domaiN adapTation), for unpaired domain alignment. SerpentFlow decomposes data within a latent space into a shared component common to both domains and a domain-specific one. By isolating the shared structure and replacing the domain-specific component with stochastic noise, we construct synthetic training pairs between shared representations and target-domain samples, thereby enabling the use of conditional generative models that are traditionally restricted to paired settings. We apply this approach to super-resolution tasks, where the shared component naturally corresponds to low-frequency content while high-frequency details capture domain-specific variability. The cutoff frequency separating low- and high-frequency components is determined automatically using a classifier-based criterion, ensuring a data-driven and domain-adaptive decomposition. By generating pseudo-pairs that preserve low-frequency structures while injecting stochastic high-frequency realizations, we learn the conditional distribution of the target domain given the shared representation. We implement SerpentFlow using Flow Matching as the generative pipeline, although the framework is compatible with other conditional generative approaches. Experiments on synthetic images, physical process simulations, and a climate downscaling task demonstrate that the method effectively reconstructs high-frequency structures consistent with underlying low-frequency patterns, supporting shared-structure decomposition as an effective strategy for unpaired domain alignment.

[LG-19] SynRXN: An Open Benchmark and Curated Dataset for Computational Reaction Modeling

链接: https://arxiv.org/abs/2601.01943
作者: Tieu-Long Phan,Nhu-Ngoc Nguyen Song,Peter F. Stadler
类目: Machine Learning (cs.LG)
*备注: 31 pages (including references), 3 figures, 7 tables

点击查看摘要

Abstract:We present SynRXN, a unified benchmarking framework and open-data resource for computer-aided synthesis planning (CASP). SynRXN decomposes end-to-end synthesis planning into five task families, covering reaction rebalancing, atom-to-atom mapping, reaction classification, reaction property prediction, and synthesis route design. Curated, provenance-tracked reaction corpora are assembled from heterogeneous public sources into a harmonized representation and packaged as versioned datasets for each task family, with explicit source metadata, licence tags, and machine-readable manifests that record checksums, and row counts. For every task, SynRXN provides transparent splitting functions that generate leakage-aware train, validation, and test partitions, together with standardized evaluation workflows and metric suites tailored to classification, regression, and structured prediction settings. For sensitive benchmarking, we combine public training and validation data with held-out gold-standard test sets, and contamination-prone tasks such as reaction rebalancing and atom-to-atom mapping are distributed only as evaluation sets and are explicitly not intended for model training. Scripted build recipes enable bitwise-reproducible regeneration of all corpora across machines and over time, and the entire resource is released under permissive open licences to support reuse and extension. By removing dataset heterogeneity and packaging transparent, reusable evaluation scaffolding, SynRXN enables fair longitudinal comparison of CASP methods, supports rigorous ablations and stress tests along the full reaction-informatics pipeline, and lowers the barrier for practitioners who seek robust and comparable performance estimates for real-world synthesis planning workloads.

[LG-20] Distorted Distributional Policy Evaluation for Offline Reinforcement Learning ICONIP2025

链接: https://arxiv.org/abs/2601.01917
作者: Ryo Iwaki,Takayuki Osogami
类目: Machine Learning (cs.LG)
*备注: The preprint version of the paper accepted to ICONIP2025. The Version of Record is available online at this https URL

点击查看摘要

Abstract:While Distributional Reinforcement Learning (DRL) methods have demonstrated strong performance in online settings, its success in offline scenarios remains limited. We hypothesize that a key limitation of existing offline DRL methods lies in their approach to uniformly underestimate return quantiles. This uniform pessimism can lead to overly conservative value estimates, ultimately hindering generalization and performance. To address this, we introduce a novel concept called quantile distortion, which enables non-uniform pessimism by adjusting the degree of conservatism based on the availability of supporting data. Our approach is grounded in theoretical analysis and empirically validated, demonstrating improved performance over uniform pessimism.

[LG-21] -FSI: Scalable Faithful Shapley Interactions via Tensor-Train

链接: https://arxiv.org/abs/2601.01903
作者: Ungsik Kim,Suwon Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Faithful Shapley Interaction (FSI) index uniquely satisfies the faithfulness axiom among Shapley interaction indices, but computing FSI requires O(d^\ell \cdot 2^d) time and existing implementations use O(4^d) memory. We present TT-FSI, which exploits FSI’s algebraic structure via Matrix Product Operators (MPO). Our main theoretical contribution is proving that the linear operator v \mapsto \textFSI(v) admits an MPO representation with TT-rank O(\ell d) , enabling an efficient sweep algorithm with O(\ell^2 d^3 \cdot 2^d) time and O(\ell d^2) core storage an exponential improvement over existing methods. Experiments on six datasets ( d=8 to d=20 ) demonstrate up to 280 \times speedup over baseline, 85 \times over SHAP-IQ, and 290 \times memory reduction. TT-FSI scales to d=20 (1M coalitions) where all competing methods fail.

[LG-22] FedBiCross: A Bi-Level Optimization Framework to Tackle Non-IID Challenges in Data-Free One-Shot Federated Learning on Medical Data

链接: https://arxiv.org/abs/2601.01901
作者: Yuexuan Xia,Yinghao Zhang,Yalin Liu,Hong-Ning Dai,Yong Xia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-free knowledge distillation-based one-shot federated learning (OSFL) trains a model in a single communication round without sharing raw data, making OSFL attractive for privacy-sensitive medical applications. However, existing methods aggregate predictions from all clients to form a global teacher. Under non-IID data, conflicting predictions cancel out during averaging, yielding near-uniform soft labels that provide weak supervision for distillation. We propose FedBiCross, a personalized OSFL framework with three stages: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation. Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

[LG-23] SafeLoad: Efficient Admission Control Framework for Identifying Memory-Overloading Queries in Cloud Data Warehouses VLDB2026

链接: https://arxiv.org/abs/2601.01888
作者: Yifan Wu,Yuhan Li,Zhenhua Wang,Zhongle Xie,Dingyu Yang,Ke Chen,Lidan Shou,Bo Tang,Liang Lin,Huan Li,Gang Chen
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at VLDB 2026

点击查看摘要

Abstract:Memory overload is a common form of resource exhaustion in cloud data warehouses. When database queries fail due to memory overload, it not only wastes critical resources such as CPU time but also disrupts the execution of core business processes, as memory-overloading (MO) queries are typically part of complex workflows. If such queries are identified in advance and scheduled to memory-rich serverless clusters, it can prevent resource wastage and query execution failure. Therefore, cloud data warehouses desire an admission control framework with high prediction precision, interpretability, efficiency, and adaptability to effectively identify MO queries. However, existing admission control frameworks primarily focus on scenarios like SLA satisfaction and resource isolation, with limited precision in identifying MO queries. Moreover, there is a lack of publicly available MO-labeled datasets with workloads for training and benchmarking. To tackle these challenges, we propose SafeLoad, the first query admission control framework specifically designed to identify MO queries. Alongside, we release SafeBench, an open-source, industrial-scale benchmark for this task, which includes 150 million real queries. SafeLoad first filters out memory-safe queries using the interpretable discriminative rule. It then applies a hybrid architecture that integrates both a global model and cluster-level models, supplemented by a misprediction correction module to identify MO queries. Additionally, a self-tuning quota management mechanism dynamically adjusts prediction quotas per cluster to improve precision. Experimental results show that SafeLoad achieves state-of-the-art prediction performance with low online and offline time overhead. Specifically, SafeLoad improves precision by up to 66% over the best baseline and reduces wasted CPU time by up to 8.09x compared to scenarios without SafeLoad.

[LG-24] High-Order Epistasis Detection Using Factorization Machine with Quadratic Optimization Annealing and MDR-Based Evaluation

链接: https://arxiv.org/abs/2601.01860
作者: Shuta Kikuchi,Shu Tanaka
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Detecting high-order epistasis is a fundamental challenge in genetic association studies due to the combinatorial explosion of candidate locus combinations. Although multifactor dimensionality reduction (MDR) is a widely used method for evaluating epistasis, exhaustive MDR-based searches become computationally infeasible as the number of loci or the interaction order increases. In this paper, we define the epistasis detection problem as a black-box optimization problem and solve it with a factorization machine with quadratic optimization annealing (FMQA). We propose an efficient epistasis detection method based on FMQA, in which the classification error rate (CER) computed by MDR is used as a black-box objective function. Experimental evaluations were conducted using simulated case-control datasets with predefined high-order epistasis. The results demonstrate that the proposed method successfully identified ground-truth epistasis across various interaction orders and the numbers of genetic loci within a limited number of iterations. These results indicate that the proposed method is effective and computationally efficient for high-order epistasis detection.

[LG-25] ackling Resource-Constrained and Data-Heterogeneity in Federated Learning with Double-Weight Sparse Pack AAAI2026

链接: https://arxiv.org/abs/2601.01840
作者: Qiantao Yang,Liquan Chen,Mingfu Xue,Songze Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted in AAAI 2026

点击查看摘要

Abstract:Federated learning has drawn widespread interest from researchers, yet the data heterogeneity across edge clients remains a key challenge, often degrading model performance. Existing methods enhance model compatibility with data heterogeneity by splitting models and knowledge distillation. However, they neglect the insufficient communication bandwidth and computing power on the client, failing to strike an effective balance between addressing data heterogeneity and accommodating limited client resources. To tackle this limitation, we propose a personalized federated learning method based on cosine sparsification parameter packing and dual-weighted aggregation (FedCSPACK), which effectively leverages the limited client resources and reduces the impact of data heterogeneity on model performance. In FedCSPACK, the client packages model parameters and selects the most contributing parameter packages for sharing based on cosine similarity, effectively reducing bandwidth requirements. The client then generates a mask matrix anchored to the shared parameter package to improve the alignment and aggregation efficiency of sparse updates on the server. Furthermore, directional and distribution distance weights are embedded in the mask to implement a weighted-guided aggregation mechanism, enhancing the robustness and generalization performance of the global model. Extensive experiments across four datasets using ten state-of-the-art methods demonstrate that FedCSPACK effectively improves communication and computational efficiency while maintaining high model accuracy.

[LG-26] FAROS: Robust Federated Learning with Adaptive Scaling against Backdoor Attacks

链接: https://arxiv.org/abs/2601.01833
作者: Chenyu Hu,Qiming Hu,Sinan Chen,Nianyu Li,Mingyue Zhang,Jialong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables multiple clients to collaboratively train a shared model without exposing local data. However, backdoor attacks pose a significant threat to FL. These attacks aim to implant a stealthy trigger into the global model, causing it to mislead on inputs that possess a specific trigger while functioning normally on benign data. Although pre-aggregation detection is a main defense direction, existing state-of-the-art defenses often rely on fixed defense parameters. This reliance makes them vulnerable to single-point-of-failure risks, rendering them less effective against sophisticated attackers. To address these limitations, we propose FAROS, an enhanced FL framework that incorporates Adaptive Differential Scaling (ADS) and Robust Core-set Computing (RCC). The ADS mechanism adjusts the defense’s sensitivity dynamically, based on the dispersion of uploaded gradients by clients in each round. This allows it to counter attackers who strategically shift between stealthiness and effectiveness. Furthermore, the RCC effectively mitigates the risk of single-point failure by computing the centroid of a core set comprising clients with the highest confidence. We conducted extensive experiments across various datasets, models, and attack scenarios. The results demonstrate that our method outperforms current defenses in both attack success rate and main task accuracy.

[LG-27] RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data

链接: https://arxiv.org/abs/2601.01829
作者: Peiyan Hu,Haodong Feng,Hongyuan Liu,Tongtong Yan,Wenhao Deng,Tianrun Gao,Rong Zheng,Haoren Zheng,Chenglei Yu,Chuanrui Wang,Kaiwen Li,Zhi-Ming Ma,Dezhi Zhou,Xingcai Lu,Dixia Fan,Tailin Wu
类目: Machine Learning (cs.LG)
*备注: 46 pages, 21 figures

点击查看摘要

Abstract:Predicting the evolution of complex physical systems remains a central problem in science and engineering. Despite rapid progress in scientific Machine Learning (ML) models, a critical bottleneck is the lack of expensive real-world data, resulting in most current models being trained and validated on simulated data. Beyond limiting the development and evaluation of scientific ML, this gap also hinders research into essential tasks such as sim-to-real transfer. We introduce RealPDEBench, the first benchmark for scientific ML that integrates real-world measurements with paired numerical simulations. RealPDEBench consists of five datasets, three tasks, eight metrics, and ten baselines. We first present five real-world measured datasets with paired simulated datasets across different complex physical systems. We further define three tasks, which allow comparisons between real-world and simulated data, and facilitate the development of methods to bridge the two. Moreover, we design eight evaluation metrics, spanning data-oriented and physics-oriented metrics, and finally benchmark ten representative baselines, including state-of-the-art models, pretrained PDE foundation models, and a traditional method. Experiments reveal significant discrepancies between simulated and real-world data, while showing that pretraining with simulated data consistently improves both accuracy and convergence. In this work, we hope to provide insights from real-world data, advancing scientific ML toward bridging the sim-to-real gap and real-world deployment. Our benchmark, datasets, and instructions are available at this https URL.

[LG-28] Distributed Federated Learning by Alternating Periods of Training

链接: https://arxiv.org/abs/2601.01793
作者: Shamik Bhattacharyya,Rachel Kalpana Kalaimani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Federated learning is a privacy-focused approach towards machine learning where models are trained on client devices with locally available data and aggregated at a central server. However, the dependence on a single central server is challenging in the case of a large number of clients and even poses the risk of a single point of failure. To address these critical limitations of scalability and fault-tolerance, we present a distributed approach to federated learning comprising multiple servers with inter-server communication capabilities. While providing a fully decentralized approach, the designed framework retains the core federated learning structure where each server is associated with a disjoint set of clients with server-client communication capabilities. We propose a novel DFL (Distributed Federated Learning) algorithm which uses alternating periods of local training on the client data followed by global training among servers. We show that the DFL algorithm, under a suitable choice of parameters, ensures that all the servers converge to a common model value within a small tolerance of the ideal model, thus exhibiting effective integration of local and global training models. Finally, we illustrate our theoretical claims through numerical simulations.

[LG-29] UnPII: Unlearning Personally Identifiable Information with Quantifiable Exposure Risk ICSE

链接: https://arxiv.org/abs/2601.01786
作者: Intae Jeon,Yujeong Kwon,Hyungjoon Koo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 11 pages, 7 Tables, 6 Figures To appear in the Software Engineering in Practice (SEIP) track of ICSE

点击查看摘要

Abstract:The ever-increasing adoption of Large Language Models in critical sectors like finance, healthcare, and government raises privacy concerns regarding the handling of sensitive Personally Identifiable Information (PII) during training. In response, regulations such as European Union’s General Data Protection Regulation (GDPR) mandate the deletion of PII upon requests, underscoring the need for reliable and cost-effective data removal solutions. Machine unlearning has emerged as a promising direction for selectively forgetting data points. However, existing unlearning techniques typically apply a uniform forgetting strategy that neither accounts for the varying privacy risks posed by different PII attributes nor reflects associated business risks. In this work, we propose UnPII, the first PII-centric unlearning approach that prioritizes forgetting based on the risk of individual or combined PII attributes. To this end, we introduce the PII risk index (PRI), a composite metric that incorporates multiple dimensions of risk factors: identifiability, sensitivity, usability, linkability, permanency, exposability, and compliancy. The PRI enables a nuanced evaluation of privacy risks associated with PII exposures and can be tailored to align with organizational privacy policies. To support realistic assessment, we systematically construct a synthetic PII dataset (e.g., 1,700 PII instances) that simulates realistic exposure scenarios. UnPII seamlessly integrates with established unlearning algorithms, such as Gradient Ascent, Negative Preference Optimization, and Direct Preference Optimization, without modifying their underlying principles. Our experimental results demonstrate that UnPII achieves the improvements of accuracy up to 11.8%, utility up to 6.3%, and generalizability up to 12.4%, respectively, while incurring a modest fine-tuning overhead of 27.5% on average during unlearning.

[LG-30] SRAS: A Lightweight Reinforcement Learning-based Document Selector for Edge-Native RAG Pipelines

链接: https://arxiv.org/abs/2601.01785
作者: Rajiv Chaitanya Muttur
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Presented at ICEdge 2025; nominated for Best Paper Award

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems often rely on fixed top-k document selection mechanisms that ignore downstream generation quality and impose computational overheads. We propose SRAS (Sparse Reward-Aware Selector), a lightweight document selector trained via reinforcement learning (RL) for edge-native RAG deployment. Unlike prior RL-based retrievers that assume large memory and latency budgets, SRAS learns a compact (~0.76MB) policy using Proximal Policy Optimization (PPO), guided by a hybrid reward signal combining Relaxed F1 and BERTScore. Our method operates under tight token and compute constraints, maintaining 1s latency on CPU. SRAS outperforms supervised and random selectors on a synthetic QA benchmark, and generalizes to real-world data, achieving BERTScore F1 of 0.8546 on SQuAD v2 without domain-specific tuning. This work is the first to demonstrate that RL-based document selection can be made ultra-lightweight, latency-aware, and effective for on-device RAG pipelines.

[LG-31] Hidden costs for inference with deep network on embedded system devices

链接: https://arxiv.org/abs/2601.01698
作者: Chankyu Lee,Woohyun Choi,Sangwook Park
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: published in Proc. of IEEE ICCE 2025

点击查看摘要

Abstract:This study evaluates the inference performance of various deep learning models under an embedded system environment. In previous works, Multiply-Accumulate operation is typically used to measure computational load of a deep model. According to this study, however, this metric has a limitation to estimate inference time on embedded devices. This paper poses the question of what aspects are overlooked when expressed in terms of Multiply-Accumulate operations. In experiments, an image classification task is performed on an embedded system device using the CIFAR-100 dataset to compare and analyze the inference times of ten deep models with the theoretically calculated Multiply-Accumulate operations for each model. The results highlight the importance of considering additional computations between tensors when optimizing deep learning models for real-time performing in embedded systems.

[LG-32] Enhanced Multi-model Online Conformal Prediction

链接: https://arxiv.org/abs/2601.01692
作者: Erfan Hajihashemi,Yanning Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction is a framework for uncertainty quantification that constructs prediction sets for previously unseen data, guaranteeing coverage of the true label with a specified probability. However, the efficiency of these prediction sets, measured by their size, depends on the choice of the underlying learning model. Relying on a single fixed model may lead to suboptimal performance in online environments, as a single model may not consistently perform well across all time steps. To mitigate this, prior work has explored selecting a model from a set of candidates. However, this approach becomes computationally expensive as the number of candidate models increases. Moreover, poorly performing models in the set may also hinder the effectiveness. To tackle this challenge, this work develops a novel multi-model online conformal prediction algorithm that reduces computational complexity and improves prediction efficiency. At each time step, a bipartite graph is generated to identify a subset of effective models, from which a model is selected to construct the prediction set. Experiments demonstrate that our method outperforms existing multi-model conformal prediction techniques in terms of both prediction set size and computational efficiency.

[LG-33] DiMEx: Breaking the Cold Start Barrier in Data-Free Model Extraction via Latent Diffusion Priors

链接: https://arxiv.org/abs/2601.01688
作者: Yash Thesia,Meera Suthar
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Model stealing attacks pose an existential threat to Machine Learning as a Service (MLaaS), allowing adversaries to replicate proprietary models for a fraction of their training cost. While Data-Free Model Extraction (DFME) has emerged as a stealthy vector, it remains fundamentally constrained by the “Cold Start” problem: GAN-based adversaries waste thousands of queries converging from random noise to meaningful data. We propose DiMEx, a framework that weaponizes the rich semantic priors of pre-trained Latent Diffusion Models to bypass this initialization barrier entirely. By employing Random Embedding Bayesian Optimization (REMBO) within the generator’s latent space, DiMEx synthesizes high-fidelity queries immediately, achieving 52.1 percent agreement on SVHN with just 2,000 queries - outperforming state-of-the-art GAN baselines by over 16 percent. To counter this highly semantic threat, we introduce the Hybrid Stateful Ensemble (HSE) defense, which identifies the unique “optimization trajectory” of latent-space attacks. Our results demonstrate that while DiMEx evades static distribution detectors, HSE exploits this temporal signature to suppress attack success rates to 21.6 percent with negligible latency.

[LG-34] HeurekaBench: A Benchmarking Framework for AI Co-scientist

链接: https://arxiv.org/abs/2601.01678
作者: Siba Smarak Panigrahi,Jovana Videnović,Maria Brbić
类目: Machine Learning (cs.LG)
*备注: 33 pages, 5 figures, 7 tables. Code available at this https URL

点击查看摘要

Abstract:LLM-based reasoning models have enabled the development of agentic systems that act as co-scientists, assisting in multi-step scientific analysis. However, evaluating these systems is challenging, as it requires realistic, end-to-end research scenarios that integrate data analysis, interpretation, and the generation of new insights from the experimental data. To address this limitation, we introduce HeurekaBench, a framework to create benchmarks with exploratory, open-ended research questions for experimental datasets. Each such question is grounded in a scientific study and its corresponding code repository, and is created using a semi-automated pipeline that leverages multiple LLMs to extract insights and generate candidate workflows, which are then verified against reported findings. We instantiate the framework in single-cell biology to obtain sc-HeurekaBench benchmark and use it to compare state-of-the-art single-cell agents. We further showcase the benefits of our benchmark for quantitatively analyzing current design choices in agentic systems. We find that the addition of a critic module can improve ill-formed responses for open-source LLM-based agents by up to 22% and close the gap with their closed-source counterparts. Overall, HeurekaBench sets a path toward rigorous, end-to-end evaluation of scientific agents, grounding benchmark construction in real scientific workflows.

[LG-35] Who is the Winning Algorithm? Rank Aggregation for Comparative Studies

链接: https://arxiv.org/abs/2601.01664
作者: Amichai Painsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Consider a collection of m competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to ``win’’ (rank highest) on a future, unseen dataset. The standard maximum likelihood approach suggests counting the number of wins per each algorithm. In this work, we argue that there is much more information in the complete rankings. That is, the number of times that each algorithm finished second, third and so forth. Yet, it is not entirely clear how to effectively utilize this information for our purpose. In this work we introduce a novel conceptual framework for estimating the win probability for each of the m algorithms, given their complete rankings over a benchmark of datasets. Our proposed framework significantly improves upon currently known methods in synthetic and real-world examples.

[LG-36] Communication-Efficient Federated AUC Maximization with Cyclic Client Participation

链接: https://arxiv.org/abs/2601.01649
作者: Umesh Vangapally,Wenhan Wu,Chen Chen,Zhishuai Guo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Federated AUC maximization is a powerful approach for learning from imbalanced data in federated learning (FL). However, existing methods typically assume full client availability, which is rarely practical. In real-world FL systems, clients often participate in a cyclic manner: joining training according to a fixed, repeating schedule. This setting poses unique optimization challenges for the non-decomposable AUC objective. This paper addresses these challenges by developing and analyzing communication-efficient algorithms for federated AUC maximization under cyclic client participation. We investigate two key settings: First, we study AUC maximization with a squared surrogate loss, which reformulates the problem as a nonconvex-strongly-concave minimax optimization. By leveraging the Polyak-Łojasiewicz (PL) condition, we establish a state-of-the-art communication complexity of \widetildeO(1/\epsilon^1/2) and iteration complexity of \widetildeO(1/\epsilon) . Second, we consider general pairwise AUC losses. We establish a communication complexity of O(1/\epsilon^3) and an iteration complexity of O(1/\epsilon^4) . Further, under the PL condition, these bounds improve to communication complexity of \widetildeO(1/\epsilon^1/2) and iteration complexity of \widetildeO(1/\epsilon) . Extensive experiments on benchmark tasks in image classification, medical imaging, and fraud detection demonstrate the superior efficiency and effectiveness of our proposed methods.

[LG-37] Real Time NILM Based Power Monitoring of Identical Induction Motors Representing Cutting Machines in Textile Industry

链接: https://arxiv.org/abs/2601.01616
作者: Md Istiauk Hossain Rifat,Moin Khan,Mohammad Zunaed
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:The textile industry in Bangladesh is one of the most energy-intensive sectors, yet its monitoring practices remain largely outdated, resulting in inefficient power usage and high operational costs. To address this, we propose a real-time Non-Intrusive Load Monitoring (NILM)-based framework tailored for industrial applications, with a focus on identical motor-driven loads representing textile cutting machines. A hardware setup comprising voltage and current sensors, Arduino Mega and ESP8266 was developed to capture aggregate and individual load data, which was stored and processed on cloud platforms. A new dataset was created from three identical induction motors and auxiliary loads, totaling over 180,000 samples, to evaluate the state-of-the-art MATNILM model under challenging industrial conditions. Results indicate that while aggregate energy estimation was reasonably accurate, per-appliance disaggregation faced difficulties, particularly when multiple identical machines operated simultaneously. Despite these challenges, the integrated system demonstrated practical real-time monitoring with remote accessibility through the Blynk application. This work highlights both the potential and limitations of NILM in industrial contexts, offering insights into future improvements such as higher-frequency data collection, larger-scale datasets and advanced deep learning approaches for handling identical loads.

[LG-38] Advanced Global Wildfire Activity Modeling with Hierarchical Graph ODE

链接: https://arxiv.org/abs/2601.01501
作者: Fan Xu,Wei Gong,Hao Wu,Lilan Peng,Nan Wang,Qingsong Wen,Xian Wu,Kun Wang,Xibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wildfires, as an integral component of the Earth system, are governed by a complex interplay of atmospheric, oceanic, and terrestrial processes spanning a vast range of spatiotemporal scales. Modeling their global activity on large timescales is therefore a critical yet challenging task. While deep learning has recently achieved significant breakthroughs in global weather forecasting, its potential for global wildfire behavior prediction remains underexplored. In this work, we reframe this problem and introduce the Hierarchical Graph ODE (HiGO), a novel framework designed to learn the multi-scale, continuous-time dynamics of wildfires. Specifically, we represent the Earth system as a multi-level graph hierarchy and propose an adaptive filtering message passing mechanism for both intra- and inter-level information flow, enabling more effective feature extraction and fusion. Furthermore, we incorporate GNN-parameterized Neural ODE modules at multiple levels to explicitly learn the continuous dynamics inherent to each scale. Through extensive experiments on the SeasFire Cube dataset, we demonstrate that HiGO significantly outperforms state-of-the-art baselines on long-range wildfire forecasting. Moreover, its continuous-time predictions exhibit strong observational consistency, highlighting its potential for real-world applications.

[LG-39] Accelerating Decentralized Optimization via Overlapping Local Steps

链接: https://arxiv.org/abs/2601.01493
作者: Yijie Zhou,Shi Pu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Decentralized optimization has emerged as a critical paradigm for distributed learning, enabling scalable training while preserving data privacy through peer-to-peer collaboration. However, existing methods often suffer from communication bottlenecks due to frequent synchronization between nodes. We present Overlapping Local Decentralized SGD (OLDSGD), a novel approach to accelerate decentralized training by computation-communication overlapping, significantly reducing network idle time. With a deliberately designed update, OLDSGD preserves the same average update as Local SGD while avoiding communication-induced stalls. Theoretically, we establish non-asymptotic convergence rates for smooth non-convex objectives, showing that OLDSGD retains the same iteration complexity as standard Local Decentralized SGD while improving per-iteration runtime. Empirical results demonstrate OLDSGD’s consistent improvements in wall-clock time convergence under different levels of communication delays. With minimal modifications to existing frameworks, OLDSGD offers a practical solution for faster decentralized learning without sacrificing theoretical guarantees.

[LG-40] SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

链接: https://arxiv.org/abs/2601.01484
作者: Itai Morad,Nir Shlezinger,Yonina C. Eldar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: (i) when the teacher provides the exact Bayes Class Probabilities (BCPs); and (ii) supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.

[LG-41] Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation Convergence and Mixture of Experts

链接: https://arxiv.org/abs/2601.01475
作者: Ruofeng Yang,Yongcan Li,Bo Jiang,Cheng Chen,Shuai Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, diffusion models have achieved a great performance with a small dataset of size n and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality n^-1/D with the data dimension D . Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a 1/\sqrtn bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of K linear subspaces, and each subspace admits a mixture of Gaussian latent ( n_k modals with dimension d_k ). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. We first conduct real-world experiments to show that the generation results of MoE-latent MoG NN are much better than MoE-latent Gaussian score. Furthermore, MoE-latent MoG NN achieves a comparable performance with MoE-latent Unet with 10 \times parameters. These results indicate that the MoLR-MoG modeling is reasonable and suitable for real-world data. After that, based on such MoE-latent MoG score, we provide a R^4\sqrt\Sigma_k=1^Kn_k\sqrt\Sigma_k=1^Kn_kd_k/\sqrtn estimation error, which escapes the curse of dimensionality by using data structure. Finally, we study the optimization process and prove the convergence guarantee under the MoLR-MoG modeling. Combined with these results, under a setting close to real-world data, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process to achieve a great performance.

[LG-42] Leverag ing Flatness to Improve Information-Theoretic Generalization Bounds for SGD ICLR2025

链接: https://arxiv.org/abs/2601.01465
作者: Ze Peng,Jian Zhang,Yisen Wang,Lei Qi,Yinghuan Shi,Yang Gao
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Information-theoretic (IT) generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe that although the flatness bias is crucial for SGD’s generalization, these bounds fail to capture the improved generalization under better flatness and are also numerically loose. This is caused by the inadequate leverage of SGD’s flatness bias in existing IT bounds. This paper derives a more flatness-leveraging IT bound for the flatness-favoring SGD. The bound indicates the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show our bound not only correctly reflects the better generalization when flatness is improved, but is also numerically much tighter. This is achieved by a flexible technique called “omniscient trajectory”. When applied to Gradient Descent’s minimax excess risk on convex-Lipschitz-Bounded problems, it improves representative IT bounds’ \Omega(1) rates to O(1/\sqrtn) . It also implies a by-pass of memorization-generalization trade-offs.

[LG-43] Unveiling the Heart-Brain Connection: An Analysis of ECG in Cognitive Performance

链接: https://arxiv.org/abs/2601.01424
作者: Akshay Sasi,Malavika Pradeep,Nusaibah Farrukh,Rahul Venugopal,Elizabeth Sherly
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 6 pages, 6 figures. Code available at this https URL . Presented at AIHC (not published)

点击查看摘要

Abstract:Understanding the interaction of neural and cardiac systems during cognitive activity is critical to advancing physiological computing. Although EEG has been the gold standard for assessing mental workload, its limited portability restricts its real-world use. Widely available ECG through wearable devices proposes a pragmatic alternative. This research investigates whether ECG signals can reliably reflect cognitive load and serve as proxies for EEG-based indicators. In this work, we present multimodal data acquired from two different paradigms involving working-memory and passive-listening tasks. For each modality, we extracted ECG time-domain HRV metrics and Catch22 descriptors against EEG spectral and Catch22 features, respectively. We propose a cross-modal XGBoost framework to project the ECG features onto EEG-representative cognitive spaces, thereby allowing workload inferences using only ECG. Our results show that ECG-derived projections expressively capture variation in cognitive states and provide good support for accurate classification. Our findings underpin ECG as an interpretable, real-time, wearable solution for everyday cognitive monitoring.

[LG-44] A Depth Hierarchy for Computing the Maximum in ReLU Networks via Extremal Graph Theory

链接: https://arxiv.org/abs/2601.01417
作者: Itay Safran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of exact computation of the maximum function over d real inputs using ReLU neural networks. We prove a depth hierarchy, wherein width \Omega\big(d^1+\frac12^k-2-1\big) is necessary to represent the maximum for any depth 3\le k\le \log_2(\log_2(d)) . This is the first unconditional super-linear lower bound for this fundamental operator at depths k\ge3 , and it holds even if the depth scales with d . Our proof technique is based on a combinatorial argument and associates the non-differentiable ridges of the maximum with cliques in a graph induced by the first hidden layer of the computing network, utilizing Turán’s theorem from extremal graph theory to show that a sufficiently narrow network cannot capture the non-linearities of the maximum. This suggests that despite its simple nature, the maximum function possesses an inherent complexity that stems from the geometric structure of its non-differentiable hyperplanes, and provides a novel approach for proving lower bounds for deep neural networks.

[LG-45] Efficient Cover Construction for Ball Mapper via Accelerated Range Queries

链接: https://arxiv.org/abs/2601.01405
作者: Jay-Anne Bulauan,John Rick Manzanares
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ball Mapper is an widely used tool in topological data analysis for summarizing the structure of high-dimensional data through metric-based coverings and graph representations. A central computational bottleneck in Ball Mapper is the construction of the underlying cover, which requires repeated range queries to identify data points within a fixed distance of selected landmarks. As data sets grow in size and dimensionality, naive implementations of this step become increasingly inefficient. In this work, we study practical strategies for accelerating cover construction in Ball Mapper by improving the efficiency of range queries. We integrate two complementary approaches into the Ball Mapper pipeline: hierarchical geometric pruning using ball tree data structures, and hardware-aware distance computation using Facebook AI Similarity Search. We describe the underlying algorithms, discuss their trade-offs with respect to metric flexibility and dimensionality, and provide implementation details relevant to large-scale data analysis. Empirical benchmarks demonstrate that both approaches yield substantial speedups over the baseline implementation, with performance gains depending on data set size, dimensionality, and choice of distance function. These results improve the practical scalability of Ball Mapper without modifying its theoretical formulation and provide guidance for the efficient implementation of metric-based exploratory tools in modern data analysis workflows. Subjects: Computational Geometry (cs.CG); Machine Learning (cs.LG) MSC classes: 55N31 Cite as: arXiv:2601.01405 [cs.CG] (or arXiv:2601.01405v1 [cs.CG] for this version) https://doi.org/10.48550/arXiv.2601.01405 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Causal discovery for linear causal model with correlated noise: an Adversarial Learning Approach

链接: https://arxiv.org/abs/2601.01368
作者: Mujin Zhou,Junzhe Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery from data with unmeasured confounding factors is a challenging problem. This paper proposes an approach based on the f-GAN framework, learning the binary causal structure independent of specific weight values. We reformulate the structure learning problem as minimizing Bayesian free energy and prove that this problem is equivalent to minimizing the f-divergence between the true data distribution and the model-generated distribution. Using the f-GAN framework, we transform this objective into a min-max adversarial optimization problem. We implement the gradient search in the discrete graph space using Gumbel-Softmax relaxation.

[LG-47] owards LLM -enabled autonomous combustion research: A literature-aware agent for self-corrective modeling workflows

链接: https://arxiv.org/abs/2601.01357
作者: Ke Xiao,Haoze Zhang,Runze Mao,Han Li,Zhi X. Chen
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) is transforming artificial intelligence into autonomous research partners, yet a critical gap persists in complex scientific domains such as combustion modeling. Here, practical AI assistance requires the seamless integration of domain literature knowledge with robust execution capabilities for expertise-intensive tools such as computational fluid dynamics (CFD) codes. To bridge this gap, we introduce FlamePilot, an LLM agent designed to empower combustion modeling research through automated and self-corrective CFD workflows. FlamePilot differentiates itself through an architecture that leverages atomic tools to ensure the robust setup and execution of complex simulations in both OpenFOAM and extended frameworks such as DeepFlame. The system is also capable of learning from scientific articles, extracting key information to guide the simulation from initial setup to optimized results. Validation on a public benchmark shows FlamePilot achieved a perfect 1.0 executability score and a 0.438 success rate, surpassing the prior best reported agent scores of 0.625 and 0.250, respectively. Furthermore, a detailed case study on Moderate or Intense Low-oxygen Dilution (MILD) combustion simulation demonstrates its efficacy as a collaborative research copilot, where FlamePilot autonomously translated a research paper into a configured simulation, conducted the simulation, post-processed the results, proposed evidence-based refinements, and managed a multi-step parameter study to convergence under minimal human intervention. By adopting a transparent and interpretable paradigm, FlamePilot establishes a foundational framework for AI-empowered combustion modeling, fostering a collaborative partnership where the agent manages workflow orchestration, freeing the researcher for high-level analysis.

[LG-48] Spectral-Window Hybrid (SWH)

链接: https://arxiv.org/abs/2601.01313
作者: Vladimer Khasia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling sequence modeling to extreme contexts requires balancing computational efficiency with representational expressivity. While Transformers provide precise retrieval via the attention mechanism, their quadratic \mathcalO(T^2) complexity limits their application to long-horizon tasks. In this work, we propose the \textbfSpectral-Window Hybrid (SWH), an architecture that decouples sequence modeling into two \textitparallel streams: a global branch utilizing the Convolution Theorem to model long-range decay dynamics in \mathcalO(T \log T) time, and a local branch employing sliding-window attention for token interactions within a bounded context. By aggregating these representations, SWH avoids the computational bottleneck of global attention while retaining local precision. We demonstrate that SWH matches the perplexity of standard Transformers on short contexts while enabling efficient linear scaling to extended sequences. The code is available at this https URL

[LG-49] Making MoE based LLM inference resilient with Tarrag on

链接: https://arxiv.org/abs/2601.01310
作者: Songyu Zhang,Aaron Tam,Myungjin Lee,Shixiong Qi,K. K. Ramakrishnan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery–an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to healthy workers. On top of this datapath, Tarragon implements a self-healing mechanism that relaxes the tightly synchronized execution of existing MoE frameworks. For stateful AWs, Tarragon performs asynchronous, incremental KV cache checkpointing with per-request restoration, and for stateless EWs, it leverages residual GPU memory to deploy shadow experts. These together keep recovery cost and recomputation overhead extremely low. Our evaluation shows that, compared to state-of-the-art MegaScale-Infer, Tarragon reduces failure-induced stalls by 160-213x (from ~64 s down to 0.3-0.4 s) while preserving performance when no failures occur. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2601.01310 [cs.DC] (or arXiv:2601.01310v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2601.01310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] owards a Principled Muon under μmathsfP: Ensuring Spectral Conditions throughout Training

链接: https://arxiv.org/abs/2601.01306
作者: John Zhao
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 pages, 0 figures

点击查看摘要

Abstract:The \mu -parameterization ( \mu P) provides a principled foundation for large language model (LLM) training by prescribing width-independent learning dynamics, which in turn enables predictable scaling behavior and robust hyperparameter transfer across model sizes. A central requirement of \mu P is the satisfaction of certain spectral conditions on weight matrices, which ensure consistent feature learning and optimization behavior as model width grows. While these conditions are well understood in theory, guaranteeing their validity in practical training for matrix-based optimizers such as Muon is still under studied. Existing works that study Muon under \mu P exhibit important limitations: they either do not ensure that the spectral conditions hold throughout the entire training horizon, or require repeated spectral normalization (or Newton-Schulz iterations) applied to both weights and updates, leading to significant computational overhead and reduced practicality. In this work, we show how to reliably guarantee the spectral conditions required by \mu P for Muon during the entire training process. Our key insight is that for moderately large models, maintaining spectral control at the level of optimizer updates alone is sufficient to preserve \mu P-compatible scaling, eliminating the need for explicit spectral normalization of the weights. Based on this principle, we develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process. Our results bridge the gap between the theoretical promises of \mu P and the practical deployment of matrix-based optimizers in long-horizon training. We also take the first step towards an adaptive spectral condition by incorporating data-dependent effects, making it better suited for long-horizon LLM training.

[LG-51] Sobolev Approximation of Deep ReLU Network in Log-weighted Barron Space

链接: https://arxiv.org/abs/2601.01295
作者: Changhoon Song,Seungchan Ko,Youngjoon Hong
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Universal approximation theorems show that neural networks can approximate any continuous function; however, the number of parameters may grow exponentially with the ambient dimension, so these results do not fully explain the practical success of deep models on high-dimensional data. Barron space theory addresses this: if a target function belongs to a Barron space, a two-layer network with n parameters achieves an O(n^-1/2) approximation error in L^2 . Yet classical Barron spaces \mathscrB^s+1 still require stronger regularity than Sobolev spaces H^s , and existing depth-sensitive results often assume constraints such as sL \le 1/2 . In this paper, we introduce a log-weighted Barron space \mathscrB^\log , which requires a strictly weaker assumption than \mathscrB^s for any s0 . For this new function space, we first study embedding properties and carry out a statistical analysis via the Rademacher complexity. Then we prove that functions in \mathscrB^\log can be approximated by deep ReLU networks with explicit depth dependence. We then define a family \mathscrB^s,\log , establish approximation bounds in the H^1 norm, and identify maximal depth scales under which these rates are preserved. Our results clarify how depth reduces regularity requirements for efficient representation, offering a more precise explanation for the performance of deep architectures beyond the classical Barron setting, and for their stable use in high-dimensional problems used today.

[LG-52] he Alchemy of Thought: Understanding In-Context Learning Through Supervised Classification

链接: https://arxiv.org/abs/2601.01290
作者: Harshita Narnoli,Mihai Surdeanu
类目: Machine Learning (cs.LG)
*备注: International Joint Conference on Natural Language Processing Asia-Pacific Chapter of the Association for Computational Linguistics, 2025

点击查看摘要

Abstract:In-context learning (ICL) has become a prominent paradigm to rapidly customize LLMs to new tasks without fine-tuning. However, despite the empirical evidence of its usefulness, we still do not truly understand how ICL works. In this paper, we compare the behavior of in-context learning with supervised classifiers trained on ICL demonstrations to investigate three research questions: (1) Do LLMs with ICL behave similarly to classifiers trained on the same examples? (2) If so, which classifiers are closer, those based on gradient descent (GD) or those based on k-nearest neighbors (kNN)? (3) When they do not behave similarly, what conditions are associated with differences in behavior? Using text classification as a use case, with six datasets and three LLMs, we observe that LLMs behave similarly to these classifiers when the relevance of demonstrations is high. On average, ICL is closer to kNN than logistic regression, giving empirical evidence that the attention mechanism behaves more similarly to kNN than GD. However, when demonstration relevance is low, LLMs perform better than these classifiers, likely because LLMs can back off to their parametric memory, a luxury these classifiers do not have.

[LG-53] Accelerated Full Waveform Inversion by Deep Compressed Learning

链接: https://arxiv.org/abs/2601.01268
作者: Maayan Gelboim,Amir Adler,Mauricio Araya-Polo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose and test a method to reduce the dimensionality of Full Waveform Inversion (FWI) inputs as computational cost mitigation approach. Given modern seismic acquisition systems, the data (as input for FWI) required for an industrial-strength case is in the teraflop level of storage, therefore solving complex subsurface cases or exploring multiple scenarios with FWI become prohibitive. The proposed method utilizes a deep neural network with a binarized sensing layer that learns by compressed learning a succinct but consequential seismic acquisition layout from a large corpus of subsurface models. Thus, given a large seismic data set to invert, the trained network selects a smaller subset of the data, then by using representation learning, an autoencoder computes latent representations of the data, followed by K-means clustering of the latent representations to further select the most relevant data for FWI. Effectively, this approach can be seen as a hierarchical selection. The proposed approach consistently outperforms random data sampling, even when utilizing only 10% of the data for 2D FWI, these results pave the way to accelerating FWI in large scale 3D inversion.

[LG-54] he Dependency Divide: An Interpretable Machine Learning Framework for Profiling Student Digital Satisfaction in the Bangladesh Context

链接: https://arxiv.org/abs/2601.01231
作者: Md Muhtasim Munif Fahim,Humyra Ankona,Md Monimul Huq,Md Rezaul Karim
类目: Machine Learning (cs.LG)
*备注: Conference Paper

点击查看摘要

Abstract:Background: While digital access has expanded rapidly in resource-constrained contexts, satisfaction with digital learning platforms varies significantly among students with seemingly equal connectivity. Traditional digital divide frameworks fail to explain these variations. Purpose: This study introduces the “Dependency Divide”, a novel framework proposing that highly engaged students become conditionally vulnerable to infrastructure failures, challenging assumptions that engagement uniformly benefits learners in post-access environments. Methods: We conducted a cross-sectional study of 396 university students in Bangladesh using a three-stage analytical approach: (1) stability-validated K-prototypes clustering to identify student profiles, (2) profile-specific Random Forest models with SHAP and ALE analysis to determine satisfaction drivers, and (3) formal interaction analysis with propensity score matching to test the Dependency Divide hypothesis. Results: Three distinct profiles emerged: Casually Engaged (58%), Efficient Learners (35%), and Hyper-Engaged (7%). A significant interaction between educational device time and internet reliability (\beta = 0.033, p = 0.028) confirmed the Dependency Divide: engagement increased satisfaction only when infrastructure remained reliable. Hyper-Engaged students showed greatest vulnerability despite or because of their sophisticated digital workflows. Policy simulations demonstrated that targeted reliability improvements for high-dependency users yielded 2.06 times greater returns than uniform interventions. Conclusions: In fragile infrastructure contexts, capability can become liability. Digital transformation policies must prioritize reliability for dependency-prone users, establish contingency systems, and educate students about dependency risks rather than uniformly promoting engagement. Comments: Conference Paper Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.01231 [cs.LG] (or arXiv:2601.01231v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.01231 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Md Muhtasim Munif Fahim [view email] [v1] Sat, 3 Jan 2026 16:37:51 UTC (1,758 KB) Full-text links: Access Paper: View a PDF of the paper titled The Dependency Divide: An Interpretable Machine Learning Framework for Profiling Student Digital Satisfaction in the Bangladesh Context, by Md Muhtasim Munif Fahim and 2 other authorsView PDF view license Current browse context: cs.LG prev | next new | recent | 2026-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-55] Adaptive Conformal Prediction via Bayesian Uncertainty Weighting for Hierarchical Healthcare Data

链接: https://arxiv.org/abs/2601.01223
作者: Marzieh Amiri Shahbazi,Ali Baheri,Nasibeh Azadeh-Fard
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Clinical decision-making demands uncertainty quantification that provides both distribution-free coverage guarantees and risk-adaptive precision, requirements that existing methods fail to jointly satisfy. We present a hybrid Bayesian-conformal framework that addresses this fundamental limitation in healthcare predictions. Our approach integrates Bayesian hierarchical random forests with group-aware conformal calibration, using posterior uncertainties to weight conformity scores while maintaining rigorous coverage validity. Evaluated on 61,538 admissions across 3,793 U.S. hospitals and 4 regions, our method achieves target coverage (94.3% vs 95% target) with adaptive precision: 21% narrower intervals for low-uncertainty cases while appropriately widening for high-risk predictions. Critically, we demonstrate that well-calibrated Bayesian uncertainties alone severely under-cover (14.1%), highlighting the necessity of our hybrid approach. This framework enables risk-stratified clinical protocols, efficient resource planning for high-confidence predictions, and conservative allocation with enhanced oversight for uncertain cases, providing uncertainty-aware decision support across diverse healthcare settings.

[LG-56] Sparse Bayesian Message Passing under Structural Uncertainty

链接: https://arxiv.org/abs/2601.01207
作者: Yoonhyuk Choi,Jiho Choi,Chanran Kim,Yumin Lee,Hawon Shin,Yeowon Jeon,Minjeong Kim,Jiwoo Kang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Semi-supervised learning on real-world graphs is frequently challenged by heterophily, where the observed graph is unreliable or label-disassortative. Many existing graph neural networks either rely on a fixed adjacency structure or attempt to handle structural noise through regularization. In this work, we explicitly capture structural uncertainty by modeling a posterior distribution over signed adjacency matrices, allowing each edge to be positive, negative, or absent. We propose a sparse signed message passing network that is naturally robust to edge noise and heterophily, which can be interpreted from a Bayesian perspective. By combining (i) posterior marginalization over signed graph structures with (ii) sparse signed message aggregation, our approach offers a principled way to handle both edge noise and heterophily. Experimental results demonstrate that our method outperforms strong baseline models on heterophilic benchmarks under both synthetic and real-world structural noise.

[LG-57] Evo-TFS: Evolutionary Time-Frequency Domain-Based Synthetic Minority Oversampling Approach to Imbalanced Time Series Classification

链接: https://arxiv.org/abs/2601.01150
作者: Wenbin Pei,Ruohao Dai,Bing Xue,Mengjie Zhang,Qiang Zhang,Yiu-Ming Cheung
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Time series classification is a fundamental machine learning task with broad real-world applications. Although many deep learning methods have proven effective in learning time-series data for classification, they were originally developed under the assumption of balanced data distributions. Once data distribution is uneven, these methods tend to ignore the minority class that is typically of higher practical significance. Oversampling methods have been designed to address this by generating minority-class samples, but their reliance on linear interpolation often hampers the preservation of temporal dynamics and the generation of diverse samples. Therefore, in this paper, we propose Evo-TFS, a novel evolutionary oversampling method that integrates both time- and frequency-domain characteristics. In Evo-TFS, strongly typed genetic programming is employed to evolve diverse, high-quality time series, guided by a fitness function that incorporates both time-domain and frequency-domain characteristics. Experiments conducted on imbalanced time series datasets demonstrate that Evo-TFS outperforms existing oversampling methods, significantly enhancing the performance of time-domain and frequency-domain classifiers.

[LG-58] Self-Training the Neurochaos Learning Algorithm

链接: https://arxiv.org/abs/2601.01146
作者: Anusree M,Akhila Henry,Pramod P Nair
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In numerous practical applications, acquiring substantial quantities of labelled data is challenging and expensive, but unlabelled data is readily accessible. Conventional supervised learning methods frequently underperform in scenarios characterised by little labelled data or imbalanced datasets. This study introduces a hybrid semi-supervised learning (SSL) architecture that integrates Neurochaos Learning (NL) with a threshold-based Self-Training (ST) method to overcome this constraint. The NL architecture converts input characteristics into chaos-based ring-rate representations that encapsulate nonlinear relationships within the data, whereas ST progressively enlarges the labelled set utilising high-confidence pseudo-labelled samples. The model’s performance is assessed using ten benchmark datasets and five machine learning classifiers, with 85% of the training data considered unlabelled and just 15% utilised as labelled data. The proposed Self-Training Neurochaos Learning (NL+ST) architecture consistently attains superior performance gain relative to standalone ST models, especially on limited, nonlinear and imbalanced datasets like Iris (188.66%), Wine (158.58%) and Glass Identification (110.48%). The results indicate that using chaos-based feature extraction with SSL improves generalisation, resilience, and classification accuracy in low-data contexts.

[LG-59] Community-Based Early-Stage Chronic Kidney Disease Screening using Explainable Machine Learning for Low-Resource Settings

链接: https://arxiv.org/abs/2601.01119
作者: Muhammad Ashad Kabir,Sirajam Munira,Dewan Tasnia Azad,Saleh Mohammed Ikram,Mohammad Habibur Rahman Sarker,Syed Manzoor Ahmed Hanifi
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Early detection of chronic kidney disease (CKD) is essential for preventing progression to end-stage renal disease. However, existing screening tools - primarily developed using populations from high-income countries - often underperform in Bangladesh and South Asia, where risk profiles differ. Most of these tools rely on simple additive scoring functions and are based on data from patients with advanced-stage CKD. Consequently, they fail to capture complex interactions among risk factors and are limited in predicting early-stage CKD. Our objective was to develop and evaluate an explainable machine learning (ML) framework for community-based early-stage CKD screening for low-resource settings, tailored to the Bangladeshi and South Asian population context. We used a community-based dataset from Bangladesh, the first such CKD dataset in South and South Asia, and evaluated twelve ML classifiers across multiple feature domains. Ten complementary feature selection techniques were applied to identify robust, generalizable predictors. The final models were assessed using 10-fold cross-validation. External validation was conducted on three independent datasets from India, the UAE, and Bangladesh. SHAP (SHapley Additive exPlanations) was used to provide model explainability. An ML model trained on an RFECV-selected feature subset achieved a balanced accuracy of 90.40%, whereas minimal non-pathology-test features demonstrated excellent predictive capability with a balanced accuracy of 89.23%, often outperforming larger or full feature sets. Compared with existing screening tools, the proposed models achieved substantially higher accuracy and sensitivity while requiring fewer and more accessible inputs. External validation confirmed strong generalizability with 78% to 98% sensitivity. SHAP interpretation identified clinically meaningful predictors consistent with established CKD risk factors.

[LG-60] Central Dogma Transformer: Towards Mechanism-Oriented AI for Cellular Understanding

链接: https://arxiv.org/abs/2601.01089
作者: Nobuyuki Ota
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Understanding cellular mechanisms requires integrating information across DNA, RNA, and protein - the three molecular systems linked by the Central Dogma of molecular biology. While domain-specific foundation models have achieved success for each modality individually, they remain isolated, limiting our ability to model integrated cellular processes. Here we present the Central Dogma Transformer (CDT), an architecture that integrates pre-trained language models for DNA, RNA, and protein following the directional logic of the Central Dogma. CDT employs directional cross-attention mechanisms - DNA-to-RNA attention models transcriptional regulation, while RNA-to-Protein attention models translational relationships - producing a unified Virtual Cell Embedding that integrates all three modalities. We validate CDT v1 - a proof-of-concept implementation using fixed (non-cell-specific) RNA and protein embeddings - on CRISPRi enhancer perturbation data from K562 cells, achieving a Pearson correlation of 0.503, representing 63% of the theoretical ceiling set by cross-experiment variability (r = 0.797). Attention and gradient analyses provide complementary interpretive windows: in detailed case studies, these approaches highlight largely distinct genomic regions, with gradient analysis identifying a CTCF binding site that Hi-C data showed as physically contacting both enhancer and target gene. These results suggest that AI architectures aligned with biological information flow can achieve both predictive accuracy and mechanistic interpretability.

[LG-61] Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces

链接: https://arxiv.org/abs/2601.01082
作者: Bryon Tjanaka,Henry Chen,Matthew C. Fontaine,Stefanos Nikolaidis
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Source code available at this https URL

点击查看摘要

Abstract:Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user-specified, vector-valued measure function. Contemporary QD algorithms focus on low-dimensional measures because high-dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the CMA-MAE algorithm guides measure space exploration with a histogram in measure space that records so-called discount values. However, CMA-MAE stagnates in domains with high-dimensional measure spaces because solutions with similar measures fall into the same histogram cell and thus receive identical discount values. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high-dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new QD applications by introducing two domains where the measure space is the high-dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand-designing the measure function. Results in these domains and on high-dimensional benchmarks show that DMS outperforms CMA-MAE and other black-box QD algorithms.

[LG-62] Revisiting Weighted Strategy for Non-stationary Parametric Bandits and MDPs

链接: https://arxiv.org/abs/2601.01069
作者: Jing Wang,Peng Zhao,Zhi-Hua Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: accepted by IEEE Transactions on Information Theory. arXiv admin note: substantial text overlap with arXiv:2303.02691

点击查看摘要

Abstract:Non-stationary parametric bandits have attracted much attention recently. There are three principled ways to deal with non-stationarity, including sliding-window, weighted, and restart strategies. As many non-stationary environments exhibit gradual drifting patterns, the weighted strategy is commonly adopted in real-world applications. However, previous theoretical studies show that its analysis is more involved and the algorithms are either computationally less efficient or statistically suboptimal. This paper revisits the weighted strategy for non-stationary parametric bandits. In linear bandits (LB), we discover that this undesirable feature is due to an inadequate regret analysis, which results in an overly complex algorithm design. We propose a \emphrefined analysis framework, which simplifies the derivation and, importantly, produces a simpler weight-based algorithm that is as efficient as window/restart-based algorithms while retaining the same regret as previous studies. Furthermore, our new framework can be used to improve regret bounds of other parametric bandits, including Generalized Linear Bandits (GLB) and Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB algorithm with an \tildeO(k_\mu^5/4 c_\mu^-3/4 d^3/4 P_T^1/4T^3/4) regret, improving the \tildeO(k_\mu^2 c_\mu^-1d^9/10 P_T^1/5T^4/5) bound in prior work, where k_\mu and c_\mu characterize the reward model’s nonlinearity, P_T measures the non-stationarity, d and T denote the dimension and time horizon. Moreover, we extend our framework to non-stationary Markov Decision Processes (MDPs) with function approximation, focusing on Linear Mixture MDP and Multinomial Logit (MNL) Mixture MDP. For both classes, we propose algorithms based on the weighted strategy and establish dynamic regret guarantees using our analysis framework.

[LG-63] ny Machine Learning for Real-Time Aquaculture Monitoring: A Case Study in Morocco

链接: https://arxiv.org/abs/2601.01065
作者: Achraf Hsain,Yahya Zaki,Othman Abaakil,Hibat-allah Bekkar,Yousra Chtouki
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: Published in IEEE GCAIoT 2024

点击查看摘要

Abstract:Aquaculture, the farming of aquatic organisms, is a rapidly growing industry facing challenges such as water quality fluctuations, disease outbreaks, and inefficient feed management. Traditional monitoring methods often rely on manual labor and are time consuming, leading to potential delays in addressing issues. This paper proposes the integration of low-power edge devices using Tiny Machine Learning (TinyML) into aquaculture systems to enable real-time automated monitoring and control, such as collecting data and triggering alarms, and reducing labor requirements. The system provides real-time data on the required parameters such as pH levels, temperature, dissolved oxygen, and ammonia levels to control water quality, nutrient levels, and environmental conditions enabling better maintenance, efficient resource utilization, and optimal management of the enclosed aquaculture space. The system enables alerts in case of anomaly detection. The data collected by the sensors over time can serve for important decision-making regarding optimizing water treatment processes, feed distribution, feed pattern analysis and improve feed efficiency, reducing operational costs. This research explores the feasibility of developing TinyML-based solutions for aquaculture monitoring, considering factors such as sensor selection, algorithm design, hardware constraints, and ethical considerations. By demonstrating the potential benefits of TinyML in aquaculture, our aim is to contribute to the development of more sustainable and efficient farming practices.

[LG-64] Byzantine-Robust Federated Learning Framework with Post-Quantum Secure Aggregation for Real-Time Threat Intelligence Sharing in Critical IoT Infrastructure

链接: https://arxiv.org/abs/2601.01053
作者: Milad Rahmati,Nima Rahmati
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of Internet of Things devices in critical infrastructure has created unprecedented cybersecurity challenges, necessitating collaborative threat detection mechanisms that preserve data privacy while maintaining robustness against sophisticated attacks. Traditional federated learning approaches for IoT security suffer from two critical vulnerabilities: susceptibility to Byzantine attacks where malicious participants poison model updates, and inadequacy against future quantum computing threats that can compromise cryptographic aggregation protocols. This paper presents a novel Byzantine-robust federated learning framework integrated with post-quantum secure aggregation specifically designed for real-time threat intelligence sharing across critical IoT infrastructure. The proposed framework combines a adaptive weighted aggregation mechanism with lattice-based cryptographic protocols to simultaneously defend against model poisoning attacks and quantum adversaries. We introduce a reputation-based client selection algorithm that dynamically identifies and excludes Byzantine participants while maintaining differential privacy guarantees. The secure aggregation protocol employs CRYSTALS-Kyber for key encapsulation and homomorphic encryption to ensure confidentiality during parameter updates. Experimental evaluation on industrial IoT intrusion detection datasets demonstrates that our framework achieves 96.8% threat detection accuracy while successfully mitigating up to 40% Byzantine attackers, with only 18% computational overhead compared to non-secure federated approaches. The framework maintains sub-second aggregation latency suitable for real-time applications and provides 256-bit post-quantum security level.

[LG-65] Coarse-Grained Kullback–Leibler Control of Diffusion-Based Generative AI

链接: https://arxiv.org/abs/2601.01045
作者: Tatsuaki Tsuruyama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models and score-based generative models provide a powerful framework for synthesizing high-quality images from noise. However, there is still no satisfactory theory that describes how coarse-grained quantities, such as blockwise intensity or class proportions after partitioning an image into spatial blocks, are preserved and evolve along the reverse diffusion dynamics. In previous work, the author introduced an information-theoretic Lyapunov function V for non-ergodic Markov processes on a state space partitioned into blocks, defined as the minimal Kullback-Leibler divergence to the set of stationary distributions reachable from a given initial condition, and showed that a leak-tolerant potential V-delta with a prescribed tolerance for block masses admits a closed-form expression as a scaling-and-clipping operation on block masses. In this paper, I transplant this framework to the reverse diffusion process in generative models and propose a reverse diffusion scheme that is projected by the potential V-delta (referred to as the V-delta projected reverse diffusion). I extend the monotonicity of V to time-inhomogeneous block-preserving Markov kernels and show that, under small leakage and the V-delta projection, V-delta acts as an approximate Lyapunov function. Furthermore, using a toy model consisting of block-constant images and a simplified reverse kernel, I numerically demonstrate that the proposed method keeps the block-mass error and the leak-tolerant potential within the prescribed tolerance, while achieving pixel-wise accuracy and visual quality comparable to the non-projected dynamics. This study reinterprets generative sampling as a decrease of an information potential from noise to data, and provides a design principle for reverse diffusion processes with explicit control of coarse-grained quantities. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.01045 [cs.LG] (or arXiv:2601.01045v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.01045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-66] Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning WWW

链接: https://arxiv.org/abs/2601.01023
作者: João Morais,Sadjad Alikhani,Akshay Malhotra,Shahab Hamidi-Rad,Ahmed Alkhateeb
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: resources available in: this https URL

点击查看摘要

Abstract:This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets, enabling applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, task-specific synthetic data generation, and informing decisions on model training/adaptation to new deployments. We evaluate candidate dataset distance metrics by how well they predict cross-dataset transferability: if two datasets have a small distance, a model trained on one should perform well on the other. We apply the framework on an unsupervised task, channel state information (CSI) compression, using autoencoders. Using metrics based on UMAP embeddings, combined with Wasserstein and Euclidean distances, we achieve Pearson correlations exceeding 0.85 between dataset distances and train-on-one/test-on-another task performance. We also apply the framework to a supervised beam prediction in the downlink using convolutional neural networks. For this task, we derive a label-aware distance by integrating supervised UMAP and penalties for dataset imbalance. Across both tasks, the resulting distances outperform traditional baselines and consistently exhibit stronger correlations with model transferability, supporting task-relevant comparisons between wireless datasets.

[LG-67] Expanding the Chaos: Neural Operator for Stochastic (Partial) Differential Equations

链接: https://arxiv.org/abs/2601.01021
作者: Dai Shi,Lequan Lin,Andi Han,Luke Thompson,José Miguel Hernández-Lobato,Zhiyong Wang,Junbin Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic differential equations (SDEs) and stochastic partial differential equations (SPDEs) are fundamental tools for modeling stochastic dynamics across the natural sciences and modern machine learning. Developing deep learning models for approximating their solution operators promises not only fast, practical solvers, but may also inspire models that resolve classical learning tasks from a new perspective. In this work, we build on classical Wiener chaos expansions (WCE) to design neural operator (NO) architectures for SPDEs and SDEs: we project the driving noise paths onto orthonormal Wick Hermite features and parameterize the resulting deterministic chaos coefficients with neural operators, so that full solution trajectories can be reconstructed from noise in a single forward pass. On the theoretical side, we investigate the classical WCE results for the class of multi-dimensional SDEs and semilinear SPDEs considered here by explicitly writing down the associated coupled ODE/PDE systems for their chaos coefficients, which makes the separation between stochastic forcing and deterministic dynamics fully explicit and directly motivates our model designs. On the empirical side, we validate our models on a diverse suite of problems: classical SPDE benchmarks, diffusion one-step sampling on images, topological interpolation on graphs, financial extrapolation, parameter estimation, and manifold SDEs for flood prediction, demonstrating competitive accuracy and broad applicability. Overall, our results indicate that WCE-based neural operators provide a practical and scalable way to learn SDE/SPDE solution operators across diverse domains.

[LG-68] Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations ICLR2026

链接: https://arxiv.org/abs/2601.01003
作者: Amin Abyaneh,Charlotte Morissette,Mohamad H. Danesh,Anas El Houssaini,David Meger,Gregory Dudek,Hsiu-Chin Lin
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Under review at ICLR 2026

点击查看摘要

Abstract:Diffusion policies have emerged as powerful generative models for offline policy learning, whose sampling process can be rigorously characterized by a score function guiding a Stochastic Differential Equation (SDE). However, the same score-based SDE modeling that grants diffusion policies the flexibility to learn diverse behavior also incurs solver and score-matching errors, large data requirements, and inconsistencies in action generation. While less critical in image generation, these inaccuracies compound and lead to failure in continuous control settings. We introduce Contractive Diffusion Policies (CDPs) to induce contractive behavior in the diffusion sampling dynamics. Contraction pulls nearby flows closer to enhance robustness against solver and score-matching errors while reducing unwanted action variance. We develop an in-depth theoretical analysis along with a practical implementation recipe to incorporate CDPs into existing diffusion policy architectures with minimal modification and computational cost. We evaluate CDPs for offline learning by conducting extensive experiments in simulation and real-world settings. Across benchmarks, CDPs often outperform baseline policies, with pronounced benefits under data scarcity.

[LG-69] Zero-shot Forecasting by Simulation Alone

链接: https://arxiv.org/abs/2601.00970
作者: Boris N. Oreshkin,Mayank Jauhari,Ravi Kiran Selvam,Malcolm Wolff,Wenhao Pan,Shankar Ramasubramanian,Kin G. Olivares,Tatiana Konstantinova,Andres Potapczynski,Mengfei Cao,Dmitry Efimov,Michael W. Mahoney,Andrew G. Wilson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot time-series forecasting holds great promise, but is still in its infancy, hindered by limited and biased data corpora, leakage-prone evaluation, and privacy and licensing constraints. Motivated by these challenges, we propose the first practical univariate time series simulation pipeline which is simultaneously fast enough for on-the-fly data generation and enables notable zero-shot forecasting performance on M-Series and GiftEval benchmarks that capture trend/seasonality/intermittency patterns, typical of industrial forecasting applications across a variety of domains. Our simulator, which we call SarSim0 (SARIMA Simulator for Zero-Shot Forecasting), is based off of a seasonal autoregressive integrated moving average (SARIMA) model as its core data source. Due to instability in the autoregressive component, naive SARIMA simulation often leads to unusable paths. Instead, we follow a three-step procedure: (1) we sample well-behaved trajectories from its characteristic polynomial stability region; (2) we introduce a superposition scheme that combines multiple paths into rich multi-seasonality traces; and (3) we add rate-based heavy-tailed noise models to capture burstiness and intermittency alongside seasonalities and trends. SarSim0 is orders of magnitude faster than kernel-based generators, and it enables training on circa 1B unique purely simulated series, generated on the fly; after which well-established neural network backbones exhibit strong zero-shot generalization, surpassing strong statistical forecasters and recent foundation baselines, while operating under strict zero-shot protocol. Notably, on GiftEval we observe a “student-beats-teacher” effect: models trained on our simulations exceed the forecasting accuracy of the AutoARIMA generating processes.

[LG-70] Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

链接: https://arxiv.org/abs/2601.00968
作者: Longwei Wang,Mohammad Navid Nayyem,Abdullah Al Rakin,KC Santosh,Chaowei Zhang,Yang Zhou
类目: Machine Learning (cs.LG)
*备注: 8pages,4 figures

点击查看摘要

Abstract:The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop refinement pipeline. This approach does not require additional datasets or model architectures and integrates seamlessly into standard adversarial training. Theoretically, we derive an attribution-aware lower bound on adversarial distortion that formalizes the link between explanation alignment and robustness. Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 demonstrate substantial improvements in adversarial robustness and out-of-distribution generalization.

[LG-71] Enhanced Data-Driven Product Development via Gradient Based Optimization and Conformalized Monte Carlo Dropout Uncertainty Estimation

链接: https://arxiv.org/abs/2601.00932
作者: Andrea Thomas Nava,Lijo Johny,Fabio Azzalini,Johannes Schneider,Arianna Casanova
类目: Machine Learning (cs.LG)
*备注: Accepted at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026)

点击查看摘要

Abstract:Data-Driven Product Development (DDPD) leverages data to learn the relationship between product design specifications and resulting properties. To discover improved designs, we train a neural network on past experiments and apply Projected Gradient Descent to identify optimal input features that maximize performance. Since many products require simultaneous optimization of multiple correlated properties, our framework employs joint neural networks to capture interdependencies among targets. Furthermore, we integrate uncertainty estimation via \emphConformalised Monte Carlo Dropout (ConfMC), a novel method combining Nested Conformal Prediction with Monte Carlo dropout to provide model-agnostic, finite-sample coverage guarantees under data exchangeability. Extensive experiments on five real-world datasets show that our method matches state-of-the-art performance while offering adaptive, non-uniform prediction intervals and eliminating the need for retraining when adjusting coverage levels.

[LG-72] Latent-Constrained Conditional VAEs for Augmenting Large-Scale Climate Ensembles

链接: https://arxiv.org/abs/2601.00915
作者: Jacquelyn Shelton,Przemyslaw Polewski,Alexander Robel,Matthew Hoffman,Stephen Price
类目: Machine Learning (cs.LG)
*备注: draft / preliminary

点击查看摘要

Abstract:Large climate-model ensembles are computationally expensive; yet many downstream analyses would benefit from additional, statistically consistent realizations of spatiotemporal climate variables. We study a generative modeling approach for producing new realizations from a limited set of available runs by transferring structure learned across an ensemble. Using monthly near-surface temperature time series from ten independent reanalysis realizations (ERA5), we find that a vanilla conditional variational autoencoder (CVAE) trained jointly across realizations yields a fragmented latent space that fails to generalize to unseen ensemble members. To address this, we introduce a latent-constrained CVAE (LC-CVAE) that enforces cross-realization homogeneity of latent embeddings at a small set of shared geographic ‘anchor’ locations. We then use multi-output Gaussian process regression in the latent space to predict latent coordinates at unsampled locations in a new realization, followed by decoding to generate full time series fields. Experiments and ablations demonstrate (i) instability when training on a single realization, (ii) diminishing returns after incorporating roughly five realizations, and (iii) a trade-off between spatial coverage and reconstruction quality that is closely linked to the average neighbor distance in latent space.

[LG-73] Security Hardening Using FABRIC: Implementing a Unified Compliance Aggregator for Linux Servers

链接: https://arxiv.org/abs/2601.00909
作者: Sheldon Paul,Izzat Alsmadi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a unified framework for evaluating Linux security hardening on the FABRIC testbed through aggregation of heterogeneous security auditing tools. We deploy three Ubuntu 22.04 nodes configured at baseline, partial, and full hardening levels, and evaluate them using Lynis, OpenSCAP, and AIDE across 108 audit runs. To address the lack of a consistent interpretation across tools, we implement a Unified Compliance Aggregator (UCA) that parses tool outputs, normalizes scores to a common 0–100 scale, and combines them into a weighted metric augmented by a customizable rule engine for organization-specific security policies. Experimental results show that full hardening increases OpenSCAP compliance from 39.7 to 71.8, while custom rule compliance improves from 39.3% to 83.6%. The results demonstrate that UCA provides a clearer and more reproducible assessment of security posture than individual tools alone, enabling systematic evaluation of hardening effectiveness in programmable testbed environments.

[LG-74] Dichotomous Diffusion Policy Optimization

链接: https://arxiv.org/abs/2601.00898
作者: Ruiming Liang,Yinan Zheng,Kexin Zheng,Tianyi Tan,Jianxiong Li,Liyuan Mao,Zhihao Wang,Guang Chen,Hangjun Ye,Jingjing Liu,Jinqiao Wang,Xianyuan Zhan
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of this http URL in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.

[LG-75] owards eco friendly cybersecurity: machine learning based anomaly detection with carbon and energy metrics

链接: https://arxiv.org/abs/2601.00893
作者: KC Aashish,Md Zakir Hossain Zamil,Md Shafiqul Islam Mridul,Lamia Akter,Farmina Sharmin,Eftekhar Hossain Ayon,Md Maruf Bin Reza,Ali Hassan,Abdur Rahim,Sirapa Malla
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: International Journal of Applied Mathematics 2025

点击查看摘要

Abstract:The rising energy footprint of artificial intelligence has become a measurable component of US data center emissions, yet cybersecurity research seldom considers its environmental cost. This study introduces an eco aware anomaly detection framework that unifies machine learning based network monitoring with real time carbon and energy tracking. Using the publicly available Carbon Aware Cybersecurity Traffic Dataset comprising 2300 flow level observations, we benchmark Logistic Regression, Random Forest, Support Vector Machine, Isolation Forest, and XGBoost models across energy, carbon, and performance dimensions. Each experiment is executed in a controlled Colab environment instrumented with the CodeCarbon toolkit to quantify power draw and equivalent CO2 output during both training and inference. We construct an Eco Efficiency Index that expresses F1 score per kilowatt hour to capture the trade off between detection quality and environmental impact. Results reveal that optimized Random Forest and lightweight Logistic Regression models achieve the highest eco efficiency, reducing energy consumption by more than forty percent compared to XGBoost while sustaining competitive detection accuracy. Principal Component Analysis further decreases computational load with negligible loss in recall. Collectively, these findings establish that integrating carbon and energy metrics into cybersecurity workflows enables environmentally responsible machine learning without compromising operational protection. The proposed framework offers a reproducible path toward sustainable carbon accountable cybersecurity aligned with emerging US green computing and federal energy efficiency initiatives.

[LG-76] FANoS: Friction-Adaptive Nosé–Hoover Symplectic Momentum for Stiff Objectives

链接: https://arxiv.org/abs/2601.00889
作者: Nalin Dhiman
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:We study a physics-inspired optimizer, \emphFANoS (Friction-Adaptive Nosé–Hoover Symplectic momentum), which combines (i) a momentum update written as a discretized second-order dynamical system, (ii) a Nosé–Hoover-like thermostat variable that adapts a scalar friction coefficient using kinetic-energy feedback, and (iii) a semi-implicit (symplectic-Euler) integrator, optionally with a diagonal RMS preconditioner. The method is motivated by structure-preserving integration and thermostat ideas from molecular dynamics, but is used here purely as an optimization heuristic. We provide the algorithm and limited theoretical observations in idealized settings. On the deterministic Rosenbrock-100D benchmark with 3000 gradient evaluations, FANoS-RMS attains a mean final objective value of 1.74\times 10^-2 , improving substantially over unclipped AdamW ( 48.50 ) and SGD+momentum ( 90.76 ) in this protocol. However, AdamW with gradient clipping is stronger, reaching 1.87\times 10^-3 , and L-BFGS reaches \approx 4.4\times 10^-10 . On ill-conditioned convex quadratics and in a small PINN warm-start suite (Burgers and Allen–Cahn), the default FANoS configuration underperforms AdamW and can be unstable or high-variance. Overall, the evidence supports a conservative conclusion: FANoS is an interpretable synthesis of existing ideas that can help on some stiff nonconvex valleys, but it is not a generally superior replacement for modern baselines, and its behavior is sensitive to temperature-schedule and hyperparameter choices. Comments: 13 pages, 5 figures, 4 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.00889 [cs.LG] (or arXiv:2601.00889v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.00889 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-77] Outlier Detection Using Vector Cosine Similarity by Adding a Dimension DATE

链接: https://arxiv.org/abs/2601.00883
作者: Zhongyang Shen
类目: Machine Learning (cs.LG)
*备注: This is an updated version of the paper originally published in ICAIIC 2024 (DOI: https://doi.org/10.1109/ICAIIC60209.2024.10463442 ). Changes include minor typographical and grammatical corrections, as well as an added description of an optimized open-source Python implementation (MDOD) available on PyPI at this https URL

点击查看摘要

Abstract:We propose a new outlier detection method for multi-dimensional data. The method detects outliers based on vector cosine similarity, using a new dataset constructed by adding a dimension with zero values to the original data. When a point in the new dataset is selected as the measured point, an observation point is created as the origin, differing only in the new dimension by having a non-zero value compared to the measured point. Vectors are then formed from the observation point to the measured point and to other points in the dataset. By comparing the cosine similarities of these vectors, abnormal data can be identified. An optimized implementation (MDOD) is available on PyPI: this https URL.

[LG-78] Quantum Machine Learning Approaches for Coordinated Stealth Attack Detection in Distributed Generation Systems

链接: https://arxiv.org/abs/2601.00873
作者: Osasumwen Cedric Ogiesoba-Eguakun,Suman Rath
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages

点击查看摘要

Abstract:Coordinated stealth attacks are a serious cybersecurity threat to distributed generation systems because they modify control and measurement signals while remaining close to normal behavior, making them difficult to detect using standard intrusion detection methods. This study investigates quantum machine learning approaches for detecting coordinated stealth attacks on a distributed generation unit in a microgrid. High-quality simulated measurements were used to create a balanced binary classification dataset using three features: reactive power at DG1, frequency deviation relative to the nominal value, and terminal voltage magnitude. Classical machine learning baselines, fully quantum variational classifiers, and hybrid quantum classical models were evaluated. The results show that a hybrid quantum classical model combining quantum feature embeddings with a classical RBF support vector machine achieves the best overall performance on this low dimensional dataset, with a modest improvement in accuracy and F1 score over a strong classical SVM baseline. Fully quantum models perform worse due to training instability and limitations of current NISQ hardware. In contrast, hybrid models train more reliably and demonstrate that quantum feature mapping can enhance intrusion detection even when fully quantum learning is not yet practical.

[LG-79] Distribution Matching for Graph Quantification Under Structural Covariate Shift ECML-PKDD2025

链接: https://arxiv.org/abs/2601.00864
作者: Clemens Damke,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages, presented at ECML-PKDD 2025

点击查看摘要

Abstract:Graphs are commonly used in machine learning to model relationships between instances. Consider the task of predicting the political preferences of users in a social network; to solve this task one should consider, both, the features of each individual user and the relationships between them. However, oftentimes one is not interested in the label of a single instance but rather in the distribution of labels over a set of instances; e.g., when predicting the political preferences of users, the overall prevalence of a given opinion might be of higher interest than the opinion of a specific person. This label prevalence estimation task is commonly referred to as quantification learning (QL). Current QL methods for tabular data are typically based on the so-called prior probability shift (PPS) assumption which states that the label-conditional instance distributions should remain equal across the training and test data. In the graph setting, PPS generally does not hold if the shift between training and test data is structural, i.e., if the training data comes from a different region of the graph than the test data. To address such structural shifts, an importance sampling variant of the popular adjusted count quantification approach has previously been proposed. In this work, we extend the idea of structural importance sampling to the state-of-the-art KDEy quantification approach. We show that our proposed method adapts to structural shifts and outperforms standard quantification approaches.

[LG-80] Selective Imperfection as a Generative Framework for Analysis Creativity and Discovery

链接: https://arxiv.org/abs/2601.00863
作者: Markus J. Buehler
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:We introduce materiomusic as a generative framework linking the hierarchical structures of matter with the compositional logic of music. Across proteins, spider webs and flame dynamics, vibrational and architectural principles recur as tonal hierarchies, harmonic progressions, and long-range musical form. Using reversible mappings, from molecular spectra to musical tones and from three-dimensional networks to playable instruments, we show how sound functions as a scientific probe, an epistemic inversion where listening becomes a mode of seeing and musical composition becomes a blueprint for matter. These mappings excavate deep time: patterns originating in femtosecond molecular vibrations or billion-year evolutionary histories become audible. We posit that novelty in science and art emerges when constraints cannot be satisfied within existing degrees of freedom, forcing expansion of the space of viable configurations. Selective imperfection provides the mechanism restoring balance between coherence and adaptability. Quantitative support comes from exhaustive enumeration of all 2^12 musical scales, revealing that culturally significant systems cluster in a mid-entropy, mid-defect corridor, directly paralleling the Hall-Petch optimum where intermediate defect densities maximize material strength. Iterating these mappings creates productive collisions between human creativity and physics, generating new information as musical structures encounter evolutionary constraints. We show how swarm-based AI models compose music exhibiting human-like structural signatures such as small-world connectivity, modular integration, long-range coherence, suggesting a route beyond interpolation toward invention. We show that science and art are generative acts of world-building under constraint, with vibration as a shared grammar organizing structure across scales.

[LG-81] Universal Battery Degradation Forecasting Driven by Foundation Model Across Diverse Chemistries and Conditions

链接: https://arxiv.org/abs/2601.00862
作者: Joey Chan,Huan Wang,Haoyu Pan,Wei Wu,Zirong Wang,Zhen Chen,Ershun Pan,Min Xie,Lifeng Xi
类目: Machine Learning (cs.LG)
*备注: Due to space limitations, the open-source method for supporting materials is currently under discussion

点击查看摘要

Abstract:Accurate forecasting of battery capacity fade is essential for the safety, reliability, and long-term efficiency of energy storage systems. However, the strong heterogeneity across cell chemistries, form factors, and operating conditions makes it difficult to build a single model that generalizes beyond its training domain. This work proposes a unified capacity forecasting framework that maintains robust performance across diverse chemistries and usage scenarios. We curate 20 public aging datasets into a large-scale corpus covering 1,704 cells and 3,961,195 charge-discharge cycle segments, spanning temperatures from -5,^\circ\mathrmC to 45,^\circ\mathrmC , multiple C-rates, and application-oriented profiles such as fast charging and partial cycling. On this corpus, we adopt a Time-Series Foundation Model (TSFM) backbone and apply parameter-efficient Low-Rank Adaptation (LoRA) together with physics-guided contrastive representation learning to capture shared degradation patterns. Experiments on both seen and deliberately held-out unseen datasets show that a single unified model achieves competitive or superior accuracy compared with strong per-dataset baselines, while retaining stable performance on chemistries, capacity scales, and operating conditions excluded from training. These results demonstrate the potential of TSFM-based architectures as a scalable and transferable solution for capacity degradation forecasting in real battery management systems.

[LG-82] Harvesting AlphaEarth: Benchmarking the Geospatial Foundation Model for Agricultural Downstream Tasks

链接: https://arxiv.org/abs/2601.00857
作者: Yuchi Ma,Yawen Shen,Anu Swatantran,David B. Lobell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geospatial foundation models (GFMs) have emerged as a promising approach to overcoming the limitations in existing featurization methods. More recently, Google DeepMind has introduced AlphaEarth Foundation (AEF), a GFM pre-trained using multi-source EOs across continuous time. An annual and global embedding dataset is produced using AEF that is ready for analysis and modeling. The internal experiments show that AEF embeddings have outperformed operational models in 15 EO tasks without re-training. However, those experiments are mostly about land cover and land use classification. Applying AEF and other GFMs to agricultural monitoring require an in-depth evaluation in critical agricultural downstream tasks. There is also a lack of comprehensive comparison between the AEF-based models and traditional remote sensing (RS)-based models under different scenarios, which could offer valuable guidance for researchers and practitioners. This study addresses some of these gaps by evaluating AEF embeddings in three agricultural downstream tasks in the U.S., including crop yield prediction, tillage mapping, and cover crop mapping. Datasets are compiled from both public and private sources to comprehensively evaluate AEF embeddings across tasks at different scales and locations, and RS-based models are trained as comparison models. AEF-based models generally exhibit strong performance on all tasks and are competitive with purpose-built RS-based models in yield prediction and county-level tillage mapping when trained on local data. However, we also find several limitations in current AEF embeddings, such as limited spatial transferability compared to RS-based models, low interpretability, and limited time sensitivity. These limitations recommend caution when applying AEF embeddings in agriculture, where time sensitivity, generalizability, and interpretability is important.

[LG-83] EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference

链接: https://arxiv.org/abs/2601.00850
作者: Aayush Kumar
类目: Machine Learning (cs.LG)
*备注: 24 pages,3 Figures, Submitting to IEEE Access

点击查看摘要

Abstract:Hallucinations hinder reliable question answering, especially in resource-constrained deployments where frontier-scale models or retrieval pipelines may be impractical. We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness using only small instruction-tuned language models (3B-8B) suitable for serverless edge inference. EdgeJury orchestrates four stages: (1) parallel role-specialized generation, (2) anonymized cross-review with structured critiques and rankings, (3) chairman synthesis that integrates the strongest content while addressing flagged issues, and (4) claim-level consistency labeling based on inter-model agreement. On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy (95% CI: 72.8-79.6%), a +21.4% relative improvement over a single 8B baseline (62.8%), and outperforms standard baselines including self-consistency and majority voting under transparent compute accounting (total tokens and platform cost reported). On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains (95% CI: 44.0-52.4%). Manual analysis on 100 incorrect answers shows an approximately 55% reduction in factual hallucination errors versus the single-model baseline. Deployed on Cloudflare Workers AI, EdgeJury achieves 8.4 s median end-to-end latency, demonstrating that coordinated small-model ensembles can improve truthfulness on misconception-heavy QA benchmarks without external retrieval or proprietary large-model APIs.

[LG-84] You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference

链接: https://arxiv.org/abs/2601.00847
作者: Ryan Shamim
类目: Machine Learning (cs.LG)
*备注: 24 pages, 5 figures. Deterministic evaluation protocol. Includes theoretical analysis and empirical validation on GPT-2 and Gemma 2 9B

点击查看摘要

Abstract:Modern AI inference systems treat transformer execution as mandatory, conflating model capability with execution necessity. We reframe inference as a control-plane decision problem: determining when execution is necessary versus when correctness can be preserved through alternative pathways. We introduce Meaning-First Execution (MFEE), a control-plane architecture implementing this framework, selectively invoking transformer inference only when required. MFEE operates as a gating layer above existing stacks without modifying models, weights, or parameters. Across 1,000 diverse prompts under deterministic decoding, MFEE achieves 78.1% execution reduction while maintaining 100% exact-match equivalence for invoked executions. Comparative evaluation reveals pattern-based routers achieve at most 53.3% avoidance with correctness failures, while MFEE reaches 100% avoidance with zero failures through semantic analysis. We prove this limitation via Theorem 1: any router operating solely on finite feature maps cannot simultaneously guarantee zero false skips and positive avoidance on feature-collision pairs. These results establish execution governance as a foundational layer in ML systems infrastructure, orthogonal to model-level optimization techniques.

[LG-85] SLO-Conditioned Action Routing for Retrieval-Augmented Generation: Objective Ablation and Failure Modes

链接: https://arxiv.org/abs/2601.00841
作者: Bharath Nunepalli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) introduces a practical control problem: retrieval depth and generation behavior must be chosen per query to satisfy service-level objectives (SLOs) such as cost, refusal rate, and hallucination risk. This work models per-query control as a small discrete action: choose a retrieval depth and a generation mode (guarded vs. auto), or refuse. An offline logged dataset is constructed from SQuAD 2.0 by executing each action and recording accuracy, token cost, hallucination/refusal indicators, and an SLO-weighted reward. Two simple policy-learning objectives are evaluated: supervised classification of the per-state best action (Argmax-CE) and a reward-weighted variant (Argmax-CE-WT). Across the evaluated settings, a strong fixed baseline (low k, guarded prompting) performs competitively; learned policies mainly provide additional cost savings under a quality-focused SLO and can exhibit refusal collapse under a cheap SLO when refusal is heavily rewarded. The contribution is a reproducible case study of SLO-aware control for RAG pipelines, emphasizing failure modes and reporting conventions rather than proposing a new retriever or language model.

[LG-86] Horizon Reduction as Information Loss in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2601.00831
作者: Uday Kumar Nidadala,Venkata Bhumika Guthi
类目: Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Horizon reduction is a common design strategy in offline reinforcement learning (RL), used to mitigate long-horizon credit assignment, improve stability, and enable scalable learning through truncated rollouts, windowed training, or hierarchical decomposition (Levine et al., 2020; Prudencio et al., 2023; Park et al., 2025). Despite recent empirical evidence that horizon reduction can improve scaling on challenging offline RL benchmarks, its theoretical implications remain underdeveloped (Park et al., 2025). In this paper, we show that horizon reduction can induce fundamental and irrecoverable information loss in offline RL. We formalize horizon reduction as learning from fixed-length trajectory segments and prove that, under this paradigm and any learning interface restricted to fixed-length trajectory segments, optimal policies may be statistically indistinguishable from suboptimal ones even with infinite data and perfect function approximation. Through a set of minimal counterexample Markov decision processes (MDPs), we identify three distinct structural failure modes: (i) prefix indistinguishability leading to identifiability failure, (ii) objective misspecification induced by truncated returns, and (iii) offline dataset support and representation aliasing. Our results establish necessary conditions under which horizon reduction can be safe and highlight intrinsic limitations that cannot be overcome by algorithmic improvements alone, complementing algorithmic work on conservative objectives and distribution shift that addresses a different axis of offline RL difficulty (Fujimoto et al., 2019; Kumar et al., 2020; Gulcehre et al., 2020).

[LG-87] Energy-Efficient Eimeria Parasite Detection Using a Two-Stage Spiking Neural Network Architecture

链接: https://arxiv.org/abs/2601.00806
作者: Ángel Miguel García-Vico,Huseyin Seker,Muhammad Afzal
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coccidiosis, a disease caused by the Eimeria parasite, represents a major threat to the poultry and rabbit industries, demanding rapid and accurate diagnostic tools. While deep learning models offer high precision, their significant energy consumption limits their deployment in resource-constrained environments. This paper introduces a novel two-stage Spiking Neural Network (SNN) architecture, where a pre-trained Convolutional Neural Network is first converted into a spiking feature extractor and then coupled with a lightweight, unsupervised SNN classifier trained with Spike-Timing-Dependent Plasticity (STDP). The proposed model sets a new state-of-the-art, achieving 98.32% accuracy in Eimeria classification. Remarkably, this performance is accomplished with a significant reduction in energy consumption, showing an improvement of more than 223 times compared to its traditional ANN counterpart. This work demonstrates a powerful synergy between high accuracy and extreme energy efficiency, paving the way for autonomous, low-power diagnostic systems on neuromorphic hardware.

[LG-88] ChronoPlastic Spiking Neural Networks

链接: https://arxiv.org/abs/2601.00805
作者: Sarim Chaudhry
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) offer a biologically grounded and energy-efficient alternative to conventional neural architectures; however, they struggle with long-range temporal dependencies due to fixed synaptic and membrane time constants. This paper introduces ChronoPlastic Spiking Neural Networks (CPSNNs), a novel architectural principle that enables adaptive temporal credit assignment by dynamically modulating synaptic decay rates conditioned on the state of the network. CPSNNs maintain multiple internal temporal traces and learn a continuous time-warping function that selectively preserves task-relevant information while rapidly forgetting noise. Unlike prior approaches based on adaptive membrane constants, attention mechanisms, or external memory, CPSNNs embed temporal control directly within local synaptic dynamics, preserving linear-time complexity and neuromorphic compatibility. We provide a formal description of the model, analyze its computational properties, and demonstrate empirically that CPSNNs learn long-gap temporal dependencies significantly faster and more reliably than standard SNN baselines. Our results suggest that adaptive temporal modulation is a key missing ingredient for scalable temporal learning in spiking systems.

[LG-89] Hunting for “Oddballs” with Machine Learning: Detecting Anomalous Exoplanets Using a Deep-Learned Low-Dimensional Representation of Transit Spectra with Autoencoders

链接: https://arxiv.org/abs/2601.02324
作者: Alexander Roman,Emilie Panek,Roy T. Forestano,Eyup B. Unlu,Katia Matcheva,Konstantin T. Matchev
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 14 pages, 12 figures

点击查看摘要

Abstract:This study explores the application of autoencoder-based machine learning techniques for anomaly detection to identify exoplanet atmospheres with unconventional chemical signatures using a low-dimensional data representation. We use the Atmospheric Big Challenge (ABC) database, a publicly available dataset with over 100,000 simulated exoplanet spectra, to construct an anomaly detection scenario by defining CO2-rich atmospheres as anomalies and CO2-poor atmospheres as the normal class. We benchmarked four different anomaly detection strategies: Autoencoder Reconstruction Loss, One-Class Support Vector Machine (1 class-SVM), K-means Clustering, and Local Outlier Factor (LOF). Each method was evaluated in both the original spectral space and the autoencoder’s latent space using Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) metrics. To test the performance of the different methods under realistic conditions, we introduced Gaussian noise levels ranging from 10 to 50 ppm. Our results indicate that anomaly detection is consistently more effective when performed within the latent space across all noise levels. Specifically, K-means clustering in the latent space emerged as a stable and high-performing method. We demonstrate that this anomaly detection approach is robust to noise levels up to 30 ppm (consistent with realistic space-based observations) and remains viable even at 50 ppm when leveraging latent space representations. On the other hand, the performance of the anomaly detection methods applied directly in the raw spectral space degrades significantly with increasing the level of noise. This suggests that autoencoder-driven dimensionality reduction offers a robust methodology for flagging chemically anomalous targets in large-scale surveys where exhaustive retrievals are computationally prohibitive.

[LG-90] Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

链接: https://arxiv.org/abs/2601.02322
作者: Shuozhi Zuo,Yixin Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) prediction is often approached by restricting models to causal or invariant covariates, avoiding non-causal spurious associations that may be unstable across environments. Despite its theoretical appeal, this strategy frequently underperforms empirical risk minimization (ERM) in practice. We investigate the source of this gap and show that such failures naturally arise when only a subset of the true causes of the outcome is observed. In these settings, non-causal spurious covariates can serve as informative proxies for unobserved causes and substantially improve prediction, except under distribution shifts that break these proxy relationships. Consequently, the optimal set of predictive covariates is neither universal nor necessarily exhibits invariant relationships with the outcome across all environments, but instead depends on the specific type of shift encountered. Crucially, we observe that different covariate shifts induce distinct, observable signatures in the covariate distribution itself. Moreover, these signatures can be extracted from unlabeled data in the target OOD environment and used to assess when proxy covariates remain reliable and when they fail. Building on this observation, we propose an environment-adaptive covariate selection (EACS) algorithm that maps environment-level covariate summaries to environment-specific covariate sets, while allowing the incorporation of prior causal knowledge as constraints. Across simulations and applied datasets, EACS consistently outperforms static causal, invariant, and ERM-based predictors under diverse distribution shifts.

[LG-91] Predicting Early and Complete Drug Release from Long-Acting Injectables Using Explainable Machine Learning

链接: https://arxiv.org/abs/2601.02265
作者: Karla N. Robles,Manar D. Samad
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Polymer-based long-acting injectables (LAIs) have transformed the treatment of chronic diseases by enabling controlled drug delivery, thus reducing dosing frequency and extending therapeutic duration. Achieving controlled drug release from LAIs requires extensive optimization of the complex underlying physicochemical properties. Machine learning (ML) can accelerate LAI development by modeling the complex relationships between LAI properties and drug release. However, recent ML studies have provided limited information on key properties that modulate drug release, due to the lack of custom modeling and analysis tailored to LAI data. This paper presents a novel data transformation and explainable ML approach to synthesize actionable information from 321 LAI formulations by predicting early drug release at 24, 48, and 72 hours, classification of release profile types, and prediction of complete release profiles. These three experiments investigate the contribution and control of LAI material characteristics in early and complete drug release profiles. A strong correlation (0.65) is observed between the true and predicted drug release in 72 hours, while a 0.87 F1-score is obtained in classifying release profile types. A time-independent ML framework predicts delayed biphasic and triphasic curves with better performance than current time-dependent approaches. Shapley additive explanations reveal the relative influence of material characteristics during early and for complete release which fill several gaps in previous in-vitro and ML-based studies. The novel approach and findings can provide a quantitative strategy and recommendations for scientists to optimize the drug-release dynamics of LAI. The source code for the model implementation is publicly available.

[LG-92] From Mice to Trains: Amortized Bayesian Inference on Graph Data

链接: https://arxiv.org/abs/2601.02241
作者: Svenja Jedhoff,Elizaveta Semenova,Aura Raulo,Anne Meyer,Paul-Christian Bürkner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs arise across diverse domains, from biology and chemistry to social and information networks, as well as in transportation and logistics. Inference on graph-structured data requires methods that are permutation-invariant, scalable across varying sizes and sparsities, and capable of capturing complex long-range dependencies, making posterior estimation on graph parameters particularly challenging. Amortized Bayesian Inference (ABI) is a simulation-based framework that employs generative neural networks to enable fast, likelihood-free posterior inference. We adapt ABI to graph data to address these challenges to perform inference on node-, edge-, and graph-level parameters. Our approach couples permutation-invariant graph encoders with flexible neural posterior estimators in a two-module pipeline: a summary network maps attributed graphs to fixed-length representations, and an inference network approximates the posterior over parameters. In this setting, several neural architectures can serve as the summary network. In this work we evaluate multiple architectures and assess their performance on controlled synthetic settings and two real-world domains - biology and logistics - in terms of recovery and calibration.

[LG-93] Feature-based Inversion of 2.5D Controlled Source Electromagnetic Data using Generative Priors

链接: https://arxiv.org/abs/2601.02145
作者: Hongyu Zhou,Haoran Sun,Rui Guo,Maokun Li,Fan Yang,Shenheng Xu
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we investigate feature-based 2.5D controlled source marine electromagnetic (mCSEM) data inversion using generative priors. Two-and-half dimensional modeling using finite difference method (FDM) is adopted to compute the response of horizontal electric dipole (HED) excitation. Rather than using a neural network to approximate the entire inverse mapping in a black-box manner, we adopt a plug-andplay strategy in which a variational autoencoder (VAE) is used solely to learn prior information on conductivity distributions. During the inversion process, the conductivity model is iteratively updated using the Gauss Newton method, while the model space is constrained by projections onto the learned VAE decoder. This framework preserves explicit control over data misfit and enables flexible adaptation to different survey configurations. Numerical and field experiments demonstrate that the proposed approach effectively incorporates prior information, improves reconstruction accuracy, and exhibits good generalization performance.

[LG-94] A Multilayered Approach to Classifying Customer Responsiveness and Credit Risk

链接: https://arxiv.org/abs/2601.01970
作者: Ayomide Afolabi,Ebere Ogburu,Symon Kimitei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This study evaluates the performance of various classifiers in three distinct models: response, risk, and response-risk, concerning credit card mail campaigns and default prediction. In the response model, the Extra Trees classifier demonstrates the highest recall level (79.1%), emphasizing its effectiveness in identifying potential responders to targeted credit card offers. Conversely, in the risk model, the Random Forest classifier exhibits remarkable specificity of 84.1%, crucial for identifying customers least likely to default. Furthermore, in the multi-class response-risk model, the Random Forest classifier achieves the highest accuracy (83.2%), indicating its efficacy in discerning both potential responders to credit card mail campaign and low-risk credit card users. In this study, we optimized various performance metrics to solve a specific credit risk and mail responsiveness business problem.

[LG-95] Efficient temporal prediction of compressible flows in irregular domains using Fourier neural operators

链接: https://arxiv.org/abs/2601.01922
作者: Yifan Nie,Qiaoxin Li
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 18 pages, 15 figures

点击查看摘要

Abstract:This paper investigates the temporal evolution of high-speed compressible fluids in irregular flow fields using the Fourier Neural Operator (FNO). We reconstruct the irregular flow field point set into sequential format compatible with FNO input requirements, and then embed temporal bundling technique within a recurrent neural network (RNN) for multi-step prediction. We further employ a composite loss function to balance errors across different physical quantities. Experiments are conducted on three different types of irregular flow fields, including orthogonal and non-orthogonal grid configurations. Then we comprehensively analyze the physical component loss curves, flow field visualizations, and physical profiles. Results demonstrate that our approach significantly surpasses traditional numerical methods in computational efficiency while achieving high accuracy, with maximum relative L_2 errors of (0.78, 0.57, 0.35)% for ( p , T , \mathbfu ) respectively. This verifies that the method can efficiently and accurately simulate the temporal evolution of high-speed compressible flows in irregular domains.

[LG-96] Random-Matrix-Induced Simplicity Bias in Over-parameterized Variational Quantum Circuits

链接: https://arxiv.org/abs/2601.01877
作者: Jun Qi,Chao-Han Huck Yang,Pin-Yu Chen,Min-Hsiu Hsieh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Over-parameterization is commonly used to increase the expressivity of variational quantum circuits (VQCs), yet deeper and more highly parameterized circuits often exhibit poor trainability and limited generalization. In this work, we provide a theoretical explanation for this phenomenon from a function-class perspective. We show that sufficiently expressive, unstructured variational ansatze enter a Haar-like universality class in which both observable expectation values and parameter gradients concentrate exponentially with system size. As a consequence, the hypothesis class induced by such circuits collapses with high probability to a narrow family of near-constant functions, a phenomenon we term simplicity bias, with barren plateaus arising as a consequence rather than the root cause. Using tools from random matrix theory and concentration of measure, we rigorously characterize this universality class and establish uniform hypothesis-class collapse over finite datasets. We further show that this collapse is not unavoidable: tensor-structured VQCs, including tensor-network-based and tensor-hypernetwork parameterizations, lie outside the Haar-like universality class. By restricting the accessible unitary ensemble through bounded tensor rank or bond dimension, these architectures prevent concentration of measure, preserve output variability for local observables, and retain non-degenerate gradient signals even in over-parameterized regimes. Together, our results unify barren plateaus, expressivity limits, and generalization collapse under a single structural mechanism rooted in random-matrix universality, highlighting the central role of architectural inductive bias in variational quantum algorithms.

[LG-97] Machine learning modularity

链接: https://arxiv.org/abs/2601.01779
作者: Yi Fan,Vishnu Jejjala,Yang Lei
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 34 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Based on a transformer based sequence-to-sequence architecture combined with a dynamic batching algorithm, this work introduces a machine learning framework for automatically simplifying complex expressions involving multiple elliptic Gamma functions, including the q - \theta function and the elliptic Gamma function. The model learns to apply algebraic identities, particularly the SL (2,\mathbbZ) and SL (3,\mathbbZ) modular transformations, to reduce heavily scrambled expressions to their canonical forms. Experimental results show that the model achieves over 99% accuracy on in-distribution tests and maintains robust performance (exceeding 90% accuracy) under significant extrapolation, such as with deeper scrambling depths. This demonstrates that the model has internalized the underlying algebraic rules of modular transformations rather than merely memorizing training patterns. Our work presents the first successful application of machine learning to perform symbolic simplification using modular identities, offering a new automated tool for computations with special functions in quantum field theory and the string theory.

[LG-98] Sparse Convex Biclustering

链接: https://arxiv.org/abs/2601.01757
作者: Jiakun Jiang,Dewei Xiang,Chenliang Gu,Wei Liu,Binhuan Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biclustering is an essential unsupervised machine learning technique for simultaneously clustering rows and columns of a data matrix, with widespread applications in genomics, transcriptomics, and other high-dimensional omics data. Despite its importance, existing biclustering methods struggle to meet the demands of modern large-scale datasets. The challenges stem from the accumulation of noise in high-dimensional features, the limitations of non-convex optimization formulations, and the computational complexity of identifying meaningful biclusters. These issues often result in reduced accuracy and stability as the size of the dataset increases. To overcome these challenges, we propose Sparse Convex Biclustering (SpaCoBi), a novel method that penalizes noise during the biclustering process to improve both accuracy and robustness. By adopting a convex optimization framework and introducing a stability-based tuning criterion, SpaCoBi achieves an optimal balance between cluster fidelity and sparsity. Comprehensive numerical studies, including simulations and an application to mouse olfactory bulb data, demonstrate that SpaCoBi significantly outperforms state-of-the-art methods in accuracy. These results highlight SpaCoBi as a robust and efficient solution for biclustering in high-dimensional and large-scale datasets.

[LG-99] Latent Space Element Method

链接: https://arxiv.org/abs/2601.01741
作者: Seung Whan Chung,Youngsoo Choi,Christopher Miller,H. Keo Springer,Kyle T. Sullivan
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:How can we build surrogate solvers that train on small domains but scale to larger ones without intrusive access to PDE operators? Inspired by the Data-Driven Finite Element Method (DD-FEM) framework for modular data-driven solvers, we propose the Latent Space Element Method (LSEM), an element-based latent surrogate assembly approach in which a learned subdomain (“element”) model can be tiled and coupled to form a larger computational domain. Each element is a LaSDI latent ODE surrogate trained from snapshots on a local patch, and neighboring elements are coupled through learned directional interaction terms in latent space, avoiding Schwarz iterations and interface residual evaluations. A smooth window-based blending reconstructs a global field from overlapping element predictions, yielding a scalable assembled latent dynamical system. Experiments on the 1D Burgers and Korteweg-de Vries equations show that LSEM maintains predictive accuracy while scaling to spatial domains larger than those seen in training. LSEM offers an interpretable and extensible route toward foundation-model surrogate solvers built from reusable local models.

[LG-100] Reinforcement Learning for Option Hedging: Static Implied-Volatility Fit versus Shortfall-Aware Performance

链接: https://arxiv.org/abs/2601.01709
作者: Ziheng Chen,Minxuan Hu,Jiayu Yi,Wenxi Sun
类目: Pricing of Securities (q-fin.PR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We extend the Q-learner in Black-Scholes (QLBS) framework by incorporating risk aversion and trading costs, and propose a novel Replication Learning of Option Pricing (RLOP) approach. Both methods are fully compatible with standard reinforcement learning algorithms and operate under market frictions. Using SPY and XOP option data, we evaluate performance along static and dynamic dimensions. Adaptive-QLBS achieves higher static pricing accuracy in implied volatility space, while RLOP delivers superior dynamic hedging performance by reducing shortfall probability. These results highlight the importance of evaluating option pricing models beyond static fit, emphasizing realized hedging outcomes.

[LG-101] Simplex Deep Linear Discriminant Analysis

链接: https://arxiv.org/abs/2601.01679
作者: Maxat Tezekbayev,Arman Bolatov,Zhenisbek Assylbekov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit Deep Linear Discriminant Analysis (Deep LDA) from a likelihood-based perspective. While classical LDA is a simple Gaussian model with linear decision boundaries, attaching an LDA head to a neural encoder raises the question of how to train the resulting deep classifier by maximum likelihood estimation (MLE). We first show that end-to-end MLE training of an unconstrained Deep LDA model ignores discrimination: when both the LDA parameters and the encoder parameters are learned jointly, the likelihood admits a degenerate solution in which some of the class clusters may heavily overlap or even collapse, and classification performance deteriorates. Batchwise moment re-estimation of the LDA parameters does not remove this failure mode. We then propose a constrained Deep LDA formulation that fixes the class means to the vertices of a regular simplex in the latent space and restricts the shared covariance to be spherical, leaving only the priors and a single variance parameter to be learned along with the encoder. Under these geometric constraints, MLE becomes stable and yields well-separated class clusters in the latent space. On images (Fashion-MNIST, CIFAR-10, CIFAR-100), the resulting Deep LDA models achieve accuracy competitive with softmax baselines while offering a simple, interpretable latent geometry that is clearly visible in two-dimensional projections.

[LG-102] Deep Linear Discriminant Analysis Revisited

链接: https://arxiv.org/abs/2601.01619
作者: Maxat Tezekbayev,Rustem Takhanov,Arman Bolatov,Zhenisbek Assylbekov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show that for unconstrained Deep Linear Discriminant Analysis (LDA) classifiers, maximum-likelihood training admits pathological solutions in which class means drift together, covariances collapse, and the learned representation becomes almost non-discriminative. Conversely, cross-entropy training yields excellent accuracy but decouples the head from the underlying generative model, leading to highly inconsistent parameter estimates. To reconcile generative structure with discriminative performance, we introduce the \emphDiscriminative Negative Log-Likelihood (DNLL) loss, which augments the LDA log-likelihood with a simple penalty on the mixture density. DNLL can be interpreted as standard LDA NLL plus a term that explicitly discourages regions where several classes are simultaneously likely. Deep LDA trained with DNLL produces clean, well-separated latent spaces, matches the test accuracy of softmax classifiers on synthetic data and standard image benchmarks, and yields substantially better calibrated predictive probabilities, restoring a coherent probabilistic interpretation to deep discriminant models.

[LG-103] Variance-Reduced Diffusion Sampling via Conditional Score Expectation Identity

链接: https://arxiv.org/abs/2601.01594
作者: Alois Duston,Tan Bui-Thanh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce and prove a \textbfConditional Score Expectation (CSE) identity: an exact relation for the marginal score of affine diffusion processes that links scores across time via a conditional expectation under the forward dynamics. Motivated by this identity, we propose a CSE-based statistical estimator for the score using a Self-Normalized Importance Sampling (SNIS) procedure with prior samples and forward noise. We analyze its relationship to the standard Tweedie estimator, proving anti-correlation for Gaussian targets and establishing the same behavior for general targets in the small time-step regime. Exploiting this structure, we derive a variance-minimizing blended score estimator given by a state–time dependent convex combination of the CSE and Tweedie estimators. Numerical experiments show that this optimal-blending estimator reduces variance and improves sample quality for a fixed computational budget compared to either baseline. We further extend the framework to Bayesian inverse problems via likelihood-informed SNIS weights, and demonstrate improved reconstruction quality and sample diversity on high-dimensional image reconstruction tasks and PDE-governed inverse problems.

[LG-104] Identifying recurrent flows in high-dimensional dissipative chaos from low-dimensional embeddings

链接: https://arxiv.org/abs/2601.01590
作者: Pierre Beck,Tobias M. Schneider
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Unstable periodic orbits (UPOs) are the non-chaotic, dynamical building blocks of spatio-temporal chaos, motivating a first-principles based theory for turbulence ever since the discovery of deterministic chaos. Despite their key role in the ergodic theory approach to fluid turbulence, identifying UPOs is challenging for two reasons: chaotic dynamics and the high-dimensionality of the spatial discretization. We address both issues at once by proposing a loop convergence algorithm for UPOs directly within a low-dimensional embedding of the chaotic attractor. The convergence algorithm circumvents time-integration, hence avoiding instabilities from exponential error amplification, and operates on a latent dynamics obtained by pulling back the physical equations using automatic differentiation through the learned embedding function. The interpretable latent dynamics is accurate in a statistical sense, and, crucially, the embedding preserves the internal structure of the attractor, which we demonstrate through an equivalence between the latent and physical UPOs of both a model PDE and the 2D Navier-Stokes equations. This allows us to exploit the collapse of high-dimensional dissipative systems onto a lower dimensional manifold, and identify UPOs in the low-dimensional embedding.

[LG-105] Learning Relationship between Quantum Walks and Underdamped Langevin Dynamics

链接: https://arxiv.org/abs/2601.01589
作者: Yazhen Wang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fast computational algorithms are in constant demand, and their development has been driven by advances such as quantum speedup and classical acceleration. This paper intends to study search algorithms based on quantum walks in quantum computation and sampling algorithms based on Langevin dynamics in classical computation. On the quantum side, quantum walk-based search algorithms can achieve quadratic speedups over their classical counterparts. In classical computation, a substantial body of work has focused on gradient acceleration, with gradient-adjusted algorithms derived from underdamped Langevin dynamics providing quadratic acceleration over conventional Langevin algorithms. Since both search and sampling algorithms are designed to address learning tasks, we study learning relationship between coined quantum walks and underdamped Langevin dynamics. Specifically, we show that, in terms of the Le Cam deficiency distance, a quantum walk with randomization is asymptotically equivalent to underdamped Langevin dynamics, whereas the quantum walk without randomization is not asymptotically equivalent due to its high-frequency oscillatory behavior. We further discuss the implications of these equivalence and nonequivalence results for the computational and inferential properties of the associated algorithms in machine learning tasks. Our findings offer new insight into the relationship between quantum walks and underdamped Langevin dynamics, as well as the intrinsic mechanisms underlying quantum speedup and classical gradient acceleration. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) MSC classes: F.2 G.3 I.2.6 Cite as: arXiv:2601.01589 [quant-ph] (or arXiv:2601.01589v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2601.01589 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-106] Modeling Information Blackouts in Missing Not-At-Random Time Series Data

链接: https://arxiv.org/abs/2601.01480
作者: Aman Sunesh(New York University),Allan Ma(New York University),Siddarth Nilol(New York University)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 8 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Large-scale traffic forecasting relies on fixed sensor networks that often exhibit blackouts: contiguous intervals of missing measurements caused by detector or communication failures. These outages are typically handled under a Missing At Random (MAR) assumption, even though blackout events may correlate with unobserved traffic conditions (e.g., congestion or anomalous flow), motivating a Missing Not At Random (MNAR) treatment. We propose a latent state-space framework that jointly models (i) traffic dynamics via a linear dynamical system and (ii) sensor dropout via a Bernoulli observation channel whose probability depends on the latent traffic state. Inference uses an Extended Kalman Filter with Rauch-Tung-Striebel smoothing, and parameters are learned via an approximate EM procedure with a dedicated update for detector-specific missingness parameters. On the Seattle inductive loop detector data, introducing latent dynamics yields large gains over naive baselines, reducing blackout imputation RMSE from 7.02 (LOCF) and 5.02 (linear interpolation + seasonal naive) to 4.23 (MAR LDS), corresponding to about a 64% reduction in MSE relative to LOCF. Explicit MNAR modeling provides a consistent but smaller additional improvement on real data (imputation RMSE 4.20; 0.8% RMSE reduction relative to MAR), with similar modest gains for short-horizon post-blackout forecasts (evaluated at 1, 3, and 6 steps). In controlled synthetic experiments, the MNAR advantage increases as the true missingness dependence on latent state strengthens. Overall, temporal dynamics dominate performance, while MNAR modeling offers a principled refinement that becomes most valuable when missingness is genuinely informative.

[LG-107] Fast Gibbs Sampling on Bayesian Hidden Markov Model with Missing Observations

链接: https://arxiv.org/abs/2601.01442
作者: Dongrong Li,Tianwei Yu,Xiaodan Fan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 45 pages, 2 figures

点击查看摘要

Abstract:The Hidden Markov Model (HMM) is a widely-used statistical model for handling sequential data. However, the presence of missing observations in real-world datasets often complicates the application of the model. The EM algorithm and Gibbs samplers can be used to estimate the model, yet suffering from various problems including non-convexity, high computational complexity and slow mixing. In this paper, we propose a collapsed Gibbs sampler that efficiently samples from HMMs’ posterior by integrating out both the missing observations and the corresponding latent states. The proposed sampler is fast due to its three advantages. First, it achieves an estimation accuracy that is comparable to existing methods. Second, it can produce a larger Effective Sample Size (ESS) per iteration, which can be justified theoretically and numerically. Third, when the number of missing entries is large, the sampler has a significant smaller computational complexity per iteration compared to other methods, thus is faster computationally. In summary, the proposed sampling algorithm is fast both computationally and theoretically and is particularly advantageous when there are a lot of missing entries. Finally, empirical evaluations based on numerical simulations and real data analysis demonstrate that the proposed algorithm consistently outperforms existing algorithms in terms of time complexity and sampling efficiency (measured in ESS).

[LG-108] Bayesian Negative Binomial Regression of Afrobeats Chart Persistence

链接: https://arxiv.org/abs/2601.01391
作者: Ian Jacob Cabansag,Paul Ntegeka
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Afrobeats songs compete for attention on streaming platforms, where chart visibility can influence both revenue and cultural impact. This paper examines whether collaborations help songs remain on the charts longer, using daily Nigeria Spotify Top 200 data from 2024. Each track is summarized by the number of days it appears in the Top 200 during the year and its total annual streams in Nigeria. A Bayesian negative binomial regression is applied, with days on chart as the outcome and collaboration status (solo versus multi-artist) and log total streams as predictors. This approach is well suited for overdispersed count data and allows the effect of collaboration to be interpreted while controlling for overall popularity. Posterior inference is conducted using Markov chain Monte Carlo, and results are assessed using rate ratios, posterior probabilities, and predictive checks. The findings indicate that, after accounting for total streams, collaboration tracks tend to spend slightly fewer days on the chart than comparable solo tracks.

[LG-109] SGD with Dependent Data: Optimal Estimation Regret and Inference

链接: https://arxiv.org/abs/2601.01371
作者: Yinan Shen,Yichen Zhang,Wen-Xin Zhou
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work investigates the performance of the final iterate produced by stochastic gradient descent (SGD) under temporally dependent data. We consider two complementary sources of dependence: (i) martingale-type dependence in both the covariate and noise processes, which accommodates non-stationary and non-mixing time series data, and (ii) dependence induced by sequential decision making. Our formulation runs in parallel with classical notions of (local) stationarity and strong mixing, while neither framework fully subsumes the other. Remarkably, SGD is shown to automatically accommodate both independent and dependent information under a broad class of stepsize schedules and exploration rate schemes. Non-asymptotically, we show that SGD simultaneously achieves statistically optimal estimation error and regret, extending and improving existing results. In particular, our tail bounds remain sharp even for potentially infinite horizon T=+\infty . Asymptotically, the SGD iterates converge to a Gaussian distribution with only an O_\PP(1/\sqrtt) remainder, demonstrating that the supposed estimation-regret trade-off claimed in prior work can in fact be avoided. We further propose a new ``conic’’ approximation of the decision region that allows the covariates to have unbounded support. For online sparse regression, we develop a new SGD-based algorithm that uses only d units of storage and requires O(d) flops per iteration, achieving the long term statistical optimality. Intuitively, each incoming observation contributes to estimation accuracy, while aggregated summary statistics guide support recovery. Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2601.01371 [math.ST] (or arXiv:2601.01371v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2601.01371 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-110] A New Framework for Explainable Rare Cell Identification in Single-Cell Transcriptomics Data

链接: https://arxiv.org/abs/2601.01358
作者: Di Su,Kai Ming Ting,Jie Zhang,Xiaorui Zhang,Xinpeng Li
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The detection of rare cell types in single-cell transcriptomics data is crucial for elucidating disease pathogenesis and tissue development dynamics. However, a critical gap that persists in current methods is their inability to provide an explanation based on genes for each cell they have detected as rare. We identify three primary sources of this deficiency. First, the anomaly detectors often function as “black boxes”, designed to detect anomalies but unable to explain why a cell is anomalous. Second, the standard analytical framework hinders interpretability by relying on dimensionality reduction techniques, such as Principal Component Analysis (PCA), which transform meaningful gene expression data into abstract, uninterpretable features. Finally, existing explanation algorithms cannot be readily applied to this domain, as single-cell data is characterized by high dimensionality, noise, and substantial sparsity. To overcome these limitations, we introduce a framework for explainable anomaly detection in single-cell transcriptomics data which not only identifies individual anomalies, but also provides a visual explanation based on genes that makes an instance anomalous. This framework has two key ingredients that are not existed in current methods applied in this domain. First, it eliminates the PCA step which is deemed to be an essential component in previous studies. Second, it employs the state-of-art anomaly detector and explainer as the efficient and effective means to find each rare cell and the relevant gene subspace in order to provide explanations for each rare cell as well as the typical normal cell associated with the rare cell’s closest normal cells.

[LG-111] Concave Certificates: Geometric Framework for Distributionally Robust Risk and Complexity Analysis

链接: https://arxiv.org/abs/2601.01311
作者: Hong T.M. Chu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 30 pages, 7 figures

点击查看摘要

Abstract:Distributionally Robust (DR) optimization aims to certify worst-case risk within a Wasserstein uncertainty set. Current certifications typically rely either on global Lipschitz bounds, which are often conservative, or on local gradient information, which provides only a first-order approximation. This paper introduces a novel geometric framework based on the least concave majorants of the growth rate function. Our proposed concave certificate establishes a tight bound of DR risk that remains applicable to non-Lipschitz and non-differentiable losses. We extend this framework to complexity analysis, introducing a deterministic bound that complements standard statistical generalization bound. Furthermore, we utilize this certificate to bound the gap between adversarial and empirical Rademacher complexity, demonstrating that dependencies on input diameter, network width, and depth can be eliminated. For practical application in deep learning, we introduce the adversarial score as a tractable relaxation of the concave certificate that enables efficient and layer-wise analysis of neural networks. We validate our theoretical results in various numerical experiments on classification and regression tasks on real-world data.

[LG-112] Stochastic Control Methods for Optimization

链接: https://arxiv.org/abs/2601.01248
作者: Jinniao Qiu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In this work, we investigate a stochastic control framework for global optimization over both finite-dimensional Euclidean spaces and the Wasserstein space of probability measures. In the Euclidean setting, the original minimization problem is approximated by a family of regularized stochastic control problems; using dynamic programming, we analyze the associated Hamilton–Jacobi–Bellman equations and obtain tractable representations via the Cole–Hopf transform and the Feynman–Kac formula. For optimization over probability measures, we formulate a regularized mean-field control problem characterized by a master equation, and further approximate it by controlled N -particle systems. We establish that, as the regularization parameter tends to zero (and as the particle number tends to infinity for the optimization over probability measures), the value of the control problem converges to the global minimum of the original objective. Building on the resulting probabilistic representations, Monte Carlo-based numerical schemes are proposed and numerical experiments are reported to illustrate the practical performance of the methods and to support the theoretical convergence rates.

[LG-113] Evidence Slopes and Effective Dimension in Singular Linear Models

链接: https://arxiv.org/abs/2601.01238
作者: Kalyaan Rao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. 10 pages, 6 figures. Under review

点击查看摘要

Abstract:Bayesian model selection commonly relies on Laplace approximation or the Bayesian Information Criterion (BIC), which assume that the effective model dimension equals the number of parameters. Singular learning theory replaces this assumption with the real log canonical threshold (RLCT), an effective dimension that can be strictly smaller in overparameterized or rank-deficient models. We study linear-Gaussian rank models and linear subspace (dictionary) models in which the exact marginal likelihood is available in closed form and the RLCT is analytically tractable. In this setting, we show theoretically and empirically that the error of Laplace/BIC grows linearly with (d/2 minus lambda) times log n, where d is the ambient parameter dimension and lambda is the RLCT. An RLCT-aware correction recovers the correct evidence slope and is invariant to overcomplete reparameterizations that represent the same data subspace. Our results provide a concrete finite-sample characterization of Laplace failure in singular models and demonstrate that evidence slopes can be used as a practical estimator of effective dimension in simple linear settings. Comments: Preprint. 10 pages, 6 figures. Under review Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 62F15, 62H25, 68T05 Cite as: arXiv:2601.01238 [stat.ML] (or arXiv:2601.01238v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.01238 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-114] NeuroSSM: Multiscale Differential State-Space Modeling for Context-Aware fMRI Analysis

链接: https://arxiv.org/abs/2601.01229
作者: Furkan Genç,Boran İsmet Macun,Sait Sarper Özaslan,Emine U. Saritas,Tolga Çukur
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate fMRI analysis requires sensitivity to temporal structure across multiple scales, as BOLD signals encode cognitive processes that emerge from fast transient dynamics to slower, large-scale fluctuations. Existing deep learning (DL) approaches to temporal modeling face challenges in jointly capturing these dynamics over long fMRI time series. Among current DL models, transformers address long-range dependencies by explicitly modeling pairwise interactions through attention, but the associated quadratic computational cost limits effective integration of temporal dependencies across long fMRI sequences. Selective state-space models (SSMs) instead model long-range temporal dependencies implicitly through latent state evolution in a dynamical system, enabling efficient propagation of dependencies over time. However, recent SSM-based approaches for fMRI commonly operate on derived functional connectivity representations and employ single-scale temporal processing. These design choices constrain the ability to jointly represent fast transient dynamics and slower global trends within a single model. We propose NeuroSSM, a selective state-space architecture designed for end-to-end analysis of raw BOLD signals in fMRI time series. NeuroSSM addresses the above limitations through two complementary design components: a multiscale state-space backbone that captures fast and slow dynamics concurrently, and a parallel differencing branch that increases sensitivity to transient state changes. Experiments on clinical and non-clinical datasets demonstrate that NeuroSSM achieves competitive performance and efficiency against state-of-the-art fMRI analysis methods.

[LG-115] Gradient-Free Approaches is a Key to an Efficient Interaction with Markovian Stochasticity

链接: https://arxiv.org/abs/2601.01160
作者: Boris Prokhorov,Semyon Chebykin,Alexander Gasnikov,Aleksandr Beznosikov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper deals with stochastic optimization problems involving Markovian noise with a zero-order oracle. We present and analyze a novel derivative-free method for solving such problems in strongly convex smooth and non-smooth settings with both one-point and two-point feedback oracles. Using a randomized batching scheme, we show that when mixing time \tau of the underlying noise sequence is less than the dimension of the problem d , the convergence estimates of our method do not depend on \tau . This observation provides an efficient way to interact with Markovian stochasticity: instead of invoking the expensive first-order oracle, one should use the zero-order oracle. Finally, we complement our upper bounds with the corresponding lower bounds. This confirms the optimality of our results.

[LG-116] Conformal Blindness: A Note on A-Cryptic change-points

链接: https://arxiv.org/abs/2601.01147
作者: Johan Hallberg Szabadváry
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Conformal Test Martingales (CTMs) are a standard method within the Conformal Prediction framework for testing the crucial assumption of data exchangeability by monitoring deviations from uniformity in the p-value sequence. Although exchangeability implies uniform p-values, the converse does not hold. This raises the question of whether a significant break in exchangeability can occur, such that the p-values remain uniform, rendering CTMs blind. We answer this affirmatively, demonstrating the phenomenon of \emphconformal blindness. Through explicit construction, for the theoretically ideal ``oracle’’ conformity measure (given by the true conditional density), we demonstrate the possibility of an \emph A -cryptic change-point (where A refers to the conformity measure). Using bivariate Gaussian distributions, we identify a line along which a change in the marginal means does not alter the distribution of the conformity scores, thereby producing perfectly uniform p-values. Simulations confirm that even a massive distribution shift can be perfectly cryptic to the CTM, highlighting a fundamental limitation and emphasising the critical role of the alignment of the conformity measure with potential shifts. Comments: 6 pages, 3 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2601.01147 [stat.ML] (or arXiv:2601.01147v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.01147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-117] Neural Networks on Symmetric Spaces of Noncompact Type

链接: https://arxiv.org/abs/2601.01097
作者: Xuan Son Nguyen,Shuo Yang,Aymeric Histace
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works have demonstrated promising performances of neural networks on hyperbolic spaces and symmetric positive definite (SPD) manifolds. These spaces belong to a family of Riemannian manifolds referred to as symmetric spaces of noncompact type. In this paper, we propose a novel approach for developing neural networks on such spaces. Our approach relies on a unified formulation of the distance from a point to a hyperplane on the considered spaces. We show that some existing formulations of the point-to-hyperplane distance can be recovered by our approach under specific settings. Furthermore, we derive a closed-form expression for the point-to-hyperplane distance in higher-rank symmetric spaces of noncompact type equipped with G-invariant Riemannian metrics. The derived distance then serves as a tool to design fully-connected (FC) layers and an attention mechanism for neural networks on the considered spaces. Our approach is validated on challenging benchmarks for image classification, electroencephalogram (EEG) signal classification, image generation, and natural language inference.

[LG-118] Fibonacci-Driven Recursive Ensembles: Algorithms Convergence and Learning Dynamics

链接: https://arxiv.org/abs/2601.01055
作者: Ernest Fokoué
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:This paper develops the algorithmic and dynamical foundations of recursive ensemble learning driven by Fibonacci-type update flows. In contrast with classical boosting Freund and Schapire (1997); Friedman (2001), where the ensemble evolves through first-order additive updates, we study second-order recursive architectures in which each predictor depends on its two immediate predecessors. These Fibonacci flows induce a learning dynamic with memory, allowing ensembles to integrate past structure while adapting to new residual information. We introduce a general family of recursive weight-update algorithms encompassing Fibonacci, tribonacci, and higher-order recursions, together with continuous-time limits that yield systems of differential equations governing ensemble evolution. We establish global convergence conditions, spectral stability criteria, and non-asymptotic generalization bounds under Rademacher Bartlett and Mendelson (2002) and algorithmic stability analyses. The resulting theory unifies recursive ensembles, structured weighting, and dynamical systems viewpoints in statistical learning. Experiments with kernel ridge regression Rasmussen and Williams (2006), spline smoothers Wahba (1990), and random Fourier feature models Rahimi and Recht (2007) demonstrate that recursive flows consistently improve approximation and generalization beyond static weighting. These results complete the trilogy begun in Papers I and II: from Fibonacci weighting, through geometric weighting theory, to fully dynamical recursive ensemble learning systems.

[LG-119] Dynamic Accuracy Estimation in a Wi-Fi-based Positioning System

链接: https://arxiv.org/abs/2601.00999
作者: Marcin Kolakowski,Vitomir Djaja-Josko
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Originally presented at 2025 33rd Telecommunications Forum (TELFOR), Belgrade, Serbia

点击查看摘要

Abstract:The paper presents a concept of a dynamic accuracy estimation method, in which the localization errors are derived based on the measurement results used by the positioning algorithm. The concept was verified experimentally in a Wi\nobreakdash-Fi based indoor positioning system, where several regression methods were tested (linear regression, random forest, k-nearest neighbors, and neural networks). The highest positioning error estimation accuracy was achieved for random forest regression, with a mean absolute error of 0.72 m.

[LG-120] Deep Deterministic Nonlinear ICA via Total Correlation Minimization with Matrix-Based Entropy Functional

链接: https://arxiv.org/abs/2601.00904
作者: Qiang Li,Shujian Yu,Liang Ma,Chen Ma,Jingyu Liu,Tulay Adali,Vince D. Calhoun
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Blind source separation, particularly through independent component analysis (ICA), is widely utilized across various signal processing domains for disentangling underlying components from observed mixed signals, owing to its fully data-driven nature that minimizes reliance on prior assumptions. However, conventional ICA methods rely on an assumption of linear mixing, limiting their ability to capture complex nonlinear relationships and to maintain robustness in noisy environments. In this work, we present deep deterministic nonlinear independent component analysis (DDICA), a novel deep neural network-based framework designed to address these limitations. DDICA leverages a matrix-based entropy function to directly optimize the independence criterion via stochastic gradient descent, bypassing the need for variational approximations or adversarial schemes. This results in a streamlined training process and improved resilience to noise. We validated the effectiveness and generalizability of DDICA across a range of applications, including simulated signal mixtures, hyperspectral image unmixing, modeling of primary visual receptive fields, and resting-state functional magnetic resonance imaging (fMRI) data analysis. Experimental results demonstrate that DDICA effectively separates independent components with high accuracy across a range of applications. These findings suggest that DDICA offers a robust and versatile solution for blind source separation in diverse signal processing tasks.

[LG-121] Investigation into U.S. Citizen and Non-Citizen Worker Health Insurance and Employment

链接: https://arxiv.org/abs/2601.00896
作者: Annabelle Yao
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Socioeconomic integration is a critical dimension of social equity, yet persistent disparities remain in access to health insurance, education, and employment across different demographic groups. While previous studies have examined isolated aspects of inequality, there is limited research that integrates both statistical analysis and advanced machine learning to uncover hidden structures within population data. This study leverages statistical analysis ( \chi^2 test of independence and Two Proportion Z-Test) and machine learning clustering techniques – K-Modes and K-Prototypes – along with t-SNE visualization and CatBoost classification to analyze socioeconomic integration and inequality. Using statistical tests, we identified the proportion of the population with healthcare insurance, quality education, and employment. With this data, we concluded that there was an association between employment and citizenship status. Moreover, we were able to determine 5 distinct population groups using Machine Learning classification. The five clusters our analysis identifies reveal that while citizenship status shows no association with workforce participation, significant disparities exist in access to employer-sponsored health insurance. Each cluster represents a distinct demographic of the population, showing that there is a primary split along the lines of educational attainment which separates Clusters 0 and 4 from Clusters 1, 2, and 3. Furthermore, labor force status and nativity serve as secondary differentiators. Non-citizens are also disproportionately concentrated in precarious employment without benefits, highlighting systemic inequalities in healthcare access. By uncovering demographic clusters that face compounded disadvantages, this research contributes to a more nuanced understanding of socioeconomic stratification.

[LG-122] Deep Learning Framework for RNA Inverse Folding with Geometric Structure Potentials

链接: https://arxiv.org/abs/2601.00895
作者: Annabelle Yao
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:RNA’s diverse biological functions stem from its structural versatility, yet accurately predicting and designing RNA sequences given a 3D conformation (inverse folding) remains a challenge. Here, I introduce a deep learning framework that integrates Geometric Vector Perceptron (GVP) layers with a Transformer architecture to enable end-to-end RNA design. I construct a dataset consisting of experimentally solved RNA 3D structures, filtered and deduplicated from the BGSU RNA list, and evaluate performance using both sequence recovery rate and TM-score to assess sequence and structural fidelity, respectively. On standard benchmarks and RNA-Puzzles, my model achieves state-of-the-art performance, with recovery and TM-scores of 0.481 and 0.332, surpassing existing methods across diverse RNA families and length scales. Masked family-level validation using Rfam annotations confirms strong generalization beyond seen families. Furthermore, inverse-folded sequences, when refolded using AlphaFold3, closely resemble native structures, highlighting the critical role of geometric features captured by GVP layers in enhancing Transformer-based RNA design.

[LG-123] Deep versus Broad Technology Search and the Timing of Innovation Impact

链接: https://arxiv.org/abs/2601.00871
作者: Likun Cao,James Evans
类目: Physics and Society (physics.soc-ph); Digital Libraries (cs.DL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 47 pages, 8 figures, 3 tables

点击查看摘要

Abstract:This study offers a new perspective on the depth-versus-breadth debate in innovation strategy, by modeling inventive search within dynamic collective knowledge systems, and underscoring the importance of timing for technological impact. Using frontier machine learning to project patent citation networks in hyperbolic space, we analyze 4.9 million U.S. patents to examine how search strategies give rise to distinct temporal patterns in impact accumulation. We find that inventions based on deep search, which relies on a specialized understanding of complex recombination structures, drive higher short-term impact through early adoption within specialized communities, but face diminishing returns as innovations become “locked-in” with limited diffusion potential. Conversely, when inventions are grounded in broad search that spans disparate domains, they encounter initial resistance but achieve wider diffusion and greater long-term impact by reaching cognitively diverse audiences. Individual inventions require both depth and breadth for stable impact. Organizations can strategically balance approaches across multiple inventions: using depth to build reliable technological infrastructure while pursuing breadth to expand applications. We advance innovation theory by demonstrating how deep and broad search strategies distinctly shape the timing and trajectory of technological impact, and how individual inventors and organizations can leverage these mechanisms to balance exploitation and exploration.

[LG-124] Physically-Constrained Autoencoder-Assisted Bayesian Optimization for Refinement of High-Dimensional Defect-Sensitive Single Crystalline Structure

链接: https://arxiv.org/abs/2601.00855
作者: Joseph Oche Agada,Andrew McAninch,Haley Day,Yasemin Tanyu,Ewan McCombs,Seyed M. Koohpayeh,Brian H. Toby,Yishu Wang,Arpan Biswas
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Physical properties and functionalities of materials are dictated by global crystal structures as well as local defects. To establish a structure-property relationship, not only the crystallographic symmetry but also quantitative knowledge about defects are required. Here we present a hybrid Machine Learning framework that integrates a physically-constrained variational-autoencoder (pcVAE) with different Bayesian Optimization (BO) methods to systematically accelerate and improve crystal structure refinement with resolution of defects. We chose the pyrochlore structured Ho2Ti2O7 as a model system and employed the GSAS2 package for benchmarking crystallographic parameters from Rietveld refinement. However, the function space of these material systems is highly nonlinear, which limits optimizers like traditional Rietveld refinement, into trapping at local minima. Also, these naive methods don’t provide an extensive learning about the overall function space, which is essential for large space, large time consuming explorations to identify various potential regions of interest. Thus, we present the approach of exploring the high Dimensional structure parameters of defect sensitive systems via pretrained pcVAE assisted BO and Sparse Axis Aligned BO. The pcVAE projects high-Dimensional diffraction data consisting of thousands of independently measured diffraction orders into a lowD latent space while enforcing scaling invariance and physical relevance. Then via BO methods, we aim to minimize the L2 norm based chisq errors in the real and latent spaces separately between experimental and simulated diffraction patterns, thereby steering the refinement towards potential optimum crystal structure parameters. We investigated and compared the results among different pcVAE assisted BO, non pcVAE assisted BO, and Rietveld refinement.

[LG-125] Autonomous battery research: Principles of heuristic operando experimentation

链接: https://arxiv.org/abs/2601.00851
作者: Emily Lu,Gabriel Perez,Peter Baker,Daniel Irving,Santosh Kumar,Veronica Celorrio,Sylvia Britto,Thomas F. Headen,Miguel Gomez-Gonzalez,Connor Wright,Calum Green,Robert Scott Young,Oleg Kirichek,Ali Mortazavi,Sarah Day,Isabel Antony,Zoe Wright,Thomas Wood,Tim Snow,Jeyan Thiyagalingam,Paul Quinn,Martin Owen Jones,William David,James Le Houx
类目: Instrumentation and Detectors (physics.ins-det); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 38 pages, 14 figures. Includes a detailed technical review of the POLARIS, BAM, DRIX, M-Series, and B18 electrochemical cells in the Supplementary Information

点击查看摘要

Abstract:Unravelling the complex processes governing battery degradation is critical to the energy transition, yet the efficacy of operando characterisation is severely constrained by a lack of Reliability, Representativeness, and Reproducibility (the 3Rs). Current methods rely on bespoke hardware and passive, pre-programmed methodologies that are ill-equipped to capture stochastic failure events. Here, using the Rutherford Appleton Laboratory’s multi-modal toolkit as a case study, we expose the systemic inability of conventional experiments to capture transient phenomena like dendrite initiation. To address this, we propose Heuristic Operando experiments: a framework where an AI pilot leverages physics-based digital twins to actively steer the beamline to predict and deterministically capture these rare events. Distinct from uncertainty-driven active learning, this proactive search anticipates failure precursors, redefining experimental efficiency via an entropy-based metric that prioritises scientific insight per photon, neutron, or muon. By focusing measurements only on mechanistically decisive moments, this framework simultaneously mitigates beam damage and drastically reduces data redundancy. When integrated with FAIR data principles, this approach serves as a blueprint for the trusted autonomous battery laboratories of the future.

信息检索

[IR-0] Cold-Starting Podcast Ads and Promotions with Multi-Task Learning on Spotify WSDM2026

链接: https://arxiv.org/abs/2601.02306
作者: Shivam Verma,Hannes Karlbom,Yu Zhao,Nick Topping,Vivian Chen,Kieran Stanley,Bharath Rengarajan
类目: Information Retrieval (cs.IR)
*备注: Accepted at WSDM 2026

点击查看摘要

Abstract:We present a unified multi-objective model for targeting both advertisements and promotions within the Spotify podcast ecosystem. Our approach addresses key challenges in personalization and cold-start initialization, particularly for new advertising objectives. By leveraging transfer learning from large-scale ad and content interactions within a multi-task learning (MTL) framework, a single joint model can be fine-tuned or directly applied to new or low-data targeting tasks, including in-app promotions. This multi-objective design jointly optimizes podcast outcomes such as streams, clicks, and follows for both ads and promotions using a shared representation over user, content, context, and creative features, effectively supporting diverse business goals while improving user experience. Online A/B tests show up to a 22% reduction in effective Cost-Per-Stream (eCPS), particularly for less-streamed podcasts, and an 18-24% increase in podcast stream rates. Offline experiments and ablations highlight the contribution of ancillary objectives and feature groups to cold-start performance. Our experience shows that a unified modeling strategy improves maintainability, cold-start performance, and coverage, while breaking down historically siloed targeting pipelines. We discuss practical trade-offs of such joint models in a real-world advertising system.

[IR-1] A Hybrid Architecture for Multi-Stage Claim Document Understanding: Combining Vision-Language Models and Machine Learning for Real-Time Processing

链接: https://arxiv.org/abs/2601.01897
作者: Lilu Cheng,Jingjun Lu,Yi Xuan Chan,Quoc Khai Nguyen,John Bi,Sean Ho
类目: Information Retrieval (cs.IR)
*备注: 19 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Claims documents are fundamental to healthcare and insurance operations, serving as the basis for reimbursement, auditing, and compliance. However, these documents are typically not born digital; they often exist as scanned PDFs or photographs captured under uncontrolled conditions. Consequently, they exhibit significant content heterogeneity, ranging from typed invoices to handwritten medical reports, as well as linguistic diversity. This challenge is exemplified by operations at Fullerton Health, which handles tens of millions of claims annually across nine markets, including Singapore, the Philippines, Indonesia, Malaysia, Mainland China, Hong Kong, Vietnam, Papua New Guinea, and Cambodia. Such variability, coupled with inconsistent image quality and diverse layouts, poses a significant obstacle to automated parsing and structured information extraction. This paper presents a robust multi-stage pipeline that integrates the multilingual optical character recognition (OCR) engine PaddleOCR, a traditional Logistic Regression classifier, and a compact Vision-Language Model (VLM), Qwen 2.5-VL-7B, to achieve efficient and accurate field extraction from large-scale claims data. The proposed system achieves a document-type classification accuracy of over 95 percent and a field-level extraction accuracy of approximately 87 percent, while maintaining an average processing latency of under 2 seconds per document. Compared to manual processing, which typically requires around 10 minutes per claim, our system delivers a 300x improvement in efficiency. These results demonstrate that combining traditional machine learning models with modern VLMs enables production-grade accuracy and speed for real-world automation. The solution has been successfully deployed in our mobile application and is currently processing tens of thousands of claims weekly from Vietnam and Singapore. Comments: 19 pages, 3 figures, 3 tables Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2601.01897 [cs.IR] (or arXiv:2601.01897v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.01897 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] When Attention Becomes Exposure in Generative Search

链接: https://arxiv.org/abs/2601.01750
作者: Shayan Alipour,Mehdi Kargar,Morteza Zihayat
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Generative search engines are reshaping information access by replacing traditional ranked lists with synthesized answers and references. In parallel, with the growth of Web3 platforms, incentive-driven creator ecosystems have become an essential part of how enterprises build visibility and community by rewarding creators for contributing to shared narratives. However, the extent to which exposure in generative search engine citations is shaped by external attention markets remains uncertain. In this study, we audit the exposure for 44 Web3 enterprises. First, we show that the creator community around each enterprise is persistent over time. Second, enterprise-specific queries reveal that more popular voices systematically receive greater citation exposure than others. Third, we find that larger follower bases and enterprises with more concentrated creator cores are associated with higher-ranked exposure. Together, these results show that generative search engine citations exhibit exposure bias toward already prominent voices, which risks entrenching incumbents and narrowing viewpoint diversity.

[IR-3] Breadcrumbs in the Digital Forest: Tracing Criminals through Torrent Metadata with OSINT

链接: https://arxiv.org/abs/2601.01492
作者: Annelies de Jong,Giuseppe Cascavilla,Jessica De Pascale
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This work investigates the potential of torrent metadata as a source for open-source intelligence (OSINT), with a focus on user profiling and behavioral analysis. While peer-to-peer (P2P) networks such as BitTorrent are well studied with respect to privacy and performance, their metadata is rarely used for investigative purposes. This work presents a proof of concept demonstrating how tracker responses, torrent index data, and enriched IP metadata can reveal patterns associated with high-risk behavior. The research follows a five-step OSINT process: source identification, data collection, enrichment, behavioral analysis, and presentation of the results. Data were collected from The Pirate Bay and UDP trackers, yielding a dataset of more than 60,000 unique IP addresses across 206 popular torrents. The data were enriched with geolocation, anonymization status, and flags of involvement in child exploitation material (CEM). A case study on sensitive e-books shows how such data can help detect possible interest in illicit content. Network analysis highlights peer clustering, co-download patterns, and the use of privacy tools by suspicious users. The study shows that publicly available torrent metadata can support scalable and automated OSINT profiling. This work adds to digital forensics by proposing a new method to extract useful signals from noisy data, with applications in law enforcement, cybersecurity, and threat analysis. Subjects: Information Retrieval (cs.IR); Computers and Society (cs.CY) Cite as: arXiv:2601.01492 [cs.IR] (or arXiv:2601.01492v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.01492 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Giuseppe Cascavilla [view email] [v1] Sun, 4 Jan 2026 11:39:32 UTC (4,315 KB)

[IR-4] Adaptive Diffusion-based Augmentation for Recommendation

链接: https://arxiv.org/abs/2601.01448
作者: Na Li,Fanghui Sun,Yan Zou,Yangfu Zhu,Xiatian Zhu,Ying Ma
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommendation systems often rely on implicit feedback, where only positive user-item interactions can be observed. Negative sampling is therefore crucial to provide proper negative training signals. However, existing methods tend to mislabel potentially positive but unobserved items as negatives and lack precise control over negative sample selection. We aim to address these by generating controllable negative samples, rather than sampling from the existing item pool. In this context, we propose Adaptive Diffusion-based Augmentation for Recommendation (ADAR), a novel and model-agnostic module that leverages diffusion to synthesize informative negatives. Inspired by the progressive corruption process in diffusion, ADAR simulates a continuous transition from positive to negative, allowing for fine-grained control over sample hardness. To mine suitable negative samples, we theoretically identify the transition point at which a positive sample turns negative and derive a score-aware function to adaptively determine the optimal sampling timestep. By identifying this transition point, ADAR generates challenging negative samples that effectively refine the model’s decision boundary. Experiments confirm that ADAR is broadly compatible and boosts the performance of existing recommendation models substantially, including collaborative filtering and sequential recommendation, without architectural modifications.

[IR-5] Curator: Efficient Vector Search with Low-Selectivity Filters SIGMOD2026

链接: https://arxiv.org/abs/2601.01291
作者: Yicheng Jin,Yongji Wu,Wenjun Hu,Bruce M. Maggs,Jun Yang,Xiao Zhang,Danyang Zhuo
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: Accepted at SIGMOD 2026

点击查看摘要

Abstract:Embedding-based dense retrieval has become the cornerstone of many critical applications, where approximate nearest neighbor search (ANNS) queries are often combined with filters on labels such as dates and price ranges. Graph-based indexes achieve state-of-the-art performance on unfiltered ANNS but encounter connectivity breakdown on low-selectivity filtered queries, where qualifying vectors become sparse and the graph structure among them fragments. Recent research proposes specialized graph indexes that address this issue by expanding graph degree, which incurs prohibitively high construction costs. Given these inherent limitations of graph-based methods, we argue for a dual-index architecture and present Curator, a partition-based index that complements existing graph-based approaches for low-selectivity filtered ANNS. Curator builds specialized indexes for different labels within a shared clustering tree, where each index adapts to the distribution of its qualifying vectors to ensure efficient search while sharing structure to minimize memory overhead. The system also supports incremental updates and handles arbitrary complex predicates beyond single-label filters by efficiently constructing temporary indexes on the fly. Our evaluation demonstrates that integrating Curator with state-of-the-art graph indexes reduces low-selectivity query latency by up to 20.9x compared to pre-filtering fallback, while increasing construction time and memory footprint by only 5.5% and 4.3%, respectively.

附件下载

点击下载今日全部论文列表