本篇博文主要内容为 2025-12-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-03)

今日共更新518篇论文,其中:

  • 自然语言处理63篇(Computation and Language (cs.CL))
  • 人工智能152篇(Artificial Intelligence (cs.AI))
  • 计算机视觉142篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习144篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] he Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂多变情境下道德一致性(moral consistency)难以持续评估的问题。现有对齐框架通常依赖静态数据集和事后评估,无法捕捉伦理推理随时间或场景变化的动态演化特性。其解决方案的关键在于提出一种无数据集、闭环式的道德一致性评估框架——Moral Consistency Pipeline (MoCoP),该框架通过三个核心模块实现自主迭代:(i) 词汇完整性分析、(ii) 语义风险估计、(iii) 基于推理的判断建模,并在自维持架构中自动生成、评估与优化伦理场景,从而无需外部监督即可持续监测模型的道德稳定性。实证结果表明,该方法能有效识别长期伦理行为模式,揭示伦理维度与毒性维度呈强负相关(rET = -0.81, p < 0.001),且与响应延迟几乎无关(rEL ≈ 0),证明道德连贯性和语言安全性是模型行为中的稳定可解释特征。

链接: https://arxiv.org/abs/2512.03026
作者: Saeid Jamshidi,Kawser Wazed Nafi,Arghavan Moradi Dakhel,Negar Shahabi,Foutse Khomh
机构: SWAT Laboratory, Polytechnique Montréal (蒙特利尔理工学院); Concordia Institute for Information Systems Engineering, Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.
zh

[NLP-1] LORE: A Large Generative Model for Search Relevance

【速读】: 该论文旨在解决电商搜索中生成式 AI(Generative AI)相关性建模的性能瓶颈问题,即现有基于链式思维(Chain-of-Thought, CoT)的方法在提升相关性时趋于饱和,难以进一步优化。其核心解决方案在于对相关性任务进行定性驱动的分解:将相关性拆解为三个独立且关键的能力维度——知识与推理能力、多模态匹配能力和规则遵从能力。通过这一结构化分解,论文提出了一套完整的大型语言模型(Large Language Model, LLM)相关性生命周期框架 LORE,包括两阶段训练范式(结合监督微调 SFT 与强化学习 RL 的偏好对齐)、面向核心能力的评估基准 RAIR,以及基于查询频率分层的部署策略,从而有效突破当前性能天花板并实现线上 GoodRate 指标累计提升 27%。

链接: https://arxiv.org/abs/2512.03025
作者: Chenji Lu,Zhuo Chen,Hui Zhao,Zhiyuan Zeng,Gang Zhao,Junjie Ren,Ruicong Xu,Haoran Li,Songyan Liu,Pengjie Wang,Jian Xu,Bo Zheng(Alibaba Group)
机构: Alimama (Alibaba Group)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
zh

[NLP-2] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在将自然语言语句自动翻译为形式逻辑表达式时存在的幻觉(hallucination)问题,这一问题尤其影响逻辑翻译任务的准确性与可靠性。解决方案的关键在于构建一个融合经典自然语言处理(Natural Language Processing, NLP)技术、自定义语法规则、符号计算库以及微调后的语言模型的框架,通过该框架将英文句子转换为合取范式(Conjunctive Normal Form, CNF)以供可满足性求解(Satisfiability Solving)。实验表明,微调后的模型能够在不同语法设置下主动纠正原始模型产生的同类幻觉,从而实现稳定且可靠的CNF生成。

链接: https://arxiv.org/abs/2512.02987
作者: Muyu Pan,Dheeraj Kodakandla,Mahfuza Farooque
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IEEE ISNCC 2025

点击查看摘要

Abstract:Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.
zh

[NLP-3] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全对齐方面易受越狱攻击(jailbreak attacks)的问题。现有攻击方法主要依赖文本与图像之间的交互,将视觉模态视为辅助提示,未能充分挖掘图像本身携带复杂上下文信息的潜力。其解决方案的关键在于提出一种以图像为中心的攻击方法——上下文图像攻击(Contextual Image Attack, CIA),通过多智能体系统结合四种不同的可视化策略,将有害查询隐式嵌入看似无害的视觉语境中,并引入上下文元素增强和自动毒性混淆技术以提升攻击效果。实验表明,CIA在MMSafetyBench-tiny数据集上对GPT-4o和Qwen2.5-VL-72B模型分别实现了4.73和4.83的高毒性得分及高达91.07%的攻击成功率,显著优于已有方法,验证了视觉模态作为越狱向量的强大潜力。

链接: https://arxiv.org/abs/2512.02973
作者: Yuan Xiong,Ziqi Miao,Lijun Li,Chen Qian,Jie Li,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Xi’an Jiaotong University (西安交通大学); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack’s efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31% and 91.07%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.
zh

[NLP-4] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在神经网络处理单元(Neural Processing Units, NPUs)上部署时效率低下问题,其核心矛盾在于:现有VLMs多为GPU优化设计,与NPU的硬件特性不匹配,主要体现在两方面——Vision Transformer(ViT)对量化敏感导致精度损失,以及自回归注意力机制因键值缓存(Key-Value Caching)引发高内存I/O开销,无法发挥NPU的高算术吞吐能力。解决方案的关键在于提出AutoNeural架构,该架构为NPU原生设计,采用两个核心创新:一是用MobileNetV5风格的深度可分离卷积骨干替代标准ViT编码器,实现激活分布稳定,支持INT4/8/16整数量化且误差降低至原有7倍;二是将状态空间模型(State-Space Model, SSM)与Transformer结构融合,引入高效门控卷积,使语言生成复杂度从O(n²)降至线性,消除KV缓存依赖,从而将端到端延迟降低14倍、解码速度提升3倍、上下文窗口扩展4倍。

链接: https://arxiv.org/abs/2512.02924
作者: Wei Chen,Liangmin Wu,Yunhai Hu,Zhiyuan Li,Zhiyuan Cheng,Yicheng Qian,Lingyue Zhu,Zhipeng Hu,Luoyi Liang,Qiang Tang,Zhen Liu,Han Yang
机构: Nexa AI; Geely Auto
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision–Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
zh

[NLP-5] Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在采用迭代推理(iterative reasoning)技术时,可能因信念固化(belief entrenchment)和确认偏误(confirmation bias)而导致信息准确性下降的问题。其核心挑战在于:尽管迭代推理被设计用于提升模型的推理能力与事实可靠性,但实证发现其反而可能导致模型对初始信念过度依赖,从而削弱其基于新证据进行理性更新的能力。解决方案的关键在于提出一种基于贝叶斯统计中鞅性质(Martingale property)的无监督评估框架——即马丁格尔分数(Martingale Score),该分数通过回归方法量化当前信念对未来信念更新的预测性,从而识别出偏离贝叶斯更新机制的偏差行为。实验表明,该分数能有效捕捉不同模型、推理策略和任务领域中的信念固化现象,并且在有真实标签的任务中可作为真理导向能力(truth-seeking ability)的可靠代理指标。

链接: https://arxiv.org/abs/2512.02914
作者: Zhonghao He,Tianyi Qiu,Hirokazu Shirado,Maarten Sap
机构: University of Cambridge (剑桥大学); Peking University (北京大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2025

点击查看摘要

Abstract:Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.
zh

[NLP-6] Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在实际应用中因迭代采样过程缓慢而导致的效率瓶颈问题。其核心解决方案是提出一种无需训练、与模型无关的早停(early-exit)算法SchED,该方法通过聚合完整序列的logit边际(logit margins),并在预测置信度达到一个平滑且依赖于生成进度的阈值时提前终止解码过程。关键创新在于将模型预测熵的稳定化现象转化为计算节省,从而在保持高精度的前提下显著提升dLLM的推理速度:在指令微调模型上实现3.8–4.0倍加速并保留99.8%–100%的基线性能,在基础模型上亦能实现最高2.34倍的加速,且在保守质量-速度权衡指标(QPS, γ=4)下优于现有基于置信度的早停方法,尤其在长文本生成任务中表现稳健。

链接: https://arxiv.org/abs/2512.02892
作者: Amr Mohamed,Yang Zhang,Michalis Vazirgiannis,Guokan Shang
机构: MBZUAI; Ecole Polytechnique
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves 3.8 - 4.0\times speedups while retaining 99.8 - 100% of the baseline score on average. On base models, SchED yields consistent speedup gains with 99.1 - 100% performance retention, with up to 2.34\times under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, \gamma=4 ), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model’s token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.
zh

[NLP-7] OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时策略优化(Test-time Policy Optimization, TPO)中因依赖固定预算多数投票机制估算奖励而导致的计算冗余问题。其解决方案的关键在于提出一种基于贝叶斯序贯概率比检验(Bayesian Sequential Probability Ratio Test)的最优轨迹分配框架(Optimal Rollout Allocation for Test-time Policy Optimization, OptPO),通过动态停止采样以在后验置信度达到阈值时终止推理过程,同时将保留的轨迹用于在线策略更新(on-policy updates),从而在不依赖真实标签的情况下无缝集成到PPO或GRPO等算法中,显著降低轨迹开销并保持或提升准确性。

链接: https://arxiv.org/abs/2512.02882
作者: Youkang Wang,Jian Wang,Rubing Chen,Tianyi Zeng,Xiao-Yong Wei,Qing Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.
zh

[NLP-8] hink in Parallel Answer as One: Logit Averag ing for Open-Ended Reasoning

【速读】: 该论文旨在解决在开放式推理任务(如代码生成和基于网络的深度研究)中,传统多数投票(majority voting)方法因无法定义对完整解决方案的“多数”而失效的问题。其核心解决方案是提出了一种无需训练、即插即用的解码策略 ThinkMerge,该策略通过运行 K 条并行推理轨迹(reasoning traces),并在同步点对下一词的 logits 进行平均,从而生成单一连贯输出。ThinkMerge 不依赖于对完整输出的投票,而是利用中间层信息融合提升推理质量,在多个基准测试中显著优于或匹配现有方法,尤其在 LiveCodeBench 等开放性编码任务上实现显著性能提升。

链接: https://arxiv.org/abs/2512.02874
作者: Haonan Wang,Chao Du,Kenji Kawaguchi,Tianyu Pang
机构: National University of Singapore (新加坡国立大学); Sea AI Lab (Sea人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a “majority” over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.
zh

[NLP-9] Bangla Hate Speech Classification with Fine-tuned Transformer Models

【速读】: 该论文旨在解决低资源语言(特别是孟加拉语)中仇恨言论识别的难题,其核心挑战包括数据集匮乏、拼写异质性以及语言多样性。为应对这些问题,作者系统评估了传统机器学习方法(如逻辑回归、随机森林、决策树及支持向量机)与多种基于Transformer的预训练语言模型(如DistilBERT、BanglaBERT、m-BERT和XLM-RoBERTa)在BLO 2025共享任务中的表现。关键发现是:尽管模型规模较小,BanglaBERT在两个子任务上均取得最优性能,显著优于m-BERT和XLM-RoBERTa,这凸显了针对特定语言进行预训练的重要性,表明语言特异性预训练对于提升低资源语言下仇恨言论检测效果具有关键作用。

链接: https://arxiv.org/abs/2512.02845
作者: Yalda Keivan Jafari,Krishno Dey
机构: University of New Brunswick (新不伦瑞克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech recognition in low-resource lan- guages remains a difficult problem due to in- sufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the grow- ing need for automated moderation on social media platforms, Bangla is significantly under- represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official base- lines (e.g., Majority, Random, Support Vec- tor Machine) and also produce and consider Logistic Regression, Random Forest, and De- cision Tree as baseline methods. We also uti- lized transformer-based models such as Dis- tilBERT, BanglaBERT, m-BERT, and XLM- RoBERTa for hate speech classification. All the transformer-based models outperformed base- line methods for the subtasks, except for Distil- BERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language- specific pre-training is very important. Our results highlight the potential and need for pre- trained language models for the low-resource Bangla language.
zh

[NLP-10] Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下推理准确性与鲁棒性不足的问题,尤其是在缺乏跨语言统一系统提示(system prompt)的情况下。其核心挑战在于如何设计一种轻量且高效的提示机制,使LLM在不同语言环境中均能保持一致的高质量输出。解决方案的关键在于提出一个统一的四维评估框架,并通过大规模实验识别出对多语言行为具有显著影响的提示组件(如思维链CoT、情感因素和场景描述),进而构建一个自动化的提示优化框架,能够显著提升多语言任务中的各项指标(提升5–10%),同时发现高性能提示可诱导更结构化、一致的推理模式并减少不必要的语言切换。

链接: https://arxiv.org/abs/2512.02841
作者: Lechen Zhang,Yusheng Zhou,Tolga Ergen,Lajanugen Logeswaran,Moontae Lee,David Jurgens
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Michigan (密歇根大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); LG AI Research (LG人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
zh

[NLP-11] promptolution: A Unified Modular Framework for Prompt Optimization

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中提示优化(prompt optimization)技术在实际应用中难以推广的问题,即现有实现多依赖于未维护且孤立的研究代码库,限制了其可复用性和扩展性。解决方案的关键在于提出 promptolution —— 一个统一、模块化且开源的框架,将提示优化所需的所有组件集成到一个可扩展的系统中,支持多种现代离散提示优化器,并与底层LLM的具体实现无关,从而为研究者和实践者提供高效、灵活且可持续的工具链。

链接: https://arxiv.org/abs/2512.02840
作者: Tom Zehle,Timo Heiß,Moritz Schlager,Matthias Aßenmacher,Matthias Feurer
机构: ELLIS Institute (ELLIS研究所); University of Freiburg (弗莱堡大学); LMU Munich (慕尼黑大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Technical University of Munich (慕尼黑工业大学); TU Dortmund University (多特蒙德工业大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers show its effectiveness, practical adoption is hindered as existing implementations are often tied to unmaintained and isolated research codebases. To address this, we introduce promptolution, a unified and modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers while remaining agnostic to the underlying LLM implementation.
zh

[NLP-12] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

【速读】: 该论文旨在解决**以推理为中心的视频目标分割(reasoning-centric video object segmentation)**任务中,现有方法因将复杂动态、因果关系和时序交互简化为隐式嵌入表示而导致推理链不透明且难以处理的问题。解决方案的关键在于提出ReVSeg框架,通过显式分解推理过程为三个可解释的操作步骤——语义理解(semantics interpretation)、时序证据选择(temporal evidence selection)和空间定位(spatial grounding),并利用预训练视觉语言模型(VLMs)的原生接口执行这些步骤;同时引入强化学习优化多步推理链,使模型能基于结果驱动信号自我优化决策质量,从而实现性能提升与推理轨迹可解释性的统一。

链接: https://arxiv.org/abs/2512.02835
作者: Yifan Li,Yingda Yin,Lingting Zhu,Weikai Chen,Shengju Qian,Xin Wang,Yanwei Fu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); LIGHTSPEED
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations – semantics interpretation, temporal evidence selection, and spatial grounding – aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at this https URL .
zh

[NLP-13] BOOM: Beyond Only One Modality KITs Multimodal Multilingual Lecture Companion

【速读】: 该论文旨在解决全球化教育背景下在线学习内容本地化(localization)的挑战,特别是多模态讲座材料(包括音频、文本和视觉幻灯片)在翻译过程中如何保持语义完整性与同步性的问题。解决方案的关键在于提出一个名为BOOM的端到端多模态多语言讲座伴侣系统,能够联合翻译讲座音频与幻灯片,并生成三模态同步输出:翻译后的文本、保留原始视觉结构的本地化幻灯片以及合成语音。该方法确保学生在母语环境中获取完整的学习体验,同时实验证明滑动感知的转录文本还能提升下游任务(如摘要生成和问答)的性能。

链接: https://arxiv.org/abs/2512.02817
作者: Sai Koneru,Fabian Retkowski,Christian Huber,Lukas Hilgert,Seymanur Akti,Enes Yavuz Ugan,Alexander Waibel,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbfBOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at this https URL and integrate it in Lecture Translator at this https URL\footnoteAll released code and models are licensed under the MIT License.
zh

[NLP-14] A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在中医药领域临床应用能力评估不足的问题,特别是针对中医“辨证论治”(Syndrome Differentiation and Treatment, SDT)所具有的个体化、整体性和多样性特征,现有基准测试多局限于知识问答或辨证准确性,缺乏对治疗决策能力的系统评估。解决方案的关键在于提出一个由中医专家主导的临床病例驱动的综合性评估基准——TCM-BEST4SDT,其包含四大任务(基础中医知识、医学伦理、LLM内容安全与辨证论治),并引入专门设计的奖励模型量化处方与证候的一致性,同时融合选择题评估、判别模型评估和奖励模型评估三种机制,从而实现对LLM在真实临床场景中辨证论治能力的多维度、专业化评价。

链接: https://arxiv.org/abs/2512.02816
作者: Kunning Li,Jianbin Guo,Zhaoyang Shang,Yiqing Liu,Hongmin Du,Lingling Liu,Yuping Zhao,Lifeng Dong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM’s “Syndrome Differentiation and Treatment” (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates three mechanisms, namely selected-response evaluation, judge model evaluation, and reward model evaluation. The effectiveness of TCM-BEST4SDT was corroborated through experiments on 15 mainstream LLMs, spanning both general and TCM domains. To foster the development of intelligent TCM research, TCM-BEST4SDT is now publicly available.
zh

[NLP-15] SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)对齐人类偏好时依赖外部监督所面临的局限性,包括人工标注稀缺且主观、奖励模型易受奖励劫持(reward hacking)影响,以及自评估方法存在提示敏感性和偏差等问题。其解决方案的关键在于提出一种内在的、无需标注的质量信号——稳定秩(stable rank),该指标通过计算隐藏状态总方差与主导方向方差的比值来衡量表示空间的有效维度,从而捕捉生成内容的质量信息。基于此,作者进一步设计了稳定秩组相对策略优化(Stable Rank Group Relative Policy Optimization, SR-GRPO),将稳定秩作为强化学习中的奖励信号,实现了无需外部监督的高效对齐,在STEM和数学推理任务上分别提升10%和19%,显著优于现有基于学习奖励模型和自评估的方法。

链接: https://arxiv.org/abs/2512.02807
作者: Yixuan Tang,Yi Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.
zh

[NLP-16] riLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages

链接: https://arxiv.org/abs/2512.02799
作者: Mike Nkongolo,Hilton Vorster,Josh Warren,Trevor Naick,Deandre Vanmali,Masana Mashapha,Luke Brand,Alyssa Fernandes,Janco Calitz,Sibusiso Makhoba
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-17] Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

【速读】: 该论文旨在解决对话式广义指代表达理解(Dialogue-Based Generalized Referring Expressions Comprehension, GREC)中模型在训练与评估域之间分布偏移(distribution shift)下的性能下降问题,尤其针对标注对话接地数据稀缺导致的泛化能力不足。解决方案的关键在于提出一种三层次数据合成方法(three-tier data-synthesis method),在保持场景真实性的前提下实现对生成数据的可控性调控,从而构建可扩展的对话条件接地监督信号,通过在合成数据上微调模型,显著优于现有方法在标准评估指标上的表现。

链接: https://arxiv.org/abs/2512.02791
作者: Juexi Shao,Siyou Li,Yujian Gan,Chris Madge,Vanja Karan,Massimo Poesio
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.
zh

[NLP-18] owards Unification of Hallucination Detection and Fact Verification for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中幻觉(Hallucination)问题,即模型生成看似流畅但事实错误的内容,从而削弱用户信任并限制其在现实场景中的应用。当前研究主要分为两类范式:以模型为中心的幻觉检测(Hallucination Detection, HD)和以文本为中心的事实验证(Fact Verification, FV),二者虽目标一致却长期孤立发展,导致评估标准不统一、方法难以横向比较。论文提出的关键解决方案是引入UniFact——一个统一的评估框架,通过动态生成模型输出及对应的真伪标签,实现HD与FV在实例层面的直接对比。实验表明,HD与FV各具优势且互补,融合两者的方法能持续达到最优性能,从而推动二者走向整合,为未来研究提供新的协同路径。

链接: https://arxiv.org/abs/2512.02772
作者: Weihang Su,Jianming Long,Changyue Wang,Shiyu Lin,Jingyan Xu,Ziyi Ye,Qingyao Ai,Yiqun Liu
机构: Tsinghua University (清华大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs. We have open-sourced all the code, data, and baseline implementation at: this https URL Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2512.02772 [cs.CL] (or arXiv:2512.02772v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.02772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-19] PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

【速读】: 该论文旨在解决当前参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在复现性、部署性和可比性方面存在的挑战,即许多新提出的PEFT方法缺乏标准化的实现框架,导致研究结果难以验证和横向比较。解决方案的关键在于提出PEFT-Factory,这是一个统一的框架,支持开箱即用的PEFT方法集成与评估,其模块化设计既保障了扩展性,又原生集成了19种代表性PEFT方法、27个分类与文本生成数据集(覆盖12项任务),以及标准与PEFT专用的评估指标,从而为PEFT方法提供了一个可控、稳定且可复现的研究环境,显著提升了方法的基准测试与比较能力。

链接: https://arxiv.org/abs/2512.02764
作者: Robert Belanec,Ivan Srba,Maria Bielikova
机构: Brno University of Technology (布鲁诺理工大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at this https URL
zh

[NLP-20] SurveyEval: Towards Comprehensive Evaluation of LLM -Generated Academic Surveys

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自动问卷生成系统在评估方面的挑战,即如何有效衡量此类复杂系统的性能。其解决方案的关键在于提出SurveyEval,一个涵盖整体质量、结构连贯性和参考准确性三个维度的综合性评估基准,并通过引入人类标注参考来增强“评估-人类对齐”能力,从而实现对自动问卷生成系统更可靠、可扩展的评测。

链接: https://arxiv.org/abs/2512.02763
作者: Jiahao Zhao,Shuaixing Zhang,Nan Xu,Lei Wang
机构: Beijing Wenge Technology Co., Ltd.(北京文革科技有限公司); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.
zh

[NLP-21] Probabilistic energy profiler for statically typed JVM-based programming languages

【速读】: 该论文旨在解决现有软件能耗测量方法局限于仅估算CPU能耗且依赖点估计值的问题,从而无法充分考虑其他硬件组件的影响以及缺乏统计解释能力的局限性。其解决方案的关键在于提出一种基于贝叶斯统计的新建模方法,通过量化Java虚拟机(JVM)字节码(Bytecode)模式的能耗特征,构建包含四个静态代码因素(数据大小、数据类型、操作类型)和一个硬件平台因素(设备)的统计模型,实现对程序能耗的分布式预测与可解释分析。实验表明,该方法能准确预测未见过程序的实际能耗,验证了其有效性。

链接: https://arxiv.org/abs/2512.02738
作者: Joel Nyholm,Wojciech Mostowski,Christoph Reichenbach
机构: Halmstad University (哈尔姆斯塔德大学); Lund University (隆德大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Energy consumption is a growing concern in several fields, from mobile devices to large data centers. Developers need detailed data on the energy consumption of their software to mitigate consumption issues. Previous approaches have a broader focus, such as on specific functions or programs, rather than source code statements. They primarily focus on estimating the CPU’s energy consumption using point estimates, thereby disregarding other hardware effects and limiting their use for statistical reasoning and explainability. We developed a novel methodology to address the limitations of measuring only the CPU’s consumption and using point estimates, focusing on predicting the energy usage of statically typed JVM-based programming languages, such as Java and Scala. We measure the energy consumption of Bytecode patterns, the translation from the programming language’s source code statement to their Java Bytecode representation. With the energy measurements, we construct a statistical model using Bayesian statistics, which allows us to predict the energy consumption through statistical distributions and analyze individual factors. The model includes three factors we obtain statically from the code: data size, data type, operation, and one factor about the hardware platform the code executes on: device. To validate our methodology, we implemented it for Java and evaluated its energy predictions on unseen programs. We observe that all four factors are influential, notably that two devices of the same model may differ in energy consumption and that the operations and data types cause consumption differences. The experiments also show that the energy prediction of programs closely follows the program’s real energy consumption, validating our approach. Our work presents a methodology for constructing an energy model that future work, such as verification tools, can use for their energy estimates.
zh

[NLP-22] Emergent Bayesian Behaviour and Optimal Cue Combination in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多模态信息整合中是否具备类似人类的贝叶斯最优计算策略的问题,即探索LLMs是否能在未显式训练或指令引导下,对噪声信号进行近似最优的感知整合。其核心解决方案在于引入心理学实验范式(psychophysics paradigm),构建了一个名为BayesBench的行为基准测试框架,包含四个基于经典心理物理学的任务(长度、位置、距离和时长估计),覆盖文本与图像模态,并通过系统性消融实验(控制噪声、上下文和提示指令)量化模型在多模态线索融合中的行为模式、效率及一致性。关键创新在于提出“贝叶斯一致性评分”(Bayesian Consistency Score),可在准确率饱和时仍检测出贝叶斯一致的行为变化,从而揭示了模型能力与策略之间的解耦现象——例如GPT-5 Mini虽在文本任务中达到完美准确率,却无法高效整合视觉信息,表明仅依赖准确率指标可能掩盖模型在不确定性处理上的脆弱性。这一发现为未来多模态架构设计提供了新的评估维度和理论依据。

链接: https://arxiv.org/abs/2512.02719
作者: Julian Ma,Jun Wang,Zafeirios Fountas
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); AI Centre, Department of Computer Science, University College London (伦敦大学学院计算机科学系人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (this https URL) as evaluation tools and to inform future multimodal architecture designs.
zh

[NLP-23] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下的内容安全防护问题,特别是针对低资源语言缺乏有效安全护栏的现状。现有安全机制主要面向高资源语言设计,难以覆盖全球多数语言使用者。其解决方案的关键在于提出一种参数高效的跨语言安全分类模型CREST(CRoss-lingual Efficient Safety Transfer),通过在仅13种高资源语言上训练,并利用基于聚类的跨语言迁移策略,实现对100种语言的有效泛化,包括未见的高资源和低资源语言,从而在仅0.5B参数规模下显著提升多语言内容安全防护能力。

链接: https://arxiv.org/abs/2512.02711
作者: Lavish Bansal,Naman Mishra
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 Pages, 5 Figures, Under Review

点击查看摘要

Abstract:Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world’s population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.
zh

[NLP-24] An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation ACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练过程中继承并放大社会偏见的问题,这会威胁公平性和社会信任。其解决方案的关键在于采用模型合并(model merging)算法对LLM参数进行编辑,以实现偏见缓解。研究系统评估了七种主流合并方法(Linear、Karcher Mean、SLERP、NuSLERP、TIES、DELLA和Nearswap),在多个开源模型家族(GPT、LLaMA、Qwen)上进行实验,并通过三个偏见检测数据集(BBQ、BOLD、HONEST)及SuperGLUE下游任务衡量效果。结果表明,存在偏见减少与下游性能之间的权衡:部分方法虽能显著降低偏见,但损害阅读理解、常识推理等关键语言能力;其中SLERP在中等插值权重下表现最优,兼顾偏见缓解与整体性能,凸显了模型合并技术在偏见治理中的潜力与局限性。

链接: https://arxiv.org/abs/2512.02689
作者: Daiki Shirafuji,Tatsuhiko Saito,Yasutomo Kimura
机构: Mitsubishi Electric Corporation (三菱电机公司); Otaru University of Commerce (小樽商科大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in PACLIC 2025

点击查看摘要

Abstract:Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing’’ LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.
zh

[NLP-25] Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在生成多文档摘要时是否对所有输入文本给予平等权重,是否存在输入顺序效应(order bias),进而影响输出内容的客观性和一致性。其解决方案的关键在于构建了40组关于堕胎议题的“支持-中立-反对”文章三元组,并通过六种不同顺序排列输入,使用Gemini 2.5 Flash模型生成中立概述,随后利用ROUGE-L、BERTScore和SummaC三种指标系统评估摘要质量。结果表明,BERTScore显示显著的首因效应(primacy effect),即摘要在语义上更贴近首个输入文档,且位置1与位置2、3存在显著差异,而位置2与3无显著区别,证明LLM存在对首篇文档的偏好,揭示了基于LLM的摘要系统在多源信息整合中的潜在偏差风险。

链接: https://arxiv.org/abs/2512.02665
作者: Jing Ma
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) are now used in settings such as Google’s AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.
zh

[NLP-26] Spoken Conversational Agents with Large Language Models EMNLP2025

【速读】: 该论文旨在解决语音对话代理(spoken conversational agents)从传统级联式自动语音识别(ASR)与自然语言理解(NLU)架构向端到端(end-to-end, E2E)以及基于检索和视觉引导的系统演进过程中所面临的关键技术挑战。其解决方案的核心在于系统性地梳理文本大语言模型(LLM)适配音频输入、跨模态对齐(cross-modal alignment)及语音-文本联合训练(joint speech-text training)的技术路径,同时通过对比级联式与端到端设计、后ASR纠错机制、流式处理等方案,提出可复现的基线方法,并明确隐私、安全与评估等方面的开放问题,为构建高效、鲁棒且实用的语音交互系统提供清晰的系统级路线图。

链接: https://arxiv.org/abs/2512.02593
作者: Chao-Han Huck Yang,Andreas Stolcke,Larry Heck
机构: NVIDIA Research(英伟达研究); Uniphore; Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to EMNLP 2025 Tutorial

点击查看摘要

Abstract:Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.
zh

[NLP-27] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks AAAI2026

【速读】: 该论文旨在解决强化学习在大语言模型后训练过程中,因早期阶段 indiscriminate(无差别地)混合正负优势信号(advantage signals)而导致指导信息模糊、性能提升有限的问题。其解决方案的关键在于提出一种基于优势信号的自适应课程机制——CAPO(Curriculum Advantage Policy Optimization),该机制首先仅使用正向优势样本进行模仿学习(imitation learning),以建立稳健的基础策略;随后逐步引入负向信号,从而培养模型在复杂场景下的判别能力,最终实现数学推理任务上的稳定且显著的性能提升,并有效扩展至多模态图形用户界面(GUI)推理场景。

链接: https://arxiv.org/abs/2512.02580
作者: Changpeng Yang,Jinyang Wu,Yuchen Liu,Shuai Zhang,Yang Li,Qiliang Liang,Hongzhen Wang,Shuai Nie,Jiaming Xu,Runyu Shi,Ying Huang,Guoquan Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
zh

[NLP-28] DeepSeek -V3.2: Pushing the Frontier of Open Large Language Models

【速读】: 该论文旨在解决大语言模型在长文本处理效率低下、推理能力不足以及工具调用场景下泛化性能受限的问题。其核心解决方案在于三项关键技术突破:一是提出DeepSeek稀疏注意力机制(DeepSeek Sparse Attention, DSA),显著降低计算复杂度的同时保持长上下文建模性能;二是构建可扩展的强化学习框架,通过大规模后训练算力优化模型表现,使高算力版本DeepSeek-V3.2-Speciale超越GPT-5并达到Gemini-3.0-Pro水平,在国际数学奥林匹克(IMO)和信息学奥林匹克(IOI)中取得金牌级推理成绩;三是开发大规模代理任务合成流水线,系统性生成多样化训练数据以支持可扩展的代理后训练,从而提升模型在复杂交互环境中的指令遵循鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2512.02556
作者: DeepSeek-AI,Aixin Liu,Aoxue Mei,Bangcai Lin,Bing Xue,Bingxuan Wang,Bingzheng Xu,Bochao Wu,Bowei Zhang,Chaofan Lin,Chen Dong,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenhao Xu,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Erhang Li,Fangqi Zhou,Fangyun Lin,Fucong Dai,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Hanwei Xu,Hao Li,Haofen Liang,Haoran Wei,Haowei Zhang,Haowen Luo,Haozhe Ji,Honghui Ding,Hongxuan Tang,Huanqi Cao,Huazuo Gao,Hui Qu,Hui Zeng,Jialiang Huang,Jiashi Li,Jiaxin Xu,Jiewen Hu,Jingchang Chen,Jingting Xiang,Jingyang Yuan,Jingyuan Cheng,Jinhua Zhu,Jun Ran,Junguang Jiang,Junjie Qiu,Junlong Li,Junxiao Song,Kai Dong,Kaige Gao,Kang Guan,Kexin Huang,Kexing Zhou,Kezhao Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Wang,Liang Zhao,Liangsheng Yin,Lihua Guo,Lingxiao Luo,Linwang Ma,Litong Wang,Liyue Zhang,M.S. Di,M.Y Xu,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingxu Zhou,Panpan Huang,Peixin Cong,Peiyi Wang,Qiancheng Wang,Qihao Zhu,Qingyang Li,Qinyu Chen,Qiushi Du,Ruiling Xu,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,Runqiu Yin,Runxin Xu,Ruomeng Shen,Ruoyu Zhang,S.H. Liu,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaofei Cai
机构: DeepSeek-AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
zh

[NLP-29] ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce SIGIR2025

【速读】: 该论文旨在解决电子商务搜索中相关性建模面临的两大挑战:一是传统词匹配方法(如BM25)存在语义鸿沟,二是神经模型对领域特定难样本的依赖导致数据稀缺问题。其解决方案的核心在于提出ADORE框架,通过三个关键创新实现自我持续优化:(1) 基于思维链大语言模型(Chain-of-Thought LLM)生成意图对齐的训练数据,并利用Kahneman-Tversky优化(KTO)对齐用户行为;(2) 构建误差类型感知的数据合成模块,自动生成对抗样例以提升鲁棒性;(3) 引入关键属性增强的知识蒸馏模块,将领域特定属性层次结构注入可部署的学生模型。该框架实现了标注自动化、对抗样本生成与知识蒸馏的闭环,显著缓解了数据稀缺并提升了模型推理能力。

链接: https://arxiv.org/abs/2512.02555
作者: Zheng Fang,Donghao Xie,Ming Pang,Chunyuan Yuan,Xue Jiang,Changping Peng,Zhangang Lin,Zheng Luo
机构: JD.COM(京东); Beijing(北京)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2025

点击查看摘要

Abstract:Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models’ reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.
zh

[NLP-30] What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints

【速读】: 该论文旨在解决在线虚假信息传播中的两个关键问题:虚假新闻检测(fake-news detection)与病毒式传播预测(virality prediction),特别是在需要快速响应的实际应用场景下。其解决方案的关键在于系统性比较不同特征表示方法的有效性,包括基于文本的嵌入(如RoBERTa和Mistral)、轻量级数值特征(如时间、粉丝数、验证状态、点赞数)以及序列模型(如GRU、门控结构和Transformer编码器)。研究发现,仅依赖文本内容即可实现较强的虚假新闻判别能力,而数值特征在资源受限或无法使用语言模型时仍具可行性;相比之下,病毒式传播预测更具挑战性,且对标签构建方式高度敏感,强调了合理定义“病毒性”标准(如中位数分割)及对参与度特征进行时间截断的重要性。此外,非线性降维分析揭示了数值特征在病毒性预测中蕴含更复杂的结构信息,进一步凸显了特征工程与评估设计在实际应用中的核心作用。

链接: https://arxiv.org/abs/2512.02552
作者: Francesco Paolo Savatteri(ENC),Chahan Vidal-Gorène(CJM, LIPN),Florian Cafiero(ENC)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, with the necessity for rapid reaction. Using the EVONS and FakeNewsNet datasets, we compare textual embeddings (RoBERTa; with a control using Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). We show that textual content alone is a strong discriminator for fake-news detection, while numeric-only pipelines remain viable when language models are unavailable or compute is constrained. Virality prediction is markedly harder than fake-news detection and is highly sensitive to label construction; in our setup, a median-based ‘‘viral’’ split (50 likes) is pragmatic but underestimates real-world virality, and time-censoring for engagement features is desirable yet difficult under current API limits. Dimensionality-reduction analyses suggest non-linear structure is more informative for virality than for fake-news detection (t-SNE PCA on numeric features). Swapping RoBERTa for Mistral embeddings yields only modest deltas, leaving conclusions unchanged. We discuss implications for evaluation design and report reproducibility constraints that realistically affect the field. We release splits and code where possible and provide guidance for metric selection.
zh

[NLP-31] A Concise Review of Hallucinations in LLM s and their Mitigation

【速读】: 该论文旨在解决传统语言模型中存在的幻觉(hallucination)问题,即模型生成与事实不符或缺乏依据的内容,这在自然语言处理(Natural Language Processing, NLP)领域构成了严重风险。其解决方案的关键在于系统性地梳理当前幻觉的类型、成因,并提出有效的缓解策略,从而为研究人员和从业者提供一个全面且实用的参考框架,以提升模型输出的可靠性与可信度。

链接: https://arxiv.org/abs/2512.02527
作者: Parth Pulkundwar,Vivek Dhanawade,Rohit Yadav,Minal Sonkar,Medha Asurlekar,Sarita Rathod
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages

点击查看摘要

Abstract:Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them.
zh

[NLP-32] Guided Self-Evolving LLM s with Minimal Human Supervision

【速读】: 该论文旨在解决AI模型在自进化过程中因缺乏有效引导而导致的性能停滞或退化问题,如概念漂移(concept drift)、多样性崩溃(diversity collapse)和错误演化(mis-evolution),这些问题通常源于模型强化自身偏见并收敛至低熵行为。解决方案的关键在于提出R-Few框架,这是一个基于引导式自我博弈(guided Self-Play)的挑战者-求解器机制:挑战者(Challenger)通过少量人工标注样本进行上下文锚定(in-context grounding),生成具有语义合理性的合成问题;求解器(Solver)则采用在线难度课程(difficulty-based curriculum)联合训练人类数据与合成数据,从而实现稳定且可控的协同进化。实验证明,该方法显著提升了模型迭代优化能力,例如Qwen3-8B-Base在数学任务上相较R-Zero提升+3.0分,并达到与使用20倍人类数据训练的General-Reasoner相当的性能。

链接: https://arxiv.org/abs/2512.02472
作者: Wenhao Yu,Zhenwen Liang,Chengsong Huang,Kishan Panaganti,Tianqing Fang,Haitao Mi,Dong Yu
机构: Tencent AI Lab in Seattle (腾讯人工智能实验室); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.
zh

[NLP-33] See Think Learn: A Self-Taught Multimodal Reason er

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态推理中因感知能力不足或推理机制薄弱而导致性能受限的问题。现有方法通常依赖高质量的链式思维(Chain-of-Thought, CoT)数据,这些数据获取成本高且常忽略感知环节。解决方案的关键在于提出一种名为“看-想-学”(See-Think-Learn, STL)的自训练框架,其核心是引入结构化的推理模板,强制模型先提取视觉属性并以文本形式表达,再基于此进行推理;同时通过生成和学习自身结构化推理过程实现感知与推理的联合优化,并引入负向推理(negative rationales)增强模型对正确与错误答案的区分能力,从而提升多模态推理的鲁棒性和准确性。

链接: https://arxiv.org/abs/2512.02456
作者: Sourabh Sharma,Sonam Gupta,Sadbhawna
机构: Malaviya National Institute of Technology Jaipur (马拉维亚国家技术学院斋浦尔); IBM Research (IBM 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model’s ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.
zh

[NLP-34] When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents AAAI2026

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长上下文(long-context)环境下作为智能体(agent)运行时,其能力与安全性表现尚未被充分研究的问题。现有工作主要聚焦于LLM在长提示词下的性能评估,而对代理场景中多步骤任务的长期上下文依赖性及其安全风险缺乏系统探索。论文的关键发现是:LLM代理在不同长度、类型和位置的上下文条件下表现出显著且不可预测的能力波动和拒绝执行有害请求的意愿变化——例如,在10万token以上时,模型性能下降超过50%,且拒绝率从不足5%跃升至40%或从80%骤降至10%。这一现象揭示了当前评估范式在衡量LLM代理安全性方面的局限性,并指出需重新审视现有指标以适配长程多步任务中的代理行为建模。

链接: https://arxiv.org/abs/2512.02445
作者: Tsimur Hadeliya,Mohammad Ali Jauhar,Nidhi Sakpal,Diogo Cruz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 11 figures. Accepted at AAAI 2026 TrustAgent Workshop

点击查看摘要

Abstract:Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from \sim 5% to \sim 40% while Grok 4 Fast decreases from \sim 80% to \sim 10% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.
zh

[NLP-35] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

【速读】: 该论文旨在解决长视频(小时级或天级)理解中因上下文容量有限和视觉细节丢失导致的挑战,尤其针对现有基于文本摘要的记忆增强方法在复杂场景推理时过度依赖文本、忽视视觉证据,以及固定时间尺度检索限制灵活性的问题。其解决方案的关键在于提出WorldMM——一种多模态记忆代理,通过构建三种互补记忆机制实现高效推理:(1)情景记忆(episodic memory)跨多个时间尺度索引事实事件;(2)语义记忆(semantic memory)持续更新高层概念知识;(3)视觉记忆(visual memory)保留场景的详细信息。在推理阶段,自适应检索代理根据查询动态选择最相关记忆源并利用多种时间粒度迭代获取信息,直至满足推理需求,从而显著提升长视频问答性能,在五个基准测试上平均优于此前最优方法8.4%。

链接: https://arxiv.org/abs/2512.02425
作者: Woongyeong Yeo,Kangsan Kim,Jaehong Yoon,Sung Ju Hwang
机构: KAIST(韩国科学技术院); Nanyang Technological University(南洋理工大学); DeepAuto.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Project page : this https URL

点击查看摘要

Abstract:Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
zh

[NLP-36] aleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在创意故事生成系统中因缺乏细粒度控制和输入规范不明确而导致用户意图难以准确转化为满意故事输出的问题。其解决方案的关键在于提出TaleFrame系统,该系统通过将故事结构分解为四个基本单元——实体(entities)、事件(events)、关系(relationships)和故事大纲(story outline),结合大语言模型(LLMs)与人机交互(HCI)技术,实现对生成过程的精准控制;同时利用JSON2Story方法,将结构化数据转换为连贯故事,并提供直观界面支持用户通过拖拽、连接等交互操作调整内容,从而提升可控性和生成质量。

链接: https://arxiv.org/abs/2512.02402
作者: Yunchao Wang,Guodao Sun,Zihang Fu,Zhehao Liu,Kaixing Du,Haidong Gao,Ronghua Liang
机构: Zhejiang University of Technology (浙江工业大学); Zhejiang Key Laboratory of Visual Information Intelligent Processing (浙江省视觉信息智能处理重点实验室)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 11 pages

点击查看摘要

Abstract:With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at this https URL.
zh

[NLP-37] Process-Centric Analysis of Agent ic Software Systems

【速读】: 该论文旨在解决当前对**智能体系统(Agentic Systems)**的评估过于依赖最终结果(outcome-centric)的问题,忽视了其在执行过程中推理、规划、行动及策略调整等动态行为的细粒度分析。为实现过程导向(process-centric)的系统性评估,作者提出关键解决方案——Graphectory,这是一种将智能体系统中时间与语义关系结构化编码的方法,通过图表示法捕捉轨迹中的模块交互与演化路径。基于此,研究者对两种主流编程智能体(SWE-agent 和 OpenHands)在 4000 条轨迹上的行为进行了自动化分析,揭示出模型复杂度、问题难度与策略一致性之间的关联,并发现即使成功完成任务,多数智能体仍存在效率低下的冗余操作,从而推动了对 agentic workflows 的质量优化与机制理解。

链接: https://arxiv.org/abs/2512.02393
作者: Shuyang Liu,Yang Chen,Rahul Krishna,Saurabh Sinha,Jatin Ganhotra,Reyhan Jabbarvand
机构: University of Illinois at Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校); IBM Research (IBM 研究院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution (i.e., trajectories) is inherently stochastic and adaptive to the problem they are solving. Evaluation of such systems is often outcome-centric, judging their performance based on success or failure at the final step. This narrow focus overlooks detailed insights about such systems, failing to explain how agents reason, plan, act, or change their strategies over time. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such software systems. Graphectory facilitates the design of process-centric metrics and analyses to assess the quality of agentic workflows independent of final success. Using Graphectory, we analyze 4000 trajectories of two dominant agentic programming workflows, namely SWE-agent and OpenHands, with a combination of four backbone Large Language Models (LLMs), attempting to resolve SWE-bench Verified issues. Our fully automated analyses reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation before patch submission; (2) agents’ problem-solving strategies vary with both problem difficulty and the underlying LLM – for resolved issues, the strategies often follow coherent localization-patching-validation steps, while unresolved ones exhibit chaotic, repetitive, or backtracking behaviors; (3) even when successful, agentic programming systems often display inefficient processes, leading to unnecessarily prolonged trajectories. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2512.02393 [cs.SE] (or arXiv:2512.02393v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.02393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-38] Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering

【速读】: 该论文旨在解决领域特定问答(Domain-specific Question Answering, QA)系统在服务场景中整合异构知识源时面临的挑战,尤其是在医疗政策和政府福利等敏感领域中确保答案的准确性与安全性问题。现有大语言模型常因事实一致性不足和上下文对齐困难而导致输出不可靠。其解决方案的关键在于提出一种名为KARMA(Knowledge-Aware Reasoning and Memory-Augmented Adaptation)的新框架,该框架包含三个核心组件:双编码器架构用于融合结构化与非结构化知识源、门控记忆单元以动态调控外部知识的引入、以及基于安全分类和引导生成技术的安全感知可控解码器,从而有效抑制不安全输出。实验证明,KARMA在答案质量与安全性方面均显著优于强基线模型。

链接: https://arxiv.org/abs/2512.02363
作者: Lei Fu,Xiang Chen,Kaige Gao Xinyue Huang,Kejian Tong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.
zh

[NLP-39] LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理系统中因Model Context Protocol (MCP)开放生态所引入的隐式信任安全风险问题,特别是针对第三方工具提供者可能带来的“隐式毒性”(implicit toxicity)攻击。这类攻击在不违反显式权限的前提下,通过恶意工具在允许的权限范围内执行隐蔽的计算资源劫持行为。解决方案的关键在于提出LeechHijack——一种潜伏嵌入式计算劫持攻击机制,其核心是两阶段运作:第一阶段将看似无害的后门植入目标MCP工具,第二阶段在预设触发条件下激活后门,建立命令与控制通道,使代理系统误将攻击者注入的任务视为正常工作流并执行,从而实现对用户计算资源的非法占用。实验表明,该方法在四大主流LLM家族上平均成功率高达77.25%,且仅带来18.62%的资源开销,凸显了当前MCP生态系统缺乏计算溯源与资源证明机制的严重安全隐患。

链接: https://arxiv.org/abs/2512.02321
作者: Yuanhe Zhang,Weiliu Wang,Zhenhong Zhou,Kun Wang,Jie Zhang,Li Sun,Yang Liu,Sen Su
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a unifying framework for integrating external tools into agent systems, enabling a thriving open ecosystem of community-built functionalities. However, the openness and composability that make MCP appealing also introduce a critical yet overlooked security assumption – implicit trust in third-party tool providers. In this work, we identify and formalize a new class of attacks that exploit this trust boundary without violating explicit permissions. We term this new attack vector implicit toxicity, where malicious behaviors occur entirely within the allowed privilege scope. We propose LeechHijack, a Latent Embedded Exploit for Computation Hijacking, in which an adversarial MCP tool covertly expropriates the agent’s computational resources for unauthorized workloads. LeechHijack operates through a two-stage mechanism: an implantation stage that embeds a benign-looking backdoor in a tool, and an exploitation stage where the backdoor activates upon predefined triggers to establish a command-and-control channel. Through this channel, the attacker injects additional tasks that the agent executes as if they were part of its normal workflow, effectively parasitizing the user’s compute budget. We implement LeechHijack across four major LLM families. Experiments show that LeechHijack achieves an average success rate of 77.25%, with a resource overhead of 18.62% compared to the baseline. This study highlights the urgent need for computational provenance and resource attestation mechanisms to safeguard the emerging MCP ecosystem.
zh

[NLP-40] OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning

【速读】: 该论文旨在解决多模态大语言模型(Omni-modal Large Language Models, OLLMs)在人机交互中面临的安全与价值对齐问题,现有防护机制主要针对单模态场景且采用二分类方式,难以在跨模态任务中保持鲁棒性。解决方案的关键在于提出首个多模态防护框架OmniGuard,其核心创新包括:构建包含超过21万条多样化样本的多模态安全数据集(涵盖文本、图像、视频和音频),并通过专家模型蒸馏获得结构化安全标签与批判性反馈;同时引入具备推理能力的防护机制,在15个基准测试中展现出卓越的泛化性能与安全性保障能力,从而为构建统一、可靠的全模态安全防护系统奠定基础。

链接: https://arxiv.org/abs/2512.02306
作者: Boyu Zhu,Xiaofei Wen,Wenjie Jacky Mo,Tinghui Zhu,Yanan Xie,Peng Qi,Muhao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.
zh

[NLP-41] When Does Verification Pay Off? A Closer Look at LLM s as Solution Verifiers

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中如何通过验证机制提升性能的问题,尤其是探究验证器(verifier)与求解器(solver)之间的交互模式对最终输出质量的影响。其关键解决方案在于提出并实证验证“验证收益”(verifier gain)这一指标,该指标可预测基于测试时拒绝采样的验证策略所带来的性能提升;同时系统性地比较了自验证(self-verification)、同家族验证与跨家族验证的效果,并揭示了后训练(post-training)对不同验证方式影响的差异,发现跨家族验证尤其有效,且数学与逻辑类任务具有最高内在可验证性。

链接: https://arxiv.org/abs/2512.02304
作者: Jack Lu,Ryan Teehan,Jinran Jin,Mengye Ren
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.
zh

[NLP-42] HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models

【速读】: 该论文旨在解决语言模型在面对长篇且存在冲突的生物医学上下文时,如何利用情境信息进行推理并作出准确回答的问题,尤其关注模型在处理矛盾信息时的稳定性与准确性。其解决方案的关键在于构建了一个由专家验证的健康问答数据集 HealthContradict,该数据集包含920个独特实例,每个实例均包含一个健康相关问题、一个基于科学证据的事实答案,以及两份立场相悖的文档;通过设计多种提示(prompt)设置(如正确、错误或矛盾上下文),系统评估语言模型在不同情境下的输出表现,从而揭示模型不仅依赖预训练获得的参数化知识,更关键的是具备识别并利用正确上下文、同时抵抗错误上下文干扰的能力。

链接: https://arxiv.org/abs/2512.02299
作者: Boya Zhang,Alban Bornet,Rui Yang,Nan Liu,Douglas Teodoro
机构: University of Geneva (日内瓦大学); National University of Singapore (新加坡国立大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models’ contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.
zh

[NLP-43] CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

【速读】: 该论文旨在解决农业领域中气候适应策略信息获取的难题,即如何从复杂且分散的结构化与非结构化数据源(如科学文献和政府API提供的异构气候数据)中为农民顾问等专家提供可信、易读且可验证的问答服务。其解决方案的关键在于提出了一种名为CAIRNS(Climate Adaptation question-answering with Improved Readability and Noted Sources)的框架,该框架通过结构化的ScholarGuide提示增强答案可读性和引用可靠性,并采用一致性加权混合评估器(consistency-weighted hybrid evaluator)利用模型间一致性与专家判断进行鲁棒评估,从而在无需微调或强化学习的情况下实现领域 grounded 的高质量问答。

链接: https://arxiv.org/abs/2512.02251
作者: Liangji Kong,Aditya Joshi,Sarvnaz Karimi
机构: University of New South Wales (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Short Paper; Under Review at The WebConf 2026 (single-blind submission)

点击查看摘要

Abstract:Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts – farmer advisors – to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.
zh

[NLP-44] DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理性能中因提示(prompt)具体性(specificity)差异而产生的表现波动问题,即当前对提示具体程度如何影响模型推理准确性的研究仍不充分。其解决方案的关键在于提出 DETAIL 框架,通过 GPT-4 生成多层级提示、利用困惑度(perplexity)量化提示具体性,并基于 GPT 的语义等价性评估答案正确性,从而系统性地分析不同具体性水平下的模型表现。实验表明,提升提示具体性可显著提高准确性,尤其在小型模型和程序性任务中效果更为明显,为开发自适应提示策略提供了实证依据与工具支持。

链接: https://arxiv.org/abs/2512.02246
作者: Olivia Kim
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.
zh

[NLP-45] Lightweight Latent Reasoning for Narrative Tasks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂任务时,因生成长链推理轨迹(reasoning traces)而导致的高计算成本问题,尤其是在涉及叙事类任务(如情节漏洞检测和章节生成)时,需处理大量token。解决方案的关键在于提出LiteReason方法,其核心是一个轻量级的推理投影模块(Reasoning Projector),该模块可学习生成连续的潜在token,从而让模型“跳过”部分推理步骤;在强化学习(Reinforcement Learning, RL)训练中,策略模型动态决定何时激活该投影模块,在潜在推理与离散推理之间切换,实现高效推理路径选择。实验表明,LiteReason在保持接近非潜在RL性能的同时,将最终推理长度减少了77%-92%。

链接: https://arxiv.org/abs/2512.02240
作者: Alexander Gurung,Nikolay Malkin,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) tackle complex tasks by generating long chains of thought or “reasoning traces” that act as latent variables in the generation of an output given a query. A model’s ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model ‘skip’ reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.
zh

[NLP-46] Swivuriso: The South African Next Voices Multilingual Speech Dataset

【速读】: 该论文旨在解决当前自动语音识别(ASR)技术在南非七种官方语言中数据稀缺且代表性不足的问题,尤其针对农业、医疗和通用领域缺乏高质量多语言语音数据集的现状。解决方案的关键在于构建Swivuriso——一个包含3000小时语音数据的多语言语料库,其设计遵循严格的伦理规范与数据采集流程,确保内容覆盖多样性和文化相关性,并通过基线实验验证了其在ASR模型训练与微调中的有效性,为低资源语言的ASR研究提供了可靠基准。

链接: https://arxiv.org/abs/2512.02201
作者: Vukosi Marivatee,Kayode Olaleye,Sitwala Mundia,Andinda Bakainga,Unarine Netshifhefhe,Mahmooda Milanzie,Tsholofelo Hope Mogale,Thapelo Sindane,Zainab Abdulrasaq,Kesego Mokgosi,Chijioke Okorie,Nia Zion Van Wyk,Graham Morrissey,Dale Dunbar,Francois Smit,Tsosheletso Chidi,Rooweither Mabuya,Andiswa Bukula,Respect Mlambo,Tebogo Macucwa,Idris Abdulmumin,and Seani Rananga
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
zh

[NLP-47] A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation WWW

【速读】: 该论文旨在解决如何通过计算多智能体系统实现无监督语言习得的问题,特别是模拟儿童语言模型在与成人语言模型交互过程中自主获取语法知识的机制。解决方案的关键在于构建一个完全参数化的多智能体实验环境——MODOMA系统,该系统基于统计与规则相结合的方法,使子代理(child agent)能够从由成人代理生成的不同规模示例数据中习得功能类和内容类语法范畴,并以显式知识表示形式存储和调用所获语法知识,从而实现了对语言习得过程的可控制、可观测与可验证。

链接: https://arxiv.org/abs/2512.02195
作者: David Ph. Shakouri,Crit Cremers,Niels O. Schiller
机构: Leiden University Centre for Linguistics (LUCL), Leiden University, the Netherlands; Leiden Institute for Brain and Cognition (LIBC), Leiden University, the Netherlands; City University of Hong Kong (CityU), Hong Kong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures, 11 tables. Related work: arXiv:2503.18702 . This is the peer-reviewed publisher’s version, downloadable from: this https URL

点击查看摘要

Abstract:This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule-based procedures, the result of language acquisition is a knowledge-based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well-established for human-generated data, are also found for these machine-generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.
zh

[NLP-48] Story2MIDI: Emotionally Aligned Music Generation from Text

【速读】: 该论文旨在解决从文本中生成与情感一致的音乐这一问题,即如何将文本所传达的情绪准确映射到音乐创作中。其解决方案的关键在于提出了一种基于序列到序列(sequence-to-sequence)Transformer架构的模型 Story2MIDI,并构建了一个包含文本片段与对应情绪音乐配对的数据集 Story2MIDI dataset,该数据集融合了文本情感分析和音乐情绪分类的现有资源。通过该模型,研究者成功地让生成式 AI (Generative AI) 学习并提取音乐中的情绪相关特征,并在生成过程中加以利用,从而产出具有多样化情感响应的音乐样本,且经客观音乐指标和人类听觉实验验证,模型能有效捕捉预期的情感线索。

链接: https://arxiv.org/abs/2512.02192
作者: Mohammad Shokri,Alexandra C. Salem,Gabriel Levine,Johanna Devaney,Sarah Ita Levitan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages (6 pages of main text + 2 pages of references and appendices), 4 figures, 1 table. Presented at IEEE Big Data 2025 3rd Workshop on AI Music Generation (AIMG 2025)

点击查看摘要

Abstract:In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener. Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model’s ability to capture intended emotional cues.
zh

[NLP-49] hink Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)中推理大语言模型(Reasoning LLMs, RLMs)在结构化剪枝(structured pruning)过程中出现的性能严重下降问题。现有剪枝方法在标准大语言模型(LLMs)上表现良好,但在RLMs上却因模型对剪枝敏感而失效,即使20%的稀疏度也会导致准确率崩溃和推理连贯性丧失。其根本原因在于剪枝流程中校准数据、优化目标与推理阶段行为之间的不匹配。论文提出的关键解决方案是RESP(Self-Reflective Structured Pruning),其核心创新包括:利用模型自身生成的推理轨迹作为可靠的校准信号以更真实反映推理分布;基于解码阶段梯度的重要性估计方法;以及通过渐进式再生机制维持高稀疏度下的校准保真度。该框架显著提升了剪枝后RLMs的鲁棒性和准确性,在Qwen3-8B上实现了高达40%稀疏度下仍保持接近稠密模型性能的效果。

链接: https://arxiv.org/abs/2512.02185
作者: Ziyan Wang,Enmao Diao,Qi Le,Pu Wang,Guanchu Wang,Minwoo Lee,Shu-ping Yeh,Li Yang
机构: University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校); DreamSoul; University of Minnesota(明尼苏达大学); Intel Corporation(英特尔公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model’s reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model’s decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model’s own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model’s reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.
zh

[NLP-50] Factor(TU): Factored Cognition Strengthens Monitoring of Untrusted AI

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在复杂任务中因监控机制薄弱而导致的安全性不足问题,特别是当部署不可信的 AI 模型时可能引发的恶意行为风险。解决方案的关键在于引入基于分解认知(factored cognition)的控制协议:由可信模型将任务分解为多个子任务,由不可信模型独立求解每个子任务,再将结果重新组合成完整方案。这种方法通过简化监控上下文或隐藏环境漏洞来提升安全性,实验证明其可使安全率从41%提升至63%,且性能提升主要源于监控模块的有效性增强。

链接: https://arxiv.org/abs/2512.02157
作者: Aaron Sandoval,Cody Rushing
机构: Redwood Research
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 7 pages body; 19 pages total; 8 figures

点击查看摘要

Abstract:The field of AI Control seeks to develop robust control protocols, deployment safeguards for untrusted AI which may be intentionally subversive. However, existing protocols that rely on weaker monitors to detect unsafe behavior often fail on complex tasks beyond the monitor’s comprehension. We develop control protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution. These protocols may improve safety by several means, such as by simplifying the context for monitors, or by obscuring vulnerabilities in the environment. We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano. We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS. While our protocols show low usefulness in APPS, they hold promise for more complex tasks.
zh

[NLP-51] Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation

【速读】: 该论文旨在解决社交媒体中仇恨言论(hate speech)检测的持续性与演化性挑战,特别是针对新型俚语和伪装词汇绕过现有检测系统的问题。其解决方案的关键在于两个方面:一是采用基于TF-IDF的样本选择机制,仅保留最具信息量的75%训练样本,从而显著降低训练数据规模而不损害模型性能;二是通过扩充BERT分词器的词汇表,引入领域特定俚语和滥用语境下的词汇变体,增强模型对演化中仇恨言论术语的识别能力。该方法在保持竞争力的同时提升了计算效率,为可扩展且自适应的有害内容审核提供了有效路径。

链接: https://arxiv.org/abs/2512.02141
作者: Pritish N. Desai,Tanay Kewalramani,Srimanta Mandal
机构: Dhirubhai Ambani University (Formerly DA-IICT), Gandhinagar, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In this work, we present a data efficient strategy for fine tuning BERT on hate speech classification by significantly reducing training set size without compromising performance. Our approach employs a TF IDF-based sample selection mechanism to retain only the most informative 75 percent of examples, thereby minimizing training overhead. To address the limitations of BERT’s native vocabulary in capturing evolving hate speech terminology, we augment the tokenizer with domain-specific slang and lexical variants commonly found in abusive contexts. Experimental results on a widely used hate speech dataset demonstrate that our method achieves competitive performance while improving computational efficiency, highlighting its potential for scalable and adaptive abusive content moderation.
zh

[NLP-52] Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

【速读】: 该论文旨在解决方言识别(Dialect Identification, DI)任务中,使用预训练语音模型进行微调时面临的计算成本高、显存占用大和训练速度慢的问题。传统参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法虽能减少参数量,但对显存效率和训练速度提升有限。为此,作者提出采用专为语言处理设计的内存高效微调(Memory-Efficient Fine-Tuning, MEFT)方法,并将其应用于通用预训练语音模型(如Whisper),通过系统性分析不同MEFT方法在GPU显存占用和训练速度上的表现,最终在KeSpeech数据集上实现了对六种普通话子方言的识别,相较基线方法显存消耗降低最高达73.25%,训练速度提升2.1倍,同时保持与常规微调及PEFT相当的识别准确率。其关键在于将MEFT策略迁移至语音领域并优化其在语音任务中的应用效果。

链接: https://arxiv.org/abs/2512.02074
作者: Zirui Lin,Haris Gulzar,Monnika Roslianna Busto,Akiko Masaki,Takeharu Eda,Kazuhiro Nakadai
机构: Institute of Science Tokyo (东京科学研究所); NTT Software Innovation Center (NTT软件创新中心)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Published in APSIPA ASC 2025

点击查看摘要

Abstract:Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when speakers have a strong dialect. However, fine-tuning a speech model for tasks like DI is expensive in terms of computation cost and memory requirement. Recent studies have explored fine-tuning pre-trained speech models for tasks like DI using Parameter-Efficient Fine-Tuning (PEFT) methods, which offer parameter efficiency but limited improvement in memory efficiency and training speed. To address these challenges, we explore Memory-Efficient Fine-Tuning (MEFT) methods, originally proposed for language processing, and apply them to the general-purpose pre-trained speech model. We then comprehensively analyze the GPU memory usage and fine-tuning speed based on various MEFT methods. As a case study, we fine-tune the Whisper model to identify six Mandarin subdialects from the KeSpeech dataset, reducing GPU memory usage by up to 73.25% and accelerating training speed by a factor of 2.1, while maintaining accuracy comparable to vanilla fine-tuning and PEFT methods.
zh

[NLP-53] Misalignment of LLM -Generated Personas with Human Perceptions in Low-Resource Settings

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)生成的社会人格(social personas)在低资源环境(如孟加拉国)中是否能真实反映人类在文化、情感和情境理解上的复杂性,以及其在感知维度(如共情与可信度)上的表现是否可靠。解决方案的关键在于通过定量比较真实人类响应与八种LLM生成人格的回应,发现LLM在所有感知矩阵中均显著劣于人类,尤其在共情(empathy)和可信度(credibility)方面差距明显;同时识别出LLM内容存在“普莉安娜效应”(Pollyanna Principle),即系统性偏向积极情绪(LLMs平均情感得分Φ_avg = 5.99,人类为5.60)。因此,研究强调必须基于真实世界的人类数据验证LLM人格的真实性与可靠性,以确保其在社会科学应用中的有效性。

链接: https://arxiv.org/abs/2512.02058
作者: Tabia Tanzin Prama,Christopher M. Danforth,Peter Sheridan Dodds
机构: Computational Story Lab; Vermont Complex Systems Institute; Vermont Advanced Computing Center; Department of Mathematics and Statistics; Department of Computer Science; University of Vermont (佛蒙特大学); Santa Fe Institute (圣达菲研究所)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances enable Large Language Models (LLMs) to generate AI personas, yet their lack of deep contextual, cultural, and emotional understanding poses a significant limitation. This study quantitatively compared human responses with those of eight LLM-generated social personas (e.g., Male, Female, Muslim, Political Supporter) within a low-resource environment like Bangladesh, using culturally specific questions. Results show human responses significantly outperform all LLMs in answering questions, and across all matrices of persona perception, with particularly large gaps in empathy and credibility. Furthermore, LLM-generated content exhibited a systematic bias along the lines of the ``Pollyanna Principle’', scoring measurably higher in positive sentiment ( \Phi_avg = 5.99 for LLMs vs. 5.60 for Humans). These findings suggest that LLM personas do not accurately reflect the authentic experience of real people in resource-scarce environments. It is essential to validate LLM personas against real-world human data to ensure their alignment and reliability before deploying them in social science research.
zh

[NLP-54] Reversing Large Language Models for Efficient Training and Fine-Tuning

链接: https://arxiv.org/abs/2512.02056
作者: Eshed Gal,Moshe Eliasof,Javier Turek,Uri Ascher,Eran Treister,Eldad Haber
机构: University of British Columbia (不列颠哥伦比亚大学); Ben Gurion University (本古里安大学); EarthDynamics AI; Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia (地球、海洋和大气科学系,不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-55] Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理过程中因依赖局部、即时步骤指标(如置信度或熵)而导致的采样轨迹不一致和生成质量不佳的问题。现有方法缺乏对历史生成路径全局一致性的建模能力,从而影响最终输出的连贯性与效率。解决方案的关键在于提出一种名为“一致上下文解码”(Coherent Contextual Decoding, CCD)的新推理框架,其核心创新包括:一是引入轨迹修正机制(trajectory rectification),通过利用历史上下文增强序列一致性,并理论证明该机制等价于通过条件互信息建模历史步骤与词元预测之间的一致性;二是设计自适应采样策略,根据一致性度量动态调整每一步的解掩码预算,取代传统均匀分配策略,从而在提升生成质量的同时显著加速采样过程。

链接: https://arxiv.org/abs/2512.02044
作者: Kecheng Chen,Ziru Liu,Xijia Tao,Hui Liu,Xinyu Fu,Suiyun Zhang,Dandan Tu,Lingpeng Kong,Rui Liu,Haoliang Li
机构: City University of Hong Kong (香港城市大学); Huawei Research (华为研究); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.
zh

[NLP-56] Mirror Mirror on the Wall – Which is the Best Model of Them All?

【速读】: 该论文旨在解决在快速迭代的大型语言模型(Large Language Models, LLMs)环境中,如何高效、科学地选择最适合特定应用场景的模型这一关键问题。随着基础模型(foundation models)数量激增,模型选型过程日益复杂,传统依赖经验或单一指标的方法难以满足实际需求。论文提出从定性维度(如模型卡信息)与定量维度(如基准测试性能)两个方面综合评估,其中重点分析了定量维度,并以医疗领域为案例,系统梳理了当前主流排行榜和基准测试的演变与现状。其解决方案的关键在于提出了一个结构化的模型选择方法(Model Selection Methodology, MSM),通过量化评估框架实现对不同LLM性能的客观比较,从而指导用户基于具体任务需求精准筛选最优模型。

链接: https://arxiv.org/abs/2512.02043
作者: Dina Sayed,Heiko Schuldt
机构: University of Basel (巴塞尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become one of the most transformative tools across many applications, as they have significantly boosted productivity and achieved impressive results in various domains such as finance, healthcare, education, telecommunications, and law, among others. Typically, state-of-the-art (SOTA) foundation models are developed by large corporations based on large data collections and substantial computational and financial resources required to pretrain such models from scratch. These foundation models then serve as the basis for further development and domain adaptation for specific use cases or tasks. However, given the dynamic and fast-paced nature of launching new foundation models, the process of selecting the most suitable model for a particular use case, application, or domain becomes increasingly complex. We argue that there are two main dimensions that need to be taken into consideration when selecting a model for further training: a qualitative dimension (which model is best suited for a task based on information, for instance, taken from model cards) and a quantitative dimension (which is the best performing model). The quantitative performance of models is assessed through leaderboards, which rank models based on standardized benchmarks and provide a consistent framework for comparing different LLMs. In this work, we address the analysis of the quantitative dimension by exploring the current leaderboards and benchmarks. To illustrate this analysis, we focus on the medical domain as a case study, demonstrating the evolution, current landscape, and practical significance of this quantitative evaluation dimension. Finally, we propose a Model Selection Methodology (MSM), a systematic approach designed to guide the navigation, prioritization, and selection of the model that best aligns with a given use case.
zh

[NLP-57] Deep Research: A Systematic Survey

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理开放性、复杂任务时面临的局限性,尤其是在需要批判性思维、多源信息整合与可验证输出的场景下,传统单次提示(single-shot prompting)或标准检索增强生成(retrieval-augmented generation)方法难以胜任的问题。其解决方案的关键在于提出“深度研究”(Deep Research, DR)范式,通过将LLM的推理能力与外部工具(如搜索引擎)相结合,构建具备自主研究能力的智能代理(research agents)。该方案的核心包括三个阶段的系统化路线图、四大关键组件(查询规划、信息获取、记忆管理和答案生成)及其细粒度子分类,并辅以提示优化、监督微调和代理强化学习等技术手段,从而实现对复杂开放任务的有效分解与协同求解。

链接: https://arxiv.org/abs/2512.02038
作者: Zhengliang Shi,Yiqun Chen,Haitao Li,Weiwei Sun,Shiyu Ni,Yougang Lyu,Run-Ze Fan,Bowen Jin,Yixuan Weng,Minjun Zhu,Qiujie Xie,Xinyu Guo,Qu Yang,Jiayi Wu,Jujia Zhao,Xiaqiang Tang,Xinbei Ma,Cunxiang Wang,Jiaxin Mao,Qingyao Ai,Jen-Tse Huang,Wenxuan Wang,Yue Zhang,Yiming Yang,Zhaopeng Tu,Zhaochun Ren
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.
zh

[NLP-58] Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床处方审核任务中缺乏系统性、细粒度评估标准的问题,从而限制了其在医疗场景中的可靠部署与优化。解决方案的关键在于构建了一个名为RxBench的综合性基准测试平台,该平台涵盖常见处方审核类别,并整合来自权威药学资料的14类典型处方错误,包含1,150道单选题、230道多选题和879道简答题,所有题目均由经验丰富的临床药师审定。通过在18个前沿LLM上进行评测,研究揭示了模型在准确性与鲁棒性上的分层表现,并发现部分顶级模型(如Gemini-2.5-pro-preview-05-06、Grok-4-0709和DeepSeek-R1-0528)可达到或超越人类药师水平;进一步地,基于RxBench的洞察对中等性能模型进行针对性微调,开发出在简答类任务上媲美顶尖通用模型的专业化处方审核模型。这一工作首次建立了以错误类型为导向的标准化评估框架,为后续构建更可靠、专业化临床辅助工具提供了基础资源与方法论支撑。

链接: https://arxiv.org/abs/2512.02024
作者: Yan Yang,Mouxiao Bian,Peiling Li,Bingjian Wen,Ruiyao Chen,Kangkun Mao,Xiaojun Ye,Tianbin Li,Pengcheng Chen,Bing Han,Jie Xu,Kaifeng Qiu,Junyan Wu
机构: SUN YAT-SEN MEMORIAL HOSPITAL (中山纪念医院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has accelerated their integration into clinical decision support, particularly in prescription review. To enable systematic and fine-grained evaluation, we developed RxBench, a comprehensive benchmark that covers common prescription review categories and consolidates 14 frequent types of prescription errors drawn from authoritative pharmacy references. RxBench consists of 1,150 single-choice, 230 multiple-choice, and 879 short-answer items, all reviewed by experienced clinical pharmacists. We benchmarked 18 state-of-the-art LLMs and identified clear stratification of performance across tasks. Notably, Gemini-2.5-pro-preview-05-06, Grok-4-0709, and DeepSeek-R1-0528 consistently formed the first tier, outperforming other models in both accuracy and robustness. Comparisons with licensed pharmacists indicated that leading LLMs can match or exceed human performance in certain tasks. Furthermore, building on insights from our benchmark evaluation, we performed targeted fine-tuning on a mid-tier model, resulting in a specialized model that rivals leading general-purpose LLMs in performance on short-answer question tasks. The main contribution of RxBench lies in establishing a standardized, error-type-oriented framework that not only reveals the capabilities and limitations of frontier LLMs in prescription review but also provides a foundational resource for building more reliable and specialized clinical tools.
zh

[NLP-59] Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations Self-Report and Behavioral Altruism ALT

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)是否具备利他倾向,以及其隐性关联和自我报告能否预测实际利他行为的问题。研究通过三种范式——内隐联想测试(Implicit Association Test, IAT)、强制二选一任务和自评量表——对24个前沿LLM进行多方法评估,发现所有模型均表现出强烈的隐性亲利他偏倚(IAT均值=0.87),且行为上显著高于随机水平(65.6% vs. 50%),但隐性关联无法预测行为(r = .22, p = .29)。最关键的是,模型普遍存在“美德信号差距”(virtue signaling gap):自我声称的利他比例(77.5%)显著高于实际行为(65.6%),影响了75%的模型。因此,论文提出将“校准差距”(Calibration Gap,即自我报告与行为之间的差异)作为标准化对齐指标,以提升模型行为的一致性和可预测性;仅12.5%的模型同时具备高亲社会行为和准确自我认知。

链接: https://arxiv.org/abs/2512.01568
作者: Sandro Andric
机构: New York University (纽约大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 14 pages, 7 figures, 7 tables. Code and data available at this https URL

点击查看摘要

Abstract:We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p .0001), confirming models “know” altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p .0001, Cohen’s d = 1.08). This “virtue signaling gap” affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.
zh

[NLP-60] Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在企业场景中因受限于上下文窗口长度和预训练数据与外部知识不一致而导致的幻觉(hallucination)问题,尤其针对那些看似合理且难以通过常规人工审核发现的幻觉。解决方案的关键在于构建一个交互式可视化知识图谱框架,将专有知识与模型生成内容进行结构化关联,并标注置信度信息,从而为用户提供清晰、直观的潜在幻觉区域识别界面。该机制使用户能够诊断不一致性、定位薄弱推理链并提供修正反馈,形成闭环的人机协同流程,以提升模型可靠性并持续优化响应质量。

链接: https://arxiv.org/abs/2512.00663
作者: Tanmay Agrawal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have rapidly advanced in their ability to interpret and generate natural language. In enterprise settings, they are frequently augmented with closed-source domain knowledge to deliver more contextually informed responses. However, operational constraints such as limited context windows and inconsistencies between pre-training data and supplied knowledge often lead to hallucinations, some of which appear highly credible and escape routine human review. Current mitigation strategies either depend on costly, large-scale gold-standard Q\A curation or rely on secondary model verification, neither of which offers deterministic assurance. This paper introduces a framework that organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs. The objective is to provide end users with a clear, intuitive view of potential hallucination zones by linking model assertions to underlying sources of truth and indicating confidence levels. Through this visual interface, users can diagnose inconsistencies, identify weak reasoning chains, and supply corrective feedback. The resulting human-in-the-loop workflow creates a structured feedback loop that can enhance model reliability and continuously improve response quality.
zh

[NLP-61] he brain-AI convergence: Predictive and generative world models for general-purpose computation

【速读】: 该论文试图解决的问题是:尽管大脑的新皮层(neocortex)和小脑(cerebellum)具有相对均匀的电路结构,但它们如何协同实现多样化功能并最终支持人类智能。其解决方案的关键在于识别出两者共享的计算机制——即基于预测误差学习(prediction-error learning)构建内部世界模型(world model),并通过该模型对未来世界事件进行预测;这种世界模型在感觉处理中用于理解,在运动控制中用于生成,从而实现多领域适应性智能。值得注意的是,注意力机制驱动的生成式 AI (Generative AI) 也独立演化出相似的学习范式与世界模型计算框架,表明这些机制构成了生物与人工系统中实现多样化高级功能的核心计算基础。

链接: https://arxiv.org/abs/2512.02419
作者: Shogo Ohmae,Keiko Ohmae
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 22 pages, 4 figures. Related to our earlier preprint “The brain versus AI” ( arXiv:2411.16075 ) but a distinct article. The earlier work surveyed broad brain-AI parallels; here we focus on world-model-based computation and convergent evolution between the brain and AI, especially large language models

点击查看摘要

Abstract:Recent advances in general-purpose AI systems with attention-based transformers offer a potential window into how the neocortex and cerebellum, despite their relatively uniform circuit architectures, give rise to diverse functions and, ultimately, to human intelligence. This Perspective provides a cross-domain comparison between the brain and AI that goes beyond the traditional focus on visual processing, adopting the emerging perspecive of world-model-based computation. Here, we identify shared computational mechanisms in the attention-based neocortex and the non-attentional cerebellum: both predict future world events from past inputs and construct internal world models through prediction-error learning. These predictive world models are repurposed for seemingly distinct functions–understanding in sensory processing and generation in motor processing-- enabling the brain to achieve multi-domain capabilities and human-like adaptive intelligence. Notably, attention-based AI has independently converged on a similar learning paradigm and world-model-based computation. We conclude that these shared mechanisms in both biological and artificial systems constitute a core computational foundation for realizing diverse functions including high-level intelligence, despite their relatively uniform circuit structures. Our theoretical insights bridge neuroscience and AI, advancing our understanding of the computational essence of intelligence.
zh

[NLP-62] On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

【速读】: 该论文旨在解决当前端到端(End-to-End, E2E)自动语音识别(ASR)系统在处理口吃言语时存在的严重局限性,即对口吃特征(dysfluencies)和流利度调整(fluency-shaping artifacts)的忽视,导致转录结果非逐字且缺乏临床与研究价值。其解决方案的关键在于提出一种参数高效适配方法,将口吃和流利度修改行为编码为特殊标记(special tokens)嵌入到转录文本中,并结合多步微调策略与语言自适应预训练,以缓解英语主导模型在德语自然口吃数据(KSoF)上的性能下降问题。同时,通过分词器分析揭示了现有分词器对英语的偏倚,进一步凸显了多语言E2E系统在跨语言泛化能力上的瓶颈。

链接: https://arxiv.org/abs/2512.02027
作者: Kashaf Gulzar,Dominik Wagner,Sebastian P. Bayerl,Florian Hönig,Tobias Bocklet,Korbinian Riedhammer
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 1 figure. Accepted to ASRU 2025. This is the arXiv preprint of the accepted paper

点击查看摘要

Abstract:Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer’s English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.
zh

计算机视觉

[CV-0] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

【速读】:该论文旨在解决生成式图像编辑中用户意图表达不清晰的问题,即扩散模型(diffusion models)依赖单一、整体的提示词(prompt),难以分离和控制内容(what)、位置(where)、结构(how)和颜色(palette)等不同层面的创作意图。其解决方案的关键在于提出一种分层组合范式(layered composition paradigm),将用户意图解耦为四个可独立调控的视觉线索层:内容层、空间层、结构层与色彩层,并通过专门设计的数据生成管道、统一控制模块及微调后的空间分支实现精确局部编辑(如对象移除),从而在保持扩散模型语义能力的同时,提供类似传统图形软件的细粒度控制能力。

链接: https://arxiv.org/abs/2512.03046
作者: Zichen Liu,Yue Yu,Hao Ouyang,Qiuyu Wang,Shuailei Ma,Ka Leong Cheng,Wen Wang,Qingyan Bai,Yuxuan Zhang,Yanhong Zeng,Yixuan Li,Xing Zhu,Yujun Shen,Qifeng Chen
机构: HKUST(香港科技大学); Ant Group(蚂蚁集团); NEU(东北大学); ZJU(浙江大学); CUHK(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and demo available at this https URL

点击查看摘要

Abstract:We propose MagicQuill V2, a novel system that introduces a \textbflayered composition paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.
zh

[CV-1] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

【速读】:该论文旨在解决多视角扩散模型(multi-view diffusion models)在新颖视图合成任务中视图一致性机制不明确的问题,特别是注意力机制如何实现跨视角几何对应关系的建模。解决方案的关键在于提出CAMEO训练技术,通过直接利用几何对应关系监督注意力图(attention maps),引导模型学习更精确的跨视角对应关系;研究表明,仅需监督一个注意力层即可显著提升训练效率、收敛速度和生成质量,且该方法具有模型无关性,适用于任意多视角扩散模型。

链接: https://arxiv.org/abs/2512.03045
作者: Minkyung Kwon,Jinhyeok Choi,Jiho Park,Seonghu Jeon,Jinhyuk Jang,Junyoung Seo,Minseop Kwak,Jin-Hwa Kim,Seungryong Kim
机构: KAIST AI(韩国科学技术院人工智能); NAVER AI Lab(NAVER人工智能实验室); SNU AIIS(首尔国立大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.
zh

[CV-2] OneThinker: All-in-one Reasoning Model for Image and Video

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理任务中存在任务分离、模态割裂及缺乏通用性的问题,即现有方法通常为不同任务训练独立模型,并将图像与视频理解视为互不相关的领域,导致难以扩展为统一的多模态推理通用模型(multimodal reasoning generalist),限制了跨任务和跨模态的知识共享与实际应用灵活性。其解决方案的关键在于提出OneThinker——一个整合图像与视频理解的端到端推理框架,覆盖包括问答、描述生成、空间/时间定位、跟踪和分割在内的10种基础视觉任务;通过构建包含600k样本的OneThinker-600k训练语料库并利用商业模型进行思维链(Chain-of-Thought, CoT)标注,实现SFT冷启动(OneThinker-SFT-340k);同时设计EMA-GRPO算法,通过追踪各任务奖励标准差的移动平均来缓解多任务强化学习中的奖励异质性问题,从而实现更均衡的优化过程。实验表明,OneThinker在31个视觉基准上表现优异,并展现出一定的任务间知识迁移能力和初步的零样本泛化潜力。

链接: https://arxiv.org/abs/2512.03043
作者: Kaituo Feng,Manyuan Zhang,Hongyu Li,Kaixuan Fan,Shuang Chen,Yilei Jiang,Dian Zheng,Peiwen Sun,Yiyuan Zhang,Haoze Sun,Yan Feng,Peng Pei,Xunliang Cai,Xiangyu Yue
机构: MMLab, CUHK (CUHK MMLab); Meituan LongCat Team (美团长猫团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.
zh

[CV-3] PPTArena: A Benchmark for Agent ic PowerPoint Editing

【速读】:该论文旨在解决演示文稿(PowerPoint,PPT)编辑任务中缺乏可靠评估基准与高效结构感知编辑代理的问题。现有方法多聚焦于图像渲染或文本到幻灯片的生成,难以实现对真实幻灯片的精确、语义一致的在位编辑(in-place editing)。为应对这一挑战,作者提出PPTArena基准,涵盖100个演示文稿、2125张幻灯片及800余项针对文本、图表、表格、动画和母版样式等元素的编辑指令,并通过双VLM判官流水线(dual VLM-as-judge pipeline)分别量化指令遵循度与视觉质量。解决方案的核心在于PPTPilot——一个结构感知的幻灯片编辑智能体,其关键创新包括:基于语义规划的编辑序列生成、在高阶编程工具与确定性XML操作之间路由以实现精准控制,以及通过迭代式“计划-编辑-验证”循环确保输出符合任务约束。实验表明,PPTPilot在复合编辑、布局敏感编辑和跨幻灯片编辑任务上显著优于主流商业代理与前沿视觉语言模型(Vision-Language Models, VLMs),尤其在视觉保真度与全文档一致性方面提升明显,但仍面临长周期、文档级任务的挑战。

链接: https://arxiv.org/abs/2512.03042
作者: Michael Ofengenden,Yunze Man,Ziqi Pang,Yu-Xiong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 26 figures

点击查看摘要

Abstract:We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.
zh

[CV-4] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

【速读】:该论文旨在解决当前视频生成技术在多镜头(multi-shot)视频生成中的局限性,即难以实现灵活的镜头排列、连贯的叙事逻辑以及超越文本提示的可控性。其核心解决方案在于提出MultiShotMaster框架,关键创新包括:1)引入多镜头叙事RoPE(Multi-Shot Narrative RoPE),通过在镜头切换处施加显式相位偏移,实现镜头顺序的灵活性与时间叙事一致性的统一;2)设计时空位置感知RoPE(Spatiotemporal Position-Aware RoPE),融合参考标记与定位信号,支持时空锚定的参考注入。此外,为缓解数据稀缺问题,构建自动化数据标注流程以提取多镜头视频、字幕、跨镜头定位信号和参考图像,从而实现文本驱动的镜头间一致性、受控主体运动及背景引导的场景定制能力。

链接: https://arxiv.org/abs/2512.03041
作者: Qinghe Wang,Xiaoyu Shi,Baolu Li,Weikang Bian,Quande Liu,Huchuan Lu,Xintao Wang,Pengfei Wan,Kun Gai,Xu Jia
机构: Dalian University of Technology (大连理工大学); Kling Team, Kuaishou Technology (快手科技Kling团队); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
zh

[CV-5] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

【速读】:该论文旨在解决视频生成模型是否具备视觉空间智能(visuospatial intelligence)这一核心问题,即仅通过视觉数据能否实现复杂的空间理解与推理。解决方案的关键在于提出Video4Spatial框架,该框架利用仅包含视频的场景上下文信息,无需依赖深度图或相机位姿等辅助模态,即可完成场景导航和目标定位两项复杂空间任务;其有效性源于简洁但高效的设计选择以及精心的数据构建策略,使模型能够端到端地规划路径、定位目标、遵循相机位姿指令并保持空间一致性,同时展现出对长视频上下文和域外环境的良好泛化能力。

链接: https://arxiv.org/abs/2512.03040
作者: Zeqi Xiao,Yiwei Zhao,Lingxiao Li,Yushi Lan,Yu Ning,Rahul Garg,Roshni Cooper,Mohammad H. Taghavi,Xingang Pan
机构: Netflix; Nanyang Technological University (南洋理工大学); University of Oxford (牛津大学); Netflix Eyeline Studios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page at this https URL

点击查看摘要

Abstract:We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
zh

[CV-6] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

【速读】:该论文旨在解决视频到音频生成任务中长期存在的空间沉浸感不足问题,特别是现有方法多局限于单声道(mono)输出,而基于两阶段流程的立体声(binaural)方法常因先生成单声道再进行空间化处理而导致误差累积与时空不一致。其解决方案的关键在于提出一种端到端的立体声空间音频生成框架 ViSAudio,该框架采用条件流匹配(conditional flow matching)机制,并设计双分支音频生成架构以分别建模左右耳通道的音频潜在流;同时引入条件时空模块(conditional spacetime module),在保证双通道间一致性的同时保留独特的空间特征,从而实现音频与输入视频之间的精确时空对齐。

链接: https://arxiv.org/abs/2512.03036
作者: Mengchen Zhang,Qi Chen,Tong Wu,Zihan Liu,Dahua Lin
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Stanford University (斯坦福大学); Beihang University (北京航空航天大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: this https URL.
zh

[CV-7] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

【速读】:该论文旨在解决多模态音频-视觉对话理解与生成任务中长期存在的挑战,即如何实现流畅的多模态融合以及生成连贯、自然且具身份一致性的长时交互内容。现有方法多局限于非交互式系统,难以生成符合语境且具备高质量音视频同步的对话输出。其关键解决方案是提出一种“指挥者-创作者”(Conductor-Creator)架构:指挥者负责解析用户意图并将其分解为动作和语音组件以实现细粒度控制;创作者则基于此指令,采用结合自回归(AR)模型与扩散模型(Diffusion Model)的双结构来分别生成高质量音频与视频,从而确保音视频在时间维度上的稳定性与一致性。此外,论文还设计了一种新颖的融合模块,强化连续片段间跨模态关联,提升长时间音频-视觉内容的同步性和语义连贯性。

链接: https://arxiv.org/abs/2512.03034
作者: Youxin Pang,Jiajun Liu,Lingfeng Tan,Yong Zhang,Feng Gao,Xiang Deng,Zhuoliang Kang,Xiaoming Wei,Yebin Liu
机构: Tsinghua University (清华大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project website is this https URL

点击查看摘要

Abstract:We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human this http URL primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary this http URL Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these this http URL, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video this http URL, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content this http URL experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users’ multimodal queries.
zh

[CV-8] SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

【速读】:该论文旨在解决传统对抗式模仿学习(adversarial imitation learning)中生成的运动先验(motion priors)难以复用的问题,即这些先验通常需要针对每个新控制器重新训练,且在下游任务训练时仍需保留原始参考运动数据,限制了其通用性和实用性。解决方案的关键在于提出Score-Matching Motion Priors (SMP),它利用预训练的运动扩散模型(motion diffusion models)和得分蒸馏采样(score distillation sampling, SDS)构建可复用的任务无关运动先验。SMP可在大规模运动数据集上独立预训练,之后冻结并作为通用奖励函数直接用于下游策略训练,从而实现高效、模块化且风格可组合的自然行为生成。

链接: https://arxiv.org/abs/2512.03028
作者: Yuxuan Mu,Ziyu Zhang,Yi Shi,Minami Matsumoto,Kotaro Imamura,Guy Tevet,Chuan Guo,Michael Taylor,Chang Shu,Pengcheng Xi,Xue Bin Peng
机构: Simon Fraser University (西蒙菲莎大学); Sony Playstation (索尼游戏平台); Stanford University (斯坦福大学); Meta (Meta); National Research Council Canada (加拿大国家研究委员会); NVIDIA (英伟达)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at this https URL
zh

[CV-9] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

【速读】:该论文旨在解决磁共振成像(MRI)中因采集时间过长限制临床应用的问题,现有基于深度学习的加速方法如未展开网络(unrolled networks)虽效率高但存在中间步骤参数自由学习导致的演化不稳定问题,而扩散模型虽理论稳定但计算成本高。其解决方案的关键在于通过理论证明未展开网络实际上是条件概率流常微分方程(conditional probability flow ODE)的离散实现,从而为网络参数提供明确的数学表述,并据此提出流对齐训练(Flow-Aligned Training, FLAT)策略:从ODE离散化推导参数并引导中间重建过程沿理想ODE轨迹演化,显著提升稳定性与收敛性,实验表明FLAT在三个MRI数据集上可实现高质量重建,迭代次数比扩散模型少达3倍且稳定性优于传统未展开网络。

链接: https://arxiv.org/abs/2512.03020
作者: Kehan Qi,Saumya Gupta,Qingqiao Hu,Weimin Lyu,Chao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling k -space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to 3\times fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.
zh

[CV-10] AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry SIGGRAPH

【速读】:该论文旨在解决直接端到端生成高质量、拓扑封闭(watertight)的边界表示(Boundary Representation, B-Rep)模型的难题。现有方法难以同时保证几何精度与拓扑有效性,而B-Reps作为计算机辅助设计(CAD)中的标准数据结构,其自动构建对工业设计自动化具有重要意义。解决方案的关键在于提出AutoBrep——一种基于Transformer的自回归生成模型,采用统一的离散标记化方案将B-Rep的几何特征(如曲面和曲线)编码为潜在几何标记,将结构关系编码为特殊拓扑引用标记,并通过广度优先遍历面邻接图来自然定义序列顺序。这种设计使得模型在推理时能够逐步生成相邻面和边及其拓扑结构,从而实现高保真、拓扑正确的B-Rep生成,并支持用户可控的自动补全功能。

链接: https://arxiv.org/abs/2512.03018
作者: Xiang Xu,Pradeep Kumar Jayaraman,Joseph G. Lambourne,Yilin Liu,Durvesh Malpure,Pete Meltzer
机构: Autodesk Research (Autodesk 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Siggraph Asia 2025

点击查看摘要

Abstract:The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at this https URL.
zh

[CV-11] Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks NEURIPS2025

【速读】:该论文旨在解决帧基础(frame-based)视觉模型在视频序列上应用时出现的时序不一致性问题,尤其是在输入包含随时间变化的图像退化(如压缩伪影、噪声和恶劣天气)时,模型输出易产生闪烁等不稳定现象。其解决方案的关键在于提出一种通用的稳定性适配器(stability adapter)机制,可插入几乎任意网络架构中,并结合冻结主干网络的高效训练策略,通过引入一个统一的“准确-稳定-鲁棒”损失函数(accuracy-stability-robustness loss),从理论上分析并确保训练过程的稳定性与鲁棒性。实验表明,该方法在去噪(NAFNet)、图像增强(HDRNet)、单目深度估计(Depth Anything v2)和语义分割(DeepLabv3+)等多个任务上均显著提升了时序稳定性与对各类图像退化的鲁棒性,同时保持或提升预测质量。

链接: https://arxiv.org/abs/2512.03014
作者: Matthew Dutson,Nathan Labiosa,Yin Li,Mohit Gupta
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.
zh

[CV-12] In-Context Sync-LoRA for Portrait Video Editing

【速读】:该论文旨在解决人像视频编辑中如何在实现多样视觉修改(如外观变化、表情调整或物体添加)的同时,保持帧级精确同步与身份一致性的问题。其核心挑战在于确保每一帧的编辑结果都与原始视频中的对应帧在时间上严格对齐,避免运动失真或行为不一致。解决方案的关键在于提出Sync-LoRA方法:利用图像到视频扩散模型,通过修改首帧并将其编辑信息传播至整个视频序列;同时,采用基于配对视频的上下文内LoRA(Low-Rank Adaptation)训练策略,这些配对视频具有相同运动轨迹但外观不同,且通过同步过滤机制自动筛选出最时序对齐的样本用于训练。该设计使模型能够融合源视频的运动线索与首帧引入的视觉变化,从而在多种未见身份和复杂编辑任务中实现高保真度与强时序一致性。

链接: https://arxiv.org/abs/2512.03013
作者: Sagi Polaczek,Or Patashnik,Ali Mahdavi-Amiri,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject’s original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.
zh

[CV-13] SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

【速读】:该论文旨在解决LiDAR点云在扫描过程中对细小几何结构和暗吸光材料区域易出现遗漏的问题,同时克服传统基于图像的三维摄影测量(3D photogrammetry)在无纹理区域精度不足的局限。其解决方案的关键在于提出SurfFill——一种基于高斯surfel(Gaussian surfel)的LiDAR点云补全方法:首先识别LiDAR点云中因激光束发散导致的伪影区域(主要出现在薄结构与边缘处),利用点云密度变化设计模糊性启发式策略定位缺失区域;随后在这些模糊区域约束高斯surfel重建过程,聚焦优化与稀疏点云的稠密化;最终提取并采样高斯原语生成补全点,实现高质量点云修复。此外,为应对大规模场景重建挑战,进一步引入分治策略以支持建筑尺度点云补全。

链接: https://arxiv.org/abs/2512.03010
作者: Svenja Strobel,Matthias Innmann,Bernhard Egger,Marc Stamminger,Linus Franke
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学); NavVis GmbH (NavVis公司); Inria & Université Côte d’Azur (法国国家信息与自动化研究院及蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.
zh

[CV-14] DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

【速读】:该论文旨在解决自动驾驶场景中动态环境的快速、可扩展4D重建与重仿真问题,现有方法普遍依赖逐场景优化、已知相机标定或短帧窗口,导致效率低下且难以扩展。其解决方案的关键在于提出一种无需相机位姿输入的统一框架——Driving Gaussian Grounded Transformer (DGGT),将相机位姿作为模型输出而非输入,从而实现从稀疏无位姿图像中直接进行重建,并支持任意数量视角用于长序列建模;该方法联合预测每帧3D高斯分布图与相机参数,通过轻量级动态头解耦动态物体,利用寿命头保持时间一致性,并结合扩散渲染优化提升稀疏输入下的新视角质量,最终实现单次前向传播、端到端的高效重建。

链接: https://arxiv.org/abs/2512.03004
作者: Xiaoxue Chen,Ziyi Xiong,Yuantao Chen,Gen Li,Nan Wang,Hongcheng Luo,Long Chen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Hongyang Li,Ya-Qin Zhang,Hao Zhao
机构: Tsinghua University (清华大学); Xiaomi EV; The University of Hong Kong (香港大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbfDriving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.
zh

[CV-15] DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

【速读】:该论文旨在解决当前基础模型难以从单目视频中准确理解真实世界动态物理场景的问题,尤其是现有数据集受限于模拟器或传统SfM(Structure-from-Motion)方法,缺乏物理尺度标注和语义描述性caption,导致模型对现实世界动态结构、运动与语义内容的建模能力不足。解决方案的关键在于提出DynamicVerse框架,该框架通过融合大规模视觉、几何与多模态模型,实现对物理尺度下的静态几何、动态运动、实例级掩码及整体描述性文本的联合建模,并结合窗口化Bundle Adjustment与全局优化策略,将长时真实世界视频序列转化为统一的4D多模态表示形式,从而显著提升模型在视频深度估计、相机位姿估计和内参估计等任务上的物理尺度精度与全局一致性。

链接: https://arxiv.org/abs/2512.03000
作者: Kairun Wen,Yuzhi Huang,Runyu Chen,Hui Zheng,Yunlong Lin,Panwang Pan,Chenxin Li,Wenyan Cong,Jian Zhang,Junbin Lu,Chenguo Lin,Dilin Wang,Zhicheng Yan,Hongyu Xu,Justin Theiss,Yue Huang,Xinghao Ding,Rakesh Ranjan,Zhiwen Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.
zh

[CV-16] EXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond WWW

【速读】:该论文旨在解决现有3D纹理生成方法依赖多视角融合时所面临的视图间不一致性(inter-view inconsistencies)和复杂表面覆盖不完整(incomplete coverage of complex surfaces)问题,从而限制了生成内容的保真度(fidelity)与完整性。其解决方案的关键在于提出TEXTRIX框架,该框架基于一个潜在的3D属性网格(latent 3D attribute grid),并采用带有稀疏注意力机制(sparse attention)的扩散Transformer(Diffusion Transformer),直接在体素空间中对3D模型进行着色,从根本上规避了多视角融合的局限性;同时,该架构可自然扩展至高精度3D分割任务,通过在相同网格上预测语义属性实现精确边界划分,从而在纹理合成与3D部件分割两项任务上均达到当前最优性能。

链接: https://arxiv.org/abs/2512.02993
作者: Yifei Zeng,Yajie Bao,Jiachen Qian,Shuang Wu,Youtian Lin,Hao Zhu,Buyu Li,Feihu Zhang,Xun Cao,Yao Yao
机构: Nanjing University (南京大学); DreamTech; HKU (香港大学); OriginArk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.
zh

[CV-17] GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection

【速读】:该论文旨在解决点云(Point Cloud)在3D目标检测中面临的挑战,包括数据稀疏性、结构不完整以及语义信息有限等问题,尤其针对远距离物体之间的上下文关系难以建模的难题。其解决方案的关键在于提出GraphFusion3D框架,该框架融合多模态信息与先进的特征学习机制:一是引入自适应跨模态Transformer(Adaptive Cross-Modal Transformer, ACMT),动态地将图像特征融入点云表示,增强几何与语义信息;二是设计图推理模块(Graph Reasoning Module, GRM),通过多尺度图注意力机制同时捕捉局部几何结构和全局语义上下文,实现提案的精细化优化;此外,采用级联解码器进行多阶段预测以逐步提升检测精度。实验证明该方法在SUN RGB-D和ScanNetV2数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2512.02991
作者: Md Sohag Mia,Md Nahid Hasan,Tawhid Ahmed,Muhammad Abdullah Adnan
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6% AP _25 and 51.2% AP _50 ) and ScanNetV2 (75.1% AP _25 and 60.8% AP _50 ) demonstrate a substantial performance improvement over existing approaches.
zh

[CV-18] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

【速读】:该论文旨在解决现有生成式框架在处理LiDAR序列时对所有空间区域采用统一建模策略所导致的局限性,即忽视了真实场景中不同区域的不确定性差异,从而在复杂或模糊区域产生几何伪影,影响4D世界模型的 realism(真实感)与 temporal stability(时间稳定性)。解决方案的关键在于提出U4D框架,其核心创新为:首先利用预训练分割模型估计空间不确定性图以定位语义挑战区域;随后采用“由难到易”的两阶段生成策略——第一阶段聚焦于高熵区域的精细化几何重建,第二阶段基于学习到的结构先验完成剩余区域的条件化补全;同时引入时空混合(mixture of spatio-temporal, MoST)模块,在扩散过程中自适应融合空间与时间表征,从而提升生成结果的时空一致性。

链接: https://arxiv.org/abs/2512.02982
作者: Xiang Xu,Ao Liang,Youquan Liu,Linfeng Li,Lingdong Kong,Ziwei Liu,Qingshan Liu
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); National University of Singapore (新加坡国立大学); Fudan University (复旦大学); S-Lab, Nanyang Technological University (南洋理工大学S-Lab); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 19 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a “hard-to-easy” manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.
zh

[CV-19] InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration AAAI2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的幻觉(Hallucination)问题,尤其在多模态大语言模型(Multimodal Large Language Models, MLLMs)中,幻觉严重影响了模型输出的可靠性。解决方案的关键在于提出一种无需训练的多智能体框架 InEx,其核心机制包括两个层面:一是引入基于熵(entropy)的不确定性估计进行内部自我反思推理(internal introspective reasoning),以降低决策过程中的不确定性;二是通过外部跨模态多智能体协作(external cross-modal multi-agent collaboration),由编辑代理(editing agent)与自省代理(self-reflection agents)对初始响应进行迭代验证与修正,从而显著提升输出可靠性并有效缓解幻觉现象。

链接: https://arxiv.org/abs/2512.02981
作者: Zhongyu Yang,Yingfang Yuan,Xuanming Jiang,Baoyi An,Wei Pang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in AAAI 2026

点击查看摘要

Abstract:Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent’s ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent’s reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.
zh

[CV-20] BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection AAAI26

【速读】:该论文旨在解决多传感器融合中因LiDAR与相机在几何精度上的本质差异导致的特征空间错位问题,以及点云数据稀疏性和语义信息不足的问题。其解决方案的关键在于提出一种以LiDAR为中心(LiDAR-centric)的BEV(鸟瞰图)表示框架——BEVDilation,通过将图像BEV特征作为隐式引导而非简单拼接,有效缓解了由图像深度估计误差引起的对齐偏差;同时设计了稀疏体素膨胀块(Sparse Voxel Dilation Block)利用图像先验增强前景体素密度,并引入语义引导的BEV膨胀块(Semantic-Guided BEV Dilation Block)借助图像语义信息和长程上下文提升LiDAR特征扩散能力,从而显著改善3D目标检测性能并增强对深度噪声的鲁棒性。

链接: https://arxiv.org/abs/2512.02972
作者: Guowen Zhang,Chenhang He,Liyi Chen,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accept by AAAI26

点击查看摘要

Abstract:Integrating LiDAR and camera information in the bird’s eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at this https URL.
zh

[CV-21] A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems

【速读】:该论文旨在解决低光照环境下(如夜间驾驶)车载摄像头图像退化严重的问题,传统增强算法因计算复杂度高难以满足车辆应用场景的实时性需求。解决方案的关键在于提出一种轻量级多尺度移位卷积网络 UltraFast-LiNET,其核心创新包括:引入仅含12个可学习参数的动态移位卷积(Dynamic Shifted Convolution, DSConv)以实现高效特征提取;构建多尺度移位残差块(Multi-scale Shifted Residual Block, MSRB)通过不同移位距离扩展感受野;同时采用残差结构与新型多级梯度感知损失函数提升网络稳定性与增强质量。该方法在保持极低参数量(最小仅36参数)的同时显著优于现有先进方法,在资源受限场景下实现了实时性与图像增强效果的最优平衡。

链接: https://arxiv.org/abs/2512.02965
作者: Yuhan Chen,Yicui Shi,Guofa Li,Guangrui Bai,Jinyuan Shao,Xiangfei Huang,Wenbo Chu,Keqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET
zh

[CV-22] Layout Anything: One Transformer for Universal Room Layout Estimation WACV2026

【速读】:该论文旨在解决室内布局估计(indoor layout estimation)中传统方法依赖复杂后处理流程、几何约束难以有效建模以及推理速度慢的问题。解决方案的关键在于提出 Layout Anything 框架,其核心创新包括:(1) 基于拓扑感知变换的布局退化策略(layout degeneration strategy),在保持曼哈顿世界(Manhattan-world)约束的前提下增强训练数据多样性;(2) 可微分的几何损失函数,直接在训练过程中强制平面一致性与锐利边界预测。通过将上述模块统一于端到端的 Transformer 架构中,模型实现了高精度(如 LSUN 数据集上像素误差 PE=5.43%)与高速推理(114ms),显著优于现有方法,适用于增强现实(AR)和大规模三维场景重建等应用。

链接: https://arxiv.org/abs/2512.02952
作者: Md Sohag Mia,Muhammad Abdullah Adnan
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at WACV 2026

点击查看摘要

Abstract:We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer’s universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer’s task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework’s combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.
zh

[CV-23] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

【速读】:该论文旨在解决当前视频生成模型在零样本推理能力上的不足,特别是其对真实世界科学规律的理解与建模能力有限的问题。现有视频基准多基于物理常识,难以评估模型是否具备跨学科的科学推理能力。解决方案的关键在于提出VideoScience-Bench,这是一个专为评估视频模型在物理学和化学领域本科水平科学理解能力而设计的新基准,包含200个涵盖14个主题和103个概念的复合科学场景提示,并通过专家标注和视觉语言模型(VLM-as-a-Judge)从五个维度(提示一致性、现象一致性、正确动态性、不变性及时空连续性)进行系统评测,从而首次将视频生成模型视为具备科学推理能力的实体进行评估。

链接: https://arxiv.org/abs/2512.02942
作者: Lanxiang Hu,Abhilash Shankarampeta,Yixin Huang,Zilin Dai,Haoyang Yu,Yujie Zhao,Haoqiang Kang,Daniel Zhao,Tajana Rosing,Hao Zhang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models’ scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \hrefthis https URLthis http URL.
zh

[CV-24] LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

【速读】:该论文旨在解决文本引导视频编辑中物体移除与添加任务的挑战,特别是如何在无需辅助掩码或参考图像的情况下实现精确的空间和时间一致性。现有方法受限于外部控制信号的依赖,导致可扩展性和泛化能力不足。解决方案的关键在于提出LoVoRA框架,其核心创新是引入可学习的物体感知定位机制(object-aware localization mechanism),该机制提供密集的时空监督信号,并结合扩散掩码预测器(Diffusion Mask Predictor)实现端到端视频编辑,从而在推理阶段无需任何外部控制信号即可完成高质量、时序一致的编辑操作。

链接: https://arxiv.org/abs/2512.02933
作者: Zhihan Xiao,Lin Liu,Yixin Gao,Xiaopeng Zhang,Haoxuan Che,Songping Mai,Qi Tian
机构: Tsinghua University (清华大学); Huawei Inc. (华为公司); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.
zh

[CV-25] EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis

【速读】:该论文旨在解决新颖视图合成(Novel View Synthesis, NVS)中现有方法在外观保真度与几何准确性之间的权衡问题:3D高斯溅射(3D Gaussian Splatting, 3DGS)虽能实现实时渲染且外观质量高,但存在多视角不一致性,影响几何精度;而2D高斯溅射(2D Gaussian Splatting, 2DGS)虽能保证多视角一致性,却牺牲了纹理细节。解决方案的关键在于提出可交换高斯溅射(Exchangeable Gaussian Splatting, EGGS),其核心创新包括:统一渲染的混合高斯光栅化(Hybrid Gaussian Rasterization)、动态适应2D与3D高斯表示的自适应类型交换机制(Adaptive Type Exchange),以及利用频率解耦优化策略(Frequency-Decoupled Optimization)以充分发挥两类高斯表示的优势。该方法通过CUDA加速实现高效训练与推理,在渲染质量、几何准确性和效率方面均优于现有方法。

链接: https://arxiv.org/abs/2512.02932
作者: Yancheng Zhang,Guangyu Sun,Chen Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.
zh

[CV-26] DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation

【速读】:该论文旨在解决基于位级视觉分词器(bitwise visual tokenizer)的自回归(autoregressive, AR)生成模型中样本多样性不足的问题。其核心挑战源于两个关键因素:一是位级建模的二分类特性限制了预测空间,二是 logits 分布过于尖锐导致采样坍缩(sampling collapse),从而降低多样性。解决方案的关键在于提出 DiverseAR 方法,其核心创新包括:(1) 引入自适应 logits 分布缩放机制,在采样过程中动态调整二进制输出分布的锐度,以生成更平滑且多样化的样本;(2) 设计基于能量的生成路径搜索算法,避免采样低置信度 token,从而在提升多样性的同时保持图像质量。

链接: https://arxiv.org/abs/2512.02931
作者: Ying Yang,Zhengyao Lv,Tianlin Pan,Haofan Wang,Binxin Yang,Hubery Yin,Chen Li,Chenyang Si
机构: PRLab, Nanjing University (南京大学); The University of Hong Kong (香港大学); University of Chinese Academy of Sciences (中国科学院大学); Lovart AI; WeChat, Tencent Inc. (腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.
zh

[CV-27] Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation KDD’26

【速读】:该论文旨在解决交通事故发生预测中因忽视道路表面及周边环境物理与环境信息而导致的模型精度不足问题。传统方法主要依赖路网结构特征,而忽略了卫星图像所蕴含的视觉信息(如路面状况、周边环境等)。其解决方案的关键在于构建一个大规模多模态数据集,整合九百万条官方交通事故记录、每条路网节点对应的一百万张高分辨率卫星图像,以及节点级的天气统计和道路类型(如住宅道路 vs. 高速公路)和边级的交通流量(平均年日交通量)信息,并在此基础上采用融合视觉特征与图嵌入的多模态学习方法进行建模。实验表明,引入卫星图像特征可使预测性能显著提升,AUROC达到90.1%,较仅使用图结构的图神经网络模型提高3.7%。

链接: https://arxiv.org/abs/2512.02920
作者: Ziniu Zhang,Minxuan Duan,Haris N. Koutsopoulos,Hongyang R. Zhang
机构: Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注: 17 pages. To appear in KDD’26 Datasets

点击查看摘要

Abstract:We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region’s weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of 90.1% , which is a 3.7% gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by 24% under higher precipitation, by 22% on higher-speed roads such as motorways, and by 29% due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.
zh

[CV-28] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理高分辨率图像时存在的语义理解不准确问题,特别是由于图像裁剪(crop-based processing)导致目标对象被分割到多个图像块中,从而破坏语义相似性计算的问题。其解决方案的关键在于提出了一种无需训练的多分辨率检索-检测框架(Multi-resolution Retrieval-Detection, MRD),通过引入多分辨率语义融合方法,将不同尺度下获得的语义相似性图进行整合,以增强语义信息的准确性并保持目标对象的完整性;同时,结合开放词汇表目标检测(Open-Vocabulary Object Detection, OVD)模型,利用滑动窗口机制实现全局尺度的目标定位,显著提升了高分辨率图像理解的性能。

链接: https://arxiv.org/abs/2512.02906
作者: Fan Yang,Kaihao Zhang
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window this http URL on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
zh

[CV-29] Glance: Accelerating Diffusion Models with 1 Sample

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成任务中因计算成本高和推理步数多而导致的部署效率问题。传统方法通过蒸馏技术减少推理步骤,但常面临重训练成本高和泛化性能下降的挑战。其解决方案的关键在于提出一种相位感知(phase-aware)的加速策略:将去噪过程划分为语义早期阶段(慢速)与冗余后期阶段(快速),并分别设计专用的轻量级LoRA适配器——Slow-LoRA用于精确处理早期语义信息,Fast-LoRA用于高效加速后期去噪。令人惊讶的是,仅需在基础模型上添加这两个轻量适配器,即可实现高达5倍的加速效果,同时保持与原模型相当的视觉质量,并且在未见提示词上展现出强泛化能力,训练仅需1个样本、单张V100显卡一小时内完成。

链接: https://arxiv.org/abs/2512.02899
作者: Zhuobai Dong,Rui Zhao,Songjie Wu,Junchao Yi,Linjie Li,Zhengyuan Yang,Lijuan Wang,Alex Jinpeng Wang
机构: WHU(武汉大学); NUS(新加坡国立大学); CSU(中南大学); UESTC(电子科技大学); Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
zh

[CV-30] Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models

【速读】:该论文旨在解决LiDAR点云到图像的投影方式如何影响度量空间识别(metric place recognition)性能的问题,特别是在与先进的视觉基础模型(vision foundation model)结合时。其解决方案的关键在于提出一个模块化的检索流水线,通过控制骨干网络(backbone)、特征聚合(aggregation)和评估协议,从而隔离出二维投影本身对识别性能的影响;实验表明,精心设计的投影方式能够显著提升判别能力、环境变化鲁棒性及实时自主性的适用性,成为端到端三维学习在LiDAR空间识别中的有效替代方案。

链接: https://arxiv.org/abs/2512.02897
作者: Pierpaolo Serio,Giulio Pisaneschi,Andrea Dan Ryals,Vincenzo Infantino,Lorenzo Gentilini,Valentina Donzella,Lorenzo Pollini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 13 Pages, 5 Figures, 2 Tables Under Review

点击查看摘要

Abstract:This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.
zh

[CV-31] MindGPT -4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在后训练阶段存在的性能瓶颈、成本高昂及泛化能力不足的问题。核心挑战在于如何高效地提升模型的基础能力与跨域适应性,同时保持推理准确性与响应简洁性。解决方案的关键在于提出一个通用的后训练范式,包含三项创新:(1) 基于信息密度的数据生成方案结合双维树状标签体系,实现高质量跨域数据的自动化构建;(2) 协同课程监督微调方法,在注入领域知识的同时维持模型通用能力;(3) 混合强化学习机制,在增强推理能力的同时优化多样性探索、多模态感知保持和响应简洁性等多重目标。此外,通过5D并行训练、算子优化与推理量化等基础设施改进,显著提升了训练与推理效率,降低了领域适配成本,最终使MindGPT-4ov在MMBench、MMStar等多个基准上达到SOTA性能,并具备工业部署潜力。

链接: https://arxiv.org/abs/2512.02895
作者: Wei Chen,Chaoqun Du,Feng Gu,Wei He,Qizhen Li,Zide Liu,Xuhao Pan,Chang Ren,Xudong Rao,Chenfeng Wang,Tao Wei,Chengjun Yu,Pengfei Yu,Yufei Zheng,Chunpeng Zhou,Pan Zhou,Xuhan Zhu
机构: Li Auto Inc. (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 14 figures

点击查看摘要

Abstract:We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community’s development of MLLMs.
zh

[CV-32] aming Camera-Controlled Video Generation with Verifiable Geometry Reward

【速读】:该论文旨在解决当前视频扩散模型在相机控制视频生成任务中依赖监督微调(Supervised Fine-Tuning, SFT)导致的精度不足与泛化能力有限的问题,尤其是在复杂大范围相机运动场景下难以实现高几何一致性与精确控制。其解决方案的关键在于提出一种在线强化学习(Online Reinforcement Learning, RL)后训练框架,并设计了一种可验证的几何奖励机制(verifiable geometry reward),通过估计生成视频与参考视频的三维相机轨迹,将其分割为短段并计算段级相对位姿,进而以段对段的对齐得分作为密集奖励信号,有效缓解了传统稀疏奖励带来的优化困难,显著提升了相机控制精度、几何一致性和视觉质量。

链接: https://arxiv.org/abs/2512.02870
作者: Zhaoqing Wang,Xiaobo Xia,Zhuolin Bie,Jinlin Liu,Dongdong Yu,Jia-Wang Bian,Changhu Wang
机构: AIsphere; National University of Singapore; Nanyang Technological University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
zh

[CV-33] MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration

【速读】:该论文旨在解决数字牙科中锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)与口内扫描(Intraoral Scanning, IOS)数据标注稀缺问题,从而推动牙髓根管分割和跨模态配准的自动化解决方案发展。其关键在于通过组织STSR 2025挑战赛,在MICCAI 2025上构建一个半监督学习(Semi-Supervised Learning, SSL)基准测试平台,提供大规模未标注数据集(640例IOS、250例CBCT)并鼓励开发高效SSL方法。解决方案的核心包括:在分割任务中采用nnU-Net和Mamba-like状态空间模型结合伪标签生成与一致性正则化策略;在配准任务中融合PointNetLK与可微分奇异值分解(differentiable SVD)及几何增强技术以应对模态差异,并引入混合神经-经典优化流程实现高精度对齐,显著提升了有限标注条件下的性能表现。

链接: https://arxiv.org/abs/2512.02867
作者: Yaqi Wang,Zhi Li,Chengyu Wu,Jun Liu,Yifan Zhang,Jialuo Chen,Jiaxue Ni,Qian Luo,Jin Liu,Can Han,Changkai Ji,Zhi Qin Tan,Ajo Babu George,Liangyu Chen,Qianni Zhang,Dahong Qian,Shuai Wang,Huiyu Zhou
机构: Hangzhou Dianzi University(杭州电子科技大学); School of Cyberspace, Hangzhou Dianzi University(杭州电子科技大学网络空间学院); Innovation Center for Electronic Design Automation Technology, Hangzhou Dianzi University(杭州电子科技大学电子设计自动化技术研究中心); College of Information Engineering, China Jiliang University(中国计量大学信息工程学院); Shenzhen University(深圳大学); Shandong University(山东大学); School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院); Queen Mary University of London(伦敦玛丽女王大学); King’s College London(伦敦国王学院); University of Leicester(莱斯特大学); Hangzhou Geriatric Stomatology Hospital, Hangzhou Dental Hospital Group(杭州老年口腔医院,杭州牙科医院集团); DiceMed(印度奥里萨邦DiceMed)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at this https URL to ensure reproducibility.
zh

[CV-34] RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association ICASSP

【速读】:该论文旨在解决多语言环境下人脸-语音关联(face-voice association)任务的挑战,尤其关注跨语言场景中如何有效对齐不同模态的信息。其解决方案的关键在于重新审视并优化特征融合与正交投影策略,通过聚焦于人脸和语音模态中的相关语义信息,提升跨语言条件下关联性能。该方法在英语-德语数据划分上表现优异,在FAME 2026挑战中以等错误率(EER)33.1%位列第三。

链接: https://arxiv.org/abs/2512.02860
作者: Abdul Hannan,Furqan Malik,Hina Jabbar,Syed Suleman Sadiq,Mubashir Noman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Ranked 3rd in Fame 2026 Challenge, ICASSP

点击查看摘要

Abstract:Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.
zh

[CV-35] Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 图像检测器在身份保持型 AIGC(IP-AIGC)场景下的鲁棒性与公平性问题,特别是针对印度及南亚人群的检测性能不足。其关键解决方案在于构建首个面向印度人脸的系统性评估框架,包括从 FairFD 和 HAV-DF 中提取聚焦印度人群的训练数据,并使用商业 Web UI 生成器(Gemini 和 ChatGPT)构建两个独立的 IP-AIGC 测试集(HIDF-img-ip-genai 和 HIDF-vid-ip-genai),从而量化跨生成器泛化能力与群体内性能差异。实验表明,微调虽能提升域内表现,但显著损害对未见 IP-AIGC 的检测性能,揭示了模型对训练生成器特征的过拟合问题,而非一般分布偏移所致,进而强调需发展保留表征能力的适应策略与面向印度人群的基准建设以缩小 AIGC 检测中的泛化差距。

链接: https://arxiv.org/abs/2512.02850
作者: Vishal Dubey,Pallavi Tyagi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.
zh

[CV-36] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? WACV2026

【速读】:该论文旨在解决动作预测(Action Anticipation)中依赖长时间视频帧聚合的局限性问题,即如何在仅观察单帧图像的情况下实现准确的动作预测。其解决方案的关键在于提出一种名为AAG(Action Anticipation at a Glimpse)的方法,该方法通过融合RGB特征与深度信息以增强空间推理能力,并引入长期上下文信息——这些上下文可来自视觉-语言模型生成的文本摘要或单帧动作识别器的预测结果,从而在不依赖时序聚合的前提下显著提升预测性能。

链接: https://arxiv.org/abs/2512.02846
作者: Manuel Benavent-Lledo,Konstantinos Bacharidis,Victoria Manousaki,Konstantinos Papoutsakis,Antonis Argyros,Jose Garcia-Rodriguez
机构: Universidad de Alicante (阿尔卡拉大学); Foundation for Research and Technology-Hellas (希腊研究与技术基金会); University of Crete (克里特大学); Hellenic Mediterranean University (地中海希腊大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in WACV 2026 - Applications Track

点击查看摘要

Abstract:Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.
zh

[CV-37] Defense That Attacks: How Robust Models Become Better Attackers

【速读】:该论文旨在解决对抗训练(Adversarial Training, AT)是否无意中增强了对抗样本的迁移性这一问题。研究发现,尽管对抗训练能提升模型对对抗攻击的鲁棒性,但其所生成的对抗扰动反而比标准训练模型更易在不同模型间迁移,形成一种新的生态风险。解决方案的关键在于:首先通过系统性实验验证了该悖论现象,即AT模型产生的对抗样本具有更强的迁移能力;其次提出应将模型生成对抗样本的迁移倾向纳入鲁棒性评估体系,而不仅关注其抵抗迁移攻击的能力,从而推动更全面的防御机制设计与评估标准建立。

链接: https://arxiv.org/abs/2512.02830
作者: Mohamed Awad,Mahmoud Akrm,Walid Gomaa
机构: Egypt Japan University of Science and Technology (埃及日本科技大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); Alexandria University (亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.
zh

[CV-38] PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

【速读】:该论文旨在解决当前基于扩散模型的文本到图像定制方法在物理概念(physical concepts)定制上的不足,即尽管这些方法能有效控制风格和形状等抽象概念,但在涉及真实物理属性(如重力、弹性、密度等)时表现不佳,即使输入提示中包含与物理相关的词汇,生成结果也无法准确体现相应的物理特性。解决方案的关键在于提出PhyCustom框架,其核心创新是引入两种新型正则化损失:**等距损失(isometric loss)用于激活扩散模型学习物理概念,以及解耦损失(decouple loss)**用于消除不同独立概念之间的混合学习,从而实现对物理属性的精准控制与分离建模。

链接: https://arxiv.org/abs/2512.02794
作者: Fan Wu,Cheng Chen,Zhoujie Fu,Jiacheng Wei,Yi Xu,Deheng Ye,Guosheng Lin
机构: Nanyang Technological University (南洋理工大学); Goertek Alpha Labs (歌尔创新实验室); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: codes: this https URL

点击查看摘要

Abstract:Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.
zh

[CV-39] IC-World: In-Context Generation for Shared World Modeling

【速读】:该论文旨在解决共享世界建模(shared world modeling)问题,即从一组输入图像中生成多段视频,每段视频对应同一场景的不同相机位姿,同时保持场景级几何一致性和物体级运动一致性。解决方案的关键在于提出IC-World框架,利用大视频模型固有的上下文内生成能力(in-context generation capability),实现对所有输入图像的并行生成,并通过强化学习方法Group Relative Policy Optimization(GRPO)结合两个新设计的奖励模型,分别优化场景级几何一致性和物体级运动一致性,从而显著提升生成视频的质量与一致性。

链接: https://arxiv.org/abs/2512.02793
作者: Fan Wu,Jiacheng Wei,Ruibo Li,Yi Xu,Junyou Li,Deheng Ye,Guosheng Lin
机构: Nanyang Technological University (南洋理工大学); Goertek Alpha Labs; Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: codes: this https URL

点击查看摘要

Abstract:Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.
zh

[CV-40] HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval ACM-MM2025

【速读】:该论文针对**组合视频检索(Composed Video Retrieval, CVR)**任务中多模态查询理解不准确的问题展开研究,核心挑战在于:现有方法忽视了视频与文本模态间的信息密度差异,导致两个关键问题——修改主体指代歧义(modification subject referring ambiguity)细节语义关注不足(limited detailed semantic focus),从而制约模型性能。解决方案的关键是提出首个利用模态信息密度差异的框架——分层不确定性感知消歧网络(Hierarchical Uncertainty-aware Disambiguation network, HUD),其创新性体现在三个组件:(a) 整体指代消歧(Holistic Pronoun Disambiguation)、(b) 原子级不确定性建模(Atomistic Uncertainty Modeling)、以及 © 从整体到原子的对齐机制(Holistic-to-Atomistic Alignment)。通过跨模态交互实现语义重叠捕捉与细粒度对齐,HUD有效提升对象消歧能力并增强对细节语义的关注,实现精准的组合特征学习。

链接: https://arxiv.org/abs/2512.02792
作者: Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Haokun Wen,Weili Guan
机构: Shandong University (山东大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and © Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on this https URL.
zh

[CV-41] UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning -Enriched Edits

【速读】:该论文旨在解决当前生成式图像编辑模型中闭源与开源模型性能差距日益扩大的问题,其核心原因在于高质量、大规模训练数据的稀缺以及缺乏能够全面诊断模型在多样化编辑行为下弱点的基准测试。为应对这一挑战,作者提出了一种轻量级数据构建流水线,关键创新在于用端到端模型替代传统多工具链,并引入统一的后验证阶段;同时训练了一个7B参数的双任务专家模型Qwen-Verify,用于高效识别失败案例并重构指令描述,从而实现可扩展的质量控制。该方案最终产出1000万样本规模的UnicEdit-10M数据集和通用性更强的UnicBench基准,辅以非编辑一致性(Non-edit Consistency)和推理准确率(Reasoning Accuracy)等新指标,实现了对主流模型的细粒度诊断与改进方向明确指引。

链接: https://arxiv.org/abs/2512.02790
作者: Keming Ye,Zhipeng Huang,Canmiao Fu,Qingyang Liu,Jiani Cai,Zheqi Lv,Chen Li,Jing Lyu,Zhou Zhao,Shengyu Zhang
机构: Zhejiang University (浙江大学); WeChat Vision, Tencent Inc. (腾讯微信视觉团队); Shanghai Jiao Tong University (上海交通大学); Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 15 figures, 12 tables

点击查看摘要

Abstract:With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbfQwen-Verify, for efficient failure detection and instruction recaptioning. This pipeline yields \textbfUnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbfUnicBench, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textitNon-edit Consistency and \textitReasoning Accuracy. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.
zh

[CV-42] rackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking

【速读】:该论文旨在解决TrackNet系列在快速移动小目标跟踪任务中面临的两大瓶颈问题:一是早期版本(V1–V3)因仅依赖视觉线索而在遮挡情况下性能受限;二是TrackNetV4虽引入运动信息,但采用绝对差分法丢失了运动极性,导致方向模糊。解决方案的关键在于提出TrackNetV5架构,其核心创新为两个模块:一是Motion Direction Decoupling (MDD) 模块,通过分解时序动态为带符号的极性场,显式编码运动发生与轨迹方向;二是Residual-Driven Spatio-Temporal Refinement (R-STR) 头,基于粗到精范式利用因子化时空上下文估计修正残差,有效恢复被遮挡目标。实验表明,TrackNetV5在TrackNetV2数据集上达到F1-score 0.9859和精度0.9733的新SOTA水平,且仅增加3.7% FLOPs,保持实时推理能力。

链接: https://arxiv.org/abs/2512.02789
作者: Tang Haonan,Chen Yanjun,Jiang Lezhi
机构: Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.
zh

[CV-43] Diagnose Correct and Learn from Manipulation Failures via Visual Symbols

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中面临的两大挑战:一是缺乏对失败原因的准确诊断能力,二是现有失败数据集多基于仿真生成,难以泛化到真实场景。其解决方案的关键在于提出ViFailback框架,该框架通过引入显式的视觉符号(visual symbols)提升失败标注效率,并构建了一个包含5,202条真实世界操作轨迹和58,126个视觉问答(Visual Question Answering, VQA)对的大规模数据集ViFailback Dataset。进一步地,作者建立了ViFailback-Bench基准,涵盖11个细粒度VQA任务,用于系统评估视觉-语言模型(Vision-Language Models, VLMs)在失败诊断与修正指导上的性能。最终,基于此框架训练出的ViFailback-8B VLM不仅在ViFailback-Bench上表现优异,还能生成视觉符号以提供可解释的纠正动作指导,并成功集成至VLA模型实现在真实机器人环境中的故障恢复能力验证。

链接: https://arxiv.org/abs/2512.02787
作者: Xianchao Zeng,Xinyu Zhou,Youcheng Li,Jiayou Shi,Tianle Li,Liangming Chen,Lei Ren,Yong-Lu Li
机构: Beihang University (北京航空航天大学); Shanghai Innovation Institute; Southern University of Science and Technology (南方科技大学); Shanghai Jiao Tong University (上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: this https URL
zh

[CV-44] LumiX: Structured and Coherent Text-to-Intrinsic Generation

【速读】:该论文旨在解决文本到内在属性(text-to-intrinsic generation)生成中多图谱(如反照率、辐照度、法向量、深度等)之间结构不一致与物理不可靠的问题,即如何实现从文本提示中联合生成一组结构一致且物理合理的内在图像图谱。其解决方案的关键在于提出两个核心设计:一是Query-Broadcast Attention机制,通过在每个自注意力块中跨所有图谱共享查询来确保生成结果的结构一致性;二是Tensor LoRA方法,以张量形式高效建模不同图谱间的跨域关系,从而实现参数高效的联合训练。这两个创新共同支撑了稳定且统一的多属性扩散生成框架,显著优于现有方法。

链接: https://arxiv.org/abs/2512.02781
作者: Xu Han,Biao Zhang,Xiangjun Tang,Xianzhi Li,Peter Wonka
机构: HUST(华中科技大学); KAUST(阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: The code will be available at this https URL

点击查看摘要

Abstract:We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.
zh

[CV-45] Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset AAAI-26

【速读】:该论文旨在解决腹腔镜视频中手术烟雾(surgical smoke)干扰视觉引导的问题,特别是现有去烟方法未能区分不同烟雾类型(如扩散型烟雾和环境型烟雾)导致的性能瓶颈。解决方案的关键在于提出首个烟雾类型感知的腹腔镜视频去烟网络(STANet),其核心创新包括:1)设计烟雾掩码分割子网络,通过注意力加权掩码聚合实现烟雾掩码与烟雾类型的联合预测;2)构建烟雾去除视频重建子网络,利用两类烟雾掩码指导特征级去烟处理;3)引入粗粒度到细粒度的解缠模块,通过烟雾类型感知的跨注意力机制提升非纠缠区域与纠缠区域之间的分离精度,从而有效缓解两类烟雾间的混淆问题。

链接: https://arxiv.org/abs/2512.02780
作者: Qifan Liang,Junlin Li,Zhen Han,Xihao Wang,Zhongyuan Wang,Bin Mei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 15 figures. Accepted to AAAI-26 (Main Technical Track)

点击查看摘要

Abstract:Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.
zh

[CV-46] AttMetNet: Attention-Enhanced Deep Neural Network for Methane Plume Detection in Sentinel-2 Satellite Imagery

【速读】:该论文旨在解决卫星遥感影像中甲烷(Methane)泄漏 plume 检测的准确性问题,特别是如何从 Sentinel-2 的 B11 和 B12 波段中准确识别甲烷吸收特征,同时抑制由背景变化和复杂地表覆盖引起的误报。传统方法依赖波段差值或比值,但易产生大量假阳性,需人工验证;而现有深度学习方法虽有所改进,却缺乏对甲烷特异性特征的优先关注机制。解决方案的关键在于提出 AttMetNet——一种融合归一化差异甲烷指数(Normalized Difference Methane Index, NDMI)与注意力增强型 U-Net 的新型架构,通过联合利用 NDMI 对 plume 敏感的线索和注意力驱动的特征选择机制,有效放大甲烷吸收特征并抑制背景噪声,从而实现更鲁棒、高精度的甲烷 plume 检测。此外,采用焦点损失(focal loss)缓解正负样本严重不平衡问题,并基于真实甲烷 plume 数据集训练,提升了模型在实际场景中的泛化能力。

链接: https://arxiv.org/abs/2512.02751
作者: Rakib Ahsan,MD Sadik Hossain Shanto,Md Sultanul Arifin,Tanzima Hashem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Methane is a powerful greenhouse gas that contributes significantly to global warming. Accurate detection of methane emissions is the key to taking timely action and minimizing their impact on climate change. We present AttMetNet, a novel attention-enhanced deep learning framework for methane plume detection with Sentinel-2 satellite imagery. The major challenge in developing a methane detection model is to accurately identify methane plumes from Sentinel-2’s B11 and B12 bands while suppressing false positives caused by background variability and diverse land cover types. Traditional detection methods typically depend on the differences or ratios between these bands when comparing the scenes with and without plumes. However, these methods often require verification by a domain expert because they generate numerous false positives. Recent deep learning methods make some improvements using CNN-based architectures, but lack mechanisms to prioritize methane-specific features. AttMetNet introduces a methane-aware architecture that fuses the Normalized Difference Methane Index (NDMI) with an attention-enhanced U-Net. By jointly exploiting NDMI’s plume-sensitive cues and attention-driven feature selection, AttMetNet selectively amplifies methane absorption features while suppressing background noise. This integration establishes a first-of-its-kind architecture tailored for robust methane plume detection in real satellite imagery. Additionally, we employ focal loss to address the severe class imbalance arising from both limited positive plume samples and sparse plume pixels within imagery. Furthermore, AttMetNet is trained on the real methane plume dataset, making it more robust to practical scenarios. Extensive experiments show that AttMetNet surpasses recent methods in methane plume detection with a lower false positive rate, better precision recall balance, and higher IoU.
zh

[CV-47] Reasoning -Aware Multimodal Fusion for Hateful Video Detection

【速读】:该论文旨在解决在线视频中仇恨言论(hate speech)识别的两大挑战:一是多模态内容复杂语义关系融合困难,二是对隐含仇恨意图的理解能力不足。解决方案的关键在于提出一种推理感知的多模态融合框架(Reasoning-Aware Multimodal Fusion, RAMF),其核心包括两个创新模块:首先,设计局部-全局上下文融合(Local-Global Context Fusion, LGCF)以捕捉局部显著特征与全局时序结构;其次,引入语义交叉注意力机制(Semantic Cross Attention, SCA)实现细粒度的跨模态语义交互;此外,通过对抗性推理机制——一个三阶段结构化过程(客观描述、假定仇恨推理和非假定仇恨推理)——提供互补的语义视角,增强模型对隐晦仇恨意图的上下文理解能力。实验证明该方法在真实数据集上显著优于现有最优方法,在Macro-F1和仇恨类别召回率上分别提升3%和7%。

链接: https://arxiv.org/abs/2512.02743
作者: Shuonan Yang,Tailin Chen,Jiangbei Yue,Guangliang Cheng,Jianbo Jiao,Zeyu Fu
机构: University of Exeter (埃克塞特大学); University of Liverpool (利物浦大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model’s contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.
zh

[CV-48] Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone WACV2026

【速读】:该论文旨在解决在无全球导航卫星系统(GNSS)环境下,无人机(UAV)基于图像的定位问题,尤其是现有方法依赖大规模配对的无人机与卫星图像数据集进行训练,而这类数据获取成本高且常不可得,限制了实际应用。解决方案的关键在于提出一种新的训练范式,即在训练过程中无需使用无人机图像,而是直接从卫星视角参考图像中学习,并通过专门设计的数据增强策略模拟卫星视图与真实无人机视图之间的视觉域偏移(domain shift),从而提升模型的泛化能力。为此,作者提出了CAEVL模型,并在自建的真实世界无人机图像数据集ViLD上验证了其性能,结果表明该方法在不依赖配对数据的情况下仍能实现与传统方法相当甚至更优的定位效果。

链接: https://arxiv.org/abs/2512.02737
作者: Tristan Amadei,Enric Meinhardt-Llopis,Benedicte Bascle,Corentin Abgrall,Gabriele Facciolo
机构: Thales LAS(泰雷兹激光与传感系统公司); Universite Paris-Saclay(巴黎萨克雷大学); ENS Paris-Saclay(巴黎高等师范学院); CNRS(法国国家科学研究中心); Centre Borelli(博雷利中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.
zh

[CV-49] DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions WACV2026

【速读】:该论文旨在解决3D手部姿态估计(3D HPE)中因严重遮挡(如双手重叠)导致的特征学习鲁棒性不足的问题。现有方法多依赖ResNet进行特征提取,但其卷积神经网络(CNN)的归纳偏置限制了对全局上下文信息(如关节间关系、手间交互或场景线索)的有效建模能力。解决方案的关键在于提出一种基于状态空间模型(state space modeling)的高效视觉特征提取框架——可变形Mamba(Deformable Mamba, DF-Mamba),其通过Mamba的可选择状态建模机制与提出的可变形状态扫描策略,在保留局部图像特征的同时,有效聚合并选择性保留有助于理解全局上下文的有用线索,从而显著提升结构化3D HPE的精度,并在推理速度上媲美ResNet-50。

链接: https://arxiv.org/abs/2512.02727
作者: Yifan Zhou,Takehiko Ohkawa,Guwenxiao Zhou,Kanoko Goto,Takumi Hirose,Yusuke Sekikawa,Nakamasa Inoue
机构: Institute of Science Tokyo (东京科学研究所); Denso IT Laboratory (电装IT实验室); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to WACV 2026. Project page: this https URL

点击查看摘要

Abstract:Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN’s inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba’s selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.
zh

[CV-50] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

【速读】:该论文旨在解决遥感图像中视觉定位(visual grounding)任务的挑战,特别是针对目标在千米级场景中尺度极小、且文本查询常涉及复杂地理空间关系(如相对位置、空间层次结构或远距离对象的上下文依赖)的问题。解决方案的关键在于提出GeoViS框架,其将遥感视觉定位重构为一个渐进式的搜索与推理过程:模型不直接预测目标位置,而是通过树状结构的视觉线索序列主动探索全局图像,融合多模态感知、空间推理与奖励驱动的探索机制,迭代优化地理空间假设。该设计使模型能够在保持整体场景感知的同时精准检测微小目标,并在五个遥感定位基准上显著优于现有方法,展现出强大的跨域泛化能力和可解释性。

链接: https://arxiv.org/abs/2512.02715
作者: Peirong Zhang,Yidan Zhang,Luxiao Xu,Jinliang Lin,Zonghao Guo,Fengxiang Wang,Xue Yang,Kaiwen Wei,Lei Wang
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); National University of Defense Technology (国防科技大学); Shanghai Jiao Tong University (上海交通大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.
zh

[CV-51] ssue-mask supported inter-subject whole-body image registration in the UK Biobank - A method benchmarking study

【速读】:该论文旨在解决英国生物银行(UK Biobank)中大规模全身体部磁共振(MR)图像的跨被试配准问题,以实现全身空间标准化及基于体素级别的非影像数据(如年龄、脂肪含量或组织体积)与图像衍生参数之间的关联分析。其核心挑战在于如何提升跨个体图像配准的精度与鲁棒性,从而支持后续精细化的医学研究。解决方案的关键在于提出一种性别分层的全身体部MR图像配准方法,该方法利用当前最先进的VIBESegmentator算法生成的皮下脂肪组织和肌肉掩膜(subcutaneous adipose tissue- and muscle-masks),作为先验信息增强基于强度的图割(graph-cut)配准策略,从而显著改善不同个体间解剖结构的空间对齐效果。

链接: https://arxiv.org/abs/2512.02702
作者: Yasemin Utkueri,Elin Lundström,Håkan Ahlström,Johan Öfverstedt,Joel Kullberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would enable their body-wide spatial standardization, and region-/voxel-wise correlation analysis of non-imaging data with image-derived parameters (e.g., tissue volume or fat content). We propose a sex-stratified inter-subject whole-body MR image registration approach that uses subcutaneous adipose tissue- and muscle-masks from the state-of-the-art VIBESegmentator method to augment intensity-based graph-cut registration. The proposed method was evaluated on a subset of 4000 subjects by comparing it to an intensity-only method as well as two previously published registration methods, uniGradICON and MIRTK. The evaluation comprised overlap measures applied to the 71 VIBESegmentator masks: 1) Dice scores, and 2) voxel-wise label error frequency. Additionally, voxel-wise correlation between age and each of fat content and tissue volume was studied to exemplify the usefulness for medical research. The proposed method exhibited a mean dice score of 0.77 / 0.75 across the cohort and the 71 masks for males/females, respectively. When compared to the intensity-only registration, the mean values were 6 percentage points (pp) higher for both sexes, and the label error frequency was decreased in most tissue regions. These differences were 9pp / 8pp against uniGradICON and 12pp / 13pp against MIRTK. Using the proposed method, the age-correlation maps were less noisy and showed higher anatomical alignment. In conclusion, the image registration method using two tissue masks improves whole-body registration of UK Biobank images. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.02702 [cs.CV] (or arXiv:2512.02702v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.02702 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yasemin Utkueri [view email] [v1] Tue, 2 Dec 2025 12:30:59 UTC (19,456 KB)
zh

[CV-52] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在移动端部署时因大量视觉token导致的计算成本过高问题,特别是现有剪枝方法仅依赖token重要性而忽略token间冗余性,以及忽视视觉token之间的空间关系,从而造成冗余保留或选择过于稀疏、无法充分覆盖目标物体区域的问题。解决方案的关键在于提出一种无需训练的token剪枝算法VLM-Pruner,其核心创新包括:1)引入离心式剪枝范式(centrifugal token pruning paradigm),实现从近到远的token选择策略,优先保留细粒度对象细节;2)设计空间稀疏缓冲机制(Buffering for Spatial Sparsity, BSS),延迟对空间上距离较远token的选择,以保障空间覆盖完整性;3)采用并行贪心策略提升剪枝效率,并通过选择性融合被剪除token中的显著信息至保留token中,有效缓解剪枝带来的信息损失。实验表明,该方法在五个VLM上实现了88.9%的高剪枝率且保持性能稳定,同时显著提升端到端推理速度。

链接: https://arxiv.org/abs/2512.02700
作者: Zhenkai Wu,Xiaowen Ma,Zhenliang Ni,Dengming Zhang,Han Shu,Xin Jiang,Xinghao Chen
机构: Zhejiang University (浙江大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9% pruning rate, while delivering an end-to-end inference speedup.
zh

[CV-53] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

【速读】:该论文旨在解决传统卫星中心范式在跨视图(如无人机、卫星和街景)和跨模态(如语言与图像)条件下进行地理定位时的鲁棒性不足问题,尤其是在高分辨率或最新卫星影像不可用的情况下。其核心解决方案是提出GeoBridge,一个能够实现跨视图双向匹配并支持语言到图像检索的基础模型;关键创新在于引入了一种新颖的语义锚定机制(semantic-anchor mechanism),通过文本描述桥接多视图特征,从而提升定位的鲁棒性和灵活性。

链接: https://arxiv.org/abs/2512.02697
作者: Zixuan Song,Jing Zhang,Di Wang,Zidie Zhou,Wenbin Liu,Haonan Guo,En Wang,Bo Du
机构: Jilin University (吉林大学); Wuhan University (武汉大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at this https URL.
zh

[CV-54] ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection ICASSP2026

【速读】:该论文旨在解决安全X射线成像中因扫描设备差异和环境条件变化导致的分布偏移(distribution shift)问题,从而提升目标检测模型在跨域场景下的性能稳定性与泛化能力。解决方案的关键在于提出ALDI++框架,其核心创新包括自蒸馏(self-distillation)、特征对齐(feature alignment)以及增强训练策略的集成,有效缓解了源域与目标域之间的域差距;实验表明,基于ViTDet骨干网络的ALDI++在EDS数据集上显著优于现有最先进方法,尤其在类别级检测精度上表现出一致性提升,验证了Transformer架构在跨域目标检测中的优越性。

链接: https://arxiv.org/abs/2512.02696
作者: Omid Reza Heidari,Yang Wang,Xinxin Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to ICASSP 2026 Conference

点击查看摘要

Abstract:Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.
zh

[CV-55] ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

【速读】:该论文旨在解决自动驾驶场景中异常目标(out-of-distribution, OoD)分割的泛化能力不足问题,其核心挑战在于真实异常数据稀缺且多样性有限,导致模型在开放世界环境下性能受限。解决方案的关键在于提出ClimaDrive框架,该框架通过语义引导的图像到图像生成机制,融合结构感知的多天气生成与提示驱动的异常区域修复(prompt-driven anomaly inpainting),从而合成具有语义一致性、天气多样性及物理真实性的OoD驾驶数据。基于此框架构建的ClimaOoD基准数据集进一步验证了方法的有效性,实验表明训练数据增强可显著提升多种先进方法在异常分割指标(如AUROC、AP和FPR95)上的表现,尤其使FPR95从3.97降至3.52,证明其对模型鲁棒性和开放世界泛化能力的实质性改善。

链接: https://arxiv.org/abs/2512.02686
作者: Yuxing Liu,Yong Liu
机构: Beijing University of Chemical Technology (北京化工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review;

点击查看摘要

Abstract:Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.
zh

[CV-56] Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance

【速读】:该论文旨在解决现有基于槽注意力(slot attention)的无监督场景分解方法在处理真实世界数据时,因未区分前景与背景区域而导致的背景干扰及实例发现性能不佳的问题。其解决方案的关键在于提出一种两阶段框架——前景感知槽注意力(Foreground-Aware Slot Attention, FASA):第一阶段通过双槽竞争机制与基于聚类的初始化策略实现粗粒度场景分解,有效分离前景与背景;第二阶段引入掩码槽注意力机制,使首个槽专注建模背景,其余槽竞争表示前景对象,并结合由自监督图像特征构建的patch亲和图生成伪掩码指导前景槽学习,从而缓解前景过分割问题。该设计显著提升了场景分解的鲁棒性与对象一致性表征能力。

链接: https://arxiv.org/abs/2512.02685
作者: Huankun Sheng,Ming Li,Yixiang Wei,Yeying Fan,Yu-Hui Wen,Tieliang Gong,Yong-Jin Liu
机构: Tsinghua University (清华大学); Beijing Jiaotong University (北京交通大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.
zh

[CV-57] PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution

【速读】:该论文旨在解决基于扩散模型(diffusion-based models)的图像超分辨率(image super-resolution)任务中因依赖大型骨干网络(如Stable Diffusion XL和Diffusion Transformers)而导致的计算与内存开销过大的问题。其解决方案的关键在于提出一种轻量级扩散方法PGP-DiffSR,通过两个核心机制实现高效性:一是设计了一种渐进式剪枝(progressive pruning)策略,在保留图像恢复能力的前提下移除扩散模型内部冗余模块;二是引入相位交换适配器模块(phase-exchange adapter module),利用输入图像的相位信息引导剪枝后的模型提升重建性能,从而在显著降低计算负载和内存消耗的同时保持优异的图像恢复质量。

链接: https://arxiv.org/abs/2512.02681
作者: Zhongbao Yang,Jiangxin Dong,Yazhou Yao,Jinhui Tang,Jinshan Pan
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transformers (DiT), which lead to excessive computational and memory costs during training and inference. To address this issue, we develop a lightweight diffusion method, PGP-DiffSR, by removing redundant information from diffusion models under the guidance of the phase information of inputs for efficient image super-resolution. We first identify the intra-block redundancy within the diffusion backbone and propose a progressive pruning approach that removes redundant blocks while reserving restoration capability. We note that the phase information of the restored images produced by the pruned diffusion model is not well estimated. To solve this problem, we propose a phase-exchange adapter module that explores the phase information of the inputs to guide the pruned diffusion model for better restoration performance. We formulate the progressive pruning approach and the phase-exchange adapter module into a unified model. Extensive experiments demonstrate that our method achieves competitive restoration quality while significantly reducing computational load and memory consumption. The code is available at this https URL.
zh

[CV-58] UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking

【速读】:该论文旨在解决反无人机(Anti-UAV)跟踪中多模态数据融合不足与缺乏统一协作框架的问题,现有方法通常采用独立模型处理单一模态任务,忽视了跨模态信息共享的潜力。其解决方案的关键在于提出一种统一的单目标跟踪框架UAUTrack,该框架基于单流、单阶段、端到端架构,有效整合RGB、热红外(TIR)及RGB-T融合等多种模态,并引入文本先验提示(text prior prompt)策略,引导模型在不同场景下聚焦于无人机目标,从而实现高精度与高效能的协同优化。

链接: https://arxiv.org/abs/2512.02668
作者: Qionglin Ren,Dawei Zhang,Chunxu Tian,Dan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.
zh

[CV-59] PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes

【速读】:该论文旨在解决传统基于偏振信息的神经辐射场(NeRF)在镜面反射场景中训练慢、渲染效率低以及对材质/视角假设依赖性强的问题,同时克服3D高斯泼溅(3DGS)在反射重建中因反射与几何信息耦合导致的精度不足及环境贴图依赖问题。其解决方案的关键在于提出PolarGuide-GSDR框架,构建偏振信息与3DGS之间的双向耦合机制:首先利用3DGS的几何先验消除偏振信息的歧义性,再将优化后的偏振信息作为引导信号,反向提升3DGS的法向量和球谐表示精度,从而实现无需环境贴图或严格材质假设下的高保真反射分离与全场景重建,并保持实时渲染能力。

链接: https://arxiv.org/abs/2512.02664
作者: Derui Shan,Qian Qiao,Hao Lu,Tao Du,Peng Lu
机构: North China University of Technology (华北理工大学); Beijing University of Posts and Telecommunications (北京邮电大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS’s geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS’s normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.
zh

[CV-60] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在文档检索中返回整页而非特定区域的问题,这限制了其在检索增强生成(Retrieval-Augmented Generation, RAG)场景下的实用性;同时,传统OCR系统虽能提取结构化文本及边界框坐标,但缺乏语义层面的相关性判断能力。解决方案的关键在于提出一种混合架构,通过将ColPali的patch级相似度得分作为空间相关性过滤器,对OCR提取的文本区域进行筛选:首先形式化了视觉Transformer的patch网格与OCR边界框之间的坐标映射关系,引入交集度量以实现相关性传播,并建立检索精度的理论边界。该方法无需额外训练即可在推理阶段运行,有效融合了视觉语义理解与空间定位能力。

链接: https://arxiv.org/abs/2512.02660
作者: Agathoklis Georgiou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 13 pages, 1 figure, 2 tables. Open-source implementation available at this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali’s patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.
zh

[CV-61] Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education

【速读】:该论文旨在解决教育场景中多模态学习分析(Multimodal Learning Analytics)因缺乏可扩展、同步且高分辨率的数据采集工具而难以广泛应用的问题。解决方案的关键在于开发了两个互补工具:一是Watch-DMLT,一个用于Fitbit Sense 2智能手表的实时多用户生理与运动信号采集应用;二是ViSeDOPS,一个基于仪表板的可视化系统,用于分析同步采集的多模态数据(如心率、运动、注视、视频及情境标注)。通过在65名学生、最多16个设备的课堂环境中部署,验证了该系统在真实学习环境中实现细粒度、可扩展且可解释的多模态学习分析的可行性与实用性。

链接: https://arxiv.org/abs/2512.02651
作者: Alvaro Becerra,Pablo Villegas,Ruth Cobos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: Accepted in Technological Ecosystems for Enhancing Multiculturality (TEEM) 2025

点击查看摘要

Abstract:Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavioral data. In the context of education, these technologies offer new opportunities to study cognitive and affective processes such as engagement, attention, and performance. However, the lack of scalable, synchronized, and high-resolution tools for multimodal data acquisition continues to be a significant barrier to the widespread adoption of Multimodal Learning Analytics in real-world educational settings. This paper presents two complementary tools developed to address these challenges: Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches that enables real-time, multi-user monitoring of physiological and motion signals; and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations. We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.
zh

[CV-62] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

【速读】:该论文旨在解决多物体视频中音频混叠问题,即现有方法无法从包含多个声源的视频中分离并生成用户指定的目标声音(text-conditioned selective video-to-audio generation),这限制了多媒体制作中对单个音源进行独立编辑与控制的能力。解决方案的关键在于提出SelVA模型,其核心创新是将文本提示作为显式的选择器,通过补充token机制调节视频编码器以提取与提示相关的视觉特征,并利用跨注意力机制抑制无关激活,从而实现语义和时间上的精准定位;此外,引入自增强策略缓解单声道音频监督信号不足的问题,显著提升了音频质量、语义一致性与时间同步性。

链接: https://arxiv.org/abs/2512.02650
作者: Junwon Lee,Juhan Nam,Jiyoung Lee
机构: KAIST (韩国科学技术院); Ewha Womans University (梨花女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at this https URL.
zh

[CV-63] PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking

【速读】:该论文旨在解决微尺度(pore-scale)面部运动轨迹追踪难题,特别是针对非刚性三维面部表情中细微皮肤表面运动的建模与分析问题。传统方法通常仅关注宏观面部关键点(facial landmarks),难以捕捉到更精细的皮肤纹理变化,而本研究通过构建首个包含超过44万条面部轨迹的基准数据集PoreTrack3D,首次实现了对传统面部关键点与微尺度孔隙级关键点(pore-scale keypoints)轨迹的联合追踪。其解决方案的关键在于开发了用于动态三维高斯泼溅(dynamic 3D Gaussian splatting)的系统性评估框架,并基于此建立了该领域首个性能基线,从而为高保真面部运动捕获和动态三维重建提供了新的技术路径。

链接: https://arxiv.org/abs/2512.02648
作者: Dong Li,Jiahao Xiong,Yingda Huang,Le Chang
机构: Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset’s creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: this https URL
zh

[CV-64] Leverag ing Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening

【速读】:该论文旨在解决遥感图像融合中模型在未见数据集上泛化能力差的问题,主要源于真实训练数据的稀缺性以及不同卫星传感器之间的域差异(domain gap)。其解决方案的关键在于提出一种新颖的预训练策略,利用大规模模拟数据集学习鲁棒的空间-光谱先验(spatial-spectral priors):首先通过多种退化操作(如模糊、噪声、下采样)和增强手段(如波段生成、通道洗牌、高通滤波、颜色抖动等)对ImageNet和SkyScript中的自然图像与遥感图像构建多样化模拟数据集;随后在该数据上预训练融合模型以获得可迁移的表征能力。实验表明,该方法显著提升了模型在多个卫星传感器(WorldView系列、IKONOS、QuickBird、GaoFen-2)上的零样本(zero-shot)和少样本(one-shot)泛化性能,验证了基于基础模型(foundation models)的先进训练策略在跨域全色锐化任务中的有效性。

链接: https://arxiv.org/abs/2512.02643
作者: Yongchuan Cui,Peng Liu,Yi Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.
zh

[CV-65] Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

【速读】:该论文旨在解决流模型(flow-based models)中对数似然评估(log-likelihood evaluation)计算成本高昂的问题,即当前最先进的扩散模型和流模型在计算单个对数似然时需数百至数千次神经网络函数评估(NFEs),严重限制了其在模型比较、微调及下游应用中的效率。解决方案的关键在于提出快速流联合蒸馏(F2D2)框架,其核心洞察是:在连续归一化流(continuous normalizing flows)中,采样轨迹与对数似然计算所依赖的耦合常微分方程(ODEs)源自同一潜在速度场(velocity field),因此可利用单一模型同时蒸馏采样轨迹与累积发散项(cumulative divergence)。F2D2通过引入一个额外的发散预测头实现高效联合蒸馏,使采样与似然计算的NFE数量均减少两个数量级,且保持高质量样本生成能力,从而突破流模型长期存在的计算瓶颈。

链接: https://arxiv.org/abs/2512.02636
作者: Xinyue Ai,Yutong He,Albert Gu,Ruslan Salakhutdinov,J Zico Kolter,Nicholas Matthew Boffi,Max Simchowitz
机构: Carnegie Mellon University (卡内基梅隆大学); Peking University (北京大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today’s best generative models – diffusion and flow-based models – still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2’s capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.
zh

[CV-66] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在 PowerPoint(PPT)相关任务中对视觉布局理解能力不足的问题。现有基准测试主要聚焦于狭窄的子任务,忽视了与幻灯片结构和空间排布密切相关的布局挑战,而这正是实际PPT创作与编辑的核心需求。为填补这一空白,作者提出了PPTBench,一个全面的多模态基准,涵盖检测(Detection)、理解(Understanding)、修改(Modification)和生成(Generation)四大类共4,439个样本,基于958个真实PPTX文件构建。其关键创新在于系统性地评估MLLMs在结合文本语义与JSON格式布局结构方面的表现,并揭示了当前模型在整合视觉线索到API规划流程中的显著缺陷,从而为未来研究指明了视觉-结构推理与连贯幻灯片生成的方向。

链接: https://arxiv.org/abs/2512.02624
作者: Zheng Huang,Xukai Liu,Tianyu Hu,Kai Zhang,Ye Liu
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.
zh

[CV-67] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

【速读】:该论文旨在解决当前视频生成模型在规则推理能力评估方面的空白问题,即现有基准主要关注视觉感知与理解相关指标(如视觉美学、指令遵循度和时序一致性),而忽视了模型对认知规则的推理能力。其解决方案的关键在于提出RULER-Bench,一个从认知规则角度评估视频生成模型推理能力的基准测试框架。该框架基于文本到视频和图像到视频两种范式,涵盖622个高质量标注实例的40项代表性任务,分为六大规则类别,并通过包含四项指标的检查清单结合GPT-o3自动评分机制实现高效且高保真的评估(与人工判断一致率达85%),从而为视频生成模型的推理能力提供细粒度、系统化的评测标准。

链接: https://arxiv.org/abs/2512.02622
作者: Xuming He,Zehao Fan,Hengjia Li,Fan Zhuo,Hankun Xu,Senlin Cheng,Di Weng,Haifeng Liu,Can Ye,Boxi Wu
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.
zh

[CV-68] Content-Aware Texturing for Gaussian Splatting

【速读】:该论文旨在解决基于高斯溅射(Gaussian Splatting)的3D重建与实时渲染中,为精细外观细节使用大量小高斯原语导致的表示冗余问题,尤其当几何结构与外观特征具有不同频率特性时。其核心解决方案是引入一种新的2D高斯原语外观表示方法,通过在优化过程中自适应地调整纹理分辨率(texel大小受图像采样频率限制并适配输入图像内容),实现每原语纹理贴图的原理化融合。关键创新在于:1)动态缩放纹理分辨率以匹配场景细节需求;2)通过控制纹理分辨率间接调控原语数量,在保证图像质量的同时显著减少参数总量。

链接: https://arxiv.org/abs/2512.02621
作者: Panagiotis Papantonakis,Georgios Kopanas,Fredo Durand,George Drettakis
机构: Inria(法国国家信息与自动化研究院); Université Côte D’Azur(蔚蓝海岸大学); Google(谷歌); Runway ML; MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Gaussian Splatting has become the method of choice for 3D reconstruction and real-time rendering of captured real scenes. However, fine appearance details need to be represented as a large number of small Gaussian primitives, which can be wasteful when geometry and appearance exhibit different frequency characteristics. Inspired by the long tradition of texture mapping, we propose to use texture to represent detailed appearance where possible. Our main focus is to incorporate per-primitive texture maps that adapt to the scene in a principled manner during Gaussian Splatting optimization. We do this by proposing a new appearance representation for 2D Gaussian primitives with textures where the size of a texel is bounded by the image sampling frequency and adapted to the content of the input images. We achieve this by adaptively upscaling or downscaling the texture resolution during optimization. In addition, our approach enables control of the number of primitives during optimization based on texture resolution. We show that our approach performs favorably in image quality and total number of parameters used compared to alternative solutions for textured Gaussian primitives. Project page: this https URL Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2512.02621 [cs.CV] (or arXiv:2512.02621v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.02621 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Eurographics Symposium on Rendering (Symposium Track), 2025
zh

[CV-69] SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

【速读】:该论文旨在解决机器人抓取任务中因多模态性(multimodal problem)导致的模仿学习失效问题:当场景中存在多个可抓取目标时,不同目标的示范动作会产生冲突的训练信号,使得标准模仿学习策略因平均化这些差异而生成无效动作。解决方案的关键在于提出 SAM2Grasp 框架,其核心创新是将原本多模态的抓取任务重构为单模态、提示条件化的预测问题;通过利用冻结的 SAM2(Segment Anything Model 2)模型提取稳定的时空视觉特征,并在其基础上并行引入一个轻量级可训练动作头(action head),仅需在预计算的时空特征上训练该小规模模块;推理阶段由初始提示(如物体检测模型输出的边界框)指定目标对象,从而引导动作头生成针对该对象的唯一、明确的抓取轨迹,后续帧则借助 SAM2 的内置时序跟踪能力持续稳定追踪目标,无需额外干预,从而有效消除视觉-运动策略中的歧义性。

链接: https://arxiv.org/abs/2512.02609
作者: Shengkai Wu,Jinrong Yang,Wenqiu Luo,Linfeng Gao,Chaohui Shang,Meiyu Zhi,Mingshan Sun,Fangping Yang,Liangliang Ren,Yong Zhao
机构: CVTE(视源股份)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2’s built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.
zh

[CV-70] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

【速读】:该论文旨在解决生成同步且自然的伴随言语手势视频(co-speech gesture video)这一难题。现有方法依赖运动图(motion graph)从已有视频数据中检索合适的动作轨迹,但受限于音频与手势之间多对多映射关系,传统的一对一特征匹配策略难以有效建模这种复杂对应关系。其解决方案的关键在于:首先采用扩散模型(diffusion model)隐式学习音频与手势的联合分布,从而从输入音频序列中生成语境恰当的手势动作;同时提取音频的低层和高层特征以增强扩散模型的训练效果;随后设计基于运动的检索算法,通过全局与局部运动相似性评估,在运动图中识别最优路径;最后对非连续节点段进行无缝拼接,输出连贯的视频结果。该框架显著提升了手势生成的同步精度与自然度。

链接: https://arxiv.org/abs/2512.02576
作者: Yafei Song,Peng Zhang,Bang Zhang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.
zh

[CV-71] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

【速读】:该论文旨在解决当前生物医学视觉-语言预训练中因将科学图表与文本压缩为粗粒度图级配对而导致细粒度对应关系丢失的问题,这限制了模型在临床实践中对局部结构细节的理解能力。其解决方案的关键在于提出了一种名为Panel2Patch的数据处理流水线,该方法从现有生物医学文献中挖掘多面板、标记密集型图表及其周围文本的层次结构,将其转化为多粒度监督信号:通过解析布局、子图和视觉标记,构建图、面板和图像块(patch)三个层级的对齐视觉-语言样本,从而保留局部语义信息;在此基础上设计了一种粒度感知的预训练策略,统一来自粗粒度教学描述到细粒度区域聚焦短语的异构目标,显著提升了小规模数据下的监督效率与模型性能。

链接: https://arxiv.org/abs/2512.02566
作者: Kun Yuan,Min Woo Sun,Zhen Chen,Alejandro Lozano,Xiangteng He,Shi Li,Nassir Navab,Xiaoxiao Sun,Nicolas Padoy,Serena Yeung-Levy
机构: University of Strasbourg (斯特拉斯堡大学); Technical University of Munich (慕尼黑工业大学); Stanford University (斯坦福大学); Yale University (耶鲁大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.
zh

[CV-72] OmniPerson: Unified Identity-Preserving Pedestrian Generation

【速读】:该论文旨在解决行人重识别(Person Re-identification, ReID)任务中因数据隐私和标注成本导致的高质量训练数据匮乏问题。现有基于行人生成的数据增强方法通常无法保证身份一致性且控制能力不足,限制了其在数据集扩充中的有效性。解决方案的关键在于提出OmniPerson——首个统一的身份保持型行人生成流水线,支持可见光/红外图像与视频的生成,并具备对行人属性的细粒度控制能力;其核心创新包括:1)设计Multi-Refer Fuser模块,能够从任意数量参考图像中提炼统一身份特征,确保生成行人身份一致性;2)构建PersonSyn数据集及自动化注释管道,将公开ReID基准转化为富含多模态监督信号的可控生成资源。实验表明,OmniPerson在视觉保真度与身份一致性上达到当前最优水平,且生成数据可显著提升ReID模型性能。

链接: https://arxiv.org/abs/2512.02554
作者: Changxiao Ma,Chao Yuan,Xincheng Shi,Yuzhuo Ma,Yongfei Zhang,Longkun Zhou,Yujia Zhang,Shangze Li,Yifan Xu
机构: Beihang University (北京航空航天大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.
zh

[CV-73] AVGGT: Rethinking Global Attention for Accelerating VGGT

【速读】:该论文旨在解决当前多视图三维重建模型(如VGGT和π³)中全局自注意力机制带来的高计算开销问题,同时缺乏对全局注意力在多视图推理中作用的系统性理解。解决方案的关键在于通过深入分析发现:早期全局层不形成有意义的对应关系,中间层负责跨视图对齐,而最后层仅提供微小优化;基于此洞察,提出一种无需训练的两步加速策略——首先将早期全局层转换为帧内注意力,其次通过对键值(K/V)投影的patch token进行子采样并保留对角线结构与均值填充,从而显著降低计算复杂度。该方法在保持或提升精度的同时,实现高达8–10倍的推理速度提升,并在密集多视图场景下优于现有稀疏注意力基线。

链接: https://arxiv.org/abs/2512.02541
作者: Xianbing Sun,Zhikai Zhu,Zhengyu Lou,Bo Yang,Jinyang Tang,Liqing Zhang,He Wang,Jianfu Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Since DUSt3R, models such as VGGT and \pi^3 have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and \pi^3 to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and \pi^3 and evaluate across standard pose and point-map benchmarks. Our method achieves up to 8 - 10\times speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.
zh

[CV-74] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在将预训练视觉-语言模型(Vision-Language Models, VLMs)与扩散模型(Diffusion Models)高效对接时面临的任务泛化能力下降问题,即现有方法使用固定数量的可学习查询令牌(learnable query tokens)虽具计算效率,但会导致持续学习中因任务差异过大而出现泛化崩溃。其解决方案的关键在于提出“噪声查询令牌”(Noisy Query Tokens),通过端到端优化学习VLM与扩散模型之间的分布式表示空间,从而增强模型的持续学习能力;同时引入一个带有线性投影的变分自编码器(VAE)分支以恢复图像的细粒度细节,实验验证该方法有效缓解了泛化崩溃并实现了跨多样化任务的稳定持续学习。

链接: https://arxiv.org/abs/2512.02536
作者: Jian Yang,Dacheng Yin,Xiaoxuan He,Yong Li,Fengyun Rao,Jing Lyu,Wei Zhai,Yang Cao,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); ZheJiang University (浙江大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.
zh

[CV-75] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection

【速读】:该论文致力于解决零样本异常分类与分割(zero-shot anomaly classification and segmentation, AC/AS)中的核心挑战,特别是由“一致异常”(consistent anomalies)引发的问题——即在工业检测和医学影像等场景中,由于重复出现的相似异常模式系统性地干扰基于距离的方法,导致误判。其解决方案的关键在于识别并建模两类关键现象:相似度缩放(similarity scaling)和邻居烧蚀(neighbor-burnout),它们描述了预训练视觉Transformer中正常patch表示关系在存在一致异常时的统计与几何变化特性。基于此,作者提出CoDeGraph框架,通过多阶段图构建、社区检测与结构化精修,有效过滤一致异常的影响;进一步扩展至3D MRI数据,实现无需训练样本的体积tokenization策略,从而完成真正的零样本3D异常检测与分割,并利用CoDeGraph生成的伪掩码监督文本驱动的视觉-语言模型,打通批量与文本引导的零样本方法边界。

链接: https://arxiv.org/abs/2512.02520
作者: Tai Le-Gia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: PhD Dissertation

点击查看摘要

Abstract:Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem. Comments: PhD Dissertation Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) Cite as: arXiv:2512.02520 [cs.CV] (or arXiv:2512.02520v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.02520 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tai Le-Gia [view email] [v1] Tue, 2 Dec 2025 08:23:03 UTC (31,217 KB) Full-text links: Access Paper: View a PDF of the paper titled On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection, by Tai Le-GiaView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-76] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

【速读】:该论文旨在解决通用视觉-语言模型(Vision-Language Models, VLMs)在遥感(Remote Sensing, RS)任务中表现不佳的问题,尤其是现有地理空间VLMs因采用统一建模策略而难以区分任务类型与解释粒度,导致局部细节感知与全局上下文理解之间的平衡能力受限。其解决方案的关键在于提出SkyMoE——一种面向多模态、多任务遥感解译的专家混合(Mixture-of-Experts, MoE)视觉-语言模型,通过一个任务和粒度感知的自适应路由机制,动态分配子任务给专用的大语言模型专家;同时引入去耦合的上下文增强策略,构建局部与全局特征间的对比对,引导专家实现层级特定的表征学习,从而显著提升模型在复杂场景下的泛化能力与多粒度理解性能。

链接: https://arxiv.org/abs/2512.02517
作者: Jiaqi Liu,Ronghao Fu,Lang Sun,Haoran Liu,Xiao Yang,Weipeng Zhang,Xu Na,Zhuoran Duan,Bo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.
zh

[CV-77] wo-Stage Vision Transformer for Image Restoration: Colorization Pretraining Residual Upsampling

【速读】:该论文旨在解决单图像超分辨率(Single Image Super-Resolution, SISR)任务中模型性能提升的难题。其解决方案的关键在于提出一种两阶段训练策略:首先在颜色化(colourization)预训练任务上对视觉Transformer(Vision Transformer, ViT)进行自监督学习,以提取丰富且可泛化的视觉表征;随后将预训练模型微调用于4倍超分辨率重建,通过预测高频残差图像与双三次插值初始图的叠加来简化残差学习过程。该方法在DIV2K数据集上取得了优异的SSIM(0.712)和PSNR(22.90 dB)结果,验证了自监督预训练在复杂图像复原任务中的有效性。

链接: https://arxiv.org/abs/2512.02512
作者: Aditya Chaudhary,Prachet Dev Singh,Ankit Jha
机构: LNMIIT (LNMIIT Jaipur); LNMIIT (LNMIIT Jaipur); LNMIIT (LNMIIT Jaipur)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 13th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2025), IIT Mandi, India. 3 pages, 1 figure

点击查看摘要

Abstract:In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.
zh

[CV-78] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

【速读】:该论文旨在解决自回归模型(Autoregressive Models)在地理空间生成任务中因结构上的不匹配而产生的问题,即其固有的顺序生成机制无法有效捕捉地理空间数据的并行特性,从而导致生成结果缺乏结构化和连贯性。解决方案的关键在于将地理空间生成重新建模为一种并行细化过程(parallel refinement process),通过引入GeoDiT——首个面向地理空间领域的基于扩散模型(Diffusion-based Vision-Language Model)实现从粗到精的全局合成,使所有语义元素能够同时被处理与优化,从而显著提升图像描述、视觉定位和多目标检测等任务的表现。

链接: https://arxiv.org/abs/2512.02505
作者: Jiaqi Liu,Ronghao Fu,Haoran Liu,Lang Sun,Bo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data’s intrinsic structure is key to unlocking superior performance in complex geospatial analysis.
zh

[CV-79] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

【速读】:该论文旨在解决当前文档版面解析(Document Layout Parsing)方法依赖碎片化多阶段流水线所带来的误差传播问题,以及无法充分利用联合训练带来的协同效应。其解决方案的关键在于提出一个统一的视觉-语言模型(Vision-Language Model),首次在端到端框架中联合学习版面检测、文本识别和关系理解三项核心任务;这一突破得益于一个高度可扩展的数据引擎,能够合成大规模多语言语料库,从而显著提升模型在多种语言、版面和领域下的鲁棒性能。

链接: https://arxiv.org/abs/2512.02498
作者: Yumeng Li,Guang Yang,Hao Liu,Bowen Wang,Colin Zhang
机构: hi lab, Xiaohongshu Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world’s vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce this http URL, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, this http URL establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.
zh

[CV-80] A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因域偏移(domain shift)导致模型性能下降的问题,特别是现有测试时适应(Test Time Adaptation, TTA)方法在模态覆盖、任务多样性及方法一致性方面的评估局限性。解决方案的关键在于构建一个全面的基准平台 MedSeg-TTA,其核心创新包括:统一的数据预处理流程、骨干网络配置和测试时协议,涵盖七种成像模态(MRI、CT、超声、病理、皮肤镜、OCT 和胸部 X 光)以及四种代表性适应范式(输入层变换、特征层对齐、输出层正则化和先验估计),从而实现跨模态的系统性比较与标准化评估。该基准揭示了不同方法在不同场景下的优势与局限,强调了临床部署中应基于具体偏移类型选择合适的方法,为开发鲁棒且可靠的医学图像分割 TTA 方法提供了坚实基础。

链接: https://arxiv.org/abs/2512.02497
作者: Wenjing Yu,Shuo Jiang,Yifei Chen,Shuo Chang,Yuanhan Wang,Beining Wu,Jie Dong,Mingxuan Liu,Shenghao Zhu,Feiwei Qin,Changmiao Wang,Qiyuan Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages, 18 figures

点击查看摘要

Abstract:Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at this https URL.
zh

[CV-81] Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration

【速读】:该论文旨在解决基于深度学习与高斯混合模型(Gaussian Mixture Model, GMM)的点云配准方法在部分到部分(partial-to-partial)点集注册中因特征向量缺乏平移和旋转不变性而导致性能受限的问题,尤其聚焦于DeepGMR方法在处理非重叠或部分重叠点云时的局限性。其解决方案的关键在于提出一种基于注意力机制的参考点偏移(Attention-based Reference Point Shifting, ARPS)层,该层通过注意力模块识别两个部分点集的共同参考点(而非重叠区域),从而提取具有变换不变性的特征表示,显著提升了DeepGMR及其改进版本UGMMReg的配准精度与鲁棒性。

链接: https://arxiv.org/abs/2512.02496
作者: Mizuki Kikkawa,Tatsuya Yatagawa,Yutaka Ohtake,Hiromasa Suzuki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 16 pages, 9 figures, 7 tables

点击查看摘要

Abstract:This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.
zh

[CV-82] YingVideo-MV: Music-Driven Multi-Stage Video Generation

【速读】:该论文旨在解决音乐驱动的长视频生成中缺乏相机运动控制与长时间序列一致性建模的问题。现有方法在生成音乐表演视频时难以实现自然的音画同步、身份一致性和镜头运动控制,尤其在复杂场景下表现不足。解决方案的关键在于提出首个级联式框架YingVideo-MV,其核心创新包括:(1)引入可解释的镜头规划模块(MV-Director)以实现语义层面的拍摄策略设计;(2)构建时间感知的扩散Transformer架构和长序列一致性建模机制,提升生成视频的时间连续性;(3)设计相机适配模块(camera adapter),将相机位姿嵌入潜在噪声空间,实现显式的摄像机运动控制;(4)提出时间感知动态窗口范围策略,基于音频嵌入自适应调整去噪范围,增强跨片段的连贯性。上述技术协同作用,使系统能够从音频信号自动合成高质量、高同步性的音乐表演视频。

链接: https://arxiv.org/abs/2512.02492
作者: Jiahui Chen,Weida Wang,Runhua Shi,Huan Yang,Chaofan Ding,Zihao Chen
机构: AI Lab, GiantNetwork(巨网络人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: this https URL .
zh

[CV-83] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLM s for 3D Scene-Language Understanding

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的3D场景理解方法中,因采用标准因果注意力掩码(causal attention mask)所引发的两个核心问题:一是3D场景中对象具有顺序无关性却受到序列依赖性的干扰,二是对象与指令之间的注意力受限,阻碍了任务特定推理。其解决方案的关键在于提出一种新型掩码策略——3D空间语言指令掩码(3D Spatial Language Instruction Mask, 3D-SLIM),该策略通过引入两个核心组件实现:几何自适应掩码(Geometry-adaptive Mask),根据空间密度而非token顺序约束注意力;以及指令感知掩码(Instruction-aware Mask),使对象token可直接访问指令上下文。此设计无需架构修改或额外参数,即可显著提升多模态3D场景理解任务的性能。

链接: https://arxiv.org/abs/2512.02487
作者: Yerim Jeon,Miso Lee,WonJun Moon,Jae-Pil Heo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user’s task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.
zh

[CV-84] UCAgents : Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在医学诊断中因推理脱离图像证据而导致的“推理脱节”问题,即模型生成的语言解释虽语法流畅却与实际影像内容不一致,从而削弱临床可信度。现有基于多智能体(multi-agent)框架的解决方案虽尝试通过模拟多学科团队(Multidisciplinary Team, MDT)辩论来缓解单一模型偏差,但其开放式的讨论机制反而加剧了文本噪声和计算开销,且未能将推理锚定于视觉证据这一医学决策的核心。论文提出UCAgents,一种分层多智能体框架,其关键在于通过结构化的证据审计实现单向收敛:禁止智能体角色互换、限制交互仅用于目标证据验证,从而抑制话语漂移并增强视觉信号提取;同时引入一轮问答式讨论以识别潜在的视觉-文本错位风险,从信息论角度形式化地约束视觉模糊性和文本噪声构成的双重噪声瓶颈。实验表明,UCAgents在四个医学VQA基准上显著提升准确率(PathVQA达71.3%,较SOTA提升6.0%),同时token消耗降低87.7%,有效平衡了视觉证据挖掘与文本干扰规避,展现出临床部署所需的诊断可靠性与计算高效性。

链接: https://arxiv.org/abs/2512.02485
作者: Qianhan Feng,Zhongzhen Huang,Yakun Zhu,Xiaofan Zhang,Qi Dou
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at this https URL.
zh

[CV-85] G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

【速读】:该论文旨在解决微创手术中对快速且高精度可变形组织三维建模的需求,现有基于高斯点阵(Gaussian splatting)的方法虽提升了实时内窥镜重建性能,但多依赖非商业衍生版本,限制了临床部署可行性。其解决方案的关键在于提出首个原生基于GSplat(Apache-2.0许可)可微分高斯光栅化器的外科手术流水线G-SHARP,实现了原理上合理的形变建模、鲁棒的遮挡处理以及在EndoNeRF基准上的高保真重建,同时通过Holoscan SDK将系统部署于NVIDIA IGX Orin和Thor边缘硬件,支持术中实时可视化,兼顾重建质量与计算效率。

链接: https://arxiv.org/abs/2512.02482
作者: Vishwesh Nath,Javier G. Tejero,Ruilong Li,Filippo Filicori,Mahdi Azizian,Sean D. Huver
机构: NVIDIA(英伟达); Northwell Health(北威尔健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.
zh

[CV-86] WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

【速读】:该论文旨在解决视频世界模型在长时间生成任务中面临的时空一致性差、计算成本高以及生成质量低的问题,尤其是在长上下文输入下难以维持空间一致性和视觉保真度。其解决方案的关键在于提出一种名为WorldPack的高效压缩记忆机制,该机制由轨迹打包(trajectory packing)和记忆检索(memory retrieval)两部分组成:轨迹打包提升了上下文效率,而记忆检索则保障了滚动推演过程中的连贯性,并支持需要空间推理的长期生成任务,从而在显著缩短上下文长度的前提下,大幅提升长期生成的空间一致性和视觉质量。

链接: https://arxiv.org/abs/2512.02473
作者: Yuta Oshima,Yusuke Iwasawa,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta
机构: The University of Tokyo (东京大学); Google DeepMind (谷歌深度智脑)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.
zh

[CV-87] GDD: Trajectory Guided Dataset Distillation with Balanced Distribution AAAI2026

【速读】:该论文旨在解决现有基于分布匹配(Distribution Matching, DM)的图像数据蒸馏方法在训练过程中忽视特征表示演化的问题,从而导致合成数据语义表达能力不足、下游任务性能受限。其解决方案的关键在于提出轨迹引导的数据蒸馏(Trajectory Guided Dataset Distillation, TGDD),将分布匹配重构为沿模型训练轨迹的动态对齐过程:在每个训练阶段,TGDD通过匹配合成数据与原始数据的特征分布来捕捉语义演化,并引入分布约束正则化以减少类别间重叠,从而在不增加额外优化开销的前提下,显著提升合成数据的语义多样性与代表性,实现性能与效率的平衡。

链接: https://arxiv.org/abs/2512.02469
作者: Fengli Ran,Xiao Pu,Bo Liu,Xiuli Bi,Bin Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI 2026

点击查看摘要

Abstract:Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have attracted attention for their high efficiency. However, they often overlook the evolution of feature representations during training, which limits the expressiveness of synthetic data and weakens downstream performance. To address this issue, we propose Trajectory Guided Dataset Distillation (TGDD), which reformulates distribution matching as a dynamic alignment process along the model’s training trajectory. At each training stage, TGDD captures evolving semantics by aligning the feature distribution between the synthetic and original dataset. Meanwhile, it introduces a distribution constraint regularization to reduce class overlap. This design helps synthetic data preserve both semantic diversity and representativeness, improving performance in downstream tasks. Without additional optimization overhead, TGDD achieves a favorable balance between performance and efficiency. Experiments on ten datasets demonstrate that TGDD achieves state-of-the-art performance, notably a 5.0% accuracy gain on high-resolution benchmarks.
zh

[CV-88] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration

【速读】:该论文旨在解决顺序式具身任务(Sequential Embodied Tasks)中如何有效复用先前探索所积累的空间知识,以支持后续推理与探索的问题。相较于单任务场景,其核心挑战在于处理子任务间的依赖关系以及不可行任务(如搜索不存在的物体)带来的干扰。解决方案的关键是提出3DSPMR方法,该方法通过利用已探索区域的几何、视觉和关系线索,增强多模态大语言模型(MLLMs)的空间理解与推理能力,首次将几何信息显式引入基于MLLM的空间认知框架,从而显著提升在顺序式具身问答(EQA)和多模态导航(EMN)任务上的性能表现。

链接: https://arxiv.org/abs/2512.02458
作者: Zhongyi Cai,Yi Du,Chen Wang,Yu Kong
机构: ACTION Lab, Michigan State University (密歇根州立大学); SAIR Lab, University at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.
zh

[CV-89] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

【速读】:该论文试图解决的问题是:在仅关注视频生成质量的情况下,音频-视频联合去噪训练是否能够提升视频生成性能。解决方案的关键在于提出了一种参数高效的Audio-Video Full DiT(AVFullDiT)架构,该架构利用预训练的文本到视频(text-to-video, T2V)和文本到音频(text-to-audio, T2A)模块进行联合去噪训练,并通过对比实验验证了音频信号作为特权信息(privileged signal)能够促使模型内化视觉事件与其声学后果之间的因果关系(如碰撞 × 声音冲击),从而正则化视频动态,显著改善复杂场景下大尺度运动和物体接触动作的视频生成质量。

链接: https://arxiv.org/abs/2512.02457
作者: Jianzong Wu,Hao Lian,Dachao Hao,Ye Tian,Qingyu Shi,Biaolong Chen,Hao Jiang
机构: Peking University (北京大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision \times impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.
zh

[CV-90] ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation

【速读】:该论文旨在解决现有风格化动作生成模型在捕捉风格内多样性(intra-style diversity)方面的局限性,即同一风格类别下应存在多样化的动作变化,而当前方法难以有效建模此类差异。其解决方案的关键在于提出一种基于聚类的框架 ClusterStyle,通过引入一组原型(prototype)来结构化地建模同一风格类别中不同动作模式,从而构建两个层次的风格嵌入空间:全局层面(global-level)捕捉同风格动作间的整体差异,局部层面(local-level)刻画动作序列内部时序动态中的细微变化。这两个嵌入空间通过与非学习型原型锚点对齐进行优化,并结合风格调制适配器(Stylistic Modulation Adapter, SMA)集成至预训练文本到动作生成模型中,实现更精准且多样化的风格迁移与生成效果。

链接: https://arxiv.org/abs/2512.02453
作者: Kerui Chen,Jianrong Zhang,Ming Li,Zhonglong Zheng,Hehe Fan
机构: CCAI, Zhejiang University (浙江大学); ReLER, AAII, University of Technology Sydney (悉尼科技大学); Guangming Laboratory (光明实验室); Zhejiang Normal University (浙江师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.
zh

[CV-91] HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild NEURIPS2025

【速读】:该论文旨在解决当前3D布局估计模型在处理大型多层建筑时的局限性问题,现有模型主要基于合成数据集进行训练,这些数据集通常仅包含简单单层或单房间环境,导致模型无法原生支持多层建筑,且需将场景分割为单层处理,从而丢失了楼梯等跨楼层结构所依赖的全局空间上下文信息。解决方案的关键在于提出HouseLayout3D这一真实世界基准数据集,用于推动面向完整建筑尺度的3D布局估计研究,并引入MultiFloor3D这一无需训练的基线方法,该方法利用现有的场景理解技术,在不依赖额外训练的情况下即优于现有3D布局估计模型,凸显了在该方向上进一步研究的重要性。

链接: https://arxiv.org/abs/2512.02450
作者: Valentin Bieri,Marie-Julie Rakotosaona,Keisuke Tateno,Francis Engelmann,Leonidas Guibas
机构: ETH Zurich(苏黎世联邦理工学院); Google(谷歌); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 (Datasets and Benchmarks Track) Project Page: this https URL

点击查看摘要

Abstract:Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: this https URL.
zh

[CV-92] nuScenes Revisited: Progress and Challenges in Autonomous Driving

【速读】:该论文旨在系统梳理和深入分析nuScenes数据集在自动驾驶(Autonomous Vehicles, AV)研究中的核心作用及其对整个领域发展的推动意义。其试图解决的问题是:如何通过回顾一个具有里程碑意义的数据集(nuScenes)的构建过程、技术细节、扩展版本(如nuImages和Panoptic nuScenes)以及其对后续研究的影响,来厘清当前自动驾驶感知、定位、预测与规划等任务的技术演进脉络,并为学术界提供一份全面且权威的文献综述。解决方案的关键在于:首次详尽披露nuScenes数据集的设计理念、采集流程及标注标准等此前未公开的技术细节,同时追踪其作为基准数据集如何塑造了行业内的多模态传感器融合、标准化评测体系和任务定义,从而成为推动AV算法开发与评估的重要基石。

链接: https://arxiv.org/abs/2512.02448
作者: Whye Kit Fong,Venice Erin Liong,Kok Seang Tan,Holger Caesar
机构: Motional(摩哈)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 17 figures

点击查看摘要

Abstract:Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \ mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.
zh

[CV-93] mporal Dynamics Enhancer for Directly Trained Spiking Object Detectors

【速读】:该论文旨在解决传统脉冲神经网络(Spiking Neural Networks, SNNs)在时间信息建模能力不足的问题,尤其是在复杂任务如目标检测中,由于输入数据通常被固定间隔复制或聚合,导致神经元在不同时间步接收几乎相同的刺激,从而严重限制了模型的表达能力。解决方案的关键在于提出Temporal Dynamics Enhancer (TDE),其核心由两个模块构成:一是脉冲编码器(Spiking Encoder, SE),用于在时间维度上生成多样化的输入刺激;二是注意力门控模块(Attention Gating Module, AGM),通过捕捉时间依赖关系引导SE的生成过程。为进一步降低AGM引入的高能耗乘法运算,作者进一步设计了脉冲驱动注意力机制(Spike-Driven Attention, SDA),显著减少了注意力计算的能量消耗,仅需传统注意力模块0.240倍的能量。

链接: https://arxiv.org/abs/2512.02447
作者: Fan Luo,Zeyu Gao,Xinhao Luo,Kai Zhao,Yanfeng Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural Networks (ANNs). However, existing SNNs typically replicate inputs directly or aggregate them into frames at fixed intervals. Such strategies lead to neurons receiving nearly identical stimuli across time steps, severely limiting the model’s expressive power, particularly in complex tasks like object detection. In this work, we propose the Temporal Dynamics Enhancer (TDE) to strengthen SNNs’ capacity for temporal information modeling. TDE consists of two modules: a Spiking Encoder (SE) that generates diverse input stimuli across time steps, and an Attention Gating Module (AGM) that guides the SE generation based on inter-temporal dependencies. Moreover, to eliminate the high-energy multiplication operations introduced by the AGM, we propose a Spike-Driven Attention (SDA) to reduce attention-related energy consumption. Extensive experiments demonstrate that TDE can be seamlessly integrated into existing SNN-based detectors and consistently outperforms state-of-the-art methods, achieving mAP50-95 scores of 57.7% on the static PASCAL VOC dataset and 47.6% on the neuromorphic EvDET200K dataset. In terms of energy consumption, the SDA consumes only 0.240 times the energy of conventional attention modules.
zh

[CV-94] Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation

【速读】:该论文旨在解决在数据和计算资源受限条件下,如何高效地将大规模预训练模型适配到未见过的任务上这一挑战。现有元学习方法虽能学习良好的初始化,但需额外的元训练阶段、成本高且不稳定;而大量任务特定的预训练模型尚未被有效迁移至新任务。其解决方案的关键在于提出BOLT(Basis-Oriented Low-rank Transfer)框架:通过离线阶段从多个已训练任务中提取并正交化各层的主导奇异方向,构建可复用的正交谱基(orthogonal, task-informed spectral basis),在线阶段则冻结这些基向量,仅训练每层少量对角系数以实现低秩更新。该设计不仅提供了一个无需训练的强初始化策略(基于源任务系数池化与轻量缩放),还实现了参数高效的微调(Parameter-Efficient Fine-Tuning, PEFT),实验证明其性能优于主流PEFT基线及代表性元学习初始化方法。

链接: https://arxiv.org/abs/2512.02441
作者: Junghwan Park,Woojin Cho,Junhyuk Heo,Darongsae Kwon,Kookjin Lee
机构: TelePIX; Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.
zh

[CV-95] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources WACV2026

【速读】:该论文旨在解决医疗健康领域中由于标注数据稀缺导致的视觉-语言模型(Vision-Language Models, VLMs)训练困难问题,同时应对对比学习(Contrastive Learning, CL)对大规模批处理(large batch sizes)的依赖所带来的高计算成本。为实现高效且高性能的多模态学习,其解决方案的关键在于:(1) 引入动量自蒸馏(momentum self-distillation)机制以增强跨模态表征学习;(2) 将动量机制与梯度累积(gradient accumulation)相结合,在不增加硬件资源消耗的前提下有效扩大等效批次大小(effective batch size)。该方法在零样本分类任务中达到SOTA性能,并在少样本适应和检索任务中显著提升效果(AUC-ROC >90%,检索性能提升2–3%),且仅需单GPU即可完成高效训练。

链接: https://arxiv.org/abs/2512.02438
作者: Phuc Pham,Nhu Pham,Ngoc Quoc Ly
机构: University of Science, Ho Chi Minh City, Vietnam (胡志明市科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV 2026

点击查看摘要

Abstract:In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at this https URL .
zh

[CV-96] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework

【速读】:该论文旨在解决当前基于人工智能(AI)的青光眼检测方法在可靠性、参数冗余、潜在虚假相关性以及临床干预分析和模拟应用方面的局限性。现有模型如VGG16或Vision Transformer虽在视网膜眼底图像或光学相干断层扫描(OCT)图像上表现出较高性能,但其复杂结构导致计算资源消耗大,且缺乏对青光眼相关生理因素之间真实因果关系的建模能力。为此,作者提出了一种基于因果表示学习的轻量化青光眼检测模型LightHCG,其关键创新在于结合高斯核独立性检验(HSIC)驱动的潜在空间解耦与图自编码器(Graph Autoencoder)实现的无监督因果表示学习,从而在仅使用93~99%更少参数的情况下,显著提升分类准确率,并增强模型对临床干预分析的支持能力。

链接: https://arxiv.org/abs/2512.02437
作者: Daeyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.
zh

[CV-97] GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

【速读】:该论文旨在解决当前图形用户界面(GUI)智能体在复杂、封闭的真实环境中难以获取完整环境信息,从而限制了其导航能力的系统性研究与基准测试的问题。解决方案的关键在于提出GUI Exploration Lab——一个可灵活定义和组合屏幕、图标及导航图的仿真环境引擎,该引擎提供对环境信息的全访问权限,支持更全面的智能体训练与评估。通过实验验证,监督微调有助于基础知识的记忆化,单轮强化学习提升未见场景下的泛化能力,而多轮强化学习则通过交互式试错促进探索策略的发展,显著提升屏幕导航性能,表明强化学习方法在GUI导航中具有显著优势。

链接: https://arxiv.org/abs/2512.02423
作者: Haolong Yan,Yeqing Shen,Xin Huang,Jia Wang,Kaijun Tan,Zhixuan Liang,Hongxin Li,Zheng Ge,Osamu Yoshie,Si Li,Xiangyu Zhang,Daxin Jiang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); StepFun; Waseda University (早稻田大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages

点击查看摘要

Abstract:With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
zh

[CV-98] Generalizing Vision-Language Models with Dedicated Prompt Guidance AAAI26

【速读】:该论文旨在解决大规模预训练视觉-语言模型(Vision-Language Models, VLMs)在下游任务微调过程中面临的领域特异性与领域泛化能力(Domain Generalization, DG)之间的权衡问题。现有方法通常对整个数据集进行统一微调,可能导致模型在未见领域上的泛化性能下降。其解决方案的关键在于提出一个两阶段的领域专家引导泛化(domain-expert-Guided DG, GuiDG)框架:首先通过提示微调(prompt tuning)在划分的源域上训练多个参数高效的专业模型(expert models),随后引入跨模态注意力模块(Cross-Modal Attention module),通过自适应专家融合机制指导视觉编码器的微调,从而提升模型在少样本场景下的泛化能力。

链接: https://arxiv.org/abs/2512.02421
作者: Xinyao Li,Yinjie Min,Hongbo Chen,Zhekai Du,Fengling Li,Jingjing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI26

点击查看摘要

Abstract:Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
zh

[CV-99] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture

【速读】:该论文旨在解决从2D平面图自动重建室内三维空间时,因墙体等细长结构分割精度不足导致的边界不规则、几何失真问题,从而影响后续矢量化处理。其解决方案的关键在于提出一种名为MitUNet的混合神经网络架构:该架构采用分层Mix-Transformer编码器捕捉全局语义上下文,结合引入scSE注意力机制增强的U-Net解码器实现精确边界恢复,并通过基于Tversky损失函数的优化策略,在抑制墙边假阳性噪声的同时保持对细小结构的高敏感性,从而生成结构正确且边界准确的分割掩膜。

链接: https://arxiv.org/abs/2512.02413
作者: Dmitriy Parashchuk,Alexey Kapshitskiy,Yuriy Karyakin
机构: Tyumen State University (秋明州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.
zh

[CV-100] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

【速读】:该论文旨在解决多智能体辩论(Multi-agent Debate, MAD)在视觉-语言推理任务中应用不足的问题,即当前MAD方法主要局限于纯语言任务,缺乏对多模态问题的有效支持。其关键解决方案是提出一种通用且模块化的MAD框架——加权迭代专家社会(Weighted Iterative Society-of-Experts, WISE),该框架将智能体划分为求解器(Solvers)和反思者(Reflectors),分别负责生成解决方案与验证正确性、分配权重并提供自然语言反馈;同时引入改进的Dawid-Skene算法用于聚合多轮辩论中的答案,综合考虑响应方差与反馈权重,从而显著提升多模态推理任务的准确性,在多个基准数据集上相较现有最优MAD方法实现2–7%的性能提升。

链接: https://arxiv.org/abs/2512.02405
作者: Anoop Cherian,River Doyle,Eyal Ben-Dov,Suhas Lohit,Kuan-Chuan Peng
机构: Mitsubishi Electric Research Labs (三菱电机研究实验室); Cambridge Rindge and Latin School (剑桥温德奇和拉丁学校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents’ solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.
zh

[CV-101] Nav-R2 Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

【速读】:该论文旨在解决开放词汇场景下目标导向导航(object-goal navigation)中代理对未见物体定位成功率低、决策过程不透明的问题。解决方案的关键在于提出Nav-R²框架,通过结构化的思维链(Chain-of-Thought, CoT)推理显式建模两类关键关系:目标-环境建模与环境-动作规划,并引入一种相似性感知记忆(Similarity-Aware Memory, SA-Mem),在不增加额外参数的前提下,压缩视频帧并融合历史观测,从时序和语义两个维度保留最相关的特征,从而实现高效、实时(2Hz)且鲁棒的未见物体定位。

链接: https://arxiv.org/abs/2512.02400
作者: Wentao Xiang,Haokang Zhang,Tianhang Yang,Zedong Chu,Ruihang Chu,Shichao Xie,Yujian Yuan,Jian Sun,Zhining Gu,Junjie Wang,Xiaolong Wu,Mu Xu,Yujiu Yang
机构: Tsinghua University (清华大学); Amap, Alibaba Group (高德地图,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav- R^2 , a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav R^2 -CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \hrefthis https URLgithub link.
zh

[CV-102] Skywork-R1V4: Toward Agent ic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

【速读】:该论文旨在解决现有多模态智能体系统中图像操作与网络搜索能力割裂、过度依赖高成本强化学习(Reinforcement Learning, RL),以及缺乏基于真实工具执行轨迹的规划等问题。其解决方案的关键在于提出 Skywork-R1V4,一个拥有 30B 参数的多模态智能体模型,通过统一多模态规划、主动图像操作(“用图像思考”)、深度多模态搜索和最关键的是交错式推理(interleaved reasoning)——即在视觉操作与外部知识检索之间动态切换——实现高效任务分解与执行。该模型仅使用少于 30,000 条高质量、规划-执行一致的轨迹进行监督微调(Supervised Fine-Tuning, SFT),并通过逐步一致性过滤验证,最终在感知与多模态搜索基准上达到领先性能,且无需任何强化学习训练。

链接: https://arxiv.org/abs/2512.02395
作者: Yifan Zhang,Liang Hu,Haofeng Sun,Peiyu Wang,Yichen Wei,Shukang Yin,Jiangbo Pei,Wei Shen,Peng Xia,Yi Peng,Tianyidan Xie,Eric Li,Yang Liu,Xuchen Song,Yahui Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation (“thinking with images”), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
zh

[CV-103] Reproducing and Extending RaDelft 4D Radar with Camera-Assisted Labels

【速读】:该论文旨在解决4D雷达(4D radar)语义分割研究中因公开数据集和标签稀缺而导致的进展受限问题。现有代表性数据集RaDelft虽提供LiDAR标注,但缺乏雷达标签及开源代码,阻碍了模型复现与下游任务发展。其解决方案的关键在于提出一种基于相机引导的雷达标注流水线:通过将雷达点云投影至摄像头驱动的语义分割结果,并结合空间聚类方法,无需人工标注即可生成高精度雷达点云标签。该方法显著提升了雷达标签质量,构建了一个可复现的框架,使研究社区能够有效训练和评估标注后的4D雷达数据,并进一步量化了不同雾浓度对雷达标注性能的影响。

链接: https://arxiv.org/abs/2512.02394
作者: Kejia Hu,Mohammed Alsakabi,John M. Dolan,Ozan K. Tonguz
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in 4D radar highlight its potential for robust environment perception under adverse conditions, yet progress in radar semantic segmentation remains constrained by the scarcity of open source datasets and labels. The RaDelft data set, although seminal, provides only LiDAR annotations and no public code to generate radar labels, limiting reproducibility and downstream research. In this work, we reproduce the numerical results of the RaDelft group and demonstrate that a camera-guided radar labeling pipeline can generate accurate labels for radar point clouds without relying on human annotations. By projecting radar point clouds into camera-based semantic segmentation and applying spatial clustering, we create labels that significantly enhance the accuracy of radar labels. These results establish a reproducible framework that allows the research community to train and evaluate the labeled 4D radar data. In addition, we study and quantify how different fog levels affect the radar labeling performance.
zh

[CV-104] From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

【速读】:该论文旨在解决端到端多目标跟踪(Multi-Object Tracking, MOT)方法中因对象嵌入(object embeddings)区分度不足而导致的关联准确率低的问题。当前基于DETR架构的MOT方法虽在检测性能上表现优异,但其生成的嵌入仅强调单帧内的类别级区分,难以满足跨帧实例级区分的需求,从而限制了跟踪效果。解决方案的关键在于提出FDTA(From Detection to Association)框架,通过三个互补的特征精化模块提升嵌入的判别能力:空间适配器(Spatial Adapter, SA)引入深度感知线索以增强空间连续性,时间适配器(Temporal Adapter, TA)聚合历史信息以建模时序依赖,身份适配器(Identity Adapter, IA)利用质量感知对比学习实现实例级可分性。该策略显著提升了跨帧对象的区分能力,推动MOT性能达到新高度。

链接: https://arxiv.org/abs/2512.02392
作者: Yuqing Shao,Yuchen Yang,Rui Yu,Weilong Li,Xu Guo,Huaicheng Yan,Wei Wang,Xiao Sun
机构: East China University of Science and Technology (华东理工大学); Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at this https URL.
zh

[CV-105] On-the-fly Feedback SfM: Online Explore-and-Exploit UAV Photogrammetry with Incremental Mesh Quality-Aware Indicator and Predictive Path Planning

【速读】:该论文旨在解决传统无人机(UAV)摄影测量在实时地理空间应用中缺乏动态质量评估与反馈机制的问题,尤其是在灾害响应和数字孪生维护等时间敏感场景下,现有方法通常仅关注图像或序列帧的实时处理,而未对实时三维重建质量进行显式评价,也未能提供指导性反馈以优化目标区域的图像采集。解决方案的关键在于提出一种“即飞即反馈”结构光恢复(On-the-fly Feedback SfM)框架,其核心创新在于集成三个模块:(1) 在线增量粗网格生成以动态扩展稀疏三维点云;(2) 基于可操作指标的在线网格质量评估;(3) 预测性路径规划用于轨迹实时优化。该方案实现了近实时的重建与评估,并通过闭环反馈显著减少覆盖盲区和返航成本,推动从被动采集向智能自适应探索流程的转变。

链接: https://arxiv.org/abs/2512.02375
作者: Liyuan Lou,Wanyun Li,Wentian Gan,Yifei Yu,Tengfei Wang,Xin Wang,Zongqian Zhan
机构: School of Geodesy and Geomatics, Wuhan University (武汉大学测绘学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work was submitted to IEEE GRSM Journal for this http URL would be transferred once it get accepted

点击查看摘要

Abstract:Compared with conventional offline UAV photogrammetry, real-time UAV photogrammetry is essential for time-critical geospatial applications such as disaster response and active digital-twin maintenance. However, most existing methods focus on processing captured images or sequential frames in real time, without explicitly evaluating the quality of the on-the-go 3D reconstruction or providing guided feedback to enhance image acquisition in the target area. This work presents On-the-fly Feedback SfM, an explore-and-exploit framework for real-time UAV photogrammetry, enabling iterative exploration of unseen regions and exploitation of already observed and reconstructed areas in near real time. Built upon SfM on-the-fly , the proposed method integrates three modules: (1) online incremental coarse-mesh generation for dynamically expanding sparse 3D point cloud; (2) online mesh quality assessment with actionable indicators; and (3) predictive path planning for on-the-fly trajectory refinement. Comprehensive experiments demonstrate that our method achieves in-situ reconstruction and evaluation in near real time while providing actionable feedback that markedly reduces coverage gaps and re-flight costs. Via the integration of data collection, processing, 3D reconstruction and assessment, and online feedback, our on the-fly feedback SfM could be an alternative for the transition from traditional passive working mode to a more intelligent and adaptive exploration workflow. Code is now available at this https URL.
zh

[CV-106] SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains

【速读】:该论文旨在解决领域泛化(domain generalization)中因领域偏移导致的语义分割模型性能下降问题,尤其是在模型参数和架构不可访问的隐私约束场景下,传统微调或适应方法受限。其解决方案的关键在于提出一种输入级自适应泛化框架(SAGE),通过学习合成视觉提示(visual prompts)来隐式对齐不同风格下的特征分布,而非直接修改模型权重。具体而言,SAGE首先利用风格迁移构建源域的多样化风格表征以学习广泛视觉特征,随后根据每张输入图像的视觉上下文自适应融合这些风格线索,生成动态提示,从而在不触碰模型内部结构的前提下提升冻结模型在未见领域中的泛化能力。

链接: https://arxiv.org/abs/2512.02369
作者: Qingmei Li,Yang Zhang,Peifeng Zhang,Haohuan Fu,Juepeng Zheng
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Sun Yat-Sen University (中山大学); National Supercomputing Center in Shenzhen (深圳国家超级计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a \textbfStyle-\textbfAdaptive \textbfGEneralization framework (\textbfSAGE), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.
zh

[CV-107] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention

【速读】:该论文旨在解决复杂交互场景下自动驾驶系统中轨迹预测的准确性与计算效率问题,尤其针对现有方法在冗余数据中难以高效提取有价值场景信息的局限性。其解决方案的关键在于提出一种无需地图信息(map-free)的多域轨迹预测算法,通过引入Mixture of Experts (MoE)机制自适应选择关键频率成分,并融合多尺度时间特征;同时设计选择性注意力模块以过滤时序序列和空间交互中的冗余信息,最终结合patch-level与point-level损失监督的多模态解码器,实现跨时间、空间及频域的精准轨迹预测。

链接: https://arxiv.org/abs/2512.02368
作者: Wenyi Xiong,Jian Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often struggle to efficiently extract valuable scene information from redundant data, thereby reducing computational efficiency and prediction accuracy, especially when dealing with intricate agent interactions. To address these challenges, we propose a novel map-free trajectory prediction algorithm that achieves trajectory prediction across the temporal, spatial, and frequency domains. Specifically, in temporal information processing, We utilize a Mixture of Experts (MoE) mechanism to adaptively select critical frequency components. Concurrently, we extract these components and integrate multi-scale temporal features. Subsequently, a selective attention module is proposed to filter out redundant information in both temporal sequences and spatial interactions. Finally, we design a multimodal decoder. Under the supervision of patch-level and point-level losses, we obtain reasonable trajectory results. Experiments on Nuscences datasets demonstrate the superiority of our algorithm, validating its effectiveness in handling complex interactive scenarios.
zh

[CV-108] ackling Tuberculosis: A Comparative Dive into Machine Learning for Tuberculosis Detection

【速读】:该论文旨在解决结核病(Tuberculosis, TB)在资源有限地区诊断效率低下的问题,传统方法如痰涂片显微镜检查和培养耗时且灵敏度不足。研究提出利用深度学习与计算机视觉技术,通过训练预训练的ResNet-50和SqueezeNet模型对胸部X光图像进行自动分类,以实现更快速、准确的TB检测。其解决方案的关键在于采用Kaggle公开数据集(4,200张胸片)进行模型开发与对比,并通过数据增强、图像归一化等预处理策略提升模型性能;结果显示,轻量级SqueezeNet模型在准确率(89%)、精确率(98%)和F1分数(87%)上显著优于ResNet-50,表明小型化、高效化的深度学习模型更适合部署于移动设备,在基层医疗场景中具有推广潜力。

链接: https://arxiv.org/abs/2512.02364
作者: Daanish Hindustani,Sanober Hindustani,Preston Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores the application of machine learning models, specifically a pretrained ResNet-50 model and a general SqueezeNet model, in diagnosing tuberculosis (TB) using chest X-ray images. TB, a persistent infectious disease affecting humanity for millennia, poses challenges in diagnosis, especially in resource-limited settings. Traditional methods, such as sputum smear microscopy and culture, are inefficient, prompting the exploration of advanced technologies like deep learning and computer vision. The study utilized a dataset from Kaggle, consisting of 4,200 chest X-rays, to develop and compare the performance of the two machine learning models. Preprocessing involved data splitting, augmentation, and resizing to enhance training efficiency. Evaluation metrics, including accuracy, precision, recall, and confusion matrix, were employed to assess model performance. Results showcase that the SqueezeNet achieved a loss of 32%, accuracy of 89%, precision of 98%, recall of 80%, and an F1 score of 87%. In contrast, the ResNet-50 model exhibited a loss of 54%, accuracy of 73%, precision of 88%, recall of 52%, and an F1 score of 65%. This study emphasizes the potential of machine learning in TB detection and possible implications for early identification and treatment initiation. The possibility of integrating such models into mobile devices expands their utility in areas lacking TB detection resources. However, despite promising results, the need for continued development of faster, smaller, and more accurate TB detection models remains crucial in contributing to the global efforts in combating TB.
zh

[CV-109] VACoT: Rethinking Visual Data Augmentation with VLMs

【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在面对挑战性输入(如分布外数据或对抗样本)时鲁棒性不足的问题,尤其是在光学字符识别(OCR)相关的 adversarial 场景中表现较差。现有方法主要依赖大规模真实数据或合成多样性进行训练,但存在成本高、泛化能力有限等缺陷。解决方案的关键在于提出 Visual Augmentation Chain-of-Thought (VACoT) 框架,其核心创新是:在模型推理阶段动态调用图像增强操作(如去噪等后处理变换),而非仅依赖训练阶段的静态增强;通过结构化的通用视觉增强集合扩展查询图像的视角,并结合条件奖励机制引导模型选择必要增强、避免冗余响应,从而以较低的训练复杂度和计算开销实现更鲁棒的感知推理。

链接: https://arxiv.org/abs/2512.02361
作者: Zhengzhuo Xu,Chong Sun,SiNan Du,Chen Li,Jing Lyu,Chun Yuan
机构: Tsinghua University (清华大学); WeChat Vision, Tencent Inc (腾讯公司微信视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.
zh

[CV-110] WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting

【速读】:该论文旨在解决多视角人群计数(multi-view crowd counting)中依赖昂贵的相机标定和场景级人群标注的问题,尤其针对现有校准自由方法仍需图像级人群标注来训练单视角计数模块的局限性。其解决方案的关键在于提出一种弱监督校准自由的多视角人群计数方法(WSCF-MVCC),通过直接以人群计数作为监督信号替代密度图构建,避免了人工标注成本;同时引入自监督排序损失(self-supervised ranking loss)利用多尺度先验增强模型感知能力,并借助语义信息提升视角匹配精度,从而实现更准确的场景级人群计数估计。

链接: https://arxiv.org/abs/2512.02359
作者: Bin Li,Daijie Chen,Qi Zhang
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PRCV 2025

点击查看摘要

Abstract:Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project different camera view images onto a common space to obtain ground-plane density maps, requiring abundant and costly crowd annotations and camera calibrations. Hence, calibration-free methods are proposed that do not require camera calibrations and scene-level crowd annotations. However, existing calibration-free methods still require expensive image-level crowd annotations for training the single-view counting module. Thus, in this paper, we propose a weakly-supervised calibration-free multi-view crowd counting method (WSCF-MVCC), directly using crowd count as supervision for the single-view counting module rather than density maps constructed from crowd annotations. Instead, a self-supervised ranking loss that leverages multi-scale priors is utilized to enhance the model’s perceptual ability without additional annotation costs. What’s more, the proposed model leverages semantic information to achieve a more accurate view matching and, consequently, a more precise scene-level crowd count estimation. The proposed method outperforms the state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, indicating that it is more suitable for practical deployment compared with calibrated methods. Code is released in this https URL.
zh

[CV-111] Understanding and Harnessing Sparsity in Unified Multimodal Models

【速读】:该论文旨在解决统一多模态模型在推理过程中因冗余计算导致的效率低下问题,尤其是不同任务或样本对模型容量的需求存在差异时,全量参数激活造成资源浪费。其关键解决方案是提出基于专家混合(Mixture-of-Experts, MoE)的自适应机制:通过将生成模块划分为多个专家并引入稀疏激活策略,在保持生成质量的同时显著降低活跃参数比例。实验表明,该方法使BAGEL模型在仅激活约一半参数的情况下仍能媲美完整模型的性能,从而实现高效且高质量的多模态推理。

链接: https://arxiv.org/abs/2512.02351
作者: Shwai He,Chaorui Deng,Ang Li,Shen Yan
机构: ByteDance Seed (字节跳动种子实验室); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 13 figures, 8 tables

点击查看摘要

Abstract:Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \hrefthis https URLthis link.
zh

[CV-112] A multi-weight self-matching visual explanation for cnns on sar images

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在合成孔径雷达(Synthetic Aperture Radar, SAR)任务中因内部机制复杂且不透明而导致的可解释性不足问题,从而限制了其在高可靠性场景下的应用。解决方案的关键在于提出一种名为多权重自匹配类激活映射(Multi-Weight Self-Matching Class Activation Mapping, MS-CAM)的可视化解释方法,该方法通过匹配SAR图像与CNN提取的特征图及其对应梯度,并结合通道级(channel-wise)和元素级(element-wise)权重,精准定位模型关注区域并揭示其决策依据,显著提升了CNN在SAR图像中的可解释性。

链接: https://arxiv.org/abs/2512.02344
作者: Siyuan Sun,Yongping Zhang,Hongcheng Zeng,Yamin Wang,Wei Yang,Wanting Yang,Jie Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal mechanisms hinder the fulfillment of high-reliability requirements, thereby limiting their application in SAR. Improving the interpretability of CNNs is thus of great importance for their development and deployment in SAR. In this paper, a visual explanation method termed multi-weight self-matching class activation mapping (MS-CAM) is proposed. MS-CAM matches SAR images with the feature maps and corresponding gradients extracted by the CNN, and combines both channel-wise and element-wise weights to visualize the decision basis learned by the model in SAR images. Extensive experiments conducted on a self-constructed SAR target classification dataset demonstrate that MS-CAM more accurately highlights the network’s regions of interest and captures detailed target feature information, thereby enhancing network interpretability. Furthermore, the feasibility of applying MS-CAM to weakly-supervised obiect localization is validated. Key factors affecting localization accuracy, such as pixel thresholds, are analyzed in depth to inform future work.
zh

[CV-113] ALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

【速读】:该论文旨在解决3D视觉基础模型在在线场景(如自动驾驶)中因时间窗口内连续预测导致的空间一致性问题。尽管现有方法通过求解全局变换来对齐相邻帧预测,但其在假设有效性、局部对齐范围及噪声几何下的鲁棒性方面存在根本性局限。解决方案的关键在于提出一种高自由度(higher-DOF)且面向长期一致性的框架,基于薄板样条(Thin Plate Spline, TPS)利用全局传播的控制点校正空间变化的不一致性,并采用无特征点(point-agnostic)子地图配准设计以增强对噪声几何预测的鲁棒性。该方法具备完全即插即用特性,兼容多种3D基础模型与相机配置(如单目或环视),实验表明其在多个数据集、主干模型和相机设置下均能显著提升几何一致性并降低轨迹误差。

链接: https://arxiv.org/abs/2512.02341
作者: Fengyi Zhang,Tianjun Zhang,Kasra Khosoussi,Zheng Zhang,Zi Huang,Yadan Luo
机构: The University of Queensland (昆士兰大学); Shanghai Jiao Tong University (上海交通大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at \hrefthis https URLthis https URL.
zh

[CV-114] Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning : A Cognitive Science Perspective

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多视角空间推理中难以保持几何一致性与跨视图一致性的核心问题,其根源在于缺乏能够隔离多视角推理与单视角感知及时间因素的细粒度评估基准。解决方案的关键在于提出ReMindView-Bench——一个基于认知科学原理的基准测试平台,通过系统性地改变视角空间模式和查询类型,精准探测空间认知的关键因素。该基准揭示了VLMs在跨视图对齐和视角转换任务中的系统性失败,并借助显式分阶段分析(如LLM-as-a-judge和自一致性提示)与隐式分析(如线性探查和熵动态变化),诊断出模型在信息整合阶段的任务相关特征逐渐丢失、正确与错误轨迹间的不确定性分离现象,从而为理解多视角空间心理模型的构建、退化与不稳定机制提供了可量化、可解释的认知基础。

链接: https://arxiv.org/abs/2512.02340
作者: Qiyao Xue,Weichen Liu,Shiqi Wang,Haoming Wang,Yuyang Wu,Wei Gao
机构: University of Pittsburgh(匹兹堡大学); Carnegie Mellon University(卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 37 figures

点击查看摘要

Abstract:Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at this https URL, and the source codes of benchmark construction and VLM reasoning analysis are available at this https URL.
zh

[CV-115] Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision NEURIPS2025

【速读】:该论文旨在解决计算机视觉中因视觉相似性导致的目标跟踪困难问题,尤其在缺乏标注数据的情况下,现有自监督跟踪方法性能受限。其解决方案的关键在于利用预训练视频扩散模型(video diffusion models)内在学习到的运动表征能力——这些模型在去噪过程的早期高噪声阶段会分离出运动信息,区别于后期的外观细节优化。作者基于此发现设计了一种自监督跟踪方法,显著提升了对视觉上难以区分目标的跟踪性能,在基准测试和新提出的针对相似物体的测试中均取得优于当前方法的结果。

链接: https://arxiv.org/abs/2512.02339
作者: Chenshuang Zhang,Kang Zhang,Joon Son Chung,In So Kweon,Junmo Kim,Chengzhi Mao
机构: KAIST(韩国科学技术院); Rutgers University(罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.
zh

[CV-116] VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM

【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的视觉SLAM方法在运动模糊、低纹理和曝光变化等复杂场景下性能退化的问题。其核心解决方案是构建一个统一优化框架,将视觉与惯性测量单元(Inertial Measurement Unit, IMU)信息紧密耦合,联合优化相机位姿、深度图以及IMU状态,从而实现鲁棒的实时跟踪与高保真重建。关键创新包括鲁棒的IMU初始化、时变偏差建模以及具备一致性高斯更新机制的回环闭合。

链接: https://arxiv.org/abs/2512.02293
作者: Zihan Zhu,Wei Zhang,Norbert Haala,Marc Pollefeys,Daniel Barath
机构: ETH Zurich (苏黎世联邦理工学院); University of Stuttgart (斯图加特大学); Microsoft (微软)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present VIGS-SLAM, a visual-inertial 3D Gaussian Splatting SLAM system that achieves robust real-time tracking and high-fidelity reconstruction. Although recent 3DGS-based SLAM methods achieve dense and photorealistic mapping, their purely visual design degrades under motion blur, low texture, and exposure variations. Our method tightly couples visual and inertial cues within a unified optimization framework, jointly refining camera poses, depths, and IMU states. It features robust IMU initialization, time-varying bias modeling, and loop closure with consistent Gaussian updates. Experiments on four challenging datasets demonstrate our superiority over state-of-the-art methods. Project page: this https URL
zh

[CV-117] Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation

【速读】:该论文旨在解决合成孔径雷达(SAR)油污分割模型在跨区域迁移时性能下降的问题,特别是在秘鲁海岸因海况差异、后向散射统计特性变化及油膜形态多样性导致的标注数据稀缺问题。其解决方案的关键在于提出一种两阶段合成增强框架MORP–Synth:第一阶段采用基于曲率引导的形态学区域扰动(Morphological Region Perturbation, MRP)方法,在标签空间生成逼真的油污与类油污区域几何变异;第二阶段利用条件生成INADE模型从编辑后的掩码中渲染出具有SAR特性的纹理。该方法显著提升了模型从地中海到秘鲁域的迁移能力,实验表明mIoU提升达+6,且对少数类(油污和类油污)的IoU分别提升+10.8和+14.6。

链接: https://arxiv.org/abs/2512.02290
作者: Andre Juarez,Luis Salsavilca,Frida Coaquira,Celso Gonzales
机构: Círculo de Investigación de Máquinas de Aprendizaje (CIMA), Universidad Nacional Agraria La Molina (国立农业大学拉莫林大学); Departamento de Estadística e Informática (统计与信息系), Facultad de Economía y Planificación (经济与规划学院), Universidad Nacional Agraria La Molina (国立农业大学拉莫林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models for SAR oil spill segmentation often fail to generalize across regions due to differences in sea-state, backscatter statistics, and slick morphology, a limitation that is particularly severe along the Peruvian coast where labeled Sentinel-1 data remain scarce. To address this problem, we propose \textbfMORP–Synth, a two-stage synthetic augmentation framework designed to improve transfer from Mediterranean to Peruvian conditions. Stage~A applies Morphological Region Perturbation, a curvature guided label space method that generates realistic geometric variations of oil and look-alike regions. Stage~B renders SAR-like textures from the edited masks using a conditional generative INADE model. We compile a Peruvian dataset of 2112 labeled 512 \times 512 patches from 40 Sentinel-1 scenes (2014–2024), harmonized with the Mediterranean CleanSeaNet benchmark, and evaluate seven segmentation architectures. Models pretrained on Mediterranean data degrade from 67.8% to 51.8% mIoU on the Peruvian domain; MORP–Synth improves performance up to +6 mIoU and boosts minority-class IoU (+10.8 oil, +14.6 look-alike).
zh

[CV-118] Bridging the Gap: Toward Cognitive Autonomy in Artificial Intelligence

【速读】:该论文试图解决当前人工智能系统在自我监控、自我修正和动态环境中自主调节行为能力方面的根本性局限问题,这些问题制约了AI在现实世界中实现鲁棒泛化、持续适应与自主性的潜力。解决方案的关键在于推动认知基础的人工智能(cognitive autonomy)范式转型,即构建模仿神经认知原理的架构,使AI具备内在的元认知意识、动态表征管理能力、目标重构能力及自主意图行为,并辅以可解释、可控且符合人类价值观的监督机制,从而突破仅靠规模扩展无法解决的结构性瓶颈。

链接: https://arxiv.org/abs/2512.02280
作者: Noorbakhsh Amiri Golilarz,Sindhuja Penchala,Shahram Rahimi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence has advanced rapidly across perception, language, reasoning, and multimodal domains. Yet despite these achievements, modern AI systems remain fun- damentally limited in their ability to self-monitor, self-correct, and regulate their behavior autonomously in dynamic contexts. This paper identifies and analyzes seven core deficiencies that constrain contemporary AI models: the absence of intrinsic self- monitoring, lack of meta-cognitive awareness, fixed and non- adaptive learning mechanisms, inability to restructure goals, lack of representational maintenance, insufficient embodied feedback, and the absence of intrinsic agency. Alongside identifying these limitations, we also outline a forward-looking perspective on how AI may evolve beyond them through architectures that mirror neurocognitive principles. We argue that these structural limitations prevent current architectures, including deep learning and transformer-based systems, from achieving robust general- ization, lifelong adaptability, and real-world autonomy. Drawing on a comparative analysis of artificial systems and biological cognition [7], and integrating insights from AI research, cognitive science, and neuroscience, we outline how these capabilities are absent in current models and why scaling alone cannot resolve them. We conclude by advocating for a paradigmatic shift toward cognitively grounded AI (cognitive autonomy) capable of self-directed adaptation, dynamic representation management, and intentional, goal-oriented behavior, paired with reformative oversight mechanisms [8] that ensure autonomous systems remain interpretable, governable, and aligned with human values.
zh

[CV-119] Progressive Image Restoration via Text-Conditioned Video Generation

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在图像修复任务中应用潜力未被充分挖掘的问题,尤其是如何利用具备强大时序生成能力的文本到视频模型(如CogVideo)进行渐进式视觉恢复。其解决方案的关键在于:通过微调CogVideo模型,使其学习从退化图像到清晰图像的渐进式恢复轨迹,而非自然视频运动;具体而言,构建了超分辨率、去模糊和低光照增强等合成数据集,其中每个样本展示从退化帧到干净帧的逐步过渡,并采用两种提示策略(统一文本提示与基于LLaVA多模态大语言模型生成并经ChatGPT优化的场景特定提示),使模型能够将时间进展与修复质量关联起来,从而在不额外训练的情况下实现对真实世界图像的零样本鲁棒修复,同时保持时空一致性与可解释性。

链接: https://arxiv.org/abs/2512.02273
作者: Peng Kang,Xijun Wang,Yu Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First two authors contributed equally to this work. IEEE ICNC Accepted

点击查看摘要

Abstract:Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.
zh

[CV-120] Spatiotemporal Pyramid Flow Matching for Climate Emulation

【速读】:该论文旨在解决现有生成式气候模拟方法在长时间尺度下计算效率低、非平稳强迫条件下稳定性差的问题。传统基于天气尺度自回归的生成模型虽能模拟气候过程,但其逐点生成方式导致长时程模拟缓慢,且难以应对气候变化中的非平稳外强迫(如温室气体浓度变化)。解决方案的关键在于提出一种新型流匹配方法——时空金字塔流(Spatiotemporal Pyramid Flows, SPF),该方法通过构建时空金字塔结构,在多尺度空间和时间上分层建模气候数据,将生成轨迹划分为多个阶段,每个阶段对应特定的空间分辨率和时间尺度,并通过条件控制引入物理强迫因子(如温室气体或气溶胶)。这种设计实现了并行化高效采样,尤其在粗粒度时间层级上表现优异,同时提升了对不同气候模型和干预情景的泛化能力。

链接: https://arxiv.org/abs/2512.02268
作者: Jeremy Andrew Irvin,Jiaqi Han,Zikui Wang,Abdulaziz Alharbi,Yufei Zhao,Nomin-Erdene Bayarsaikhan,Daniele Visioni,Andrew Y. Ng,Duncan Watson-Parris
机构: Stanford University (斯坦福大学); Cornell University (康奈尔大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Generative models have the potential to transform the way we emulate Earth’s changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at this https URL .
zh

[CV-121] Exploring the Potentials of Spiking Neural Networks for Image Deraining AAAI2026

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在低层次视觉任务中,尤其是图像去雨(image deraining)应用中的两个核心问题:一是传统SNN难以有效表征图像的高频特性(即高通特性),导致空间上下文理解能力不足;二是传统SNN在频域上存在饱和现象,限制了其多尺度特征提取能力。解决方案的关键在于提出一种新型脉冲神经元——视觉LIF(Visual LIF, VLIF)神经元,该神经元能够更好地捕捉图像的高频细节并增强空间上下文感知能力。进一步地,作者基于VLIF设计了脉冲分解与增强模块(Spiking Decomposition and Enhancement Module)以及轻量级脉冲多尺度单元(lightweight Spiking Multi-scale Unit),从而实现分层多尺度表示学习。实验表明,该方法在五个基准去雨数据集上显著优于现有SNN方法,且仅需其13%的能耗,为高效、高性能的SNN在低层次视觉任务中的部署提供了坚实基础。

链接: https://arxiv.org/abs/2512.02258
作者: Shuang Chen,Tomas Krajnik,Farshad Arvin,Amir Atapour-Abarghouei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted By AAAI2026

点击查看摘要

Abstract:Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an example, this study addresses the representation of the inherent high-pass characteristics of spiking neurons, specifically in image deraining and innovatively proposes the Visual LIF (VLIF) neuron, overcoming the obstacle of lacking spatial contextual understanding present in traditional spiking neurons. To tackle the limitation of frequency-domain saturation inherent in conventional spiking neurons, we leverage the proposed VLIF to introduce the Spiking Decomposition and Enhancement Module and the lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning. Extensive experiments across five benchmark deraining datasets demonstrate that our approach significantly outperforms state-of-the-art SNN-based deraining methods, achieving this superior performance with only 13% of their energy consumption. These findings establish a solid foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks.
zh

[CV-122] PhishSnap: Image-Based Phishing Detection Using Perceptual Hashing

【速读】:该论文旨在解决当前基于URL和HTML的钓鱼检测系统在面对混淆技术和视觉欺骗时效果不佳的问题。其解决方案的关键在于提出了一种名为PhishSnap的隐私保护型本地化钓鱼检测系统,该系统利用感知哈希(perceptual hashing, pHash)技术,通过浏览器扩展捕获网页截图并计算视觉哈希值,与合法网站模板进行比对以识别视觉相似的钓鱼页面。整个推理过程在设备端完成,既保障了用户隐私又实现了低延迟检测。

链接: https://arxiv.org/abs/2512.02243
作者: Md Abdul Ahad Minhaz,Zannatul Zahan Meem,Md. Shohrab Hossain
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: IEE Standard Formatting, 3 pages, 3 figures

点击查看摘要

Abstract:Phishing remains one of the most prevalent online threats, exploiting human trust to harvest sensitive credentials. Existing URL- and HTML-based detection systems struggle against obfuscation and visual deception. This paper presents \textbfPhishSnap, a privacy-preserving, on-device phishing detection system leveraging perceptual hashing (pHash). Implemented as a browser extension, PhishSnap captures webpage screenshots, computes visual hashes, and compares them against legitimate templates to identify visually similar phishing attempts. A \textbf2024 dataset of 10,000 URLs (70%/20%/10% train/validation/test) was collected from PhishTank and Netcraft. Due to security takedowns, a subset of phishing pages was unavailable, reducing dataset diversity. The system achieved \textbf0.79 accuracy, \textbf0.76 precision, and \textbf0.78 recall, showing that visual similarity remains a viable anti-phishing measure. The entire inference process occurs locally, ensuring user privacy and minimal latency.
zh

[CV-123] See Hear and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

【速读】:该论文旨在解决当前视频基准测试中对人类语音的细粒度推理能力评估不足的问题,现有评测任务往往仅能通过视觉信息解答或粗略评估语音内容,难以判断模型是否能准确对齐“谁在说话、说了什么以及何时说的”这一多模态语义关系。其解决方案的关键在于提出AV-SpeakerBench——一个包含3,212道多项选择题的精细化音频视觉推理基准,其核心创新包括:(1) 以说话者为中心的建模范式,将说话人而非场景作为推理单元;(2) 融合驱动的问题设计,将音视频依赖关系嵌入问题语义中;(3) 专家标注确保时间精度与跨模态有效性,从而为未来多模态系统中的细粒度音视频对齐提供严谨评估基础。

链接: https://arxiv.org/abs/2512.02231
作者: Le Thien Phuc Nguyen,Zhuoran Yu,Samuel Low Yu Hang,Subin An,Jeongik Lee,Yohan Ban,SeungEun Chung,Thanh-Huy Nguyen,JuWan Maeng,Soochahn Lee,Yong Jae Lee
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Kookmin University (酷克敏大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.
zh

[CV-124] owards Unified Video Quality Assessment

【速读】:该论文旨在解决现有视频质量评估(Video Quality Assessment, VQA)方法中存在的两大问题:一是模型通常为单一结构,仅输出全局质量分数,缺乏诊断性与可解释性;二是多数方法针对特定视频格式或失真类型设计,难以实现跨域通用性。解决方案的关键在于提出Unified-VQA框架,将通用VQA重构为一种诊断型混合专家(Diagnostic Mixture-of-Experts, MoE)问题,通过多个专注于不同感知领域的“感知专家”协同工作,并引入一种新颖的多代理专家训练策略,利用最适合各感知域的代理指标优化每个专家;同时,在框架中集成一个诊断多任务头,以弱监督学习方式生成全局质量分数和可解释的多维失真向量,从而在不重新训练或微调参数的情况下,在17个包含高清(HD)、超高清(UHD)、高动态范围(HDR)及高帧率(HFR)等多种视频格式的数据库上展现出优于18种基准方法的性能,实现了统一、可解释且实用的视频质量评估。

链接: https://arxiv.org/abs/2512.02224
作者: Chen Feng,Tianhao Peng,Fan Zhang,David Bull
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple perceptual experts’’ dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.
zh

[CV-125] Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation

【速读】:该论文旨在解决现有深度学习中多尺度特征表示不足的问题,尤其是在语义分割等任务中,传统端到端的多分形分析方法因依赖大量池化或强特征空间降维而限制了性能提升。其解决方案的关键在于引入两种归纳偏置——单分形(Monofractal)与多分形(Multifractal)重校准机制,通过建模指数概率质量与多分形谱之间的关系,构建编码器嵌入的统计描述,并将其以通道注意力(channel-attention)函数的形式集成至卷积网络中,从而在不破坏局部细节的前提下增强模型对复杂结构规律的捕捉能力。

链接: https://arxiv.org/abs/2512.02198
作者: Miguel L. Martins,Miguel T. Coimbra,Francesco Renna
机构: University of Porto (波尔图大学); INESC-TEC (葡萄牙国家科学与技术研究所-技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 9 figures, journal paper

点击查看摘要

Abstract:Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy pooling or strong feature-space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel-attention functions in convolutional networks. Using a U-Net-based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel-attention mechanisms that also use higher-order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U-Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability. Comments: 30 pages, 9 figures, journal paper Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) MSC classes: 68T07, 68U10, 28A80 Cite as: arXiv:2512.02198 [cs.CV] (or arXiv:2512.02198v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.02198 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-126] RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation

【速读】:该论文旨在解决外科场景分割中模型在跨中心(out-of-distribution, OOD)和跨模态数据上泛化能力不足的问题,即当前基于深度学习的方法在单中心、单成像模态数据上表现良好,但在面对不同医疗机构或成像设备带来的分布差异时性能显著下降。解决方案的关键在于:首先,通过实例归一化(instance normalisation)和特征协方差映射(feature covariance mapping)技术提取鲁棒且通用的特征表示,以减少因血液、成像伪影等引起的外观变化;其次,在ResNet骨干网络中引入恢复模块(restitution module),在保持任务相关显著特征的同时避免对关键结构信息的误删;此外,作者还构建了一个新的多中心、多类别的外科场景分割数据集(HeiCholSeg 和 EndoUDA polyp),为提升模型泛化性提供高质量训练资源。实验表明,所提方法RobustSurg相较DeepLabv3+基线和现有最先进方法在多个未见中心数据集上均取得显著提升,平均IoU得分提高达10%-32%。

链接: https://arxiv.org/abs/2512.02188
作者: Mansoor Ali,Maksim Richards,Gilberto Ochoa-Ruiz,Sharib Ali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Medical Image Analysis

点击查看摘要

Abstract:While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not generalise to unseen distribution (i.e., from other centres) and unseen modalities. Current literature for tackling generalisation on out-of-distribution data and domain gaps due to modality changes has been widely researched but mostly for natural scene data. However, these methods cannot be directly applied to the surgical scenes due to limited visual cues and often extremely diverse scenarios compared to the natural scene data. Inspired by these works in natural scenes to push generalisability on OOD data, we hypothesise that exploiting the style and content information in the surgical scenes could minimise the appearances, making it less variable to sudden changes such as blood or imaging artefacts. This can be achieved by performing instance normalisation and feature covariance mapping techniques for robust and generalisable feature representations. Further, to eliminate the risk of removing salient feature representation associated with the objects of interest, we introduce a restitution module within the feature learning ResNet backbone that can enable the retention of useful task-relevant features. To tackle the lack of multiclass and multicentre data for surgical scene segmentation, we also provide a newly curated dataset that can be vital for addressing generalisability in this domain. Our proposed RobustSurg obtained nearly 23% improvement on the baseline DeepLabv3+ and from 10-32% improvement on the SOTA in terms of mean IoU score on an unseen centre HeiCholSeg dataset when trained on CholecSeg8K. Similarly, RobustSurg also obtained nearly 22% improvement over the baseline and nearly 11% improvement on a recent SOTA method for the target set of the EndoUDA polyp dataset.
zh

[CV-127] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在生成高分辨率(SR)图像时面临的多视角不一致性问题,即传统独立对每张低分辨率(LR)图像进行超分辨率处理会导致渲染结果模糊且跨视角缺乏一致性。其解决方案的关键在于提出SplatSuRe方法,该方法基于相机位姿与场景几何的关系,智能地识别出在低频监督下欠采样的区域,并仅在这些区域选择性地添加高频率的SR内容,从而在保持多视角一致性的同时提升局部细节清晰度和整体感知质量。

链接: https://arxiv.org/abs/2512.02172
作者: Pranav Asthana,Alex Hanson,Allen Tu,Tom Goldstein,Matthias Zwicker,Amitabh Varshney
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.
zh

[CV-128] Mapping of Lesion Images to Somatic Mutations

【速读】:该论文旨在解决如何基于医学影像预测患者体细胞突变谱的问题,以实现更早、更精准的癌症诊断与治疗决策。其核心挑战在于跨模态数据(医学图像与基因组信息)之间的异构性及复杂关联建模。解决方案的关键在于提出LLOST模型——一个由两个变分自编码器(Variational Autoencoder, VAE)通过共享潜在空间耦合而成的深度潜在变量模型,并引入条件归一化流先验(conditional normalizing flow prior)来分别建模图像点云和突变计数的不同分布特性。该设计实现了对影像特征与遗传变异特征的统一表征,从而有效捕捉影像与体细胞突变之间的共享模式,尤其在反映癌症类型方面表现显著。

链接: https://arxiv.org/abs/2512.02162
作者: Rahul Mehta
机构: University of Illinois at Chicago (芝加哥伊利诺大学); University of Texas MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: this https URL

点击查看摘要

Abstract:Medical imaging is a critical initial tool used by clinicians to determine a patient’s cancer diagnosis, allowing for faster intervention and more reliable patient prognosis. At subsequent stages of patient diagnosis, genetic information is extracted to help select specific patient treatment options. As the efficacy of cancer treatment often relies on early diagnosis and treatment, we build a deep latent variable model to determine patients’ somatic mutation profiles based on their corresponding medical images. We first introduce a point cloud representation of lesions images to allow for invariance to the imaging modality. We then propose, LLOST, a model with dual variational autoencoders coupled together by a separate shared latent space that unifies features from the lesion point clouds and counts of distinct somatic mutations. Therefore our model consists of three latent space, each of which is learned with a conditional normalizing flow prior to account for the diverse distributions of each domain. We conduct qualitative and quantitative experiments on de-identified medical images from The Cancer Imaging Archive and the corresponding somatic mutations from the Pan Cancer dataset of The Cancer Genomic Archive. We show the model’s predictive performance on the counts of specific mutations as well as it’s ability to accurately predict the occurrence of mutations. In particular, shared patterns between the imaging and somatic mutation domain that reflect cancer type. We conclude with a remark on how to improve the model and possible future avenues of research to include other genetic domains.
zh

[CV-129] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges NEURIPS2025

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在准确遵循用户提示中特定属性(如对象数量、颜色等)方面的不足,以及视觉语言模型(Vision Language Models, VLMs)在复杂场景标注能力上缺乏系统性评估基准的问题。其解决方案的关键在于提出一种结构化的联合评估方法:通过设计包含27种特定失败模式的挑战性提示,测试VLM能否识别T2I模型生成图像中的错误类型,并构建了一个包含5种T2I模型(Flux、SD3-Medium、SD3-Large、SD3.5-Medium、SD3.5-Large)与3种VLM(Molmo、InternVL3、Pixtral)的标注数据集,由大语言模型(LLM, Llama3)进行人工校验,从而实现对T2I模型属性保真度和对象表征能力的细粒度分析,揭示现有评估指标的局限性并推动生成模型可靠性与可解释性的提升。

链接: https://arxiv.org/abs/2512.02161
作者: Kevin David Hayes,Micah Goldblum,Vikash Sehwag,Gowthami Somepalli,Ashwinee Panda,Tom Goldstein
机构: University of Maryland (马里兰大学); Columbia University (哥伦比亚大学); Sony AI (索尼人工智能); Google Deepmind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 Datasets and Benchmarks Track

点击查看摘要

Abstract:Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.
zh

[CV-130] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework

【速读】:该论文旨在解决对比学习(contrastive learning)中因数据增强导致的信息失真问题,即模型过度依赖同标签样本的特征,而忽视来自同一原始图像的正样本对,从而影响学习效率与泛化性能。其解决方案的关键在于提出一种上下文增强的对比损失函数(context-enriched contrastive loss function),包含两个收敛目标:一是通过标签敏感项强化不同类别特征间的区分度,提升对比训练效率;二是拉近源自同一源图像的增强样本,同时推开其他所有样本,从而缓解信息失真并加速收敛。实验表明,该方法在8个主流图像识别基准数据集上优于16种先进对比学习方法,尤其在系统性失真任务(如BiasedMNIST)中实现22.9%的性能提升。

链接: https://arxiv.org/abs/2512.02152
作者: Haojin Deng,Yimin Yang
机构: Western University (西蒙弗雷泽大学); Vector Institute (向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures. Published in IEEE Transactions on Multimedia. Code available at: this https URL

点击查看摘要

Abstract:Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.
zh

[CV-131] CoatFusion: Controllable Material Coating in Images

【速读】:该论文旨在解决图像编辑中“材料涂层”(Material Coating)这一新任务,即在不破坏物体原有粗细几何结构的前提下,模拟在物体表面施加一层薄材料的效果。这与传统“材料迁移”(Material Transfer)方法不同,后者通常会替换物体的固有材质并可能覆盖精细细节,无法实现对涂层厚度和物理属性的精确控制。解决方案的关键在于提出了一种名为CoatFusion的新架构,其通过将扩散模型(Diffusion Model)条件化于2D反照率纹理(albedo texture)以及粒度化的基于物理渲染(PBR-style)参数控制(包括粗糙度、金属度、透射率和关键的厚度参数),实现了对涂层效果的高保真、可调控生成,显著优于现有材料编辑与迁移方法。

链接: https://arxiv.org/abs/2512.02143
作者: Sagie Levy,Elad Aharoni,Matan Levy,Ariel Shamir,Dani Lischinski
机构: The Hebrew University of Jerusalem (希伯来大学); Reichman University (里奇曼大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Material Coating, a novel image editing task that simulates applying a thin material layer onto an object while preserving its underlying coarse and fine geometry. Material coating is fundamentally different from existing “material transfer” methods, which are designed to replace an object’s intrinsic material, often overwriting fine details. To address this new task, we construct a large-scale synthetic dataset (110K images) of 3D objects with varied, physically-based coatings, named DataCoat110K. We then propose CoatFusion, a novel architecture that enables this task by conditioning a diffusion model on both a 2D albedo texture and granular, PBR-style parametric controls, including roughness, metalness, transmission, and a key thickness parameter. Experiments and user studies show CoatFusion produces realistic, controllable coatings and significantly outperforms existing material editing and transfer methods on this new task.
zh

[CV-132] Superpixel Attack: Enhancing Black-box Adversarial Attack with Image-driven Division Areas

【速读】:该论文旨在解决黑盒对抗攻击(black-box adversarial attacks)中攻击成功率较低的问题,尤其是在面对对对抗样本具有较强鲁棒性的模型时。现有方法通常在每次迭代中仅对图像中简单的矩形区域进行扰动修改,难以有效捕捉图像的局部结构特征。其解决方案的关键在于引入超像素(superpixels)作为扰动调整的基本单元,以更好地平衡颜色方差与空间紧凑性;同时提出一种新的搜索策略——多用途搜索(versatile search),并基于此构建了Superpixel Attack方法。该方法通过在超像素级别上优化扰动分布,显著提升了攻击成功率,平均提升达2.10%,为黑盒场景下的对抗攻击提供了更有效的手段。

链接: https://arxiv.org/abs/2512.02062
作者: Issa Oe,Keiichiro Yamamura,Hiroki Ishikura,Ryo Hamahira,Katsuki Fujisawa
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models are used in safety-critical tasks such as automated driving and face recognition. However, small perturbations in the model input can significantly change the predictions. Adversarial attacks are used to identify small perturbations that can lead to misclassifications. More powerful black-box adversarial attacks are required to develop more effective defenses. A promising approach to black-box adversarial attacks is to repeat the process of extracting a specific image area and changing the perturbations added to it. Existing attacks adopt simple rectangles as the areas where perturbations are changed in a single iteration. We propose applying superpixels instead, which achieve a good balance between color variance and compactness. We also propose a new search method, versatile search, and a novel attack method, Superpixel Attack, which applies superpixels and performs versatile search. Superpixel Attack improves attack success rates by an average of 2.10% compared with existing attacks. Most models used in this study are robust against adversarial attacks, and this improvement is significant for black-box adversarial attacks. The code is avilable at this https URL.
zh

[CV-133] Leverag ing AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale

【速读】:该论文旨在解决全球范围内洪水灾害监测中遥感影像分割精度不足与模型泛化能力有限的问题,尤其关注如何利用多模态数据提升生成式AI(Generative AI)模型在近实时洪水范围制图中的性能。其解决方案的关键在于:首先,构建并使用FloodsNet这一统一的多模态数据集(包含同步的Sentinel-1合成孔径雷达(SAR)和Sentinel-2光学影像),实现对85个全球不同区域洪水事件的标准化训练;其次,通过微调ESA-IBM的TerraMind地理空间基础模型(Geospatial Foundation Model, GFM),探索不同模型规模(base vs. large)与参数冻结策略(frozen vs. unfrozen backbone)对精度、召回率和计算成本的影响。实验表明,基于多模态数据微调的GFM可显著提升洪水边界识别效果,且base-unfrozen配置在保持高召回率的同时大幅降低计算开销,验证了融合多源遥感数据与预训练模型微调是增强洪水智能感知能力的有效路径。

链接: https://arxiv.org/abs/2512.02055
作者: Mirela G. Tulbure,Julio Caineta,Mark Broich,Mollie D. Gaines,Philippe Rufin,Leon-Friedrich Thomas,Hamed Alemohammad,Jan Hemmerling,Patrick Hostert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM’s TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.02055 [cs.CV] (or arXiv:2512.02055v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.02055 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-134] A Real-time Face Mask Detection and Social Distancing System for COVID-19 using Attention-InceptionV3 Model

【速读】:该论文旨在解决新冠疫情下公共场所中人群是否遵守物理距离(6英尺)及佩戴口罩等防控措施的问题,以识别高风险传播区域。其解决方案的关键在于构建一个基于定制化注意力机制的InceptionV3模型(customized attention-inceptionv3 model),通过两个不同数据集共10,800张含/不含口罩的图像进行训练,实现了98%的训练准确率和99.5%的验证准确率,同时具备98.2%的检测精度与每秒25帧(FPS)的实时处理能力,从而可有效辅助监管部门快速定位高风险区域并及时预警。

链接: https://arxiv.org/abs/2411.05312
作者: Abdullah Al Asif,Farhana Chowdhury Tisha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the deadliest pandemics is now happening in the current world due to COVID-19. This contagious virus is spreading like wildfire around the whole world. To minimize the spreading of this virus, World Health Organization (WHO) has made protocols mandatory for wearing face masks and maintaining 6 feet physical distance. In this paper, we have developed a system that can detect the proper maintenance of that distance and people are properly using masks or not. We have used the customized attention-inceptionv3 model in this system for the identification of those two components. We have used two different datasets along with 10,800 images including both with and without Face Mask images. The training accuracy has been achieved 98% and validation accuracy 99.5%. The system can conduct a precision value of around 98.2% and the frame rate per second (FPS) was 25.0. So, with this system, we can identify high-risk areas with the highest possibility of the virus spreading zone. This may help authorities to take necessary steps to locate those risky areas and alert the local people to ensure proper precautions in no time.
zh

[CV-135] Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for Stroke Outcome Prediction

【速读】:该论文旨在解决急性缺血性卒中(Acute Ischemic Stroke, AIS)患者在发病后三个月功能预后的精准预测问题。其核心挑战在于如何利用影像学与临床数据构建具有高准确性和可解释性的预测模型。解决方案的关键在于:首先,采用24小时后的扩散磁共振成像(Diffusion MRI)相较于基线(治疗前)扫描提供更优的预后预测性能(AUC = 0.923),表明早期治疗后影像变化对预后更具 prognostic 价值;其次,通过三维ResNet-50提取MRI特征并与结构化临床变量融合,经主成分分析降维至12个成分后,使用线性支持向量机进行分类,显著提升了模型稳定性与泛化能力;最后,引入病灶体积特征进一步增强了模型的可解释性与鲁棒性,形成一个整合多模态信息的高效预测框架。

链接: https://arxiv.org/abs/2512.02088
作者: Sina Raeisadigh,Myles Joshua Toledo Tan,Henning Müller,Abderrahmane Hedjoudje
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 5 figures, 2 tables

点击查看摘要

Abstract:This study compares baseline (J0) and 24-hour (J1) diffusion magnetic resonance imaging (MRI) for predicting three-month functional outcomes after acute ischemic stroke (AIS). Seventy-four AIS patients with paired apparent diffusion coefficient (ADC) scans and clinical data were analyzed. Three-dimensional ResNet-50 embeddings were fused with structured clinical variables, reduced via principal component analysis (=12 components), and classified using linear support vector machines with eight-fold stratified group cross-validation. J1 multimodal models achieved the highest predictive performance (AUC = 0.923 +/- 0.085), outperforming J0-based configurations (AUC = 0.86). Incorporating lesion-volume features further improved model stability and interpretability. These findings demonstrate that early post-treatment diffusion MRI provides superior prognostic value to pre-treatment imaging and that combining MRI, clinical, and lesion-volume features produces a robust and interpretable framework for predicting three-month functional outcomes in AIS patients.
zh

人工智能

[AI-0] okenPowerBench: Benchmarking the Power Consumption of LLM Inference AAAI’26

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)推理阶段能耗难以量化与分析的问题。现有基准测试主要关注训练/微调或推理性能,缺乏对推理过程中功耗的系统性测量支持。其解决方案的关键在于提出TokenPowerBench——首个轻量且可扩展的LLM推理功耗基准工具,包含三个核心组件:(i) 声明式配置接口,支持模型选择、提示集和推理引擎的灵活组合;(ii) 无需专用功率计即可捕获GPU级、节点级和系统级功耗的测量层;(iii) 将能量精确归因于每个请求的预填充(prefill)和解码(decode)阶段的对齐指标流水线。这一设计使用户能够高效评估不同参数设置(如批大小、上下文长度、并行策略和量化方式)对每token能耗等能效指标的影响,从而为部署阶段的功耗预测、成本控制和可持续发展目标提供量化依据。

链接: https://arxiv.org/abs/2512.03024
作者: Chenxu Niu,Wei Zhang,Jie Li,Yongjian Zhao,Tongyang Wang,Xi Wang,Yong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by the AAAI’26 Conference Main Track

点击查看摘要

Abstract:Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.
zh

[AI-1] Distribution-Calibrated Inference time compute for Thinking LLM -as-a-Judge

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为成对偏好评判者时,在单样本层面存在噪声的问题,尤其针对现有聚合规则(如多数投票、软自一致性或基于指令的自聚合)在允许平局情况下表现不一致的缺陷。其解决方案的关键在于引入推理时计算(Inference-time Compute, ITC)策略,即对每个评估项生成n个独立的思维-评分样本,并提出一种基于分布校准的聚合方法:通过Bradley-Terry-Davidson模型对评分计数建模三元偏好关系,同时利用极性(非平局间的差距)和果断性(非平局率)两个维度区分微弱优势与强共识。该方法在多个评估基准上显著降低平均绝对误差(MAE)并提升成对准确性,且在与人类共识元标签对比时达到或超过个体人类评判者的水平,表明合理分配ITC并采用分布感知聚合可将噪声个体判断转化为可靠评价结果。

链接: https://arxiv.org/abs/2512.03019
作者: Hamid Dadkhahi,Firas Trabelsi,Parker Riley,Juraj Juraska,Mehdi Mirzazadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.
zh

[AI-2] From Moderation to Mediation: Can LLM s Serve as Mediators in Online Flame Wars?

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在在线社交互动中如何从单纯的有害内容检测者转变为具备共情能力与冲突调解功能的中介者(mediator)的问题。其核心挑战在于提升LLMs对对话公平性与情绪动态的理解,并生成具有同理心和降级冲突效果的干预信息。解决方案的关键在于构建一个两阶段框架:第一阶段为“判断”(judgment),由LLM评估对话中的公平性和情感张力;第二阶段为“引导”(steering),生成能促进和解的共情式回应。研究通过基于Reddit的大规模数据集和多阶段评估流程(包括原则评分、用户模拟和人工对比)验证了API驱动模型在推理能力和干预一致性上优于开源模型,揭示了当前LLMs作为在线社会调解代理的潜力与局限。

链接: https://arxiv.org/abs/2512.03005
作者: Dawei Li,Abdullah Alnaibari,Arslan Bisharat,Manny Sandoval,Deborah Hall,Yasin Silva,Huan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.
zh

[AI-3] Invasive Context Engineering to Control Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文场景下仍易受滥用和越狱攻击(jailbreak)的问题,尤其关注随着上下文长度增加导致的越狱概率上升这一安全挑战。其解决方案的关键在于引入“侵入式上下文工程”(Invasive Context Engineering),即通过在LLM输入上下文中插入特定控制句(control sentences),以非训练方式增强模型对恶意行为的鲁棒性,从而在不依赖额外训练数据的前提下实现更可靠的长期上下文安全防护,并可进一步推广至思维链(Chain-of-Thought)过程以防止策略性欺骗(scheming)。

链接: https://arxiv.org/abs/2512.03001
作者: Thomas Rivasseau
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages

点击查看摘要

Abstract:Current research on operator control of Large Language Models improves model robustness against adversarial attacks and misbehavior by training on preference examples, prompting, and input/output filtering. Despite good results, LLMs remain susceptible to abuse, and jailbreak probability increases with context length. There is a need for robust LLM security guarantees in long-context situations. We propose control sentences inserted into the LLM context as invasive context engineering to partially solve the problem. We suggest this technique can be generalized to the Chain-of-Thought process to prevent scheming. Invasive Context Engineering does not rely on LLM training, avoiding data shortage pitfalls which arise in training models for long context situations.
zh

[AI-4] Lumos: Let there be Language Model System Certification

【速读】:该论文旨在解决语言模型系统(Language Model System, LMS)行为规范与形式化认证缺乏系统性框架的问题,尤其在复杂场景下(如视觉-语言模型在自动驾驶中的应用)难以有效验证其安全性。解决方案的关键在于提出首个原则性的形式化框架Lumos,它是一种基于图的指令式概率编程领域特定语言(DSL),通过结构化表示提示分布(prompt distributions)并生成独立同分布(i.i.d.)提示,结合统计认证器实现对任意提示分布下LMS行为的可证属性。Lumos提供混合语义(操作语义与指称语义),支持编码现有复杂关系与时间特性规范,并首次为视觉-语言模型(VLMs)在自动驾驶场景中定义安全规范;实验表明,基于Lumos的规范能揭示当前主流VLM(如Qwen-VL)在雨天右转场景中存在至少90%概率的安全失效,从而凸显其在动态威胁环境中进行模块化、可扩展认证的核心优势。

链接: https://arxiv.org/abs/2512.02966
作者: Isha Chaudhary,Vedaant Jain,Avaljot Singh,Kavya Sachdeva,Sayan Ranu,Gagandeep Singh
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce the first principled framework, Lumos, for specifying and formally certifying Language Model System (LMS) behaviors. Lumos is an imperative probabilistic programming DSL over graphs, with constructs to generate independent and identically distributed prompts for LMS. It offers a structured view of prompt distributions via graphs, forming random prompts from sampled subgraphs. Lumos supports certifying LMS for arbitrary prompt distributions via integration with statistical certifiers. We provide hybrid (operational and denotational) semantics for Lumos, providing a rigorous way to interpret the specifications. Using only a small set of composable constructs, Lumos can encode existing LMS specifications, including complex relational and temporal specifications. It also facilitates specifying new properties - we present the first safety specifications for vision-language models (VLMs) in autonomous driving scenarios developed with Lumos. Using these, we show that the state-of-the-art VLM Qwen-VL exhibits critical safety failures, producing incorrect and unsafe responses with at least 90% probability in right-turn scenarios under rainy driving conditions, revealing substantial safety risks. Lumos’s modular structure allows easy modification of the specifications, enabling LMS certification to stay abreast with the rapidly evolving threat landscape. We further demonstrate that specification programs written in Lumos enable finding specific failure cases exhibited by state-of-the-art LMS. Lumos is the first systematic and extensible language-based framework for specifying and certifying LMS behaviors, paving the way for a wider adoption of LMS certification.
zh

[AI-5] In Silico Development of Psychometric Scales: Feasibility of Representative Population Data Simulation with LLM s

【速读】:该论文旨在解决心理测量量表开发与验证过程中对大规模真实样本、多轮测试及资源投入的高需求问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成模拟被试数据,通过让LLM以特定人口学特征“扮演”个体来回答量表项目,从而实现虚拟环境下的预测试(in silico piloting)。研究发现,LLM生成的数据能够在群体层面准确再现潜在结构和测量属性(如配置不变性、度量不变性),尤其适用于早期阶段的量表构念探索与初步验证;但其在个体水平上的得分分布和方差等细节特征上与真实数据存在显著差异,无法替代真实个体数据用于最终验证。因此,LLM模拟数据的价值在于支持群体层面的心理测量原型设计,而非个体层面的严格校验。

链接: https://arxiv.org/abs/2512.02910
作者: Enrico Cipriani,Pavel Okopnyi,Danilo Menicucci,Simone Grassini
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing and validating psychometric scales requires large samples, multiple testing phases, and substantial resources. Recent advances in Large Language Models (LLMs) enable the generation of synthetic participant data by prompting models to answer items while impersonating individuals of specific demographic profiles, potentially allowing in silico piloting before real data collection. Across four preregistered studies (N = circa 300 each), we tested whether LLM-simulated datasets can reproduce the latent structures and measurement properties of human responses. In Studies 1-2, we compared LLM-generated data with real datasets for two validated scales; in Studies 3-4, we created new scales using EFA on simulated data and then examined whether these structures generalized to newly collected human samples. Simulated datasets replicated the intended factor structures in three of four studies and showed consistent configural and metric invariance, with scalar invariance achieved for the two newly developed scales. However, correlation-based tests revealed substantial differences between real and synthetic datasets, and notable discrepancies appeared in score distributions and variances. Thus, while LLMs capture group-level latent structures, they do not approximate individual-level data properties. Simulated datasets also showed full internal invariance across gender. Overall, LLM-generated data appear useful for early-stage, group-level psychometric prototyping, but not as substitutes for individual-level validation. We discuss methodological limitations, risks of bias and data pollution, and ethical considerations related to in silico psychometric simulations.
zh

[AI-6] VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在面对新相机视角和视觉扰动时性能显著下降的问题,其核心原因被识别为空间建模(Spatial Modeling)的错位而非物理建模(Physical Modeling)的缺陷。解决方案的关键在于通过轻量级、可学习的更新机制对视觉表征进行再校准:首先提出特征令牌调制(Feature Token Modulation, FTM),采用全局仿射变换调整视觉令牌,在仅引入4K参数的情况下将Libero数据集上的视角准确率从48.5%提升至87.1%;在此基础上进一步提出特征线性适配(Feature Linear Adaptation, FLA),引入低秩更新策略作用于ViT编码器,以4.7M参数实现90.8%的成功率,达到与LoRA规模微调相当的效果但计算成本更低。这表明预训练VLA模型中存在大量未被挖掘的鲁棒性,且针对性的最小视觉适配足以恢复视角泛化能力。

链接: https://arxiv.org/abs/2512.02902
作者: Weiqi Li,Quande Zhang,Ruifeng Zhai,Liang Lin,Guangrun Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters – matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
zh

[AI-7] FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在极端低比特量化(extremely low-bit quantization)下性能下降的问题,同时避免从头训练复杂值模型(complex-valued LLMs)所带来的资源浪费与生态割裂。其核心挑战在于如何在不损失模型性能的前提下,将已有的实值预训练模型(pre-trained real-valued foundation models)高效转换为可支持超低精度推理的复数形式。解决方案的关键在于提出Fairy2i框架:通过数学上证明实值映射与广义线性复数映射(widely-linear complex form)之间的无损等价关系,将标准Transformer层转化为复数域表示,并引入基于四次单位根(fourth roots of unity)的相位感知量化方案及递归残差量化机制(recursive residual quantization),从而实现近似2-bit精度下的高性能推理,且无需重新训练,显著优于现有实值二值化或三值化方法。

链接: https://arxiv.org/abs/2512.02901
作者: Feiyu Wang,Xinyu Tan,Bokai Huang,Yihao Zhang,Guoan Wang,Peizhuang Cong,Tong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.
zh

[AI-8] Model-Based Diagnosis with Multiple Observations: A Unified Approach for C Software and Boolean Circuits

【速读】:该论文旨在解决多故障场景下软件和布尔电路中故障定位(Fault Localisation, FL)的准确性与效率问题,现有基于公式的故障定位方法(Formula-Based Fault Localisation, FBFL)在处理多故障时往往无法保证所有失败测试用例的一致性诊断,或产生冗余且非最小化的诊断结果。其解决方案的关键在于引入CFaults工具,该工具基于模型驱动诊断(Model-Based Diagnosis, MBD),将多个观测结果(即所有失败测试用例)统一建模为一个最大可满足性(Maximum Satisfiability, MaxSAT)公式,从而确保跨观测的一致性并简化诊断过程;同时,CFaults仅生成子集最小化的诊断集合,避免了冗余诊断(如BugAssist和SNIPER等方法所出现的问题),在C程序和布尔电路两个基准测试集上均展现出更高的效率和更高质量的诊断结果。

链接: https://arxiv.org/abs/2512.02898
作者: Pedro Orvalho,Marta Kwiatkowska,Mikoláš Janota,Vasco Manquinho
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注: 50 pages, 9 figures, 6 tables, 5 listings

点击查看摘要

Abstract:Debugging is one of the most time-consuming and expensive tasks in software development and circuit design. Several formula-based fault localisation (FBFL) methods have been proposed, but they fail to guarantee a set of diagnoses across all failing tests or may produce redundant diagnoses that are not subset-minimal, particularly for programs/circuits with multiple faults. This paper introduces CFaults, a novel fault localisation tool for C software and Boolean circuits with multiple faults. CFaults leverages Model-Based Diagnosis (MBD) with multiple observations and aggregates all failing test cases into a unified Maximum Satisfiability (MaxSAT) formula. Consequently, our method guarantees consistency across observations and simplifies the fault localisation procedure. Experimental results on three benchmark sets, two of C programs, TCAS and C-Pack-IPAs, and one of Boolean circuits, ISCAS85, show that CFaults is faster at localising faults in C software than other FBFL approaches such as BugAssist, SNIPER, and HSD. On the ISCAS85 benchmark, CFaults is generally slower than HSD; however, it localises faults in only 6% fewer circuits, demonstrating that it remains competitive in this domain. Furthermore, CFaults produces only subset-minimal diagnoses of faulty statements, whereas the other approaches tend to enumerate redundant diagnoses (e.g., BugAssist and SNIPER). Comments: 50 pages, 9 figures, 6 tables, 5 listings Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC) Cite as: arXiv:2512.02898 [cs.SE] (or arXiv:2512.02898v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.02898 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-9] he future of AI in critical mineral exploration

【速读】:该论文旨在解决全球关键矿产勘探领域长期存在的新发现减少问题,尽管近年来投资增加,但过去二十年中关键矿产的新发现数量却持续下降。其解决方案的核心在于引入人工智能(AI)作为严谨科学方法的赋能工具,以减少认知偏差和假阳性结果,并降低勘探成本。关键创新点是提出一种基于贝叶斯主义(Bayesianism)与证伪原则(falsification)的新型科学方法:数据采集被重新定义为用于证伪人类生成的地质假设的过程,后续数据获取决策通过可验证指标量化并基于理性判断进行优化。该方法依赖两类新型AI技术——一是新颖的无监督学习方法,用于与领域专家协作理解数据并生成多个竞争性地质假说;二是人机协同(human-in-the-loop)AI算法,用于最优规划地质、地球物理、地球化学及钻探数据采集,优先降低地质假说的不确定性,进而推动品位与储量不确定性的优化。

链接: https://arxiv.org/abs/2512.02879
作者: Jef Caers
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The energy transition through increased electrification has put the worlds attention on critical mineral exploration Even with increased investments a decrease in new discoveries has taken place over the last two decades Here I propose a solution to this problem where AI is implemented as the enabler of a rigorous scientific method for mineral exploration that aims to reduce cognitive bias and false positives drive down the cost of exploration I propose a new scientific method that is based on a philosophical approach founded on the principles of Bayesianism and falsification In this approach data acquisition is in the first place seen as a means to falsify human generated hypothesis Decision of what data to acquire next is quantified with verifiable metrics and based on rational decision making A practical protocol is provided that can be used as a template in any exploration campaign However in order to make this protocol practical various form of artificial intelligence are needed I will argue that the most important form are one novel unsupervised learning methods that collaborate with domain experts to better understand data and generate multiple competing geological hypotheses and two humanintheloop AI algorithms that can optimally plan various geological geophysical geochemical and drilling data acquisition where uncertainty reduction of geological hypothesis precedes the uncertainty reduction on grade and tonnage
zh

[AI-10] GraphMatch: Fusing Language and Graph Representations in a Dynamic Two-Sided Work Marketplace

【速读】:该论文旨在解决文本丰富且动态变化的双边市场(如Upwork劳动力平台)中匹配推荐问题,其核心挑战在于如何同时建模不断演化的文本语义与交互图结构的时间敏感性。解决方案的关键在于提出GraphMatch框架,该框架通过融合预训练语言模型(pre-trained language models, PLMs)与图神经网络(graph neural networks, GNNs),采用对抗性负采样(adversarial negative sampling)和即时子图训练(point-in-time subgraph training)策略,联合学习兼具细粒度文本语义与动态图结构特征的表示,从而在大规模场景下实现高精度且低延迟的实时推荐。

链接: https://arxiv.org/abs/2512.02849
作者: Mikołaj Sacha,Hammad Jafri,Mattie Terzolo,Ayan Sinha,Andrew Rabinovich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recommending matches in a text-rich, dynamic two-sided marketplace presents unique challenges due to evolving content and interaction graphs. We introduce GraphMatch, a new large-scale recommendation framework that fuses pre-trained language models with graph neural networks to overcome these challenges. Unlike prior approaches centered on standalone models, GraphMatch is a comprehensive recipe built on powerful text encoders and GNNs working in tandem. It employs adversarial negative sampling alongside point-in-time subgraph training to learn representations that capture both the fine-grained semantics of evolving text and the time-sensitive structure of the graph. We evaluated extensively on interaction data from Upwork, a leading labor marketplace, at large scale, and discuss our approach towards low-latency inference suitable for real-time use. In our experiments, GraphMatch outperforms language-only and graph-only baselines on matching tasks while being efficient at runtime. These results demonstrate that unifying language and graph representations yields a highly effective solution to text-rich, dynamic two-sided recommendations, bridging the gap between powerful pretrained LMs and large-scale graphs in practice.
zh

[AI-11] Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

【速读】:该论文旨在解决预训练视觉-语言-动作(Vision-Language-Action, VLA)模型在下游任务微调后出现的推理阶段不稳定性问题,即由于训练数据中存在冗余动作模式(action modes),导致模型在推理时对不同噪声采样敏感,从而引发分布偏移(distribution shift)并降低任务成功率。解决方案的关键在于提出一种测试时缩放(Test-Time Scaling, TTS)框架TACO,其核心是引入一个轻量级伪计数估计器(pseudo-count estimator),用于在推理阶段验证动作片段(action chunks)的有效性,并选择具有最大伪计数的动作执行,从而在不破坏VLA泛化能力的前提下抑制分布偏移。该方法无需梯度更新,计算高效,尤其适用于难以进行强化学习(Reinforcement Learning, RL)更新的基于流匹配或扩散机制的VLA模型。

链接: https://arxiv.org/abs/2512.02834
作者: Siyuan Yang,Yang Zhang,Haoran He,Ling Pan,Xiu Li,Chenjia Bai,Xuelong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The first two authors contributed equally. Yang Zhang leads the whole project

点击查看摘要

Abstract:Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbfTACO, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
zh

[AI-12] A Comparative Study on How Data Normalization Affects Zero-Shot Generalization in Time Series Foundation Models

【速读】:该论文旨在解决时间序列基础模型(Time-Series Foundation Models, TSFMs)中因输入数据尺度差异和非平稳性导致的泛化性能下降问题。尽管归一化在特定数据集的时间序列模型中已被广泛研究,但在TSFMs中仍被忽视,而后者对跨域泛化能力的要求更高。解决方案的关键在于系统评估多种归一化方法,并发现REVIN(Reversible Instance Normalization)是最有效的策略:其在零样本预测任务中相比未归一化基线将平均绝对误差(MASE)降低89%,相较于其他归一化方法降低44%,同时保持最优的领域内准确率(0.84 MASE),且无需任何数据集级别的预处理,实现了最高的精度-效率权衡。然而,REVIN的效果依赖于架构设计选择和优化目标,特别是训练损失尺度敏感性和模型类型(概率型、点预测型或大语言模型(LLM)驱动型)。

链接: https://arxiv.org/abs/2512.02833
作者: Ihab Ahmed,Denis Krompaß,Cheng Feng,Volker Tresp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSFMs where generalization is critical. Time-series data, unlike text or images, exhibits significant scale variation across domains and channels, coupled with non-stationarity, can undermine TSFM performance regardless of architectural complexity. Through systematic evaluation across four architecturally diverse TSFMs, we empirically establish REVIN as the most efficient approach, reducing zero-shot MASE by 89% relative to an un-normalized baseline and by 44% versus other normalization methods, while matching the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing – yielding the highest accuracy-efficiency trade-off. Yet its effect utilization depends on architectural design choices and optimization objective, particularly with respect to training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).
zh

[AI-13] From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

【速读】:该论文旨在解决流模型(flow-based diffusion models)在训练过程中记忆-泛化行为不明确的问题,特别是其潜在的两阶段学习机制如何影响生成质量与泛化能力。解决方案的关键在于重新审视流匹配(flow matching, FM)目标,并通过推导出边际速度场(marginal velocity field)的闭式表达式,精确计算出理想的目标速度场(oracle FM target)。这一分析揭示了FM本质上具有两阶段训练目标:早期由数据模式混合引导以形成全局布局(泛化),后期则受最近邻数据样本主导以记忆细节(记忆)。该发现不仅解释了实践中诸如时间步偏移调度、无分类器引导区间和潜在空间设计等技术的有效性,也为未来扩散模型架构与算法优化提供了理论依据。

链接: https://arxiv.org/abs/2512.02826
作者: Haoming Liu,Jinnuo Liu,Yanhao Li,Liuyang Bai,Yunkai Ji,Yuanhe Guo,Shenji Wan,Hongyi Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint version; 15 pages, 16 figures

点击查看摘要

Abstract:Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.
zh

[AI-14] Radiologist Copilot: An Agent ic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control

【速读】:该论文旨在解决放射科报告生成过程中效率低、易出错以及缺乏系统性质量控制的问题,尤其针对体积医学影像的临床检查场景。现有自动化方法多聚焦于报告生成阶段,忽视了关键的质量控制环节,导致其对放射科医生的支持不够全面。解决方案的关键在于提出 Radiologist Copilot——一个具备工具协同调度能力的代理型人工智能助手,以大语言模型(Large Language Models, LLMs)为推理核心,自主选择工具、规划流程并执行操作,模拟放射科医生在完整报告流程中的行为。其核心组件包括区域定位、基于图像思维引导的区域分析规划、报告模板策略选择、质量评估与反馈驱动的自适应优化,从而实现准确、完整且高效的放射科报告生成与质量保障。

链接: https://arxiv.org/abs/2512.02814
作者: Yongrui Yu,Zhongzhen Huang,Linjie Mu,Shaoting Zhang,Xiaofan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is also critical but tedious, ensuring that the final report meets clinical standards. Existing automated approaches, including radiology report generation methods and medical vision-language models, focus mainly on the report generation phase and neglect the crucial quality control procedure, limiting their capability to provide comprehensive support to radiologists. We propose Radiologist Copilot, an agentic AI assistant equipped with orchestrated tools designed for automated radiology reporting with quality control. Leveraging large language models as the reasoning backbone, the agentic system autonomously selects tools, plans, and executes actions, emulating the behavior of radiologists throughout the holistic radiology reporting process. The orchestrated tools include region localization, think with image paradigm directed region analysis planning, strategic template selection for report generation, quality assessment and feedback-driven adaptive refinement for quality control. Therefore, Radiologist Copilot facilitates accurate, complete, and efficient radiology reporting, assisting radiologists and improving clinical efficiency. Experimental results demonstrate that Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting. The source code will be released upon acceptance.
zh

[AI-15] Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents

【速读】:该论文旨在解决现有自动化论文到代码生成框架中缺乏有效验证与自适应优化机制的问题,尤其针对当前方法依赖人工设计提示(prompt)进行自我精炼所导致的可扩展性差和适应性不足。其解决方案的关键在于提出一种无需提示(prompt-free)的协作智能体框架,由验证代理(verification agent)和精炼代理(refinement agent)协同工作:验证代理基于原始系统提示逐步检查输出是否满足要求,精炼代理则依据验证结果自动修正问题,整个过程仅依赖初始系统提示,无需人工干预,从而实现了高质量、高一致性的代码生成。

链接: https://arxiv.org/abs/2512.02812
作者: Zijie Lin,Qilin Cai,Liang Shen,Mingjun Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated paper reproduction has emerged as a promising approach to accelerate scientific research, employing multi-step workflow frameworks to systematically convert academic papers into executable code. However, existing frameworks often lack mechanisms to verify and refine the outputs at each generation step, or rely heavily on manually designed prompts for self-refinement, which limits their adaptability and scalability. To address these limitations, we propose a prompt-free collaborative agent framework that automatically enhances the quality of paper-to-code generation. Our approach employs two collaborative agents: a verification agent that examines whether the outputs at each step satisfy the requirements specified in the corresponding system prompt, and a refinement agent that revises the outputs based on the identified issues. Unlike previous methods that require human experts to craft specific refinement prompts for each step, our framework achieves automatic verification and improvement by leveraging only the original system prompts. We integrate our collaborative agents into the Paper2Code framework and conduct comprehensive experiments on PaperBench Code-Dev and Paper2CodeBench datasets. Experimental results demonstrate that our approach significantly improves the accuracy and completeness of reproduced code, achieving performance gains of approximately 15% and 13%, respectively, compared to the baseline without our agents. Furthermore, comparative experiments against Self-Refine validate the robustness and consistency of our prompt-free approach across different datasets.
zh

[AI-16] Phase-Adaptive LLM Framework with Multi-Stage Validation for Construction Robot Task Allocation: A Systematic Benchmark Against Traditional Optimization Algorithms

链接: https://arxiv.org/abs/2512.02810
作者: Shyam prasad reddy Kaitha,Hongrui Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-17] Perception of AI-Generated Music - The Role of Composer Identity Personality Traits Music Preferences and Perceived Humanness

【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 时代,听众对人工智能生成音乐(AI-generated music)的感知与评价是否存在偏见,以及这种偏见如何受到作曲家信息、听众态度、人格特质和音乐相关变量的影响。解决方案的关键在于采用混合方法研究设计,结合量化分析与质性主题分析,发现态度(尤其是对 AI 的态度)是最强的预测因子,同时识别出伦理、文化与情境因素构成听众评判 AI 音乐的核心标准,从而为未来人机交互中的音乐感知研究提供了理论依据与方法论指导。

链接: https://arxiv.org/abs/2512.02785
作者: David Stammer,Hannah Strauss,Peter Knees
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Under review at Computers in Human Behaviour Reports

点击查看摘要

Abstract:The rapid rise of AI-generated art has sparked debate about potential biases in how audiences perceive and evaluate such works. This study investigates how composer information and listener characteristics shape the perception of AI-generated music, adopting a mixed-method approach. Using a diverse set of stimuli across various genres from two AI music models, we examine effects of perceived authorship on liking and emotional responses, and explore how attitudes toward AI, personality traits, and music-related variables influence evaluations. We further assess the influence of perceived humanness and analyze open-ended responses to uncover listener criteria for judging AI-generated music. Attitudes toward AI proved to be the best predictor of both liking and emotional intensity of AI-generated music. This quantitative finding was complemented by qualitative themes from our thematic analysis, which identified ethical, cultural, and contextual considerations as important criteria in listeners’ evaluations of AI-generated music. Our results offer a nuanced view of how people experience music created by AI tools and point to key factors and methodological considerations for future research on music perception in human-AI interaction.
zh

[AI-18] A Framework for Causal Concept-based Model Explanations

【速读】:该论文旨在解决非可解释模型(non-interpretable models)在实际应用中缺乏透明性和可信度的问题,即如何生成既易于理解又忠实于原模型行为的解释。其解决方案的关键在于提出一种基于因果概念的后验可解释人工智能(causal concept-based post-hoc Explainable Artificial Intelligence, XAI)框架,通过计算概念干预的充分性概率(probability of sufficiency of concept interventions)来生成局部和全局解释,从而在保持解释忠实性(fidelity)的同时,借助清晰的概念词汇实现可理解性(understandability),并强调解释生成与解释解读上下文的一致性以保障有效性。

链接: https://arxiv.org/abs/2512.02735
作者: Anna Rodum Bjøru,Jacob Lysnæs-Larsen,Oskar Jørgensen,Inga Strümke,Helge Langseth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents a conceptual framework for causal concept-based post-hoc Explainable Artificial Intelligence (XAI), based on the requirements that explanations for non-interpretable models should be understandable as well as faithful to the model being explained. Local and global explanations are generated by calculating the probability of sufficiency of concept interventions. Example explanations are presented, generated with a proof-of-concept model made to explain classifiers trained on the CelebA dataset. Understandability is demonstrated through a clear concept-based vocabulary, subject to an implicit causal interpretation. Fidelity is addressed by highlighting important framework assumptions, stressing that the context of explanation interpretation must align with the context of explanation generation.
zh

[AI-19] Self-Improving AI Agents through Self-Play

【速读】:该论文旨在解决如何在动态系统框架下形式化智能体(agent)的自增强机制,特别是从模态理论角度扩展心理测量电池(psychometric batteries)到动态系统领域的问题。其核心挑战在于量化并保证智能体在资源约束下的持续自我改进能力,避免因噪声累积导致不稳定。解决方案的关键在于提出一个由生成-验证-更新(Generator-Verifier-Updater, GVU)构成的递归算子,该算子定义了参数流形 Θ\Theta 上的向量场,并将自增强系数 κ\kappa 表征为能力泛函沿该流的李导数(Lie derivative)。进一步推导出“方差不等式”(Variance Inequality),这是一个谱条件,在适度正则性假设下足以确保自增强过程的稳定性;该条件表明,只要生成与验证阶段的联合噪声足够小(考虑曲率和步长效应),即可实现 κ>0\kappa > 0,从而保障稳定进化。此框架成功统一了语言自对弈(LSP)、自修正(Self-Correction)及合成数据自举(Synthetic Data bootstrapping)等前沿方法,揭示它们均为满足方差不等式的GVU算子的不同拓扑实现。

链接: https://arxiv.org/abs/2512.02731
作者: Przemyslaw Chojecki
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow \nu_r parameterized by computational resource r , governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold \Theta , and we identify the coefficient of self-improvement \kappa as the Lie derivative of the capability functional along this flow. The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for \kappa 0 is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough. We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.02731 [cs.AI] (or arXiv:2512.02731v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.02731 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-20] AuditCopilot: Leverag ing LLM s for Fraud Detection in Double-Entry Bookkeeping

【速读】:该论文旨在解决传统审计中基于规则的日记账测试(Journal Entry Tests, JETs)在检测税务相关会计记录异常时存在的高误报率及对细微异常识别能力不足的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)作为异常检测工具,通过在合成数据和真实匿名账目上对主流LLMs(如LLaMA和Gemma)进行基准测试,发现其不仅显著优于传统JETs和经典机器学习基线,还能提供自然语言解释以增强可解释性,从而推动AI增强型审计(AI-augmented auditing)的发展,实现人机协同提升财务完整性。

链接: https://arxiv.org/abs/2512.02726
作者: Md Abdul Kadir,Sai Suresh Macharla Vasu,Sidharth S. Nair,Daniel Sonntag
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Auditors rely on Journal Entry Tests (JETs) to detect anomalies in tax-related ledger records, but rule-based methods generate overwhelming false positives and struggle with subtle irregularities. We investigate whether large language models (LLMs) can serve as anomaly detectors in double-entry bookkeeping. Benchmarking SoTA LLMs such as LLaMA and Gemma on both synthetic and real-world anonymized ledgers, we compare them against JETs and machine learning baselines. Our results show that LLMs consistently outperform traditional rule-based JETs and classical ML baselines, while also providing natural-language explanations that enhance interpretability. These results highlight the potential of \textbfAI-augmented auditing, where human auditors collaborate with foundation models to strengthen financial integrity.
zh

[AI-21] StockMem: An Event-Reflection Memory Framework for Stock Forecasting

【速读】:该论文旨在解决股票价格预测中因市场波动性和对实时事件敏感性带来的挑战,尤其是现有大语言模型(Large Language Models, LLMs)在金融领域应用受限于新闻数据噪声以及文本中缺乏明确答案的问题。其核心解决方案是提出一种事件反射双层记忆框架(StockMem),关键在于将新闻信息结构化为事件,并通过水平整合(horizontal consolidation)和纵向追踪(longitudinal tracking)两个维度构建时序事件知识库,从而提取反映市场预期差异的增量信息;同时,基于事件-价格动态关系形成因果经验的知识库,实现对历史相似场景的检索与推理,使预测过程具备可解释性并提升决策透明度。

链接: https://arxiv.org/abs/2512.02720
作者: He Wang,Wenyilin Xiao,Songqiao Han,Hailiang Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stock price prediction is challenging due to market volatility and its sensitivity to real-time events. While large language models (LLMs) offer new avenues for text-based forecasting, their application in finance is hindered by noisy news data and the lack of explicit answers in text. General-purpose memory architectures struggle to identify the key drivers of price movements. To address this, we propose StockMem, an event-reflection dual-layer memory framework. It structures news into events and mines them along two dimensions: horizontal consolidation integrates daily events, while longitudinal tracking captures event evolution to extract incremental information reflecting market expectation discrepancies. This builds a temporal event knowledge base. By analyzing event-price dynamics, the framework further forms a reflection knowledge base of causal experiences. For prediction, it retrieves analogous historical scenarios and reasons with current events, incremental data, and past experiences. Experiments show StockMem outperforms existing memory architectures and provides superior, explainable reasoning by tracing the information chain affecting prices, enhancing decision transparency in financial forecasting.
zh

[AI-22] Menta: A Small Language Model for On-Device Mental Health Prediction

【速读】:该论文旨在解决当前心理健康预测中早期检测能力有限的问题,尤其是在利用社交媒体数据进行多任务心理状态识别时,大型语言模型(LLMs)因计算资源消耗大而难以部署,而小型语言模型(SLMs)虽具轻量化优势但尚未在该领域得到充分探索。解决方案的关键在于提出Menta,首个专为社交媒体心理状态多任务预测优化的小型语言模型,其核心创新包括:基于LoRA(Low-Rank Adaptation)的联合训练框架、跨数据集策略以及面向平衡准确率的损失函数设计;该模型在六个分类任务(涵盖抑郁、压力和自杀倾向)上相较最优非微调SLMs平均提升15.2%,且在抑郁与压力分类上超越参数达13B的LLMs,同时模型体积仅为后者约3.25倍小,并可在iPhone 15 Pro Max上实现仅需约3GB RAM的实时本地部署,验证了其在可扩展性和隐私保护方面的潜力。

链接: https://arxiv.org/abs/2512.02716
作者: Tianyi Zhang,Xiangyuan Xue,Lingyan Ruan,Shiya Fu,Feng Xia,Simon D’Alfonso,Vassilis Kostakos,Hong Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health conditions affect hundreds of millions globally, yet early detection remains limited. While large language models (LLMs) have shown promise in mental health applications, their size and computational demands hinder practical deployment. Small language models (SLMs) offer a lightweight alternative, but their use for social media–based mental health prediction remains largely underexplored. In this study, we introduce Menta, the first optimized SLM fine-tuned specifically for multi-task mental health prediction from social media data. Menta is jointly trained across six classification tasks using a LoRA-based framework, a cross-dataset strategy, and a balanced accuracy–oriented loss. Evaluated against nine state-of-the-art SLM baselines, Menta achieves an average improvement of 15.2% across tasks covering depression, stress, and suicidality compared with the best-performing non–fine-tuned SLMs. It also achieves higher accuracy on depression and stress classification tasks compared to 13B-parameter LLMs, while being approximately 3.25x smaller. Moreover, we demonstrate real-time, on-device deployment of Menta on an iPhone 15 Pro Max, requiring only approximately 3GB RAM. Supported by a comprehensive benchmark against existing SLMs and LLMs, Menta highlights the potential for scalable, privacy-preserving mental health monitoring. Code is available at: this https URL
zh

[AI-23] raining Data Attribution for Image Generation using Ontology-Aligned Knowledge Graphs

【速读】:该论文旨在解决生成式 AI(Generative AI)在训练数据透明度、责任归属和版权侵权方面的挑战,核心问题在于如何量化特定训练数据对模型输出的影响。解决方案的关键在于提出一种基于自动构建与领域本体对齐的知识图谱(Knowledge Graph, KG)的解释框架:利用多模态大语言模型(Multimodal Large Language Models, MLLMs)从图像中提取结构化三元组,并将其映射到预定义本体,从而实现生成内容与训练图像知识图谱的比对分析,支持版权溯源、数据集透明性和可解释性增强。

链接: https://arxiv.org/abs/2512.02713
作者: Theodoros Aivalis,Iraklis A. Klampanos,Antonis Troumpoukis,Joemon M. Jose
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As generative models become powerful, concerns around transparency, accountability, and copyright violations have intensified. Understanding how specific training data contributes to a model’s output is critical. We introduce a framework for interpreting generative outputs through the automatic construction of ontologyaligned knowledge graphs (KGs). While automatic KG construction from natural text has advanced, extracting structured and ontology-consistent representations from visual content remains challenging – due to the richness and multi-object nature of images. Leveraging multimodal large language models (LLMs), our method extracts structured triples from images, aligned with a domain-specific ontology. By comparing the KGs of generated and training images, we can trace potential influences, enabling copyright analysis, dataset transparency, and interpretable AI. We validate our method through experiments on locally trained models via unlearning, and on large-scale models through a style-specific experiment. Our framework supports the development of AI systems that foster human collaboration, creativity and stimulate curiosity.
zh

[AI-24] Empirical Assessment of the Perception of Software Product Line Engineering by an SME before Migrating its Code Base

【速读】:该论文旨在解决中小企业(SME)在将现有软件变体迁移至软件产品线(Software Product Line, SPL)过程中所面临的挑战,包括高昂的成本、对开发流程的显著影响以及开发者习惯的改变。解决方案的关键在于通过深入评估公司当前开发流程与实践,并识别迁移带来的潜在收益与风险,同时邀请关键利益相关者参与访谈以获取其对变革的认知与潜在抵触情绪。研究发现,所有角色的参与者均认同迁移对其自身工作的积极意义,且有效的风险缓解策略应包括持续的信息透明度与利益相关者的深度参与,同时尽可能保留原有良好实践,从而确保迁移过程平稳过渡并最小化潜在问题。

链接: https://arxiv.org/abs/2512.02707
作者: Thomas Georges,Marianne Huchard,Mélanie König,Clémentine Nebut,Chouki Tibermacine
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 34 pages

点击查看摘要

Abstract:Migrating a set of software variants into a software product line (SPL) is an expensive and potentially challenging endeavor. Indeed, SPL engineering can significantly impact a company’s development process and often requires changes to established developer practices. The work presented in this paper stems from a collaboration with a Small and Medium-sized Enterprise (SME) that decided to migrate its existing code base into an SPL. In this study, we conducted an in-depth evaluation of the company’s current development processes and practices, as well as the anticipated benefits and risks associated with the migration. Key stakeholders involved in software development participated in this evaluation to provide insight into their perceptions of the migration and their potential resistance to change. This paper describes the design of the interviews conducted with these stakeholders and presents an analysis of the results. Among the qualitative findings, we observed that all participants, regardless of their role in the development process, identified benefits of the migration relevant to their own activities. Furthermore, our results suggest that an effective risk mitigation strategy involves keeping stakeholders informed and engaged throughout the process, preserving as many good practices as possible, and actively involving them in the migration to ensure a smooth transition and minimize potential challenges.
zh

[AI-25] Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

【速读】:该论文旨在解决多模态大语言模型在基于推理的多模态情感理解中存在的情感推理漂移(reasoning drift)问题,即模型逐渐依赖自身生成的文本而非多模态证据进行推理,导致解释过度受视觉引导路径影响,缺乏情感一致性。解决方案的关键在于提出一种名为“模态重要性引导推理”(Modality-Importance-Guided Reasoning, MIGR)的框架,其核心是引入模态重要性(Modality Importance, MI)机制以识别主导目标情感的模态,并据此重构推理顺序,使解释从最能反映情感的模态开始,从而避免早期被低信息量线索误导。该框架采用两阶段训练策略——模态对齐监督微调与模态感知奖励优化,有效提升模型生成解释的情感根基性、因果相关性和连贯性。

链接: https://arxiv.org/abs/2512.02699
作者: Hyeongseop Rha,Jeong Hun Yeo,Junil Won,Se Jin Park,Yong Man Ro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.
zh

[AI-26] Beyond Single-Agent Safety: A Taxonomy of Risks in LLM LLM -to-LLM Interactions

【速读】:该论文旨在解决当前安全机制在大型语言模型(Large Language Models, LLMs)相互交互场景下失效的问题,即传统基于单模型的安全约束(如提示工程、微调和内容审核层)无法有效管控多模型协同环境中涌现的系统性风险。其核心挑战在于:即使每个模型个体行为合规,其交互结构仍可能引发集体性失败。解决方案的关键在于提出从模型级安全向系统级安全的范式转变,并引入“涌现系统性风险边界”(Emergent Systemic Risk Horizon, ESRH)框架,以形式化刻画由交互结构而非孤立偏差导致的不稳定现象;同时构建微观、中观与宏观层面的故障模式分类体系,并设计“InstitutionalAI”架构,在多智能体系统内部嵌入自适应治理机制,实现对复杂交互动态的持续监控与干预。

链接: https://arxiv.org/abs/2512.02682
作者: Piercosma Bisconti,Marcello Galisai,Federico Pierucci,Marcantonio Bracale,Matteo Prandi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines why safety mechanisms designed for human-model interaction do not scale to environments where large language models (LLMs) interact with each other. Most current governance practices still rely on single-agent safety containment, prompts, fine-tuning, and moderation layers that constrain individual model behavior but leave the dynamics of multi-model interaction ungoverned. These mechanisms assume a dyadic setting: one model responding to one user under stable oversight. Yet research and industrial development are rapidly shifting toward LLM-to-LLM ecosystems, where outputs are recursively reused as inputs across chains of agents. In such systems, local compliance can aggregate into collective failure even when every model is individually aligned. We propose a conceptual transition from model-level safety to system-level safety, introducing the framework of the Emergent Systemic Risk Horizon (ESRH) to formalize how instability arises from interaction structure rather than from isolated misbehavior. The paper contributes (i) a theoretical account of collective risk in interacting LLMs, (ii) a taxonomy connecting micro, meso, and macro-level failure modes, and (iii) a design proposal for InstitutionalAI, an architecture for embedding adaptive oversight within multi-agent systems.
zh

[AI-27] Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks

【速读】:该论文旨在解决大语言模型在处理递归推理问题时的深度泛化(depth generalization)能力不足的问题,即模型在面对训练中未见过的更深嵌套层次结构(如括号嵌套或逻辑子句嵌套)时性能显著下降的现象。这一局限性源于标准Transformer架构难以维持栈式行为(stack-like behavior),从而无法有效追踪和解决多层嵌套依赖关系。解决方案的关键在于提出一种新颖的“循环定位与替换”(looped locate-and-replace)流水线,通过两个专用模型协同工作:定位器(locator)识别可求解的子表达式,替换器(replacer)对这些子组件进行评估并保持整体结构不变,从而将复杂的递归问题分解为可管理的子任务,显著缓解了在分布外递归深度下的性能衰减。

链接: https://arxiv.org/abs/2512.02677
作者: Zhiyuan He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have demonstrated remarkable capabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems, those requiring the resolution of nested hierarchical structures. While prior research has extensively studied length generalization (a model’s ability to handle longer sequences than seen during training), we investigate a distinct and underexplored limitation: depth generalization. Here, depth refers to the number of nested levels in a hierarchical problem, such as the layers of parentheses in a mathematical expression or the nesting of logical clauses in a Boolean formula. Our work reveals that standard transformer architectures struggle with problems involving deeper recursion than encountered during training, even when they perform well on longer but non-nested sequences. This limitation stems from their inability to maintain stack-like behavior, the capacity to track and resolve multiple levels of nested dependencies. Through systematic analysis, we demonstrate how this architectural constraint leads to rapid performance decay as the depth of the recursion increases. To address this challenge, we develop a novel looped locate-and-replace pipeline that decomposes recursive problems into manageable subcomponents. The approach employs two specialized models: a locator that identifies solvable subexpressions and a replacer that evaluates these components while preserving the overall structure. We evaluated this method in three carefully designed domains: Boolean algebra, recursive arithmetic, and propositional logic, each with a controllable depth of recursion. We show that our method effectively alleviates the performance decay when tested on out-of-distribution recursion depth.
zh

[AI-28] SAND Challenge: Four Approaches for Dysartria Severity Classification

【速读】:该论文旨在解决神经退行性疾病中构音障碍(dysarthria)严重程度分类的问题,目标是基于语音录音对五类不同程度的构音障碍进行准确识别。其解决方案的关键在于提出并统一比较四种不同建模方法:基于频谱图的视觉Transformer(ViT-OF)、多通道一维卷积神经网络(1D-CNN)、双向长短期记忆网络(BiLSTM)以及分层XGBoost集成模型。其中,特征工程驱动的XGBoost方法在宏观F1得分上表现最优(0.86),而深度学习模型(ViT、CNN、BiLSTM)虽略低(F1≈0.70),但提供了互补的建模视角与可解释性,体现了不同范式在该任务中的潜力与适用性差异。

链接: https://arxiv.org/abs/2512.02669
作者: Gauri Deshpande,Harish Battula,Ashish Panda,Sunil Kumar Kopparapu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN’s with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.
zh

[AI-29] Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents

【速读】:该论文旨在解决分子生成任务中两大核心挑战:一是扩散模型(Diffusion Models)计算成本高,二是自回归模型(Autoregressive Models)存在误差传播问题。其解决方案的关键在于提出一种两阶段生成框架——Graph VQ-Transformer (GVT),其中核心创新是引入图结构向量量化变分自编码器(Graph Vector Quantized Variational Autoencoder, VQ-VAE),将分子图压缩为高保真离散潜在序列;通过结合图 Transformer、规范化的 Reverse Cuthill-McKee (RCM) 节点排序和旋转位置嵌入(Rotary Positional Embeddings, RoPE),实现近乎完美的重构率;随后在该离散潜空间上训练自回归 Transformer,从而将复杂的分子图生成转化为结构化的序列建模问题。这一映射机制不仅显著提升了生成效率与准确性,还打通了分子设计与大规模序列建模(如大语言模型 LLMs)之间的桥梁,为下一代分子生成方法提供了新的范式。

链接: https://arxiv.org/abs/2512.02667
作者: Haozhuo Zheng,Cheng Wang,Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.
zh

[AI-30] Distill Forget Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models

【速读】:该论文旨在解决生成式 AI 模型在面对持续性数据删除请求时的不可持续性问题,即在连续不断的数据移除场景下(称为持续遗忘,Continual Unlearning, CUL),传统一次性遗忘(One-shot Unlearning)方法会导致模型性能急剧下降,表现为保留概念的退化、相关概念的连锁损害以及生成质量显著恶化。解决方案的关键在于提出一种基于生成蒸馏(Generative Distillation)的持续遗忘框架,将每一次遗忘步骤重构为一个多目标教师-学生蒸馏过程,借鉴持续学习(Continual Learning)中的稳定性机制,从而在不显著干扰保留概念性能和整体图像质量的前提下,实现稳定且精准的目标概念移除。

链接: https://arxiv.org/abs/2512.02657
作者: Naveen George,Naoki Murata,Yuhta Takida,Konda Reddy Mopuri,Yuki Mitsufuji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR’s ``Right to be Forgotten.‘’ This necessitates machine unlearning (MU) to remove specific concepts without the prohibitive cost of retraining. However, existing MU techniques are fundamentally ill-equipped for real-world scenarios where deletion requests arrive sequentially, a setting known as continual unlearning (CUL). Naively applying one-shot methods in a continual setting triggers a stability crisis, leading to a cascade of degradation characterized by retention collapse, compounding collateral damage to related concepts, and a sharp decline in generative quality. To address this critical challenge, we introduce a novel generative distillation based continual unlearning framework that ensures targeted and stable unlearning under sequences of deletion requests. By reframing each unlearning step as a multi-objective, teacher-student distillation process, the framework leverages principles from continual learning to maintain model integrity. Experiments on a 10-step sequential benchmark demonstrate that our method unlearns forget concepts with better fidelity and achieves this without significant interference to the performance on retain concepts or the overall image quality, substantially outperforming baselines. This framework provides a viable pathway for the responsible deployment and maintenance of large-scale generative models, enabling industries to comply with ongoing data removal requests in a practical and effective manner.
zh

[AI-31] Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training

【速读】:该论文旨在解决现有表达性音乐演奏渲染方法依赖于小规模标注数据集的局限性,从而限制了数据量和模型规模的扩展,尽管在音乐领域存在大量未标注数据。其解决方案的关键在于提出Pianist Transformer模型,通过四个核心贡献实现突破:首先,采用统一的MIDI(Musical Instrument Digital Interface)数据表示方式,使模型能够在无显式标注的情况下学习音乐结构与表现力的共享原则;其次,设计了一种高效的非对称架构,在不牺牲演奏质量的前提下支持更长上下文和更快推理;再次,构建了一个包含100亿token的自监督预训练流程,配合1.35亿参数模型,释放了数据与模型扩展的优势;最后,实现了在客观指标和主观人类评价上均达到当前最优水平的演奏生成模型,为音乐领域中类人演奏合成提供了可扩展的新路径。

链接: https://arxiv.org/abs/2512.02652
作者: Hong-Jie You,Jie-Jing Shao,Xiao-Wen Yang,Lin-Han Jia,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.
zh

[AI-32] Zero-Shot Instruction Following in RL via Structured LTL Representations ICML2025

【速读】:该论文旨在解决现有基于线性时序逻辑(Linear Temporal Logic, LTL)的强化学习方法在多高阶事件同时发生且存在复杂交互场景下表现不足的问题。现有方法通常将LTL指令解释为有限自动机(finite automata),但难以处理多个原子命题(atomic propositions)在同一时刻为真的情况。本文的关键解决方案是:通过条件化策略于一系列简单布尔公式(simple Boolean formulae)来建模自动机状态转移,并利用图神经网络(Graph Neural Network, GNN)对这些公式进行编码,从而生成结构化的任务表示,使策略能够有效应对多事件并发与交互的复杂场景。

链接: https://arxiv.org/abs/2512.02633
作者: Mattia Giuri,Mathias Jackermeier,Alessandro Abate
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025 Workshop on Programmatic Representations for Agent Learning

点击查看摘要

Abstract:Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.
zh

[AI-33] CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在密码学领域中深度推理与数学分析能力不足的问题,这一缺陷主要源于缺乏适用于评估和训练的高质量数据集。解决方案的关键在于构建了CryptoQA——首个面向密码学的大规模问答(Question-Answering, QA)数据集,包含超过两百万条来自学术文献的QA对及上下文元数据,可用于系统性评估LLMs的密码学能力,并支持针对性微调以提升其在密码学任务上的表现。

链接: https://arxiv.org/abs/2512.02625
作者: Mayar Elfares,Pascal Reisert,Tilman Dietz,Manpa Barman,Ahmed Zaki,Ralf Küsters,Andreas Bulling
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at many general-purpose natural language processing tasks. However, their ability to perform deep reasoning and mathematical analysis, particularly for complex tasks as required in cryptography, remains poorly understood, largely due to the lack of suitable data for evaluation and training. To address this gap, we present CryptoQA, the first large-scale question-answering (QA) dataset specifically designed for cryptography. CryptoQA contains over two million QA pairs drawn from curated academic sources, along with contextual metadata that can be used to test the cryptographic capabilities of LLMs and to train new LLMs on cryptographic tasks. We benchmark 15 state-of-the-art LLMs on CryptoQA, evaluating their factual accuracy, mathematical reasoning, consistency, referencing, backward reasoning, and robustness to adversarial samples. In addition to quantitative metrics, we provide expert reviews that qualitatively assess model outputs and establish a gold-standard baseline. Our results reveal significant performance deficits of LLMs, particularly on tasks that require formal reasoning and precise mathematical knowledge. This shows the urgent need for LLM assistants tailored to cryptography research and development. We demonstrate that, by using CryptoQA, LLMs can be fine-tuned to exhibit better performance on cryptographic tasks.
zh

[AI-34] arget-specific Adaptation and Consistent Degradation Alignment for Cross-Domain Remaining Useful Life Prediction

【速读】:该论文旨在解决跨域剩余使用寿命(Remaining Useful Life, RUL)预测中因训练与测试数据分布不一致导致的性能下降问题。现有基于对抗域适应的方法虽能学习域不变特征,但忽略了目标域特有信息及退化阶段的一致性特征,从而影响预测精度。其解决方案的关键在于提出一种名为TACDA的新颖域适应方法:首先在对抗适配过程中引入目标域重构策略,以保留目标域特有信息并同时学习域不变特征;其次设计了一种新颖的聚类与配对策略,实现相似退化阶段间的稳定对齐,从而提升跨域RUL预测的准确性。

链接: https://arxiv.org/abs/2512.02610
作者: Yubo Hou,Mohamed Ragab,Min Wu,Chee-Keong Kwoh,Xiaoli Li,Zhenghua Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of the Remaining Useful Life (RUL) in machinery can significantly diminish maintenance costs, enhance equipment up-time, and mitigate adverse outcomes. Data-driven RUL prediction techniques have demonstrated commendable performance. However, their efficacy often relies on the assumption that training and testing data are drawn from the same distribution or domain, which does not hold in real industrial settings. To mitigate this domain discrepancy issue, prior adversarial domain adaptation methods focused on deriving domain-invariant features. Nevertheless, they overlook target-specific information and inconsistency characteristics pertinent to the degradation stages, resulting in suboptimal performance. To tackle these issues, we propose a novel domain adaptation approach for cross-domain RUL prediction named TACDA. Specifically, we propose a target domain reconstruction strategy within the adversarial adaptation process, thereby retaining target-specific information while learning domain-invariant features. Furthermore, we develop a novel clustering and pairing strategy for consistent alignment between similar degradation stages. Through extensive experiments, our results demonstrate the remarkable performance of our proposed TACDA method, surpassing state-of-the-art approaches with regard to two different evaluation metrics. Our code is available at this https URL.
zh

[AI-35] IACT: A Self-Organizing Recursive Model for General AI Agents : A Technical White Paper on the Architecture Behind krag ent.ai

【速读】:该论文旨在解决传统静态、硬编码代理工作流(agent workflows)在处理开放任务时灵活性不足的问题,尤其是其难以适应复杂且动态变化的任务结构。解决方案的关键在于提出交互式代理呼叫树(Interactive Agents Call Tree, IACT),该模型通过用户对话驱动的自主演化机制,构建一个动态递归的代理拓扑结构,从而按需扩展组织复杂度以匹配任务需求;同时引入交互冗余机制,将单向函数调用替换为双向状态化对话,实现运行时错误纠正与歧义消解,显著提升系统鲁棒性与适应性。

链接: https://arxiv.org/abs/2512.02605
作者: Pengju Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 13 pages, 2 figures, 1 table

点击查看摘要

Abstract:This technical white paper introduces the Interactive Agents Call Tree (IACT), a computational model designed to address the limitations of static, hard-coded agent workflows. Unlike traditional systems that require pre-defined graphs or specialized programming, IACT operates as a general-purpose autonomous system driven purely by user dialogue. Given a high-level objective, the system autonomously grows a dynamic, recursive agent topology incrementally tailored to the problem’s structure. This allows it to scale its organizational complexity to match open-ended tasks. To mitigate the error propagation inherent in unidirectional function calls, IACT introduces interactional redundancy by replacing rigid invocations with bidirectional, stateful dialogues. This mechanism enables runtime error correction and ambiguity resolution. We describe the architecture, design principles, and practical lessons behind the production deployment of this model in the this http URL system, presenting qualitative evidence from real-world workflows rather than exhaustive benchmark results.
zh

[AI-36] PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing Review and Editing

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在学术写作流程中嵌入时存在的“外部化”问题,即现有助手与编辑器(如Overleaf)之间缺乏深度交互,无法直接访问文档状态、结构及修订历史,从而限制了代理式(agentic)、上下文感知操作的实现。解决方案的关键在于提出PaperDebugger——一个内置于LaTeX编辑器的多智能体(multi-agent)、插件驱动的写作辅助系统,通过Chrome批准的扩展程序、基于Kubernetes的编排层以及Model Context Protocol(MCP)工具链,实现了与编辑器的可靠双向同步、细粒度版本控制与补丁管理、安全的状态存储、多智能体调度及对外部工具的可扩展通信,从而支持局部编辑、结构化审阅、并行智能体执行和基于diff的更新等全流程功能。

链接: https://arxiv.org/abs/2512.02589
作者: Junyi Hou(National University of Singapore),Andre Lin Huikai(National University of Singapore),Nuo Chen(National University of Singapore),Yiwei Gong(Independent Researcher, Singapore),Bingsheng He(National University of Singapore)
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at this https URL.
zh

[AI-37] Feedback Loops and Code Perturbations in LLM -based Software Engineering: A Case Study on a C-to-Rust Translation System

【速读】:该论文旨在解决自动化代码翻译系统(特别是C到Rust的翻译)在工业实践中可靠性不足的问题,其核心挑战在于如何提升生成式AI(Generative AI)驱动的翻译结果的质量与鲁棒性。解决方案的关键在于构建一个基于“生成-验证”模式的闭环系统:利用大型语言模型(Large Language Model, LLM)生成候选Rust代码,并通过自动化的编译检查和行为等价性验证来评估输出;对于失败案例,引入自动化反馈循环重新提示LLM进行修正,从而形成迭代优化机制。研究发现,反馈循环显著降低了不同LLM之间的性能差异,提升了系统的平均表现及对代码扰动的鲁棒性,且适度的代码扰动多样性反而能增强整体翻译效果。

链接: https://arxiv.org/abs/2512.02567
作者: Martin Weiss,Jesko Hecking-Harbusch,Jochen Quante,Matthias Woehrle
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:The advent of strong generative AI has a considerable impact on various software engineering tasks such as code repair, test generation, or language translation. While tools like GitHub Copilot are already in widespread use in interactive settings, automated approaches require a higher level of reliability before being usable in industrial practice. In this paper, we focus on three aspects that directly influence the quality of the results: a) the effect of automated feedback loops, b) the choice of Large Language Model (LLM), and c) the influence of behavior-preserving code changes. We study the effect of these three variables on an automated C-to-Rust translation system. Code translation from C to Rust is an attractive use case in industry due to Rust’s safety guarantees. The translation system is based on a generate-and-check pattern, in which Rust code generated by the LLM is automatically checked for compilability and behavioral equivalence with the original C code. For negative checking results, the LLM is re-prompted in a feedback loop to repair its output. These checks also allow us to evaluate and compare the respective success rates of the translation system when varying the three variables. Our results show that without feedback loops LLM selection has a large effect on translation success. However, when the translation system uses feedback loops the differences across models diminish. We observe this for the average performance of the system as well as its robustness under code perturbations. Finally, we also identify that diversity provided by code perturbations can even result in improved system performance. Comments: 10 pages, 9 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.02567 [cs.SE] (or arXiv:2512.02567v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.02567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-38] EZYer: A simulacrum of high school with generative agent SIGIR2025

【速读】:该论文旨在解决当前在线教育工具在课件生成、互动笔记和内容质量保障方面存在的服务不完整、性能不足及交互性弱的问题。其核心解决方案是提出一个名为EZYer的生成式AI代理系统,关键在于三个模块的协同设计:1)教师模块(Teacher Module)利用文本语料库检索与深度生成技术,自动生成符合高中数学教学大纲的结构化教学材料及LaTeX Beamer课件,并支持用户自定义图片插入;2)学生模块(Student Module)通过教师、助教、优等生与学困生四角色协作,由记笔记者总结生成具有深度与趣味性的学习笔记;3)控制器(Controller)构建关键词过滤、内容评分、角色互验与动态内容修正机制,确保输出内容的学术严谨性和教学适切性。实证评估表明,EZYer生成的内容在准确性、知识覆盖度、可用性、格式正确性和视觉吸引力五个维度上均表现优异。

链接: https://arxiv.org/abs/2512.02561
作者: Jinming Yang,Zimu Ji,Weiqi Luo,Gaoxi Wang,Bin Ma,Yueling Deng
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: AgentIR@SIGIR 2025

点击查看摘要

Abstract:With the rapid development of the online education and large language model, the existing educational tools still suffer from incomplete service, insufficient performance and weak interactivity in terms of courseware generation, interactive notes and quality assurance of content. In particular, the proposed generative agent EZYer : 1) Teacher Module: Integrating the Text Corpus retrieval and in-depth generation technologies, it automatically generates structured teaching materials and LaTeX Beamer courseware in line with the high school mathematics syllabus and supports user-defined image insertion. 2) Student Module: Throughout the collaborative interaction of the four roles of Teacher, Assistant, Top Student and Struggling Student, Note Taker summarizes and generates academic notes to enhance the depth and interest of learning. 3) Controller: set up keyword filtering system, content scoring system, role co-validation system, and dynamic content correction system. This ensure academic strictness and pedagogical propriety of EZYer inputs and outputs. In order to evaluate EZYer, this paper designs five-dimensional evaluation indexes of content accuracy, knowledge coverage, usability, formatting correctness and visual design and appeal, and scores 100 Beamer and Notes generated by EZYer by five large language models, separately, and the results show that the quality of EZYer-generated content is excellent and has a good application prospect.
zh

[AI-39] Empathy Level Prediction in Multi-Modal Scenario with Supervisory Documentation Assistance

【速读】:该论文旨在解决当前情感预测(empathy prediction)方法主要依赖单一模态(如文本)而忽视多模态信息融合,以及未充分利用训练阶段可获得的特权信息(privileged information)的问题。其解决方案的关键在于提出一种结合视频、音频与文本的多模态情感预测方法,并引入监督文档辅助训练机制:首先利用预训练网络提取各模态特征并进行跨模态融合以生成多模态表征用于预测;其次在训练阶段引入由导师创建的监督文档作为特权信息,通过Latent Dirichlet Allocation(LDA)模型挖掘潜在主题分布来约束文本特征学习,从而提升文本特征的语义表达能力。值得注意的是,该特权信息仅在训练时可用,推理阶段不依赖,确保了模型在实际应用中的实用性。

链接: https://arxiv.org/abs/2512.02558
作者: Yufei Xiao,Shangfei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. To enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor’s display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods.
zh

[AI-40] CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

【速读】:该论文旨在解决高性能计算中半精度通用矩阵乘法(Half-precision General Matrix Multiply, HGEMM)CUDA内核手动调优效率低、难以覆盖大规模配置空间的问题。传统方法依赖专家经验或有限搜索策略,难以在复杂硬件环境下实现最优性能。解决方案的关键在于提出CUDA-L2系统,该系统融合大语言模型(Large Language Models, LLMs)与强化学习(Reinforcement Learning, RL),利用CUDA执行速度作为奖励信号,自动探索并优化HGEMM内核的1,000种配置组合。LLM用于引导RL agent高效探索配置空间,从而显著超越现有主流基线(如cuBLAS、cuBLASLt及其启发式与自动调优版本),在离线和服务器模式下分别实现最高达+28.7%的加速比,证明了生成式AI驱动的自动化调优在关键计算内核优化中的有效性。

链接: https://arxiv.org/abs/2512.02551
作者: Songqiao Su,Xiaofei Sun,Xiaoya Li,Albert Wang,Jiwei Li,Chris Shum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used \it this http URL to state-of-the-art Nvidia’s closed-source libraries, i.e., \it cuBLAS, \it cuBLASLt. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over \it this http URL on average; +19.2% over \it cuBLAS using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over \it cuBLASLt-heuristic, which queries \it cuBLASLt library and selects the algorithm based on the heuristic’s suggestion; and +11.4% over the most competitive \it cuBLASLt-AutoTuning model, which selects the fastest algorithm from up to 100 candidates from \it cuBLASLt’s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for \it this http URL, \it cuBLAS, \it cuBLASLt-heuristic, and \it cuBLASLt-AutoTuning respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at this http URL
zh

[AI-41] Sparse Computations in Deep Learning Inference

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在推理阶段带来的巨大计算、能源和环境负担问题,尤其是在当前生产环境中对稀疏性(sparsity)利用不足的现状下。其解决方案的关键在于系统性地梳理和总结稀疏性在DNN推理优化中的核心作用:首先明确可应用于DNN推理的各种稀疏形式,其次阐明密集计算如何转化为高效的稀疏内核(sparse kernels),并进一步提供针对CPU与GPU平台的稀疏内核实现技术的全面文献综述,同时评估关键稀疏矩阵乘法(SpMM)和稀疏双线性点积矩阵乘法(SDDMM)等核心算子的性能表现。通过整合这些知识与实践成果,论文为性能工程师提供了从理论到工程落地的完整指导,以推动高效稀疏模型在生产环境中的部署。

链接: https://arxiv.org/abs/2512.02550
作者: Ioanna Tasou,Panagiotis Mpakos,Angelos Vlachos,Dionysios Adamopoulos,Georgios Giannakopoulos,Konstantinos Katsikopoulos,Ioannis Karaparisis,Maria Lazou,Spyridon Loukovitis,Areti Mei,Anastasia Poulopoulou,Angeliki Dimitriou,Giorgos Filandrianos,Dimitrios Galanopoulos,Vasileios Karampinis,Ilias Mitsouras,Nikolaos Spanos,Petros Anastasiadis,Ioannis Doudalis,Konstantinos Nikas,George Retsinas,Paraskevi Tzouveli,Christina Giannoula,Nectarios Koziris,Nikela Papadopoulou,Giorgos Stamou,Athanasios Voulodimos,Georgios Goumas
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.
zh

[AI-42] Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration

【速读】:该论文旨在解决当前内容安全(Content Safety)审核系统在识别隐性风险(Implicit Risks)和提供可解释判断过程方面的局限性,这些问题通常源于单一模型或固定流水线架构的不足。其解决方案的关键在于提出Aetheria框架,该框架基于多智能体辩论机制(Multi-Agent Debate)与检索增强生成(Retrieval-Augmented Generation, RAG)相结合的协作式架构,通过五个核心智能体之间的动态、相互说服式的辩论过程,实现对多模态内容的深度分析与裁决。实验结果表明,Aetheria不仅能够生成详细且可追溯的审计报告,还在整体内容安全准确性上显著优于基线方法,尤其在隐性风险识别方面表现突出,从而为可信人工智能内容审核建立了透明、可解释的新范式。

链接: https://arxiv.org/abs/2512.02530
作者: Yuxiang He,Jian Zhao,Yuchen Yuan,Tianle Zhang,Wei Cai,Haojie Cheng,Ziyan Shi,Ming Zhu,Haichuan Tang,Chi Zhang,Xuelong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi-agent debate and this http URL a collaborative architecture of five core agents, Aetheria conducts in-depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG-based knowledge this http URL experiments on our proposed benchmark (AIR-Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.
zh

[AI-43] AskNearby: An LLM -Based Application for Neighborhood Information Retrieval and Personalized Cognitive-Map Recommendations

【速读】:该论文旨在解决“本地生活信息可及性”(Local Life Information Accessibility, LLIA)问题,即在15分钟城市理念下,居民虽具备物理上的近距离可达性,但缺乏高效、可靠的信息获取机制来支持日常决策。其解决方案的关键在于提出AskNearby——一个基于AI的社区应用,通过两个核心模块实现:(i) 三层检索增强生成(Retrieval-Augmented Generation, RAG)管道,融合图结构、语义向量与地理空间检索;(ii) 认知地图模型(cognitive-map model),编码用户对邻近区域的熟悉度与偏好。此设计实现了时空精准定位与认知感知排序的协同优化,显著提升检索准确率与推荐质量,从而有效增强居民对本地资源的发现能力与社区参与度。

链接: https://arxiv.org/abs/2512.02502
作者: Luyao Niu,Zhicheng Deng,Boyang Li,Nuoxian Huang,Ruiqi Liu,Wenjia Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The “15-minute city” envisions neighborhoods where residents can meet daily needs via a short walk or bike ride. Realizing this vision requires not only physical proximity but also efficient and reliable access to information about nearby places, services, and events. Existing location-based systems, however, focus mainly on city-level tasks and neglect the spatial, temporal, and cognitive factors that shape localized decision-making. We conceptualize this gap as the Local Life Information Accessibility (LLIA) problem and introduce AskNearby, an AI-driven community application that unifies retrieval and recommendation within the 15-minute life circle. AskNearby integrates (i) a three-layer Retrieval-Augmented Generation (RAG) pipeline that synergizes graph-based, semantic-vector, and geographic retrieval with (ii) a cognitive-map model that encodes each user’s neighborhood familiarity and preferences. Experiments on real-world community datasets demonstrate that AskNearby significantly outperforms LLM-based and map-based baselines in retrieval accuracy and recommendation quality, achieving robust performance in spatiotemporal grounding and cognitive-aware ranking. Real-world deployments further validate its effectiveness. By addressing the LLIA challenge, AskNearby empowers residents to more effectively discover local resources, plan daily activities, and engage in community life.
zh

[AI-44] COPE: Chain-Of-Thought Prediction Engine for Open-Source Large Language Model Based Stroke Outcome Prediction from Clinical Notes

【速读】:该论文旨在解决急性缺血性卒中(Acute Ischemic Stroke, AIS)患者90天功能预后预测问题,尤其针对临床笔记中蕴含的丰富上下文信息因非结构化特性而难以被传统预测模型有效利用的挑战。解决方案的关键在于提出一种基于链式思维(Chain-of-Thought, CoT)的推理增强型大语言模型框架——Outcome Prediction Engine (COPE),其采用两步式CoT机制:首先由LLaMA-3-8B模型生成临床推理过程,再基于该推理输出改良Rankin量表(mRS)预测结果。该设计显著提升了对非结构化临床文本的理解与建模能力,在保持轻量化、可解释性和隐私保护的前提下,实现了与GPT-4.1相当甚至优于其他基线模型(如ClinicalBERT、结构化机器学习模型和单步大语言模型)的预测性能。

链接: https://arxiv.org/abs/2512.02499
作者: Yongkai Liu,Helena Feng,Bin Jiang,Yixin Wang,Max Wintermark,David S. Liebeskind,Michael Moseley,Maarten Lansberg,Gregory Albers,Jeremy Heit,Greg Zaharchuk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting outcomes in acute ischemic stroke (AIS) guides clinical decision-making, patient counseling, and resource allocation. Clinical notes contain rich contextual information, but their unstructured nature limits their use in traditional predictive models. We developed and evaluated the Chain-of-Thought (CoT) Outcome Prediction Engine (COPE), a reasoning-enhanced large language model framework, for predicting 90-day functional outcomes after AIS from unstructured clinical notes. This study included 464 AIS patients with discharge summaries and 90-day modified Rankin Scale (mRS) scores. COPE uses a two-step CoT framework based on sequential open-source LLaMA-3-8B models: the first generates clinical reasoning, and the second outputs an mRS prediction. We compared COPE with GPT-4.1, ClinicalBERT, a structured variable-based machine learning model (Clinical ML), and a single-step LLM without CoT. Performance was evaluated using mean absolute error (MAE), accuracy within +/-1 mRS point, and exact accuracy. COPE achieved an MAE of 1.01 (95% CI 0.92-1.11), +/-1 accuracy of 74.4% (69.9, 78.8%), and exact accuracy of 32.8% (28.0, 37.6%), comparable to GPT-4.1 and superior to ClinicalBERT [MAE 1.24 (1.13-1.36)], Clinical ML [1.28 (1.18-1.39)], and the single-step LLM [1.20 (1.09-1.33)]. Subgroup analyses showed consistent performance across sex and age, with slightly higher error among older patients, those undergoing thrombectomy, and those with longer summaries. These findings demonstrate that COPE, a lightweight, interpretable, and privacy-preserving open-source framework, provides an accurate and practical solution for outcome prediction from unstructured clinical text.
zh

[AI-45] Q-BERT4Rec: Quantized Semantic-ID Representation Learning for Multimodal Recommendation KDD2026

【速读】:该论文旨在解决当前基于Transformer的序列推荐方法(如BERT4Rec)在处理离散物品ID时缺乏语义信息且忽略多模态数据(如文本和图像)的问题,从而导致模型泛化能力弱和可解释性差。其解决方案的关键在于提出Q-Bert4Rec框架,通过三个核心阶段实现:首先利用动态Transformer进行跨模态语义注入,将文本、视觉与结构特征融合以增强初始ID嵌入;其次采用残差向量量化(residual vector quantization)对融合表示进行语义化离散化,生成具有语义意义的token;最后通过多种掩码策略(span、tail和multi-region)进行预训练与微调,提升模型对序列模式的理解能力。该方法显著提升了多模态序列推荐的性能与可解释性。

链接: https://arxiv.org/abs/2512.02474
作者: Haofeng Huang,Ling Gai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Submitted to KDD2026

点击查看摘要

Abstract:Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users’ next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic quantization, which discretizes fused representations into meaningful tokens via residual vector quantization; and (3) multi-mask pretraining and fine-tuning, which leverage diverse masking strategies – span, tail, and multi-region – to improve sequential understanding. We validate our model on public Amazon benchmarks and demonstrate that Q-Bert4Rec significantly outperforms many strong existing methods, confirming the effectiveness of semantic tokenization for multimodal sequential recommendation. Our source code will be publicly available on GitHub after publishing.
zh

[AI-46] abGRU: An Enhanced Design for Urban Rainfall Intensity Estimation Using Commercial Microwave Links

【速读】:该论文旨在解决基于商用微波链路(Commercial Microwave Links, CMLs)的城市降雨监测中,传统物理模型因信号噪声和非线性衰减等现实复杂因素导致精度不足的问题。其解决方案的关键在于提出一种融合Transformer与双向门控循环单元(Bidirectional Gated Recurrent Unit, BiGRU)的新型混合深度学习架构——TabGRU,该架构能够协同捕捉CML信号数据中的长期依赖关系与局部序列特征,并通过可学习的位置嵌入(learnable positional embedding)和注意力池化机制增强动态特征提取能力与泛化性能,从而在真实降雨事件中实现更高精度的降雨估计,尤其有效缓解了物理模型在强降雨峰值期间的显著高估问题。

链接: https://arxiv.org/abs/2512.02465
作者: Xingwang Li,Mengyun Chen,Jiamou Liu,Sijie Wang,Shuanggen Jin,Jafet C. M. Andersson,Jonas Olsson,Remco(C. Z.)van de Beek,Hai Victor Habi,Congzheng Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the face of accelerating global urbanization and the increasing frequency of extreme weather events, highresolution urban rainfall monitoring is crucial for building resilient smart cities. Commercial Microwave Links (CMLs) are an emerging data source with great potential for this this http URL traditional rainfall retrieval from CMLs relies on physicsbased models, these often struggle with real-world complexities like signal noise and nonlinear attenuation. To address these limitations, this paper proposes a novel hybrid deep learning architecture based on the Transformer and a Bidirectional Gated Recurrent Unit (BiGRU), which we name TabGRU. This design synergistically captures both long-term dependencies and local sequential features in the CML signal data. The model is further enhanced by a learnable positional embedding and an attention pooling mechanism to improve its dynamic feature extraction and generalization capabilities. The model was validated on a public benchmark dataset from Gothenburg, Sweden (June-September 2015). The evaluation used 12 sub-links from two rain gauges (Torp and Barl) over a test period (August 22-31) covering approximately 10 distinct rainfall events. The proposed TabGRU model demonstrated consistent advantages, outperforming deep learning baselines and achieving high coefficients of determination (R2) at both the Torp site (0.91) and the Barl site (0.96). Furthermore, compared to the physics-based approach, TabGRU maintained higher accuracy and was particularly effective in mitigating the significant overestimation problem observed in the PL model during peak rainfall events. This evaluation confirms that the TabGRU model can effectively overcome the limitations of traditional methods, providing a robust and accurate solution for CML-based urban rainfall monitoring under the tested conditions.
zh

[AI-47] Semantic Trading: Agent ic AI for Clustering and Relationship Discovery in Prediction Markets

【速读】:该论文旨在解决预测市场中存在的碎片化问题,即由于重复提问、隐含等价关系和隐藏矛盾导致的市场结构不清晰,从而影响交易效率与信息整合。解决方案的关键在于构建一个代理型AI(agentic AI)流水线,其核心能力包括:(i) 利用自然语言理解技术对合约文本和元数据进行分析,将市场聚类为语义连贯的主题组;(ii) 在每个聚类内识别出具有强依赖关系的市场对,包括同结果(相关)和异结果(反相关)的关系。该方法通过历史数据验证了其关系预测准确率达60–70%,并进一步转化为简单交易策略,在一周时间窗口内实现约20%的平均收益,证明了大语言模型在挖掘预测市场中潜在语义结构方面的有效性。

链接: https://arxiv.org/abs/2512.02436
作者: Agostino Capponi,Alfio Gliozzo,Brian Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prediction markets allow users to trade on outcomes of real-world events, but are prone to fragmentation through overlapping questions, implicit equivalences, and hidden contradictions across markets. We present an agentic AI pipeline that autonomously (i) clusters markets into coherent topical groups using natural-language understanding over contract text and metadata, and (ii) identifies within-cluster market pairs whose resolved outcomes exhibit strong dependence, including same-outcome (correlated) and different-outcome (anti-correlated) relationships. Using a historical dataset of resolved markets on Polymarket, we evaluate the accuracy of the agent’s relational predictions. We then translate discovered relationships into a simple trading strategy to quantify how these relationships map to actionable signals. Results show that agent-identified relationships achieve roughly 60-70% accuracy, and their induced trading strategies earn about 20% average returns over week-long horizons, highlighting the ability of agentic AI and large language models to uncover latent semantic structure in prediction markets.
zh

[AI-48] Vehicle Dynamics Embedded World Models for Autonomous Driving

【速读】:该论文旨在解决现有世界模型(World Models)在自动驾驶中因联合学习自车动力学与环境动态变化而导致的效率低下和对车辆动力学参数变化缺乏鲁棒性的问题。其解决方案的关键在于提出一种解耦机制——Vehicle Dynamics embedded Dreamer (VDD),将自车动力学建模与环境状态转移动态建模分离,从而提升模型在不同车辆参数下的泛化能力;同时引入部署阶段策略调整(Policy Adjustment during Deployment, PAD)与训练阶段策略增强(Policy Augmentation during Training, PAT)两种策略,进一步增强策略的鲁棒性。

链接: https://arxiv.org/abs/2512.02417
作者: Huiqian Li,Wei Pan,Haodong Zhang,Jin Huang,Zhihua Zhong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World models have gained significant attention as a promising approach for autonomous driving. By emulating human-like perception and decision-making processes, these models can predict and adapt to dynamic environments. Existing methods typically map high-dimensional observations into compact latent spaces and learn optimal policies within these latent representations. However, prior work usually jointly learns ego-vehicle dynamics and environmental transition dynamics from the image input, leading to inefficiencies and a lack of robustness to variations in vehicle dynamics. To address these issues, we propose the Vehicle Dynamics embedded Dreamer (VDD) method, which decouples the modeling of ego-vehicle dynamics from environmental transition dynamics. This separation allows the world model to generalize effectively across vehicles with diverse parameters. Additionally, we introduce two strategies to further enhance the robustness of the learned policy: Policy Adjustment during Deployment (PAD) and Policy Augmentation during Training (PAT). Comprehensive experiments in simulated environments demonstrate that the proposed model significantly improves both driving performance and robustness to variations in vehicle dynamics, outperforming existing approaches.
zh

[AI-49] Data Curation Through the Lens of Spectral Dynamics: Static Limits Dynamic Acceleration and Practical Oracles

【速读】:该论文旨在解决大规模神经模型训练中数据相关策略(如数据剪枝、合成数据生成、跨模型蒸馏、基于人类反馈的强化学习(RLHF)及难度采样)效果不一的问题,特别是自动生成的合成数据常导致数据集规模膨胀但模型能力未提升这一现象。其解决方案的关键在于将数据精炼(data curation)形式化为对采样分布的重新加权,并将其影响映射到数据诱导算子(data-induced operator)的特征结构上;通过理论分析证明:静态剪枝仅诱导有界算子,无法改变谱尾指数(spectral tail exponent),因此只能带来有限区域的性能改进,而不能影响渐近神经尺度行为(asymptotic neural scaling);进一步地,针对时变数据精炼,论文指出理想情况下能够追踪谱残差并持续重归一化尾部的“理想Oracle”可理论上加速学习过程,尽管实际系统只能近似实现该机制。

链接: https://arxiv.org/abs/2512.02409
作者: Yizhou Zhang,Lun Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains – most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbfstatic pruning induces a bounded operator and therefore cannot change the spectral tail exponent; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbftime-dependent data curation, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning – although practical systems can only approximate this behavior. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.02409 [cs.LG] (or arXiv:2512.02409v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.02409 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-50] Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

【速读】:该论文试图解决的问题是:如何在不依赖计算成本高昂的强化学习(Reinforcement Learning, RL)的情况下,通过更高效的监督学习方法诱导大语言模型(Large Language Models, LLMs)具备自我纠错能力。其解决方案的关键在于:采用合成错误注入(synthetic error injection)的监督学习策略,即在推理链中人为插入错误并加以掩码,训练模型识别并修正这些错误。然而研究发现,该方法在多个模型上均未能显著提升性能,且即使模型识别出错误,也常重复原始错误;进一步分析表明,合成错误与在线策略(on-policy)误差之间的分布偏移显著削弱了模型的纠错能力,从而解释了为何基于在线策略的强化学习方法在激发自我纠错能力方面具有独特优势。

链接: https://arxiv.org/abs/2512.02389
作者: David X. Wu,Shreyas Kapur,Anant Sahai,Stuart Russell
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.
zh

[AI-51] Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games

【速读】:该论文旨在解决大型多人在线(Massively Multiplayer Online, MMO)游戏中数值系统与机制设计优化过程中存在的高成本、低效率及对玩家体验干扰等问题。传统方法依赖大规模在线实验或基于预设统计模型的参数调优,不仅耗时耗力,还可能影响真实玩家的游戏体验;而现有的简化离线仿真系统因保真度不足,难以准确模拟玩家在干预下的决策与反应。解决方案的关键在于构建一个基于生成式AI(Generative AI)的MMO仿真系统,通过监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)对大规模真实玩家行为数据进行训练,使大语言模型(Large Language Models, LLMs)从通用先验迁移至游戏特定领域,从而实现可解释且逼真的玩家决策建模,并结合基于真实 gameplay 日志训练的数据驱动环境模型,重建动态游戏系统,最终提供一种可靠、可解释且低成本的数据驱动数值设计优化框架。

链接: https://arxiv.org/abs/2512.02358
作者: Ran Zhang,Kun Ouyang,Tiancheng Ma,Yida Yang,Dong Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large-scale online experiments or parameter tuning over predefined statistical models, which are costly, time-consuming, and may disrupt player experience. Although simplified offline simulation systems are often adopted as alternatives, their limited fidelity prevents agents from accurately mimicking real player reasoning and reactions to interventions. To address these limitations, we propose a generative agent-based MMO simulation system empowered by Large Language Models (LLMs). By applying Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on large-scale real player behavioral data, we adapt LLMs from general priors to game-specific domains, enabling realistic and interpretable player decision-making. In parallel, a data-driven environment model trained on real gameplay logs reconstructs dynamic in-game systems. Experiments demonstrate strong consistency with real-world player behaviors and plausible causal responses under interventions, providing a reliable, interpretable, and cost-efficient framework for data-driven numerical design optimization.
zh

[AI-52] FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data

【速读】:该论文旨在解决混合质量数据下离线联邦强化学习(Offline Federated Reinforcement Learning, Offline FRL)性能显著下降的问题,即当各客户端的离线数据由不同质量的策略(behavioral policy)收集时,现有方法难以保持稳定和高效的策略优化。解决方案的关键在于提出一种基于投票机制的新型框架FOVA(Vote-based Offline FRL),其核心创新包括:1)引入投票机制(vote mechanism)在本地策略评估阶段识别高回报动作,从而缓解低质量行为对全局策略学习的负面影响;2)基于优势加权回归(Advantage-Weighted Regression, AWR)构建一致的本地与全局训练目标,提升算法的训练效率与稳定性。理论分析进一步证明,FOVA所学策略严格优于行为策略,实验结果也验证了其在主流基准上的显著性能提升。

链接: https://arxiv.org/abs/2512.02350
作者: Nan Qiao,Sheng Yue,Ju Ren,Yaoxue Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE/ACM ToN

点击查看摘要

Abstract:Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emphvote mechanism to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.
zh

[AI-53] COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)对视觉验证码(visual CAPTCHA)安全机制的潜在威胁问题,即MLLMs能否低成本自动化破解现有CAPTCHA系统。其解决方案的关键在于系统性评估7个主流商业与开源MLLM在18类真实世界CAPTCHA任务上的表现,包括单次准确率、有限重试下的成功率、端到端延迟及每解题成本,并结合任务特定提示工程(prompt engineering)和少量示例演示(few-shot demonstrations)分析模型有效性;研究发现,MLLMs能以接近人类的成本和延迟可靠破解识别导向型和低交互性CAPTCHA任务,而对需要精细定位、多步骤空间推理或跨帧一致性判断的任务仍存在显著困难。通过解析模型的推理轨迹,作者进一步揭示了成功与失败的根本原因,并据此提出面向防御的CAPTCHA任务选择与强化指南,为平台运营者部署CAPTCHA作为反滥用机制提供实践依据。

链接: https://arxiv.org/abs/2512.02318
作者: Junyu Wang,Changjia Zhu,Yuanbo Zhou,Lingyao Li,Xu He,Junjie Xiong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. We conclude by discussing implications for platform operators deploying CAPTCHA as part of their abuse-mitigation this http URL Availability (this https URL).
zh

[AI-54] Breast Cell Segmentation Under Extreme Data Constraints: Quantum Enhancement Meets Adaptive Loss Stabilization

【速读】:该论文旨在解决医学图像标注中耗时耗力的问题,特别是乳腺上皮细胞核分割任务中因标注数据稀缺、类别严重不平衡(乳腺组织仅占图像像素的4%,且60%的图像不含乳腺区域)以及边界模糊(不同标注者间差异可达±3像素)导致的模型性能瓶颈。其核心解决方案包括:1)引入量子启发式的多尺度Gabor滤波器增强边缘特征,作为第四个输入通道以提升边界检测精度;2)设计一种稳定化的多组件损失函数,融合自适应Dice损失与边界感知项,并采用自动正样本加权机制缓解类别不平衡问题;3)提出基于复杂度的加权采样策略,优先训练难样本区域(如小尺寸病变);4)采用EfficientNet-B7/UNet++架构结合4→3通道投影,利用预训练权重降低对大规模标注数据的依赖;5)通过指数移动平均和统计异常值检测实现小验证集(129张图像)下的鲁棒性评估。最终在仅使用599张训练图像的情况下实现了95.5% Dice分数,显著减少了病理学家标注时间,为临床感知AI的发展提供了高效可行的解决方案。

链接: https://arxiv.org/abs/2512.02302
作者: Varun Kumar Dasoju,Qingsu Cheng,Zeyun Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Annotating medical images demands significant time and expertise, often requiring pathologists to invest hundreds of hours in labeling mammary epithelial nuclei datasets. We address this critical challenge by achieving 95.5% Dice score using just 599 training images for breast cell segmentation, where just 4% of pixels represent breast tissue and 60% of images contain no breast regions. Our framework uses quantum-inspired edge enhancement via multi-scale Gabor filters creating a fourth input channel, enhancing boundary detection where inter-annotator variations reach +/- 3 pixels. We present a stabilized multi-component loss function that integrates adaptive Dice loss with boundary-aware terms and automatic positive weighting to effectively address severe class imbalance, where mammary epithelial cell regions comprise only 0.1%-20% of the total image area. Additionally, a complexity-based weighted sampling strategy is introduced to prioritize the challenging mammary epithelial cell regions. The model employs an EfficientNet-B7/UNet++ architecture with a 4-to-3 channel projection, enabling the use of pretrained weights despite limited medical imaging data. Finally, robust validation is achieved through exponential moving averaging and statistical outlier detection, ensuring reliable performance estimates on a small validation set (129 images). Our framework achieves a Dice score of 95.5% +/- 0.3% and an IoU of 91.2% +/- 0.4%. Notably, quantum-based enhancement contributes to a 2.1% improvement in boundary accuracy, while weighted sampling increases small lesion detection by 3.8%. By achieving groundbreaking performance with limited annotations, our approach significantly reduces the medical expert time required for dataset creation, addressing a fundamental bottleneck in clinical perception AI development.
zh

[AI-55] Model Recovery at the Edge under Resource Constraints for Physical AI ECAI2025

【速读】:该论文旨在解决模型恢复(Model Recovery, MR)在边缘设备上部署时面临的效率瓶颈问题,特别是由神经微分方程(Neural Ordinary Differential Equations, NODEs)迭代求解带来的高内存占用和能耗问题。解决方案的关键在于提出MERINDA框架,通过设计一种可并行化的神经架构替代传统的迭代求解器,从而显著降低对DRAM的依赖(减少近11倍)并提升运行速度(比移动GPU快2.2倍),同时保持与NODEs相当的建模精度,使其更适用于资源受限且需实时响应的使命关键型自主系统(Mission-Critical Autonomous Systems, MCAS)。

链接: https://arxiv.org/abs/2512.02283
作者: Bin Xu,Ayan Banerjee,Sandeep K.S. Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in ECAI 2025, Frontiers in Artificial Intelligence and Applications, volume 413, pages 3904-3911

点击查看摘要

Abstract:Model Recovery (MR) enables safe, explainable decision making in mission-critical autonomous systems (MCAS) by learning governing dynamical equations, but its deployment on edge devices is hindered by the iterative nature of neural ordinary differential equations (NODEs), which are inefficient on FPGAs. Memory and energy consumption are the main concerns when applying MR on edge devices for real-time operation. We propose MERINDA, a novel FPGA-accelerated MR framework that replaces iterative solvers with a parallelizable neural architecture equivalent to NODEs. MERINDA achieves nearly 11x lower DRAM usage and 2.2x faster runtime compared to mobile GPUs. Experiments reveal an inverse relationship between memory and energy at fixed accuracy, highlighting MERINDA’s suitability for resource-constrained, real-time MCAS.
zh

[AI-56] DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在心理健康、危机干预等情感敏感场景中应用时,其生成内容可能引发的心理社会风险缺乏系统评估的问题。解决方案的关键在于提出 DialogGuard——一个基于多智能体框架的评估体系,通过四种 LLM-as-a-judge 管道(单智能体评分、双智能体修正、多智能体辩论和随机多数投票),结合统一的三层评分标准,对五类高严重性风险(隐私侵犯、歧视行为、心理操纵、心理伤害和侮辱性言论)进行量化检测。实验表明,双智能体修正与多数投票机制在准确性、与人类标注的一致性及鲁棒性之间取得最佳平衡,显著优于非LLM基线和单智能体判断,从而为生成式AI(Generative AI)在脆弱用户场景下的安全治理提供了可解释、可扩展的自动化评估工具。

链接: https://arxiv.org/abs/2512.02282
作者: Han Luo,Guy Laban
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) now mediate many web-based mental- health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent frame- work for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discrimi- natory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse gen- erative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi- agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog- Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language ratio- nales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.
zh

[AI-57] radeTrap: Are LLM -based Trading Agents Truly Reliable and Faithful?

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的自主交易代理在真实金融环境中缺乏可靠性和鲁棒性评估的问题,尤其是在对抗性或故障条件下其行为稳定性未被系统性检验。解决方案的关键在于提出TradeTrap——一个统一的评估框架,通过在闭环历史回测环境下对自主交易代理的四个核心模块(市场情报、策略制定、组合与账本管理、交易执行)施加受控的系统级扰动,从而量化其在极端情况下的脆弱性。实验表明,单一组件的小扰动可在决策循环中传播并引发极端集中、失控风险敞口及大幅组合回撤,揭示了现有代理在系统层面易被误导的共性缺陷。

链接: https://arxiv.org/abs/2512.02261
作者: Lewen Yan,Jilin Mei,Tianyi Zhou,Lige Huang,Jie Zhang,Dongrui Liu,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based trading agents are increasingly deployed in real-world financial markets to perform autonomous analysis and execution. However, their reliability and robustness under adversarial or faulty conditions remain largely unexamined, despite operating in high-risk, irreversible financial environments. We propose TradeTrap, a unified evaluation framework for systematically stress-testing both adaptive and procedural autonomous trading agents. TradeTrap targets four core components of autonomous trading agents: market intelligence, strategy formulation, portfolio and ledger handling, and trade execution, and evaluates their robustness under controlled system-level perturbations. All evaluations are conducted in a closed-loop historical backtesting setting on real US equity market data with identical initial conditions, enabling fair and reproducible comparisons across agents and attacks. Extensive experiments show that small perturbations at a single component can propagate through the agent decision loop and induce extreme concentration, runaway exposure, and large portfolio drawdowns across both agent types, demonstrating that current autonomous trading agents can be systematically misled at the system level. Our code is available at this https URL.
zh

[AI-58] Benchmarking LLM Agents for Wealth-Management Workflows

【速读】:该论文旨在解决当前数字协作工具在财富管理场景中因人为错误和延迟导致的效率瓶颈问题,核心目标是评估通用大语言模型(Large Language Model, LLM)代理在助理级财富管理任务中的准确性与经济性。其解决方案的关键在于构建一个包含12个任务对的基准测试集,覆盖信息检索、分析与合成/沟通三个维度,并引入金融领域专属合成数据、高/低自主性任务变体及自动化任务生成流水线,从而实现对代理端到端工作流可靠性的量化评估,揭示出模型性能受限于流程稳定性而非数学推理能力,且自主性水平显著影响表现。

链接: https://arxiv.org/abs/2512.02230
作者: Rory Milsom
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 56 pages, 8 figures, The University of Edinburgh

点击查看摘要

Abstract:Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent’s fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.
zh

[AI-59] STRIDE: A Systematic Framework for Selecting AI Modalities - Agent ic AI AI Assistants or LLM Calls NEURIPS2025

【速读】:该论文旨在解决当前生成式 AI 应用中一个关键问题:如何在不同任务场景下合理选择模型部署模式,避免盲目采用全自主代理(agentic AI)带来的高成本、复杂性和风险。其核心挑战在于区分哪些任务真正需要具备多步推理、持续记忆和工具编排能力的自主代理,而哪些只需简单的大型语言模型(LLM)调用或受引导的辅助模式即可完成。解决方案的关键是提出 STRIDE 框架——通过结构化任务分解、动态性归属分析和自我反思需求评估,量化生成“代理适用性得分”(Agentic Suitability Score),从而系统性地推荐最优模态:直接 LLM 调用、引导式 AI 助手或全自主代理。实证表明,STRIDE 在 30 个真实世界任务中实现了 92% 的模态选择准确率,并显著减少不必要的代理部署(-45%)和资源消耗(-37%),推动代理采用从经验驱动转向必要性驱动的设计范式。

链接: https://arxiv.org/abs/2512.02228
作者: Shubhi Asthana,Bing Zhang,Chad DeLuca,Ruchi Mahindru,Hima Patel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 Figures, 5 Tables Paper presented at NeurIPS 2025 LAW workshop: Bridging Language, Agent, and World Models

点击查看摘要

Abstract:The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs. Comments: 10 pages, 4 Figures, 5 Tables Paper presented at NeurIPS 2025 LAW workshop: Bridging Language, Agent, and World Models Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.02228 [cs.AI] (or arXiv:2512.02228v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.02228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-60] Orchestration Framework for Financial Agents : From Algorithmic Trading to Agent ic Trading NEURIPS2025

【速读】:该论文旨在解决传统算法交易系统开发门槛高、依赖专业团队长期迭代的问题,试图通过构建一个面向金融代理(Financial Agents)的编排框架,将算法交易系统的各个核心模块(如规划器、执行器、风险控制、组合管理等)映射为独立智能体,从而实现金融智能的民主化。其解决方案的关键在于采用多智能体协作架构,将传统系统中的功能组件抽象为可独立训练与协同工作的“Agent”,并通过实证验证了该框架在股票和比特币交易任务中均优于基准指数的表现,展现出良好的收益风险比(如股票任务Sharpe比达2.63,BTC任务回测收益8.39% vs 价格涨幅3.80%)。

链接: https://arxiv.org/abs/2512.02227
作者: Jifeng Li,Arnav Grover,Abraham Alpuerto,Yupeng Cao,Xiao-Yang Liu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: Accepted at the Workshop on Generative AI in Finance, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:The financial market is a mission-critical playground for AI agents due to its temporal dynamics and low signal-to-noise ratio. Building an effective algorithmic trading system may require a professional team to develop and test over the years. In this paper, we propose an orchestration framework for financial agents, which aims to democratize financial intelligence to the general public. We map each component of the traditional algorithmic trading system to agents, including planner, orchestrator, alpha agents, risk agents, portfolio agents, backtest agents, execution agents, audit agents, and memory agent. We present two in-house trading examples. For the stock trading task (hourly data from 04/2024 to 12/2024), our approach achieved a return of 20.42% , a Sharpe ratio of 2.63, and a maximum drawdown of -3.59% , while the SP 500 index yielded a return of 15.97% . For the BTC trading task (minute data from 27/07/2025 to 13/08/2025), our approach achieved a return of 8.39% , a Sharpe ratio of 0.38 , and a maximum drawdown of -2.80% , whereas the BTC price increased by 3.80% . Our code is available on \hrefthis https URLGitHub.
zh

[AI-61] Improved Training Mechanism for Reinforcement Learning via Online Model Selection

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中的在线模型选择问题,即在训练过程中动态选择最优配置的RL代理(agent),以提升效率与性能。其解决方案的关键在于设计并理论分析一种能够自适应调整代理配置的在线模型选择机制,该机制基于三个实践导向的理论准则:1)高效资源分配,2)非平稳环境下的适应能力,3)不同随机种子下的训练稳定性。通过将此类方法集成到RL训练流程中,论文不仅提供了理论保障,还通过神经架构选择、步长选择和自我模型选择等实证任务验证了其有效性。

链接: https://arxiv.org/abs/2512.02214
作者: Aida Afshar,Aldo Pacchiano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the problem of online model selection in reinforcement learning, where the selector has access to a class of reinforcement learning agents and learns to adaptively select the agent with the right configuration. Our goal is to establish the improved efficiency and performance gains achieved by integrating online model selection methods into reinforcement learning training procedures. We examine the theoretical characterizations that are effective for identifying the right configuration in practice, and address three practical criteria from a theoretical perspective: 1) Efficient resource allocation, 2) Adaptation under non-stationary dynamics, and 3) Training stability across different seeds. Our theoretical results are accompanied by empirical evidence from various model selection tasks in reinforcement learning, including neural architecture selection, step-size selection, and self model selection.
zh

[AI-62] Bin2Vec: Interpretable and Auditable Multi-View Binary Analysis for Code Plagiarism Detection

【速读】:该论文旨在解决软件程序相似性比较中缺乏全面性和可解释性的问题,传统方法往往仅依赖静态特征(如函数调用或导入导出)或动态行为(如指令执行和内存使用),难以准确反映程序的本质差异。其解决方案的关键在于提出Bin2Vec框架,通过整合程序的静态视图(如内置函数、导入/导出符号)与动态视图(如指令序列和内存访问模式),生成多维度的可解释特征表示,并以可视化图表形式呈现各视图的差异,最终融合为一个综合相似性评分。这一设计使得分析结果既具备机器学习模型所需的结构化输入,又便于人工理解,从而在软件审计、来源验证及大规模恶意软件筛查等场景中实现高效且可信的决策支持。

链接: https://arxiv.org/abs/2512.02197
作者: Moussa Moussaoui,Tarik Houichime,Abdelalim Sadiq
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Bin2Vec, a new framework that helps compare software programs in a clear and explainable way. Instead of focusing only on one type of information, Bin2Vec combines what a program looks like (its built-in functions, imports, and exports) with how it behaves when it runs (its instructions and memory usage). This gives a more complete picture when deciding whether two programs are similar or not. Bin2Vec represents these different types of information as views that can be inspected separately using easy-to-read charts, and then brings them together into an overall similarity score. Bin2Vec acts as a bridge between binary representations and machine learning techniques by generating feature representations that can be efficiently processed by machine-learning models. We tested Bin2Vec on multiple versions of two well-known Windows programs, PuTTY and 7-Zip. The primary results strongly confirmed that our method compute an optimal and visualization-friendly representation of the analyzed software. For example, PuTTY versions showed more complex behavior and memory activity, while 7-Zip versions focused more on performance-related patterns. Overall, Bin2Vec provides decisions that are both reliable and explainable to humans. Because it is modular and easy to extend, it can be applied to tasks like auditing, verifying software origins, or quickly screening large numbers of programs in cybersecurity and reverse-engineering work.
zh

[AI-63] Enforcing Orderedness to Improve Feature Consistency

【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在神经网络可解释性研究中普遍存在的特征不一致性问题,即不同随机种子和超参数设置下学习到的特征存在显著差异。其解决方案的关键在于提出有序稀疏自编码器(Ordered Sparse Autoencoders, OSAE),通过两个核心改进实现:(1) 建立潜在特征的严格排序机制,确保特征维度具有确定性的顺序;(2) 确定性地使用所有特征维度,避免以往嵌套式SAE方法依赖采样的近似策略。理论上,OSAE能够解决稀疏字典学习场景下的排列非辨识性问题,在解唯一(至自然对称性)的前提下提供更一致的特征表示;实验证明其在Gemma2-2B和Pythia-70M模型上相较Matryoshka基线提升了特征一致性。

链接: https://arxiv.org/abs/2512.02194
作者: Sophie L. Wang,Alex Quach,Nithin Parsan,John J. Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.
zh

[AI-64] From monoliths to modules: Decomposing transducers for efficient world modelling

【速读】:该论文旨在解决复杂世界模型在计算效率与结构透明性之间的矛盾问题,即如何在保证AI安全所需的可解释性的同时,实现高效的真实世界推理。其解决方案的关键在于提出一种基于转换器(transducer)的分解框架,通过逆向合成方法提取出作用于不同输入-输出子空间的子转换器,从而将原本的单体世界模型拆解为模块化、可并行处理的组件,既提升了计算效率,又增强了模型的可解释性和分布式推理能力。

链接: https://arxiv.org/abs/2512.02193
作者: Alexander Boyd,Franz Nowak,David Hyland,Manuel Baltieri,Fernando E. Rosas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. Although realistic world models often have high computational demands, efficient modelling is usually possible by exploiting the fact that real-world scenarios tend to involve subcomponents that interact in a modular manner. In this paper, we explore this idea by developing a framework for decomposing complex world models represented by transducers, a class of models generalising POMDPs. Whereas the composition of transducers is well understood, our results clarify how to invert this process, deriving sub-transducers operating on distinct input-output subspaces, enabling parallelizable and interpretable alternatives to monolithic world modelling that can support distributed inference. Overall, these results lay a groundwork for bridging the structural transparency demanded by AI safety and the computational efficiency required for real-world inference.
zh

[AI-65] CLEF: Clinically-Guided Contrastive Learning for Electrocardiogram Foundation Models

【速读】:该论文旨在解决单导联心电图(Single-lead ECG)分析中因缺乏标注数据而导致模型性能受限的问题,尤其是在远程健康监测场景下如何提升模型的准确性和可扩展性。其核心解决方案是提出一种基于临床风险评分的对比学习方法——临床引导对比学习(clinically-guided contrastive learning),通过利用常规收集的临床元数据自适应地加权负样本对,使ECG嵌入空间中的相似性更贴近临床上有意义的个体差异,并显式处理缺失元数据问题。该方法在MIMIC-IV数据集上预训练了三种规模的单导联ECG基础模型(统称为CLEF),无需逐样本ECG标注即可显著提升分类和回归任务的性能,优于现有自监督基线模型及主流算法,在仅用I导联数据预训练时也达到与监督训练的ECGFounder相当的效果。

链接: https://arxiv.org/abs/2512.02180
作者: Yuxuan Shu,Peter H. Charlton,Fahim Kawsar,Jussi Hernesniemi,Mohammad Malekzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The code is available at this https URL

点击查看摘要

Abstract:The electrocardiogram (ECG) is a key diagnostic tool in cardiovascular health. Single-lead ECG recording is integrated into both clinical-grade and consumer wearables. While self-supervised pretraining of foundation models on unlabeled ECGs improves diagnostic performance, existing approaches do not incorporate domain knowledge from clinical metadata. We introduce a novel contrastive learning approach that utilizes an established clinical risk score to adaptively weight negative pairs: clinically-guided contrastive learning. It aligns the similarities of ECG embeddings with clinically meaningful differences between subjects, with an explicit mechanism to handle missing metadata. On 12-lead ECGs from 161K patients in the MIMIC-IV dataset, we pretrain single-lead ECG foundation models at three scales, collectively called CLEF, using only routinely collected metadata without requiring per-sample ECG annotations. We evaluate CLEF on 18 clinical classification and regression tasks across 7 held-out datasets, and benchmark against 5 foundation model baselines and 3 self-supervised algorithms. When pretrained on 12-lead ECG data and tested on lead-I data, CLEF outperforms self-supervised foundation model baselines: the medium-sized CLEF achieves average AUROC improvements of at least 2.6% in classification and average reductions in MAEs of at least 3.2% in regression. Comparing with existing self-supervised learning algorithms, CLEF improves the average AUROC by at least 1.8%. Moreover, when pretrained only on lead-I data for classification tasks, CLEF performs comparably to the state-of-the-art ECGFounder, which was trained in a supervised manner. Overall, CLEF enables more accurate and scalable single-lead ECG analysis, advancing remote health monitoring. Code and pretrained CLEF models are available at: this http URL.
zh

[AI-66] Young Childrens Anthropomorphism of AI Chatbots and the Role of Parent Co-Presence

【速读】:该论文旨在解决儿童如何理解社交型人工智能聊天机器人(social AI chatbot)及其与互动参与度和前额叶皮层激活之间的关系这一问题。研究通过让5-6岁儿童(N=23)在三种情境下进行协作式故事讲述——仅与AI交互、仅与父母交互,以及与AI和父母共同交互——来考察儿童对AI和父母的拟人化程度(anthropomorphism),并结合行为指标(对话轮次比,CTC)和近红外光谱成像(fNIRS)测量双侧腹内侧前额叶皮质(vmPFC)和背内侧前额叶皮质(dmPFC)的氧合血红蛋白变化。关键发现是:尽管儿童普遍认为父母更具拟人化特征,但对AI的感知能力(perceptive abilities)拟人化评分较高,并且这种感知拟人化与右侧dmPFC激活显著相关——尤其在仅与AI互动时激活增强,而在父母共在时激活降低;此外,该区域激活水平升高还与儿童报告“害怕”情绪正相关。这表明,儿童对AI的感知能力赋予可能促进其心理理论加工,而父母陪伴可调节儿童对AI互动的认知与情绪反应,为设计鼓励亲子共同使用AI的早期教育产品提供了神经认知依据。

链接: https://arxiv.org/abs/2512.02179
作者: Pilyoung Kim,Jenna H. Chin,Yun Xie,Nolan Brady,Tom Yeh,Sujin Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) chatbots powered by a large language model (LLM) are entering young children’s learning and play, yet little is known about how young children construe these agents or how such construals relate to engagement. We examined anthropomorphism of a social AI chatbot during collaborative storytelling and asked how children’s attributions related to their behavior and prefrontal activation. Children at ages 5-6 (N = 23) completed three storytelling sessions: interacting with (1) an AI chatbot only, (2) a parent only, and (3) the AI and a parent together. After the sessions, children completed an interview assessing anthropomorphism toward both the AI chatbot and the parent. Behavioral engagement was indexed by the conversational turn count (CTC) ratio, and concurrent fNIRS measured oxygenated hemoglobin in bilateral vmPFC and dmPFC regions. Children reported higher anthropomorphism for parents than for the AI chatbot overall, although AI ratings were relatively high for perceptive abilities and epistemic states. Anthropomorphism was not associated with CTC. In the right dmPFC, higher perceptive scores were associated with greater activation during the AI-only condition and with lower activation during the AI+Parent condition. Exploratory analyses indicated that higher dmPFC activation during the AI-only condition correlated with higher end-of-session “scared” mood ratings. Findings suggest that stronger perceptive anthropomorphism can be associated with greater brain activation related to interpreting the AI’s mental states, whereas parent co-presence may help some children interpret and regulate novel AI interactions. These results may have design implications for encouraging parent-AI co-use in early childhood.
zh

[AI-67] Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code EACL2026

【速读】:该论文旨在解决流程图(flowchart)图像作为静态内容难以编辑和复用的问题。现有工具通常无法将图像形式的流程图转化为可修改的结构化文本表示,限制了其在协作与版本控制中的应用。解决方案的关键在于提出一个轻量级的Web系统——Flowchart2Mermaid,该系统利用视觉语言模型(vision-language models)结合精心设计的系统提示(system prompt),将流程图图像自动转换为可编辑的Mermaid代码(一种用于可视化工作流的标记语言)。该方案不仅生成结构化的文本表示,还支持混合智能优化(mixed-initiative refinement),包括内联文本编辑、拖拽节点插入及自然语言指令解析,从而确保文本代码与渲染后的图表始终保持同步,显著提升了流程图的可维护性与交互性。

链接: https://arxiv.org/abs/2512.02170
作者: Pritam Deka,Barry Devereux
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to EACL 2026 Demo Track

点击查看摘要

Abstract:Flowcharts are common tools for communicating processes but are often shared as static images that cannot be easily edited or reused. We present \textscFlowchart2Mermaid, a lightweight web system that converts flowchart images into editable this http URL code which is a markup language for visual workflows, using a detailed system prompt and vision-language models. The interface supports mixed-initiative refinement through inline text editing, drag-and-drop node insertion, and natural-language commands interpreted by an integrated AI assistant. Unlike prior image-to-diagram tools, our approach produces a structured, version-controllable textual representation that remains synchronized with the rendered diagram. We further introduce evaluation metrics to assess structural accuracy, flow correctness, syntax validity, and completeness across multiple models.
zh

[AI-68] he 4/δ Bound: Designing Predictable LLM -Verifier Systems for Formal Method Guarantee

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的软件验证方法缺乏理论保障、收敛性不可靠的问题,即在使用LLM与形式化验证工具(Formal Verification tools)协同进行程序验证时,现有方法存在无法保证终止或收敛的风险,可能导致验证过程徘徊、循环甚至偏离稳定轨迹。解决方案的关键在于提出首个具有可证明保证的“LLM-Verifier收敛定理”(LLM-Verifier Convergence Theorem),将LLM与验证器的交互建模为一个离散时间马尔可夫链(Discrete-Time Markov Chain),并引入核心参数——错误减少概率(error-reduction probability, δ)。理论证明了当δ > 0时,系统几乎必然收敛至已验证状态,且期望迭代次数满足𝔼[n] ≤ 4/δ;实验通过超过9万次测试验证了该理论预测,结果显示所有运行均成功收敛,收敛因子C_f ≈ 1.0,表明理论边界精确刻画了实际行为,从而为LLM辅助验证提供了可预测、可规划的架构基础。

链接: https://arxiv.org/abs/2512.02080
作者: PIerre Dantas,Lucas Cordeiro,Youcheng Sun,Waldir Junior
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 32 pages, 9 figures

点击查看摘要

Abstract:The idea of using Formal Verification tools with large language models (LLMs) has enabled scaling software verification beyond manual workflows. However, current methods remain unreliable. Without a solid theoretical footing, the refinement process can wander; sometimes it settles, sometimes it loops back, and sometimes it breaks away from any stable trajectory. This work bridges this critical gap by developing an LLM-Verifier Convergence Theorem, providing the first formal framework with provable guarantees for termination and convergence. We model the interaction between the LLM and the verifier as a discrete-time Markov Chain, with state transitions determined by a key parameter: the error-reduction probability ( \delta ). The procedure reaching the Verified state almost surely demonstrates that the program terminates for any \delta 0 , with an expected iteration count bounded by \mathbbE[n] \leq 4/\delta . We then stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials. The empirical results match the theory with striking consistency. Every single run reached verification, and the convergence factor clustered tightly around C_f\approx 1.0. Consequently, the bound mirrors the system’s actual behavior. The evidence is sufficiently robust to support dividing the workflow into three distinct operating zones: marginal, practical, and high-performance. Consequently, we establish the design thresholds with absolute confidence. Together, the theoretical guarantee and the experimental evidence provide a clearer architectural foundation for LLM-assisted verification. Heuristic tuning no longer has to be carried out by the system. Engineers gain a framework that supports predictable resource planning and performance budgeting, precisely what is needed before deploying these pipelines into safety-critical software environments.
zh

[AI-69] FDRMFL:Multi-modal Federated Feature Extraction Model Based on Information Maximization and Contrastive Learning

【速读】:该论文旨在解决多模态数据回归中的特征提取问题,针对现实场景中面临的三大挑战:数据有限且非独立同分布(non-IID)、多模态信息的有效提取与融合,以及模型学习过程中对灾难性遗忘(catastrophic forgetting)的敏感性。其解决方案的关键在于提出一种任务驱动的监督式多模态联邦特征提取方法,该方法融合多模态信息提取与对比学习机制,并支持不同神经网络结构作为各模态数据的潜在映射函数;通过引入多约束学习框架(包括互信息保留约束、对称Kullback-Leibler散度约束和跨模型对比约束),在保证回归精度的同时,实现了任务相关特征的保留、多模态特征的提取融合与对齐,以及非IID环境下表示漂移和灾难性遗忘的缓解,从而确保特征提取始终聚焦于提升下游回归任务性能。

链接: https://arxiv.org/abs/2512.02076
作者: Haozhe Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14pages,6figures

点击查看摘要

Abstract:This study focuses on the feature extraction problem in multi-modal data regression. To address three core challenges in real-world scenarios: limited and non-IID data, effective extraction and fusion of multi-modal information, and susceptibility to catastrophic forgetting in model learning, a task-driven supervised multi-modal federated feature extraction method is proposed. The method integrates multi-modal information extraction and contrastive learning mechanisms, and can adapt to different neural network structures as the latent mapping functions for data of each modality. It supports each client to independently learn low-dimensional representations of multi-modal data, and can flexibly control the degree of retention of effective information about the response variable in the predictive variables within the low-dimensional features through parameter tuning. The multi-constraint learning framework constructed by the method guarantees regression accuracy using Mean Squared Error loss. Through the synergistic effect of mutual information preservation constraint, symmetric Kullback-Leibler divergence constraint, and inter-model contrastive constraint, it achieves the retention of task-related information, the extraction, fusion, and alignment of multi-modal features, and the mitigation of representation drift and catastrophic forgetting in non-IID scenarios, respectively. This ensures that the feature extraction process always centers on improving the performance of downstream regression tasks. Experimental results from simulations and real-world data analysis demonstrate that the proposed method achieves more significant performance improvement on downstream regression tasks compared with classical feature extraction techniques.
zh

[AI-70] HTG-GCL: Leverag ing Hierarchical Topological Granularity from Cellular Complexes for Graph Contrastive Learning

【速读】:该论文旨在解决现有图对比学习(Graph Contrastive Learning, GCL)方法在使用结构增强时难以识别任务相关拓扑结构,且无法适应不同下游任务所需的粗粒度到细粒度拓扑粒度变化的问题。其解决方案的关键在于提出一种分层拓扑粒度图对比学习(Hierarchical Topological Granularity Graph Contrastive Learning, HTG-GCL)框架,通过同一图的变换生成多尺度环基细胞复形(ring-based cellular complexes),以体现拓扑粒度概念并生成多样化的拓扑视图;同时,为避免特定粒度可能引入误导语义,引入多粒度解耦对比机制,并基于不确定性估计设计粒度特定的加权策略,从而更有效地捕捉有意义的图表示。

链接: https://arxiv.org/abs/2512.02073
作者: Qirui Ji,Bin Qin,Yifan Jin,Yunze Zhao,Chuxiong Sun,Changwen Zheng,Jianwen Cao,Jiangmeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) aims to learn discriminative semantic invariance by contrasting different views of the same graph that share critical topological patterns. However, existing GCL approaches with structural augmentations often struggle to identify task-relevant topological structures, let alone adapt to the varying coarse-to-fine topological granularities required across different downstream tasks. To remedy this issue, we introduce Hierarchical Topological Granularity Graph Contrastive Learning (HTG-GCL), a novel framework that leverages transformations of the same graph to generate multi-scale ring-based cellular complexes, embodying the concept of topological granularity, thereby generating diverse topological views. Recognizing that a certain granularity may contain misleading semantics, we propose a multi-granularity decoupled contrast and apply a granularity-specific weighting mechanism based on uncertainty estimation. Comprehensive experiments on various benchmarks demonstrate the effectiveness of HTG-GCL, highlighting its superior performance in capturing meaningful graph representations through hierarchical topological information.
zh

[AI-71] DPWMixer: Dual-Path Wavelet Mixer for Long-Term Time Series Forecasting

【速读】:该论文旨在解决长期时间序列预测(Long-term Time Series Forecasting, LTSF)中Transformer模型因计算复杂度高和数据稀疏导致过拟合,以及线性模型难以刻画复杂非线性局部动态的问题;同时,现有多尺度框架依赖平均池化作为低通滤波器,易引发频谱混叠并不可逆地丢失高频瞬态信息。其解决方案的关键在于提出一种计算高效的双路径架构DPWMixer,核心创新包括:基于无损Haar小波金字塔的结构替代传统池化操作,利用正交分解显式分离趋势与局部波动且不损失信息;设计双路径趋势混合器(Dual-Path Trend Mixer),结合全局线性映射锚定宏观趋势与基于补丁的MLP-Mixer灵活建模微观动态演化;最后引入自适应多尺度融合模块,根据通道平稳性加权整合不同尺度预测结果以优化合成性能。

链接: https://arxiv.org/abs/2512.02070
作者: Li Qianyang,Zhang Xingjun,Wang Shaoxun,Wei Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term time series forecasting (LTSF) is a critical task in computational intelligence. While Transformer-based models effectively capture long-range dependencies, they often suffer from quadratic complexity and overfitting due to data sparsity. Conversely, efficient linear models struggle to depict complex non-linear local dynamics. Furthermore, existing multi-scale frameworks typically rely on average pooling, which acts as a non-ideal low-pass filter, leading to spectral aliasing and the irreversible loss of high-frequency transients. In response, this paper proposes DPWMixer, a computationally efficient Dual-Path architecture. The framework is built upon a Lossless Haar Wavelet Pyramid that replaces traditional pooling, utilizing orthogonal decomposition to explicitly disentangle trends and local fluctuations without information loss. To process these components, we design a Dual-Path Trend Mixer that integrates a global linear mapping for macro-trend anchoring and a flexible patch-based MLP-Mixer for micro-dynamic evolution. Finally, An adaptive multi-scale fusion module then integrates predictions from diverse scales, weighted by channel stationarity to optimize synthesis. Extensive experiments on eight public benchmarks demonstrate that our method achieves a consistent improvement over state-of-the-art baselines. The code is available at this https URL.
zh

[AI-72] Large Language Model based Smart Contract Auditing with LLM BugScanner

【速读】:该论文旨在解决智能合约漏洞检测中大型语言模型(Large Language Model, LLM)性能不稳定的问题,即不同预训练模型在各类漏洞类型和合约结构上的表现差异显著,且单一模型即使经过微调也难以实现一致的高精度与强泛化能力。解决方案的关键在于提出LLMBugScanner框架,其核心由两部分组成:一是领域知识适配(domain knowledge adaptation),通过参数高效微调策略在互补数据集上训练LLM,以同时捕捉代码语义与指令引导的漏洞推理能力;二是集成推理(ensemble reasoning),利用多个LLM的互补优势,并采用基于共识的冲突消解机制提升漏洞评估的可靠性与鲁棒性。该方法实现了更稳定、准确且可扩展的智能合约审计方案。

链接: https://arxiv.org/abs/2512.02069
作者: Yining Yuan,Yifei Wang,Yichang Xu,Zachary Yahn,Sihao Hu,Ling Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents LLMBugScanner, a large language model (LLM) based framework for smart contract vulnerability detection using fine-tuning and ensemble learning. Smart contract auditing presents several challenges for LLMs: different pretrained models exhibit varying reasoning abilities, and no single model performs consistently well across all vulnerability types or contract structures. These limitations persist even after fine-tuning individual LLMs. To address these challenges, LLMBugScanner combines domain knowledge adaptation with ensemble reasoning to improve robustness and generalization. Through domain knowledge adaptation, we fine-tune LLMs on complementary datasets to capture both general code semantics and instruction-guided vulnerability reasoning, using parameter-efficient tuning to reduce computational cost. Through ensemble reasoning, we leverage the complementary strengths of multiple LLMs and apply a consensus-based conflict resolution strategy to produce more reliable vulnerability assessments. We conduct extensive experiments across multiple popular LLMs and compare LLMBugScanner with both pretrained and fine-tuned individual models. Results show that LLMBugScanner achieves consistent accuracy improvements and stronger generalization, demonstrating that it provides a principled, cost-effective, and extensible framework for smart contract auditing. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.02069 [cs.CR] (or arXiv:2512.02069v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.02069 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-73] Ada-MoGE: Adaptive Mixture of Gaussian Expert Model for Time Series Forecasting

【速读】:该论文旨在解决多变量时间序列预测中因主导频率随数据频谱分布变化而产生的专家数量固定导致的频率覆盖不平衡问题。传统混合专家(Mixture of Experts, MoE)模型由于专家数量不可变,难以适应频谱动态变化,易造成关键信息遗漏或噪声引入。其解决方案的关键在于提出Ada-MoGE模型,通过融合频谱强度与频率响应自适应确定专家数量,使专家配置与输入数据的频率分布保持一致;同时采用高斯带通滤波平滑分解频域特征,避免直接截断带来的噪声污染,从而提升特征表示质量与预测性能。

链接: https://arxiv.org/abs/2512.02061
作者: Zhenliang Ni,Xiaowen Ma,Zhenkai Wu,Shuai Xiao,Han Shu,Xinghao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data’s frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.
zh

[AI-74] Opening the Black Box: An Explainable Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering

【速读】:该论文旨在解决工业领域中人工智能用于工程(AI4E)应用所面临的两大核心瓶颈:高质量数据稀缺与黑箱模型缺乏可解释性,尤其在航空航天等对安全性要求极高的行业中尤为突出。解决方案的关键在于构建一个基于物理规律和专家知识系统性嵌入的可解释、小样本AI4E框架:首先通过三阶段协议(差异化噪声注入、硬性物理约束强化、参数间关系保持)从仅32个实验样本中生成物理合理的人工合成数据;随后采用嵌套优化策略进行本构模型发现,结合符号回归与差分进化算法识别方程结构并优化参数,并进一步利用混合全局-局部优化方法精细调参;最终获得具有88%预测准确率且具备明确物理意义的本构方程,不仅实现热裂倾向的定量预测,还揭示了热、几何与冶金机制之间的耦合关系,从而提升工程师对工艺的认知理解,并作为多功能工具支持工艺优化与高保真虚拟数据生成,推动可信AI在数据有限但物理机理清晰场景下的可靠部署。

链接: https://arxiv.org/abs/2512.02057
作者: Haoxiang Zhang,Ruihao Yuan,Lihui Zhang,Yushi Luo,Qiang Zhang,Pan Ding,Xiaodong Ren,Weijie Xing,Niu Gao,Jishan Chen,Chubo Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-particularly critical in safety-sensitive sectors like aerospace. We present an explainable, few-shot AI4E framework that is systematically informed by physics and expert knowledge throughout its architecture. Starting from only 32 experimental samples in an aerial K439B superalloy castings repair welding case, we first augment physically plausible synthetic data through a three-stage protocol: differentiated noise injection calibrated to process variabilities, enforcement of hard physical constraints, and preservation of inter-parameter relationships. We then employ a nested optimization strategy for constitutive model discovery, where symbolic regression explores equation structures while differential evolution optimizes parameters, followed by intensive parameter refinement using hybrid global-local optimization. The resulting interpretable constitutive equation achieves 88% accuracy in predicting hot-cracking tendency. This equation not only provides quantitative predictions but also delivers explicit physical insight, revealing how thermal, geometric, and metallurgical mechanisms couple to drive cracking-thereby advancing engineers’ cognitive understanding of the process. Furthermore, the constitutive equation serves as a multi-functional tool for process optimization and high-fidelity virtual data generation, enabling accuracy improvements in other data-driven models. Our approach provides a general blueprint for developing trustworthy AI systems that embed engineering domain knowledge directly into their architecture, enabling reliable adoption in high-stakes industrial applications where data is limited but physical understanding is available.
zh

[AI-75] he Impact of Artificial Intelligence on Enterprise Decision-Making Process

【速读】:该论文试图解决的问题是:人工智能(Artificial Intelligence, AI)在企业管理决策中的实际应用效果及其主要障碍,特别是如何通过组织层面的能力建设提升AI对决策效率与绩效的影响。解决方案的关键在于将AI技术与人类判断和沟通实践深度融合,强调组织因素(如变革管理能力、适应性领导力和透明流程)相较于技术限制更为重要,并指出成功实施AI的核心能力在于理解算法机制与推动组织内部适应性变革,而非单纯的技术技能(如编程)。

链接: https://arxiv.org/abs/2512.02048
作者: Ernest Górka,Dariusz Baran,Gabriela Wojak,Michał Ćwiąkała,Sebastian Zupok,Dariusz Starkowski,Dariusz Reśko,Oliwia Okrasa
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 22 pages

点击查看摘要

Abstract:Artificial intelligence improves enterprise decision-making by accelerating data analysis, reducing human error, and supporting evidence-based choices. A quantitative survey of 92 companies across multiple industries examines how AI adoption influences managerial performance, decision efficiency, and organizational barriers. Results show that 93 percent of firms use AI, primarily in customer service, data forecasting, and decision support. AI systems increase the speed and clarity of managerial decisions, yet implementation faces challenges. The most frequent barriers include employee resistance, high costs, and regulatory ambiguity. Respondents indicate that organizational factors are more significant than technological limitations. Critical competencies for successful AI use include understanding algorithmic mechanisms and change management. Technical skills such as programming play a smaller role. Employees report difficulties in adapting to AI tools, especially when formulating prompts or accepting system outputs. The study highlights the importance of integrating AI with human judgment and communication practices. When supported by adaptive leadership and transparent processes, AI adoption enhances organizational agility and strengthens decision-making performance. These findings contribute to ongoing research on how digital technologies reshape management and the evolution of hybrid human-machine decision environments.
zh

[AI-76] aching an Online Multi-Institutional Research Level Software Engineering Course with Industry - an Experience Report

【速读】:该论文旨在解决小型高校因师资力量不足而难以开设高水平研究型课程的问题。解决方案的关键在于通过跨机构合作与在线教学平台,联合多所院校共同开设前沿领域的研究课程,并引入产业专家参与授课与指导,从而弥补单一机构在师资和资源上的短板。文中以“AI在软件工程中的应用”课程为例,展示了这种协作模式的可行性与有效性,表明该方法可推广至计算机科学其他应用领域,尤其适用于需要行业实践结合的交叉学科方向。

链接: https://arxiv.org/abs/2512.01523
作者: Pankaj Jalote,Y. Raghu Reddy,Vasudeva Varma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages

点击查看摘要

Abstract:Covid has made online teaching and learning acceptable and students, faculty, and industry professionals are all comfortable with this mode. This comfort can be leveraged to offer an online multi-institutional research-level course in an area where individual institutions may not have the requisite faculty to teach and/or research students to enroll. If the subject is of interest to industry, online offering also allows industry experts to contribute and participate with ease. Advanced topics in Software Engineering are ideally suited for experimenting with this approach as industry, which is often looking to incorporate advances in software engineering in their practices, is likely to agree to contribute and participate. In this paper we describe an experiment in teaching a course titled “AI in Software Engineering” jointly between two institutions with active industry participation, and share our and student’s experience. We believe this collaborative teaching approach can be used for offering research level courses in any applied area of computer science by institutions who are small and find it difficult to offer research level courses on their own.
zh

[AI-77] Mixed precision accumulation for neural network inference guided by componentwise forward error analysis

【速读】:该论文旨在解决神经网络推理过程中精度与计算成本之间的权衡问题,特别是在累加(accumulation)操作中如何实现更高效的混合精度策略。其解决方案的关键在于提出了一种基于分量级前向误差分析(componentwise forward error analysis)的数学框架,该框架揭示了线性层输出各分量的误差传播机制:每个分量的误差与其对应的权重-输入内积的条件数(condition number)和激活函数的条件数的乘积成正比。由于这些条件数在不同分量间差异显著,因此可利用此特性实施混合精度——即对条件数乘积较大的分量采用更高精度累加,其余则保持低精度。算法首先以低精度完成全部计算,再根据结果估计条件数,并仅对关键分量重新以高精度计算,从而在保证精度的前提下显著降低整体计算成本。

链接: https://arxiv.org/abs/2503.15568
作者: El-Mehdi El Arar(PEQUAN),Silviu-Ioan Filip(TARAN),Theo Mary(PEQUAN),Elisa Riccietti(OCKHAM)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:This work proposes a mathematically founded mixed precision accumulation strategy for the inference of neural networks. Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks. Specifically, our analysis shows that the error in each component of the output of a linear layer is proportional to the condition number of the inner product between the weights and the input, multiplied by the condition number of the activation function. These condition numbers can vary widely from one component to the other, thus creating a significant opportunity to introduce mixed precision: each component should be accumulated in a precision inversely proportional to the product of these condition numbers. We propose a numerical algorithm that exploits this observation: it first computes all components in low precision, uses this output to estimate the condition numbers, and recomputes in higher precision only the components associated with large condition numbers. We test our algorithm on various networks and datasets and confirm experimentally that it can significantly improve the cost–accuracy tradeoff compared with uniform precision accumulation baselines.
zh

[AI-78] Rethinking Generalized BCIs: Benchmarking 340000 Unique Algorithmic Configurations for EEG Mental Command Decoding

【速读】:该论文旨在解决真实场景下脑-机接口(Brain-Computer Interface, BCI)应用中脑电信号(Electroencephalography, EEG)解码与分类的鲁棒性问题,其核心挑战源于个体间和个体内的显著变异性。解决方案的关键在于构建一个大规模基准测试框架,系统评估超过34万种空间特征与非线性特征组合在三个公开EEG数据集上的性能表现,涵盖共空间模式(Common Spatial Patterns, CSP)、黎曼几何、功能连接性及分形或熵基特征等方法,并在多频段(8–15 Hz 和 8–30 Hz)下进行个体层面分析。结果表明,尽管协方差切空间投影(cov-tgsp)和CSP在平均精度上最优,但其效果高度依赖数据集,且个体差异依然显著;尤其在异质性最强的数据集中,非线性方法对特定用户表现更优,凸显个性化建模的重要性。因此,研究强调不存在适用于所有用户和数据集的“一刀切”解码策略,未来需发展自适应、多模态甚至新型方法以应对实际BCI中的神经生理变异性。

链接: https://arxiv.org/abs/2512.02978
作者: Paul Barbaste(Inclusive Brains, Wavestone, Human Technology Foundation),Olivier Oullier(Human-Computer Interaction Department, Mohamed bin Zayed University of Artificial Intelligence, Inclusive Brains, Institute for Artificial Intelligence, Biotech Dental Group),Xavier Vasques(IBM Technology, IBM France Lab)
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 28 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) brain-computer interface (BCI) applications due to well documented inter- and intra-participant variability. Here, we present a large-scale benchmark evaluating over 340,000+ unique combinations of spatial and nonlinear EEG classification. Our methodological pipeline consists in combinations of Common Spatial Patterns (CSP), Riemannian geometry, functional connectivity, and fractal- or entropy-based features across three open-access EEG datasets. Unlike prior studies, our analysis operates at the per-participant level and across multiple frequency bands (8-15 Hz and 8-30 Hz), enabling direct assessment of both group-level performance and individual variability. Covariance tangent space projection (cov-tgsp) and CSP consistently achieved the highest average classification accuracies. However, their effectiveness was strongly dataset-dependent, and marked participant-level differences persisted, particularly in the most heterogeneous of the datasets. Importantly, nonlinear methods outperformed spatial approaches for specific individuals, underscoring the need for personalized pipeline selection. Our findings highlight that no universal ‘one-size-fits-all’ method can optimally decode EEG motor imagery patterns across all users or datasets. Future work will require adaptive, multimodal, and possibly novel approaches to fully address neurophysiological variability in practical BCI applications where the system can automatically adapt to what makes each user unique.
zh

[AI-79] owards a fully differentiable digital twin for solar cells

【速读】:该论文旨在解决当前太阳能电池能量产出(Energy Yield, EY)预测与优化中缺乏统一计算框架的问题,现有仿真方法通常仅关注太阳能电池的孤立属性,导致难以实现从材料到器件再到实际环境条件的全流程精准预测与优化。解决方案的关键在于提出一种可微分的数字孪生框架 Sol(Di)²T,其将材料特性、形貌处理参数、光学与电学模拟以及气候和地理位置等多尺度信息整合为一个端到端可微分的流程;通过引入机器学习代理模型或保留内在可微性,使得EY预测不仅准确,还可基于梯度进行参数优化,从而扩展至此前未探索的应用场景,显著提升太阳能电池在特定应用中的性能适配能力。

链接: https://arxiv.org/abs/2512.02904
作者: Marie Louise Schubert,Houssam Metni,Jan David Fischbach,Benedikt Zerulla,Marjan Krstić,Ulrich W. Paetzold,Seyedamir Orooji,Olivier J. J. Ronsin,Yasin Ameslon,Jens Harting,Thomas Kirchartz,Sandheep Ravishankar,Chris Dreessen,Eunchi Kim,Christian Sprau,Mohamed Hussein,Alexander Colsmann,Karen Forberich,Klaus Jäger,Pascal Friederich,Carsten Rockstuhl
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Maximizing energy yield (EY) - the total electric energy generated by a solar cell within a year at a specific location - is crucial in photovoltaics (PV), especially for emerging technologies. Computational methods provide the necessary insights and guidance for future research. However, existing simulations typically focus on only isolated aspects of solar cells. This lack of consistency highlights the need for a framework unifying all computational levels, from material to cell properties, for accurate prediction and optimization of EY prediction. To address this challenge, a differentiable digital twin, Sol(Di) ^2 T, is introduced to enable comprehensive end-to-end optimization of solar cells. The workflow starts with material properties and morphological processing parameters, followed by optical and electrical simulations. Finally, climatic conditions and geographic location are incorporated to predict the EY. Each step is either intrinsically differentiable or replaced with a machine-learned surrogate model, enabling not only accurate EY prediction but also gradient-based optimization with respect to input parameters. Consequently, Sol(Di) ^2 T extends EY predictions to previously unexplored conditions. Demonstrated for an organic solar cell, the proposed framework marks a significant step towards tailoring solar cells for specific applications while ensuring maximal performance.
zh

[AI-80] scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing

【速读】:该论文旨在解决单细胞RNA测序(scRNA-seq)数据中细胞聚类方法缺乏标准化评估框架的问题,现有基准测试往往碎片化、协议不统一,且未充分整合人工智能领域的最新进展。解决方案的关键在于提出scCluBench——一个全面的聚类算法基准平台,其核心包括:36个来自不同公共数据库的标准化scRNA-seq数据集,覆盖多种组织类型并统一处理;涵盖传统方法、深度学习、图神经网络及生物基础模型在内的多样化聚类算法;以及定量与定性相结合的多维评估体系,包括核心性能指标、可视化分析和代表性下游任务(如标记基因识别与细胞类型注释),从而系统评估各方法在真实场景中的鲁棒性、可扩展性和实际应用价值,为研究者提供透明、可复现且用户友好的基准工具。

链接: https://arxiv.org/abs/2512.02471
作者: Ping Xu,Zaitian Wang,Zhirui Wang,Pengjiang Li,Jiajia Wang,Ran Zhang,Pengfei Wang,Yuanchun Zhou
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cell clustering is crucial for uncovering cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data by identifying cell types and marker genes. Despite its importance, benchmarks for scRNA-seq clustering methods remain fragmented, often lacking standardized protocols and failing to incorporate recent advances in artificial intelligence. To fill these gaps, we present scCluBench, a comprehensive benchmark of clustering algorithms for scRNA-seq data. First, scCluBench provides 36 scRNA-seq datasets collected from diverse public sources, covering multiple tissues, which are uniformly processed and standardized to ensure consistency for systematic evaluation and downstream analyses. To evaluate performance, we collect and reproduce a range of scRNA-seq clustering methods, including traditional, deep learning-based, graph-based, and biological foundation models. We comprehensively evaluate each method both quantitatively and qualitatively, using core performance metrics as well as visualization analyses. Furthermore, we construct representative downstream biological tasks, such as marker gene identification and cell type annotation, to further assess the practical utility. scCluBench then investigates the performance differences and applicability boundaries of various clustering models across diverse analytical tasks, systematically assessing their robustness and scalability in real-world scenarios. Overall, scCluBench offers a standardized and user-friendly benchmark for scRNA-seq clustering, with curated datasets, unified evaluation protocols, and transparent analyses, facilitating informed method selection and providing valuable insights into model generalizability and application scope.
zh

[AI-81] Quantum feature encoding optimization

【速读】:该论文旨在解决量子机器学习(Quantum Machine Learning, QML)中输入数据编码这一关键问题,尤其是传统方法通常聚焦于调整量子电路的参数化结构(ansatz),而忽视了对数据如何被传递至ansatz的优化。其解决方案的关键在于:将经典数据预处理操作(如特征排序、选择与加权)作为编码策略的一部分,通过优化这些数据表达方式来提升QML模型性能,而非仅依赖ansatz设计本身。实验表明,这种基于数据编码的优化手段在多种数据集、ansatz结构和电路规模下均能显著且一致地改善模型表现,并在真实量子硬件上成功验证了其可行性。

链接: https://arxiv.org/abs/2512.02422
作者: Tommaso Fioravanti,Brian Quanz,Gabriele Agliardi,Edgar Andres Ruiz Guzman,Ginés Carrascal,Jae-Eun Park
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) holds the promise of enhancing machine learning modeling in terms of both complexity and accuracy. A key challenge in this domain is the encoding of input data, which plays a pivotal role in determining the performance of QML models. In this work, we tackle a largely unaddressed aspect of encoding that is unique to QML modeling – rather than adjusting the ansatz used for encoding, we consider adjusting how data is conveyed to the ansatz. We specifically implement QML pipelines that leverage classical data manipulation (i.e., ordering, selecting, and weighting features) as a preprocessing step, and evaluate if these aspects of encoding can have a significant impact on QML model performance, and if they can be effectively optimized to improve performance. Our experimental results, applied across a wide variety of data sets, ansatz, and circuit sizes, with a representative QML approach, demonstrate that by optimizing how features are encoded in an ansatz we can substantially and consistently improve the performance of QML models, making a compelling case for integrating these techniques in future QML applications. Finally we demonstrate the practical feasibility of this approach by running it using real quantum hardware with 100 qubit circuits and successfully achieving improved QML modeling performance in this case as well.
zh

[AI-82] Parallel Multi-Circuit Quantum Feature Fusion in Hybrid Quantum-Classical Convolutional Neural Networks for Breast Tumor Classification

【速读】:该论文旨在解决高维医学图像数据中特征提取与分类性能提升的问题,特别是在乳腺癌肿瘤良恶性判别任务中。其解决方案的关键在于提出一种混合量子-经典卷积神经网络(Hybrid Quantum-Classical Convolutional Neural Network, QCNN)架构:通过两个不同的量子电路——幅度编码变分量子电路(Amplitude-Encoding Variational Quantum Circuit, VQC)和角度编码带圆周纠缠的VQC——生成量子特征嵌入,并将其与经典卷积层提取的特征融合形成联合特征空间,再由全连接分类器完成最终决策。该设计在参数规模与训练条件上严格匹配经典CNN基线模型,从而可统计验证量子模块对性能提升的独立贡献,实验表明该方法显著优于传统CNN(p = 0.03125,Cohen’s d = 2.14),证明了量子特征融合与纠缠机制在生物医学图像分类中的有效性。

链接: https://arxiv.org/abs/2512.02066
作者: Ece Yurtseven
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Quantum machine learning has emerged as a promising approach to improve feature extraction and classification tasks in high-dimensional data domains such as medical imaging. In this work, we present a hybrid Quantum-Classical Convolutional Neural Network (QCNN) architecture designed for the binary classification of the BreastMNIST dataset, a standardized benchmark for distinguishing between benign and malignant breast tumors. Our architecture integrates classical convolutional feature extraction with two distinct quantum circuits: an amplitude-encoding variational quantum circuit (VQC) and an angle-encoding VQC circuit with circular entanglement, both implemented on four qubits. These circuits generate quantum feature embeddings that are fused with classical features to form a joint feature space, which is subsequently processed by a fully connected classifier. To ensure fairness, the hybrid QCNN is parameter-matched against a baseline classical CNN, allowing us to isolate the contribution of quantum layers. Both models are trained under identical conditions using the Adam optimizer and binary cross-entropy loss. Experimental evaluation in five independent runs demonstrates that the hybrid QCNN achieves statistically significant improvements in classification accuracy compared to the classical CNN, as validated by a one-sided Wilcoxon signed rank test (p = 0.03125) and supported by large effect size of Cohen’s d = 2.14. Our results indicate that hybrid QCNN architectures can leverage entanglement and quantum feature fusion to enhance medical image classification tasks. This work establishes a statistical validation framework for assessing hybrid quantum models in biomedical applications and highlights pathways for scaling to larger datasets and deployment on near-term quantum hardware.
zh

[AI-83] Statistical Arbitrag e in Polish Equities Market Using Deep Learning Techniques

【速读】:该论文旨在解决传统配对交易(Pairs Trading)策略中依赖高度相关资产所带来的局限性问题,提出一种基于风险因子表示的资产复制方法来替代第二只标的资产。其核心解决方案在于利用主成分分析(PCA)、交易所交易基金(ETFs)以及长短期记忆网络(LSTM)构建对目标资产的复制模型,并通过残差序列的均值回归特性生成交易信号。其中,LSTM作为主要创新点,用于从历史数据中学习复杂非线性关系以实现更精准的资产复制,从而提升策略在不同市场环境下的适应性和盈利能力。

链接: https://arxiv.org/abs/2512.02037
作者: Marek Adamczyk,Michał Dąbrowski
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study a systematic approach to a popular Statistical Arbitrage technique: Pairs Trading. Instead of relying on two highly correlated assets, we replace the second asset with a replication of the first using risk factor representations. These factors are obtained through Principal Components Analysis (PCA), exchange traded funds (ETFs), and, as our main contribution, Long Short Term Memory networks (LSTMs). Residuals between the main asset and its replication are examined for mean reversion properties, and trading signals are generated for sufficiently fast mean reverting portfolios. Beyond introducing a deep learning based replication method, we adapt the framework of Avellaneda and Lee (2008) to the Polish market. Accordingly, components of WIG20, mWIG40, and selected sector indices replace the original SP500 universe, and market parameters such as the risk free rate and transaction costs are updated to reflect local conditions. We outline the full strategy pipeline: risk factor construction, residual modeling via the Ornstein Uhlenbeck process, and signal generation. Each replication technique is described together with its practical implementation. Strategy performance is evaluated over two periods: 2017-2019 and the recessive year 2020. All methods yield profits in 2017-2019, with PCA achieving roughly 20 percent cumulative return and an annualized Sharpe ratio of up to 2.63. Despite multiple adaptations, our conclusions remain consistent with those of the original paper. During the COVID-19 recession, only the ETF based approach remains profitable (about 5 percent annual return), while PCA and LSTM methods underperform. LSTM results, although negative, are promising and indicate potential for future optimization. Subjects: Statistical Finance (q-fin.ST); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.02037 [q-fin.ST] (or arXiv:2512.02037v1 [q-fin.ST] for this version) https://doi.org/10.48550/arXiv.2512.02037 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Marek Adamczyk [view email] [v1] Thu, 20 Nov 2025 21:09:34 UTC (7,498 KB)
zh

[AI-84] Integration of LSTM Networks in Random Forest Algorithms for Stock Market Trading Predictions

【速读】:该论文旨在解决传统股票交易系统在建模时未能有效融合基本面数据(fundamental data)与技术面数据(technical data)所带来的性能瓶颈问题。其解决方案的关键在于构建一种混合型交易算法,将基于长短期记忆网络(Long Short-Term Memory, LSTM)的技术分析模型与基于决策树类算法(如随机森林 Random Forest 和梯度提升 Gradient Boosting)的基本面分析模型相结合:LSTM用于捕捉金融资产价格的时间序列模式,而决策树则处理公司层面的经济指标;实证结果表明,这种融合两类变量的混合方法在10个交易日预测中显著优于仅使用单一类型变量的传统方法,其中随机森林在决策树类别中表现最优,且通过优化技术变量的选择可进一步提升整体预测性能。

链接: https://arxiv.org/abs/2512.02036
作者: Juan C. King,Jose M. Amigo
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 7 Figures, 2 Tables

点击查看摘要

Abstract:The aim of this paper is the analysis and selection of stock trading systems that combine different models with data of different nature, such as financial and microeconomic information. Specifically, based on previous work by the authors and applying advanced techniques of Machine Learning and Deep Learning, our objective is to formulate trading algorithms for the stock market with empirically tested statistical advantages, thus improving results published in the literature. Our approach integrates Long Short-Term Memory (LSTM) networks with algorithms based on decision trees, such as Random Forest and Gradient Boosting. While the former analyze price patterns of financial assets, the latter are fed with economic data of companies. Numerical simulations of algorithmic trading with data from international companies and 10-weekday predictions confirm that an approach based on both fundamental and technical variables can outperform the usual approaches, which do not combine those two types of variables. In doing so, Random Forest turned out to be the best performer among the decision trees. We also discuss how the prediction performance of such a hybrid approach can be boosted by selecting the technical variables.
zh

[AI-85] CONFIDE: Hallucination Assessment for Reliable Biomolecular Structure Prediction and Design

【速读】:该论文旨在解决蛋白质结构预测评估中存在的重要挑战:现有指标(如pLDDT)虽能反映能量稳定性,但难以捕捉原子碰撞或构象陷阱等细微错误,这些错误往往源于蛋白质折叠能谷中的拓扑挫折(topological frustration)。为应对这一问题,作者提出CODE(Chain of Diffusion Embeddings),这是一种从AlphaFold3系列模型的潜空间扩散嵌入(latent diffusion embeddings)中直接无监督量化拓扑挫折的新方法。其关键创新在于将理论上的拓扑挫折概念与数据驱动的嵌入表示相结合,从而实现对折叠动力学本质特征的准确刻画。进一步地,通过将CODE与pLDDT融合形成CONFIDE统一评估框架,显著提升了AlphaFold3及相关模型在分子胶结构预测等任务中的可靠性,相较于传统指标在多个基准测试中表现出更强的相关性(如与RMSD的Spearman相关系数提升至0.73)。

链接: https://arxiv.org/abs/2512.02033
作者: Zijun Gao,Mutian He,Shijia Sun,Hanqun Cao,Jingjie Zhang,Zihao Luo,Xiaorui Wang,Xiaojun Yao,Chang-Yu Hsieh,Chunbin Gu,Pheng Ann Heng
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable evaluation of protein structure predictions remains challenging, as metrics like pLDDT capture energetic stability but often miss subtle errors such as atomic clashes or conformational traps reflecting topological frustration within the protein folding energy landscape. We present CODE (Chain of Diffusion Embeddings), a self evaluating metric empirically found to quantify topological frustration directly from the latent diffusion embeddings of the AlphaFold3 series of structure predictors in a fully unsupervised manner. Integrating this with pLDDT, we propose CONFIDE, a unified evaluation framework that combines energetic and topological perspectives to improve the reliability of AlphaFold3 and related models. CODE strongly correlates with protein folding rates driven by topological frustration, achieving a correlation of 0.82 compared to pLDDT’s 0.33 (a relative improvement of 148%). CONFIDE significantly enhances the reliability of quality evaluation in molecular glue structure prediction benchmarks, achieving a Spearman correlation of 0.73 with RMSD, compared to pLDDT’s correlation of 0.42, a relative improvement of 73.8%. Beyond quality assessment, our approach applies to diverse drug design tasks, including all-atom binder design, enzymatic active site mapping, mutation induced binding affinity prediction, nucleic acid aptamer screening, and flexible protein modeling. By combining data driven embeddings with theoretical insight, CODE and CONFIDE outperform existing metrics across a wide range of biomolecular systems, offering robust and versatile tools to refine structure predictions, advance structural biology, and accelerate drug discovery.
zh

[AI-86] Characterizing Continuous and Discrete Hybrid Latent Spaces for Structural Connectomes

链接: https://arxiv.org/abs/2512.02032
作者: Gaurav Rudravaram,Lianrui Zuo,Adam M. Saunders,Michael E. Kim,Praitayini Kanakaraj,Nancy R. Newlin,Aravind R. Krishnan,Elyssa M. McMaster,Chloe Cho,Susan M. Resnick,Lori L. Beason Held,Derek Archer,Timothy J. Hohman,Daniel C. Moyer,Bennett A. Landman
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-87] owards Sustainable Precision: Machine Learning for Laser Micromachining Optimization

【速读】:该论文旨在解决超短脉冲激光微加工过程中,因不同预处理技术导致的工件表面质量评估难以实现早期缺陷检测的问题。为应对这一挑战,其解决方案的关键在于构建一个优化的机器学习框架,通过降低模型规模和数据维度来实现实时分析,从而在保证跨多种预处理技术泛化能力的同时显著减少训练计算需求,推动可持续制造过程的基准建立。

链接: https://arxiv.org/abs/2512.02026
作者: Luis Correas-Naranjo,Miguel Camacho-Sánchez,Laëtitia Launet,Milena Zuric,Valery Naranjo
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the pursuit of sustainable manufacturing, ultra-short pulse laser micromachining stands out as a promising solution while also offering high-precision and qualitative laser processing. However, unlocking the full potential of ultra-short pulse lasers requires an optimized monitoring system capable of early detection of defective workpieces, regardless of the preprocessing technique employed. While advances in machine learning can help predict process quality features, the complexity of monitoring data necessitates reducing both model size and data dimensionality to enable real-time analysis. To address these challenges, this paper introduces a machine learning framework designed to enhance surface quality assessment across diverse preprocessing techniques. To facilitate real-time laser processing monitoring, our solution aims to optimize the computational requirements of the machine learning model. Experimental results show that the proposed model not only outperforms the generalizability achieved by previous works across diverse preprocessing techniques but also significantly reduces the computational requirements for training. Through these advancements, we aim to establish the baseline for a more sustainable manufacturing process.
zh

[AI-88] DySTAN: Joint Modeling of Sedentary Activity and Social Context from Smartphone Sensors

【速读】:该论文旨在解决从智能手机传感器数据中准确识别用户情境(context)的难题,尤其是在久坐场景下,如学习、听课、放松和进食等活动具有高度相似的惯性模式,难以区分;同时,社会情境(social context)在理解用户行为中的关键作用常被移动感知研究忽视。为应对上述挑战,作者提出LogMe应用,通过被动采集多维传感器数据(加速度计、陀螺仪、磁力计和旋转矢量)并结合用户每小时的自报告,构建包含活动与社会情境双重标签的数据集。其核心解决方案是DySTAN(Dynamic Cross-Stitch with Task Attention Network),一种多任务学习框架,能从共享传感器输入中联合分类两个情境维度,通过任务特定层与跨任务注意力机制建模细微差异,从而显著提升识别精度——相比单任务CNN-BiLSTM-GRU模型,久坐活动的宏平均F1分数提升21.8%,相较最强多任务基线Sluice Network提升8.2%。

链接: https://arxiv.org/abs/2512.02025
作者: Aditya Sneh,Nilesh Kumar Sahu,Snehil Gupta,Haroon R. Lone
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurately recognizing human context from smartphone sensor data remains a significant challenge, especially in sedentary settings where activities such as studying, attending lectures, relaxing, and eating exhibit highly similar inertial patterns. Furthermore, social context plays a critical role in understanding user behavior, yet is often overlooked in mobile sensing research. To address these gaps, we introduce LogMe, a mobile sensing application that passively collects smartphone sensor data (accelerometer, gyroscope, magnetometer, and rotation vector) and prompts users for hourly self-reports capturing both sedentary activity and social context. Using this dual-label dataset, we propose DySTAN (Dynamic Cross-Stitch with Task Attention Network), a multi-task learning framework that jointly classifies both context dimensions from shared sensor inputs. It integrates task-specific layers with cross-task attention to model subtle distinctions effectively. DySTAN improves sedentary activity macro F1 scores by 21.8% over a single-task CNN-BiLSTM-GRU (CBG) model and by 8.2% over the strongest multi-task baseline, Sluice Network (SN). These results demonstrate the importance of modeling multiple, co-occurring context dimensions to improve the accuracy and robustness of mobile context recognition.
zh

机器学习

[LG-0] Learning Physically Consistent Lagrangian Control Models Without Acceleration Measurements

链接: https://arxiv.org/abs/2512.03035
作者: Ibrahim Laiche,Mokrane Boudaoud,Patrick Gallinari,Pascal Morin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to the L4DC 2026

点击查看摘要

Abstract:This article investigates the modeling and control of Lagrangian systems involving non-conservative forces using a hybrid method that does not require acceleration calculations. It focuses in particular on the derivation and identification of physically consistent models, which are essential for model-based control synthesis. Lagrangian or Hamiltonian neural networks provide useful structural guarantees but the learning of such models often leads to inconsistent models, especially on real physical systems where training data are limited, partial and noisy. Motivated by this observation and the objective to exploit these models for model-based nonlinear control, a learning algorithm relying on an original loss function is proposed to improve the physical consistency of Lagrangian systems. A comparative analysis of different learning-based modeling approaches with the proposed solution shows significant improvements in terms of physical consistency of the learned models, on both simulated and experimental systems. The model’s consistency is then exploited to demonstrate, on an experimental benchmark, the practical relevance of the proposed methodology for feedback linearization and energy-based control techniques.

[LG-1] ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics

链接: https://arxiv.org/abs/2512.02983
作者: Louis McConnell,Jieran Sun,Theo Maffei,Raphael Gottardo,Marianna Rapsomaniki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.

[LG-2] Pruning AMR: Efficient Visualization of Implicit Neural Representations via Weight Matrix Analysis

链接: https://arxiv.org/abs/2512.02967
作者: Jennifer Zvonek,Andrew Gillette
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:An implicit neural representation (INR) is a neural network that approximates a spatiotemporal function. Many memory-intensive visualization tasks, including modern 4D CT scanning methods, represent data natively as INRs. While INRs are prized for being more memory-efficient than traditional data stored on a lattice, many visualization tasks still require discretization to a regular grid. We present PruningAMR, an algorithm that builds a mesh with resolution adapted to geometric features encoded by the INR. To identify these geometric features, we use an interpolative decomposition pruning method on the weight matrices of the INR. The resulting pruned network is used to guide adaptive mesh refinement, enabling automatic mesh generation tailored to the underlying resolution of the function. Starting from a pre-trained INR–without access to its training data–we produce a variable resolution visualization with substantial memory savings.

[LG-3] Fast Gaussian Process Approximations for Autocorrelated Data

链接: https://arxiv.org/abs/2512.02925
作者: Ahmadreza Chokhachian,Matthias Katzfuss,Yu Ding
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by the INFORMS Journal on Data Science

点击查看摘要

Abstract:This paper is concerned with the problem of how to speed up computation for Gaussian process models trained on autocorrelated data. The Gaussian process model is a powerful tool commonly used in nonlinear regression applications. Standard regression modeling assumes random samples and an independently, identically distributed noise. Various fast approximations that speed up Gaussian process regression work under this standard setting. But for autocorrelated data, failing to account for autocorrelation leads to a phenomenon known as temporal overfitting that deteriorates model performance on new test instances. To handle autocorrelated data, existing fast Gaussian process approximations have to be modified; one such approach is to segment the originally correlated data points into blocks in which the blocked data are de-correlated. This work explains how to make some of the existing Gaussian process approximations work with blocked data. Numerical experiments across diverse application datasets demonstrate that the proposed approaches can remarkably accelerate computation for Gaussian process regression on autocorrelated data without compromising model prediction performance.

[LG-4] Hypothesis Testing for Generalized Thurstone Models

链接: https://arxiv.org/abs/2512.02912
作者: Anuran Makur,Japneet Singh
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 35 pages, 9 figures

点击查看摘要

Abstract:In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying \emphgeneralized Thurstone model \mathcalT_F for a given choice function F . While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for \mathcalT_F models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of \mathcalT_F models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as \Theta((nk)^-1/2) , where n is the number of agents and k is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.

[LG-5] Assessing the performance of correlation-based multi-fidelity neural emulators

链接: https://arxiv.org/abs/2512.02868
作者: Cristian J. Villatoro,Gianluca Geraci,Daniele E. Schiavazzi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Outer loop tasks such as optimization, uncertainty quantification or inference can easily become intractable when the underlying high-fidelity model is computationally expensive. Similarly, data-driven architectures typically require large datasets to perform predictive tasks with sufficient accuracy. A possible approach to mitigate these challenges is the development of multi-fidelity emulators, leveraging potentially biased, inexpensive low-fidelity information while correcting and refining predictions using scarce, accurate high-fidelity data. This study investigates the performance of multi-fidelity neural emulators, neural networks designed to learn the input-to-output mapping by integrating limited high-fidelity data with abundant low-fidelity model solutions. We investigate the performance of such emulators for low and high-dimensional functions, with oscillatory character, in the presence of discontinuities, for collections of models with equal and dissimilar parametrization, and for a possibly large number of potentially corrupted low-fidelity sources. In doing so, we consider a large number of architectural, hyperparameter, and dataset configurations including networks with a different amount of spectral bias (Multi-Layered Perceptron, Siren and Kolmogorov Arnold Network), various mechanisms for coordinate encoding, exact or learnable low-fidelity information, and for varying training dataset size. We further analyze the added value of the multi-fidelity approach by conducting equivalent single-fidelity tests for each case, quantifying the performance gains achieved through fusing multiple sources of information.

[LG-6] Adaptive Decentralized Federated Learning for Robust Optimization

链接: https://arxiv.org/abs/2512.02852
作者: Shuyuan Wu,Feifei Wang,Yuan Gao,Hansheng Wang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In decentralized federated learning (DFL), the presence of abnormal clients, often caused by noisy or poisoned data, can significantly disrupt the learning process and degrade the overall robustness of the model. Previous methods on this issue often require a sufficiently large number of normal neighboring clients or prior knowledge of reliable clients, which reduces the practical applicability of DFL. To address these limitations, we develop here a novel adaptive DFL (aDFL) approach for robust estimation. The key idea is to adaptively adjust the learning rates of clients. By assigning smaller rates to suspicious clients and larger rates to normal clients, aDFL mitigates the negative impact of abnormal clients on the global model in a fully adaptive way. Our theory does not put any stringent conditions on neighboring nodes and requires no prior knowledge. A rigorous convergence analysis is provided to guarantee the oracle property of aDFL. Extensive numerical experiments demonstrate the superior performance of the aDFL method.

[LG-7] VLM as Strategist: Adaptive Generation of Safety-critical Testing Scenarios via Guided Diffusion

链接: https://arxiv.org/abs/2512.02844
作者: Xinzheng Wu,Junyi Chen,Naiting Zhong,Yong Shen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures

点击查看摘要

Abstract:The safe deployment of autonomous driving systems (ADSs) relies on comprehensive testing and evaluation. However, safety-critical scenarios that can effectively expose system vulnerabilities are extremely sparse in the real world. Existing scenario generation methods face challenges in efficiently constructing long-tail scenarios that ensure fidelity, criticality, and interactivity, while particularly lacking real-time dynamic response capabilities to the vehicle under test (VUT). To address these challenges, this paper proposes a safety-critical testing scenario generation framework that integrates the high-level semantic understanding capabilities of Vision Language Models (VLMs) with the fine-grained generation capabilities of adaptive guided diffusion models. The framework establishes a three-layer hierarchical architecture comprising a strategic layer for VLM-directed scenario generation objective determination, a tactical layer for guidance function formulation, and an operational layer for guided diffusion execution. We first establish a high-quality fundamental diffusion model that learns the data distribution of real driving scenarios. Next, we design an adaptive guided diffusion method that enables real-time, precise control of background vehicles (BVs) in closed-loop simulation. The VLM is then incorporated to autonomously generate scenario generation objectives and guidance functions through deep scenario understanding and risk reasoning, ultimately guiding the diffusion model to achieve VLM-directed scenario generation. Experimental results demonstrate that the proposed method can efficiently generate realistic, diverse, and highly interactive safety-critical testing scenarios. Furthermore, case studies validate the adaptability and VLM-directed generation performance of the proposed method.

[LG-8] FiMMIA: scaling semantic perturbation-based membership inference across modalities EACL2026

链接: https://arxiv.org/abs/2512.02786
作者: Anton Emelyanov,Sergei Kudriashov,Alena Fenogenova
类目: Machine Learning (cs.LG)
*备注: System demo track paper for EACL 2026

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed for detecting data contamination in large language models (LLMs), their performance on multimodal LLMs (MLLMs) falls short due to the instabilities introduced through multimodal component adaptation and possible distribution shifts across multiple inputs. In this work, we investigate multimodal membership inference and address two issues: first, by identifying distribution shifts in the existing datasets, and second, by releasing an extended baseline pipeline to detect them. We also generalize the perturbation-based membership inference methods to MLLMs and release \textbfFiMMIA – a modular \textbfFramework for \textbfMultimodal \textbfMIA.\footnoteThe source code and framework have been made publicly available under the MIT license via \hrefthis https URLlink.The video demonstration is available on \hrefthis https URLYouTube. Our approach trains a neural network to analyze the target model’s behavior on perturbed inputs, capturing distributional differences between members and non-members. Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of our perturbation-based membership inference attacks in multimodal domains.

[LG-9] Adversarial Jamming for Autoencoder Distribution Matching ICASSP2024

链接: https://arxiv.org/abs/2512.02740
作者: Waleed El-Geresy,Deniz Gündüz
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Presented at ICASSP 2024. 5 pages, 3 figures

点击查看摘要

Abstract:We propose the use of adversarial wireless jamming to regularise the latent space of an autoencoder to match a diagonal Gaussian distribution. We consider the minimisation of a mean squared error distortion, where a jammer attempts to disrupt the recovery of a Gaussian source encoded and transmitted over the adversarial channel. A straightforward consequence of existing theoretical results is the fact that the saddle point of a minimax game - involving such an encoder, its corresponding decoder, and an adversarial jammer - consists of diagonal Gaussian noise output by the jammer. We use this result as inspiration for a novel approach to distribution matching in the latent space, utilising jamming as an auxiliary objective to encourage the aggregated latent posterior to match a diagonal Gaussian distribution. Using this new technique, we achieve distribution matching comparable to standard variational autoencoders and to Wasserstein autoencoders. This approach can also be generalised to other latent distributions.

[LG-10] Credal Graph Neural Networks

链接: https://arxiv.org/abs/2512.02722
作者: Matteo Tolloso,Davide Bacciu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty quantification is essential for deploying reliable Graph Neural Networks (GNNs), where existing approaches primarily rely on Bayesian inference or ensembles. In this paper, we introduce the first credal graph neural networks (CGNNs), which extend credal learning to the graph domain by training GNNs to output set-valued predictions in the form of credal sets. To account for the distinctive nature of message passing in GNNs, we develop a complementary approach to credal learning that leverages different aspects of layer-wise information propagation. We assess our approach on uncertainty quantification in node classification under out-of-distribution conditions. Our analysis highlights the critical role of the graph homophily assumption in shaping the effectiveness of uncertainty estimates. Extensive experiments demonstrate that CGNNs deliver more reliable representations of epistemic uncertainty and achieve state-of-the-art performance under distributional shift on heterophilic graphs.

[LG-11] FGC-Comp: Adaptive Neighbor-Grouped Attribute Completion for Graph-based Anomaly Detection

链接: https://arxiv.org/abs/2512.02705
作者: Junpeng Wu,Pinheng Zong
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Graph-based Anomaly Detection models have gained widespread adoption in recent years, identifying suspicious nodes by aggregating neighborhood information. However, most existing studies overlook the pervasive issues of missing and adversarially obscured node attributes, which can undermine aggregation stability and prediction reliability. To mitigate this, we propose FGC-Comp, a lightweight, classifier-agnostic, and deployment-friendly attribute completion module-designed to enhance neighborhood aggregation under incomplete attributes. We partition each node’s neighbors into three label-based groups, apply group-specific transforms to the labeled groups while a node-conditioned gate handles unknowns, fuse messages via residual connections, and train end-to-end with a binary classification objective to improve aggregation stability and prediction reliability under missing attributes. Experiments on two real-world fraud datasets validate the effectiveness of the approach with negligible computational overhead.

[LG-12] Conformal Correction for Efficiency May be at Odds with Entropy

链接: https://arxiv.org/abs/2512.02704
作者: Senrong Xu,Tianyu Wang,Zenan Li,Yuan Yao,Taolue Chen,Feng Xu,Xiaoxing Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction (CP) provides a comprehensive framework to produce statistically rigorous uncertainty sets for black-box machine learning models. To further improve the efficiency of CP, conformal correction is proposed to fine-tune or wrap the base model with an extra module using a conformal-aware inefficiency loss. In this work, we empirically and theoretically identify a trade-off between the CP efficiency and the entropy of model prediction. We then propose an entropy-constrained conformal correction method, exploring a better Pareto optimum between efficiency and entropy. Extensive experimental results on both computer vision and graph datasets demonstrate the efficacy of the proposed method. For instance, it can significantly improve the efficiency of state-of-the-art CP methods by up to 34.4%, given an entropy threshold.

[LG-13] Embedding networks with the random walk first return time distribution

链接: https://arxiv.org/abs/2512.02694
作者: Vedanta Thapar,Renaud Lambiotte,George T. Cantwell
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose the first return time distribution (FRTD) of a random walk as an interpretable and mathematically grounded node embedding. The FRTD assigns a probability mass function to each node, allowing us to define a distance between any pair of nodes using standard metrics for discrete distributions. We present several arguments to motivate the FRTD embedding. First, we show that FRTDs are strictly more informative than eigenvalue spectra, yet insufficient for complete graph identification, thus placing FRTD equivalence between cospectrality and isomorphism. Second, we argue that FRTD equivalence between nodes captures structural similarity. Third, we empirically demonstrate that the FRTD embedding outperforms manually designed graph metrics in network alignment tasks. Finally, we show that random networks that approximately match the FRTD of a desired target also preserve other salient features. Together these results demonstrate the FRTD as a simple and mathematically principled embedding for complex networks.

[LG-14] Adaptive Weighted LSSVM for Multi-View Classification

链接: https://arxiv.org/abs/2512.02653
作者: Farnaz Faramarzi Lighvan,Mehrdad Asadi,Lynn Houthuys
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-view learning integrates diverse representations of the same instances to improve performance. Most existing kernel-based multi-view learning methods use fusion techniques without enforcing an explicit collaboration type across views or co-regularization which limits global collaboration. We propose AW-LSSVM, an adaptive weighted LS-SVM that promotes complementary learning by an iterative global coupling to make each view focus on hard samples of others from previous iterations. Experiments demonstrate that AW-LSSVM outperforms existing kernel-based multi-view methods on most datasets, while keeping raw features isolated, making it also suitable for privacy-preserving scenarios.

[LG-15] SeeNav-Agent : Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

链接: https://arxiv.org/abs/2512.02631
作者: Zhengcheng Wang,Zichuan Lin,Yijun Yang,Haobo Fu,Deheng Ye
类目: Machine Learning (cs.LG)
*备注: 12 pages,6 figures

点击查看摘要

Abstract:Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent’s understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.

[LG-16] Adapting Tensor Kernel Machines to Enable Efficient Transfer Learning for Seizure Detection

链接: https://arxiv.org/abs/2512.02626
作者: Seline J.S. de Rooij,Borbála Hunyadi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Transfer learning aims to optimize performance in a target task by learning from a related source problem. In this work, we propose an efficient transfer learning method using a tensor kernel machine. Our method takes inspiration from the adaptive SVM and hence transfers ‘knowledge’ from the source to the ‘adapted’ model via regularization. The main advantage of using tensor kernel machines is that they leverage low-rank tensor networks to learn a compact non-linear model in the primal domain. This allows for a more efficient adaptation without adding more parameters to the model. To demonstrate the effectiveness of our approach, we apply the adaptive tensor kernel machine (Adapt-TKM) to seizure detection on behind-the-ear EEG. By personalizing patient-independent models with a small amount of patient-specific data, the patient-adapted model (which utilizes the Adapt-TKM), achieves better performance compared to the patient-independent and fully patient-specific models. Notably, it is able to do so while requiring around 100 times fewer parameters than the adaptive SVM model, leading to a correspondingly faster inference speed. This makes the Adapt-TKM especially useful for resource-constrained wearable devices.

[LG-17] Modeling and Inverse Identification of Interfacial Heat Conduction in Finite Layer and Semi-Infinite Substrate Systems via a Physics-Guided Neural Framework

链接: https://arxiv.org/abs/2512.02618
作者: Wenhao Sha,Tienchong Chang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Heat transfer in semiconductor devices is dominated by chip and substrate assemblies, where heat generated within a finite chip layer dissipates into a semi-infinite substrate with much higher thermophysical properties. This mismatch produces steep interfacial temperature gradients, making the transient thermal response highly sensitive to the interface. Conventional numerical solvers require excessive discretization to resolve these dynamics, while physics-informed neural networks (PINNs) often exhibit unstable convergence and loss of physical consistency near the material interface. To address these challenges, we introduce HeatTransFormer, a physics-guided Transformer architecture for interface-dominated diffusion problems. The framework integrates physically informed spatiotemporal sampling, a Laplace-based activation emulating analytical diffusion solutions, and a mask-free attention mechanism supporting bidirectional spatiotemporal coupling. These components enable the model to resolve steep gradients, maintain physical consistency, and remain stable where PINNs typically fail. HeatTransFormer produces coherent temperature fields across the interface when applied to a finite layer and semi-infinite substrate configuration. Coupled with a physics-constrained inverse strategy, it further enables reliable identification of three unknown thermal properties simultaneously using only external measurements. Overall, this work demonstrates that physics-guided Transformer architectures provide a unified framework for forward and inverse modeling in interface-dominated thermal systems.

[LG-18] GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

链接: https://arxiv.org/abs/2512.02581
作者: Chubin Zhang,Zhenglin Wan,Feng Chen,Xingrui Yu,Ivor Tsang,Bo An
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

[LG-19] nsor Network Based Feature Learning Model

链接: https://arxiv.org/abs/2512.02547
作者: Albert Saiapin,Kim Batselier
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures, 2 tables. Code available at: this https URL

点击查看摘要

Abstract:Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal formulation of the learning problem by mapping the data to a higher-dimensional space using tensor-product structured polynomial and Fourier features. The curse of dimensionality due to these tensor-product features was effectively solved by a tensor network reparameterization of the model parameters. However, another important aspect of model training - identifying optimal feature hyperparameters - has not been addressed and is typically handled using the standard cross-validation approach. In this paper, we introduce the Feature Learning (FL) model, which addresses this issue by representing tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). By leveraging this CPD structure, we efficiently learn the hyperparameters associated with different features alongside the model parameters using an Alternating Least Squares (ALS) optimization method. We prove the effectiveness of the FL model through experiments on real data of various dimensionality and scale. The results show that the FL model can be consistently trained 3-5 times faster than and have the prediction quality on par with a standard cross-validated model.

[LG-20] In-Context Distillation with Self-Consistency Cascades: A Simple Training-Free Way to Reduce LLM Agent Costs

链接: https://arxiv.org/abs/2512.02543
作者: Vishnu Sarukkai,Asanshay Gupta,James Hong,Michaël Gharbi,Kayvon Fatahalian
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce \textitin-context distillation , which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of \textitself-consistency cascades to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at \textbf2.5 \times lower cost , reducing per-episode costs from \ 0.059 to \ 0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \ 34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a \textbf2 \times cost reduction at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

[LG-21] Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation

链接: https://arxiv.org/abs/2512.02523
作者: Xueyan Li,Yuxin Wang,Mengjie Jiang,Qingzi Zhu,Jiang Zhang,Zoey Kim,Yazhe Niu
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch and consistent style. As these capabilities improve, the need for reliable evaluation and optimization becomes increasingly critical. However, current methods like reward systems often rely on single numerical scores, struggle to capture various dimensions such as phrasing or expressiveness, and require costly annotations, limiting interpretability and generalization. To address these issues, we propose a generative feedback (i.e., reward model) framework that provides multi-dimensional language and audio feedback for SVS assessment. Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality. The model is fine-tuned on a hybrid dataset combining human music reactions and synthetic critiques from a MLLMs, enhancing diversity and linguistic richness. Quantitative experiments validate the effectiveness of the proposed dataset and training strategy, demonstrating that the framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement. The code is at [this https URL](this https URL)

[LG-22] Decentralized Fairness Aware Multi Task Federated Learning for VR Network

链接: https://arxiv.org/abs/2512.02513
作者: Krishnendu S. Tharakan,Carlo Fischione
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: accepted for IEEE Globecom Workshop 2025

点击查看摘要

Abstract:Wireless connectivity promises to unshackle virtual reality (VR) experiences, allowing users to engage from anywhere, anytime. However, delivering seamless, high-quality, real-time VR video wirelessly is challenging due to the stringent quality of experience requirements, low latency constraints, and limited VR device capabilities. This paper addresses these challenges by introducing a novel decentralized multi task fair federated learning (DMTFL) based caching that caches and prefetches each VR user’s field of view (FOV) at base stations (BSs) based on the caching strategies tailored to each BS. In federated learning (FL) in its naive form, often biases toward certain users, and a single global model fails to capture the statistical heterogeneity across users and BSs. In contrast, the proposed DMTFL algorithm personalizes content delivery by learning individual caching models at each BS. These models are further optimized to perform well under any target distribution, while providing theoretical guarantees via Rademacher complexity and a probably approximately correct (PAC) bound on the loss. Using a realistic VR head-tracking dataset, our simulations demonstrate the superiority of our proposed DMTFL algorithm compared to baseline algorithms.

[LG-23] Water Quality Estimation Through Machine Learning Multivariate Analysis

链接: https://arxiv.org/abs/2512.02508
作者: Marco Cardia,Stefano Chessa,Alessio Micheli,Antonella Giuliana Luminare,Francesca Gambineri
类目: Machine Learning (cs.LG)
*备注: The paper has been accepted at Italian Workshop on Neural Networks (WIRN) 2024

点击查看摘要

Abstract:The quality of water is key for the quality of agrifood sector. Water is used in agriculture for fertigation, for animal husbandry, and in the agrifood processing industry. In the context of the progressive digitalization of this sector, the automatic assessment of the quality of water is thus becoming an important asset. In this work, we present the integration of Ultraviolet-Visible (UV-Vis) spectroscopy with Machine Learning in the context of water quality assessment aiming at ensuring water safety and the compliance of water regulation. Furthermore, we emphasize the importance of model interpretability by employing SHapley Additive exPlanations (SHAP) to understand the contribution of absorbance at different wavelengths to the predictions. Our approach demonstrates the potential for rapid, accurate, and interpretable assessment of key water quality parameters.

[LG-24] A Fully First-Order Layer for Differentiable Optimization

链接: https://arxiv.org/abs/2512.02494
作者: Zihao Zhao,Kai-Chia Mo,Shing-Hei Ho,Brandon Amos,Kai Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in \tilde\oo(1) time, leading to an overall complexity of \tilde\oo(\delta^-1\epsilon^-3) for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. Our code is available here: this https URL.

[LG-25] Stress-Testing Causal Claims via Cardinality Repairs

链接: https://arxiv.org/abs/2512.02491
作者: Yarden Gabbay,Haoquan Guan,Shaull Almagor,El Kindi Rezig,Brit Youngmann,Babak Salimi
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal analyses derived from observational data underpin high-stakes decisions in domains such as healthcare, public policy, and economics. Yet such conclusions can be surprisingly fragile: even minor data errors - duplicate records, or entry mistakes - may drastically alter causal relationships. This raises a fundamental question: how robust is a causal claim to small, targeted modifications in the data? Addressing this question is essential for ensuring the reliability, interpretability, and reproducibility of empirical findings. We introduce SubCure, a framework for robustness auditing via cardinality repairs. Given a causal query and a user-specified target range for the estimated effect, SubCure identifies a small set of tuples or subpopulations whose removal shifts the estimate into the desired range. This process not only quantifies the sensitivity of causal conclusions but also pinpoints the specific regions of the data that drive those conclusions. We formalize this problem under both tuple- and pattern-level deletion settings and show both are NP-complete. To scale to large datasets, we develop efficient algorithms that incorporate machine unlearning techniques to incrementally update causal estimates without retraining from scratch. We evaluate SubCure across four real-world datasets covering diverse application domains. In each case, it uncovers compact, high-impact subsets whose removal significantly shifts the causal conclusions, revealing vulnerabilities that traditional methods fail to detect. Our results demonstrate that cardinality repair is a powerful and general-purpose tool for stress-testing causal analyses and guarding against misleading claims rooted in ordinary data imperfections.

[LG-26] Hybrid(Penalized Regression and MLP) Models for Outcome Prediction in HDLSS Health Data

链接: https://arxiv.org/abs/2512.02489
作者: Mithra D K
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:I present an application of established machine learning techniques to NHANES health survey data for predicting diabetes status. I compare baseline models (logistic regression, random forest, XGBoost) with a hybrid approach that uses an XGBoost feature encoder and a lightweight multilayer perceptron (MLP) head. Experiments show the hybrid model attains improved AUC and balanced accuracy compared to baselines on the processed NHANES subset. I release code and reproducible scripts to encourage replication.

[LG-27] Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

链接: https://arxiv.org/abs/2512.02486
作者: Zhongjian Qiao,Rui Yang,Jiafei Lyu,Xiu Li,Zhongxiang Dai,Zhuoran Yang,Siyang Gao,Shuang Qiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbfDual-\textbfRObust \textbfCross-domain \textbfOffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.

[LG-28] QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning

链接: https://arxiv.org/abs/2512.02444
作者: Ning Wang,Sainyam Galhotra
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins, which assume that join keys match exactly or nearly so. These techniques, while efficient in clean, well-normalized databases, fail in open or federated settings where identifiers are inconsistently formatted, embedded, or split across multiple columns. Approximate or fuzzy joins alleviate minor string variations but cannot capture systematic transformations. We introduce QJoin, a reinforcement-learning framework that learns and reuses transformation strategies across join tasks. QJoin trains an agent under a uniqueness-aware reward that balances similarity with key distinctiveness, enabling it to explore concise, high-value transformation chains. To accelerate new joins, we introduce two reuse mechanisms: (i) agent transfer, which initializes new policies from pretrained agents, and (ii) transformation reuse, which caches successful operator sequences for similar column clusters. On the AutoJoin Web benchmark (31 table pairs), QJoin achieves an average F1-score of 91.0%. For 19,990 join tasks in NYC+Chicago open datasets, Qjoin reduces runtime by up to 7.4% (13,747 s) by using reusing. These results demonstrate that transformation learning and reuse can make join discovery both more accurate and more efficient.

[LG-29] Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering

链接: https://arxiv.org/abs/2512.02435
作者: Zhongjian Qiao,Rui Yang,Jiafei Lyu,Chenjia Bai,Xiu Li,Zhuoran Yang,Siyang Gao,Shuang Qiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textitvalue alignment, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf\underlineDynamics- and \textbf\underlineValue-aligned \textbf\underlineData \textbf\underlineFiltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.

[LG-30] Leverag ing Large Language Models to Bridge On-chain and Off-chain Transparency in Stablecoins

链接: https://arxiv.org/abs/2512.02418
作者: Yuexin Xiang,Yuchen Lei,SM Mahir Shazeed Rish,Yuanzhe Zhang,Qin Wang,Tsz Hon Yuen,Jiangshan Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, the transparency is split across two worlds: verifiable on-chain traces and off-chain disclosures locked in unstructured text that are unconnected. We introduce a large language model (LLM)-based automated framework that bridges these two dimensions by aligning on-chain issuance data with off-chain disclosure statements. First, we propose an integrative framework using LLMs to capture and analyze on- and off-chain data through document parsing and semantic alignment, extracting key financial indicators from issuer attestations and mapping them to corresponding on-chain metrics. Second, we integrate multi-chain issuance records and disclosure documents within a model context protocol (MCP) framework that standardizes LLMs access to both quantitative market data and qualitative disclosure narratives. This framework enables unified retrieval and contextual alignment across heterogeneous stablecoin information sources and facilitates consistent analysis. Third, we demonstrate the capability of LLMs to operate across heterogeneous data modalities in blockchain analytics, quantifying discrepancies between reported and observed circulation and examining their implications for cross-chain transparency and price dynamics. Our findings reveal systematic gaps between disclosed and verifiable data, showing that LLM-assisted analysis enhances cross-modal transparency and supports automated, data-driven auditing in decentralized finance (DeFi).

[LG-31] Dynamic Configuration of On-Street Parking Spaces using Multi Agent Reinforcement Learning

链接: https://arxiv.org/abs/2512.02406
作者: Oshada Jayasinghe,Farhana Choudhury,Egemen Tanin,Shanika Karunasekera
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on-street parking, further hinders traffic flow, by limiting the effective road width available for driving. With the advancement of vehicle-to-infrastructure connectivity technologies, we explore how the impact of on-street parking on traffic congestion could be minimized, by dynamically configuring on-street parking spaces. Towards that end, we formulate dynamic on-street parking space configuration as an optimization problem, and we follow a data driven approach, considering the nature of our problem. Our proposed solution comprises a two-layer multi agent reinforcement learning based framework, which is inherently scalable to large road networks. The lane level agents are responsible for deciding the optimal parking space configuration for each lane, and we introduce a novel Deep Q-learning architecture which effectively utilizes long short term memory networks and graph attention networks to capture the spatio-temporal correlations evident in the given problem. The block level agents control the actions of the lane level agents and maintain a sufficient level of parking around the block. We conduct a set of comprehensive experiments using SUMO, on both synthetic data as well as real-world data from the city of Melbourne. Our experiments show that the proposed framework could reduce the average travel time loss of vehicles significantly, reaching upto 47%, with a negligible increase in the walking distance for parking.

[LG-32] ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity

链接: https://arxiv.org/abs/2512.02403
作者: Hongxiang Liu,Zhifang Deng,Tong Pu,Shengli Lu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively. Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2512.02403 [cs.LG] (or arXiv:2512.02403v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.02403 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hongxiang Liu [view email] [v1] Tue, 2 Dec 2025 04:28:17 UTC (8,526 KB) Full-text links: Access Paper: View a PDF of the paper titled ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity, by Hongxiang Liu and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs cs.AR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-33] Risk-Sensitive Q-Learning in Continuous Time with Application to Dynamic Portfolio Selection

链接: https://arxiv.org/abs/2512.02386
作者: Chuhan Xie
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper studies the problem of risk-sensitive reinforcement learning (RSRL) in continuous time, where the environment is characterized by a controllable stochastic differential equation (SDE) and the objective is a potentially nonlinear functional of cumulative rewards. We prove that when the functional is an optimized certainty equivalent (OCE), the optimal policy is Markovian with respect to an augmented environment. We also propose \textitCT-RS-q, a risk-sensitive q-learning algorithm based on a novel martingale characterization approach. Finally, we run a simulation study on a dynamic portfolio selection problem and illustrate the effectiveness of our algorithm.

[LG-34] Reinforcement Learning in POMDPs via Direct Gradient Ascent

链接: https://arxiv.org/abs/2512.02383
作者: Jonathan Baxter,Peter L. Bartlett
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter \beta\in [0,1) , which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.

[LG-35] SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

链接: https://arxiv.org/abs/2512.02337
作者: Zhendong Tan,Xingjun Zhang,Chaoyi Hu,Junjie Peng,Kun Xia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

[LG-36] Forecasting MBTA Transit Dynamics: A Performance Benchmarking of Statistical and Machine Learning Models

链接: https://arxiv.org/abs/2512.02336
作者: Sai Siddharth Nalamalpu,Kaining Yuan,Aiden Zhou,Eugene Pinsky
类目: Machine Learning (cs.LG)
*备注: 14 pages 9 figures

点击查看摘要

Abstract:The Massachusetts Bay Transportation Authority (MBTA) is the main public transit provider in Boston, operating multiple means of transport, including trains, subways, and buses. However, the system often faces delays and fluctuations in ridership volume, which negatively affect efficiency and passenger satisfaction. To further understand this phenomenon, this paper compares the performance of existing and unique methods to determine the best approach in predicting gated station entries in the subway system (a proxy for subway usage) and the number of delays in the overall MBTA system. To do so, this research considers factors that tend to affect public transportation, such as day of week, season, pressure, wind speed, average temperature, and precipitation. This paper evaluates the performance of 10 statistical and machine learning models on predicting next-day subway usage. On predicting delay count, the number of models is extended to 11 per day by introducing a self-exciting point process model, representing a unique application of a point-process framework for MBTA delay modeling. This research involves experimenting with the selective inclusion of features to determine feature importance, testing model accuracy via Root Mean Squared Error (RMSE). Remarkably, it is found that providing either day of week or season data has a more substantial benefit to predictive accuracy compared to weather data; in fact, providing weather data generally worsens performance, suggesting a tendency of models to overfit.

[LG-37] Retrieval-Augmented Memory for Online Learning

链接: https://arxiv.org/abs/2512.02333
作者: Wenzhang Du
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Retrieval-augmented models couple parametric predictors with non-parametric memories, but their use in streaming supervised learning with concept drift is not well understood. We study online classification in non-stationary environments and propose Retrieval-Augmented Memory for Online Learning (RAM-OL), a simple extension of stochastic gradient descent that maintains a small buffer of past examples. At each time step, RAM-OL retrieves a few nearest neighbours of the current input in the hidden representation space and updates the model jointly on the current example and the retrieved neighbours. We compare a naive replay variant with a gated replay variant that constrains neighbours using a time window, similarity thresholds, and gradient reweighting, in order to balance fast reuse of relevant past data against robustness to outdated regimes. From a theoretical perspective, we interpret RAM-OL under a bounded drift model and discuss how retrieval can reduce adaptation cost and improve regret constants when patterns recur over time. Empirically, we instantiate RAM-OL on a simple online multilayer perceptron and evaluate it on three real-world data streams derived from electricity pricing, electricity load, and airline delay data. On strongly and periodically drifting streams, RAM-OL improves prequential accuracy by up to about seven percentage points and greatly reduces variance across random seeds, while on a noisy airline stream the gated variant closely matches the purely online baseline. These results show that retrieval-augmented memory is a practical and robust tool for online learning under concept drift.

[LG-38] Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation

链接: https://arxiv.org/abs/2512.02323
作者: Kentaro Kubo,Hayato Goto
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient learning method called contrastive divergence. More accurate learning typically requires Markov chain Monte Carlo (MCMC) Boltzmann sampling, but it is time-consuming due to the difficulty of parallelization for more expressive models. To address this limitation, we first propose a new Boltzmann sampler inspired by a quantum-inspired combinatorial optimization called simulated bifurcation (SB). This SB-inspired approach, which we name Langevin SB (LSB), enables parallelized sampling while maintaining accuracy comparable to MCMC. Furthermore, this is applicable not only to RBMs but also to BMs with general couplings. However, LSB cannot control the inverse temperature of the output Boltzmann distribution, which hinders learning and degrades performance. To overcome this limitation, we also developed an efficient method for estimating the inverse temperature during the learning process, which we call conditional expectation matching (CEM). By combining LSB and CEM, we establish an efficient learning framework for BMs with greater expressive power than RBMs. We refer to this framework as sampler-adaptive learning (SAL). SAL opens new avenues for energy-based generative modeling beyond RBMs.

[LG-39] raining Dynamics of Learning 3D-Rotational Equivariance

链接: https://arxiv.org/abs/2512.02303
作者: Max W. Shen,Ewa Nowara,Michael Maser,Kyunghyun Cho
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Accepted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to \leq 2% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap’’ is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.

[LG-40] Limitations of Membership Queries in Testable Learning

链接: https://arxiv.org/abs/2512.02279
作者: Jane Lange,Mingda Qiao
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Conference: ITCS 2026

点击查看摘要

Abstract:Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the \emphtestable learning model of Rubinfeld and Vasilyan [RV23], membership queries cannot decrease the time complexity of testable learning algorithms beyond the complexity of sample-only distribution-specific learning. In the testable learning model, the learner must output a hypothesis whenever the data distribution satisfies a desired property, and if it outputs a hypothesis, the hypothesis must be near-optimal. We give a general reduction from sample-based \emphrefutation of boolean concept classes, as presented in [Vadhan17, KL18], to testable learning with queries (TL-Q). This yields lower bounds for TL-Q via the reduction from learning to refutation given in [KL18]. The result is that, relative to a concept class and a distribution family, no m -sample TL-Q algorithm can be super-polynomially more time-efficient than the best m -sample PAC learner. Finally, we define a class of ``statistical’’ MQ algorithms that encompasses many known distribution-specific MQ learners, such as those based on influence estimation or subcube-conditional statistical queries. We show that TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms. Thus, combined with known SQ dimension lower bounds, our results imply that these efficient membership query learners cannot be made testable. Comments: Conference: ITCS 2026 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2512.02279 [cs.LG] (or arXiv:2512.02279v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.02279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Adversarial Robustness of Traffic Classification under Resource Constraints: Input Structure Matters

链接: https://arxiv.org/abs/2512.02276
作者: Adel Chehade,Edoardo Ragusa,Paolo Gastaldo,Rodolfo Zunino
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at the 2025 IEEE International Symposium on Networks, Computers and Communications (ISNCC)

点击查看摘要

Abstract:Traffic classification (TC) plays a critical role in cybersecurity, particularly in IoT and embedded contexts, where inspection must often occur locally under tight hardware constraints. We use hardware-aware neural architecture search (HW-NAS) to derive lightweight TC models that are accurate, efficient, and deployable on edge platforms. Two input formats are considered: a flattened byte sequence and a 2D packet-wise time series; we examine how input structure affects adversarial vulnerability when using resource-constrained models. Robustness is assessed against white-box attacks, specifically Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). On USTC-TFC2016, both HW-NAS models achieve over 99% clean-data accuracy while remaining within 65k parameters and 2M FLOPs. Yet under perturbations of strength 0.1, their robustness diverges: the flat model retains over 85% accuracy, while the time-series variant drops below 35%. Adversarial fine-tuning delivers robust gains, with flat-input accuracy exceeding 96% and the time-series variant recovering over 60 percentage points in robustness, all without compromising efficiency. The results underscore how input structure influences adversarial vulnerability, and show that even compact, resource-efficient models can attain strong robustness, supporting their practical deployment in secure edge-based TC.

[LG-42] he Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models

链接: https://arxiv.org/abs/2512.02265
作者: Joshua Wolff Anderson,Shyam Visweswaran
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.

[LG-43] Verifying Closed-Loop Contractivity of Learning-Based Controllers via Partitioning

链接: https://arxiv.org/abs/2512.02262
作者: Alexander Davydov
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We address the problem of verifying closed-loop contraction in nonlinear control systems whose controller and contraction metric are both parameterized by neural networks. By leveraging interval analysis and interval bound propagation, we derive a tractable and scalable sufficient condition for closed-loop contractivity that reduces to checking that the dominant eigenvalue of a symmetric Metzler matrix is nonpositive. We combine this sufficient condition with a domain partitioning strategy to integrate this sufficient condition into training. The proposed approach is validated on an inverted pendulum system, demonstrating the ability to learn neural network controllers and contraction metrics that provably satisfy the contraction condition.

[LG-44] On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks

链接: https://arxiv.org/abs/2512.02223
作者: Benjamin K. Rosenzweig,Matthew W. Hahn
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 10 pages

点击查看摘要

Abstract:Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by Bayesian and maximum likelihood search procedures based on complex models of molecular evolution. In this work we describe minimal neural network architectures that can approximate classic phylogenetic distance functions and the properties required to learn distances under a variety of molecular evolutionary models. In contrast to model-based inference (and recently proposed model-free convolutional and transformer networks), these architectures have a small computational footprint and are scalable to large numbers of taxa and molecular characters. The learned distance functions generalize well and, given an appropriate training dataset, achieve results comparable to state-of-the art inference methods.

[LG-45] Uncertainty Reasoning with Photonic Bayesian Machines

链接: https://arxiv.org/abs/2512.02217
作者: F. Brückerhoff-Plückelmann,H. Borras,S. U. Hulyal,L. Meyer,X. Ji,J. Hu,J. Sun,B. Klein,F. Ebert,J. Dijkstra,L. McRae,P. Schmidt,T. J. Kippenberg,H. Fröning,W. Pernice
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems increasingly influence safety-critical aspects of society, from medical diagnosis to autonomous mobility, making uncertainty awareness a central requirement for trustworthy AI. We present a photonic Bayesian machine that leverages the inherent randomness of chaotic light sources to enable uncertainty reasoning within the framework of Bayesian Neural Networks. The analog processor features a 1.28 Tbit/s digital interface compatible with PyTorch, enabling probabilistic convolutions processing within 37.5 ps per convolution. We use the system for simultaneous classification and out-of-domain detection of blood cell microscope images and demonstrate reasoning between aleatoric and epistemic uncertainties. The photonic Bayesian machine removes the bottleneck of pseudo random number generation in digital systems, minimizes the cost of sampling for probabilistic models, and thus enables high-speed trustworthy AI systems.

[LG-46] InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

链接: https://arxiv.org/abs/2512.02213
作者: Mamadou K. Keita,Sebastien Diarra,Christopher Homan,Seydou Diallo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.

[LG-47] WhAM: Towards A Translative Model of Sperm Whale Vocalization NEURIPS2025

链接: https://arxiv.org/abs/2512.02206
作者: Orr Paradise,Pranav Muralikrishnan,Liangyuan Chen,Hugo Flores García,Bryan Pardo,Roee Diamant,David F. Gruber,Shane Gero,Shafi Goldwasser
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM’s synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM’s learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at this https URL

[LG-48] Modelling the Doughnut of social and planetary boundaries with frugal machine learning

链接: https://arxiv.org/abs/2512.02200
作者: Stefano Vrizzi,Daniel W. O’Neill
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:The ‘Doughnut’ of social and planetary boundaries has emerged as a popular framework for assessing environmental and social sustainability. Here, we provide a proof-of-concept analysis that shows how machine learning (ML) methods can be applied to a simple macroeconomic model of the Doughnut. First, we show how ML methods can be used to find policy parameters that are consistent with ‘living within the Doughnut’. Second, we show how a reinforcement learning agent can identify the optimal trajectory towards desired policies in the parameter space. The approaches we test, which include a Random Forest Classifier and Q -learning, are frugal ML methods that are able to find policy parameter combinations that achieve both environmental and social sustainability. The next step is the application of these methods to a more complex ecological macroeconomic model.

[LG-49] Sampling on Metric Graphs

链接: https://arxiv.org/abs/2512.02175
作者: Rajat Vadiraj Dwaraknath,Lexing Ying
类目: Numerical Analysis (math.NA); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Metric graphs are structures obtained by associating edges in a standard graph with segments of the real line and gluing these segments at the vertices of the graph. The resulting structure has a natural metric that allows for the study of differential operators and stochastic processes on the graph. Brownian motions in these domains have been extensively studied theoretically using their generators. However, less work has been done on practical algorithms for simulating these processes. We introduce the first algorithm for simulating Brownian motions on metric graphs through a timestep splitting Euler-Maruyama-based discretization of their corresponding stochastic differential equation. By applying this scheme to Langevin diffusions on metric graphs, we also obtain the first algorithm for sampling on metric graphs. We provide theoretical guarantees on the number of timestep splittings required for the algorithm to converge to the underlying stochastic process. We also show that the exit probabilities of the simulated particle converge to the vertex-edge jump probabilities of the underlying stochastic differential equation as the timestep goes to zero. Finally, since this method is highly parallelizable, we provide fast, memory-aware implementations of our algorithm in the form of custom CUDA kernels that are up to ~8000x faster than a GPU implementation using PyTorch on simple star metric graphs. Beyond simple star graphs, we benchmark our algorithm on a real cortical vascular network extracted from a DuMuX tissue-perfusion model for tracer transport. Our algorithm is able to run stable simulations with timesteps significantly larger than the stable limit of the finite volume method used in DuMuX while also achieving speedups of up to ~1500x.

[LG-50] How Market Volatility Shapes Algorithmic Collusion: A Comparative Analysis of Learning-Based Pricing Algorithms

链接: https://arxiv.org/abs/2512.02134
作者: Aheer Sravon,Md. Ibrahim,Devdyuti Mazumder,Ridwan Al Aziz
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Autonomous pricing algorithms are increasingly influencing competition in digital markets; however, their behavior under realistic demand conditions remains largely unexamined. This paper offers a thorough analysis of four pricing algorithms – Q-Learning, PSO, Double DQN, and DDPG – across three classic duopoly models (Logit, Hotelling, Linear) and under various demand-shock regimes created by auto-regressive processes. By utilizing profit- and price-based collusion indices, we investigate how the interactions among algorithms, market structure, and stochastic demand collaboratively influence competitive outcomes. Our findings reveal that reinforcement-learning algorithms often sustain supra-competitive prices under stable demand, with DDPG demonstrating the most pronounced collusive tendencies. Demand shocks produce notably varied effects: Logit markets suffer significant performance declines, Hotelling markets remain stable, and Linear markets experience shock-induced profit inflation. Despite marked changes in absolute performance, the relative rankings of the algorithms are consistent across different environments. These results underscore the critical importance of market structure and demand uncertainty in shaping algorithmic competition, while also contributing to the evolving policy discussions surrounding autonomous pricing behavior.

[LG-51] Cross-View Topology-Aware Graph Representation Learning

链接: https://arxiv.org/abs/2512.02130
作者: Ahmet Sami Korkmaz,Selim Coskunuzer,Md Joshem Uddin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph classification has gained significant attention due to its applications in chemistry, social networks, and bioinformatics. While Graph Neural Networks (GNNs) effectively capture local structural patterns, they often overlook global topological features that are critical for robust representation learning. In this work, we propose GraphTCL, a dual-view contrastive learning framework that integrates structural embeddings from GNNs with topological embeddings derived from persistent homology. By aligning these complementary views through a cross-view contrastive loss, our method enhances representation quality and improves classification performance. Extensive experiments on benchmark datasets, including TU and OGB molecular graphs, demonstrate that GraphTCL consistently outperforms state-of-the-art baselines. This study highlights the importance of topology-aware contrastive learning for advancing graph representation methods.

[LG-52] Contextual Gating within the Transformer Stack: Synergistic Feature Modulation for Enhanced Lyrical Classification and Calibration

链接: https://arxiv.org/abs/2512.02053
作者: M.A. Gameiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces a significant architectural advancement in feature fusion for lyrical content classification by integrating auxiliary structural features directly into the self-attention mechanism of a pre-trained Transformer. I propose the SFL Transformer, a novel deep learning model that utilizes a Contextual Gating mechanism (an Intermediate SFL) to modulate the sequence of hidden states within the BERT encoder stack, rather than fusing features at the final output layer. This approach modulates the deep, contextualized semantic features (Hseq) using low-dimensional structural cues (Fstruct). The model is applied to a challenging binary classification task derived from UMAP-reduced lyrical embeddings. The SFL Transformer achieved an Accuracy of 0.9910 and a Macro F1 score of 0.9910, significantly improving the state-of-the-art established by the previously published SFL model (Accuracy 0.9894). Crucially, this Contextual Gating strategy maintained exceptional reliability, with a low Expected Calibration Error (ECE = 0.0081) and Log Loss (0.0489). This work validates the hypothesis that injecting auxiliary context mid-stack is the most effective means of synergistically combining structural and semantic information, creating a model with both superior discriminative power and high-fidelity probability estimates.

[LG-53] PIBNet: a Physics-Inspired Boundary Network for Multiple Scattering Simulations

链接: https://arxiv.org/abs/2512.02049
作者: Rémi Marsal,Stéphanie Chaillat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The boundary element method (BEM) provides an efficient numerical framework for solving multiple scattering problems in unbounded homogeneous domains, since it reduces the discretization to the domain boundaries, thereby condensing the computational complexity. The procedure first consists in determining the solution trace on the boundaries of the domain by solving a boundary integral equation, after which the volumetric solution can be recovered at low computational cost with a boundary integral representation. As the first step of the BEM represents the main computational bottleneck, we introduce PIBNet, a learning-based approach designed to approximate the solution trace. The method leverages a physics-inspired graph-based strategy to model obstacles and their long-range interactions efficiently. Then, we introduce a novel multiscale graph neural network architecture for simulating the multiple scattering. To train and evaluate our network, we present a benchmark consisting of several datasets of different types of multiple scattering problems. The results indicate that our approach not only surpasses existing state-of-the-art learning-based methods on the considered tasks but also exhibits superior generalization to settings with an increased number of obstacles. this http URL

[LG-54] Pharmacophore-based design by learning on voxel grids

链接: https://arxiv.org/abs/2512.02031
作者: Omar Mahmood,Pedro O. Pinheiro,Richard Bonneau,Saeed Saremi,Vishnu Sresht
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Ligand-based drug discovery (LBDD) relies on making use of known binders to a protein target to find structurally diverse molecules similarly likely to bind. This process typically involves a brute force search of the known binder (query) against a molecular library using some metric of molecular similarity. One popular approach overlays the pharmacophore-shape profile of the known binder to 3D conformations enumerated for each of the library molecules, computes overlaps, and picks a set of diverse library molecules with high overlaps. While this virtual screening workflow has had considerable success in hit diversification, scaffold hopping, and patent busting, it scales poorly with library sizes and restricts candidate generation to existing library compounds. Leveraging recent advances in voxel-based generative modelling, we propose a pharmacophore-based generative model and workflows that address the scaling and fecundity issues of conventional pharmacophore-based virtual screening. We introduce \emphVoxCap, a voxel captioning method for generating SMILES strings from voxelised molecular representations. We propose two workflows as practical use cases as well as benchmarks for pharmacophore-based generation: \emphde-novo design, in which we aim to generate new molecules with high pharmacophore-shape similarities to query molecules, and fast search, which aims to combine generative design with a cheap 2D substructure similarity search for efficient hit identification. Our results show that VoxCap significantly outperforms previous methods in generating diverse \textitde-novo hits. When combined with our fast search workflow, VoxCap reduces computational time by orders of magnitude while returning hits for all query molecules, enabling the search of large libraries that are intractable to search by brute force.

[LG-55] An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction

链接: https://arxiv.org/abs/2512.02023
作者: Md. Najmul Islam,Md. Miner Hossain Rimon,Shah Sadek-E-Akbor Shamim,Zarif Mohaimen Fahad,Md. Jehadul Islam Mony,Md. Jalal Uddin Chowdhury
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at the 7th International Conference on Trends in Computational and Cognitive Engineering (TCCE-2025), 12-13 November 2025. This manuscript contains 10 pages and 7 figures

点击查看摘要

Abstract:Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and this http URL stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.

[LG-56] Flexible Gravitational-Wave Parameter Estimation with Transformers

链接: https://arxiv.org/abs/2512.02968
作者: Annalena Kofler,Maximilian Dax,Stephen R. Green,Jonas Wildberger,Nihar Gupte,Jakob H. Macke,Jonathan Gair,Alessandra Buonanno,Bernhard Schölkopf
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 8+11 pages, 3+7 figures

点击查看摘要

Abstract:Gravitational-wave data analysis relies on accurate and efficient methods to extract physical information from noisy detector signals, yet the increasing rate and complexity of observations represent a growing challenge. Deep learning provides a powerful alternative to traditional inference, but existing neural models typically lack the flexibility to handle variations in data analysis settings. Such variations accommodate imperfect observations or are required for specialized tests, and could include changes in detector configurations, overall frequency ranges, or localized cuts. We introduce a flexible transformer-based architecture paired with a training strategy that enables adaptation to diverse analysis settings at inference time. Applied to parameter estimation, we demonstrate that a single flexible model – called Dingo-T1 – can (i) analyze 48 gravitational-wave events from the third LIGO-Virgo-KAGRA Observing Run under a wide range of analysis configurations, (ii) enable systematic studies of how detector and frequency configurations impact inferred posteriors, and (iii) perform inspiral-merger-ringdown consistency tests probing general relativity. Dingo-T1 also improves median sample efficiency on real events from a baseline of 1.4% to 4.2%. Our approach thus demonstrates flexible and scalable inference with a principled framework for handling missing or incomplete data – key capabilities for current and next-generation observatories.

[LG-57] Representation of Inorganic Synthesis Reactions and Prediction: Graphical Framework and Datasets

链接: https://arxiv.org/abs/2512.02947
作者: Samuel Andrello,Daniel Alabi,Simon J. L. Billinge
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: For associated code and datasets, see this https URL

点击查看摘要

[LG-58] Leverag ing generative adversarial networks with spatially adaptive denormalization for multivariate stochastic seismic data inversion

链接: https://arxiv.org/abs/2512.02863
作者: Roberto Miele,Leonardo Azevedo
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

[LG-59] Revisiting Theory of Contrastive Learning for Domain Generalization

链接: https://arxiv.org/abs/2512.02831
作者: Ali Alvandi,Mina Rezaei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 19 pages

点击查看摘要

[LG-60] Generative modeling using evolved quantum Boltzmann machines

链接: https://arxiv.org/abs/2512.02721
作者: Mark M. Wilde
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 30 pages, 2 figures

点击查看摘要

Abstract:Born-rule generative modeling, a central task in quantum machine learning, seeks to learn probability distributions that can be efficiently sampled by measuring complex quantum states. One hope is for quantum models to efficiently capture probability distributions that are difficult to learn and simulate by classical means alone. Quantum Boltzmann machines were proposed about one decade ago for this purpose, yet efficient training methods have remained elusive. In this paper, I overcome this obstacle by proposing a practical solution that trains quantum Boltzmann machines for Born-rule generative modeling. Two key ingredients in the proposal are the Donsker-Varadhan variational representation of the classical relative entropy and the quantum Boltzmann gradient estimator of [Patel et al., arXiv:2410.12935]. I present the main result for a more general ansatz known as an evolved quantum Boltzmann machine [Minervini et al., arXiv:2501.03367], which combines parameterized real- and imaginary-time evolution. I also show how to extend the findings to other distinguishability measures beyond relative entropy. Finally, I present four different hybrid quantum-classical algorithms for the minimax optimization underlying training, and I discuss their theoretical convergence guarantees.

[LG-61] Laplace Approximation For Tensor Train Kernel Machines In System Identification

链接: https://arxiv.org/abs/2512.02532
作者: Albert Saiapin,Kim Batselier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 4 tables. Submitted to IFAC 2026. Code available at: this https URL

点击查看摘要

[LG-62] Bayesian Physics-Informed Neural Networks for Inverse Problems (BPINN-IP): Application in Infrared Image Processing

链接: https://arxiv.org/abs/2512.02495
作者: Ali Mohammad-Djafari,Ning Chu,Li Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 31 page, paper in revision, submitted in Journal of the Franklin Institute, 2025

点击查看摘要

[LG-63] Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients

链接: https://arxiv.org/abs/2512.02342
作者: Dimitris Oikonomou,Nicolas Loizou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 15 figures

点击查看摘要

[LG-64] Molecular Embedding-Based Algorithm Selection in Protein-Ligand Docking ACL

链接: https://arxiv.org/abs/2512.02328
作者: Jiabao Brad Wang,Siyuan Cao,Hongxuan Wu,Yiliang Yuan,Mustafa Misir
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 25 pages, 13 figures, 5 tables. Protein-ligand docking, algorithm selection, pretrained embeddings (ESM, ChemBERTa), docking benchmarks, oracle-landscape analysis. Code and data available

点击查看摘要

[LG-65] Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training NEURIPS2025

链接: https://arxiv.org/abs/2512.02315
作者: Felix Teufel,Aaron W. Kollasch,Yining Huang,Ole Winther,Kevin K. Yang,Pascal Notin,Debora S. Marks
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: AI for Science Workshop (NeurIPS 2025)

点击查看摘要

Abstract:Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.

[LG-66] Opening the Black Box: Nowcasting Singapores GDP Growth and its Explainability

链接: https://arxiv.org/abs/2512.02092
作者: Luca Attolico
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Applications (stat.AP)
*备注: PhD thesis, University of Macerata (2025). PhD programme: Quantitative Methods for Policy Evaluation (Cycle XXXVII). Supervisors: Rosaria Romano, Jamus Jerome Lim

点击查看摘要

Abstract:Timely assessment of current conditions is essential especially for small, open economies such as Singapore, where external shocks transmit rapidly to domestic activity. We develop a real-time nowcasting framework for quarterly GDP growth using a high-dimensional panel of approximately 70 indicators, encompassing economic and financial indicators over 1990Q1-2023Q2. The analysis covers penalized regressions, dimensionality-reduction methods, ensemble learning algorithms, and neural architectures, benchmarked against a Random Walk, an AR(3), and a Dynamic Factor Model. The pipeline preserves temporal ordering through an expanding-window walk-forward design with Bayesian hyperparameter optimization, and uses moving block-bootstrap procedures both to construct prediction intervals and to obtain confidence bands for feature-importance measures. It adopts model-specific and XAI-based explainability tools. A Model Confidence Set procedure identifies statistically superior learners, which are then combined through simple, weighted, and exponentially weighted schemes; the resulting time-varying weights provide an interpretable representation of model contributions. Predictive ability is assessed via Giacomini-White tests. Empirical results show that penalized regressions, dimensionality-reduction models, and GRU networks consistently outperform all benchmarks, with RMSFE reductions of roughly 40-60%; aggregation delivers further gains. Feature-attribution methods highlight industrial production, external trade, and labor-market indicators as dominant drivers of Singapore’s short-run growth dynamics.

[LG-67] From Betti Numbers to Persistence Diagrams: A Hybrid Quantum Algorithm for Topological Data Analysis

链接: https://arxiv.org/abs/2512.02081
作者: Dong Liu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

[LG-68] Quantum Machine Learning for Secondary Frequency Control

链接: https://arxiv.org/abs/2512.02065
作者: Younes Ghazagh Jahed,Alireza Khatiri
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Frequency control in power systems is critical to maintaining stability and preventing blackouts. Traditional methods like meta-heuristic algorithms and machine learning face limitations in real-time applicability and scalability. This paper introduces a novel approach using a pure variational quantum circuit (VQC) for real-time secondary frequency control in diesel generators. Unlike hybrid classical-quantum models, the proposed VQC operates independently during execution, eliminating latency from classical-quantum data exchange. The VQC is trained via supervised learning to map historical frequency deviations to optimal Proportional-Integral (PI) controller parameters using a pre-computed lookup table. Simulations demonstrate that the VQC achieves high prediction accuracy (over 90%) with sufficient quantum measurement shots and generalizes well across diverse test events. The quantum-optimized PI parameters significantly improve transient response, reducing frequency fluctuations and settling time.

[LG-69] From What-is to What-if in Human-Factor Analysis: A Post-Occupancy Evaluation Case

链接: https://arxiv.org/abs/2512.02060
作者: Xia Chen,Ruiji Sun,Philipp Geyer,André Borrmann,Stefano Schiavon
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

[LG-70] Generative design and validation of therapeutic peptides for glioblastoma based on a potential target ATP5A

链接: https://arxiv.org/abs/2512.02030
作者: Hao Qian,Pu You,Lin Zeng,Jingyuan Zhou,Dengdeng Huang,Kaicheng Li,Shikui Tu,Lei Xu
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-71] Seizure-NGCLNet: Representation Learning of SEEG Spatial Pathological Patterns for Epileptic Seizure Detection via Node-Graph Dual Contrastive Learning

链接: https://arxiv.org/abs/2512.02028
作者: Yiping Wang,Peiren Wang,Zhenye Li,Fang Liu,Jinguo Huang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

信息检索

附件下载

点击下载今日全部论文列表