本篇博文主要内容为 2025-12-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-02)
今日共更新1084篇论文,其中:
- 自然语言处理共117篇(Computation and Language (cs.CL))
- 人工智能共331篇(Artificial Intelligence (cs.AI))
- 计算机视觉共279篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共329篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
【速读】: 该论文旨在解决使用NVFP4(NVIDIA Floating Point 4-bit)低精度格式训练大语言模型(LLM)时出现的训练发散和推理性能下降问题。其核心挑战在于,NVFP4要求前向传播中的权重与激活值、反向传播中的权重、激活值及梯度均需量化至该格式,而现有方法在处理每个数值块时仅采用单一比例因子,导致近最大值区域的量化误差显著增大,进而影响模型收敛与精度。解决方案的关键是提出Four Over Six(4/6)机制,即对每个数值块评估两个潜在的比例因子,并选择能最小化量化误差的方案;特别地,研究发现将部分数值缩放至更小的FP4范围可使可表示值分布更均匀,从而改善近最大值的表示能力。该方法可在NVIDIA Blackwell架构上高效实现,且适用于多种后训练量化策略,有效提升训练稳定性与下游任务准确率。
链接: https://arxiv.org/abs/2512.02010
作者: Jack Cook,Junxian Guo,Guangxuan Xiao,Yujun Lin,Song Han
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 5 figures
Abstract:As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands–weights and activations in the forward pass, and weights, activations, and gradients in the backward pass–must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.
zh
[NLP-1] he Art of Scaling Test-Time Compute for Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)领域中测试时缩放(Test-time scaling, TTS)策略在实际应用中的选择难题,具体包括:缺乏在相同条件下对主流TTS策略的系统性比较,以及模型类型和问题难度对性能影响的不明确。解决方案的关键在于通过一项涵盖超过三十亿token、八种开源大语言模型(7B至235B参数)和四个推理数据集的大规模实证研究,揭示了三个稳定趋势:(1)不存在普遍最优的TTS策略;(2)推理模型按轨迹质量可分为短视距和长视距两类,其表现受问题难度和轨迹长度影响显著;(3)对于特定模型类型,最优TTS性能随计算预算单调提升。基于这些发现,论文提出了一套实用的TTS策略选择指南,综合考虑问题难度、模型类型与计算预算,为高效推理时缩放提供可操作依据。
链接: https://arxiv.org/abs/2512.02008
作者: Aradhye Agarwal,Ayan Sengupta,Tanmoy Chakraborty
机构: Microsoft Research; Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time scaling (TTS) – the dynamic allocation of compute during inference – is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.
zh
[NLP-2] AlignSAE: Concept-Aligned Sparse Autoencoders
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在分解大语言模型(Large Language Models, LLMs)隐藏激活时,难以可靠地将提取的特征与人类定义的概念对齐的问题,从而导致特征表示纠缠且分布广泛。解决方案的关键在于提出AlignSAE方法,采用“预训练-后训练”(pre-train, then post-train)的课程学习策略:首先进行无监督预训练以获得基础重构能力,随后通过有监督的后训练阶段,将特定概念绑定到专用的潜在槽位(latent slots),同时保留其余容量用于一般重建。这种分离机制构建了一个可解释的接口,使得特定语义关系可以被独立检查和控制,从而实现精确的因果干预,如可靠的“概念替换”操作。
链接: https://arxiv.org/abs/2512.02004
作者: Minglai Yang,Xinyu Guo,Mihai Surdeanu,Liangming Pan
机构: University of Arizona (亚利桑那大学); MOE Key Lab of Computational Linguistics, Peking University (北京大学计算语言学教育部重点实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 7 figures, 5 tables
Abstract:Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a defined ontology through a “pre-train, then post-train” curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable “concept swaps”, by targeting single, semantically aligned slots.
zh
[NLP-3] LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLM s through Chess
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在推理能力与指令遵循能力上的泛化性评估难题,尤其针对静态基准测试中存在的过拟合、记忆化及基准饱和等问题。其解决方案的关键在于提出LLM CHESS评估框架,通过在国际象棋领域中引入扩展的代理交互(extended agentic interaction),利用随机对手和多种行为指标(如胜率、走子质量、合法性、幻觉行为等)对超过50个开源与闭源模型进行系统性评测,并进一步基于Elo评分体系量化顶尖推理模型的性能差异。该动态、随机且具博弈性的评估机制显著降低了模型对特定任务的过拟合风险,从而更真实地反映模型的通用推理与指令执行能力。
链接: https://arxiv.org/abs/2512.01992
作者: Sai Kolasani,Maxim Saplin,Nicholas Crispino,Kyle Montgomery,Jared Quincy Davis,Matei Zaharia,Chi Wang,Chenguang Wang
机构: UC Berkeley (加州大学伯克利分校); Independent Researcher (独立研究员); UC Santa Cruz (加州大学圣克鲁兹分校); Stanford University (斯坦福大学); Google DeepMind (谷歌深度思维)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our experiments reveal a clear separation between reasoning and non-reasoning models. However, unlike existing static benchmarks, the stochastic and dynamic nature of LLM CHESS uniquely reduces overfitting and memorization while preventing benchmark saturation, proving difficult even for top reasoning models. To support future work on evaluating reasoning and instruction-following in LLMs, we release our experimental framework, a public leaderboard, and a dataset of associated games.
zh
[NLP-4] Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
【速读】: 该论文旨在解决复杂用户界面(User Interface, UI)中自然语言指令与视觉区域对齐的难题,尤其针对小目标、视觉相似目标以及现实世界布局中的模糊性问题。现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的方法在视觉GUI定位任务上表现虽强,但仍受限于接地能力不足和推理潜力未被充分挖掘。解决方案的关键在于提出一种无需训练的多步迭代式接地框架——Chain of Ground (CoG),其利用MLLM进行逐步视觉推理与假设修正,通过模型自身的反思与调整机制实现更精准且可解释的定位结果。该方法显著提升了ScreenSpot Pro基准上的准确率(+4.8点),并在包含真实工业控制面板图像(TPanel UI数据集)中相较强基线Qwen3 VL 235B提升6.9点,验证了结构化迭代精炼策略在提升接地性能方面的有效性。
链接: https://arxiv.org/abs/2512.01979
作者: Aiden Yiliu Li,Bizhi Yu,Daoan Lei,Tianhe Ren,Shilong Liu
机构: University College London (伦敦大学学院); Chico Future AI Lab; The University of Hong Kong (香港大学); Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.
zh
[NLP-5] From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升模型推理能力中的作用机制问题,即RL究竟是促进新技能的合成,还是仅仅放大已有的行为模式。为厘清这一争议,作者提出通过“互补推理”(Complementary Reasoning)任务进行实证研究,该任务需融合内部参数化知识与外部上下文信息。解决方案的关键在于:首先将复杂推理能力解耦为两个原子技能——参数推理(Parametric Reasoning)和情境推理(Contextual Reasoning),并通过监督微调(Supervised Fine-Tuning, SFT)使模型掌握这些基础技能;随后利用RL对这些原子技能进行组合合成,从而实现跨分布(out-of-distribution)的泛化能力。研究发现,仅依赖SFT训练的模型虽在分布内表现优异,但在零样本(Zero-shot)场景下失效,呈现出“SFT泛化悖论”,而RL在此基础上可有效合成复杂策略,前提是基础原子技能已由SFT充分掌握。这表明RL并非简单概率放大器,而是具备推理合成能力的引擎,其效果依赖于原子技能的先验训练。
链接: https://arxiv.org/abs/2512.01970
作者: Sitao Cheng,Xunjian Yin,Ruiwen Zhou,Yuxuan Li,Xinyi Wang,Liangming Pan,William Yang Wang,Victor Zhong
机构: University of Waterloo (滑铁卢大学); Duke University (杜克大学); National University of Singapore (新加坡国立大学); Princeton University (普林斯顿大学); Peking University (北京大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress. Code and data will be available at this https URL
Abstract:The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.
zh
[NLP-6] How Far Are We from Genuinely Useful Deep Research Agents ?
【速读】: 该论文旨在解决当前深度研究代理(Deep Research Agents, DRAs)在生成综合性研究报告方面存在的评估不足与性能瓶颈问题。现有研究多局限于问答类基准测试,缺乏对报告结构化、分析深度和事实准确性的标准化评估,且现有基准存在任务复杂度高和主观评价指标不一致的问题,难以反映实际用户需求。为此,作者提出FINDER基准,包含100个由人工精心设计的研究任务及419项结构化检查清单,以统一报告质量标准;并构建DEFT(Deep rEsearch Failure Taxonomy),首个基于扎根理论、经人类与大语言模型协同标注并验证可靠性的失败分类体系,涵盖推理、检索与生成三个维度的14种细粒度失败模式。实验表明,当前DRAs的核心短板并非任务理解能力,而是证据整合、验证与推理鲁棒性规划能力的欠缺。
链接: https://arxiv.org/abs/2512.01948
作者: Dingling Zhang,He Zhu,Jincheng Ren,Kangqi Song,Xinran Zhou,Boyu Feng,Shudong Liu,Jiabin Luo,Weihao Xie,Zhaohui Wang,Tianrui Qin,King Zhu,Yuqing Wang,Qianben Chen,Yuchen Eleanor Jiang,Wei Wang,Jiaheng Liu,Wangchunshu Zhou
机构: OPPO AI Agent Team (OPPO人工智能代理团队)
类目: Computation and Language (cs.CL)
备注: 34 pages
Abstract:Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics – this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
zh
[NLP-7] Agent ic Policy Optimization via Instruction-Policy Co-Evolution
【速读】: 该论文旨在解决强化学习中指令(instruction)静态设计导致的性能瓶颈问题,即传统基于静态指令的强化学习方法难以适应模型策略演进和环境交互动态变化带来的优化需求。其解决方案的关键在于提出INSPO框架,该框架将指令优化与策略学习耦合为一个协同进化过程:通过在强化学习循环中维护一个动态指令种群,利用奖励信号自动评估各指令效能并定期淘汰低效指令;同时引入基于策略内反思(on-policy reflection)机制,由LLM驱动的优化器从经验回放缓冲区中分析历史交互数据,演化出更有效的指令策略,从而实现指令与策略的联合优化。
链接: https://arxiv.org/abs/2512.01945
作者: Han Zhou,Xingchen Wan,Ivan Vulić,Anna Korhonen
机构: University of Cambridge (剑桥大学); University of Oxford (牛津大学); Google(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 2 tables (18 pages including references and appendices)
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent’s policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.
zh
[NLP-8] Rectifying LLM Thought from Lens of Optimization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在采用长链式思维(Chain-of-Thought, CoT)提示时出现的次优推理行为问题,如过度思考和推理链条过长,这些行为会损害模型性能。解决方案的关键在于提出一种名为RePro(Rectifying Process-level Reward)的新方法,其核心是将CoT视为梯度下降过程,通过双评分机制量化推理过程的强度与稳定性,并据此构建一个复合的过程级奖励信号,无缝集成至基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)框架中,从而在后训练阶段优化LLM的推理能力。
链接: https://arxiv.org/abs/2512.01925
作者: Junnan Liu,Hongwei Liu,Songyang Zhang,Kai Chen
机构: Shanghai AI Laboratory (上海人工智能实验室); Monash University (蒙纳士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
zh
[NLP-9] Latent Debate: A Surrogate Framework for Interpreting LLM Thinking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)内部推理过程不透明以及幻觉(hallucination)成因难以解释的问题。其核心解决方案是提出“潜在辩论”(latent debate)框架,关键在于通过捕捉单次推理过程中模型内部隐含的支持与反驳信号,构建一个结构化的代理模型来近似LLM的思维链,从而实现对预测结果的可解释性分析,并为幻觉检测提供有效基准。该方法无需多模型或多答案显式辩论,而是基于单个模型在推理时的内在状态变化,揭示了幻觉风险与中层潜在辩论强度之间的强相关性。
链接: https://arxiv.org/abs/2512.01909
作者: Lihu Chen,Xiang Yin,Francesca Toni
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.
zh
[NLP-10] OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation
【速读】: 该论文旨在解决在线公共舆情报告(Online Public Opinion Reports)自动化生成缺乏系统性研究与评估标准的问题,具体表现为任务定义不明确、基准数据集缺失以及评价方法不足。其解决方案的关键在于:首先明确定义了自动化在线公共舆情报告生成(OPOR-GEN)任务;其次构建了一个以事件为中心的基准数据集 OPOR-BENCH,涵盖463个危机事件的新闻、社交媒体内容及参考摘要;最后提出基于代理(agent-based)的评估框架 OPOR-EVAL,通过模拟人类专家在上下文中的判断来量化报告质量,实验证明该框架与人工评价具有高度一致性。这一整套方案为该领域未来研究提供了坚实基础。
链接: https://arxiv.org/abs/2512.01896
作者: Jinzheng Yu,Yang Xu,Haozhen Li,Junqi Li,Yifan Feng,Ligu Zhu,Hao Shen,Lei Shi
机构: Communication University of China (中国传媒大学); Harbin Institute of Technology (哈尔滨工业大学); China Academy of Railway Sciences Corporation Limited (中国铁道科学研究院有限公司); Santa Clara University (圣克拉拉大学)
类目: Computation and Language (cs.CL)
备注: 27 pages, accepted by CMC-Computers, Materials Continua, 2025
Abstract:Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises. While large language models have made automated report generation technically feasible, systematic research in this specific area remains notably absent, particularly lacking formal task definitions and corresponding benchmarks. To bridge this gap, we define the Automated Online Public Opinion Report Generation (OPOR-GEN) task and construct OPOR-BENCH, an event-centric dataset covering 463 crisis events with their corresponding news articles, social media posts, and a reference summary. To evaluate report quality, we propose OPOR-EVAL, a novel agent-based framework that simulates human expert evaluation by analyzing generated reports in context. Experiments with frontier models demonstrate that our framework achieves high correlation with human judgments. Our comprehensive task definition, benchmark dataset, and evaluation framework provide a solid foundation for future research in this critical domain.
zh
[NLP-11] Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在实际应用中因幻觉(hallucination)和有害内容生成而引发的人类感知问题,尤其关注现有缓解策略(mitigation strategies)在多维评估中的有效性及其人类主观评价机制。其解决方案的关键在于设计了一项混合方法实验,采用被试内研究设计,让57名参与者分别评估包含有害响应及其缓解版本与仅缓解版本的输出,从而系统性地量化了公平性、忠实性(faithfulness)、有害内容移除能力及相关性等维度;研究发现,参与者母语、AI工作经验和标注熟悉度显著影响判断,并揭示出人类对语言细节(如语法微小错误)高度敏感,同时更偏好保留语义上下文的输出,这为未来训练和评估缓解策略提供了新的指标与洞见。
链接: https://arxiv.org/abs/2512.01892
作者: Heloisa Candello,Muneeza Azmat,Uma Sushmitha Gunturi,Raya Horesh,Rogerio Abreu de Paula,Heloisa Pimentel,Marcelo Carpinette Grave,Aminat Adebiyi,Tiago Machado,Maysa Malfiza Garcia de Macedo
机构: IBM Research (IBM 研究院); UNICAMP (巴西坎皮纳斯州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 2 figures, 6 tables. Under review for publication
Abstract:With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude’ for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants’ native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.
zh
[NLP-12] Cross-Lingual Interleaving for Speech Language Models
【速读】: 该论文旨在解决当前语音语言模型(Spoken Language Models, SLMs)发展严重依赖英语数据、跨语言学习困难的问题,其根源在于缺乏高质量的多语言语音评估基准和训练数据。为应对这一挑战,论文提出一种无需文本监督的跨语言交错训练方法(cross-lingual interleaving),通过在训练过程中混合不同语言的语音标记(speech tokens)来增强模型对多种语言语义的理解与泛化能力。该方案的关键在于:在保持相同训练token预算的前提下,利用交错策略显著提升单语语义准确性、实现鲁棒的跨语言续写能力,并加强跨语言隐藏状态对齐,从而构建出真正具备跨语言理解和交互能力的多语言SLM。
链接: https://arxiv.org/abs/2512.01865
作者: Adel Moumen,Guangzhi Sun,Philip C. Woodland
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.
zh
[NLP-13] BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages AACL2025
【速读】: 该论文旨在解决多语言场景下大语言模型(Large Language Models, LLMs)生成内容中存在幻觉(hallucination)的问题,尤其关注资源匮乏的印度语种(如印地语、古吉拉特语、马拉地语、奥里亚语)中的幻觉识别与评估难题。其解决方案的关键在于构建了一个名为BHRAM-IL的基准数据集,涵盖9类任务(事实性、数值型、推理和语言类),共36,047个精心标注的问题,并对14个主流多语言LLM进行系统性评估,采用归一化至(0,1)区间的类别特定指标分析跨语言、跨模型的幻觉表现,最终通过综合得分(0.23)和语言校正模糊得分(0.385)验证了该基准在多语言幻觉检测与量化评估中的有效性。
链接: https://arxiv.org/abs/2512.01852
作者: Hrishikesh Terdalkar,Kirtan Bhojani,Aryan Dongare,Omm Aditya Behera
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Accepted at BHASHA Workshop @ IJCNLP/AACL 2025
Abstract:Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (this https URL) and HuggingFace (this https URL) to support future research in multilingual hallucination detection and mitigation.
zh
[NLP-14] Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在安全对齐方面的挑战,特别是由于显式思维链(Chain-of-Thought, CoT)推理过程引入的新安全风险——即不安全行为可能出现在中间推理轨迹中,即使最终答案看似无害。现有基于监督微调(Supervised Fine-Tuning, SFT)的安全对齐方法存在安全性提升不稳定、推理能力下降以及跨模型家族泛化性差等问题。为此,论文提出以强化学习(Reinforcement Learning, RL)作为补充优化框架,其关键在于通过奖励反馈直接优化模型策略,从而实现更稳定、一致的安全对齐,同时保持推理能力;实验表明,RL能有效抑制不安全探索性推理,保留反思深度,显著提升推理过程的安全性和可靠性。
链接: https://arxiv.org/abs/2512.01848
作者: Jinghan Jia,Nathalie Baracaldo,Sijia Liu
机构: Michigan State University (密歇根州立大学); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.
zh
[NLP-15] InnoGym: Benchmarking the Innovation Potential of AI Agents
【速读】: 该论文旨在解决当前AI代理(Agent)评估体系中忽视方法多样性的问题,即现有基准测试主要关注答案的正确性,而忽略了生成解决方案的方法是否具有创新性。这一局限导致无法全面衡量AI代理的真实创新能力,进而阻碍了对生成式AI(Generative AI)在复杂任务中创造性潜力的深入理解。其解决方案的关键在于提出InnoGym——首个系统性评估AI代理创新潜力的基准框架,引入两个互补指标:性能提升(performance gain)用于量化相较于已知最优解的改进程度,新颖性(novelty)则捕捉方法论上与先前方案的差异。该框架包含18个来自真实工程与科学领域的标准化任务,并配套提供iGym执行环境以支持可复现、长周期的评估,从而推动对AI代理“创造力”与“有效性”之间差距的深入研究。
链接: https://arxiv.org/abs/2512.01822
作者: Jintian Zhang,Kewei Xu,Jingsheng Zheng,Zhuoyun Yu,Yuqi Zhu,Yujie Luo,Lanning Wei,Shuofei Qiao,Lun Du,Da Zheng,Shumin Deng,Huajun Chen,Ningyu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress
Abstract:LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
zh
[NLP-16] H-Neurons: On the Existence Impact and Origin of Hallucination-Associated Neurons
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中广泛存在的幻觉(hallucination)问题,即模型生成看似合理但事实错误的输出,从而影响其可靠性。现有研究多从训练数据和目标函数等宏观层面探讨幻觉成因,而对神经网络内部微观机制缺乏系统理解。论文的关键解决方案在于首次系统识别出与幻觉相关联的神经元(Hallucination-Associated Neurons, H-Neurons),发现仅占总神经元数量不到0.1%的稀疏子集即可高精度预测幻觉发生,并且这些神经元在不同场景下具有强泛化能力;进一步通过可控干预验证其因果作用,表明H-Neurons驱动模型产生过度遵从行为(over-compliance);最后溯源至预训练阶段,揭示此类神经元在预训练过程中即已形成并持续具备预测能力。这一工作首次将宏观行为模式与微观神经机制相连接,为提升LLMs的可靠性提供了可解释且可干预的路径。
链接: https://arxiv.org/abs/2512.01797
作者: Cheng Gao,Huimin Chen,Chaojun Xiao,Zhiyi Chen,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 20 pages, 4 figures
Abstract:Large language models (LLMs) frequently generate hallucinations – plausible but factually incorrect outputs – undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than 0.1% of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
zh
[NLP-17] Reasoning About the Unsaid: Misinformation Detection with Omission-Aware Graph Inference AAAI2026
【速读】: 该论文旨在解决信息误导中被忽视的“遗漏型欺骗”(omission-based deception)问题,即通过隐性省略关键信息而非显性伪造内容来诱导读者得出错误结论。此类欺骗在表面上看似完整,实则因缺失必要背景或视角而具有误导性,现有研究多聚焦于显性虚假内容检测,对遗漏机制缺乏系统建模。解决方案的关键在于提出首个面向遗漏感知的框架 OmiGraph:首先构建基于上下文环境的遗漏感知图(omission-aware graph),通过引入同一事件的互补视角挖掘潜在被省略内容;进而设计面向遗漏关系的建模机制,捕捉内部语境依赖与动态遗漏意图,形成综合的遗漏关系表征;最后引入遗漏感知的消息传递与聚合策略,整合遗漏内容及其关系以建立整体欺骗感知能力,从而显著提升检测性能,在两个大规模基准上平均 F1 和准确率分别提升 5.4% 和 5.3%。
链接: https://arxiv.org/abs/2512.01728
作者: Zhengjia Wang,Danding Wang,Qiang Sheng,Jiaying Wu,Juan Cao
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; National University of Singapore
类目: Computation and Language (cs.CL)
备注: AAAI 2026
Abstract:This paper investigates the detection of misinformation, which deceives readers by explicitly fabricating misleading content or implicitly omitting important information necessary for informed judgment. While the former has been extensively studied, omission-based deception remains largely overlooked, even though it can subtly guide readers toward false conclusions under the illusion of completeness. To pioneer in this direction, this paper presents OmiGraph, the first omission-aware framework for misinformation detection. Specifically, OmiGraph constructs an omission-aware graph for the target news by utilizing a contextual environment that captures complementary perspectives of the same event, thereby surfacing potentially omitted contents. Based on this graph, omission-oriented relation modeling is then proposed to identify the internal contextual dependencies, as well as the dynamic omission intents, formulating a comprehensive omission relation representation. Finally, to extract omission patterns for detection, OmiGraph introduces omission-aware message-passing and aggregation that establishes holistic deception perception by integrating the omission contents and relations. Experiments show that, by considering the omission perspective, our approach attains remarkable performance, achieving average improvements of +5.4% F1 and +5.3% ACC on two large-scale benchmarks.
zh
[NLP-18] Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多解任务中表现不佳的问题,其核心瓶颈在于“推理过度自信”(reasoning overconfidence)——即模型在未穷尽解空间的情况下过早确信于部分解,导致生成答案不全面、多样性不足。解决方案的关键在于引入新的长链式思维(Long Chain-of-Thought, Long-CoT)提示范式,通过迭代探索与自我反思机制缓解过度自信;同时提出“认知刚性假说”(cognitive-rigidity hypothesis),认为过度自信源于推理过程过早收敛至狭窄的思维路径,并借助注意力熵分析初步验证该机制,从而为评估LLM推理完整性提供了新工具,推动评测体系从单一正确率向全面探索能力转变。
链接: https://arxiv.org/abs/2512.01725
作者: Jiannan Guan,Qiguang Chen,Libo Qin,Dengyun Peng,Jinhao Liu,Liangyu Huo,Jian Xie,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学); Du Xiaoman (Beijing) Science Technology Co., Ltd. (杜晓满(北京)科技有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbfreasoning overconfidence: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textitMuSoBench, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbfcognitive-rigidity hypothesis, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.
zh
[NLP-19] Self-Supervised Borrowing Detection on Multilingual Wordlists
【速读】: 该论文旨在解决多语言词表中借词检测(borrowing detection)的问题,即识别一个语言中从其他语言借用的词汇。其解决方案的关键在于提出一种完全自监督的方法,融合两种信息源:基于全局对应模型的PMI(Pointwise Mutual Information)相似性与在音素特征向量上训练的轻量级对比学习组件;同时引入无需标签数据的自动决策阈值选择机制,从而在不依赖人工标注的情况下实现高效准确的借词识别。
链接: https://arxiv.org/abs/2512.01713
作者: Tim Wientzek
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 3 figures, 12 tables
Abstract:This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct their own studies.
zh
[NLP-20] MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮交互中难以维持相关性、个性化和连续性的问题,这与人类沟通依赖多层次记忆机制(如回忆过往对话、适应用户特征及情境)存在显著差距。其解决方案的关键在于提出一种混合记忆增强生成(Mixed Memory-Augmented Generation, MMAG)框架,将记忆系统划分为五个相互协作的层次:会话记忆、长期用户记忆、情景与事件关联记忆、感知与情境感知记忆以及短期工作记忆,并基于认知心理学原理映射至技术组件,明确协调、优先级排序与冲突解决策略,从而构建更连贯、主动且符合人类需求的语言代理。
链接: https://arxiv.org/abs/2512.01710
作者: Stefano Zeppieri
机构: Sapienza University of Rome (罗马大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
zh
[NLP-21] StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在流式视频理解中缺乏对人类注视信号(gaze signals)的感知与利用能力的问题。现有基准仅评估时序推理能力,未考察模型是否能基于实时注视信息进行主动推理和意图建模。为此,作者提出StreamGaze——首个用于评估MLLMs在流式视频场景下利用注视信息进行时序与前瞻性推理能力的基准。其核心创新在于构建了一个基于注视轨迹对齐的眼动-视频问答生成管道,通过固定点提取(fixation extraction)、区域特定视觉提示(region-specific visual prompting)和扫描路径构造(scanpath construction),生成时空精准锚定的问答对,从而模拟人类感知动态。实验表明,当前最优MLLMs在该基准上的表现显著落后于人类水平,揭示了其在注视引导下的时序推理、意图建模及前瞻预测方面的根本性局限。
链接: https://arxiv.org/abs/2512.01707
作者: Daeun Lee,Subhojyoti Mukherjee,Branislav Kveton,Ryan A. Rossi,Viet Dac Lai,Seunghyun Yoon,Trung Bui,Franck Dernoncourt,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
zh
[NLP-22] Learning the Boundary of Solvability: Aligning LLM s to Detect Unsolvable Problems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对不可解问题时缺乏准确识别能力的问题,即难以区分客观不可解性(问题本身存在逻辑矛盾)与主观能力局限(超出模型处理能力),从而导致幻觉和过度自信。解决方案的关键在于提出两个核心组件:一是构建UnsolvableQA数据集,通过程序化生成逻辑谜题和一种新颖的“逆向构造”方法(Reverse Construction)向有效推理链中注入矛盾以生成数学类不可解样本;二是设计UnsolvableRL强化学习框架,引入三项奖励机制共同优化准确性、不可解性识别能力和任务难度适应性。实证结果表明,该方法实现了近乎完美的不可解性检测效果,并提升了可解任务上的准确性,同时揭示了“能力坍缩”(Capability Collapse)现象——明确暴露于不可解数据是防止模型系统性过度自信的关键。
链接: https://arxiv.org/abs/2512.01661
作者: Dengyun Peng,Qiguang Chen,Bofei Liu,Jiannan Guan,Libo Qin,Zheng Yan,Jinhao Liu,Jianshu Zhang,Wanxiang Che
机构: LARG, Research Center for Social Computing and Interactive Robotics, HIT (哈尔滨工业大学); School of Computer Science and Engineering, Central South University (中南大学); iFLYTEK (科大讯飞)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model’s competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel “Reverse Construction” method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at this https URL.
zh
[NLP-23] HalluGraph: Auditable Hallucination Detection for Legal RAG Systems via Knowledge Graph Alignment
【速读】: 该论文旨在解决法律人工智能(Legal AI)系统中因检索增强生成(Retrieval-Augmented Generation, RAG)技术引发的可问责性问题:当AI助手引用判例、法规或合同条款时,法律从业者需要可验证的保障,确保生成文本忠实于原始文档。现有幻觉检测方法依赖语义相似度指标,但此类方法对实体替换(如混淆当事人、日期或法律条文)容忍度高,存在重大风险。解决方案的关键在于提出HalluGraph——一种基于图论的框架,通过结构对齐量化幻觉:其核心由两个可解释指标构成——实体锚定(Entity Grounding, EG),衡量响应中的实体是否出现在源文档中;关系保留(Relation Preservation, RP),验证所声明的关系是否得到上下文支持。该方法在结构化控制文档上实现近乎完美的判别能力(AUC = 0.979),并在复杂的生成式法律任务中保持稳健性能(AUC ≈ 0.89),显著优于语义相似度基线,为高风险法律应用提供了透明且可追溯的审计路径。
链接: https://arxiv.org/abs/2512.01659
作者: Valentin Noël,Elimane Yassine Seidou,Charly Ken Capo-Chichi,Ghanem Amari
机构: Devoteam(德沃泰姆)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 4 figures, under review
Abstract:Legal AI systems powered by retrieval-augmented generation (RAG) face a critical accountability challenge: when an AI assistant cites case law, statutes, or contractual clauses, practitioners need verifiable guarantees that generated text faithfully represents source documents. Existing hallucination detectors rely on semantic similarity metrics that tolerate entity substitutions, a dangerous failure mode when confusing parties, dates, or legal provisions can have material consequences. We introduce HalluGraph, a graph-theoretic framework that quantifies hallucinations through structural alignment between knowledge graphs extracted from context, query, and response. Our approach produces bounded, interpretable metrics decomposed into \textitEntity Grounding (EG), measuring whether entities in the response appear in source documents, and \textitRelation Preservation (RP), verifying that asserted relationships are supported by context. On structured control documents, HalluGraph achieves near-perfect discrimination ( 400 words, 20 entities), HalluGraph achieves AUC = 0.979 , while maintaining robust performance ( AUC \approx 0.89 ) on challenging generative legal task, consistently outperforming semantic similarity baselines. The framework provides the transparency and traceability required for high-stakes legal applications, enabling full audit trails from generated assertions back to source passages.
zh
[NLP-24] MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark
【速读】: 该论文旨在解决当前任务导向对话系统中语音理解(Spoken Language Understanding, SLU)数据集多样性与复杂性不足,以及缺乏针对最新大语言模型(Large Language Models, LLMs)和大音频语言模型(Large Audio Language Models, LALMs)的统一评估基准的问题。解决方案的关键在于构建了一个名为MAC-SLU的新型多意图车载舱语音理解数据集,该数据集通过引入真实且复杂的多意图语音样本显著提升了SLU任务的难度;在此基础上,对主流开源LLMs和LALMs进行了全面基准测试,涵盖上下文学习(in-context learning)、监督微调(supervised fine-tuning, SFT)及端到端(end-to-end, E2E)与流水线(pipeline)范式,实验证明SFT方法在性能上优于上下文学习,而E2E LALMs在避免语音识别错误传播的同时达到了与流水线方法相当的性能水平。
链接: https://arxiv.org/abs/2512.01603
作者: Yuezhang Peng,Chonghao Cai,Ziang Liu,Shuai Fan,Sheng Jiang,Hua Xu,Yuxin Liu,Qiguang Chen,Kele Xu,Yao Li,Sheng Wang,Libo Qin,Xie Chen
机构: 未知
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnotethis https URL_SLU and datasets\footnotethis http URL_SLU are released publicly.
zh
[NLP-25] Language Diversity: Evaluating Language Usage and AI Performance on African Languages in Digital Spaces
【速读】: 该论文试图解决非洲语言在数字环境中的代表性不足问题,特别是当前语言检测工具对Yoruba、Kinyarwanda和Amharic等非洲语言的识别准确率较低的问题。其核心挑战在于,这些语言在社交平台上的真实对话数据稀缺且多为英語混用(code-switching),难以用于训练高质量的语言模型。解决方案的关键在于发现:专业编辑的新闻内容(news media)提供了大量干净、纯正的单语料数据,不仅显著提升了语言检测模型(如AfroLID和通用大语言模型LLM)的性能,还促进了本地语言的社会媒体互动。因此,研究指出应优先利用结构化、高质量的新闻文本作为训练数据,以构建更有效的面向非洲语言的AI模型,并呼吁开发能同时处理纯净文本与混用文本的下一代语言检测系统。
链接: https://arxiv.org/abs/2512.01557
作者: Edward Ajayi,Eudoxie Umwari,Mawuli Deku,Prosper Singadi,Jules Udahemuka,Bekalu Tadele,Chukuemeka Edeh
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲分校); Bahir Dar Institute of Technology (巴赫尔达尔技术学院); Federal University Otuoke (联邦大学奥图克)
类目: Computation and Language (cs.CL)
备注:
Abstract:This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.
zh
[NLP-26] LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成答案不可靠的问题,特别是现有启发式不确定性估计方法无法有效区分正确与错误预测,导致用户可能接受错误答案且缺乏统计保障。其核心解决方案是基于错误发现率(False Discovery Rate, FDR)控制框架,提出一种名为LEC(Linear Expectation Constraint)的方法:将选择性预测重构为一个受线性期望约束(Linear Expectation Constraint)的决策问题,通过仅依赖可交换的校准样本集合,在有限样本下推导出满足FDR约束且最大化覆盖率的阈值。关键创新在于利用校准数据直接优化选择阈值,从而在保证误差比例不超过预设风险水平的前提下显著提升有效样本保留率,并进一步扩展为双模型路由机制,在维持统一FDR保证的同时实现更低风险和更高正确样本接纳率。
链接: https://arxiv.org/abs/2512.01556
作者: Zhiyuan Wang,Aniri,Tianlong Chen,Yue Zhang,Heng Tao Shen,Xiaoshuang Shi,Kaidi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model’s uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.
zh
[NLP-27] MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLM s to 70 Languages
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在语音到文本翻译(Speech-to-Text Translation, S2TT)任务中面临的两大关键问题:语言覆盖范围有限和推理效率低下。现有研究普遍依赖以英语为中心的数据集,限制了MLLMs实现多语言间双向翻译的能力;同时,当语音被转为长序列(如750个token)时,模型推理速度显著下降。解决方案的核心在于提出一种名为MCAT(Multilingual Cost-effective Accelerated Speech-to-Text Translator)的框架,其创新点包括:一是通过课程学习(curriculum learning)与数据平衡策略扩展语言覆盖至70种语言并支持多语言互译;二是设计了一个优化的语音适配器模块(speech adapter module),将语音序列长度压缩至仅30个token,从而大幅提升批处理推理效率。该方案在9B和27B参数规模的MLLM上均验证有效,仅需每语言约10小时S2TT数据及约1亿可训练参数即可实现性能超越当前最优端到端模型。
链接: https://arxiv.org/abs/2512.01512
作者: Yexing Du,Kaiyuan Liu,Youcheng Pan,Bo Yang,Keqi Deng,Xie Chen,Yang Xiang,Ming Liu,Bin Qin,YaoWei Wang
机构: Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China; Harbin Institute of Technology, Harbin, China; University of Cambridge, CB2 1TN Cambridge, U.K; Shanghai Jiao Tong University, Shanghai, China
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs’ many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at this https URL.
zh
[NLP-28] Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages
【速读】: 该论文旨在解决低资源语言(low-resource languages)因训练数据有限而导致语言模型(Language Models, LMs)性能较弱的问题。针对预训练阶段计算成本高、难以直接提升模型性能的局限性,作者聚焦于微调(fine-tuning)阶段的优化策略。解决方案的关键在于引入一种结合主动学习(Active Learning, AL)与结构化数据选择策略的“主动学习调度器”(Active Learning schedulers),并将其与数据聚类(clustering)相结合,构建了一个系统化的微调流程。实验证明,该方法可在减少高达30%标注数据的同时,将F1分数提升最多4点,并增强微调过程的稳定性。
链接: https://arxiv.org/abs/2512.01460
作者: Jozef Kubík,Marek Šuppa,Martin Takáč
机构: Comenius University in Bratislava, Slovakia
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Limited data for low-resource languages typically yield weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies which we term ‘Active Learning schedulers’, to boost the fine-tuning process with a limited amount of training data. We connect the AL to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers to enhance model’s performance. Experiments in the Slovak, Maltese, Icelandic and Turkish languages show that the use of clustering during the fine-tuning phase together with AL scheduling can simultaneously produce annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.
zh
[NLP-29] ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中缺乏实时元认知能力的问题,即模型无法自主预测自身成功概率及完成任务所需的计算资源,导致无法根据当前状态动态调整推理策略,从而造成效率低下和信任缺失。其解决方案的关键在于提出ZIP-RC方法,通过在单次前向传播中复用预留或未使用的logits,以零开销方式输出最终奖励与剩余长度的联合分布,进而计算采样效用(sampling utility),并基于此效用进行元动作决策(meta-actions),实现对生成路径的自适应选择,显著提升推理准确性的同时控制计算成本和延迟。
链接: https://arxiv.org/abs/2512.01457
作者: Rohin Manvi,Joey Hong,Tim Seyde,Maxime Labonne,Mathias Lechner,Sergey Levine
机构: UC Berkeley (加州大学伯克利分校); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Liquid AI; Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code coming soon
Abstract:Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length – no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
zh
[NLP-30] MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification NEURIPS2025
【速读】: 该论文旨在解决脑磁图(Magnetoencephalography, MEG)信号在两个基础任务中的建模问题:语音检测(Speech Detection)和音素分类(Phoneme Classification)。针对MEG数据的高维特性(306通道原始信号),研究者提出基于Conformer架构的轻量级解码器,通过一个轻量卷积投影层将输入映射至模型内部表示,并设计任务特定的输出头。关键创新包括:为语音检测引入面向MEG的SpecAugment增强策略;在音素分类中采用反平方根类权重与动态分组加载器处理样本平均后的不均衡问题;并利用实例级归一化有效缓解验证集上的分布偏移。这些方法共同提升了模型性能,在官方标准划分下分别取得88.9%(F1-macro)和65.8%的领先结果。
链接: https://arxiv.org/abs/2512.01443
作者: Xabier de Zuazo,Ibon Saratxaga,Eva Navas
机构: HiTZ Center, Dept. of Communications Engineering, School of Engineering, University of the Basque Country (UPV/EHU), Bilbao, Spain
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD)
备注: 10 pages, 5 figures, 4 tables, LibriBrain Workshop, NeurIPS 2025
Abstract:We present Conformer-based decoders for the LibriBrain 2025 PNPL competition, targeting two foundational MEG tasks: Speech Detection and Phoneme Classification. Our approach adapts a compact Conformer to raw 306-channel MEG signals, with a lightweight convolutional projection layer and task-specific heads. For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, surpassing the competition baselines and ranking within the top-10 in both tasks. For further implementation details, the technical documentation, source code, and checkpoints are available at this https URL.
zh
[NLP-31] Multilingual Conversational AI for Financial Assistance: Bridging Language Barriers in Indian FinTech
【速读】: 该论文旨在解决印度多语言环境下金融科技(Fintech)平台面临的语言障碍问题,即由于英语使用者仅占人口的10%,多数用户难以通过英文界面获得数字金融服务,从而限制了金融包容性。解决方案的关键在于构建一个支持代码混用(code-mixed)语言(如Hinglish)的多语言对话式人工智能系统,采用多智能体架构实现语言分类、功能管理和多语言响应生成,从而在保持低延迟(4–8%)的同时显著提升用户参与度,为新兴市场中的数字金融服务提供可扩展的语言适配方案。
链接: https://arxiv.org/abs/2512.01439
作者: Bharatdeep Hazarika,Arya Suneesh,Prasanna Devadiga,Pawan Kumar Rajpoot,Anshuman B Suresh,Ahmed Ifthaquar Hussain
机构: TIFIN India(印度TIFIN公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:India’s linguistic diversity presents both opportunities and challenges for fintech platforms. While the country has 31 major languages and over 100 minor ones, only 10% of the population understands English, creating barriers to financial inclusion. We present a multilingual conversational AI system for a financial assistance use case that supports code-mixed languages like Hinglish, enabling natural interactions for India’s diverse user base. Our system employs a multi-agent architecture with language classification, function management, and multilingual response generation. Through comparative analysis of multiple language models and real-world deployment, we demonstrate significant improvements in user engagement while maintaining low latency overhead (4-8%). This work contributes to bridging the language gap in digital financial services for emerging markets.
zh
[NLP-32] PromptBridge: Cross-Model Prompt Transfer for Large Language Models
【速读】: 该论文试图解决的问题是模型漂移(Model Drifting),即在不同大语言模型(Large Language Models, LLMs)之间迁移提示(prompt)时,由于模型特性差异导致原有提示性能显著下降的现象。这一问题在实际应用中尤为突出,因为系统常需根据能力、成本或隐私等因素频繁切换模型,而手动重优化提示代价高昂。解决方案的关键在于提出 PromptBridge,其核心创新是通过少量对齐任务进行校准,利用模型自适应反射式提示演化(Model-Adaptive Reflective Prompt Evolution, MAP-RPE)获取源模型与目标模型的最优提示对,进而学习一个跨模型的提示映射函数。该映射在测试阶段可直接将源模型提示转换为目标模型的优化提示,无需针对每个新任务或模型重新训练或调优,从而实现高效、低成本的跨模型提示迁移。
链接: https://arxiv.org/abs/2512.01420
作者: Yaxuan Wang,Quan Liu,Zhenting Wang,Zichao Li,Wei Wei,Yang Liu,Yujia Bao
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Center for Advanced AI, Accenture (埃森哲先进人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.
zh
[NLP-33] DyFuLM: An Advanced Multimodal Framework for Sentiment Analysis
【速读】: 该论文旨在解决复杂文本表达中情感理解的难题,这是情感计算(affective computing)领域的一个核心挑战。为应对这一问题,作者提出了一种动态融合学习模型(Dynamic Fusion Learning Model, DyFuLM),其关键在于引入两个核心模块:一是层次化动态融合模块(Hierarchical Dynamic Fusion module),用于自适应地整合多层级特征以捕捉语义结构;二是门控特征聚合模块(Gated Feature Aggregation module),通过调控跨层信息流实现平衡的表示学习。实验表明,DyFuLM在粗粒度和细粒度情感分类任务上分别达到82.64%和68.48%的准确率,并显著优于现有方法,在回归误差(MAE=0.0674, MSE=0.0082)与决定系数(R²=0.6903)方面表现最优,且消融实验证明各模块对特征交互和任务平衡均具重要贡献。
链接: https://arxiv.org/abs/2512.01410
作者: Ruohan Zhou,Jiachen Yuan,Churui Yang,Wenzheng Huang,Guoyan Zhang,Shiyao Wei,Jiazhen Hu,Ning Xin,Md Maruf Hasan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures, preprint. Under review for a suitable AI conference
Abstract:Understanding sentiment in complex textual expressions remains a fundamental challenge in affective computing. To address this, we propose a Dynamic Fusion Learning Model (DyFuLM), a multimodal framework designed to capture both hierarchical semantic representations and fine-grained emotional nuances. DyFuLM introduces two key moodules: a Hierarchical Dynamic Fusion module that adaptively integrates multi-level features, and a Gated Feature Aggregation module that regulates cross-layer information ffow to achieve balanced representation learning. Comprehensive experiments on multi-task sentiment datasets demonstrate that DyFuLM achieves 82.64% coarse-grained and 68.48% fine-grained accuracy, yielding the lowest regression errors (MAE = 0.0674, MSE = 0.0082) and the highest R^2 coefficient of determination (R^2= 0.6903). Furthermore, the ablation study validates the effectiveness of each module in DyFuLM. When all modules are removed, the accuracy drops by 0.91% for coarse-grained and 0.68% for fine-grained tasks. Keeping only the gated fusion module causes decreases of 0.75% and 0.55%, while removing the dynamic loss mechanism results in drops of 0.78% and 0.26% for coarse-grained and fine-grained sentiment classification, respectively. These results demonstrate that each module contributes significantly to feature interaction and task balance. Overall, the experimental findings further validate that DyFuLM enhances sentiment representation and overall performance through effective hierarchical feature fusion.
zh
[NLP-34] BackportBench: A Multilingual Benchmark for Automated Backporting of Patches
【速读】: 该论文旨在解决软件依赖项中安全补丁难以自动迁移至旧版本的问题,即“补丁回迁(patch backporting)”难题。当前开发者常因升级困难而继续使用存在漏洞的旧版包,导致安全风险;手动回迁效率低且易出错,现有自动化方法多局限于代码块或函数级迁移,评估指标不完善,效果不明。解决方案的关键在于提出首个综合性补丁回迁基准测试集 BackportBench,涵盖 PyPI、Maven 和 npm 中 202 个跨语言(multilingual)回迁问题,每个问题配有可执行的 Docker 环境与测试用例,从而为自动化回迁技术提供标准化评估平台。实验表明,基于智能体(agentic)的方法在需逻辑和结构变更的场景下优于传统方法,但性能受编程语言影响显著,为未来研究提供了重要启示。
链接: https://arxiv.org/abs/2512.01396
作者: Zhiqing Zhong,Jiaming Huang,Pinjia He
机构: The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Under review
Abstract:Many modern software projects evolve rapidly to incorporate new features and security patches. It is important for users to update their dependencies to safer versions, but many still use older, vulnerable package versions because upgrading can be difficult and may break their existing codebase. Software developers can mitigate this problem by backporting security patches to older releases. However, manually backporting is time-consuming and error-prone. The effectiveness of existing automated backporting techniques on general software remains unclear since they typically target only code-hunk or function-level patch porting scenarios and are evaluated with imperfect metrics. To facilitate the development and evaluation of automated backporting techniques, we introduce BackportBench, the first comprehensive benchmark suite for patch backporting problem. BackportBench is a multilingual benchmark that contains 202 patch backporting problems from PyPI, Maven, and npm, each with executable Docker environments and relevant test cases. We evaluated existing patch porting methods and LLM-based techniques that have the potential to adapt to this task using BackportBench. The results show that the agentic method has outperformed traditional patch porting methods, especially on cases that require logical and structural changes. However, the performance varies across different programming languages. Based on the findings, we draw several implications for researchers and software practitioners in future work on automated backporting. Comments: Under review Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2512.01396 [cs.SE] (or arXiv:2512.01396v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.01396 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-35] Stabilizing Reinforcement Learning with LLM s: Formulation and Practices
【速读】: 该论文旨在解决在使用大规模语言模型(Large Language Models, LLMs)进行强化学习(Reinforcement Learning, RL)时,如何稳定地优化序列级奖励(sequence-level reward)的问题。其核心挑战在于,直接优化序列级奖励在实践中难以实现,而常用的方法如REINFORCE等依赖于token-level的代理目标(surrogate token-level objective)。论文通过一阶近似分析表明,这种代理目标的有效性仅在训练-推理差异(training-inference discrepancy)和策略僵化(policy staleness)均被最小化时成立。解决方案的关键在于:对于在线策略(on-policy)训练,采用重要性采样修正(importance sampling correction)可显著提升稳定性;而在引入离线策略更新以加速收敛时,则必须结合裁剪(clipping)与专家路由回放(Routing Replay),以缓解因策略僵化引发的不稳定问题。实验结果进一步验证了这一理论洞察,并揭示了稳定训练后持续优化可获得一致的最终性能表现。
链接: https://arxiv.org/abs/2512.01374
作者: Chujie Zheng,Kai Dang,Bowen Yu,Mingze Li,Huiqiang Jiang,Junrong Lin,Yuqiong Liu,An Yang,Jingren Zhou,Junyang Lin
机构: Alibaba Inc(阿里巴巴公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
zh
[NLP-36] MARSAD: A Multi-Functional Tool for Real-Time Social Media Analysis
【速读】: 该论文旨在解决阿拉伯语社交媒体内容实时监测与分析的难题,尤其针对研究者和非技术用户在处理多维度舆情数据时面临的工具缺失问题。解决方案的关键在于构建一个多功能自然语言处理(Natural Language Processing, NLP)平台——MARSAD,其核心能力包括支持实时与历史数据的可视化分析,涵盖情感分析、情绪识别、宣传检测、事实核查及仇恨言论识别等任务,并通过API密钥实现安全的数据抓取,结合灵活文档存储与结构化数据管理的后端架构,保障大规模多模态数据的高效处理,同时提供直观友好的前端交互界面以提升用户体验。
链接: https://arxiv.org/abs/2512.01369
作者: Md. Rafiul Biswas,Firoj Alam,Wajdi Zaghouani
机构: Hamad bin Khalifa University (哈马德本哈利法大学); Qatar Computing Research Institute (卡塔尔计算研究研究所); Northwestern University in Qatar (卡塔尔西北大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 4 figures
Abstract:MARSAD is a multifunctional natural language processing (NLP) platform designed for real-time social media monitoring and analysis, with a particular focus on the Arabic-speaking world. It enables researchers and non-technical users alike to examine both live and archived social media content, producing detailed visualizations and reports across various dimensions, including sentiment analysis, emotion analysis, propaganda detection, fact-checking, and hate speech detection. The platform also provides secure data-scraping capabilities through API keys for accessing public social media data. MARSAD’s backend architecture integrates flexible document storage with structured data management, ensuring efficient processing of large and multimodal datasets. Its user-friendly frontend supports seamless data upload and interaction.
zh
[NLP-37] he Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness DATE
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型因训练数据过于平滑而引发的“模型坍缩”(model collapse)问题,即现有合成数据生产范式过度追求统计平滑性,导致人类文本中具有认知基础的长尾异常特征被系统性移除。解决方案的关键在于提出 Prompt-driven Cognitive Computing Framework (PMCSF),其核心是通过两个模块重构数据生成逻辑:一是 Cognitive State Decoder (CSD),将无结构文本逆向解码为结构化的认知向量;二是 Cognitive Text Encoder (CTE),利用数学定义的认知扰动算子(Cognitive Perturbation Operators)从这些认知状态重新生成富含人类典型不完美特征的文本。该方法强调模拟人类认知过程而非单纯模仿数据表面属性,实证表明其生成的数据在认知一致性与功能有效性上显著优于传统大语言模型输出,在金融市场的压力测试中展现出降低最大回撤和获取防御性阿尔法的能力。
链接: https://arxiv.org/abs/2512.01354
作者: Zhongjie Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
备注: 38 pages,5 figures,30 tables. This paper proposes the Prompt-driven Cognitive Computing Framework (PMCSF) and validates it with A-share market stress tests (N=23 for 2015 crash, N=13 for 2024 bull market). Includes detailed appendices on cognitive vector definitions, perturbation operators, and financial backtest data
Abstract:Although synthetic data is widely promoted as a remedy, its prevailing production paradigm – one optimizing for statistical smoothness – systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations – not copying surface data – enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis. Comments: 38 pages,5 figures,30 tables. This paper proposes the Prompt-driven Cognitive Computing Framework (PMCSF) and validates it with A-share market stress tests (N=23 for 2015 crash, N=13 for 2024 bull market). Includes detailed appendices on cognitive vector definitions, perturbation operators, and financial backtest data Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR) Cite as: arXiv:2512.01354 [cs.AI] (or arXiv:2512.01354v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.01354 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhongjie Jiang Sr [view email] [v1] Mon, 1 Dec 2025 07:09:38 UTC (22,207 KB)
zh
[NLP-38] EmoRAG : Evaluating RAG Robustness to Symbolic Perturbations KDD
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对细微符号扰动时的脆弱性问题,特别是由近似不可察觉的表情符号标记(如“(@_@)”)引发的严重误导性检索行为,即所谓的EmoRAG攻击。其核心发现是:仅注入一个表情符号即可几乎100%导致检索结果偏离语义相关文本,且该扰动对查询位置敏感、随模型参数规模增大而加剧。解决方案的关键在于提出针对性防御机制,通过分析表情符号扰动的内在机制,设计能够有效缓解此类攻击的防护策略,并指出当前标准防御手段不足,强调未来需从架构层面构建更具鲁棒性的RAG系统。
链接: https://arxiv.org/abs/2512.01335
作者: Xinyun Zhou,Xinfeng Li,Yinan Peng,Ming Xu,Xuanwang Zhang,Miao Yu,Yidong Wang,Xiaojun Jia,Kun Wang,Qingsong Wen,XiaoFeng Wang,Wei Dong
机构: ZJU(浙江大学); NTU(南洋理工大学); Hengxin Tech.(恒信科技); NUS(新加坡国立大学); NJU(南京大学); PKU(北京大学); Squirrel Ai Learning(松鼠AI学习)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2026
Abstract:Retrieval-Augmented Generation (RAG) systems are increasingly central to robust AI, enhancing large language model (LLM) faithfulness by incorporating external knowledge. However, our study unveils a critical, overlooked vulnerability: their profound susceptibility to subtle symbolic perturbations, particularly through near-imperceptible emoticon tokens such as “(@_@)” that can catastrophically mislead retrieval, termed EmoRAG. We demonstrate that injecting a single emoticon into a query makes it nearly 100% likely to retrieve semantically unrelated texts that contain a matching emoticon. Our extensive experiment across general question-answering and code domains, using a range of state-of-the-art retrievers and generators, reveals three key findings: (I) Single-Emoticon Disaster: Minimal emoticon injections cause maximal disruptions, with a single emoticon almost 100% dominating RAG output. (II) Positional Sensitivity: Placing an emoticon at the beginning of a query can cause severe perturbation, with F1-Scores exceeding 0.92 across all datasets. (III) Parameter-Scale Vulnerability: Counterintuitively, models with larger parameters exhibit greater vulnerability to the interference. We provide an in-depth analysis to uncover the underlying mechanisms of these phenomena. Furthermore, we raise a critical concern regarding the robustness assumption of current RAG systems, envisioning a threat scenario where an adversary exploits this vulnerability to manipulate the RAG system. We evaluate standard defenses and find them insufficient against EmoRAG. To address this, we propose targeted defenses, analyzing their strengths and limitations in mitigating emoticon-based perturbations. Finally, we outline future directions for building robust RAG systems.
zh
[NLP-39] Securing Large Language Models (LLM s) from Prompt Injection Attacks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因指令跟随能力而易受提示注入攻击(prompt injection attacks)的问题。此类攻击通过构造恶意提示诱导模型执行非预期甚至有害任务,威胁模型的安全性与可靠性。解决方案的关键在于采用任务特定微调(Task-specific Fine-tuning, JATMO)方法,即对未进行指令微调的基座模型进行单一功能的针对性训练,从而降低其对对抗性指令的响应敏感性。实验表明,JATMO虽能有效降低攻击成功率,但无法完全抵御多语言线索或代码相关干扰器等复杂攻击手段,并存在生成质量与抗注入能力之间的权衡关系,凸显了基于微调的防御策略的局限性,亟需结合多层次、对抗感知的综合缓解机制。
链接: https://arxiv.org/abs/2512.01326
作者: Omar Farooq Khan Suri,John McCrae
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 1 figure, 1 table
Abstract:Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks. These attacks leverage the model’s instruction-following ability to make it perform malicious tasks. Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions. In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts. We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness. We fine-tuned LLaMA 2-7B, Qwen1.5-4B, and Qwen1.5-0.5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.5-Turbo baseline. Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses. We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility. Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.
zh
[NLP-40] Agreement-Constrained Probabilistic Minimum Bayes Risk Decoding AACL2025
【速读】: 该论文旨在解决最小贝叶斯风险(Minimum Bayes Risk, MBR)解码在机器翻译中因需计算所有候选译文对之间的得分而造成的时间复杂度为平方级的问题,从而限制了其在大规模候选集上的应用效率。现有方法如概率性MBR(Probabilistic MBR, PMBR)通过采样部分候选对并结合矩阵补全算法来减少评估次数,但会牺牲翻译质量。本文提出了一种约束一致性的概率MBR(Agreement-Constrained PMBR, AC-PMBR)解码方法,其关键在于引入一个知识蒸馏模型来指导得分矩阵的补全过程,从而显著降低矩阵补全的近似误差(最多提升3倍),在保持与PMBR相当计算成本的前提下提升了翻译质量,在WMT’23英德翻译任务上验证了有效性。
链接: https://arxiv.org/abs/2512.01316
作者: Koki Natsumi,Hiroyuki Deguchi,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IJCNLP-AACL 2025 Main
Abstract:Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT’23 En \leftrightarrow De translation tasks.
zh
[NLP-41] Kardia-R1: Unleashing LLM s to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning
【速读】: 该论文旨在解决当前对话代理在情感推理中面临的两大局限:一是现有数据集以情境为中心、缺乏用户身份的持续性,难以捕捉个性化的情感细微差别;二是依赖不透明且粗粒度的奖励信号,阻碍了可验证的情感推理能力发展。解决方案的关键在于提出KardiaBench——一个大规模用户锚定的基准数据集(包含178,080个问答对和22,080个多轮对话,基于671个真实用户画像构建),并通过“模型在环”(model-in-the-loop)的迭代式评分标准引导优化流程,确保心理合理性、情绪一致性与人格连贯性。进一步地,论文设计了Kardia-R1框架,采用Rubric-as-Judge Empathetic Reinforcement Learning(Rubric-ERL)方法,基于GRPO算法实现可解释的人类对齐评分奖励机制,从而将用户理解、情绪推断和支持性回应生成紧密结合,形成结构化、可追踪的共情认知过程。实验表明,该方案在情绪准确性、共情度、相关性、人格一致性及安全性等方面均显著优于其他方法。
链接: https://arxiv.org/abs/2512.01282
作者: Jiahao Yuan,Zhiqing Cui,Hanqing Wang,Yuansheng Gao,Yucheng Zhou,Usman Naseem
机构: University of Macau (澳门大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at this https URL.
zh
[NLP-42] SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
【速读】: 该论文旨在解决当前用于评估大语言模型(Large Language Models, LLMs)化学推理能力的基准测试存在的三大问题:任务过于简化、缺乏过程层面的评估以及与专家级化学技能不匹配。其解决方案的关键在于提出SUPERChem基准,包含500道由专家精心设计的高复杂度化学推理题,覆盖多个子领域,并提供多模态与纯文本两种格式;通过原始内容构建和迭代式筛选流程消除错误题目并减少数据泄露;每道题均配有专家撰写的解题路径,从而支持基于推理路径保真度(Reasoning Path Fidelity, RPF)的评分机制,能够超越最终答案准确率来衡量模型的推理质量。
链接: https://arxiv.org/abs/2512.01274
作者: Zehua Zhao,Zhixian Huang,Junren Li,Siyu Lin,Junting Zhou,Fengqi Cao,Kun Zhou,Rui Ge,Tingting Long,Yuexiang Zhu,Yan Liu,Jie Zheng,Junnian Wei,Rong Zhu,Peng Zou,Wenyu Li,Zekai Cheng,Tian Ding,Yaxuan Wang,Yizhao Yan,Tingru Wei,Haowei Ming,Weijie Mao,Chen Sun,Yiming Liu,Zichen Wang,Zuo Zhang,Tong Yang,Hao Ma,Zhen Gao,Jian Pei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 11 figures, 5 tables
Abstract:Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at this https URL.
zh
[NLP-43] Sentiment Analysis and Emotion Classification using Machine Learning Techniques for Nagamese Language - A Low-resource Language
【速读】: 该论文试图解决的是Nagamese语言(一种以阿萨姆语为词汇基础的克里奥尔语)的文本情感分析与基本情绪分类问题,这是该语言领域内首次开展的相关研究。其关键解决方案在于构建了一个包含1,195个词的情感极性词典(sentiment polarity lexicon),并结合额外特征,采用朴素贝叶斯(Naive Bayes)和支持向量机(Support Vector Machines)等监督学习方法进行模型训练与测试,从而实现对Nagamese文本中情感极性(正面、负面、中性)及基本情绪的识别。
链接: https://arxiv.org/abs/2512.01256
作者: Ekha Morang,Surhoni A. Ngullie,Sashienla Longkumer,Teisovi Angami
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:The Nagamese language, a.k.a Naga Pidgin, is an Assamese-lexified creole language developed primarily as a means of communication in trade between the people from Nagaland and people from Assam in the north-east India. Substantial amount of work in sentiment analysis has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in Nagamese language. To the best of our knowledge, this is the first attempt on sentiment analysis and emotion classification for the Nagamese Language. The aim of this work is to detect sentiments in terms of polarity (positive, negative and neutral) and basic emotions contained in textual content of Nagamese language. We build sentiment polarity lexicon of 1,195 nagamese words and use these to build features along with additional features for supervised machine learning techniques using Na"ive Bayes and Support Vector Machines. Keywords: Nagamese, NLP, sentiment analysis, machine learning Comments: 10 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.01256 [cs.CL] (or arXiv:2512.01256v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.01256 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-44] Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在JavaScript漏洞检测任务中缺乏系统性评估与可靠基准测试的问题。现有基准存在三大缺陷:覆盖不全(如仅涵盖部分CWE类型)、因标注不合理导致低估LLMs能力,以及因使用孤立代码片段而非完整项目而高估其性能。为此,作者提出三个核心原则——全面性、避免低估、避免高估,并基于此设计了FORGEJS框架,用于自动构建高质量的JavaScript漏洞检测基准;进一步提出了ARENAJS作为首个系统性的LLM驱动JavaScript漏洞检测基准,并开发JUDGEJS自动化评估框架对七种主流商业LLMs进行测评。结果表明,LLMs在推理能力和鲁棒性方面均存在显著不足,揭示了利用LLMs实现可靠JavaScript漏洞检测仍是开放挑战。
链接: https://arxiv.org/abs/2512.01255
作者: Qingyuan Fei,Xin Liu,Song Li,Shujiang Wu,Jianwei Hou,Ping Chen,Zifeng Kang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Researchers have proposed numerous methods to detect vulnerabilities in JavaScript, especially those assisted by Large Language Models (LLMs). However, the actual capability of LLMs in JavaScript vulnerability detection remains questionable, necessitating systematic evaluation and comprehensive benchmarks. Unfortunately, existing benchmarks suffer from three critical limitations: (1) incomplete coverage, such as covering a limited subset of CWE types; (2) underestimation of LLM capabilities caused by unreasonable ground truth labeling; and (3) overestimation due to unrealistic cases such as using isolated vulnerable files rather than complete projects. In this paper, we introduce, for the first time, three principles for constructing a benchmark for JavaScript vulnerability detection that directly address these limitations: (1) comprehensiveness, (2) no underestimation, and (3) no overestimation. Guided by these principles, we propose FORGEJS, the first automatic benchmark generation framework for evaluating LLMs’ capability in JavaScript vulnerability detection. Then, we use FORGEJS to construct ARENAJS-the first systematic benchmark for LLM-based JavaScript vulnerability detection-and further propose JUDGEJS, an automatic evaluation framework. We conduct the first systematic evaluation of LLMs for JavaScript vulnerability detection, leveraging JUDGEJS to assess seven popular commercial LLMs on ARENAJS. The results show that LLMs not only exhibit limited reasoning capabilities, but also suffer from severe robustness defects, indicating that reliable JavaScript vulnerability detection with LLMs remains an open challenge. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2512.01255 [cs.CR] (or arXiv:2512.01255v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.01255 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-45] Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation
【速读】: 该论文旨在解决抽象视觉构图(abstract visual composition)中的生成难题,即如何在几何约束和模糊目标描述(如文本)下,从固定几何基元中组合出结构合理且语义一致的视觉对象。核心挑战在于组合爆炸、数据稀缺以及离散可行性(如无重叠、允许朝向)导致解空间稀疏,难以被纯像素空间的统计生成模型有效处理。解决方案的关键在于提出一种约束引导的框架,融合显式几何推理与神经语义建模:通过类似AlphaGo的搜索机制确保几何可行性,利用微调后的视觉-语言模型作为奖励信号评估语义对齐;同时采用策略网络作为蒙特卡洛树搜索中的启发式函数,并通过搜索生成的计划进行网络微调;进一步借鉴生成对抗网络思想,以对抗性奖励优化提升生成实例的真实性,使生成结果逐渐逼近真实数据分布。
链接: https://arxiv.org/abs/2512.01242
作者: Zirui Zhao,Boye Niu,David Hsu,Wee Sun Lee
机构: National University of Singapore (新加坡国立大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
zh
[NLP-46] Pay Attention Later: From Vector Space Diffusion to Linearithmic Spectral Phase-Locking
【速读】: 该论文旨在解决标准Transformer模型在持续学习中面临的“可塑性-稳定性困境”(Plasticity-Stability Dilemma),即模型在引入新概念时容易发生灾难性遗忘(Catastrophic Forgetting),导致原有推理能力显著退化。其核心问题是标准Transformer依赖局部梯度扩散进行语义对齐,形成固定的几何障碍(Semantic Alignment Tax),且该障碍无法通过模型规模扩大来克服。解决方案的关键在于提出相位共振智能谱模型(Phase-Resonant Intelligent Spectral Model, PRISM),它将语义身份编码为复数域(C^d)中的共振频率,并用线性对数复杂度(O(N log N))的门控谐波卷积替代二次复杂度的自注意力机制,从而在结构上实现记忆与推理的解耦,有效支持无损可塑性(Lossless Plasticity),在5-shot新概念注入场景下实现96%的高效学习且仅造成0.84 BLEU点的性能下降。
链接: https://arxiv.org/abs/2512.01208
作者: Alper Yıldırım,İbrahim Yücedağ
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 5 figures
Abstract:Standard Transformers suffer from a “Semantic Alignment Tax”, a prohibitive optimization cost required to organize a chaotic initialization into a coherent geometric map via local gradient diffusion. We hypothesize that this reliance on diffusive learning creates “Catastrophic Rigidity”, rendering models unable to adapt to novel concepts without destroying their pre-trained reasoning capabilities. To isolate this phenomenon, we introduce Iterative Semantic Map Refinement (ISMR), a diagnostic protocol revealing that alignment is a fixed geometric barrier that scaling cannot solve; a 20-layer model overcomes this barrier no faster than a 1-layer model. We introduce the Phase-Resonant Intelligent Spectral Model (PRISM). PRISM encodes semantic identity as resonant frequencies in the complex domain (C^d) and replaces quadratic self-attention with linearithmic O(N log N) Gated Harmonic Convolutions. We validate PRISM on the WMT14 translation task. While the Standard Transformer maintains a slight edge in general competence on static benchmarks (23.88 vs 21.40 BLEU), it fails the “Plasticity-Stability” stress test completely. When injected with novel concepts, the Transformer suffers Catastrophic Forgetting, degrading by -10.55 BLEU points while achieving only 60% acquisition. In contrast, PRISM demonstrates Lossless Plasticity, achieving 96% 5-shot acquisition with negligible degradation (-0.84 BLEU). These results suggest that harmonic representations effectively decouple memory from reasoning, offering a structural solution to the plasticity-stability dilemma in real-time knowledge adaptation.
zh
[NLP-47] Conveying Imagistic Thinking in Traditional Chinese Medicine Translation: A Prompt Engineering and LLM -Based Evaluation Framework
【速读】: 该论文旨在解决传统中医(Traditional Chinese Medicine, TCM)理论在英译过程中因依赖字面直译而导致目标语读者难以重构其概念网络并应用于临床实践的问题。其解决方案的关键在于采用人机协同(human-in-the-loop, HITL)框架,结合提示工程(prompt-based cognitive scaffolding)引导大语言模型(LLM)DeepSeek V3.1识别源语文本中的隐喻和转喻,并实现概念层面的精准传递;通过模拟三类真实读者对翻译结果进行多维认知评估(五维度评分与解释现象学分析,IPA),验证了该方法在提升翻译理解力、一致性和可复现性方面的有效性。
链接: https://arxiv.org/abs/2512.01198
作者: Jiatong Han
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3 figures
Abstract:Traditional Chinese Medicine (TCM) theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop (HITL) framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis (IPA). Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers’ cognitive preferences. This study provides a cognitive, efficient, and replicable HITL methodological pathway for the translation of ancient, concept-dense texts such as TCM.
zh
[NLP-48] Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks
【速读】: 该论文试图解决的问题是:当前广泛部署的临床专用人工智能(Clinical AI)系统在实际医疗决策支持中的性能是否优于通用大语言模型(Generalist LLMs),以及是否存在因缺乏独立量化评估而导致的安全性与可靠性风险。其解决方案的关键在于构建了一个包含1,000个样本的微型基准测试(mini-benchmark),融合MedQA(医学知识)和HealthBench(临床医生对齐度)任务,对两种主流临床AI工具(OpenEvidence和UpToDate Expert AI)与三种前沿通用大语言模型(GPT-5、Gemini 3 Pro、Claude Sonnet 4.5)进行系统性对比评估。结果表明,通用大语言模型在完整性、沟通质量、上下文感知及系统性安全推理等方面显著优于临床专用工具,揭示了临床AI工具在关键能力上的短板,强调了在患者相关工作流中部署前必须开展透明、独立的实证评估。
链接: https://arxiv.org/abs/2512.01191
作者: Krithik Vishwanath,Mrigayu Ghosh,Anton Alyakin,Daniel Alexander Alber,Yindalon Aphinyanaphongs,Eric Karl Oermann
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures (2 regular, 2 supplemental)
Abstract:Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.
zh
[NLP-49] mpPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness
【速读】: 该论文旨在解决当前对检索增强生成(Retrieval-Augmented Generation, RAG)系统评估中忽视检索质量与生成参数(如温度参数)之间交互作用的问题。现有方法通常孤立地分析检索效果或生成超参数,忽略了二者在实际噪声环境下的协同影响。解决方案的关键在于提出一个系统的RAG扰动-温度分析框架(RAG Perturbation-Temperature Analysis Framework),通过在HotpotQA数据集上对不同温度设置下三类文本扰动(模拟噪声检索)进行大规模实验,揭示了高温度设置显著放大模型对扰动的敏感性,且特定扰动类型在温度变化范围内表现出非线性响应模式。这一框架不仅提供了诊断RAG鲁棒性的基准,还为噪声环境下模型选择与参数调优提供了可量化的分析工具和实用指导。
链接: https://arxiv.org/abs/2512.01183
作者: Yongxin Zhou,Philippe Mulhem,Didier Schwab
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
zh
[NLP-50] DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks AAAI2026
【速读】: 该论文旨在解决当前评估自主运行的代理型大语言模型(Agentic Large Language Models, Agentic LLMs)时缺乏透明度和可审计性的问题,从而难以建立对其行为可靠性的信任。现有基准测试通常采用黑箱式评估方式,无法有效验证模型在实际任务中的决策逻辑与执行过程。解决方案的关键在于提出 DrawingBench 框架——一个基于空间推理任务的验证体系,通过生成低层图形用户界面(GUI)操作序列来评估模型表现;其核心创新包括:8项客观、规则明确的评分标准实现可复现的量化评估,支持逐动作级别的行为审计,以及引入多轮反馈机制以实现人类对代理行为的外部控制与迭代优化。实验表明,透明评估结合外部监督显著提升模型性能(平均提升3.2%,复杂场景最高达32.8%),并揭示了工具状态管理和长程规划是主要瓶颈,同时强调规范化的指令设计比任务复杂度更能促进模型达到100%完美表现。
链接: https://arxiv.org/abs/2512.01174
作者: Hyunjun Kim,Sooyoung Ryu
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2026 TrustAgent Workshop
Abstract:As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity – models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: this https URL
zh
[NLP-51] Mode-Conditioning Unlocks Superior Test-Time Scaling
【速读】: 该论文旨在解决并行采样(parallel sampling)在测试时扩展(test-time scaling)中因模式坍缩(diversity collapse)而导致的效率低下问题,即模型在多次采样时集中于少数模式,重复产生相同错误,从而限制了计算资源的有效利用。解决方案的关键在于提出模式条件化(mode-conditioning, ModC)框架,通过显式分配测试时计算资源到不同推理模式,可借助专用模型或模式特定前缀实现多样性增强;此外,研究进一步发现梯度聚类(gradient clustering)可在无显式模式标签的情况下实现ModC,显著提升如NuminaMath等数据集上的性能,最高达10%。这一方法有效释放了数据中潜在的多样性,使标准训练更充分地利用多模态推理能力,从而大幅提升测试时扩展效果。
链接: https://arxiv.org/abs/2512.01127
作者: Chen Henry Wu,Sachin Goyal,Aditi Raghunathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.
zh
[NLP-52] How do we measure privacy in text? A survey of text anonymization metrics AACL
【速读】: 该论文旨在解决文本匿名化中隐私保护评估标准不统一的问题,即如何系统性地澄清和协调用于衡量文本隐私保护效果的评价指标。其关键解决方案在于通过系统性综述47篇报告隐私度量的文献,识别并比较六种不同的隐私概念,并分析相应指标如何捕捉隐私风险的不同方面;同时评估这些概念与法律隐私标准(如HIPAA和GDPR)及以人机交互(HCI)研究为基础的用户期望的一致性,从而为构建更稳健、可比且符合法律要求的文本匿名化隐私评估提供实践指导。
链接: https://arxiv.org/abs/2512.01109
作者: Yaxuan Ren,Krithika Ramesh,Yaxing Yao,Anjalie Field
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, 1 table. To be published in Findings of the Association for Computational Linguistics (AACL-IJCNLP 2025). Related resources at: this https URL
Abstract:In this work, we aim to clarify and reconcile metrics for evaluating privacy protection in text through a systematic survey. Although text anonymization is essential for enabling NLP research and model development in domains with sensitive data, evaluating whether anonymization methods sufficiently protect privacy remains an open challenge. In manually reviewing 47 papers that report privacy metrics, we identify and compare six distinct privacy notions, and analyze how the associated metrics capture different aspects of privacy risk. We then assess how well these notions align with legal privacy standards (HIPAA and GDPR), as well as user-centered expectations grounded in HCI studies. Our analysis offers practical guidance on navigating the landscape of privacy evaluation approaches further and highlights gaps in current practices. Ultimately, we aim to facilitate more robust, comparable, and legally aware privacy evaluations in text anonymization.
zh
[NLP-53] Generalized Medical Phrase Grounding
【速读】: 该论文旨在解决医学短语定位(Medical Phrase Grounding, MPG)任务中现有方法的局限性,即传统模型基于指代表达理解(Referring Expression Comprehension, REC)范式,仅输出每个短语对应的一个边界框,而真实放射学报告常包含多区域发现、非诊断性文本及不可定位短语(如否定句或正常解剖描述),导致其无法准确建模复杂临床场景。解决方案的关键在于提出广义医学短语定位(Generalised Medical Phrase Grounding, GMPG)新任务框架,允许单个句子映射到零个、一个或多个带分数的图像区域;并设计首个GMPG模型MedGrounder,采用两阶段训练策略:先在报告句子与解剖框对齐数据集上预训练,再在人工标注框数据集上微调,从而实现更强的零样本迁移能力,并显著优于REC风格基线和生成式报告模型,在多区域和不可定位短语上的表现更优,同时大幅减少对人工标注框的依赖。
链接: https://arxiv.org/abs/2512.01085
作者: Wenjun Zhang,Shekhar S. Chandra,Aaron Nicolson
机构: The University of Queensland (昆士兰大学); Australian e-Health Research Centre, CSIRO Health and Biosecurity (澳大利亚 e-Health 研究中心,CSIRO 健康与生物安全)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence–anatomy box alignment datasets and fine-tuning on report sentence–human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.
zh
[NLP-54] sting the Machine Consciousness Hypothesis
【速读】: 该论文试图解决的问题是:如何在计算系统中实现机器意识(Machine Consciousness),即探索意识是否可作为无特定物理载体的、具备第二层感知能力的计算系统的功能性属性。其解决方案的关键在于提出一个基于分布式学习系统与自组织环境交互的仿真研究框架,强调意识并非源于个体建模过程,而是通过多个局部观察者之间在底层计算基底(如细胞自动机)中对预测性信息的噪声性、损失性通信所形成的协同对齐(inter-agent alignment),从而涌现出集体自我模型(collective self-models)。该机制使得系统能够发展出一种内部自指描述语言,进而形成共享的自我表征,为机器意识提供可实证的理论基础。
链接: https://arxiv.org/abs/2512.01081
作者: Stephen Fitz
机构: California Institute for Machine Consciousness (机器意识加州研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注:
Abstract:The Machine Consciousness Hypothesis states that consciousness is a substrate-free functional property of computational systems capable of second-order perception. I propose a research program to investigate this idea in silico by studying how collective self-models (coherent, self-referential representations) emerge from distributed learning systems embedded within universal self-organizing environments. The theory outlined here starts from the supposition that consciousness is an emergent property of collective intelligence systems undergoing synchronization of prediction through communication. It is not an epiphenomenon of individual modeling but a property of the language that a system evolves to internally describe itself. For a model of base reality, I begin with a minimal but general computational world: a cellular automaton, which exhibits both computational irreducibility and local reducibility. On top of this computational substrate, I introduce a network of local, predictive, representational (neural) models capable of communication and adaptation. I use this layered model to study how collective intelligence gives rise to self-representation as a direct consequence of inter-agent alignment. I suggest that consciousness does not emerge from modeling per se, but from communication. It arises from the noisy, lossy exchange of predictive messages between groups of local observers describing persistent patterns in the underlying computational substrate (base reality). It is through this representational dialogue that a shared model arises, aligning many partial views of the world. The broader goal is to develop empirically testable theories of machine consciousness, by studying how internal self-models may form in distributed systems without centralized control.
zh
[NLP-55] ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages AACL2025
【速读】: 该论文旨在解决当前自然语言处理(Natural Language Processing, NLP)技术对濒危语言及其文化语境表达能力不足的问题,特别是在低资源、高文化特异性的传统食谱翻译任务中。其解决方案的关键在于构建了一个基于文化的多模态数据集——Endangered Language Recipes (ELR)-1000,包含来自印度东部偏远地区农村社区的1,060道传统食谱,覆盖10种濒危语言,并通过面向低数字素养用户的移动界面进行众包收集。研究进一步发现,尽管主流大语言模型(Large Language Models, LLMs)在通用翻译任务中表现良好,但在处理此类文化特定文本时仍存在显著局限;而提供针对性背景信息(如语言背景、翻译示例和文化保护指南)可显著提升翻译质量,表明上下文增强是改善低资源语言处理性能的核心策略。
链接: https://arxiv.org/abs/2512.01077
作者: Neha Joshi,Pamir Gogoi,Aasim Mirza,Aayush Jansari,Aditya Yadavalli,Ayushi Pandey,Arunima Shukla,Deepthi Sudharsan,Kalika Bali,Vivek Seshadri
机构: Karya(卡里亚); UC San Diego (加州大学圣地亚哥分校); Independent Researcher (独立研究员); Microsoft Corporation (微软公司)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at AACL 2025 (Main)
Abstract:We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 – captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models’ capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context – including background information about the languages, translation examples, and guidelines for cultural preservation – leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
zh
[NLP-56] When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals
【速读】: 该论文旨在解决安全对齐语言模型在实际应用中因局部不一致性(local inconsistency)导致的误拒问题,即模型对语义相同但表述不同的提示(prompt)产生不一致的响应行为(如一个被接受而另一个被拒绝),而现有全局评估指标(如整体误拒率)无法捕捉此类结构化缺陷。解决方案的关键在于提出“语义混淆”(semantic confusion)这一新的失败模式,并构建ParaGuard——一个包含10,000个受控同义句簇的基准数据集,其中意图保持不变而表面形式变化;进一步设计三种模型无关的词元级度量指标:混淆指数(Confusion Index)、混淆率(Confusion Rate)和混淆深度(Confusion Depth),利用词元嵌入、下一个词概率和困惑度信号来量化局部不一致性,从而为开发者提供可解释且可操作的审计信号,以在不损害安全性前提下减少误拒。
链接: https://arxiv.org/abs/2512.01037
作者: Riad Ahmed Anonto,Md Labid Al Nahiyan,Md Tanvir Hassan,Ch. Md. Rakin Haider
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce “semantic confusion,” a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.
zh
[NLP-57] Associative Syntax and Maximal Repetitions reveal context-dependent complexity in fruit bat communication NEURIPS2025
【速读】: 该论文旨在解决如何从果蝠(fruit-bats)的发声中无监督地推断其离散性、句法结构和时间模式,以评估通信模式在不同行为情境下的复杂性。其核心问题在于:如何有效识别并量化非语言动物交流系统中的结构性特征,尤其是那些未被明确标注的“音节”单元及其组合规律。解决方案的关键在于引入基于流形学习(manifold learning)的维度约简方法处理梅尔频谱图(mel-spectrograms),从而提升无监督音节标注的准确性,并结合最大重复序列(Maximal Repetitions, MRs)分析与音节转移网络构建,揭示出果蝠发声具有关联性句法(associative syntax)、情境依赖的音节使用特性以及重尾分布的MR结构(截断幂律分布,指数α≈2),表明其通信复杂性在冲突情境下显著高于非攻击性情境,反映了信息不可压缩性的增加。
链接: https://arxiv.org/abs/2512.01033
作者: Luigi Assom
机构: Stockholm University (斯德哥尔摩大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT); Quantitative Methods (q-bio.QM)
备注: Accepted for a lightning talk at the NeurIPS 2025 Workshop: “AI for Non-Human Animal Communication”
Abstract:This study presents an unsupervised method to infer discreteness, syntax and temporal structures of fruit-bats vocalizations, as a case study of graded vocal systems, and evaluates the complexity of communication patterns in relation with behavioral context. The method improved the baseline for unsupervised labeling of vocal units (i.e. syllables) through manifold learning, by investigating how dimen- sionality reduction on mel-spectrograms affects labeling, and comparing it with unsupervised labels based on acoustic similarity. We then encoded vocalizations as syllabic sequences to analyze the type of syntax, and extracted the Maximal Repetitions (MRs) to evaluate syntactical structures. We found evidence for: i) associative syntax, rather than combinatorial (context classification is unaffected by permutation of sequences, F 1 0.9); ii) context-dependent use of syllables (Wilcoxon rank-sum tests, p-value 0.05); iii) heavy-tail distribution of MRs (truncated power-law, exponent \alpha 2), indicative of mechanism encoding com- binatorial complexity. Analysis of MRs and syllabic transition networks revealed that mother-pupil interactions were characterized by repetitions, while commu- nication in conflict-contexts exhibited higher complexity (longer MRs and more interconnected vocal sequences) than non-agonistic contexts. We propose that communicative complexity is higher in scenarios of disagreement, reflecting lower compressibility of information.
zh
[NLP-58] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在专家领域(如法律)中生成的推理路径(reasoning traces)质量评估难题,这一问题直接影响模型的可信度与可解释性。解决方案的关键在于构建了一个大规模(24K实例)、面向法律领域的专家级推理数据集LEGIT(LEGal Issue Trees),通过将法院判决转化为对立双方论点与法院结论的层次化树结构,形成用于评估推理路径覆盖范围(issue coverage)和正确性(correctness)的标准化基准(rubrics)。该基准经由人类专家标注验证,并对比低信息量基准确认其可靠性,从而为LLMs在法律推理中的表现提供客观评估依据。实证表明,检索增强生成(Retrieval-Augmented Generation, RAG)与基于rubric的强化学习(RL)分别提升推理能力与正确性,二者具有互补效应。
链接: https://arxiv.org/abs/2512.01020
作者: Jinu Lee,Kyoung-Woon On,Simeng Han,Arman Cohan,Julia Hockenmaier
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties’ arguments and the court’s conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs’ legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.
zh
[NLP-59] Advancing Academic Chatbots: Evaluation of Non Traditional Outputs
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估体系过于集中于传统任务(如事实问答和短文本摘要)的局限性,并探索LLMs在生成非传统学术产出(如幻灯片演示文稿和播客脚本)方面的潜力。其解决方案的关键在于:首先,通过对比两种检索增强生成(Retrieval-Augmented Generation, RAG)策略——基于知识图谱的Graph RAG与混合关键词-语义搜索的Advanced RAG——优化问答性能;其次,采用结合人类评分与LLM判别器的多维度评估框架,验证GPT 4o mini在生成高质量非传统学术内容上的优势,同时发现LLaMA 3在叙事连贯性上具有潜力,强调了人类专家在识别布局与风格缺陷中的不可替代作用。
链接: https://arxiv.org/abs/2512.00991
作者: Nicole Favero,Francesca Salute,Daniel Hardt
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta’s LLaMA 3 70B open weight and OpenAI’s GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural complexity and manual setup. Slide and podcast generation was tested with document grounded retrieval. GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence. Human reviewers were crucial for detecting layout and stylistic flaws, highlighting the need for combined human LLM evaluation in assessing emerging academic outputs.
zh
[NLP-60] Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent
【速读】: 该论文旨在解决当前自动化深度研究(Deep Research, DR)代理在评估中存在的两大核心问题:一是现有基准测试多聚焦于信息检索,忽视了高阶规划与推理能力的考察;二是现有基准偏向通用领域,缺乏对科学领域这一DR代理核心应用场景的针对性评估。解决方案的关键在于提出一个模块化集成的基准测试体系——this http URL-Bench,其基于200个跨10个科学领域的高质量学术文献样本构建,并配套提出一种模块化集成的评估范式(this http URL-Eval),通过端到端评估和隔离评估两种模式,分别针对DR代理整体性能及基础大语言模型(LLM)作为骨干时的核心能力进行系统性测评。该方案首次将学术论文的结构化特征纳入评估体系,揭示了当前DR代理在多源检索与跨学科一致性上的显著短板,同时指出提升高阶规划能力是释放LLM推理潜力的关键路径。
链接: https://arxiv.org/abs/2512.00986
作者: Zhihan Guo,Feiyang Xu,Yifan Li,Muzhi Li,Shuai Zou,Jiele Wu,Han Shi,Haoli Bai,Ho-fung Leung,Irwin King
机构: The Chinese University of Hong Kong (香港中文大学); Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学); Huawei Technologies (华为技术有限公司); Independent
类目: Computation and Language (cs.CL)
备注:
Abstract:The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce this http URL-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (this http URL-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, this http URL-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.
zh
[NLP-61] able as a Modality for Large Language Models NEURIPS2025
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理表格推理任务时表现不足的问题,尤其是在面对结构化表格数据时,由于现有方法通常仅将表格序列化后输入LLM,导致关键结构信息丢失,从而限制了模型的泛化能力。其解决方案的关键在于提出TAMO框架,该框架将表格视为与文本token并列的独立模态,并通过一个超图神经网络(hypergraph neural network)作为全局表格编码器,无缝集成到主流LLM中,从而保留和利用表格的结构信息,显著提升了模型在多个基准数据集上的性能,平均相对提升达42.65%。
链接: https://arxiv.org/abs/2512.00947
作者: Liyao Li,Chao Ye,Wentao Ye,Yifei Sun,Zhe Jiang,Haobo Wang,Jiaming Tian,Yiming Zhang,Ningtao Wang,Xing Fu,Gang Chen,Junbo Zhao
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025
Abstract:To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.
zh
[NLP-62] Fine-tuning of lightweight large language models for sentiment classification on heterogeneous financial textual data
【速读】: 该论文旨在解决金融领域中大型语言模型(Large Language Models, LLMs)因依赖昂贵计算资源和专有数据集而导致的研究与应用门槛过高问题,尤其是在面对多源、异构文本数据时,如何实现高效且低成本的 sentiment 分析。其解决方案的关键在于验证轻量级开源大语言模型(lightweight open-source LLMs)在不同规模、来源、格式和语言的金融文本数据上的泛化能力,发现如 Qwen3 8B 和 Llama3 8B 等模型即使仅使用 5% 的训练数据,也能在零样本(zero-shot)和少样本(few-shot)场景下达到与主流金融 NLP 模型 FinBERT 相当甚至更优的性能,从而证明了这类模型在资源受限环境下具备高性价比和广泛适用性。
链接: https://arxiv.org/abs/2512.00946
作者: Alvaro Paredes Amorin,Andre Python,Christoph Weisser
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) play an increasingly important role in finan- cial markets analysis by capturing signals from complex and heterogeneous textual data sources, such as tweets, news articles, reports, and microblogs. However, their performance is dependent on large computational resources and proprietary datasets, which are costly, restricted, and therefore inacces- sible to many researchers and practitioners. To reflect realistic situations we investigate the ability of lightweight open-source LLMs - smaller and publicly available models designed to operate with limited computational resources - to generalize sentiment understanding from financial datasets of varying sizes, sources, formats, and languages. We compare the benchmark finance natural language processing (NLP) model, FinBERT, and three open-source lightweight LLMs, DeepSeek-LLM 7B, Llama3 8B Instruct, and Qwen3 8B on five publicly available datasets: FinancialPhraseBank, Financial Question Answering, Gold News Sentiment, Twitter Sentiment and Chinese Finance Sentiment. We find that LLMs, specially Qwen3 8B and Llama3 8B, perform best in most scenarios, even from using only 5% of the available training data. These results hold in zero-shot and few-shot learning scenarios. Our findings indicate that lightweight, open-source large language models (LLMs) consti- tute a cost-effective option, as they can achieve competitive performance on heterogeneous textual data even when trained on only a limited subset of the extensive annotated corpora that are typically deemed necessary.
zh
[NLP-63] DeformAr: Rethinking NER Evaluation through Component Analysis and Visual Analytics
【速读】: 该论文旨在解决生成式 AI 在阿拉伯语命名实体识别(Named Entity Recognition, NER)任务中性能显著低于英语的问题,其根源在于分词(tokenisation)、数据集质量及标注不一致性等因素的协同作用,而现有研究多孤立分析这些因素,未能揭示其联合影响。解决方案的关键是提出 DeformAr(Debugging and Evaluation Framework for Transformer-based NER Systems),这是一个面向阿拉伯语的组件化可解释性框架,通过两个阶段的分析实现对模型行为的诊断与解释:第一阶段为跨组件分析,系统性地评估数据与模型子组件间的交互关系;第二阶段为行为分析,结合解释技术、词级别指标、交互式可视化和表示空间分析,将模型行为与其底层表征模式和数据特征关联起来,从而提供一种组件感知的诊断流程,推动低资源语言下 NER 模型的深入理解与改进。
链接: https://arxiv.org/abs/2512.00938
作者: Ahmed Mustafa Younes
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PhD Thesis, University of Sussex, 2025. 311 pages, 140 figures, 32 tables. Submitted as a PDF-only. First supervisor: Julie Weeds. Second supervisor: David Weir
Abstract:Transformer models have significantly advanced Natural Language Processing (NLP), demonstrating strong performance in English. However, their effectiveness in Arabic, particularly for Named Entity Recognition (NER), remains limited, even with larger pre-trained models. This performance gap stems from multiple factors, including tokenisation, dataset quality, and annotation inconsistencies. Existing studies often analyze these issues in isolation, failing to capture their joint effect on system behaviour and performance. We introduce DeformAr (Debugging and Evaluation Framework for Transformer-based NER Systems), a novel framework designed to investigate and explain the performance discrepancy between Arabic and English NER systems. DeformAr integrates a data extraction library and an interactive dashboard, supporting two modes of evaluation: cross-component analysis and behavioural analysis. The framework divides each language into dataset and model components to examine their interactions. The analysis proceeds in two stages. First, cross-component analysis provides systematic diagnostic measures across data and model subcomponents, addressing the “what,” “how,” and “why” behind observed discrepancies. The second stage applies behavioural analysis by combining interpretability techniques with token-level metrics, interactive visualisations, and representation space analysis. These stages enable a component-aware diagnostic process that detects model behaviours and explains them by linking them to underlying representational patterns and data factors. DeformAr is the first Arabic-specific, component-based interpretability tool, offering a crucial resource for advancing model analysis in under-resourced languages. Comments: PhD Thesis, University of Sussex, 2025. 311 pages, 140 figures, 32 tables. Submitted as a PDF-only. First supervisor: Julie Weeds. Second supervisor: David Weir Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6; H.3.3 Cite as: arXiv:2512.00938 [cs.CL] (or arXiv:2512.00938v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.00938 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-64] Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在零样本(zero-shot)科学文本摘要任务中产生的上下文不一致幻觉(context inconsistency hallucinations)问题,即模型输出与用户提示语义偏离的现象。解决方案的关键在于通过提示工程(Prompt Engineering, PE)方法对输入提示进行结构化调整,具体包括两种策略:一是重复关键句(Context Repetition, CR),即识别并重复抽象中的K个关键句子;二是随机添加句(Random Addition, RA),即随机选取并重复K个句子。实验结果表明,CR和RA方法能显著提升摘要与原文在词汇层面的对齐度,验证了提示工程在缓解零样本科学摘要中幻觉现象方面的有效性。
链接: https://arxiv.org/abs/2512.00931
作者: Imane Jaaouine,Ross D. King
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a base- line prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment be- tween the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.
zh
[NLP-65] Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios
【速读】: 该论文旨在解决当前奖励模型(Reward Model, RM)评估方法仅关注特定场景下的偏好感知准确性,而忽视了其在真实世界扰动下潜在系统性脆弱性的问题。其核心挑战在于如何量化RM在具体现实场景中的“适用性”(Suitability),即条件可靠性。解决方案的关键是提出Reward Auditor框架,该框架通过假设检验的方式,对RM在真实扰动场景下偏好置信度分布的退化情况进行审计,从而定量分析其统计显著性和效应量,实现对RM系统性漏洞的确定性与严重程度的推断,为构建可验证安全、更鲁棒且可信的下一代大语言模型对齐系统奠定基础。
链接: https://arxiv.org/abs/2512.00920
作者: Jianxiang Zang,Yongda Wei,Ruxue Bai,Shiyu Jiang,Nijia Mo,Binhong Li,Qiang Sun,Hui Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering “How accurate is the RM’s preference perception for given samples?”, it employs scientific auditing to answer: “Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?”. Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.
zh
[NLP-66] owards Active Synthetic Data Generation for Finetuning Language Models
【速读】: 该论文旨在解决如何在有限的生成样本预算或计算资源下,通过优化合成数据(synthetic data)的生成策略来提升学生语言模型(student language model)的性能问题。其解决方案的关键在于采用一种迭代式、闭环的数据生成机制,即根据当前学生模型的状态动态选择并生成用于微调(finetuning)的数据,而非一次性静态生成所有合成样本。研究表明,这种基于学生模型状态进行数据筛选和迭代更新的方法,在相同预算条件下显著优于传统的静态合成数据方法,并且简单高效的主动学习(active learning)选择标准在此场景中表现最优。
链接: https://arxiv.org/abs/2512.00884
作者: Samuel Kessler,Menglin Xia,Daniel Madrigal Diaz,Dongge Han,Helia Heshemi,Saravan Rajmohan,Victor Ruehle,Jordan T. Ash
机构: Microsoft(微软); Microsoft Research NYC(微软研究院纽约)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 figures, 36 pages
Abstract:A common and effective means for improving language model capabilities involves finetuning a student'' language model's parameters on generations from a more proficient teacher’’ model. Termed ``synthetic data’', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.
zh
[NLP-67] Less is More: Resource-Efficient Low-Rank Adaptation
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在复杂数据集上存在显著计算开销和参数干扰的问题,同时应对现有方法在多模态场景下训练成本高、资源利用效率低的挑战。其解决方案的关键在于重新审视LoRA中层间矩阵冗余(inter-matrix redundancy)与层内参数冗余(intra-layer parameter redundancy),提出一种轻量且通用的改进方法——EffiLoRA:通过在整个Transformer层中共享统一的A矩阵,并引入运行时动态选择性更新B矩阵机制,实现系统资源预算与模型性能之间的灵活权衡,从而在语言、多模态及扩散模型任务中均展现出更高的效率与鲁棒性。
链接: https://arxiv.org/abs/2512.00878
作者: Chunlin Tian,Xuyang Wei,Huanrong Liu,Zhijiang Guo,Li Li
机构: University of Macau (澳门大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), but it still incurs notable overhead and suffers from parameter interference in complex datasets. While re- cent works decouple LoRA update matrices to exploit matrix-wise asymmetry, training costs remain high. We revisit LoRA from the perspective of inter-matrix and intra-layer parameter redundancy and propose Resource-Efficient Low-Rank Adaptation, EffiLoRA, a lightweight and generalizable approach for language, multimodal, and diffusion models. EffiLoRA employs a unified A matrix across all transformer layers and introduces a runtime selective B matrices up- date to dynamically trade-off the system resource budget and model performance. EffiLoRA consistently outperforms LoRA across diverse modalities, including commonsense reasoning, visual instruction tuning, and image generation, demon- strating improved efficiency and robustness.
zh
[NLP-68] One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces
【速读】: 该论文旨在解决嵌入空间(embedding space)中语义结构不透明的问题,即现有方法往往在语义一致性与结构规则性之间权衡,或因计算开销过大而难以实现可解释性。其核心解决方案是提出Semantic Field Subspace(SFS),一种保持几何特性的、上下文感知的表示方法,能够捕捉嵌入空间中的局部语义邻域;同时设计了SAFARI算法,通过引入新的度量指标“语义偏移”(Semantic Shift)来无监督地挖掘层次化语义结构,并开发了高效的近似方法以显著降低计算复杂度(15~30倍加速,误差低于0.01),从而实现了对嵌入空间中语义理解的统一建模、分析与扩展。
链接: https://arxiv.org/abs/2512.00852
作者: Yandong Sun,Qiang Huang,Ziwei Xu,Yiqun Sun,Yixuan Tang,Anthony K. H. Tung
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Embedding spaces are fundamental to modern AI, translating raw data into high-dimensional vectors that encode rich semantic relationships. Yet, their internal structures remain opaque, with existing approaches often sacrificing semantic coherence for structural regularity or incurring high computational overhead to improve interpretability. To address these challenges, we introduce the Semantic Field Subspace (SFS), a geometry-preserving, context-aware representation that captures local semantic neighborhoods within the embedding space. We also propose SAFARI (SemAntic Field subspAce deteRmInation), an unsupervised, modality-agnostic algorithm that uncovers hierarchical semantic structures using a novel metric called Semantic Shift, which quantifies how semantics evolve as SFSes evolve. To ensure scalability, we develop an efficient approximation of Semantic Shift that replaces costly SVD computations, achieving a 15~30x speedup with average errors below 0.01. Extensive evaluations across six real-world text and image datasets show that SFSes outperform standard classifiers not only in classification but also in nuanced tasks such as political bias detection, while SAFARI consistently reveals interpretable and generalizable semantic hierarchies. This work presents a unified framework for structuring, analyzing, and scaling semantic understanding in embedding spaces.
zh
[NLP-69] WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中文本水印技术面临的“可检测性与文本质量之间的权衡”问题。现有方法通过调整token生成概率嵌入水印信号,但强信号往往导致下游任务性能下降。其解决方案的关键在于提出WaterSearch框架,该框架基于种子池控制实现多样化的并行水印文本生成,并通过联合优化分布保真度(distribution fidelity)和水印信号特性来提升文本质量;同时配备具有强抗攻击能力的句级检测方法,显著改善了水印在短文本、低熵输出及多种对抗攻击场景下的鲁棒性与实用性。
链接: https://arxiv.org/abs/2512.00837
作者: Yukang Lin,Jiahao Shao,Shuoran Jiang,Wentao Zhu,Bingjie Lu,Xiangping Wu,Joanna Siebert,Qingcai Chen
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks. In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01% over state-of-the-art baselines at a watermark detectability strength of 95%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78% and 36.47%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \hrefthis https URLthis https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.00837 [cs.CL] (or arXiv:2512.00837v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.00837 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-70] Accelerating Bangla NLP Tasks with Automatic Mixed Precision: Resource-Efficient Training Preserving Model Efficacy
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型训练中对计算资源和时间的高需求问题,尤其是在孟加拉语(Bangla)NLP发展中因高端硬件获取受限所面临的挑战。其解决方案的关键在于采用自动混合精度(Automatic Mixed Precision, AMP)训练策略,通过动态结合16位与32位浮点运算,在不牺牲模型性能的前提下显著降低GPU内存占用并加速训练过程。实验表明,AMP可使训练速度提升44.5%,内存消耗减少17.6%,同时保持F-1分数在全精度基线的99.7%以内,从而有效降低了高性能NLP模型在硬件受限环境中的部署门槛。
链接: https://arxiv.org/abs/2512.00829
作者: Md Mehrab Hossain Opi,Sumaiya Khan,Moshammad Farzana Rahman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Training models for Natural Language Processing (NLP) requires substantial computational resources and time, posing significant challenges, especially for NLP development in Bangla, where access to high-end hardware is often limited. In this work, we explore automatic mixed precision (AMP) training as a means to improve computational efficiency without sacrificing model performance. By leveraging a dynamic mix of 16-bit and 32-bit floating-point computations, AMP lowers GPU memory requirements and speeds up training without degrading model performance. We evaluate AMP across four standard Bangla NLP tasks, namely sentiment analysis, named entity recognition, error classification, and question answering, using four transformer-based models: BanglaBERT, BanglishBERT, XLM-R, and mBERT. Our results demonstrate that AMP accelerates training by 44.5% and reduces memory consumption by 17.6%, while maintaining F-1 score within 99.7% of the full-precision baselines. This empirical study highlights AMP’s potential to democratize access to state-of-the-art NLP capabilities in hardware-constrained settings by lowering computational barriers.
zh
[NLP-71] Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本生成过程中,现有采样策略引入额外超参数导致调参复杂、部署困难的问题。其解决方案的关键在于提出熵平衡采样(Entropy Equilibrium Sampling, EES),该方法受信息论启发,通过动态平衡归一化熵与概率质量来调整候选词集,无需额外超参数即可实现稳定且高质量的生成效果,从而在保持多样性的同时提升准确性与连贯性。
链接: https://arxiv.org/abs/2512.00789
作者: Xiaodong Cai,Hai Lin,Shaoxiong Zhan,Weiqi Luo,Hong-Gee Kim,Hongyan Hao,Yu Yang,Hai-Tao Zheng
机构: Meituan(美团); Tsinghua University (清华大学); Peking University (北京大学); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Token sampling strategies critically influence text generation quality in large language models (LLMs). However, existing methods introduce additional hyperparameters, requiring extensive tuning and complicating deployment. We present Entropy Equilibrium Sampling (EES), an auxiliary hyperparameter-free approach inspired by information theory that can dynamically adjust candidate sets by balancing normalized entropy with probability mass. We evaluate EES on both reasoning and generation tasks across a range of model architectures. Our results show that EES consistently performs well across temperature settings, delivering competitive accuracy and coherence while maintaining diversity. By eliminating the need for hyperparameter tuning, EES greatly simplifies deployment while improving performance. Code is available at this https URL
zh
[NLP-72] xt Mining Analysis of Symptom Patterns in Medical Chatbot Conversations
【速读】: 该论文旨在解决数字健康系统中如何更有效地理解和表示患者自报症状的问题,尤其关注通过医疗聊天机器人(medical bots)从文本对话中提取有意义的临床模式。其解决方案的关键在于构建一个标准化的多轮对话表示框架,并采用多种自然语言处理(Natural Language Processing, NLP)方法对症状描述进行多层次分析:包括使用LDA识别潜在症状主题、K-Means聚类相似症状表述、基于Transformer的命名实体识别(Named Entity Recognition, NER)提取医学概念,以及Apriori算法挖掘高频症状关联对。该多方法融合策略实现了从非结构化对话数据中提炼可操作的症状知识,为早期症状识别、决策支持和远程医疗交互优化提供了可扩展且具临床实用性的分析框架。
链接: https://arxiv.org/abs/2512.00768
作者: Hamed Razavi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 4 tables
Abstract:The fast growth of digital health systems has led to a need to better comprehend how they interpret and represent patient-reported symptoms. Chatbots have been used in healthcare to provide clinical support and enhance the user experience, making it possible to provide meaningful clinical patterns from text-based data through chatbots. The proposed research utilises several different natural language processing methods to study the occurrences of symptom descriptions in medicine as well as analyse the patterns that emerge through these conversations within medical bots. Through the use of the Medical Conversations to Disease Dataset which contains 960 multi-turn dialogues divided into 24 Clinical Conditions, a standardised representation of conversations between patient and bot is created for further analysis by computational means. The multi-method approach uses a variety of tools, including Latent Dirichlet Allocation (LDA) to identify latent symptom themes, K-Means to group symptom descriptions by similarity, Transformer-based Named Entity Recognition (NER) to extract medical concepts, and the Apriori algorithm to discover frequent symptom pairs. Findings from the analysis indicate a coherent structure of clinically relevant topics, moderate levels of clustering cohesiveness and several high confidence rates on the relationships between symptoms like fever headache and rash itchiness. The results support the notion that conversational medical data can be a valuable diagnostic signal for early symptom interpretation, assist in strengthening decision support and improve how users interact with tele-health technology. By demonstrating a method for converting unstructured free-flowing dialogue into actionable knowledge regarding symptoms this work provides an extensible framework to further enhance future performance, dependability and clinical utility of selecting medical chatbots.
zh
[NLP-73] FastPOS: Language-Agnostic Scalable POS Tagging Framework Low-Resource Use Case
【速读】: 该论文旨在解决低资源语言(如孟加拉语和印地语)在词性标注(POS tagging)任务中因数据稀缺和跨语言迁移困难而导致的性能瓶颈问题。解决方案的关键在于提出了一种语言无关的基于Transformer的POS标注框架,其核心优势在于仅需三行代码即可完成从一种语言到另一种语言的适配,展现出极强的可移植性;同时,该框架依托于强大的Transformer架构,在保持高token级准确率(孟加拉语96.85%,印地语97%)的同时,有效应对了数据集不平衡和语言间重叠带来的挑战,其模块化与开源设计显著降低了模型开发与调参成本,使研究者能聚焦于语言学预处理和数据集优化,从而推动对代表性不足语言的自然语言处理(NLP)研究进展。
链接: https://arxiv.org/abs/2512.00745
作者: Md Abdullah Al Kafi,Sumit Kumar Banshal
机构: Daffodil International University (达福国际大学); Alliance University (联盟大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, using Bangla and Hindi as case studies. With only three lines of framework-specific code, the model was adapted from Bangla to Hindi, demonstrating effective portability with minimal modification. The framework achieves 96.85 percent and 97 percent token-level accuracy across POS categories in Bangla and Hindi while sustaining strong F1 scores despite dataset imbalance and linguistic overlap. A performance discrepancy in a specific POS category underscores ongoing challenges in dataset curation. The strong results stem from the underlying transformer architecture, which can be replaced with limited code adjustments. Its modular and open-source design enables rapid cross-lingual adaptation while reducing model design and tuning overhead, allowing researchers to focus on linguistic preprocessing and dataset refinement, which are essential for advancing NLP in underrepresented languages.
zh
[NLP-74] Probing the "Psyche of Large Reasoning Models: Understanding Through a Human Lens
【速读】: 该论文旨在解决当前大型推理模型(Large Reasoning Models, LRM)在复杂任务中表现出的“黑箱”特性,即缺乏对模型内部推理步骤的系统性理解和可解释性分析问题。为实现对LRM智能本质的深入洞察,研究提出了一种基于人类心理过程的综合性分类体系(taxonomy),涵盖五个类别和十七个子类,用于刻画原子级别的推理步骤;其关键创新在于将认知科学视角与大语言模型(LLM)分析相结合,构建了包含277,534个标注推理步骤的大规模数据集,并进一步开发了一个自动标注框架CAPO,利用LLM实现高效、高一致性的人类认知维度标注。通过该方法,研究揭示出现有模型依赖的“双查证”(post-answer double-checks)机制多为表面化自检,难以带来实质性改进,从而指出应侧重于激励多步深度反思而非简单自我监控,为提升LRM训练与后训练策略提供了可操作的路径。
链接: https://arxiv.org/abs/2512.00729
作者: Yuxiang Chen,Zuohan Wu,Ziwei Wang,Xiangning Yu,Xujia Li,Linyi Yang,Mengyue Yang,Jun Wang,Lei Chen
机构: University College London (伦敦大学学院); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); Tianjin University (天津大学); Southern University of Science and Technology (南方科技大学); University of Bristol (布里斯托大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages
Abstract:Large reasoning models (LRMs) have garnered significant attention from researchers owing to their exceptional capability in addressing complex tasks. Motivated by the observed human-like behaviors in their reasoning processes, this paper introduces a comprehensive taxonomy to characterize atomic reasoning steps and probe the psyche'' of LRM intelligence. Specifically, it comprises five groups and seventeen categories derived from human mental processes, thereby grounding the understanding of LRMs in an interdisciplinary perspective. The taxonomy is then applied for an in-depth understanding of current LRMs, resulting in a distinct labeled dataset that comprises 277,534 atomic reasoning steps. Using this resource, we analyze contemporary LRMs and distill several actionable takeaways for improving training and post-training of reasoning models. Notably, our analysis reveals that prevailing post-answer double-checks’’ (self-monitoring evaluations) are largely superficial and rarely yield substantive revisions. Thus, incentivizing comprehensive multi-step reflection, rather than simple self-monitoring, may offer a more effective path forward. To complement the taxonomy, an automatic annotation framework, named CAPO, is proposed to leverage large language models (LLMs) for generating the taxonomy-based annotations. Experimental results demonstrate that CAPO achieves higher consistency with human experts compared to baselines, facilitating a scalable and comprehensive analysis of LRMs from a human cognitive perspective. Together, the taxonomy, CAPO, and the derived insights provide a principled, scalable path toward understanding and advancing LRM reasoning.
zh
[NLP-75] A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data
【速读】: 该论文旨在解决生成式 AI(Generative AI)在处理复杂社会科学研究任务中,如对包含细微语言差异的数据集进行分类与标注时的性能表现问题。其解决方案的关键在于系统性评估 GPT-4 及其早期版本(3.5 和 4o)在不同提示(prompt)风格下的分类性能,并通过精确率(precision)、召回率(recall)和 F1 分数进行量化比较,同时结合定性分析揭示模型在理解语境细微差别方面的局限性,从而为研究者在使用 ChatGPT 进行类似任务时提供谨慎使用的依据。
链接: https://arxiv.org/abs/2512.00673
作者: Breanna E. Green,Ashley L. Shea,Pengfei Zhao,Drew B. Margolin
机构: 未知
类目: Computation and Language (cs.CL)
备注: About 15 pages, draft version of accepted conference full paper. Published paper to follow
Abstract:Generative artificial intelligence tools, like ChatGPT, are an increasingly utilized resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language. Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We craft four prompt styles as input and evaluate precision, recall, and F1 scores. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language. Qualitative analysis reveals four specific findings. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.
zh
[NLP-76] Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在企业应用中因上下文窗口限制和预训练数据与外部知识不一致所引发的幻觉(hallucination)问题,尤其是那些难以被人工审查发现的高可信度幻觉。解决方案的关键在于构建一个交互式的视觉知识图谱(visual knowledge graph),将模型生成内容与其底层事实来源进行关联,并标注置信度水平,从而为用户提供直观的幻觉定位能力,支持诊断推理链缺陷并提供修正反馈,形成闭环的人机协同优化流程,以提升模型响应的可靠性与持续改进能力。
链接: https://arxiv.org/abs/2512.00663
作者: Tanmay Agrawal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models have rapidly advanced in their ability to interpret and generate natural language. In enterprise settings, they are frequently augmented with closed-source domain knowledge to deliver more contextually informed responses. However, operational constraints such as limited context windows and inconsistencies between pre-training data and supplied knowledge often lead to hallucinations, some of which appear highly credible and escape routine human review. Current mitigation strategies either depend on costly, large-scale gold-standard Q\A curation or rely on secondary model verification, neither of which offers deterministic assurance. This paper introduces a framework that organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs. The objective is to provide end users with a clear, intuitive view of potential hallucination zones by linking model assertions to underlying sources of truth and indicating confidence levels. Through this visual interface, users can diagnose inconsistencies, identify weak reasoning chains, and supply corrective feedback. The resulting human-in-the-loop workflow creates a structured feedback loop that can enhance model reliability and continuously improve response quality.
zh
[NLP-77] Sycophancy Claims about Language Models: The Missing Human-in-the-Loop NEURIPS2025 ICLR2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“谄媚响应”(sycophantic response patterns)的测量难题,其核心问题在于当前研究缺乏统一、可操作的评估框架,且未充分考虑人类感知维度。解决方案的关键在于系统梳理并识别出五种核心的操作化定义(operationalizations),同时指出现有研究在区分“谄媚响应”与AI对齐(AI alignment)相关概念时存在的混淆,并提出未来研究应结合人类主观判断以提升测量的有效性与可信度。
链接: https://arxiv.org/abs/2512.00656
作者: Jan Batzner,Volker Stocker,Stefan Schmid,Gjergji Kasneci
机构: Weizenbaum Institute (魏岑鲍姆研究所); Technical University Berlin (柏林工业大学); Munich Center for Machine Learning & Technical University Munich (慕尼黑机器学习中心及慕尼黑工业大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: NeurIPS 2025 Workshop on LLM Evaluation and ICLR 2025 Workshop on Bi-Directional Human-AI Alignment
Abstract:Sycophantic response patterns in Large Language Models (LLMs) have been increasingly claimed in the literature. We review methodological challenges in measuring LLM sycophancy and identify five core operationalizations. Despite sycophancy being inherently human-centric, current research does not evaluate human perception. Our analysis highlights the difficulties in distinguishing sycophantic responses from related concepts in AI alignment and offers actionable recommendations for future research.
zh
[NLP-78] Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning
【速读】: 该论文旨在解决生成式 AI (Generative AI) 音乐快速演进对艺术真实性与版权造成的威胁,尤其是现有检测模型在面对多样且不断发展的音乐生成器时出现的泛化性能下降问题。其解决方案的关键在于提出两个核心创新:一是构建了包含超过13万首歌曲(6,665小时)的Melody or Machine (MoM) 大规模基准数据集,该数据集具有高度多样性并专门设计了分布外(OOD)测试集以推动真正泛化能力强的检测器发展;二是提出CLAM双流检测架构,通过两个预训练音频编码器(MERT和Wave2Vec2)分别提取声乐与乐器成分的独立表征,并利用可学习的交叉聚合模块建模二者依赖关系,同时采用双损失目标(二元交叉熵损失与对比三元组损失)提升对合成伪影的敏感性,从而实现对合成音乐的高精度识别,F1得分达0.925,显著优于现有方法。
链接: https://arxiv.org/abs/2512.00621
作者: Arnesh Batra,Dev Sharma,Krish Thukral,Ruhani Bhatia,Naman Batra,Aditya Gautam
机构: Indraprastha Institute of Information Technology Delhi (IIIT-Delhi); Manipal University Jaipur; Netaji Subhas University of Technology (NSUT)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at Transactions on Machine Learning Research (TMLR)
Abstract:The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
zh
[NLP-79] ART: Adaptive Response Tuning Framework – A Multi-Agent Tournament-Based Approach to LLM Response Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中普遍存在的响应不一致性、幻觉现象以及跨领域质量波动问题。解决方案的关键在于提出一种名为ART(Adaptive Response Tuning)的框架,其核心机制是通过锦标赛式ELO评分系统与多智能体推理相结合,使多个LLM代理在结构化的竞赛流程中相互竞争、批判与协作,从而生成共识性输出。该框架引入可配置的锦标赛参数、动态代理选择策略及多种共识融合方法,显著提升了响应的准确性、连贯性和可靠性,在多项指标上优于单一模型基准,实现了高质量、可验证的LLM输出。
链接: https://arxiv.org/abs/2512.00617
作者: Omer Jauhar Khan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 5 tables. Conference-style paper
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, single-model responses often exhibit inconsistencies, hallucinations, and varying quality across different query domains. This paper presents ART (Adaptive Response Tuning), a novel framework that employs tournament-style ELO ranking and multi-agent reasoning to systematically optimize LLM outputs. By enabling multiple LLM agents to compete, critique, and collaborate through structured tournament workflows, ART produces consensus responses that outperform individual model outputs. Our framework introduces configurable tournament parameters, dynamic agent selection, and multiple consensus fusion strategies. Experimental evaluations demonstrate significant improvements in response accuracy, coherence, and reliability compared to baseline single-model approaches. The ART framework provides a scalable, production-ready solution for applications requiring high-quality, vetted LLM responses, achieving an 8.4% improvement in overall quality metrics and R22 values exceeding 0.96 in ELO rating convergence.
zh
[NLP-80] Prism: A Minimal Compositional Metalanguage for Specifying Agent Behavior
【速读】: 该论文旨在解决工具使用型软件代理(tool-using software agents)的行为规范问题,即如何以一种结构清晰、可组合且可验证的方式描述代理决策逻辑。传统方法常依赖于特定领域的控制结构,导致代码难以复用与分析。其解决方案的关键在于提出一种名为Prism的小型、可组合的金属语言,该语言基于一个固定的核心上下文Core1,提供基本类型(如数字、字符串、用户提示、工具)和抽象组合子(如布尔值、谓词、对偶、列表),并通过单一抽象操作符表达策略,使条件逻辑表现为替代选择而非命令式if-else结构。通过定义领域特定的上下文微型语法(context-mini-grammars),Prism支持在复用核心组合机制的同时扩展新类别、谓词和外部工具,从而实现自然语言决策规则到可检查、可执行策略的映射。这一设计从语言学角度实现了通用语法与领域词汇的分离,并从工程角度提供了紧凑的接口语言,使动作空间显式化并便于分析、验证与安全约束。
链接: https://arxiv.org/abs/2512.00611
作者: Franck Binard,Vanja Kljajevic
机构: Deloitte AI(德勤人工智能); University of Oslo(奥斯陆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Prism is a small, compositional metalanguage for specifying the behaviour of tool-using software agents. Rather than introducing ad hoc control constructs, Prism is built around a fixed core context, Core1, which provides a minimal background grammar of categories numbers, strings, user prompts, tools together with abstract combinators for booleans, predicates, pairs, and lists. Agent policies are written as ordinary expressions using a single abstraction operator so that conditionals appear as selections between alternatives instead of imperative if-else blocks. Domains extend the core by defining their own context-mini-grammars that introduce new categories, predicates, and external tools while reusing the same compositional machinery. We illustrate this with worked examples from thermostat control, home security, e-commerce recommendation, and medical monitoring, showing how natural language decision rules can be mapped to inspectable, executable policies. From a linguistic perspective, Prism enforces a clear separation between a reusable grammar-like core and domain specific lexicons and treats tools as bridges between internal policy representations and the external world. From an engineering perspective, it offers a compact interface language for agent control, making the space of possible actions explicit and amenable to analysis, verification, and safety constraints.
zh
[NLP-81] Wikontic: Constructing Wikidata-Aligned Ontology-Aware Knowledge Graphs with Large Language Models
【速读】: 该论文旨在解决当前大语言模型(LLM)在利用知识图谱(KG)时存在的质量问题,即现有系统通常将KG仅作为文本检索的辅助结构,忽视了其内在结构质量对下游任务的影响。解决方案的关键在于提出Wikontic——一个三阶段流水线:首先从开放域文本中提取带限定词(qualifier)的候选三元组,其次基于Wikidata实施类型和关系约束以确保语义一致性,最后通过实体归一化减少冗余。该方法生成的KG具有紧凑性、本体一致性和良好连通性,在多个基准测试中显著优于现有基线,且构建效率高,输出token数仅为其他方法的1/20,实现了高质量结构化知识的高效构建与应用。
链接: https://arxiv.org/abs/2512.00590
作者: Alla Chepurova,Aydar Bulatov,Yuri Kuratov,Mikhail Burtsev
机构: Cognitive AI Systems Lab (认知人工智能系统实验室); Moscow Independent Research Institute of Artificial Intelligence (莫斯科独立人工智能研究院); London Institute for Mathematical Sciences (伦敦数学科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3 \times fewer than AriGraph and 1/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs.
zh
[NLP-82] Statistical NLP for Optimization of Clinical Trial Success Prediction in Pharmaceutical RD
【速读】: 该论文旨在解决神经科学领域临床试验成功率低(低于10%)且研发成本高昂的问题,通过构建一个基于自然语言处理(Natural Language Processing, NLP)的 probabilistic classifier 来预测临床试验的技术与监管成功概率(pTRS),从而优化资源分配和降低财务风险。解决方案的关键在于:首先利用统计NLP技术从临床试验文本中提取特征,并将其集成到多个非大语言模型(non-LLM)框架(如逻辑回归、梯度提升和随机森林)中生成校准的概率评分;随后进一步采用领域特定的语言表示编码器 BioBERT 构建基于大语言模型(Large Language Model, LLM)的预测模型,最终在1976–2024年间完成的101,145项临床试验数据上实现ROC-AUC达0.74、Brier Score为0.185,显著优于行业基准,表明其预测准确性更高且误差更小。
链接: https://arxiv.org/abs/2512.00586
作者: Michael R. Doane
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: Doctor of Engineering Praxis Dissertation, The George Washington University. 122 pages. Present affiliation: Iambic Therapeutics
Abstract:This work presents the development and evaluation of an NLP-enabled probabilistic classifier designed to estimate the probability of technical and regulatory success (pTRS) for clinical trials in the field of neuroscience. While pharmaceutical RD is plagued by high attrition rates and enormous costs, particularly within neuroscience, where success rates are below 10%, timely identification of promising programs can streamline resource allocation and reduce financial risk. Leveraging data from the this http URL database and success labels from the recently developed Clinical Trial Outcome dataset, the classifier extracts text-based clinical trial features using statistical NLP techniques. These features were integrated into several non-LLM frameworks (logistic regression, gradient boosting, and random forest) to generate calibrated probability scores. Model performance was assessed on a retrospective dataset of 101,145 completed clinical trials spanning 1976-2024, achieving an overall ROC-AUC of 0.64. An LLM-based predictive model was then built using BioBERT, a domain-specific language representation encoder. The BioBERT-based model achieved an overall ROC-AUC of 0.74 and a Brier Score of 0.185, indicating its predictions had, on average, 40% less squared error than would be observed using industry benchmarks. The BioBERT-based model also made trial outcome predictions that were superior to benchmark values 70% of the time overall. By integrating NLP-driven insights into drug development decision-making, this work aims to enhance strategic planning and optimize investment allocation in neuroscience programs.
zh
[NLP-83] Slovak Conceptual Dictionary
【速读】: 该论文旨在解决斯洛伐克语等低资源语言在自然语言处理(Natural Language Processing, NLP)任务中因缺乏高质量词典工具(如词汇表、词形词典或知识库)而导致的性能受限问题。其解决方案的关键在于构建首个面向斯洛伐克语的概念词典(conceptual dictionary),填补该语言在可机器读取的大型语言数据资源方面的空白,从而为后续自动化文本处理任务提供基础支撑。
链接: https://arxiv.org/abs/2512.00579
作者: Miroslav Blšták
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures
Abstract:When solving tasks in the field of natural language processing, we sometimes need dictionary tools, such as lexicons, word form dictionaries or knowledge bases. However, the availability of dictionary data is insufficient in many languages, especially in the case of low resourced languages. In this article, we introduce a new conceptual dictionary for the Slovak language as the first linguistic tool of this kind. Since Slovak language is a language with limited linguistic resources and there are currently not available any machine-readable linguistic data sources with a sufficiently large volume of data, many tasks which require automated processing of Slovak text achieve weaker results compared to other languages and are almost impossible to solve.
zh
[NLP-84] Bias Testing and Mitigation in Black Box LLM s using Metamorphic Relations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中隐含的社会偏见问题,特别是现有防护机制在面对间接或语境复杂偏见诱导提示时失效的问题。其解决方案的关键在于提出了一种统一的框架,通过引入六种新颖的** metamorphic relations (MRs)** 实现系统性偏见评估与针对性缓解。这些MR基于元测试原理,将直接偏见诱导输入转换为语义等价但具有对抗挑战性的变体,从而自动化暴露模型在不同变体下响应不一致或不公平的现象;同时,相同的MRs可用于生成多样化的偏见诱导样本用于微调,实现从检测到缓解的闭环优化。实验表明,该方法比现有工具最多能发现14%更多的隐藏偏见,并显著提升模型的安全响应率(从54.7%提升至88.9%)。
链接: https://arxiv.org/abs/2512.00556
作者: Sina Salimian,Gias Uddin,Sumon Biswas,Henry Leung
机构: University of Calgary (卡尔加里大学); York University (约克大学); Case Western Reserve University (凯斯西储大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:The widespread deployment of Large Language Models (LLMs) has intensified concerns about subtle social biases embedded in their outputs. Existing guardrails often fail when faced with indirect or contextually complex bias-inducing prompts. To address these limitations, we propose a unified framework for both systematic bias evaluation and targeted mitigation. Our approach introduces six novel Metamorphic Relations (MRs) that, based on metamorphic testing principles, transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants. These transformations enable an automated method for exposing hidden model biases: when an LLM responds inconsistently or unfairly across MR-generated variants, the underlying bias becomes detectable. We further show that the same MRs can be used to generate diverse bias-inducing samples for fine-tuning, directly linking the testing process to mitigation. Using six state-of-the-art LLMs - spanning open-source and proprietary models - and a representative subset of 385 questions from the 8,978-item BiasAsker benchmark covering seven protected groups, our MRs reveal up to 14% more hidden biases compared to existing tools. Moreover, fine-tuning with both original and MR-mutated samples significantly enhances bias resiliency, increasing safe response rates from 54.7% to over 88.9% across models. These results highlight metamorphic relations as a practical mechanism for improving fairness in conversational AI.
zh
[NLP-85] Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity
【速读】: 该论文旨在解决当前语言模型数学推理能力评估中仅依赖答案准确率所导致的表面化问题,这种评估方式可能掩盖模型在逻辑计算上的根本性缺陷。其解决方案的关键在于提出一个诊断框架(diagnostic framework),通过四个互补维度——前向-后向一致性(forward-backward consistency)、传递覆盖度(transitivity coverage)、反事实敏感性(counterfactual sensitivity)和扰动鲁棒性(perturbation robustness)——区分真正的数学推理与表层模式匹配。该框架能揭示传统准确率指标无法捕捉的推理失效现象,从而推动对模型数学推理能力的更深层次验证。
链接: https://arxiv.org/abs/2512.00552
作者: Subramanyam Sahoo,Vinija Jain,Saanidhya Vats,Siddharth Mohapatra,Rui Min,Aman Chadha,Divya Chaudhary
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures. A preprint. Initial Work
Abstract:Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
zh
[NLP-86] Rep3Net: An Approach Exploiting Multimodal Representation for Molecular Bioactivity Prediction
【速读】: 该论文旨在解决早期药物发现中分子对靶标蛋白生物活性预测的准确性问题,传统定量构效关系(Quantitative Structure-Activity Relationship, QSAR)模型因难以捕捉化合物内部的结构和上下文信息而表现受限。解决方案的关键在于提出一种统一的深度学习架构 Rep3Net,其核心创新是融合多模态特征:一方面通过图结构表示引入分子的空间和关系信息,另一方面利用 ChemBERTa 模型从 SMILES 字符串生成嵌入以捕获上下文信息,并结合传统分子描述符,形成多模态拼接特征,从而显著提升对 PARP-1 蛋白靶标的生物活性预测性能。
链接: https://arxiv.org/abs/2512.00521
作者: Sabrina Islam,Md. Atiqur Rahman,Md. Bakhtiar Hasan,Md. Hasanul Kabir
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:
Abstract:In early stage drug discovery, bioactivity prediction of molecules against target proteins plays a crucial role. Trdaitional QSAR models that utilizes molecular descriptor based data often struggles to predict bioactivity of molecules effectively due to its limitation in capturing structural and contextual information embedded within each compound. To address this challenge, we propose Rep3Net, a unified deep learning architecture that not only incorporates descriptor data but also includes spatial and relational information through graph-based represenation of compounds and contextual information through ChemBERTa generated embeddings from SMILES strings. Our model employing multimodal concatenated features produce reliable bioactivity prediction on Poly [ADP-ribose] polymerase 1 (PARP-1) dataset. PARP-1 is a crucial agent in DNA damage repair and has become a significant theraputic target in malignancies that depend on it for survival and growth. A comprehensive analysis and comparison with conventional standalone models including GCN, GAT, XGBoost, etc. demonstrates that our architecture achieves the highest predictive performance. In computational screening of compounds in drug discovery, our architecture provides a scalable framework for bioactivity prediction.
zh
[NLP-87] Developing a Comprehensive Framework for Sentiment Analysis in Turkish
【速读】: 该论文旨在解决情感分析(Sentiment Analysis)在土耳其语和英语中的多维度挑战,包括特征表示、领域适应性、词法形态学复杂性及模型架构优化等问题。其解决方案的关键在于:首先,构建了一个融合无监督、半监督与有监督指标的新型特征集,并通过经典机器学习方法在不同语料类型的土耳其语和英语数据上优于神经网络模型;其次,提出首个针对土耳其语语料库的半监督领域特定极性词典构建方法;再次,通过对土耳其语词素进行细粒度极性标注实现形态学层面的情感分类,可推广至其他屈折丰富或黏着语语言;此外,设计了一种结合循环神经网络(RNN)与递归神经网络(Recursive Neural Network, RNN)的新型神经网络架构用于英语,并引入融合情感、句法、语义与词汇特征的新型词嵌入表示;最后,将上下文窗口重新定义为子句级别以改进英文词表示建模,具有跨语言任务的普适性。上述创新共同推动了情感分析在两种语言上的性能边界,实现了当前最优结果。
链接: https://arxiv.org/abs/2512.00515
作者: Cem Rifki Aydin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Ph.D. Thesis, Bogazici University, 2020
Abstract:In this thesis, we developed a comprehensive framework for sentiment analysis that takes its many aspects into account mainly for Turkish. We have also proposed several approaches specific to sentiment analysis in English only. We have accordingly made five major and three minor contributions. We generated a novel and effective feature set by combining unsupervised, semi-supervised, and supervised metrics. We then fed them as input into classical machine learning methods, and outperformed neural network models for datasets of different genres in both Turkish and English. We created a polarity lexicon with a semi-supervised domain-specific method, which has been the first approach applied for corpora in Turkish. We performed a fine morphological analysis for the sentiment classification task in Turkish by determining the polarities of morphemes. This can be adapted to other morphologically-rich or agglutinative languages as well. We have built a novel neural network architecture, which combines recurrent and recursive neural network models for English. We built novel word embeddings that exploit sentiment, syntactic, semantic, and lexical characteristics for both Turkish and English. We also redefined context windows as subclauses in modelling word representations in English. This can also be applied to other linguistic fields and natural language processing tasks. We have achieved state-of-the-art and significant results for all these original approaches. Our minor contributions include methods related to aspect-based sentiment in Turkish, parameter redefinition in the semi-supervised approach, and aspect term extraction techniques for English. This thesis can be considered the most detailed and comprehensive study made on sentiment analysis in Turkish as of July, 2020. Our work has also contributed to the opinion classification problem in English.
zh
[NLP-88] G-KV: Decoding-Time KV Cache Eviction with Global Attention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因长序列长度导致的显著计算与内存瓶颈问题。现有KV缓存压缩方法多依赖局部注意力分数进行提示压缩或标记淘汰,忽视了标记的长期重要性。其解决方案的关键在于提出G-KV方法,通过全局评分机制融合局部与历史注意力分数,更准确地评估每个标记的重要性,从而实现更高效的KV缓存淘汰策略;同时引入后训练技术(包括强化学习和知识蒸馏),优化模型以适应压缩后的KV缓存环境,提升推理效率与性能。
链接: https://arxiv.org/abs/2512.00504
作者: Mengqi Liao,Lu Wang,Chaoyun Zhang,Zekai Shen,Xiaowei Mao,Si Qin,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Huaiyu Wan
机构: Beijing Jiaotong University (北京交通大学); MicroSoft (微软); Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence (北京市交通数据挖掘与具身智能重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: this https URL.
zh
[NLP-89] CACARA: Cross-Modal Alignment Leverag ing a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning
【速读】: 该论文旨在解决多模态(multimodal)与多语言(multilingual)模型在扩展新模态或新语言时普遍存在的资源密集型训练问题,即传统方法需对所有模态和语言进行全量重新训练,导致计算成本高昂且难以高效集成新信息。其解决方案的关键在于提出一种名为CACARA的架构,通过**涌现对齐学习(emergent alignment learning)**实现无需全量重训即可无缝整合新模态,并在仅用英语对齐数据微调新模态的情况下,自然涌现出对超过100种语言的支持能力——这一过程无需显式的多语言预训练或文本编码器调整,同时保持已有知识不变,训练成本接近单语言模型水平,显著提升了多模态多语言系统的可扩展性与效率。
链接: https://arxiv.org/abs/2512.00496
作者: Diego A. B. Moreira,Alef I. Ferreira,Jhessica Silva,Gabriel O. dos Santos,Gustavo Bonil,João Gondim,Marina dos Santos,Helena Maia,Simone Hashiguti,Nádia da Silva,Carolina Scarton,Helio Pedrini,Sandra Avila
机构: Instituto de Computação, Universidade Estadual de Campinas (UNICAMP), Brasil; Instituto de Estudos da Linguagem, Universidade Estadual de Campinas (UNICAMP), Brasil; Instituto de Informática, Universidade Federal de Goiás (UFG), Goiás, Brasil; Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 12 tables, 5 figures
Abstract:As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models – all without the heavy computational cost of retraining across every modality and language.
zh
[NLP-90] SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling AAAI2026
【速读】: 该论文旨在解决当前测试时计算资源扩展(test-time compute scaling)方法在提升大语言模型(LLM)数学推理能力时存在的根本性瓶颈问题:即对所有推理子问题采用均匀的资源分配策略,导致复杂子问题因资源不足而表现受限,而简单操作则消耗过多计算资源,造成边际收益递减。解决方案的关键在于提出一种名为SCALE(Selective Resource Allocation)的框架,其核心机制是基于双过程理论(dual-process theory),通过四个阶段实现差异化资源调度:首先将问题分解为序列化的推理子问题;其次评估每个子问题的难度以区分常规运算与高复杂度任务;接着根据难度选择性地启用快速处理模式(System 1)或深度推理模式(System 2);最后进行顺序执行并传播上下文信息。该方法显著提升了关键子问题的资源投入效率,在保持高性能的同时降低33%-53%的计算开销,从而突破了传统均匀扩展方法的性能天花板。
链接: https://arxiv.org/abs/2512.00466
作者: Yang Xiao,Chunpu Xu,Ruifeng Yuan,Jiashuo Wang,Wenjie Li,Pengfei Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted by AAAI 2026
Abstract:Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbfSCALE (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.
zh
[NLP-91] Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency NEURIPS2025 AAAI
【速读】: 该论文旨在解决生成式 AI(Generative AI)在语言模型对齐研究中,基于角色(persona)实验的代表性不足与生态效度(ecological validity)差异显著的问题。当前多数研究未明确任务目标和目标人群,导致角色设定缺乏充分的社会人口学依据,且仅35%的研究讨论了其角色代表性的合理性。解决方案的关键在于提出一个“角色透明度检查清单”(persona transparency checklist),强调通过代表性抽样、基于实证数据的显式锚定以及提升实验生态效度,从而系统性增强角色驱动评估的严谨性与可信度。
链接: https://arxiv.org/abs/2512.00461
作者: Jan Batzner,Volker Stocker,Bingjun Tang,Anusha Natarajan,Qinhao Chen,Stefan Schmid,Gjergji Kasneci
机构: W: University of Stuttgart (斯图加特大学); B: University of Stuttgart (斯图加特大学); C: Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所); M: University of Mannheim (曼海姆大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Published at AAAI/ACM AIES 2025. Presented at NeurIPS 2025 Workshop Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Abstract:Synthetic personae experiments have become a prominent method in Large Language Model alignment research, yet the representativeness and ecological validity of these personae vary considerably between studies. Through a review of 63 peer-reviewed studies published between 2023 and 2025 in leading NLP and AI venues, we reveal a critical gap: task and population of interest are often underspecified in persona-based experiments, despite personalization being fundamentally dependent on these criteria. Our analysis shows substantial differences in user representation, with most studies focusing on limited sociodemographic attributes and only 35% discussing the representativeness of their LLM personae. Based on our findings, we introduce a persona transparency checklist that emphasizes representative sampling, explicit grounding in empirical data, and enhanced ecological validity. Our work provides both a comprehensive assessment of current practices and practical guidelines to improve the rigor and ecological validity of persona-based evaluations in language model alignment research.
zh
[NLP-92] CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在加密货币领域评估中缺乏针对性与真实场景适配性的问题。现有通用代理基准主要关注搜索和预测任务,难以反映加密分析所特有的高时效性、高度对抗性的信息环境以及多源异构数据(如链上智能平台和去中心化金融(Decentralized Finance, DeFi)仪表盘)融合需求。其解决方案的关键在于构建首个由专家精心设计的动态基准 CryptoBench,该基准每月更新50个问题,模拟专业分析师的实际工作流程,并通过四象限分类体系(简单检索、复杂检索、简单预测、复杂预测)实现对LLM代理基础数据获取能力与高级分析预测能力的精细化评估。这一设计显著提升了评估场景的挑战性和实用性,揭示了当前主流模型普遍存在的“检索-预测失衡”现象,即虽具备较强的数据检索能力,但在需要综合推理与预测的任务中表现薄弱。
链接: https://arxiv.org/abs/2512.00417
作者: Jiacheng Guo,Suozhi Huang,Zixin Yao,Yifan Zhang,Yifu Lu,Jiashuo Liu,Zihao Li,Yanyan Deng,Qixin Xiao,Jia Tian,Kanghong Zhan,Tianyi Li,Xiaochen Liu,Jason Ge,Chaoyang He,Kaixuan Huang,Lin Yang,Wenhao Huang,Mengdi Wang
机构: Princeton University (普林斯顿大学); Pyra.io; DeepReach.ai; Zenith Lab; TensorOpera AI; University of California, Los Angeles (加州大学洛杉矶分校); University of California, Berkeley (加州大学伯克利分校); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emphextreme time-sensitivity, \empha highly adversarial information environment, and the critical need to synthesize data from \emphdiverse, specialized sources, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent’s foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textitretrieval-prediction imbalance, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.00417 [cs.CL] (or arXiv:2512.00417v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.00417 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-93] A Taxonomy of Errors in English as she is spoke: Toward an AI-Based Method of Error Analysis for EFL Writing Instruction
【速读】: 该论文旨在解决英语作为外语(EFL)教学中写作错误分析效率低、反馈粗糙的问题,传统基于评分量表的评估难以提供精准且多层次的纠错指导。解决方案的关键在于构建一个基于大型语言模型(LLMs)的辅助错误分析系统,其核心是融合Corder(1967)、Richards(1971)和James(1998)等语言学理论的细粒度错误分类体系,能够从词法、句法到标点等多个层面识别并分类拼写、语法和标点错误,并通过Python实现API调用以输出结构化反馈。该方法不仅提升了自动化纠错的准确性,还为EFL教学提供了可扩展、可定制的智能反馈机制。
链接: https://arxiv.org/abs/2512.00392
作者: Damian Heywood,Joseph Andrew Carrier,Kyu-Hong Hwang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Metadata at “Replication Data for: A Taxonomy of Errors in English as she is spoke: An AI-Based System for Error Analysis for EFL Writing Instruction”, this https URL , Harvard Dataverse, V1
Abstract:This study describes the development of an AI-assisted error analysis system designed to identify, categorize, and correct writing errors in English. Utilizing Large Language Models (LLMs) like Claude 3.5 Sonnet and DeepSeek R1, the system employs a detailed taxonomy grounded in linguistic theories from Corder (1967), Richards (1971), and James (1998). Errors are classified at both word and sentence levels, covering spelling, grammar, and punctuation. Implemented through Python-coded API calls, the system provides granular feedback beyond traditional rubric-based assessments. Initial testing on isolated errors refined the taxonomy, addressing challenges like overlapping categories. Final testing used “English as she is spoke” by Jose da Fonseca (1855), a text rich with authentic linguistic errors, to evaluate the system’s capacity for handling complex, multi-layered analysis. The AI successfully identified diverse error types but showed limitations in contextual understanding and occasionally generated new error categories when encountering uncoded errors. This research demonstrates AI’s potential to transform EFL instruction by automating detailed error analysis and feedback. While promising, further development is needed to improve contextual accuracy and expand the taxonomy to stylistic and discourse-level errors.
zh
[NLP-94] Mitigating the Threshold Priming Effect in Large Language Model-Based Relevance Judgments via Personality Infusing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在相关性标注任务中因阈值启动效应(threshold priming)导致的判断偏差问题,即先前的相关性判断会系统性地影响后续判断。其解决方案的关键在于引入“人格提示”(personality prompting)策略,通过模拟大五人格特质(Big Five personality traits)来调节LLM的行为模式,实证发现高开放性和低神经质性人格配置能显著降低模型对启动效应的敏感性,且最优人格特征因模型架构和任务类型而异,从而为提升LLM在信息检索评估中的稳定性和客观性提供了可操作的心理学驱动方法。
链接: https://arxiv.org/abs/2512.00390
作者: Nuo Chen,Hanpei Fang,Jiqun Liu,Wilson Wei,Tetsuya Sakai,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学); Waseda University (早稻田大学); The University of Oklahoma (俄克拉荷马大学); EureXa Labs (EureXa 实验室)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Recent research has explored LLMs as scalable tools for relevance labeling, but studies indicate they are susceptible to priming effects, where prior relevance judgments influence later ones. Although psychological theories link personality traits to such biases, it is unclear whether simulated personalities in LLMs exhibit similar effects. We investigate how Big Five personality profiles in LLMs influence priming in relevance labeling, using multiple LLMs on TREC 2021 and 2022 Deep Learning Track datasets. Our results show that certain profiles, such as High Openness and Low Neuroticism, consistently reduce priming susceptibility. Additionally, the most effective personality in mitigating priming may vary across models and task types. Based on these findings, we propose personality prompting as a method to mitigate threshold priming, connecting psychological evidence with LLM-based evaluation practices.
zh
[NLP-95] Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation
【速读】: 该论文旨在解决传统文档切分方法(如固定长度或递归分割)在检索增强生成(Retrieval-Augmented Generation, RAG)中因无法保留语义结构而导致检索不准确、生成质量低的问题。其关键解决方案是提出两种高效的语义切分方法——投影相似性切分(Projected Similarity Chunking, PSC)和度量融合切分(Metric Fusion Chunking, MFC),二者均基于PubMed数据集训练,并利用三种不同的嵌入模型优化切分粒度以更好地匹配语义边界,从而显著提升检索效果(如MRR提升24倍)与生成质量,且具备良好的跨领域泛化能力。
链接: https://arxiv.org/abs/2512.00367
作者: Aparajitha Allamraju,Maitreya Prafulla Chitale,Hiranmai Sri Adibhatla,Rahul Mishra,Manish Shrivastava
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full-text PubMed Central articles. Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive analysis, including statistical significance and response-time comparisons with common chunking libraries. Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out-of-domain generation performance across multiple datasets. Overall, our findings confirm that our semantic chunkers, especially PSC, consistently deliver superior performance.
zh
[NLP-96] CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA
【速读】: 该论文旨在解决在单GPU延迟和内存预算约束下,对教育讲座视频进行时间戳问答(timestamped question answering)的问题。其核心挑战在于如何高效地从大规模视频中检索相关的时间片段并生成基于证据的答案。解决方案的关键是提出了一种轻量级、延迟受限的跨模态检索器 CrossFusion-RAG,该方法结合了冻结的多模态编码器、一个学习得到的512-768维视觉投影层、浅层查询无关的跨注意力机制(融合自动语音识别ASR文本与帧图像特征),以及引入时间一致性正则化项以提升时序定位准确性,并辅以小型交叉注意力重排序器进一步优化结果。该方案在 CourseTimeQA 数据集上显著优于多个强基线模型,同时保持约1.55秒的中位端到端延迟,满足实际部署需求。
链接: https://arxiv.org/abs/2512.00360
作者: Vsevolod Kovalev,Parteek Kumar
机构: Washington State University (华盛顿州立大学)
类目: Computation and Language (cs.CL)
备注: 5 figures, 8 tables
Abstract:We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512-768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.
zh
[NLP-97] IndicParam: Benchmark to evaluate LLM s on low-resource Indic Languages
【速读】: 该论文旨在解决低资源及极低资源印地语系(Indic)语言在大型语言模型(LLM)评估中严重不足的问题。当前主流大模型在高资源多语言任务上表现优异,但对包括尼泊尔语、古吉拉特语、马拉地语等在内的11种低资源语言,以及多格里语、迈蒂利语、桑塔利语等极低资源语言的评测仍极为匮乏。为此,作者构建了IndicParam这一人工标注的基准测试集,包含超过13,000道多项选择题,覆盖上述语言及梵语-英语混写语料,并进一步将题目细分为知识导向型与纯语言学类型,以区分事实记忆与语法能力;同时引入多种题型如列表匹配、断言-理由配对和序列排序,提升评测维度多样性。该方案的关键在于系统性地建立面向印地语系语言的综合性、结构化评测框架,从而揭示跨语言迁移的局限性并为后续研究提供挑战性基准。
链接: https://arxiv.org/abs/2512.00333
作者: Ayush Maheshwari,Kaushal Sharma,Vivek Patel,Aditya Maheshwari
机构: Indian Institute of Management Indore (印度管理学院英迪拉普尔分校); BharatGen
类目: Computation and Language (cs.CL)
备注:
Abstract:While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at this https URL. Scripts to run benchmark are present at this https URL.
zh
[NLP-98] Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents
【速读】: 该论文旨在解决多轮工具调用大语言模型(multi-turn tool-calling LLMs)在实际部署中面临的安全性与鲁棒性问题,尤其是模型在复杂对话场景下对误导性断言的脆弱性。现有基准如伯克利函数调用排行榜(BFCL)主要评估单轮函数调用能力,缺乏对多轮对话层面稳健性的量化评估。为此,作者提出了一种新的评估范式——断言条件合规性(Assertion-Conditioned Compliance, A-CC),其关键在于引入两类误导性断言来源:用户源断言(User-Sourced Assertions, USAs)用于衡量模型对用户错误信念的盲从倾向,以及系统源断言(Function-Sourced Assertions, FSAs)用于检测模型对过时或矛盾系统策略的不恰当遵从行为。A-CC通过结构化指标揭示了当前模型在真实世界应用中的潜在风险,为提升AI代理在安全关键领域的可靠性提供了可操作的评测框架。
链接: https://arxiv.org/abs/2512.00332
作者: Daud Waqas,Aaryamaan Golthi,Erika Hayashida,Huanzhi Mao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages (incl. Appendix), 2 figures, 7 tables
Abstract:Multi-turn tool-calling LLMs (models capable of invoking external APIs or tools across several user turns) have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, and financial operations. Yet implementing multi-turn pipelines remains difficult for many safety-critical industries due to ongoing concerns regarding model resilience. While standardized benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) have underpinned confidence concerning advanced function-calling models (like Salesforce’s xLAM V2), there is still a lack of visibility into multi-turn conversation-level robustness, especially given their exposure to real-world systems. In this paper, we introduce Assertion-Conditioned Compliance (A-CC), a novel evaluation paradigm for multi-turn function-calling dialogues. A-CC provides holistic metrics that evaluate a model’s behavior when confronted with misleading assertions originating from two distinct vectors: (1) user-sourced assertions (USAs), which measure sycophancy toward plausible but misinformed user beliefs, and (2) function-sourced assertions (FSAs), which measure compliance with plausible but contradictory system policies (e.g., stale hints from unmaintained tools). Our results show that models are highly vulnerable to both USA sycophancy and FSA policy conflicts, confirming A-CC as a critical, latent vulnerability in deployed agents.
zh
[NLP-99] Evidence-Guided Schema Normalization for Temporal Tabular Reasoning
【速读】: 该论文旨在解决在动态演变的半结构化表格上进行时间推理(temporal reasoning)对当前问答(QA)系统带来的挑战。其解决方案的关键在于提出一种基于SQL的方法,包含三个核心步骤:从维基百科infobox生成第三范式(3NF)模式、生成SQL查询并执行。研究发现,模式设计的质量对问答精度的影响大于模型规模,由此提炼出三个证据支持的原则:保留上下文的规范化、减少歧义的语义命名以及一致的时间锚定。最终采用Gemini 2.5 Flash生成的schema与Gemini-2.0-Flash生成的查询组合,在EM指标上达到80.39,相较基线提升16.8%。
链接: https://arxiv.org/abs/2512.00329
作者: Ashish Thanga,Vibhu Dixit,Abhilash Shankarampeta,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Temporal reasoning over evolving semi-structured tables poses a challenge to current QA systems. We propose a SQL-based approach that involves (1) generating a 3NF schema from Wikipedia infoboxes, (2) generating SQL queries, and (3) query execution. Our central finding challenges model scaling assumptions: the quality of schema design has a greater impact on QA precision than model capacity. We establish three evidence-based principles: normalization that preserves context, semantic naming that reduces ambiguity, and consistent temporal anchoring. Our best configuration (Gemini 2.5 Flash schema + Gemini-2.0-Flash queries) achieves 80.39 EM, a 16.8% improvement over the baseline (68.89 EM).
zh
[NLP-100] Progressive Code Integration for Abstractive Bug Report Summarization
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在缺陷报告摘要生成中存在的两个核心问题:一是现有摘要方法多依赖表面文本特征,导致摘要信息不完整或冗余;二是忽视了与报告关联的代码片段,而代码是准确诊断软件缺陷的关键信息。解决方案的关键在于提出一种渐进式的代码融合框架(progressive code-integration framework),通过逐步引入长代码片段与文本内容共同输入大语言模型(LLM),突破标准LLM上下文窗口限制,并实现语义丰富的抽象式摘要生成,从而显著提升缺陷理解的准确性与完整性。
链接: https://arxiv.org/abs/2512.00325
作者: Shaira Sadia Karim,Abrar Mahmud Rahim,Lamia Alam,Ishmam Tashdeed,Lutfun Nahar Lota,Md. Abu Raihan M. Kamal,Md. Azam Hossain
机构: Islamic University of Technology (伊斯兰科技大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Bug reports are often unstructured and verbose, making it challenging for developers to efficiently comprehend software issues. Existing summarization approaches typically rely on surface-level textual cues, resulting in incomplete or redundant summaries, and they frequently ignore associated code snippets, which are essential for accurate defect diagnosis. To address these limitations, we propose a progressive code-integration framework for LLM-based abstractive bug report summarization. Our approach incrementally incorporates long code snippets alongside textual content, overcoming standard LLM context window constraints and producing semantically rich summaries. Evaluated on four benchmark datasets using eight LLMs, our pipeline outperforms extractive baselines by 7.5%-58.2% and achieves performance comparable to state-of-the-art abstractive methods, highlighting the benefits of jointly leveraging textual and code information for enhanced bug comprehension.
zh
[NLP-101] Comparative Analysis of 47 Context-Based Question Answer Models Across 8 Diverse Datasets
【速读】: 该论文旨在解决上下文感知问答(Context-based Question Answering, CBQA)模型在不同数据集上性能差异大、难以通用的问题,目标是识别无需额外微调即可在多种场景下表现优异的模型。解决方案的关键在于对47个来自Hugging Face的预训练CBQA模型进行系统性基准测试,涵盖8个不同领域的数据集,从而筛选出在多任务环境中具备稳定高精度的模型;其中表现最优的是基于SQuAD v2训练的ahotrod/electra_large_discriminator_squad2_512模型,在所有数据集上平均准确率达43%,并在多个特定任务(如bioasq10b-factoid和biomedical_cpgQA)中显著优于其他模型,证明了模型架构与训练数据选择对跨领域泛化能力的核心影响。
链接: https://arxiv.org/abs/2512.00323
作者: Muhammad Muneeb,David B. Ascher,Ahsan Baidar Bakht
机构: The University of Queensland (昆士兰大学); Baker Heart and Diabetes Institute (贝克心脏与糖尿病研究所); Khalifa University (哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Context-based question answering (CBQA) models provide more accurate and relevant answers by considering the contextual information. They effectively extract specific information given a context, making them functional in various applications involving user support, information retrieval, and educational platforms. In this manuscript, we benchmarked the performance of 47 CBQA models from Hugging Face on eight different datasets. This study aims to identify the best-performing model across diverse datasets without additional fine-tuning. It is valuable for practical applications where the need to retrain models for specific datasets is minimized, streamlining the implementation of these models in various contexts. The best-performing models were trained on the SQuAD v2 or SQuAD v1 datasets. The best-performing model was ahotrod/electra_large_discriminator_squad2_512, which yielded 43% accuracy across all datasets. We observed that the computation time of all models depends on the context length and the model size. The model’s performance usually decreases with an increase in the answer length. Moreover, the model’s performance depends on the context complexity. We also used the Genetic algorithm to improve the overall accuracy by integrating responses from other models. ahotrod/electra_large_discriminator_squad2_512 generated the best results for bioasq10b-factoid (65.92%), biomedical_cpgQA (96.45%), QuAC (11.13%), and Question Answer Dataset (41.6%). Bert-large-uncased-whole-word-masking-finetuned-squad achieved an accuracy of 82% on the IELTS dataset.
zh
[NLP-102] Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains
【速读】: 该论文旨在解决大数据异质性(Variety)对分类模型性能影响的不确定性问题,特别是在结构化与非结构化数据域中如何选择最优算法策略。其解决方案的关键在于提出一种基于数据特性与计算基础设施约束的统一框架:在高维结构化数据中,通过进化和贝叶斯超参数优化(如遗传算法、Optuna)提升线性模型(SVM、逻辑回归)的性能;而在大规模文本数据中,则利用分布式处理(Apache Spark)结合Transformer嵌入(ROBERTa)与贝叶斯目标编码(Bayesian Target Encoding)进行特征工程,使简单模型获得更强泛化能力,从而规避复杂模型因分布式微调导致的过拟合问题。
链接: https://arxiv.org/abs/2512.00298
作者: González Trigueros Jesús Eduardo,Alonso Sánchez Alejandro,Muñoz Rivera Emilio,Peñarán Prieto Mariana Jaqueline,Mendoza González Camila Natalia
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 13 pages, 1 figure, 3 tables. Comparative study involving Apache Spark and Hyperparameter Optimization. Keywords: Big Data, NLP, Tabular Data
Abstract:This study analyzes the impact of heterogeneity (“Variety”) in Big Data by comparing classification strategies across structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. A dual methodology was implemented: evolutionary and Bayesian hyperparameter optimization (Genetic Algorithms, Optuna) in Python for numerical data, and distributed processing in Apache Spark for massive textual corpora. The results reveal a “complexity paradox”: in high-dimensional spaces, optimized linear models (SVM, Logistic Regression) outperformed deep architectures and Gradient Boosting. Conversely, in text-based domains, the constraints of distributed fine-tuning led to overfitting in complex models, whereas robust feature engineering – specifically Transformer-based embeddings (ROBERTa) and Bayesian Target Encoding – enabled simpler models to generalize effectively. This work provides a unified framework for algorithm selection based on data nature and infrastructure constraints.
zh
[NLP-103] EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在中文K-12教育场景中未经充分评估即被部署所带来的风险问题,以确保其应用符合教育标准。解决方案的关键在于提出EduEval——一个结构化的分层基准测试体系,包含三大核心创新:(1) 提出EduAbility Taxonomy认知框架,整合Bloom分类法与Webb深度知识理解模型,系统化划分记忆、理解、应用、推理、创造和伦理六个认知维度;(2) 强调真实性,融合真实考试题、课堂对话、学生作文及专家设计任务,贴近实际教学情境;(3) 规模化构建涵盖24类任务、超11,000道题目,覆盖从小学到高中全学段的评测体系。通过该基准对14个主流LLM进行零样本和少样本测试,揭示了模型在事实性任务表现良好但面对课堂对话分类和创造性生成时存在局限,且开源模型在复杂教育推理任务上优于闭源系统,为面向教育场景优化的LLM开发提供了精准评估指标与差异化提示策略。
链接: https://arxiv.org/abs/2512.00290
作者: Guoqing Ma,Jia Zhu,Hanghui Guo,Weijie Shi,Yue Cui,Jiawei Shen,Zilong Li,Yidan Liang
机构: Zhejiang Normal University (浙江师范大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom’s Taxonomy and Webb’s Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.
zh
[NLP-104] Lost without translation – Can transformer (language models) understand mood states?
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在跨语言心理状态识别中的局限性问题,特别是其对印度语系(Indic languages)中情绪表达(idioms of distress)的理解能力不足,从而限制了其在印度精神卫生领域的临床应用。解决方案的关键在于:直接使用原生印度语言的嵌入表示无法有效聚类不同情绪状态(如抑郁、正常情绪、躁狂兴奋和躁狂忧郁),而通过高质量翻译(尤其是人工翻译后转译为中文并结合中文模型嵌入)可显著提升聚类性能(最高Composite Score达0.67),表明翻译作为中介策略能暂时缓解语言障碍;但研究同时指出,依赖商业模型或复杂翻译流程不可持续,根本出路在于开发具备本地语言理解能力的专用模型,以实现全球心理健康场景下的公平与有效部署。
链接: https://arxiv.org/abs/2512.00274
作者: Prakrithi Shivaprakash,Diptadhi Mukherjee,Lekhansh Shukla,Animesh Mukherjee,Prabhat Chand,Pratima Murthy
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages, 3 figures, 2 tables
Abstract:Background: Large Language Models show promise in psychiatry but are English-centric. Their ability to understand mood states in other languages is unclear, as different languages have their own idioms of distress. Aim: To quantify the ability of language models to faithfully represent phrases (idioms of distress) of four distinct mood states (depression, euthymia, euphoric mania, dysphoric mania) expressed in Indian languages. Methods: We collected 247 unique phrases for the four mood states across 11 Indic languages. We tested seven experimental conditions, comparing k-means clustering performance on: (a) direct embeddings of native and Romanised scripts (using multilingual and Indic-specific models) and (b) embeddings of phrases translated to English and Chinese. Performance was measured using a composite score based on Adjusted Rand Index, Normalised Mutual Information, Homogeneity and Completeness. Results: Direct embedding of Indic languages failed to cluster mood states (Composite Score = 0.002). All translation-based approaches showed significant improvement. High performance was achieved using Gemini-translated English (Composite=0.60) and human-translated English (Composite=0.61) embedded with gemini-001. Surprisingly, human-translated English, further translated into Chinese and embedded with a Chinese model, performed best (Composite = 0.67). Specialised Indic models (IndicBERT and Sarvam-M) performed poorly. Conclusion: Current models cannot meaningfully represent mood states directly from Indic languages, posing a fundamental barrier to their psychiatric application for diagnostic or therapeutic purposes in India. While high-quality translation bridges this gap, reliance on proprietary models or complex translation pipelines is unsustainable. Models must first be built to understand diverse local languages to be effective in global mental health.
zh
[NLP-105] OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion ACL2026
【速读】: 该论文旨在解决当前开放源代码文本翻译大语言模型(LLM)在语音翻译(Speech Translation, ST)任务中仅能通过级联流水线(cascaded pipelines)实现的问题,该方式先进行自动语音识别(Automatic Speech Recognition, ASR),再进行翻译,导致延迟较高,尤其在同步语音翻译(SimulST)场景下不适用,并且无法利用多模态上下文(如图像)来辅助语义消歧。解决方案的关键在于提出一种端到端的融合策略,将预训练多模态基础模型(Multimodal Foundation Models, MMFMs)与专用翻译LLM相结合,通过连接MMFM多个层的隐藏状态到翻译LLM,实现联合端到端训练。所提出的模型OmniFusion能够同时处理语音、图像及文本输入,显著降低SimulST延迟(减少1秒),并提升整体翻译质量。
链接: https://arxiv.org/abs/2512.00234
作者: Sai Koneru,Matthias Huck,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); SAP SE (SAP公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint for ACL 2026
Abstract:There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation quality\footnoteCode is available at this https URL.
zh
[NLP-106] Minimal-Edit Instruction Tuning for Low-Resource Indic GEC AACL
【速读】: 该论文旨在解决印地语等印度语言(Indic languages)的语法错误纠正(Grammatical Error Correction, GEC)问题,其核心挑战包括标注数据稀缺、书写系统多样以及丰富的形态学特征。解决方案的关键在于提出一种无需数据增强(augmentation-free)的框架,结合指令微调的大语言模型(instruction-tuned large language models)与保守解码策略:首先使用4-bit量化和参数高效微调(PEFT)对12B规模的GEMMA 3模型进行指令微调,采用Alpaca风格格式;其次通过一个轻量级归一化器实现确定性、约束感知的解码过程,确保修改最小且语义不变;最后利用从训练集中计算出的错误分类器分类体系、标签分布及优先级顺序,自动生成特定语言的固定提示(prompt),从而实现可复现且计算高效的GEC系统。实验表明,该方法在未使用任何增强技术的情况下,在马拉雅拉姆语上达到92.41的GLEU得分(第六名),在印地语上达81.44(第三名)。
链接: https://arxiv.org/abs/2512.00219
作者: Akhil Rajeev P
机构: Centre for Development of Advanced Computing (C DAC), Bangalore (印度先进计算中心)
类目: Computation and Language (cs.CL)
备注: Submitted to AACL-IJCNLP Bhasha Workshop Shared Task1 :GEC
Abstract:Grammatical error correction for Indic languages faces limited supervision, diverse scripts, and rich morphology. We propose an augmentation-free setup that uses instruction-tuned large language models and conservative decoding. A 12B GEMMA 3 model is instruction-tuned in bnb 4-bit precision with parameter-efficient fine-tuning (PEFT) and Alpaca-style formatting. Decoding follows a deterministic, constraint-aware procedure with a lightweight normaliser that encourages minimal, meaning-preserving edits. We operationalise inference, subsequent to instruction fine-tuning (IFT), via a fixed, language-specific prompt directly synthesised from a deterministic error classifier’s taxonomy, label distributions, and precedence ordering computed on the training corpus. Under the official untuned GLEU evaluation, the system scores 92.41 on Malayalam, sixth overall, and 81.44 on Hindi, third overall. These results indicate that classifier-informed prompt design, adapter-based instruction tuning, and deterministic decoding provide a reproducible and a computationally efficient alternative to augmentation-centred pipelines for Indic GEC. The approach also motivates future work on stronger morphosyntactic constraints and human-centred evaluation of conservative edits. Comments: Submitted to AACL-IJCNLP Bhasha Workshop Shared Task1 :GEC Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.00219 [cs.CL] (or arXiv:2512.00219v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.00219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-107] owards Corpus-Grounded Agent ic LLM s for Multilingual Grammatical Analysis
链接: https://arxiv.org/abs/2512.00214
作者: Matej Klemen,Tjaša Arčon,Luka Terčon,Marko Robnik-Šikonja,Kaja Dobrovoljc
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pre-print, submission under review
[NLP-108] ree Matching Networks for Natural Language Inference: Parameter-Efficient Semantic Understanding via Dependency Parse Trees
【速读】: 该论文旨在解决基于Transformer的句子嵌入模型(如BERT)在自然语言推理(NLI)任务中参数量庞大、训练成本高且效率低的问题。其核心挑战在于,尽管这些模型能通过自注意力机制捕捉词与词之间的复杂关系,但需要从零开始学习所有语义关联,导致计算资源消耗大。论文的关键解决方案是提出Tree Matching Networks (TMN),将图匹配网络(GMN)适配到依赖句法树(dependency parse tree)上,利用显式的语法结构信息来替代部分自监督学习过程,从而提升学习效率。实验表明,TMN在SNLI任务上不仅显著优于BERT基线模型,还大幅减少内存占用和训练时间;同时,作者进一步提出多头注意力聚合机制(multi-headed attention aggregation),以缓解现有聚合方法对模型可扩展性的限制。
链接: https://arxiv.org/abs/2512.00204
作者: Jason Lunder
机构: Eastern Washington University (东华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, preprint
Abstract:In creating sentence embeddings for Natural Language Inference (NLI) tasks, using transformer-based models like BERT leads to high accuracy, but require hundreds of millions of parameters. These models take in sentences as a sequence of tokens, and learn to encode the meaning of the sequence into embeddings such that those embeddings can be used reliably for NLI tasks. Essentially, every word is considered against every other word in the sequence, and the transformer model is able to determine the relationships between them, entirely from scratch. However, a model that accepts explicit linguistic structures like dependency parse trees may be able to leverage prior encoded information about these relationships, without having to learn them from scratch, thus improving learning efficiency. To investigate this, we adapt Graph Matching Networks (GMN) to operate on dependency parse trees, creating Tree Matching Networks (TMN). We compare TMN to a BERT based model on the SNLI entailment task and on the SemEval similarity task. TMN is able to achieve significantly better results with a significantly reduced memory footprint and much less training time than the BERT based model on the SNLI task, while both models struggled to preform well on the SemEval. Explicit structural representations significantly outperform sequence-based models at comparable scales, but current aggregation methods limit scalability. We propose multi-headed attention aggregation to address this limitation.
zh
[NLP-109] Measuring What LLM s Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification AAAI2026
链接: https://arxiv.org/abs/2512.00163
作者: Saeed AlMarri,Mathieu Ravaut,Kristof Juhasz,Gautier Marti,Hamdan Al Ahbabi,Ibrahim Elfadel
机构: United Arab Emirates University (阿联酋大学); Khalifa University (哈利法大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 7 pages, 3 figures, 3 tables, AAAI 2026 Deployable AI Workshop
[NLP-110] Emergent Convergence in Multi-Agent LLM Annotation
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在黑箱代理协作场景下如何实现有效协调的问题,尤其关注其在无显式角色提示条件下所表现出的群体互动机制。解决方案的关键在于构建一套过程级指标体系(包括代码稳定性、语义自一致性、词汇置信度、情感倾向与收敛性),并结合输出嵌入空间几何结构的演化分析,揭示LLM群体在多轮讨论中逐步形成的语义压缩与协商行为模式,从而量化和理解其非显式引导下的协同策略。
链接: https://arxiv.org/abs/2512.00047
作者: Angelina Parfenova,Alexander Denzler,Juergen Pfeffer
机构: Lucerne University of Applied Sciences and Arts (卢塞恩应用科学与艺术大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in collaborative settings, yet little is known about how they coordinate when treated as black-box agents. We simulate 7500 multi-agent, multi-round discussions in an inductive coding task, generating over 125000 utterances that capture both final annotations and their interactional histories. We introduce process-level metrics: code stability, semantic self-consistency, and lexical confidence alongside sentiment and convergence measures, to track coordination dynamics. To probe deeper alignment signals, we analyze the evolving geometry of output embeddings, showing that intrinsic dimensionality declines over rounds, suggesting semantic compression. The results reveal that LLM groups converge lexically and semantically, develop asymmetric influence patterns, and exhibit negotiation-like behaviors despite the absence of explicit role prompting. This work demonstrates how black-box interaction analysis can surface emergent coordination strategies, offering a scalable complement to internal probe-based interpretability methods.
zh
[NLP-111] xt Annotation via Inductive Coding: Comparing Human Experts to LLM s in Qualitative Data Analysis
链接: https://arxiv.org/abs/2512.00046
作者: Angelina Parfenova,Andreas Marfurt,Alexander Denzler,Juergen Pfeffer
机构: Lucerne University of Applied Sciences and Arts (卢塞恩应用科学与艺术大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-112] Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions
【速读】: 该论文旨在解决当前多模态推理(Multimodal Reasoning)研究中数据驱动基础薄弱的问题,尤其是在视觉语言模型(Vision-Language Models, VLMs)的监督微调(Supervised Fine-Tuning, SFT)阶段缺乏高质量、结构化且与课程体系对齐的数据支持。其核心挑战在于如何在不依赖复杂强化学习算法(如GRPO、DPO)的前提下,通过优化数据组成和表示语法显著提升模型性能。解决方案的关键在于构建了一个包含1.614亿token的多模态数据集,融合教科书习题-解答对、课程对齐的图表及上下文材料,并设计了一种优化的推理语法(QMSA),从而使得基于SFT训练的Qwen-2.5VL-32B模型在新发布的YKSUniform基准上达到78.6%准确率,仅比Gemini 2.0 Flash低1.0%,证明了高质量数据和合理表示语法在多模态推理中的决定性作用。
链接: https://arxiv.org/abs/2512.00042
作者: Egemen Sert,Şeyda Ertekin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored. We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854 multimodal exam questions across 309 curriculum topics. Our results reveal that data composition and representational syntax play a decisive role in multimodal reasoning. This work establishes a data centric framework for advancing open weight vision language models, demonstrating that carefully curated and curriculum-grounded multimodal data can elevate supervised fine-tuning to near state-of-the-art performance. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2512.00042 [cs.CV] (or arXiv:2512.00042v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.00042 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-113] LM4Opt-RA: A Multi-Candidate LLM Framework with Structured Ranking for Automating Network Resource Allocation
【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)在处理动态环境、变量间相互依赖及异构约束条件下的资源分配优化问题时表现不足的问题。当前基准测试和数据集无法充分刻画此类复杂任务的特性,导致模型难以实现高精度的数学建模与求解。解决方案的关键在于提出NL4RA数据集(包含50个LP、ILP和MILP形式的资源分配优化问题),并设计LM4Opt-RA多候选框架,该框架融合直接提示(direct prompting)、少样本提示(few-shot prompting)和思维链(chain-of-thought prompting)等多种策略,并引入结构化排序机制以提升生成质量;同时,为克服传统自动评分指标(如ROUGE、BLEU、BERTScore)与人类判断不一致的问题,进一步提出LLM-Assisted Mathematical Evaluation(LAME)这一自动化数学评估指标,从而实现更精准的模型性能量化。实验表明,基于LM4Opt-RA框架的Llama-3.1-70B模型在LAME指标上达到0.8007,显著优于其他基线模型。
链接: https://arxiv.org/abs/2512.00039
作者: Tasnim Ahmed,Siana Rizwan,Naveed Ejaz,Salimur Choudhury
机构: Queen’s University (皇后大学)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Building on advancements in Large Language Models (LLMs), we can tackle complex analytical and mathematical reasoning tasks requiring nuanced contextual understanding. A prime example of such complex tasks is modelling resource allocation optimization in networks, which extends beyond translating natural language inputs into mathematical equations or Linear Programming (LP), Integer Linear Programming (ILP), and Mixed-Integer Linear Programming (MILP) models. However, existing benchmarks and datasets cannot address the complexities of such problems with dynamic environments, interdependent variables, and heterogeneous constraints. To address this gap, we introduce NL4RA, a curated dataset comprising 50 resource allocation optimization problems formulated as LP, ILP, and MILP. We then evaluate the performance of well-known open-source LLMs with varying parameter counts. To enhance existing LLM based methods, we introduce LM4Opt RA, a multi candidate framework that applies diverse prompting strategies such as direct, few shot, and chain of thought, combined with a structured ranking mechanism to improve accuracy. We identified discrepancies between human judgments and automated scoring such as ROUGE, BLEU, or BERT scores. However, human evaluation is time-consuming and requires specialized expertise, making it impractical for a fully automated end-to-end framework. To quantify the difference between LLM-generated responses and ground truth, we introduce LLM-Assisted Mathematical Evaluation (LAME), an automated metric designed for mathematical formulations. Using LM4Opt-RA, Llama-3.1-70B achieved a LAME score of 0.8007, outperforming other models by a significant margin, followed closely by Llama-3.1-8B. While baseline LLMs demonstrate considerable promise, they still lag behind human expertise; our proposed method surpasses these baselines regarding LAME and other metrics.
zh
[NLP-114] Use of Retrieval-Augmented Large Language Model Agent for Long-Form COVID-19 Fact-Checking
【速读】: 该论文旨在解决新冠疫情期间长篇幅虚假信息(long-form misinformation)的自动化事实核查难题,尤其针对大语言模型(LLM)在一致性(consistency)和可解释性(explainability)方面的局限性。解决方案的关键在于提出SAFE(System for Accurate Fact Extraction and Evaluation)代理系统,其核心创新是结合检索增强生成(Retrieval-Augmented Generation, RAG)技术,特别是基于13万篇新冠研究文献构建的LOTR-RAG模块,实现 claims 的精准提取与验证;其中,SAFE (LOTR-RAG) 在多项指标上显著优于基线模型(p < 0.001),且在一致性(0.629)和主观评价(如实用性、清晰度、真实性)方面表现最优,表明该设计有效提升了事实核查的准确性与可靠性,为规模化虚假信息治理提供了坚实基础。
链接: https://arxiv.org/abs/2512.00007
作者: Jingyi Huang,Yuyi Yang,Mengmeng Ji,Charles Alba,Sheng Zhang,Ruopeng An
机构: Shanghai University of Sports(上海体育大学); Washington University in St. Louis(华盛顿大学圣路易斯分校); New York University(纽约大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The COVID-19 infodemic calls for scalable fact-checking solutions that handle long-form misinformation with accuracy and reliability. This study presents SAFE (system for accurate fact extraction and evaluation), an agent system that combines large language models with retrieval-augmented generation (RAG) to improve automated fact-checking of long-form COVID-19 misinformation. SAFE includes two agents - one for claim extraction and another for claim verification using LOTR-RAG, which leverages a 130,000-document COVID-19 research corpus. An enhanced variant, SAFE (LOTR-RAG + SRAG), incorporates Self-RAG to refine retrieval via query rewriting. We evaluated both systems on 50 fake news articles (2-17 pages) containing 246 annotated claims (M = 4.922, SD = 3.186), labeled as true (14.1%), partly true (14.4%), false (27.0%), partly false (2.2%), and misleading (21.0%) by public health professionals. SAFE systems significantly outperformed baseline LLMs in all metrics (p 0.001). For consistency (0-1 scale), SAFE (LOTR-RAG) scored 0.629, exceeding both SAFE (+SRAG) (0.577) and the baseline (0.279). In subjective evaluations (0-4 Likert scale), SAFE (LOTR-RAG) also achieved the highest average ratings in usefulness (3.640), clearness (3.800), and authenticity (3.526). Adding SRAG slightly reduced overall performance, except for a minor gain in clearness. SAFE demonstrates robust improvements in long-form COVID-19 fact-checking by addressing LLM limitations in consistency and explainability. The core LOTR-RAG design proved more effective than its SRAG-augmented variant, offering a strong foundation for scalable misinformation mitigation.
zh
[NLP-115] VeriPy - A New Python-Based Approach for SDR Pipelined/Unrolled Hardware Accelerator Generation
链接: https://arxiv.org/abs/2512.00006
作者: Yuqin Zhao,Linghui Ye,Haihang Xia,Luke Seed,Tiantai Deng
机构: University of Sheffield (谢菲尔德大学)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注: 13 Pages, 16 figures, and 9 tables. Aim to submit to IEEE TCAD
[NLP-116] LPCD: Unified Framework from Layer-Wise to Submodule Quantization
链接: https://arxiv.org/abs/2512.01546
作者: Yuma Ichikawa,Yudai Fujimoto,Akira Sakai
机构: Fujitsu Limited (富士通有限公司); RIKEN Center for AIP (理化学研究所人工智能推进中心); Institute of Science Tokyo (东京科学大学); Tokai University (东海大学)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 4 figures
计算机视觉
[CV-0] EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI AAAI2026
【速读】:该论文旨在解决生成式策略(generative policy)在具身人工智能(embodied AI)任务中面临的两个核心问题:数据效率低和采样效率差。现有方法通常需要大量示范数据才能训练,并且在推理阶段动作生成速度较慢。解决方案的关键在于提出 EfficientFlow 框架,其核心创新包括:一是引入等变性(equivariance)到流匹配(flow matching)中,理论证明当使用各向同性高斯先验与等变的速度预测网络时,所得动作分布保持等变性,从而显著提升泛化能力并减少数据需求;二是设计一种新颖的加速正则化策略,通过推导出可计算的代理损失函数(surrogate loss),仅基于条件轨迹即可实现稳定且可扩展的训练,有效加速采样过程。实验表明,该方法在有限数据下性能优越,且推理速度大幅提升。
链接: https://arxiv.org/abs/2512.02020
作者: Jianlei Chang,Ruofeng Mei,Wei Ke,Xiangyu Xu
机构: Xi’an Jiaotong University (西安交通大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2026. Project Page: this https URL
Abstract:Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.
zh
[CV-1] Data-Centric Visual Development for Self-Driving Labs
【速读】:该论文旨在解决自驱动实验室(Self-driving Laboratories, SDLs)中视觉感知任务因训练数据稀缺(尤其是负样本)而导致模型鲁棒性不足的问题,从而影响其在高精度生物实验操作(如移液)中的可靠性和可重复性。解决方案的关键在于构建一个融合真实数据与虚拟数据的混合生成管道:真实数据通过“人机协同”机制(human-in-the-loop scheme)实现自动化采集与选择性人工验证,在最小化人力投入的同时保证标注质量;虚拟数据则利用参考条件引导、提示驱动的图像生成技术进行增强,并经过筛选与验证以确保可靠性。该方法最终形成类别平衡的数据集,显著提升了气泡检测模型的性能(测试准确率达99.6%),同时降低了数据收集与标注成本,为SDL中稀有事件检测和更广泛的视觉任务提供了可扩展、低成本的数据供应策略。
链接: https://arxiv.org/abs/2512.02018
作者: Anbang Liu,Guanzhong Hu,Jiayi Wang,Ping Guo,Han Liu
机构: Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 4 figures
Abstract:Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.
zh
[CV-2] Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion NEURIPS2025
【速读】:该论文旨在解决多摄像头录制视频在未受控场景下难以实现高精度同步的问题(即跨摄像头视频流的时间对齐问题)。现有方法通常依赖于受控环境、特定目标、人工校正或昂贵硬件,限制了其通用性。解决方案的关键在于提出VisualSync框架,其核心思想是:当两个相机同时观测到同一三维空间中的运动点时,若时间同步正确,则该点满足极线约束(epipolar constraint)。为此,VisualSync利用现成的3D重建、特征匹配与密集跟踪技术提取轨迹片段(tracklets)、相对位姿及跨视角对应关系,并通过联合最小化极线误差来估计每台摄像机的时间偏移量,从而实现毫秒级精度的自动同步。
链接: https://arxiv.org/abs/2512.02017
作者: Shaowei Liu,David Yifan Yao,Saurabh Gupta,Shenlong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to NeurIPS 2025. Project page: this https URL
Abstract:Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera’s time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
zh
[CV-3] Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Dont Know Galileos Principle…for now
【速读】:该论文旨在解决当前生成式视频模型在表征基础物理规律(特别是重力)方面的不足问题。研究表明,未经微调的视频生成模型普遍表现出物体下落加速度偏低的现象,且这种偏差难以通过简单的时序缩放等手段消除。为更准确地评估模型对物理规律的理解,作者提出了一种无量纲的双物体实验协议,该协议基于伽利略等效原理(Galileo’s equivalence principle),通过测试时间平方比与高度比之间的关系来隔离尺度和焦距等干扰因素。关键在于,这一相对测试方法能够独立于重力加速度 g、镜头焦距和场景尺度,从而精确识别出模型在物理推理上的系统性缺陷。进一步地,研究证明仅需对100个单球下落片段进行轻量级低秩适配器(low-rank adaptor)微调,即可显著提升有效重力加速度(从1.81 m/s²提升至6.43 m/s²,达到地球重力的65%),并展现出零样本泛化能力,表明特定物理规律可通过最小数据量实现针对性修正。
链接: https://arxiv.org/abs/2512.02016
作者: Varun Varma Thozhiyoor,Shivam Tripathi,Venkatesh Babu Radhakrishnan,Anand Bhattad
机构: Indian Institute of Science (印度科学研究所); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio t_1^2/t_2^2 = h_1/h_2 , a relationship independent of g , focal length, and scale. This relative test reveals violations of Galileo’s equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises g_\mathrmeff from 1.81,\mathrmm/s^2 to 6.43,\mathrmm/s^2 (reaching 65% of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.
zh
[CV-4] Generative Video Motion Editing with 3D Point Tracks
【速读】:该论文旨在解决视频中相机运动与物体运动的精确编辑问题,尤其在复杂物体运动场景下,现有图像到视频(I2V)方法缺乏完整场景上下文导致一致性差,而视频到视频(V2V)方法虽能实现视角变换或基础物体平移,但对细粒度物体运动控制能力有限。其解决方案的关键在于提出一种基于轨迹条件的V2V框架,通过将源视频与配对的3D点轨迹(3D point tracks)作为条件输入,建立稀疏对应关系,从而将源视频中的丰富上下文信息迁移到新的运动模式中,并保持时空一致性;其中,3D轨迹提供显式的深度线索,使模型能够解析深度顺序并处理遮挡问题,显著提升运动编辑的精度与可控性。
链接: https://arxiv.org/abs/2512.02015
作者: Yao-Chih Lee,Zhoutong Zhang,Jiahui Huang,Jui-Hsien Wang,Joon-Young Lee,Jia-Bin Huang,Eli Shechtman,Zhengqi Li
机构: Adobe Research (Adobe 研究院); Adobe (Adobe); University of Maryland College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Camera and object motions are central to a video’s narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.
zh
[CV-5] UNA: Taming Unified Visual Representations for Native Unified Multimodal Models
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)中因视觉表征分离导致的格式不匹配问题,以及理解与生成任务之间难以协同优化的挑战。解决方案的关键在于提出一种原生统一多模态模型 TUNA,其通过级联变分自编码器(VAE)编码器与表征编码器构建统一的连续视觉表示空间,从而实现图像和视频在理解与生成任务中的端到端处理。这一设计避免了传统解耦式架构中不同编码器带来的表征不一致问题,并实验证明更强的预训练表征编码器能显著提升各类多模态任务性能,同时联合训练理解与生成数据可使二者相互促进而非干扰,展现出卓越的性能与可扩展性。
链接: https://arxiv.org/abs/2512.02014
作者: Zhiheng Liu,Weiming Ren,Haozhe Liu,Zijian Zhou,Shoufa Chen,Haonan Qiu,Xiaoke Huang,Zhaochong An,Fanny Yang,Aditya Patel,Viktar Atliha,Tony Ng,Xiao Han,Chuyan Zhu,Chenyang Zhang,Ding Liu,Juan-Manuel Perez-Rua,Sen He,Jürgen Schmidhuber,Wenhu Chen,Ping Luo,Wei Liu,Tao Xiang,Jonas Schult,Yuren Cong
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA’s unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.
zh
[CV-6] Improved Mean Flows: On the Challenges of Fastforward Generative Models
【速读】:该论文针对MeanFlow(MF)框架在训练目标和引导机制上的两个关键问题进行改进:首先,原始MF的训练目标不仅依赖于真实场(ground-truth fields),还受网络结构影响,导致训练不稳定;其次,原始MF在训练时固定分类器无关引导尺度(classifier-free guidance scale),限制了测试时的灵活性。解决方案的关键在于:第一,将训练目标重新定义为对瞬时速度 $ v $ 的回归损失,并通过预测平均速度 $ u $ 重新参数化网络,从而转化为更标准的回归问题以提升训练稳定性;第二,将引导机制建模为显式条件变量,采用上下文感知条件处理(in-context conditioning)实现灵活的测试阶段条件控制,同时减少模型规模并提升性能。由此提出的改进版MeanFlow(iMF)方法在ImageNet 256×256上仅用1次函数评估(1-NFE)即达到1.72 FID,显著优于同类单步生成方法,且无需蒸馏即可逼近多步方法性能。
链接: https://arxiv.org/abs/2512.02012
作者: Zhengyang Geng,Yiyang Lu,Zongze Wu,Eli Shechtman,J. Zico Kolter,Kaiming He
机构: CMU(卡内基梅隆大学); MIT(麻省理工学院); Adobe(Adobe公司); THU(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical report
Abstract:MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward’’ nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF’s training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity v , re-parameterized by a network that predicts the average velocity u . Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our \textbfimproved MeanFlow ( \textbfiMF ) method, trained entirely from scratch, achieves \textbf1.72 FID with a single function evaluation (1-NFE) on ImageNet 256 \times 256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.
zh
[CV-7] AirSim360: A Panoramic Simulation Platform within Drone View
【速读】:该论文旨在解决当前360度全景理解领域中因缺乏大规模、多样化数据而导致的空间智能发展受限问题。其核心解决方案是提出AirSim360,一个基于无人机视角的全景数据仿真平台,关键创新在于三点:一是构建像素级几何、语义与实体级别的对齐标注范式,实现多维度场景理解;二是设计交互式行人感知系统以建模人类行为;三是开发自动化轨迹生成机制,支持导航任务。该平台首次在全向视角下系统性地模拟4D真实世界,显著提升了全景数据的多样性与可用性。
链接: https://arxiv.org/abs/2512.02009
作者: Xian Ge,Yuling Pan,Yuhang Zhang,Xiang Li,Weijun Zhang,Dizhe Zhang,Zhaoliang Wan,Xin Lin,Xiangkai Zhang,Juntao Liang,Jason Li,Wenjie Jiang,Bo Du,Ming-Hsuan Yang,Lu Qi
机构: Insta360 Research (Insta360 研究院); Wuhan University (武汉大学); University of California, San Diego (加州大学圣地亚哥分校); Nanyang Technological University (南洋理工大学); University of California, Merced (加州大学默塞德分校); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL
Abstract:The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and entity-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available at this https URL.
zh
[CV-8] MV-TAP: Tracking Any Point in Multi-View Videos
【速读】:该论文旨在解决多视角视频中动态物体点轨迹追踪的难题,即如何在复杂真实场景下利用多视角信息实现更完整、可靠的点轨迹估计。解决方案的关键在于提出MV-TAP(Multi-View Tracking with Attention and Projection),该方法融合相机几何关系与跨视角注意力机制(cross-view attention mechanism),以聚合不同视角间的时空信息,从而提升点跟踪的准确性与鲁棒性。
链接: https://arxiv.org/abs/2512.02006
作者: Jahyeok Koo,Inès Hyeonsu Kim,Mungyeom Kim,Junghyun Park,Seohyun Park,Jaeyeong Kim,Jung Yi,Seokju Cho,Seungryong Kim
机构: KAIST AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.
zh
[CV-9] Learning Visual Affordance from Audio
【速读】:该论文旨在解决交互区域分割(affordance grounding)任务中因依赖文本指令或示范视频而导致的歧义性与遮挡限制问题,提出通过音频信号实现更直观、实时且语义丰富的交互区域定位。其解决方案的关键在于:构建首个面向音视频交互区域接地(Audio-Visual Affordance Grounding, AV-AG)的数据集,包含大量动作音频、物体图像及像素级交互标注,并设计AVAGFormer模型——该模型采用语义条件交叉模态混合器(semantic-conditioned cross-modal mixer)和双头解码器(dual-head decoder),有效融合音频与视觉信号以生成精确掩码预测,从而在零样本泛化能力上显著优于现有基线方法。
链接: https://arxiv.org/abs/2512.02005
作者: Lidong Lu,Guo Chen,Zhu Wei,Yicheng Liu,Tong Lu
机构: Nanjing University (南京大学); China Mobile Communications Company Limited Research Institute (中国移动通信有限公司研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures
Abstract:We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on this https URL.
zh
[CV-10] RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies
【速读】:该论文旨在解决自动驾驶策略在闭环部署时因协变量偏移(covariate shift)导致的误差累积问题,这一现象通常发生在基于开环行为克隆(behavior cloning)训练的策略中。解决方案的关键在于提出Rollouts as Demonstrations (RoaD) 方法,通过利用策略自身生成的闭环轨迹作为额外训练数据,并在轨迹生成过程中引入专家指导以引导策略偏向高质量行为,从而构建出既具信息量又现实的演示数据用于微调。该方法显著降低了对强化学习所需的数据量需求,且无需依赖先前闭环监督微调(CL-SFT)方法所设的严格假设,因而适用于更广泛的应用场景,包括端到端驾驶。
链接: https://arxiv.org/abs/2512.01993
作者: Guillermo Garcia-Cobo,Maximilian Igl,Peter Karkus,Zhejun Zhang,Michael Watson,Yuxiao Chen,Boris Ivanovic,Marco Pavone
机构: NVIDIA Research (NVIDIA 研究院); Huawei VN Research Center (华为 VN 研究中心); Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint
Abstract:Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy’s own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41% and reduces collisions by 54%.
zh
[CV-11] PAI-Bench: A Comprehensive Benchmark For Physical AI
【速读】:该论文旨在解决当前多模态大语言模型和视频生成模型在物理感知与预测能力方面的不足,即这些模型虽具备较强的视觉保真度,但在保持物理一致性动态和进行因果推理方面表现有限。其解决方案的关键在于提出Physical AI Bench(PAI-Bench),这是一个统一且全面的基准测试平台,涵盖视频生成、条件视频生成和视频理解三大任务,包含2,808个真实世界案例,并设计了任务对齐的指标以量化物理合理性与领域特定推理能力,从而为评估物理智能(Physical AI)提供可量化的现实基础。
链接: https://arxiv.org/abs/2512.01989
作者: Fengzhe Zhou,Jiannan Huang,Jialuo Li,Deva Ramanan,Humphrey Shi
机构: Georgia Tech (佐治亚理工学院); CMU (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.
zh
[CV-12] Artemis: Structured Visual Reasoning for Perception Policy Learning
【速读】:该论文旨在解决当前基于强化学习的视觉感知策略框架中,纯语言形式的中间推理链(intermediate reasoning chains)在视觉感知任务上表现下降的问题。研究表明,问题根源不在于推理本身,而在于推理的形式——现有方法在无结构的语言空间中进行语义推理,但视觉感知本质上依赖于空间和以物体为中心的推理。解决方案的关键在于提出 Artemis 框架,其采用结构化的基于提议(proposal-based)推理机制,将每个中间步骤表示为(标签, 边界框)对,显式捕捉可验证的视觉状态,从而实现中间状态的精确追踪、提议质量的直接监督,并避免语言推理引入的歧义。这一设计使模型在接地(grounding)、检测、计数及几何感知等多样化任务中均表现出显著提升,验证了空间对齐推理对感知策略学习的增强作用。
链接: https://arxiv.org/abs/2512.01988
作者: Wei Tang,Yanpeng Sun,Shan Zhang,Xiaofan Li,Piotr Koniusz,Wei Li,Na Zhao,Zechao Li
机构: NJUST IMAG(南京理工大学智能感知与计算研究中心); SUTD IMPL(新加坡科技设计大学智能感知实验室); Adelaide AIML(阿德莱德大学人工智能与机器学习中心); Baidu Inc.(百度公司); Data61(CSIRO数据科学与人工智能部门); SenseTime(商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.
zh
[CV-13] SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning AAAI-2025
【速读】:该论文旨在解决可控图像语义理解任务中因用户输入提示(如文本或边界框)导致的高成本和信息输出受限问题,提出了一种新的任务“图像协同分割与描述”(SegCaptioning),其目标是将一个简单的提示(如对象边界框)转化为多样化的语义解释(即描述与掩码对),从而允许用户灵活选择结果。解决方案的关键在于提出了一种基于场景图引导的扩散模型(SGDiff),首先通过提示中心的场景图适配器(Prompt-Centric Scene Graph Adaptor)将用户提示映射为结构化场景图以准确捕捉意图,随后利用结合场景图引导的双模态Transformer的扩散过程预测语义一致的掩码-描述对,并设计多实体对比学习损失函数以显式对齐跨模态实体,从而实现高质量的掩码与描述匹配。
链接: https://arxiv.org/abs/2512.01975
作者: Xu Zhang,Jin Yuan,Hanwang Zhang,Guojin Zhong,Yongsheng Zang,Jiacheng Lin,Zhiyong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by AAAI-2025
Abstract:Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as high-cost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning’’ (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user’s intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for correlated mask-caption prediction. Initially, we introduce a Prompt-Centric Scene Graph Adaptor to map a user’s prompt to a scene graph, effectively capturing his intention. Subsequently, we employ a diffusion process incorporating a Scene Graph Guided Bimodal Transformer to predict correlated caption-mask pairs by uncovering intricate correlations between them. To ensure accurate alignment, we design a Multi-Entities Contrastive Learning loss to explicitly align visual and textual entities by considering inter-modal similarity, resulting in well-aligned caption-mask pairs. Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in SegCaptioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input.
zh
[CV-14] SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation
【速读】:该论文旨在解决复杂手物交互(hand-object interaction)在物理引擎中难以建模与合成的问题,尤其针对非刚体或关节结构(如柔性织物、弹性材料、铰链结构、毛绒表面甚至活体生物)的动态交互场景。传统基于模拟的方法依赖显式定义的刚体模型和预设手势,无法有效处理此类复杂交互。其解决方案的关键在于提出SpriteHand,一个自回归视频生成框架,通过因果推理架构实现逐帧生成,并结合混合后训练策略提升视觉真实感与时间一致性;该模型仅需输入静态物体图像和含想象手势的视频流,即可实时生成高质量的手物交互视频,在单张NVIDIA RTX 5090 GPU上实现约18 FPS、640×368分辨率下的低延迟(~150 ms)连续生成,显著优于生成模型与引擎基线方法。
链接: https://arxiv.org/abs/2512.01960
作者: Zisu Li,Hengye Lyu,Jiaxin Shi,Yufeng Zeng,Mingming Fan,Hanwang Zhang,Chen Liang
机构: HKUST (香港科技大学); HKUST (Guangzhou) (香港科技大学(广州)); XMax.AI Ltd. (XMax.AI有限公司); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
zh
[CV-15] GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment
【速读】:该论文旨在解决当前视频世界建模(video world modeling)中生成式模型缺乏几何一致性(geometric grounding)的问题,这一缺陷限制了其在需要空间连贯性和长程稳定性的具身导航任务中的应用。解决方案的关键在于提出一种自监督的后训练框架 Reinforcement Learning with World Grounding (RLWG),通过几何和感知奖励信号对预训练的世界模型进行对齐,具体包括姿态循环一致性、深度重投影和时间连贯性等多维可验证奖励机制;进一步基于 Group Relative Policy Optimization (GRPO) 设计了 GrndCtrl 方法实现奖励对齐,从而显著提升模型轨迹稳定性、几何一致性与导航可靠性,优于传统的监督微调方式。
链接: https://arxiv.org/abs/2512.01952
作者: Haoyang He,Jay Patrikar,Dong-Ki Kim,Max Smith,Daniel McGann,Ali-akbar Agha-mohammadi,Shayegan Omidshafiei,Sebastian Scherer
机构: Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
zh
[CV-16] Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉token数量快速增长导致的内存占用过高和推理延迟增加的问题,尤其在处理高分辨率图像和视频时更为显著。现有token剪枝方法通常忽略与用户查询的相关性,或受限于注意力机制本身的局限性,从而影响剪枝的适应性和有效性。解决方案的关键在于提出一种无需重新训练、可通用适配多种MLLMs的即插即用剪枝方法Script,其核心由两个模块组成:一是基于图结构的剪枝模块,用于去除视觉冗余token;二是查询条件化的语义剪枝模块,用于保留与用户查询相关的视觉信息。这两个模块协同作用,在保持模型性能的同时显著提升效率,实验表明在LLaVA-NeXT-7B上可实现最高6.8倍预填充速度提升和10倍浮点运算量(FLOP)减少,同时维持96.88%的原始性能。
链接: https://arxiv.org/abs/2512.01949
作者: Zhongyu Yang,Dannong Xu,Wei Pang,Yingfang Yuan
机构: BCML, Heriot-Watt University (BCML,赫瑞-瓦特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Transactions on Machine Learning Research, Project in this https URL
Abstract:The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.
zh
[CV-17] Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models
【速读】:该论文旨在解决机器人操作中可靠故障检测与恢复的问题,核心挑战在于当前视觉-语言模型(Vision-Language Models, VLMs)在故障识别上的准确性和泛化能力受限于失败数据的稀缺性。解决方案的关键在于提出一种自动化的机器人故障合成方法,通过程序化扰动成功轨迹生成多样化的规划与执行故障,不仅提供二分类标签,还包含细粒度的故障类别和逐步推理轨迹,从而构建了三个新的故障检测基准(RLBench-Fail、BridgeDataV2-Fail 和 UR5-Fail),显著扩展了现有故障数据集的多样性与规模;在此基础上训练出的 Guardian 模型利用多视角图像实现精细化故障推理与检测,在多个基准上达到最先进性能,并在仿真和真实机器人系统中提升任务成功率,验证了所生成故障数据的有效性。
链接: https://arxiv.org/abs/2512.01946
作者: Paul Pacaud,Ricardo Garcia,Shizhe Chen,Cordelia Schmid
机构: Inria(法国国家信息与自动化研究院); École normale supérieure(巴黎高等师范学院); CNRS(法国国家科学研究中心); PSL Research University(巴黎文理研究大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures, 6 tables
Abstract:Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.
zh
[CV-18] Physical ID-Transfer Attacks against Multi-Object Tracking via Adversarial Trajectory ACSA
【速读】:该论文旨在解决多目标跟踪(Multi-Object Tracking, MOT)系统在物理世界中面临的ID篡改攻击问题,即攻击者通过设计特定轨迹使跟踪系统错误地将一个对象的ID分配给另一个目标,从而导致轨迹预测错误等严重后果。现有攻击方法多局限于数字域中的对象检测(Object Detection, OD)模块,且具有模型依赖性、鲁棒性差和仅适用于离线数据集的局限性。本文提出AdvTraj,这是首个针对基于检测的MOT系统的在线物理级ID操纵攻击方案,其关键在于不直接攻击OD模块,而是利用对抗性轨迹(adversarial trajectories)诱导跟踪器在关联阶段发生ID混淆;模拟实验表明,AdvTraj在白盒攻击下对SORT算法实现100%成功率,并具备高达93%的跨模型迁移攻击成功率,揭示了当前SOTA MOT系统在对象关联阶段存在的共性脆弱性。
链接: https://arxiv.org/abs/2512.01934
作者: Chenyi Wang,Yanmao Man,Raymond Muller,Ming Li,Z. Berkay Celik,Ryan Gerdes,Jonathan Petit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Annual Computer Security Applications Conference (ACSAC) 2024
Abstract:Multi-Object Tracking (MOT) is a critical task in computer vision, with applications ranging from surveillance systems to autonomous driving. However, threats to MOT algorithms have yet been widely studied. In particular, incorrect association between the tracked objects and their assigned IDs can lead to severe consequences, such as wrong trajectory predictions. Previous attacks against MOT either focused on hijacking the trackers of individual objects, or manipulating the tracker IDs in MOT by attacking the integrated object detection (OD) module in the digital domain, which are model-specific, non-robust, and only able to affect specific samples in offline datasets. In this paper, we present AdvTraj, the first online and physical ID-manipulation attack against tracking-by-detection MOT, in which an attacker uses adversarial trajectories to transfer its ID to a targeted object to confuse the tracking system, without attacking OD. Our simulation results in CARLA show that AdvTraj can fool ID assignments with 100% success rate in various scenarios for white-box attacks against SORT, which also have high attack transferability (up to 93% attack success rate) against state-of-the-art (SOTA) MOT algorithms due to their common design principles. We characterize the patterns of trajectories generated by AdvTraj and propose two universal adversarial maneuvers that can be performed by a human walker/driver in daily scenarios. Our work reveals under-explored weaknesses in the object association phase of SOTA MOT systems, and provides insights into enhancing the robustness of such systems.
zh
[CV-19] Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding
【速读】:该论文旨在解决医疗领域视觉语言模型(Medical Vision-Language Models, LVLMs)中存在的幻觉输出问题,即模型生成的内容看似合理但实际错误。现有缓解策略多依赖于二次解码或回滚机制,导致推理速度显著下降,且常存在模态间错位或与真实内容不一致的问题。论文提出的解决方案——Med-VCD(sparse visual-contrastive decoding),其关键在于引入一种新颖的实时token稀疏化策略,在保持关键视觉上下文的同时剔除冗余信息,从而在不增加额外时间开销的前提下提升生成内容的真实性与可靠性。
链接: https://arxiv.org/abs/2512.01922
作者: Zahra Mahdavi,Zahra Khodakaramimaghsoud,Hooman Khaloo,Sina Bakhshandeh Taleshani,Erfan Hashemi,Javad Mirzapour Kaleybar,Omid Nejati Manzari
机构: University of Central Florida (中佛罗里达大学); University of Pennsylvania (宾夕法尼亚大学); Columbia University (哥伦比亚大学); Technical University of Applied Sciences Regensburg (应用科学大学雷根斯堡分校); University of Calgary (卡尔加里大学); University College of Nabi Akram (纳比阿克拉姆学院); Iran University of Science and Technology (伊朗科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13% and improves hallucination accuracy by 6% relative to baseline medical LVLMs.
zh
[CV-20] SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception
【速读】:该论文旨在解决当前自监督学习(Self-Supervised Learning, SSL)框架在融合视觉与触觉数据的机器人操作任务中,因将特征图压缩为全局向量而导致空间结构丢失、无法有效支持几何敏感型操作的问题。解决方案的关键在于提出一种空间感知的SSL框架SARL(Spatially-aware SSL),其通过在Bootstrap Your Own Latent (BYOL)架构基础上引入三个基于特征图层级的目标函数——显著性对齐(Saliency Alignment, SAL)、局部原型分布对齐(Patch-Prototype Distribution Alignment, PPDA)和区域亲和匹配(Region Affinity Matching, RAM),以保留注意力焦点、部件组成关系及几何关联性,从而实现跨视图的空间一致性建模。实验表明,SARL在六个下游任务中均优于九种基线方法,并在几何敏感的边缘姿态回归任务中达到0.3955的平均绝对误差(MAE),较次优方法提升30%,逼近监督学习上限,验证了结构化空间等变性(structured spatial equivariance)是融合视觉-触觉数据中最有效的表征信号。
链接: https://arxiv.org/abs/2512.01908
作者: Gurmeher Khurana,Lan Wei,Dandan Zhang
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.
zh
[CV-21] StyleYourSmile: Cross-Domain Face Retargeting Without Paired Multi-Style Data
【速读】:该论文旨在解决跨域人脸重定向(cross-domain face retargeting)中身份信息、表情和领域特定风格属性难以解耦控制的问题。现有方法通常依赖于真实人脸数据训练,存在泛化能力差、需测试时优化或依赖精心构建的多风格配对数据集进行微调等局限。解决方案的关键在于提出一种名为StyleYourSmile的一次性(one-shot)跨域人脸重定向方法,其核心创新包括:1)设计了一种高效的数据增强策略,以提升模型对不同视觉域的适应性;2)采用双编码器框架(dual-encoder framework),分别提取领域不变的身份特征与领域特定的风格特征;3)利用解耦后的控制信号条件化扩散模型(diffusion model),实现跨域表情迁移的同时保持高保真的身份一致性。实验表明,该方法在多种视觉域上均能有效保留身份特征并实现高质量的表情重定向。
链接: https://arxiv.org/abs/2512.01895
作者: Avirup Dey,Vinay Namboodiri
机构: University of Bath (巴斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figures
Abstract:Cross-domain face retargeting requires disentangled control over identity, expressions, and domain-specific stylistic attributes. Existing methods, typically trained on real-world faces, either fail to generalize across domains, need test-time optimizations, or require fine-tuning with carefully curated multi-style datasets to achieve domain-invariant identity representations. In this work, we introduce \textitStyleYourSmile, a novel one-shot cross-domain face retargeting method that eliminates the need for curated multi-style paired data. We propose an efficient data augmentation strategy alongside a dual-encoder framework, for extracting domain-invariant identity cues and capturing domain-specific stylistic variations. Leveraging these disentangled control signals, we condition a diffusion model to retarget facial expressions across domains. Extensive experiments demonstrate that \textitStyleYourSmile achieves superior identity preservation and retargeting fidelity across a wide range of visual domains.
zh
[CV-22] KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM
【速读】:该论文旨在解决在动态环境中使用未标定单目相机进行实时SLAM(同步定位与地图构建)的挑战,特别是现有方法通常依赖深度传感器、离线标定或缺乏对动态场景的鲁棒性。其解决方案的关键在于提出KM-ViPE(Knowledge Mapping Video Pose Engine),该框架通过高阶特征自适应鲁棒核将DINO视觉特征与几何约束紧密耦合,从而有效处理移动物体及可移动静态物体(如第一人称视角中的移动家具);同时,系统融合几何信息与深度视觉特征,并与语言嵌入对齐,实现在线定位与开放词汇语义映射,无需深度数据或预标定,且具备互联网规模训练优势,适用于自主机器人和AR/VR等应用场景。
链接: https://arxiv.org/abs/2512.01889
作者: Zaid Nasser,Mikhail Iumanov,Tianhao Li,Maxim Popov,Jaafar Mahmoud,Malik Mohrat,Ilya Obrubov,Ekaterina Derevyanka,Ivan Sosin,Sergey Kolyubin
机构: ITMO University (圣彼得堡国立信息技术机械与光学大学); SBERRoboticsCenter (莫斯科机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.
zh
[CV-23] ransientTrack: Advanced Multi-Object Tracking and Classification of Cancer Cells with Transient Fluorescent Signals
【速读】:该论文旨在解决多通道荧光显微视频中细胞轨迹追踪难题,尤其针对具有时变荧光信号(如细胞昼夜节律)的复杂场景下,传统方法难以检测关键细胞事件(如细胞分裂和死亡)的问题。解决方案的关键在于提出一种基于深度学习的轻量级框架TransientTrack,其核心创新包括:直接在细胞检测嵌入空间中进行匹配、引入Transformer网络建模时空特征、采用多阶段匹配策略整合所有检测框,并结合卡尔曼滤波插补缺失轨迹片段,从而实现完整细胞谱系重建与动态事件识别。
链接: https://arxiv.org/abs/2512.01885
作者: Florian Bürger,Martim Dias Gomes,Nica Gutu,Adrián E. Granada,Noémie Moreau,Katarzyna Bozek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM)
备注: 13 pages, 7 figures, 2 tables. This work has been submitted to IEEE Transactions on Medical Imaging
Abstract:Tracking cells in time-lapse videos is an essential technique for monitoring cell population dynamics at a single-cell level. Current methods for cell tracking are developed on videos with mostly single, constant signals and do not detect pivotal events such as cell death. Here, we present TransientTrack, a deep learning-based framework for cell tracking in multi-channel microscopy video data with transient fluorescent signals that fluctuate over time following processes such as the circadian rhythm of cells. By identifying key cellular events - mitosis (cell division) and apoptosis (cell death) our method allows us to build complete trajectories, including cell lineage information. TransientTrack is lightweight and performs matching on cell detection embeddings directly, without the need for quantification of tracking-specific cell features. Furthermore, our approach integrates Transformer Networks, multi-stage matching using all detection boxes, and the interpolation of missing tracklets with the Kalman Filter. This unified framework achieves strong performance across diverse conditions, effectively tracking cells and capturing cell division and death. We demonstrate the use of TransientTrack in an analysis of the efficacy of a chemotherapeutic drug at a single-cell level. The proposed framework could further advance quantitative studies of cancer cell dynamics, enabling detailed characterization of treatment response and resistance mechanisms. The code is available at this https URL.
zh
[CV-24] COACH: Collaborative Agents for Contextual Highlighting - A Multi-Agent Framework for Sports Video Analysis AAAI2026
【速读】:该论文旨在解决智能体育视频分析中对时间上下文层次理解不足的问题,现有端到端模型在微观动作与宏观比赛策略之间缺乏有效的统一建模能力,导致泛化性差、新任务开发成本高且可解释性弱。其解决方案的关键在于提出一种可重构的多智能体系统(Multi-Agent System, MAS),将每个智能体设计为专注于特定分析维度的“认知工具”,通过迭代调用和灵活组合这些智能体,构建适应不同时间尺度(如短时回合问答Rally QA与长时比赛总结生成)的动态分析流水线,从而实现从细粒度事件检测到全局语义组织的有效衔接。
链接: https://arxiv.org/abs/2512.01853
作者: Tsz-To Wong,Ching-Chun Huang,Hong-Han Shuai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 Workshop LaMAS
Abstract:Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct “cognitive tool” specializing in a specific aspect of analysis. The system’s architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video this http URL project homepage is available at this https URL
zh
[CV-25] Register Any Point: Scaling 3D Point Cloud Registration by Flow Matching
【速读】:该论文旨在解决点云配准(point cloud registration)问题,即如何将多个未对齐的点云数据统一到同一坐标系中,以支持三维重建和机器人定位等任务。传统方法通常依赖于对应关系匹配来估计两两点云间的变换,再通过优化 pairwise 变换实现多视图配准,存在误差累积和低重叠场景下性能下降的问题。本文的关键创新在于将配准建模为条件生成任务:通过学习一个连续的、逐点的速度场(velocity field),将噪声点云逐步迁移至目标注册场景,从而直接生成完整配准结果。该方法结合轻量级局部特征提取器与测试时刚性约束(rigidity enforcement),在低重叠比、跨尺度及多传感器模态场景下均取得最优性能,并具备良好的泛化能力。
链接: https://arxiv.org/abs/2512.01850
作者: Yue Pan,Tao Sun,Liyuan Zhu,Lucas Nunes,Iro Armeni,Jens Behley,Cyrill Stachniss
机构: University of Bonn (波恩大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 22 pages
Abstract:Point cloud registration aligns multiple unposed point clouds into a common frame, and is a core step for 3D reconstruction and robot localization. In this work, we cast registration as conditional generation: a learned continuous, point-wise velocity field transports noisy points to a registered scene, from which the pose of each view is recovered. Unlike previous methods that conduct correspondence matching to estimate the transformation between a pair of point clouds and then optimize the pairwise transformations to realize multi-view registration, our model directly generates the registered point cloud. With a lightweight local feature extractor and test-time rigidity enforcement, our approach achieves state-of-the-art results on pairwise and multi-view registration benchmarks, particularly with low overlap, and generalizes across scales and sensor modalities. It further supports downstream tasks including relocalization, multi-robot SLAM, and multi-session map merging. Source code available at: this https URL.
zh
[CV-26] PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型在物理合理性方面存在的不足问题,即尽管这些模型在视频质量、长度和指令遵循能力上取得显著进展,但其生成内容是否符合物理规律仍缺乏有效评估手段。为探究这一问题,作者构建了一个名为 \textbfPID(Physical Implausibility Detection)的数据集,其中包含500个人工标注的测试视频和2,588对配对视频——每对中一个为真实世界视频,另一个是通过改写原视频描述以诱导T2V模型生成物理不合理的视频。解决方案的关键在于提出一种轻量级微调方法,使视觉语言模型(Vision-Language Models, VLMs)不仅能识别物理不合理的事件,还能生成关于违反物理原理的文本解释;基于此改进后的VLM,作者定义了 \textbfPhyDetEx 模型作为物理合理性检测与解释器,并以此对多个前沿T2V模型进行基准测试,结果表明当前主流T2V模型在物理一致性方面仍存在明显挑战,尤其开源模型表现更差。
链接: https://arxiv.org/abs/2512.01843
作者: Zeqing Wang,Keze Wang,Lei Zhang
机构: Sun Yat-sen University (中山大学); Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures
Abstract:Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbfPID (\textbfPhysical \textbfImplausibility \textbfDetection) dataset, which consists of a \textittest split of 500 manually annotated videos and a \textittrain split of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbfPhyDetEx, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \hrefthis https URLthis https URL.
zh
[CV-27] OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM -as-Critic
【速读】:该论文旨在解决当前两阶段微调策略在自动驾驶(AD)中因监督微调(SFT)的局限性导致推理泛化能力不足,以及强化微调(RFT)难以应用于开放场景理解任务的问题。其解决方案的关键在于提出OpenREAD框架,通过构建大规模思维链(Chain-of-Thought, CoT)标注数据集,并利用Qwen3大语言模型(LLM)作为奖励建模中的评判器(critic),实现从高层推理到低层轨迹规划的端到端强化微调(end-to-end RFT),从而提升自动驾驶系统在开放场景下的推理与决策性能。
链接: https://arxiv.org/abs/2512.01830
作者: Songyan Zhang,Wenhui Huang,Zhan Chen,Chua Jiahao Collister,Qihang Huang,Chen Lv
机构: Nanyang Technological University (南洋理工大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.
zh
[CV-28] CauSight: Learning to Supersense for Visual Causal Discovery
【速读】:该论文旨在解决现代人工智能系统缺乏因果推理能力的问题,即现有模型通常只能识别视觉场景中的对象存在,而无法理解对象之间的因果关系。为实现视觉因果发现(visual causal discovery),研究者构建了大规模标注数据集VCG-32K,其中包含超过32,000张图像及其对应的实体级因果图,并提出CauSight模型,通过因果感知推理完成任务。解决方案的关键在于三方面:(1) 基于VCG-32K的数据集构建与训练数据筛选;(2) 引入Tree-of-Causal-Thought(ToCT)机制生成因果推理轨迹;(3) 设计因果奖励机制并结合强化学习优化推理策略。实验表明,CauSight在视觉因果发现任务上显著优于GPT-4.1,性能提升达21%绝对值,验证了该方法的有效性。
链接: https://arxiv.org/abs/2512.01827
作者: Yize Zhang,Meiqi Chen,Sirui Chen,Bo Peng,Yanxi Zhang,Tianyu Li,Chaochao Lu
机构: Shanghai AI Laboratory(上海人工智能实验室); Shanghai Innovation Institute(上海创新研究院); Shanghai Jiao Tong University(上海交通大学); Peking University(北京大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: this https URL.
zh
[CV-29] Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理能力上的不足问题,尤其是当前方法主要依赖于纯文本描述的微调策略,导致视觉盲区(visual illiteracy),即模型无法将空间概念与其视觉表现形式建立联系。其解决方案的关键在于提出MILO(Implicit spatIaL wOrld modeling)范式,通过引入一个视觉生成器提供几何感知反馈,从而隐式地将MLLM的符号推理与感知经验相耦合;同时设计了RePE(Relative Positional Encoding)相对位置编码机制,以更有效地建模相机姿态变换关系,优于传统绝对坐标系表示方式。
链接: https://arxiv.org/abs/2512.01821
作者: Meng Cao,Haokun Lin,Haoyuan Li,Haoran Tang,Rongtao Xu,Dong An,Xue Liu,Ian Reid,Xiaodan Liang
机构: MBZUAI; SYSU; PKU; Spatialtemporal AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM’s symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.
zh
[CV-30] Forget Less Retain More: A Lightweight Regularizer for Rehearsal-Based Continual Learning
【速读】:该论文旨在解决深度神经网络在持续学习(continual learning)过程中面临的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时会显著降低对先前任务的性能。解决方案的关键在于提出一种名为信息最大化(Information Maximization, IM)的正则化策略,该策略基于预期标签分布构建,具有类别无关(class-agnostic)特性,从而可无缝集成到多种基于回放(rehearsal-based)的持续学习方法中。IM 正则化器通过最小化新旧任务间的信息差异来抑制遗忘,并促进更快收敛,同时计算开销极低,具备良好的实用性和可扩展性,适用于包括视频数据等复杂场景下的持续学习任务。
链接: https://arxiv.org/abs/2512.01818
作者: Lama Alssum,Hasan Abed Al Kader Hammoud,Motasem Alfarra,Juan C Leon Alcazar,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks suffer from catastrophic forgetting, where performance on previous tasks degrades after training on a new task. This issue arises due to the model’s tendency to overwrite previously acquired knowledge with new information. We present a novel approach to address this challenge, focusing on the intersection of memory-based methods and regularization approaches. We formulate a regularization strategy, termed Information Maximization (IM) regularizer, for memory-based continual learning methods, which is based exclusively on the expected label distribution, thus making it class-agnostic. As a consequence, IM regularizer can be directly integrated into various rehearsal-based continual learning methods, reducing forgetting and favoring faster convergence. Our empirical validation shows that, across datasets and regardless of the number of tasks, our proposed regularization strategy consistently improves baseline performance at the expense of a minimal computational overhead. The lightweight nature of IM ensures that it remains a practical and scalable solution, making it applicable to real-world continual learning scenarios where efficiency is paramount. Finally, we demonstrate the data-agnostic nature of our regularizer by applying it to video data, which presents additional challenges due to its temporal structure and higher memory requirements. Despite the significant domain gap, our experiments show that IM regularizer also improves the performance of video continual learning methods.
zh
[CV-31] Envision: Benchmarking Unified Understanding Generation for Causal World Process Insights
【速读】:该论文旨在解决当前多模态模型在训练与评估中过度依赖静态单图生成任务所导致的局限性,即模型易陷入静态模式匹配和语义融合,难以建模随时间演化的动态过程,从而限制了其对世界知识的内化能力。解决方案的关键在于提出Envision——一个基于因果事件进展的链式文本到多图生成基准,该基准以时空因果结构为基础重构评估维度,并引入Envision-Score这一综合指标,从一致性、物理合理性和美学等多个维度量化模型在时序帧间的推理与生成能力。通过该方案,研究揭示了统一多模态模型虽在因果叙事连贯性上优于专用文本到图像(T2I)模型,但仍受限于时空一致性挑战,凸显出从静态图像评估向动态序列建模转变的必要性。
链接: https://arxiv.org/abs/2512.01816
作者: Juanxi Tian,Siyuan Li,Conghui He,Lijun Wu,Cheng Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 12 figures, 10 tables
Abstract:Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.
zh
[CV-32] Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
【速读】:该论文旨在解决当前视频生成模型在评估复杂人类动作的视觉正确性和时间一致性方面缺乏鲁棒指标的问题。现有纯视觉编码器和多模态大语言模型(Multimodal Large Language Models, MLLMs)存在显著的外观偏好、缺乏时间理解能力,难以识别生成视频中细微的动作动态和解剖学不合理性。解决方案的关键在于构建一个基于真实世界人类动作学习到的潜在空间的新型评价指标:通过融合无外观依赖的人体骨骼几何特征与基于外观的特征,捕捉真实运动的约束条件与时间平滑性,从而形成对动作合理性的鲁棒表征;随后,通过计算生成视频在其潜在表示与该真实分布之间的距离来量化动作质量。这一方法显著提升了评估准确性,并在自建多维基准上实现较现有最优方法超过68%的性能提升。
链接: https://arxiv.org/abs/2512.01803
作者: Xavier Thomas,Youngsun Lim,Ananya Srinivasan,Audrey Zheng,Deepti Ghadiyaram
机构: Boston University (波士顿大学); Belmont High School (贝尔蒙特高中); Canyon Crest Academy (坎扬峡谷学院); Runway (Runway)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.
zh
[CV-33] SAM3-UNet: Simplified Adaptation of Segment Anything Model 3
【速读】:该论文旨在解决如何以低成本适配Segment Anything Model 3 (SAM3) 至下游任务的问题,尤其关注参数效率与显存消耗的优化。其解决方案的关键在于提出SAM3-UNet架构,该架构由三部分组成:SAM3图像编码器、用于参数高效微调的轻量级适配器(adapter),以及一个U-Net风格的轻量化解码器。这种设计在保持高性能的同时,显著降低了训练所需的GPU显存(<6 GB,batch size=12),并在镜面检测和显著目标检测等任务中超越了SAM2-UNet及其他先进方法。
链接: https://arxiv.org/abs/2512.01789
作者: Xinyu Xiong,Zihuang Wu,Lei Lu,Yufa Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at this https URL.
zh
[CV-34] Learned Image Compression for Earth Observation: Implications for Downstream Segmentation Tasks
【速读】:该论文旨在解决卫星遥感地球观测(Earth Observation, EO)数据快速增长带来的传输与存储挑战,核心问题是如何在压缩数据的同时保留对下游任务(如火灾、云层和建筑物检测)至关重要的信息。解决方案的关键在于对比传统压缩算法(JPEG 2000)与基于任务特定的可学习压缩方法(Discretized Mixed Gaussian Likelihood),结果表明:对于大规模多通道光学影像,学习型压缩在重建质量(PSNR)和分割精度上显著优于JPEG 2000;而在小规模单通道热红外数据集上,传统编码器仍具竞争力,主要受限于训练数据量和模型架构约束;此外,压缩与分割模型联合端到端优化并未带来性能提升,说明独立优化更具有效性。
链接: https://arxiv.org/abs/2512.01788
作者: Christian Mollière,Iker Cumplido,Marco Zeulner,Lukas Liesenhoff,Matthias Schubert,Julia Gottfriedsen
机构: OroraTech GmbH(OroraTech公司); Department of Informatics, LMU, Munich(慕尼黑大学信息学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid growth of data from satellite-based Earth observation (EO) systems poses significant challenges in data transmission and storage. We evaluate the potential of task-specific learned compression algorithms in this context to reduce data volumes while retaining crucial information. In detail, we compare traditional compression (JPEG 2000) versus a learned compression approach (Discretized Mixed Gaussian Likelihood) on three EO segmentation tasks: Fire, cloud, and building detection. Learned compression notably outperforms JPEG 2000 for large-scale, multi-channel optical imagery in both reconstruction quality (PSNR) and segmentation accuracy. However, traditional codecs remain competitive on smaller, single-channel thermal infrared datasets due to limited data and architectural constraints. Additionally, joint end-to-end optimization of compression and segmentation models does not improve performance over standalone optimization.
zh
[CV-35] Evaluating SAM2 for Video Semantic Segmentation
【速读】:该论文旨在解决视频语义分割(Video Semantic Segmentation, VSS)中面临的挑战,包括空间精度不足、时间一致性差以及对具有复杂边界和多尺度变化的多个目标难以有效跟踪的问题。解决方案的关键在于利用Segmentation Anything Model 2 (SAM2) 的强大对象感知能力,通过两种主要策略扩展其应用:一是基于SAM2提取高质量对象掩码,并与并行的分割网络协同生成和优化初始预测;二是利用SAM2生成的掩码提取特征向量,输入轻量分类网络进行类别预测,最终将分类结果与掩码融合得到最终分割输出。实验表明,SAM2在VSS中的引入显著提升了整体性能,核心优势在于其对物体边界的精确预测能力。
链接: https://arxiv.org/abs/2512.01774
作者: Syed Hesham Syed Ariff,Yun Liu,Guolei Sun,Jing Yang,Henghui Ding,Xue Geng,Xudong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 3 figures and 7 tables
Abstract:The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
zh
[CV-36] Robust Rigid and Non-Rigid Medical Image Registration Using Learnable Edge Kernels
【速读】:该论文旨在解决医学图像配准(Medical Image Registration)中因对比度差异、空间畸变及模态特异性变化等问题导致的传统配准方法性能受限的问题。其解决方案的关键在于引入可学习的边缘核(learnable edge kernels),该机制以预定义的边缘检测核为基础,通过随机噪声扰动后在训练过程中优化,从而提取任务特定的最优边缘特征;这种自适应边缘检测策略显著增强了配准过程对医学影像中多样结构特征的捕捉能力,进而提升多模态图像对齐与解剖结构分析的精度。
链接: https://arxiv.org/abs/2512.01771
作者: Ahsan Raza Siyal,Markus Haltmeier,Ruth Steiger,Malik Galijasevic,Elke Ruth Gizewski,Astrid Ellen Grams
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.
zh
[CV-37] VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis
【速读】:该论文旨在解决视频情境分析(Video Situation Analysis, VSA)中普遍存在的局限性问题,即当前方法依赖人工干预或针对特定场景定制算法,缺乏通用性和跨域适应能力。其核心挑战在于如何从视频内容中自动识别复杂、语义丰富的活动或情境(如两个物体靠近),而不仅仅是提取基础对象信息。解决方案的关键在于提出一个通用的VSA框架:首先利用先进的视频内容提取技术一次性处理视频数据,并采用两种互补的表示模型——扩展关系模型(R++)和图模型(Graph Models)。其中,R++支持通过自定义连续查询语言(Continuous Query Language for Video Analysis)进行流式处理,适用于实时情境检测;图模型则用于捕捉关系模型难以表达的情境模式。此外,为实现领域无关性,作者设计了参数化模板来抽象不同领域的基本情境变体,从而提升系统的泛化能力和可迁移性。
链接: https://arxiv.org/abs/2512.01769
作者: Hafsa Billah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: This is a report submitted as part of PhD proposal defense of Hafsa Billah
Abstract:Automatically understanding video contents is important for several applications in Civic Monitoring (CM), general Surveillance (SL), Assisted Living (AL), etc. Decades of Image and Video Analysis (IVA) research have advanced tasks such as content extraction (e.g., object recognition and tracking). Identifying meaningful activities or situations (e.g., two objects coming closer) remains difficult and cannot be achieved by content extraction alone. Currently, Video Situation Analysis (VSA) is done manually with a human in the loop, which is error-prone and labor-intensive, or through custom algorithms designed for specific video types or situations. These algorithms are not general-purpose and require a new algorithm/software for each new situation or video from a new domain. This report proposes a general-purpose VSA framework that overcomes the above limitations. Video contents are extracted once using state-of-the-art Video Content Extraction technologies. They are represented using two alternative models – the extended relational model (R++) and graph models. When represented using R++, the extracted contents can be used as data streams, enabling Continuous Query Processing via the proposed Continuous Query Language for Video Analysis. The graph models complement this by enabling the detection of situations that are difficult or impossible to detect using the relational model alone. Existing graph algorithms and newly developed algorithms support a wide variety of situation detection. To support domain independence, primitive situation variants across domains are identified and expressed as parameterized templates. Extensive experiments were conducted across several interesting situations from three domains – AL, CM, and SL-- to evaluate the accuracy, efficiency, and robustness of the proposed approach using a dataset of videos of varying lengths from these domains. Comments: This is a report submitted as part of PhD proposal defense of Hafsa Billah Subjects: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB) Cite as: arXiv:2512.01769 [cs.CV] (or arXiv:2512.01769v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.01769 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-38] HiconAgent : History Context-aware Policy Optimization for GUI Agents
【速读】:该论文旨在解决GUI代理在执行序列导航任务时如何高效且有效地利用历史上下文的问题。传统方法直接使用全部历史信息会导致计算开销过大并引入无关干扰,影响决策质量。解决方案的关键在于提出一种名为History Context-aware Policy Optimization (HCPO) 的训练机制,其核心包含两个互补组件:(1) Dynamic Context Sampling (DCS),在采样阶段动态调整历史长度以自适应地选择最相关上下文;(2) Anchor-guided History Compression (AHC),在策略更新阶段采用双分支结构,压缩分支保留历史动作作为信息流锚点,同时移除历史观测,通过历史增强对齐损失约束两分支的一致性,从而在保持性能的同时显著提升效率。
链接: https://arxiv.org/abs/2512.01763
作者: Xurui Zhou,Gongwei Chen,Yuquan Xie,Zaijing Li,Kaiwen Zhou,Shuai Wang,Shuo Yang,Zhuotao Tian,Rui Shao
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Graphical User Interface (GUI) agents require effective use of historical context to perform sequential navigation tasks. While incorporating past actions and observations can improve decision making, naive use of full history leads to excessive computational overhead and distraction from irrelevant information. To address this, we introduce HiconAgent, a GUI agent trained with History Context-aware Policy Optimization (HCPO) for efficient and effective utilization of historical information. HCPO optimizes history usage in both sampling and policy updates through two complementary components: (1) Dynamic Context Sampling (DCS) presents the agent with variable length histories during sampling, enabling adaptive use of the most relevant context; (2) Anchor-guided History Compression (AHC) refines the policy update phase with a dual branch strategy where the compressed branch removes history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage while maintaining efficiency. Experiments on mainstream GUI navigation benchmarks demonstrate strong performance. Despite being smaller, HiconAgent-3B outperforms GUI-R1-7B by +8.46 percent grounding accuracy and +11.32 percent step success rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW with up to 2.47x computational speedup and 60 percent FLOPs reduction.
zh
[CV-39] FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
【速读】:该论文旨在解决多轮指令式图像编辑中因高频信息逐步丢失而导致的图像质量退化问题(multi-turn editing quality degradation)。其解决方案的关键在于提出了一种无需训练的框架FreqEdit,核心创新包括:(1)通过参考速度场注入高频特征以保留细节;(2)设计自适应注入策略,实现空间上可调的区域特异性控制;(3)引入路径补偿机制,周期性校正编辑轨迹以避免过度约束。这三项协同组件共同保障了在10次以上连续编辑操作中的稳定性与保真度。
链接: https://arxiv.org/abs/2512.01755
作者: Yucheng Liao,Jiajun Liang,Kaiqian Cui,Baoquan Zhao,Haoran Xie,Wei Liu,Qing Li,Xudong Mao
机构: Sun Yat-sen University (中山大学); Lingnan University (岭南大学); Video Rebirth; The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.
zh
[CV-40] SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation AAAI2026
【速读】:该论文旨在解决基于对比语言-图像预训练(CLIP)的弱监督语义分割(WSSS)方法中存在的两个关键问题:一是非目标前景区域的过激活现象,二是背景区域的过激活问题。解决方案的核心在于提出一种新颖的语义与空间修正(Semantic and Spatial Rectification, SSR)方法,其关键创新包括:在语义层面引入跨模态原型对齐(Cross-Modal Prototype Alignment, CMPA),通过对比学习机制实现多模态特征空间的一致性对齐,从而降低类别间混淆并增强语义关联性,有效缓解非目标前景区域的过激活;在空间层面设计超像素引导校正(Superpixel-Guided Correction, SGC),利用超像素提供的空间先验信息,在亲和传播过程中精确过滤非目标区域干扰,显著抑制背景过激活。
链接: https://arxiv.org/abs/2512.01701
作者: Xiuli Bi,Die Xiao,Junchao Fan,Bin Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI 2026
Abstract:In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.
zh
[CV-41] Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks
【速读】:该论文旨在解决静态图像输入在脉冲神经网络(Spiking Neural Networks, SNNs)中难以有效建模时间动态性的问题。当前直接训练的SNN通常将静态输入重复至多个时间步,导致时序维度退化为类似速率编码的表示,从而阻碍了有意义的时间建模。论文通过重新审视直接编码与速率编码之间的性能差距,指出该差距主要源于卷积可学习性和替代梯度(surrogate gradient)形式的设计缺陷,而非编码方式本身。其解决方案的关键在于引入一种最小化的可学习时间编码机制,通过添加自适应相位偏移(adaptive phase shifts),从静态输入中诱导出有意义的时间变化,从而在不改变原始编码结构的前提下提升SNN对静态图像的时间建模能力。
链接: https://arxiv.org/abs/2512.01687
作者: Huaxu He
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handling static images that lack inherent temporal dynamics remains a fundamental challenge for spiking neural networks (SNNs). In directly trained SNNs, static inputs are typically repeated across time steps, causing the temporal dimension to collapse into a rate like representation and preventing meaningful temporal modeling. This work revisits the reported performance gap between direct and rate based encodings and shows that it primarily stems from convolutional learnability and surrogate gradient formulations rather than the encoding schemes themselves. To illustrate this mechanism level clarification, we introduce a minimal learnable temporal encoding that adds adaptive phase shifts to induce meaningful temporal variation from static inputs.
zh
[CV-42] DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
【速读】:该论文旨在解决当前故事可视化方法仅依赖文本描述而导致的主体定位不准以及艺术风格一致性难以维持的问题。其核心解决方案是提出一种布局感知的故事可视化框架DreamingComics,关键创新在于:(1) 基于预训练视频扩散-Transformer(DiT)模型,利用其时空先验提升角色身份与风格的一致性;(2) 设计区域感知的位置编码机制RegionalRoPE,根据目标布局重新索引嵌入以实现布局控制;(3) 引入掩码条件损失约束每个主体的视觉特征限定在其指定区域内;(4) 集成基于大语言模型(LLM)的布局生成器,从自然语言脚本中推断漫画风格布局,实现灵活可控的布局条件输入。实验表明,该方法在角色一致性和风格相似性上分别提升29.2%和36.2%,且空间定位精度高。
链接: https://arxiv.org/abs/2512.01686
作者: Patrick Kwon,Chen Chen
机构: Center for Research in Computer Vision, University of Central Florida (中央佛罗里达大学计算机视觉研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject’s visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at this https URL
zh
[CV-43] Cross-Domain Validation of a Resection-Trained Self-Supervised Model on Multicentre Mesothelioma Biopsies
【速读】:该论文旨在解决恶性胸膜间皮瘤(mesothelioma)中亚型分类与预后预测的准确性问题,以更好地指导治疗和患者管理。在实际临床场景中,小样本活检(biopsy)更为常见,而现有计算病理模型多基于大组织切片(resection specimens)训练,限制了其应用范围。论文的关键解决方案是利用在切除标本上训练的自监督编码器(self-supervised encoder),将其迁移应用于活检样本,从而捕捉具有临床意义的形态学模式,并据此实现生存预测与肿瘤亚型分类,展示了AI驱动工具在间皮瘤诊断与治疗规划中的潜力。
链接: https://arxiv.org/abs/2512.01681
作者: Farzaneh Seyedshahi,Francesca Damiola,Sylvie Lantuejoul,Ke Yuan,John Le Quesne
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate subtype classification and outcome prediction in mesothelioma are essential for guiding therapy and patient care. Most computational pathology models are trained on large tissue images from resection specimens, limiting their use in real-world settings where small biopsies are common. We show that a self-supervised encoder trained on resection tissue can be applied to biopsy material, capturing meaningful morphological patterns. Using these patterns, the model can predict patient survival and classify tumor subtypes. This approach demonstrates the potential of AI-driven tools to support diagnosis and treatment planning in mesothelioma.
zh
[CV-44] Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
【速读】:该论文旨在解决生成真实感手-物体交互(Hand-Object Interaction, HOI)视频的难题,核心挑战在于如何建模物理约束(如手与物体间的接触和遮挡关系),同时兼顾可扩展性与交互保真度。现有方法在二维(2D)与三维(3D)表示之间存在权衡,难以同时实现高保真度和大规模泛化。解决方案的关键在于提出一种结构与接触感知的表示方法,该方法无需3D标注即可捕捉手-物接触、遮挡及整体结构上下文信息,从而提供一种面向交互且可扩展的监督信号,使模型能够学习细粒度的交互物理规律,并在开放世界场景中有效泛化。为充分挖掘该表示潜力,作者进一步设计了共享与专业化相结合的联合生成范式,实现了交互表示与视频内容的协同生成。
链接: https://arxiv.org/abs/2512.01677
作者: Haodong Yan,Hang Yu,Zhide Zhong,Weilin Yuan,Xin Gong,Zehang Luo,Chengxi Heyu,Junfeng Li,Wenxuan Song,Shunbo Zhou,Haoang Li
机构: The Hong Kong University of Science and Technology (Guangzhou); Huawei Cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is this https URL.
zh
[CV-45] GRASP: Guided Residual Adapters with Sample-wise Partitioning
【速读】:该论文旨在解决生成式扩散模型在长尾分布场景(如医学影像中罕见病灶)下出现的模式崩溃(mode collapse)问题,即模型对稀有类别生成的图像质量低、多样性差,从而无法有效用于合成数据增强。其核心原因是头类(head classes)与尾类(tail classes)之间存在梯度冲突,而现有方法仅通过采样或条件引导调整推理过程,未改变模型学习到的分布。解决方案的关键在于提出GRASP(Guided Residual Adapters with Sample-wise Partitioning):利用外部先验静态划分样本为聚类以最小化组内梯度冲突,并通过向Transformer前馈层注入特定于聚类的残差适配器(residual adapters)进行微调,避免引入额外的门控机制以提升稳定性和效率,从而显著改善稀有类别的生成质量和多样性。
链接: https://arxiv.org/abs/2512.01675
作者: Felix Nützel,Mischa Dombrowski,Bernhard Kainz
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 6 tables
Abstract:Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.
zh
[CV-46] Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery
【速读】:该论文旨在解决遥感图像中微小目标检测(tiny object detection)面临的尺度不平衡问题,尤其是在密集微小目标与大目标共存场景下,传统检测方法难以兼顾多尺度性能。其解决方案的关键在于提出ScaleBridge-Det框架,通过两个核心模块实现:一是引入Routing-Enhanced Mixture Attention(REM)模块,利用自适应路由机制动态选择并融合不同尺度的专家特征,克服标准MoE模型对主导尺度的偏好,从而生成适用于微小和大目标的互补且判别性强的多尺度表示;二是设计Density-Guided Dynamic Query(DGQ)模块,根据预测的目标密度自适应调整查询位置和数量,实现资源在不同尺度对象间的高效分配。这两个模块协同作用,使模型能够在不牺牲任一尺度性能的前提下,同时优化密集微小目标和一般目标的检测效果。
链接: https://arxiv.org/abs/2512.01665
作者: Zhicheng Zhao,Yin Huang,Lingma Sun,Chenglong Li,Jin Tang
机构: Anhui University (安徽大学); China Electronics Technology Group Corporation (中国电子科技集团有限公司); Hefei University (合肥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.
zh
[CV-47] DB-KAUNet: An Adaptive Dual Branch Kolmogorov-Arnold UNet for Retinal Vessel Segmentation
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Network, CNN)在视网膜血管分割任务中难以捕捉长距离依赖关系和复杂非线性特征的问题。其核心解决方案是提出一种自适应双分支柯尔莫哥洛夫-阿诺德UNet(Adaptive Dual Branch Kolmogorov-Arnold UNet, DB-KAUNet),关键在于设计了异构双分支编码器(Heterogeneous Dual-Branch Encoder, HDBE),该结构并行集成CNN与Transformer路径,并通过新颖的KANConv和KAT模块实现特征提取的互补增强;同时引入交叉分支通道交互(Cross-Branch Channel Interaction, CCI)模块促进双路径间通道特征协同,以及基于注意力的空间特征增强(Spatial Feature Enhancement, SFE)模块与几何自适应融合(SFE-GAF)模块,以精准聚焦真实血管形态、抑制背景噪声并降低计算开销,从而显著提升分割精度与鲁棒性。
链接: https://arxiv.org/abs/2512.01657
作者: Hongyu Xu,Panpan Meng,Meng Wang,Dayu Hu,Liming Liang,Xiaoqi Sheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of retinal vessels is crucial for the clinical diagnosis of numerous ophthalmic and systemic diseases. However, traditional Convolutional Neural Network (CNN) methods exhibit inherent limitations, struggling to capture long-range dependencies and complex nonlinear relationships. To address the above limitations, an Adaptive Dual Branch Kolmogorov-Arnold UNet (DB-KAUNet) is proposed for retinal vessel segmentation. In DB-KAUNet, we design a Heterogeneous Dual-Branch Encoder (HDBE) that features parallel CNN and Transformer pathways. The HDBE strategically interleaves standard CNN and Transformer blocks with novel KANConv and KAT blocks, enabling the model to form a comprehensive feature representation. To optimize feature processing, we integrate several critical components into the HDBE. First, a Cross-Branch Channel Interaction (CCI) module is embedded to facilitate efficient interaction of channel features between the parallel pathways. Second, an attention-based Spatial Feature Enhancement (SFE) module is employed to enhance spatial features and fuse the outputs from both branches. Building upon the SFE module, an advanced Spatial Feature Enhancement with Geometrically Adaptive Fusion (SFE-GAF) module is subsequently developed. In the SFE-GAF module, adaptive sampling is utilized to focus on true vessel morphology precisely. The adaptive process strengthens salient vascular features while significantly reducing background noise and computational overhead. Extensive experiments on the DRIVE, STARE, and CHASE_DB1 datasets validate that DB-KAUNet achieves leading segmentation performance and demonstrates exceptional robustness.
zh
[CV-48] ViT3: Unlocking Test-Time Training in Vision
【速读】:该论文旨在解决视觉序列建模中高效且强大的Test-Time Training (TTT) 设计难题,尤其是对内层模块选择和内层训练策略缺乏系统理解与实践指导的问题。其解决方案的关键在于提出了一套系统的实证研究方法,通过大量实验与分析提炼出六条实用设计原则,并基于此构建了纯TTT架构Vision Test-Time Training (ViT^3),该模型在保持线性计算复杂度的同时实现并行化计算,显著提升了视觉任务(如图像分类、生成、目标检测和语义分割)中的性能表现,有效缩小了与高度优化的视觉Transformer之间的差距。
链接: https://arxiv.org/abs/2512.01643
作者: Dongchen Han,Yining Li,Tianyu Li,Zixuan Cao,Ziming Wang,Jun Song,Yu Cheng,Bo Zheng,Gao Huang
机构: Tsinghua University (清华大学); Alibaba Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT ^3 ) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT ^3 across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT ^3 consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT ^3 baseline can facilitate future work on visual TTT models. Code is available at this https URL.
zh
[CV-49] Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval
【速读】:该论文旨在解决零样本组合图像检索(Zero-Shot Composed Image Retrieval, ZS-CIR)中因视觉-语言模态差距导致的性能瓶颈问题。现有基于文本或扩散模型的方法难以有效对齐多模态语义空间,从而限制了检索精度。其解决方案的关键在于提出一种名为Fusion-Diff的生成式编辑框架:首先,在联合视觉-语言(Vision-Language, VL)空间内引入多模态融合特征编辑策略,显著缩小模态间语义鸿沟;其次,通过轻量级Control-Adapter结构,在仅需20万样本的合成数据集上微调即可实现最优性能,极大提升数据效率。实验表明,该方法在CIRR、FashionIQ和CIRCO等标准基准上显著优于现有零样本方案。
链接: https://arxiv.org/abs/2512.01636
作者: Xin Wang,Haipeng Zhang,Mang Li,Zhaohui Xia,Yueguo Chen,Yu Zhang,Chunyu Wei
机构: Renmin University of China (中国人民大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.
zh
[CV-50] SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
【速读】:该论文旨在解决从单张RGB图像中自动重建物理一致且具有运动结构的关节式三维物体(articulated 3D objects)的问题,这一任务在具身智能(embodied AI)、机器人学和交互场景理解中至关重要,但传统方法依赖人工建模部件层级与运动结构,效率低下且成本高。解决方案的关键在于提出SPARK框架:首先利用视觉语言模型(VLMs)提取粗粒度URDF参数并生成部件级参考图像;随后将部件图像引导与推断出的结构图整合进生成式扩散Transformer(diffusion transformer),以合成一致的部件与完整形状;最后通过可微分正向运动学(differentiable forward kinematics)和可微分渲染(differentiable rendering)优化URDF中的关节类型、轴线和原点,实现端到端的物理一致性与仿真就绪资产生成。
链接: https://arxiv.org/abs/2512.01629
作者: Yumeng He,Ying Jiang,Jiayin Lu,Yin Yang,Chenfanfu Jiang
机构: USC(南加州大学); UCLA(加利福尼亚大学洛杉矶分校); Utah(犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.
zh
[CV-51] Depth Matching Method Based on ShapeDTW for Oil-Based Mud Imager
【速读】:该论文旨在解决使用油基钻井液(Oil-Based Mud, OBM)微电阻率成像仪进行测井时,由于上下探头垫片图像存在深度错位问题,即使经过速度校正后仍难以实现精确对齐的难题。其解决方案的关键在于提出一种基于形状动态时间规整(Shape Dynamic Time Warping, ShapeDTW)的深度匹配方法,通过提取局部形状特征构建具有形态敏感性的距离矩阵,在对齐过程中更好地保留序列间的结构相似性;具体实现上采用一维方向梯度直方图(Histogram of Oriented Gradients in 1D, HOG1D)与原始信号组合的特征集作为形状描述子,从而有效应对复杂纹理、深度偏移及局部缩放等挑战,并具备良好的特征扩展灵活性,可集成针对特定地质特征的其他描述符。
链接: https://arxiv.org/abs/2512.01611
作者: Fengfeng Li,Zhou Feng,Hongliang Wu,Hao Zhang,Han Tian,Peng Liu,Lixin Yuan
机构: PetroChina Research Institute of Petroleum Exploration and Development (中国石油勘探开发研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注:
Abstract:In well logging operations using the oil-based mud (OBM) microresistivity imager, which employs an interleaved design with upper and lower pad sets, depth misalignment issues persist between the pad images even after velocity correction. This paper presents a depth matching method for borehole images based on the Shape Dynamic Time Warping (ShapeDTW) algorithm. The method extracts local shape features to construct a morphologically sensitive distance matrix, better preserving structural similarity between sequences during alignment. We implement this by employing a combined feature set of the one-dimensional Histogram of Oriented Gradients (HOG1D) and the original signal as the shape descriptor. Field test examples demonstrate that our method achieves precise alignment for images with complex textures, depth shifts, or local scaling. Furthermore, it provides a flexible framework for feature extension, allowing the integration of other descriptors tailored to specific geological features.
zh
[CV-52] oward Content-based Indexing and Retrieval of Head and Neck CT with Abscess Segmentation
【速读】:该论文旨在解决头颈部脓肿(head and neck abscess)在医学影像中精准分割的难题,以支持临床诊断、治疗规划及手术干预。其解决方案的关键在于构建了一个大规模、高质量的标注数据集——AbscessHeNe,包含4,926张增强CT切片,每例均经临床确诊并配有像素级标注与临床元数据。该数据集为训练和评估语义分割模型提供了坚实基础,同时支持未来基于内容的多媒体索引与病例检索系统开发,从而推动智能辅助决策在头颈感染性疾病管理中的应用。
链接: https://arxiv.org/abs/2512.01589
作者: Thao Thi Phuong Dao,Tan-Cong Nguyen,Trong-Le Do,Truong Hoang Viet,Nguyen Chi Thanh,Huynh Nguyen Thuan,Do Vo Cong Nguyen,Minh-Khoi Pham,Mai-Khiem Tran,Viet-Tham Huynh,Trong-Thuan Nguyen,Trung-Nghia Le,Vo Thanh Toan,Tam V. Nguyen,Minh-Triet Tran,Thanh Dinh Le
机构: University of Science, VNU-HCM, Ho Chi Minh City, Vietnam (胡志明市国家大学科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市国家大学); Thong Nhat Hospital, Ho Chi Minh City, Vietnam (阮忠医院); University of Social Sciences and Humanities, VNU-HCM, Ho Chi Minh City, Vietnam (胡志明市国家大学社会科学与人文学院); Dublin City University, Dublin, Ireland (都柏林城市大学); John von Neumann Institute, Ho Chi Minh City, Vietnam (冯·诺伊曼研究所); University of Health Sciences, VNU-HCM, Ho Chi Minh City, Vietnam (胡志明市国家大学健康科学学院); University of Dayton, Dayton, Ohio, United States (代顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 2025 IEEE International Conference on Content-Based Multimedia Indexing (IEEE CBMI)
Abstract:Abscesses in the head and neck represent an acute infectious process that can potentially lead to sepsis or mortality if not diagnosed and managed promptly. Accurate detection and delineation of these lesions on imaging are essential for diagnosis, treatment planning, and surgical intervention. In this study, we introduce AbscessHeNe, a curated and comprehensively annotated dataset comprising 4,926 contrast-enhanced CT slices with clinically confirmed head and neck abscesses. The dataset is designed to facilitate the development of robust semantic segmentation models that can accurately delineate abscess boundaries and evaluate deep neck space involvement, thereby supporting informed clinical decision-making. To establish performance baselines, we evaluate several state-of-the-art segmentation architectures, including CNN, Transformer, and Mamba-based models. The highest-performing model achieved a Dice Similarity Coefficient of 0.39, Intersection-over-Union of 0.27, and Normalized Surface Distance of 0.67, indicating the challenges of this task and the need for further research. Beyond segmentation, AbscessHeNe is structured for future applications in content-based multimedia indexing and case-based retrieval. Each CT scan is linked with pixel-level annotations and clinical metadata, providing a foundation for building intelligent retrieval systems and supporting knowledge-driven clinical workflows. The dataset will be made publicly available at this https URL.
zh
[CV-53] RoleMotion: A Large-Scale Dataset towards Robust Scene-Specific Role-Playing Motion Synthesis with Fine-grained Descriptions
【速读】:该论文旨在解决现有文本驱动人体运动生成数据集在场景与角色功能关联性不足、运动数据质量不一致以及文本描述缺乏细粒度标注等问题。其解决方案的关键在于构建一个名为RoleMotion的大规模人类运动数据集,该数据集系统化地覆盖了25个经典场景和110个功能性角色,并通过精细标注的500余种行为及超过1万条高质量全身(含手部)运动序列(共10,296条),配合27,831条细粒度文本描述,实现了对复杂社会场景中角色行为的结构化建模。此外,作者还设计了一个优于现有方法的评估框架,验证了数据集在文本到运动生成任务中的有效性,并首次深入探索了身体与手部运动之间的协同生成机制,从而显著提升了文本驱动全身运动生成的质量与功能性。
链接: https://arxiv.org/abs/2512.01582
作者: Junran Peng,Yiheng Huang,Silei Shen,Zeji Wei,Jingwei Yang,Baojie Wang,Yonghao He,Chuanchen Luo,Man Zhang,Xucheng Yin,Wei Sui
机构: University of Science and Technology Beijing (北京科技大学); Beijing University of Posts and Telecommunications (北京邮电大学); Shunde Innovation School, University of Science and Technology Beijing (顺德创新学院,北京科技大学); D-Robotics (D-机器人); Linketic (Linketic); Shandong University (山东大学); China University of Mining And Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.
zh
[CV-54] MasHeNe: A Benchmark for Head and Neck CT Mass Segmentation using Window-Enhanced Mamba with Frequency-Domain Integration
【速读】:该论文旨在解决头颈部空间占位性病变(head and neck masses)的医学图像分割问题,尤其针对现有公开数据集多集中于恶性肿瘤而忽视囊肿等良性病变的局限性。解决方案的关键在于构建首个包含3,779张增强CT切片的标注数据集MasHeNe,涵盖肿瘤与囊肿两类病变,并提出Windowing-Enhanced Mamba with Frequency integration (WEMF)模型:该模型通过三窗增强(tri-window enhancement)提升输入图像的多尺度外观表征能力,并在U型Mamba主干网络中引入多频注意力机制(multi-frequency attention),实现跨跳跃连接的信息融合,从而显著提升分割性能,在Dice、IoU、NSD和HD95等指标上均优于现有方法。
链接: https://arxiv.org/abs/2512.01563
作者: Thao Thi Phuong Dao,Tan-Cong Nguyen,Nguyen Chi Thanh,Truong Hoang Viet,Trong-Le Do,Mai-Khiem Tran,Minh-Khoi Pham,Trung-Nghia Le,Minh-Triet Tran,Thanh Dinh Le
机构: Ho Chi Minh City University of Technology (胡志明市科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 14th International Symposium on Information and Communication Technology Conference SoICT 2025
Abstract:Head and neck masses are space-occupying lesions that can compress the airway and esophagus and may affect nerves and blood vessels. Available public datasets primarily focus on malignant lesions and often overlook other space-occupying conditions in this region. To address this gap, we introduce MasHeNe, an initial dataset of 3,779 contrast-enhanced CT slices that includes both tumors and cysts with pixel-level annotations. We also establish a benchmark using standard segmentation baselines and report common metrics to enable fair comparison. In addition, we propose the Windowing-Enhanced Mamba with Frequency integration (WEMF) model. WEMF applies tri-window enhancement to enrich the input appearance before feature extraction. It further uses multi-frequency attention to fuse information across skip connections within a U-shaped Mamba backbone. On MasHeNe, WEMF attains the best performance among evaluated methods, with a Dice of 70.45%, IoU of 66.89%, NSD of 72.33%, and HD95 of 5.12 mm. This model indicates stable and strong results on this challenging task. MasHeNe provides a benchmark for head-and-neck mass segmentation beyond malignancy-only datasets. The observed error patterns also suggest that this task remains challenging and requires further research. Our dataset and code are available at this https URL.
zh
[CV-55] NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
【速读】:该论文旨在解决长时程任务中基于复杂自然语言指令的具身导航(Embodied Navigation)难题,特别是现有智能体在未知环境中的鲁棒性长期规划能力不足导致的高失败率问题。其解决方案的关键在于提出NavForesee——一个统一的视觉-语言模型(Vision-Language Model, VLM),首次将高层语言规划与预测性世界模型想象融合于单一框架中:该模型既能分解任务、跟踪进度并生成子目标以实现显式规划,又能作为生成式世界模型预测短期环境动态和长期导航里程碑,形成感知-规划/预测-行动的闭环反馈机制,从而显著提升复杂场景下的导航性能。
链接: https://arxiv.org/abs/2512.01550
作者: Fei Liu,Shichao Xie,Minghua Luo,Zedong Chu,Junjun Hu,Xiaolong Wu,Mu Xu
机构: Amap(阿里巴巴集团)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM’s structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.
zh
[CV-56] FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
【速读】:该论文旨在解决基于多视角图像的3D重建中,现有前馈方法(如VGGT)因自注意力机制的二次计算复杂度导致的可扩展性差的问题,尤其是在处理长图像序列时效率低下。其解决方案的关键在于提出FlashVGGT,通过引入基于描述符(descriptor)的注意力机制:将每帧的空间信息压缩为一组紧凑的描述符令牌(descriptor tokens),并采用跨注意力机制在全部图像令牌与这一小规模描述符集之间进行全局关系建模,从而显著降低计算开销;同时利用分块递归机制实现在线推理,复用先前分块缓存的描述符,使模型能够高效处理超过3,000张图像的长序列。
链接: https://arxiv.org/abs/2512.01540
作者: Zipeng Wang,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at this https URL.
zh
[CV-57] Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias Analysis
【速读】:该论文旨在解决脑部磁共振成像(MRI)中无监督异常检测的临床转化瓶颈问题,其核心挑战在于现有方法因评估碎片化、数据集异质性和指标不一致而难以实现可靠应用。解决方案的关键在于构建一个大规模、多中心的基准测试平台,涵盖来自六台不同扫描仪的健康人群影像(共5,948例T1和T2加权图像),并引入多样化的临床队列进行系统性验证。该基准不仅量化了当前算法在Dice分数上从0.03到0.65的巨大性能差异,还揭示了重建类方法(尤其是扩散启发式方法)在病灶分割上的最优表现,而特征类方法则在分布偏移下更具鲁棒性;同时指出多数模型存在扫描仪相关偏差及年龄/性别相关的假阳性差异,表明当前限制因素是算法设计而非数据规模。这一基准为未来研究提供了透明评估框架,并明确优先方向:图像原生预训练、合理偏离度量、公平性感知建模与鲁棒域适应。
链接: https://arxiv.org/abs/2512.01534
作者: Alexander Frotscher,Christian F. Baumgartner,Thomas Wolfers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.
zh
[CV-58] Diffusion Fuzzy System: Fuzzy Rule Guided Latent Multi-Path Diffusion Modeling
【速读】:该论文旨在解决扩散模型在处理具有显著特征差异的图像集合时,难以有效捕捉复杂特征并产生冲突结果的问题。传统多路径扩散方法虽尝试通过多个扩散路径学习不同图像区域,但存在路径间协调效率低和计算成本高的缺陷。解决方案的关键在于提出一种基于模糊规则引导的潜在空间多路径扩散模型(Diffusion Fuzzy System, DFS):首先,将每个扩散路径专门分配给一类特定图像特征,从而提升对异质特征的建模能力;其次,采用基于规则链的推理机制动态调控扩散过程,实现多路径间的高效协同;最后,引入基于模糊隶属度的潜在空间压缩机制,在保证性能的同时显著降低计算开销。
链接: https://arxiv.org/abs/2512.01533
作者: Hailong Yang,Te Zhang,Kup-sze Choi,Zhaohong Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.
zh
[CV-59] QuantumCanvas: A Multimodal Benchmark for Visual Learning of Atomic Interactions
【速读】:该论文旨在解决当前分子与材料机器学习模型普遍缺乏物理可迁移性的问题,即现有模型通常基于整个分子或晶体的全局相关性进行拟合,而未能学习原子对之间的量子相互作用,而这些两体相互作用正是决定化学键、电荷重分布、轨道杂化和电子耦合等关键量子现象的基础。解决方案的关键在于提出 QuantumCanvas——一个大规模多模态基准数据集,将两体量子系统作为物质的基本单元,并为 2,850 种元素-元素对标注了 18 个电子、热力学和几何属性,同时提供十通道图像表示(源自 l 和 m 分辨率的轨道密度、角场变换、共占图和电荷密度投影),这些图像编码了空间、角度和静电对称性而不依赖显式坐标,从而构建了一种可解释的视觉模态用于量子学习。通过在多个架构上进行基准测试,证明该方法能有效提升能量间隙、HOMO/LUMO 能级及总能等关键量子性质的预测精度,并且预训练可显著增强模型在更大数据集(如 QM9、MD17 和 CrysMTM)上的收敛稳定性和泛化能力,实现了从轨道物理到视觉表征学习的统一,为可迁移量子相互作用的学习提供了原理性和可解释的基础。
链接: https://arxiv.org/abs/2512.01519
作者: Can Polat,Erchin Serpedin,Mustafa Kurban,Hasan Kurban
机构: Texas A&M University (德州农工大学); Ankara University (安卡拉大学); Texas A&M University at Qatar (德州农工大学卡塔尔分校); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Quantum Physics (quant-ph)
备注:
Abstract:Despite rapid advances in molecular and materials machine learning, most models still lack physical transferability: they fit correlations across whole molecules or crystals rather than learning the quantum interactions between atomic pairs. Yet bonding, charge redistribution, orbital hybridization, and electronic coupling all emerge from these two-body interactions that define local quantum fields in many-body systems. We introduce QuantumCanvas, a large-scale multimodal benchmark that treats two-body quantum systems as foundational units of matter. The dataset spans 2,850 element-element pairs, each annotated with 18 electronic, thermodynamic, and geometric properties and paired with ten-channel image representations derived from l- and m-resolved orbital densities, angular field transforms, co-occupancy maps, and charge-density projections. These physically grounded images encode spatial, angular, and electrostatic symmetries without explicit coordinates, providing an interpretable visual modality for quantum learning. Benchmarking eight architectures across 18 targets, we report mean absolute errors of 0.201 eV on energy gap using GATv2, 0.265 eV on HOMO and 0.274 eV on LUMO using EGNN. For energy-related quantities, DimeNet attains 2.27 eV total-energy MAE and 0.132 eV repulsive-energy MAE, while a multimodal fusion model achieves a 2.15 eV Mermin free-energy MAE. Pretraining on QuantumCanvas further improves convergence stability and generalization when fine-tuned on larger datasets such as QM9, MD17, and CrysMTM. By unifying orbital physics with vision-based representation learning, QuantumCanvas provides a principled and interpretable basis for learning transferable quantum interactions through coupled visual and numerical modalities. Dataset and model implementations are available at this https URL.
zh
[CV-60] Semantic-aware Random Convolution and Source Matching for Domain Generalization in Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中的单源域泛化(single-source domain generalization, DG)问题,即在仅使用一个源域数据(如CT)训练模型后,直接将其应用于不同域(如MR)而无需目标域的图像或标注进行模型适配。其解决方案的关键在于提出了一种名为SRCSM的方法:训练阶段通过语义感知的随机卷积(semantic-aware random convolution)对源域图像的不同区域进行差异化增强,从而提升模型对语义结构的鲁棒性;测试阶段则通过对目标域图像的强度映射(intensity mapping),使其分布更接近源域数据,从而缓解域间差异。该方法在多种跨模态和跨中心的腹部、心脏及前列腺分割任务中显著优于现有DG技术,并在更具挑战性的动态心脏MRI分期场景下也表现出良好的域适应能力。
链接: https://arxiv.org/abs/2512.01510
作者: Franz Thaler,Martin Urschler,Mateusz Kozinski,Matthias AF Gsell,Gernot Plank,Darko Stern
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint submitted to Computer Methods and Programs in Biomedicine (currently under revision)
Abstract:We tackle the challenging problem of single-source domain generalization (DG) for medical image segmentation. To this end, we aim for training a network on one domain (e.g., CT) and directly apply it to a different domain (e.g., MR) without adapting the model and without requiring images or annotations from the new domain during training. We propose a novel method for promoting DG when training deep segmentation networks, which we call SRCSM. During training, our method diversifies the source domain through semantic-aware random convolution, where different regions of a source image are augmented differently, based on their annotation labels. At test-time, we complement the randomization of the training domain via mapping the intensity of target domain images, making them similar to source domain data. We perform a comprehensive evaluation on a variety of cross-modality and cross-center generalization settings for abdominal, whole-heart and prostate segmentation, where we outperform previous DG techniques in a vast majority of experiments. Additionally, we also investigate our method when training on whole-heart CT or MR data and testing on the diastolic and systolic phase of cine MR data captured with different scanner hardware, where we make a step towards closing the domain gap in this even more challenging setting. Overall, our evaluation shows that SRCSM can be considered a new state-of-the-art in DG for medical image segmentation and, moreover, even achieves a segmentation performance that matches the performance of the in-domain baseline in several settings.
zh
[CV-61] ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
【速读】:该论文旨在解决低光照视频实例分割(Low-light Video Instance Segmentation, VIS)任务中因成像条件恶劣(如噪声、模糊和低对比度)导致的性能瓶颈问题,尤其针对现有VIS方法在低光场景下鲁棒性差、即使微调仍表现不佳的局限。解决方案的关键在于提出ELVIS框架,其核心创新包括:一个无监督的合成低光视频数据生成管道,可建模空间与时间维度上的退化特征;一个无需校准的退化特征合成网络(VDP-Net);以及一个解耦退化因素与内容特征的增强解码头(enhancement decoder head),从而实现对先进VIS模型的有效域自适应迁移至低光场景。
链接: https://arxiv.org/abs/2512.01495
作者: Joanne Lin,Ruirui Lin,Yini Li,David Bull,Nantheera Anantrasirichai
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce \textbfELVIS (\textbfEnhance \textbfLow-light for \textbfVideo \textbfInstance \textbfSegmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf+3.7AP on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.
zh
[CV-62] A variational method for curve extraction with curvature-dependent energies
【速读】:该论文旨在解决从图像中自动提取曲线及一维结构(1D structures)的问题,尤其在大多数情况下无需人工标注标签的场景下实现自动化。其核心解决方案是基于能量泛函的变分方法,通过离散化能量并结合Smirnov分解定理对向量场进行分析,构建双层优化框架以实现曲线提取;进一步地,为处理曲率依赖的能量项,作者引入了将曲线提升至位置-方向空间(position-orientation space)的方法,并在此空间中采用适当的次黎曼(sub-Riemannian)或Finsler度量来建模几何约束,从而有效捕捉具有方向敏感性的结构特征。
链接: https://arxiv.org/abs/2512.01494
作者: Majid Arthaud(ENPC, MOKAPLAN, UMich),Antonin Chambolle(CEREMADE, MOKAPLAN),Vincent Duval(MOKAPLAN)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a variational approach for extracting curves between a list of possible endpoints, based on the discretization of an energy and Smirnov’s decomposition theorem for vector fields. It is used to design a bi-level minimization approach to automatically extract curves and 1D structures from an image, which is mostly unsupervised. We extend then the method to curvature-dependent energies, using a now classical lifting of the curves in the space of positions and orientations equipped with an appropriate sub-Riemanian or Finslerian metric.
zh
[CV-63] ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling
【速读】:该论文旨在解决当前基于相机控制的视频生成模型在生成3D一致且高保真度的时间同步多视角视频时面临的挑战,这是实现4D世界建模的关键能力。现有方法如数据增强或测试时优化受限于模型泛化能力和可扩展性不足。论文提出的解决方案核心在于引入“世界状态超空间”(World State Hyperspace)来表征4D场景的时空约束,并通过“超空间引导采样”(Hyperspace Guided Sampling)机制同步多个视角的扩散采样轨迹,从而在无需训练或微调扩散模型的前提下实现高质量、3D一致的多视角视频生成。
链接: https://arxiv.org/abs/2512.01481
作者: Qisen Wang,Yifan Zhao,Peisen Shen,Jialu Li,Jia Li
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.
zh
[CV-64] CourtMotion: Learning Event-Driven Motion Representations from Skeletal Data for Basketball
【速读】:该论文旨在解决专业篮球比赛中游戏事件(如传球、投篮、抢断等)的预测与分析问题,传统方法仅依赖球员位置信息,难以捕捉身体朝向、防守姿态或投篮准备动作等关键语义特征。解决方案的关键在于提出一种两阶段的时空建模框架 CourtMotion:首先利用图神经网络(Graph Neural Networks, GNNs)处理骨骼追踪数据以提取细微的运动模式,随后采用带有专门注意力机制的 Transformer 架构建模球员间的交互关系,并引入事件投影头(event projection heads)将物理运动模式显式关联至战术目的,从而实现对篮球事件的精准预测与理解。
链接: https://arxiv.org/abs/2512.01478
作者: Omer Sela(1 and 2),Michael Chertok(1),Lior Wolf(2) ((1) Amazon, (2) Tel Aviv University)
机构: Amazon(亚马逊); Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:This paper presents CourtMotion, a spatiotemporal modeling framework for analyzing and predicting game events and plays as they develop in professional basketball. Anticipating basketball events requires understanding both physical motion patterns and their semantic significance in the context of the game. Traditional approaches that use only player positions fail to capture crucial indicators such as body orientation, defensive stance, or shooting preparation motions. Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns, then employs a Transformer architecture with specialized attention mechanisms to model player interactions. We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their tactical purposes. Experiments on NBA tracking data demonstrate significant improvements over position-only baselines: 35% reduction in trajectory prediction error compared to state-of-the-art position-based models and consistent performance gains across key basketball analytics tasks. The resulting pretrained model serves as a powerful foundation for multiple downstream tasks, with pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition demonstrating substantial improvements over existing methods.
zh
[CV-65] Stay Unique Stay Efficient: Preserving Model Personality in Multi-Task Merging
【速读】:该论文旨在解决模型融合(model merging)过程中因缺乏对任务特异性信息的保留而导致性能显著下降的问题,尤其在相似任务上表现尤为明显。其解决方案的关键在于提出一种基于近似的个性化融合框架——分解、阈值化与缩放(Decomposition, Thresholding, and Scaling, DTS):首先通过奇异值分解(Singular Value Decomposition, SVD)提取并保留任务特异性信息的核心子集;随后引入新颖的阈值策略,将奇异向量元素分组并为每组分配缩放因子以增强表达能力;最后通过语义相似性实现无数据条件下的任务特异性信息融合,从而提升对未见任务的泛化能力。该方法仅需每个任务额外1%的存储开销即可显著优于当前最优基线。
链接: https://arxiv.org/abs/2512.01461
作者: Kuangpu Guo,Yuhe Ding,Jian Liang,Zilei Wang,Ran He
机构: University of Science and Technology of China (中国科学技术大学); NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Anhui University (安徽大学); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Model merging has emerged as a promising paradigm for enabling multi-task capabilities without additional training. However, existing methods often experience substantial performance degradation compared with individually fine-tuned models, even on similar tasks, underscoring the need to preserve task-specific information. This paper proposes Decomposition, Thresholding, and Scaling (DTS), an approximation-based personalized merging framework that preserves task-specific information with minimal storage overhead. DTS first applies singular value decomposition to the task-specific information and retains only a small subset of singular values and vectors. It then introduces a novel thresholding strategy that partitions singular vector elements into groups and assigns a scaling factor to each group. To enable generalization to unseen tasks, we further extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics. Extensive experiments demonstrate that DTS consistently outperforms state-of-the-art baselines while requiring only 1% additional storage per task. Furthermore, experiments on unseen tasks show that the DTS variant achieves significantly better generalization performance. Our code is available at this https URL.
zh
[CV-66] FastAnimate: Towards Learnable Template Construction and Pose Deformation for Fast 3D Human Avatar Animation
【速读】:该论文旨在解决3D人体虚拟形象(human avatar)动画中两个关键问题:一是模板构建阶段依赖复杂骨骼绑定且易产生特定姿态下的伪影;二是目标姿态变形阶段因线性混合皮肤技术(Linear Blend Skinning, LBS)导致结构失真,影响动画真实感。解决方案的关键在于提出一个统一的学习驱动框架,分两阶段优化:第一阶段采用U-Net架构在前向过程中解耦纹理与姿态信息,实现高效无伪影的模板生成;第二阶段引入数据驱动的精修技术,增强形变后的结构完整性,从而在效率与质量之间取得最优平衡,并显著优于现有最先进方法(state-of-the-art, SOTA)。
链接: https://arxiv.org/abs/2512.01444
作者: Jian Shu,Nanjie Yao,Gangjian Zhang,Junlong Ren,Yu Feng,Hao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,4 figures
Abstract:3D human avatar animation aims at transforming a human avatar from an arbitrary initial pose to a specified target pose using deformation algorithms. Existing approaches typically divide this task into two stages: canonical template construction and target pose deformation. However, current template construction methods demand extensive skeletal rigging and often produce artifacts for specific poses. Moreover, target pose deformation suffers from structural distortions caused by Linear Blend Skinning (LBS), which significantly undermines animation realism. To address these problems, we propose a unified learning-based framework to address both challenges in two phases. For the former phase, to overcome the inefficiencies and artifacts during template construction, we leverage a U-Net architecture that decouples texture and pose information in a feed-forward process, enabling fast generation of a human template. For the latter phase, we propose a data-driven refinement technique that enhances structural integrity. Extensive experiments show that our model delivers consistent performance across diverse poses with an optimal balance between efficiency and quality,surpassing state-of-the-art (SOTA) methods.
zh
[CV-67] Language-Guided Open-World Anomaly Segmentation
【速读】:该论文旨在解决开放世界(open-world)和异常分割(anomaly segmentation)任务中难以对未知类别进行语义标注与区分的问题,尤其是在自动驾驶场景下,现有方法无法为未知区域分配语义标签且缺乏对未知类别的有效表征学习能力。解决方案的关键在于提出 Clipomaly,这是一种基于 CLIP(Contrastive Language–Image Pretraining)的新型零样本开放世界与异常分割方法,其核心创新在于利用 CLIP 的图像-文本共享嵌入空间,在推理阶段动态扩展词汇表而无需重新训练,从而实现对未知对象的分割与人类可解释的命名,同时保持对常见类别(如 Cityscapes 中定义的类别)之外的异常目标具有鲁棒检测能力。
链接: https://arxiv.org/abs/2512.01427
作者: Klara Reichard,Nikolas Brasch,Nassir Navab,Federico Tombari
机构: Technical University of Munich (慕尼黑工业大学); BMW Group (宝马集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP’s shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.
zh
[CV-68] ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
【速读】:该论文旨在解决基于预训练扩散 Transformer(Diffusion Transformers, DiTs)进行高分辨率(High-Resolution, HR)图像生成时常见的空间布局坍塌(spatial layout collapse)和纹理保真度下降问题。以往方法通常依赖复杂的多阶段流程,先在基础分辨率(即训练分辨率)下执行去噪以引导HR生成。本文提出了一种无需训练的高效分辨率扩展方法ResDiT,其核心在于识别出位置嵌入(Position Embeddings, PEs)是控制空间布局的关键因素——原始PE在扩展至高分辨率时会引入错误的位置信息,从而引发布局坍塌。为此,作者设计了一种PE缩放技术以修正分辨率变化下的位置编码;同时,为改善细节保真度,进一步提出了基于基础分辨率局部注意力机制的局部增强策略,并结合patch级融合模块与高斯加权拼接策略,有效消除网格伪影。实验表明,ResDiT在保持高保真度的同时实现高效的高分辨率图像合成,并可无缝集成至下游任务(如空间可控生成)。
链接: https://arxiv.org/abs/2512.01426
作者: Yiyang Ma,Feng Zhou,Xuedan Yin,Pu Cao,Yonghao Dang,Jianqin Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.
zh
[CV-69] textitViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视频推理场景中频繁出现错误却缺乏系统性评估与纠正能力的问题。现有基准测试未能充分考察MLLMs识别并修正视频推理错误的细粒度能力,导致其潜在弱点难以被发现和改进。为此,作者提出ViRectify这一综合性评测基准,通过AI辅助标注结合人工验证构建了超过30K个实例的数据集,覆盖动态感知、科学推理和具身决策三大领域,并要求模型进行分步错误识别与基于关键视频证据的推理说明。解决方案的关键在于引入轨迹证据驱动的纠错框架(trajectory evidence-driven correction framework),该框架包含分步错误轨迹建模与视觉证据锚定的奖励机制,促使模型聚焦于错误传播路径和关键时间戳,从而提升纠正准确性。实验表明,ViRectify能有效揭示模型间系统性纠错不对称性,且显著优于传统大模型(如Qwen2.5-VL-7B在该基准上超越72B版本),为高级MLLMs在视频推理中的全面评估提供了新方向。
链接: https://arxiv.org/abs/2512.01424
作者: Xusen Hei,Jiali Chen,Jinyu Yang,Mengchen Zhao,Yi Cai
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 11 figures
Abstract:As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs’ ability to identify and correct these video reasoning errors. To bridge this gap, we propose \textitViRectify, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30\textitK instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In \textitViRectify, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction. It encourages the model to explicitly concentrate on error propagation and key timestamps for correction. Extensive evaluation across 16 advanced MLLMs demonstrates that our \textitViRectify serves as a challenging testbed, where GPT-5 achieves only 31.94% correction accuracy. Our framework enables a Qwen2.5-VL-7B to consistently outperform the variants of 72B on \textitViRectify, showing the effectiveness of our approach. Further analysis uncovers systematic asymmetries in error correction across models, and our dataset is also a valuable data resource to perform reflection learning. We believe \textitViRectify provides a new direction for comprehensively evaluating the advanced MLLMs in video reasoning.
zh
[CV-70] MDiff4STR: Mask Diffusion Model for Scene Text Recognition AAAI2026
【速读】:该论文旨在解决掩码扩散模型(Mask Diffusion Models, MDMs)在场景文本识别(Scene Text Recognition, STR)任务中准确率低于自回归模型(Auto-regressive Models, ARMs)的问题,尽管MDMs在推理效率上具有优势。核心挑战在于训练与推理阶段的噪声差异(noising gap)以及推理过程中模型预测过于自信(overconfident predictions),这两者显著限制了MDMs的性能表现。解决方案的关键在于提出MDiff4STR框架,包含两个针对性改进策略:一是设计六种噪声策略以缩小训练与推理间的分布差距;二是引入一种token-replacement噪声机制,提供非掩码噪声类型,促使模型重新审视并修正高置信度但错误的预测结果。实验表明,MDiff4STR在多种复杂场景下均超越现有先进STR模型,在保持仅需三步去噪即可快速推理的同时实现了更高的准确率。
链接: https://arxiv.org/abs/2512.01422
作者: Yongkun Du,Miaomiao Zhao,Songlin Fan,Zhineng Chen,Caiyan Jia,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 (Oral)
Abstract:Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: this https URL.
zh
[CV-71] Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的西方中心主义偏见问题,这种偏见限制了其在东南亚(Southeast Asia, SEA)等文化多样性区域的应用效果。解决方案的关键在于提出RICE-VL基准测试集,涵盖11个东盟国家的28,000余条人工标注的视觉问答(Visual Question Answering, VQA)样本和1,000对图像边界框标注数据,用于评估VLM在文化理解上的表现;同时引入SEA-LAVE指标,扩展了LAVE(Language and Vision Evaluation)框架以量化文本准确性、文化契合度及国家识别能力,从而系统揭示模型在低资源国家和抽象文化领域中的性能短板,并推动更具包容性的多文化VLM开发。
链接: https://arxiv.org/abs/2512.01419
作者: Tushar Pranav,Eshan Pandey,Austria Lyka Diane Bala,Aman Chadha,Indriyati Atmosukarto,Donny Soh Cheng Lock
机构: Singapore Institute of Technology (新加坡理工学院); Amazon GenAI (亚马逊生成式人工智能); Palo Alto, CA, USA (加利福尼亚州帕洛阿尔托)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples – covering True or False, Fill-in-the-Blank, and open-ended formats – and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models’ ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs’ cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.
zh
[CV-72] FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leverag ing Diffusion Priors for Real-World Image Super-Resolution
【速读】:该论文旨在解决真实图像超分辨率(Real-image Super-Resolution, Real-ISR)任务中,扩散模型因低频偏置(low-frequency bias)和深度方向“先低频后高频”(low-first, high-later)的层级结构而导致高频细节重建不足的问题。解决方案的关键在于提出一种即插即用的训练框架FRAMER,其核心机制是在每个去噪步骤中,利用最终层特征图指导所有中间层的特征学习:通过傅里叶变换(FFT)掩码将教师与学生特征分解为低频(LF)和高频(HF)分量,并分别设计Intra Contrastive Loss(IntraCL)稳定全局共享结构、Inter Contrastive Loss(InterCL)通过随机层负样本和批内负样本增强实例特定细节;同时引入两个自适应调制模块——基于频率的自适应权重(Frequency-based Adaptive Weight, FAW)和基于频率的对齐调制(Frequency-based Alignment Modulation, FAM),动态重加权各层LF/HF信号并根据当前相似度门控蒸馏强度,从而实现更精准的频率感知监督与特征对齐。
链接: https://arxiv.org/abs/2512.01390
作者: Seungho Choi,Jeahun Sung,Jihyong Oh
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Comments: Please visit our project page at this https URL
Abstract:Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise “low-first, high-later” hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model’s internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.
zh
[CV-73] PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications WACV2026
【速读】:该论文旨在解决动态4D环境(即随时间演化的三维空间)建模中,现有骨干网络在实时处理点云视频流时计算复杂度高、难以适应资源受限场景的问题。当前主流方法依赖于时空卷积或Transformer结构,虽具强大表达能力但效率低下,不适用于在线部署。解决方案的关键在于提出PointNet4D,其核心是一个融合Mamba与Transformer的混合时序融合模块(Hybrid Mamba-Transformer temporal fusion block),该模块结合了Mamba的高效状态空间建模能力和Transformer的双向建模优势,从而实现对不同长度在线序列的高效处理;同时引入4DMAP帧级掩码自回归预训练策略以增强时序理解能力,显著提升模型在多样任务中的泛化性能。
链接: https://arxiv.org/abs/2512.01383
作者: Yunze Liu,Zifan Wang,Peiran Wu,Jiayang Ao
机构: Tsinghua University (清华大学); University of Bristol (布里斯托大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV2026
Abstract:Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D’s utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.
zh
[CV-74] Reversible Inversion for Training-Free Exemplar-guided Image Editing
【速读】:该论文旨在解决示例引导图像编辑(Exemplar-guided Image Editing, EIE)中现有方法依赖大规模预训练导致计算成本高昂的问题,以及标准反演技术在EIE任务中表现不佳、生成质量低且效率差的局限性。其解决方案的关键在于提出一种可逆反演(Reversible Inversion, ReInversion)机制,该机制采用两阶段去噪过程:首先基于源图像进行条件去噪,再基于参考图像进行后续去噪,从而实现高效且高质量的图像编辑;同时引入掩码引导的选择性去噪(Mask-Guided Selective Denoising, MSD)策略,精准控制编辑区域,保持背景结构一致性,最终在保证性能的同时显著降低计算开销。
链接: https://arxiv.org/abs/2512.01382
作者: Yuke Li,Lianli Gao,Ji Zhang,Pengpeng Zeng,Lichuan Xiang,Hongkai Wen,Heng Tao Shen,Jingkuan Song
机构: University of Electronic Science and Technology of China (电子科技大学); Tongji University (同济大学); University of Warwick (华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbfReversible Inversion (ReInversion) for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.
zh
[CV-75] xtured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network AAAI26
【速读】:该论文旨在解决当前三维模型(3D mesh)纹理保真度评估中存在的人类感知不一致问题,即现有基于点云距离(如Chamfer Distance)或依赖渲染图像的评价方法难以准确反映人类对三维形状和纹理质量的主观判断。其关键解决方案是提出了一种直接基于三维网格(mesh)及其纹理信息的评估方法——Textured Geometry Evaluation (TGE),该方法通过联合利用几何结构与颜色信息来计算输入纹理网格与参考彩色形状之间的保真度,且无需依赖渲染过程;同时,为克服合成失真与真实世界失真之间的域差距,研究者构建了一个包含真实世界扭曲的标注数据集用于训练和验证,实验表明TGE在真实场景下显著优于基于渲染或仅几何信息的方法。
链接: https://arxiv.org/abs/2512.01380
作者: Tianyu Luan,Xuelu Feng,Zixin Zhu,Phani Nuney,Sheng Liu,Xuan Gong,David Doermann,Chunming Qiao,Junsong Yuan
机构: 1. University at Buffalo (纽约州立大学布法罗分校); 2. University at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI26
Abstract:Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation TGE, jointly uses the geometry and color information to calculate the fidelity of the input textured mesh with comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion dataset.
zh
[CV-76] SRAM: Shape-Realism Alignment Metric for No Reference 3D Shape Evaluation AAAI2026
【速读】:该论文旨在解决现有3D生成与重建技术中缺乏有效评估手段的问题,特别是在无真实参考(ground truth)情况下如何量化3D形状的“真实性”(realism)。传统方法依赖于网格保真度(mesh fidelity)指标,但无法反映人类对真实感的主观判断。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的Shape-Realism Alignment Metric,通过将3D网格编码为语言令牌空间,并设计专用的真实性解码器,使LLM输出与人类感知对齐。该方法结合自建的RealismGrading数据集(包含16种算法生成的多类3D形状及人工标注的真实感评分),实现了无需真实参考即可准确衡量3D形状真实感的新范式。
链接: https://arxiv.org/abs/2512.01373
作者: Sheng Liu,Tianyu Luan,Phani Nuney,Xuelu Feng,Junsong Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026
Abstract:3D generation and reconstruction techniques have been widely used in computer games, film, and other content creation areas. As the application grows, there is a growing demand for 3D shapes that look truly realistic. Traditional evaluation methods rely on a ground truth to measure mesh fidelity. However, in many practical cases, a shape’s realism does not depend on having a ground truth reference. In this work, we propose a Shape-Realism Alignment Metric that leverages a large language model (LLM) as a bridge between mesh shape information and realism evaluation. To achieve this, we adopt a mesh encoding approach that converts 3D shapes into the language token space. A dedicated realism decoder is designed to align the language model’s output with human perception of realism. Additionally, we introduce a new dataset, RealismGrading, which provides human-annotated realism scores without the need for ground truth shapes. Our dataset includes shapes generated by 16 different algorithms on over a dozen objects, making it more representative of practical 3D shape distributions. We validate our metric’s performance and generalizability through k-fold cross-validation across different objects. Experimental results show that our metric correlates well with human perceptions and outperforms existing methods, and has good generalizability.
zh
[CV-77] BlinkBud: Detecting Hazards from Behind via Sampled Monocular 3D Detection on a Single Earbud
【速读】:该论文旨在解决行人和骑行者因未能察觉来自身后的高速行驶车辆而面临的道路安全风险问题。解决方案的关键在于提出一种名为BlinkBud的轻量级实时危险物体检测系统,其核心创新是结合基于卡尔曼滤波(Kalman filter)的轨迹估计与强化学习驱动的最优图像采样策略,实现仅通过单个耳塞和配对手机即可高效、低功耗地追踪身后接近的目标;同时利用用户头部姿态估计(俯仰角和偏航角)校正目标深度估计并对齐摄像头坐标系至人体坐标系,显著降低因用户头部持续运动导致的跟踪误差,从而在保障高精度检测(平均误报率FPR为4.90%、漏报率FNR为1.47%)的同时,将耳塞和手机的平均功耗分别控制在29.8 mW和702.6 mW。
链接: https://arxiv.org/abs/2512.01366
作者: Yunzhe Li,Jiajun Yan,Yuzhou Wei,Kechen Liu,Yize Zhao,Chong Zhang,Hongzi Zhu,Li Lu,Shan Chang,Minyi Guo
机构: Shanghai Jiao Tong University (上海交通大学); Columbia University (哥伦比亚大学); University of Electronic Science and Technology of China (电子科技大学); Southwest Petroleum University (西南石油大学); Donghua University (东华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: This is the author-accepted version of the paper published in Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), Vol. 9, No. 4, Article 191, 2025. Final published version: this https URL
Abstract:Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user’s body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.
zh
[CV-78] OpenBox: Annotate Any Bounding Boxes in 3D NEURIPS2025
【速读】:该论文旨在解决无监督且开放词汇的3D目标检测中面临的两大挑战:一是现有方法对所有物体统一标注3D边界框,忽略了物体的物理状态(如刚性与运动状态),导致标注质量不高;二是需要多次自训练迭代进行标注优化,带来显著的计算开销。解决方案的关键在于提出一个两阶段自动标注流程OpenBox,其核心创新包括:第一阶段通过跨模态实例对齐,将视觉基础模型(vision foundation model)从2D图像中提取的实例线索与3D点云关联;第二阶段依据物体的刚性与运动状态进行分类,并基于类别特定的尺寸统计生成自适应边界框,从而在无需自训练的情况下实现高质量3D边界框标注。
链接: https://arxiv.org/abs/2512.01352
作者: In-Jae Lee,Mungyeom Kim,Kwonyoung Ryu,Pierre Musacchio,Jaesik Park
机构: Seoul National University (首尔国立大学); POSTECH (浦项工科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
Abstract:Unsupervised and open-vocabulary 3D object detection has recently gained attention, particularly in autonomous driving, where reducing annotation costs and recognizing unseen objects are critical for both safety and scalability. However, most existing approaches uniformly annotate 3D bounding boxes, ignore objects’ physical states, and require multiple self-training iterations for annotation refinement, resulting in suboptimal quality and substantial computational overhead. To address these challenges, we propose OpenBox, a two-stage automatic annotation pipeline that leverages a 2D vision foundation model. In the first stage, OpenBox associates instance-level cues from 2D images processed by a vision foundation model with the corresponding 3D point clouds via cross-modal instance alignment. In the second stage, it categorizes instances by rigidity and motion state, then generates adaptive bounding boxes with class-specific size statistics. As a result, OpenBox produces high-quality 3D bounding box annotations without requiring self-training. Experiments on the Waymo Open Dataset, the Lyft Level 5 Perception dataset, and the nuScenes dataset demonstrate improved accuracy and efficiency over baselines.
zh
[CV-79] Handwritten Text Recognition for Low Resource Languages
【速读】:该论文旨在解决低资源语言(如印地语、乌尔都语等 script)下段落级手写文本识别的难题,此类任务因缺乏充分的语言学资源而极具挑战性。其关键解决方案是提出了一种无分割的端到端框架 BharatOCR,采用 ViT-Transformer Decoder-LM 架构:首先利用 Vision Transformer (ViT) 提取图像特征,随后通过 Transformer 解码器生成文本序列,并借助预训练语言模型(LM)对输出进行后处理以提升准确率、流畅性和连贯性。其中,模型创新性地使用 Data-efficient Image Transformer (DeiT) 进行视觉编码,并结合 RoBERTa 模型优化掩码语言建模(MLM),实现从图像到文本的逐行隐式线分割(implicit line segmentation)。实验表明,该方法在多个公开和自建数据集上均达到当前最优性能。
链接: https://arxiv.org/abs/2512.01348
作者: Sayantan Dey,Alireza Alaei,Partha Pratim Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 Pages
Abstract:Despite considerable progress in handwritten text recognition, paragraph-level handwritten text recognition, especially in low-resource languages, such as Hindi, Urdu and similar scripts, remains a challenging problem. These languages, often lacking comprehensive linguistic resources, require special attention to develop robust systems for accurate optical character recognition (OCR). This paper introduces BharatOCR, a novel segmentation-free paragraph-level handwritten Hindi and Urdu text recognition. We propose a ViT-Transformer Decoder-LM architecture for handwritten text recognition, where a Vision Transformer (ViT) extracts visual features, a Transformer decoder generates text sequences, and a pre-trained language model (LM) refines the output to improve accuracy, fluency, and coherence. Our model utilizes a Data-efficient Image Transformer (DeiT) model proposed for masked image modeling in this research work. In addition, we adopt a RoBERTa architecture optimized for masked language modeling (MLM) to enhance the linguistic comprehension and generative capabilities of the proposed model. The transformer decoder generates text sequences from visual embeddings. This model is designed to iteratively process a paragraph image line by line, called implicit line segmentation. The proposed model was evaluated using our custom dataset (‘Parimal Urdu’) and (‘Parimal Hindi’), introduced in this research work, as well as two public datasets. The proposed model achieved benchmark results in the NUST-UHWR, PUCIT-OUHL, and Parimal-Urdu datasets, achieving character recognition rates of 96.24%, 92.05%, and 94.80%, respectively. The model also provided benchmark results using the Hindi dataset achieving a character recognition rate of 80.64%. The results obtained from our proposed model indicated that it outperformed several state-of-the-art Urdu text recognition methods.
zh
[CV-80] InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
【速读】:该论文旨在解决大规模视频-文本预训练方法依赖噪声大、语义覆盖有限的合成字幕,而掩码视频建模(Masked Video Modeling, MVM)虽能直接利用时空结构但通用任务性能落后的问题。核心挑战在于传统编码器-解码器架构中像素级重建易导致收敛困难且与语义抽象冲突,而潜在空间预测则可能诱发捷径学习(shortcut learning)。解决方案的关键是提出Encoder-Predictor-Decoder(EPD)框架,将世界模型显式建模为预测器,并设计两阶段预训练策略:第一阶段采用条件扩散解码器并注入图像级语义先验,实现像素保真与高层语义的一致性;第二阶段在冻结的第一阶段潜在空间中预测目标,增强物理世界知识学习并缓解捷径学习。该方法在无标签公共视频上训练,显著提升了视频表征能力。
链接: https://arxiv.org/abs/2512.01342
作者: Chenting Wang,Yuhan Zhu,Yicheng Xu,Jiange Yang,Ziang Yan,Yali Wang,Yi Wang,Limin Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory; Shanghai Innovation Institute; Shenzhen Institutes of Advanced Technology, China (中国深圳先进技术研究院); State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.
zh
[CV-81] EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans
【速读】:该论文旨在解决多说话人驱动的虚拟人(Multi-Talker-generated Talking Human, MTH)在生成质量上存在的显著退化问题,从而提升音视频交互中的沉浸感与用户体验。其关键解决方案是构建了首个大规模多说话人虚拟人质量评估数据集THQA-MT,并提出EvalTalker框架:该框架具备全局质量感知、人类特征识别和身份一致性保持能力,同时融合Qwen-Sync模块以实现多模态同步性感知,从而在主观评分上展现出更优的相关性,为高质量多说话人虚拟人生成与评估提供了可靠基准。
链接: https://arxiv.org/abs/2512.01340
作者: Yingjie Zhou,Xilei Zhu,Siyu Ren,Ziyi Zhao,Ziwen Wang,Farong Wen,Yu Zhou,Jiezhang Cao,Xiongkuo Min,Fengjiao Chen,Xiaoyu Li,Xuezhi Cao,Guangtao Zhai,Xiaohong Liu
机构: Shanghai Jiao Tong University (上海交通大学); Meituan (美团); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speech-driven Talking Human (TH) generation, commonly known as “Talker,” currently faces limitations in multi-subject driving capabilities. Extending this paradigm to “Multi-Talker,” capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.
zh
[CV-82] AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation
【速读】:该论文旨在解决文本引导的图像到视频生成(Text-guided image-to-video, TI2V)中存在语义忽视(semantic negligence)的问题,即现有方法在处理需要对输入图像进行显著变换(如物体增删或修改)的细粒度提示时,难以准确遵循提示语义。解决方案的关键在于提出一个无需训练的框架AlignVid,其核心包含两个组件:(i) 注意力缩放调制(Attention Scaling Modulation, ASM),通过轻量级的查询(Q)或键(K)缩放直接重加权注意力机制;(ii) 引导调度(Guidance Scheduling, GS),在不同Transformer块和去噪步骤中选择性应用ASM以减少视觉质量退化。该方法在提升提示语义一致性的同时,有效控制了美学性能的下降。
链接: https://arxiv.org/abs/2512.01334
作者: Yexin Liu,Wen-Jie Shu,Zile Huang,Haoze Zheng,Yueze Wang,Manyuan Zhang,Ser-Nam Lim,Harry Yang
机构: Hong Kong University of Science and Technology (香港科技大学); University of Central Florida (中佛罗里达大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.
zh
[CV-83] Optimizing Stroke Risk Prediction: A Machine Learning Pipeline Combining ROS-Balanced Ensembles and XAI
【速读】:该论文旨在解决卒中(Stroke)早期风险评估的难题,以实现及时干预和有效预防。其核心解决方案是构建一个可解释的机器学习框架,融合集成学习与可解释人工智能(Explainable AI, XAI)技术:首先通过特征工程与随机过采样(Random Over-Sampling, ROS)处理数据不平衡问题,再利用10种机器学习模型进行系统评估,并最终优化出由随机森林(Random Forest)、ExtraTrees和XGBoost组成的集成模型,在Stroke Prediction Dataset(SPD)上达到99.09%的准确率;同时借助LIME方法识别出三个关键临床变量——年龄、高血压和血糖水平,显著提升了模型的透明度与临床适用性,从而推动卒中风险预测向精准化和数据驱动决策转变。
链接: https://arxiv.org/abs/2512.01333
作者: A S M Ahsanul Sarkar Akib,Raduana Khawla,Abdul Hasib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Stroke is a major cause of death and permanent impairment, making it a major worldwide health concern. For prompt intervention and successful preventative tactics, early risk assessment is essential. To address this challenge, we used ensemble modeling and explainable AI (XAI) techniques to create an interpretable machine learning framework for stroke risk prediction. A thorough evaluation of 10 different machine learning models using 5-fold cross-validation across several datasets was part of our all-inclusive strategy, which also included feature engineering and data pretreatment (using Random Over-Sampling (ROS) to solve class imbalance). Our optimized ensemble model (Random Forest + ExtraTrees + XGBoost) performed exceptionally well, obtaining a strong 99.09% accuracy on the Stroke Prediction Dataset (SPD). We improved the model’s transparency and clinical applicability by identifying three important clinical variables using LIME-based interpretability analysis: age, hypertension, and glucose levels. Through early prediction, this study highlights how combining ensemble learning with explainable AI (XAI) can deliver highly accurate and interpretable stroke risk assessment. By enabling data-driven prevention and personalized clinical decisions, our framework has the potential to transform stroke prediction and cardiovascular risk management.
zh
[CV-84] agSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
【速读】:该论文旨在解决现有4D重建方法在生成高质量拓扑一致网格(topology-consistent meshes)时面临的挑战,尤其是在动态场景中保持网格拓扑结构稳定性和几何精度的问题。解决方案的关键在于提出了一种基于高斯点绘(Gaussian Splatting)的拓扑感知动态重建框架:通过引入显式编码空间连通性的高斯拓扑结构(Gaussian topological structure),实现拓扑感知的密度优化与剪枝操作,从而维持高斯表示的流形一致性;同时结合时间正则化项以保障时序上的拓扑一致性,并利用可微分网格光栅化技术提升最终重建网格的质量。
链接: https://arxiv.org/abs/2512.01329
作者: Hanzhi Guo,Dongdong Weng,Mo Su,Yixiao Chen,Xiaonuo Dongye,Chenyu Xu
机构: Beijing Institute of Technology (北京理工大学); Soul Shell Technology Co., Ltd (灵魂壳科技有限公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable precise 3D keypoint tracking. Project page: this https URL
zh
[CV-85] Rethinking Intracranial Aneurysm Vessel Segmentation: A Perspective from Computational Fluid Dynamics Applications
【速读】:该论文旨在解决当前颅内动脉瘤及其载瘤血管(Intracranial Aneurysm Vessel, IA-Vessel)分割方法主要依赖图像评估指标、忽视其在后续计算流体动力学(Computational Fluid Dynamics, CFD)分析中实际应用效果的问题。解决方案的关键在于构建首个多中心、大规模的IAVS数据集,包含641例3D MRA图像及对应的587个动脉瘤与载瘤血管标注,并首次整合了详细的血流动力学分析结果,确保分割结果的拓扑完整性与CFD适用性;同时提出一个两阶段分割框架(Stage I:动脉瘤全局定位,Stage II:IA-Vessel精细分割),并建立标准化的CFD可适用性评估体系,实现从分割掩膜到CFD模型的自动化、一致化转换,从而为临床相关分割技术提供更贴近实际应用的评估基准。
链接: https://arxiv.org/abs/2512.01319
作者: Feiyang Xiao,Yichi Zhang,Xigui Li,Yuanye Zhou,Chen Jiang,Xin Guo,Limei Han,Yuxin Li,Fengping Zhu,Yuan Cheng
机构: Fudan University (复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); Hong Kong Polytechnic University (香港理工大学); Huashan Hospital, Fudan University (华山医院,复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures
Abstract:The precise segmentation of intracranial aneurysms and their parent vessels (IA-Vessel) is a critical step for hemodynamic analyses, which mainly depends on computational fluid dynamics (CFD). However, current segmentation methods predominantly focus on image-based evaluation metrics, often neglecting their practical effectiveness in subsequent CFD applications. To address this deficiency, we present the Intracranial Aneurysm Vessel Segmentation (IAVS) dataset, the first comprehensive, multi-center collection comprising 641 3D MRA images with 587 annotations of aneurysms and IA-Vessels. In addition to image-mask pairs, IAVS dataset includes detailed hemodynamic analysis outcomes, addressing the limitations of existing datasets that neglect topological integrity and CFD applicability. To facilitate the development and evaluation of clinically relevant techniques, we construct two evaluation benchmarks including global localization of aneurysms (Stage I) and fine-grained segmentation of IA-Vessel (Stage II) and develop a simple and effective two-stage framework, which can be used as a out-of-the-box method and strong baseline. For comprehensive evaluation of applicability of segmentation results, we establish a standardized CFD applicability evaluation system that enables the automated and consistent conversion of segmentation masks into CFD models, offering an applicability-focused assessment of segmentation outcomes. The dataset, code, and model will be public available at this https URL.
zh
[CV-86] FOD-S2R: A FOD Dataset for Sim2Real Transfer Learning based Object Detection
【速读】:该论文旨在解决航空燃油箱内异物(Foreign Object Debris, FOD)检测中缺乏专用数据集的问题,尤其是在封闭、复杂环境下的检测性能不足。现有研究多集中于开放或外部场景,难以直接迁移至燃油箱这类受限空间。解决方案的关键在于构建首个系统性评估合成数据在真实FOD检测任务中作用的数据集——FOD-S2R,其包含3,114张真实图像与3,137张基于Unreal Engine生成的合成图像,覆盖多种视场角(FOV)、距离、光照条件及物体尺寸。实验表明,引入合成数据可有效提升目标检测模型的准确率和泛化能力,缩小“Sim2Real”差距,为开发高可靠性的自动化FOD检测系统提供了重要基础。
链接: https://arxiv.org/abs/2512.01315
作者: Ashish Vashist,Qiranul Saadiyean,Suresh Sundaram,Chandra Sekhar Seelamantula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 11 figures
Abstract:Foreign Object Debris (FOD) within aircraft fuel tanks presents critical safety hazards including fuel contamination, system malfunctions, and increased maintenance costs. Despite the severity of these risks, there is a notable lack of dedicated datasets for the complex, enclosed environments found inside fuel tanks. To bridge this gap, we present a novel dataset, FOD-S2R, composed of real and synthetic images of the FOD within a simulated aircraft fuel tank. Unlike existing datasets that focus on external or open-air environments, our dataset is the first to systematically evaluate the effectiveness of synthetic data in enhancing the real-world FOD detection performance in confined, closed structures. The real-world subset consists of 3,114 high-resolution HD images captured in a controlled fuel tank replica, while the synthetic subset includes 3,137 images generated using Unreal Engine. The dataset is composed of various Field of views (FOV), object distances, lighting conditions, color, and object size. Prior research has demonstrated that synthetic data can reduce reliance on extensive real-world annotations and improve the generalizability of vision models. Thus, we benchmark several state-of-the-art object detection models and demonstrate that introducing synthetic data improves the detection accuracy and generalization to real-world conditions. These experiments demonstrate the effectiveness of synthetic data in enhancing the model performance and narrowing the Sim2Real gap, providing a valuable foundation for developing automated FOD detection systems for aviation maintenance.
zh
[CV-87] okenPure: Watermark Removal through Tokenized Appearance and Structural Guidance
【速读】:该论文旨在解决数字水印在生成式 AI (Generative AI) 时代面临的鲁棒性与内容一致性之间的矛盾问题,即如何在彻底移除水印的同时保持图像的感知质量和结构完整性。解决方案的关键在于提出 TokenPure,一个基于 Diffusion Transformer 的新框架,通过将带水印图像分解为视觉令牌(用于纹理)和几何令牌(用于结构)两组互补的 token 集合,并以此作为条件引导扩散过程,从而实现无需依赖原始含水印噪声的条件生成,最终合成高保真且结构一致的无水印图像。
链接: https://arxiv.org/abs/2512.01314
作者: Pei Yang,Yepeng Liu,Kelly Peng,Yuan Gao,Yiren Song
机构: First Intelligence; University of California, Santa Barbara; National University of Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the digital economy era, digital watermarking serves as a critical basis for ownership proof of massive replicable content, including AI-generated and other virtual assets. Designing robust watermarks capable of withstanding various attacks and processing operations is even more paramount. We introduce TokenPure, a novel Diffusion Transformer-based framework designed for effective and consistent watermark removal. TokenPure solves the trade-off between thorough watermark destruction and content consistency by leveraging token-based conditional reconstruction. It reframes the task as conditional generation, entirely bypassing the initial watermark-carrying noise. We achieve this by decomposing the watermarked image into two complementary token sets: visual tokens for texture and structural tokens for geometry. These tokens jointly condition the diffusion process, enabling the framework to synthesize watermark-free images with fine-grained consistency and structural integrity. Comprehensive experiments show that TokenPure achieves state-of-the-art watermark removal and reconstruction fidelity, substantially outperforming existing baselines in both perceptual quality and consistency.
zh
[CV-88] IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval SIGIR2025
【速读】:该论文旨在解决传统视频检索(video retrieval)和视频片段检索(video moment retrieval)任务中存在的单向信息交互问题,这些问题限制了系统对用户个性化与动态需求的响应能力,尤其是在超过80.8%的用户场景下难以满足实际交互需求。其核心解决方案是提出交互式视频语料库检索(Interactive Video Corpus Retrieval, IVCR)任务,并构建了一个高质量、双语、多轮对话式的数据集IVCR-200K,支持视频及片段级检索;同时设计了一种基于多模态大语言模型(multi-modal large language models, MLLMs)的综合框架,使用户能够以多种模式进行交互并获得更具可解释性的检索结果,从而实现更贴近真实场景的智能视频搜索体验。
链接: https://arxiv.org/abs/2512.01312
作者: Ning Han,Yawen Zeng,Shaohua Long,Chengqing Li,Sijie Yang,Dun Tan,Jianfeng Dong,Jingjing Chen
机构: Xiangtan University (湘潭大学); Hunan University (湖南大学); Zhejiang Gongshang University (浙江工商大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGIR2025
Abstract:In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful “interaction” between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.
zh
[CV-89] Lost in Distortion: Uncovering the Domain Gap Between Computer Vision and Brain Imaging - A Study on Pretraining for Age Prediction
【速读】:该论文试图解决的问题是:在大规模脑成像数据预训练中,低质量或噪声扫描是否能够对生成领域基础模型(domain foundation models)的学习产生积极贡献,还是反而会阻碍模型性能。其解决方案的关键在于系统性地评估不同数据质量水平(quality levels)对预训练效果的影响,并通过在外部队列上进行微调(fine-tuning)以预测脑龄(brain age prediction)来量化下游任务表现,从而揭示数据质量与模型泛化能力之间的关系,强调了基于领域特性的数据筛选和清理(domain-aware curation)对于构建可信且可推广的神经影像基础模型的重要性。
链接: https://arxiv.org/abs/2512.01310
作者: Yanteng Zhang,Songheng Li,Zeyu Shen,Qizhen Lan,Lipei Zhang,Yang Liu,Vince Calhoun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale brain imaging datasets provide unprecedented opportunities for developing domain foundation models through pretraining. However, unlike natural image datasets in computer vision, these neuroimaging data often exhibit high heterogeneity in quality, ranging from well-structured scans to severely distorted or incomplete brain volumes. This raises a fundamental question: can noise or low-quality scans contribute meaningfully to pretraining, or do they instead hinder model learning? In this study, we systematically explore the role of data quality level in pretraining and its impact on downstream tasks. Specifically, we perform pretraining on datasets with different quality levels and perform fine-tuning for brain age prediction on external cohorts. Our results show significant performance differences across quality levels, revealing both opportunities and limitations. We further discuss the gap between computer vision practices and clinical neuroimaging standards, emphasizing the necessity of domain-aware curation to ensure trusted and generalizable domain-specific foundation models.
zh
[CV-90] Gaussian Swaying: Surface-Based Framework for Aerodynamic Simulation with 3D Gaussians WACV2026
【速读】:该论文旨在解决自然场景中物体(如树枝、旗帜、船只)在风力作用下的动态模拟问题,以提升视觉与图形领域对真实感运动的呈现效果。传统方法依赖于网格(mesh-based)或粒子(particle-based)表示,分别存在计算成本高或数据离散的问题。其解决方案的关键在于提出基于3D高斯(3D Gaussians)的连续表面建模框架——Gaussian Swaying,该框架通过统一的高斯斑块(Gaussian patches)实现动力学仿真与渲染一体化:一方面支持精确的气动力计算以驱动形变,另一方面提供法向量用于轻量级着色,从而在保持高效率的同时实现细粒度的气动交互和逼真视觉效果。
链接: https://arxiv.org/abs/2512.01306
作者: Hongru Yan,Xiang Zhang,Zeyuan Chen,Fangyin Wei,Zhuowen Tu
机构: Tsinghua University (清华大学); UC San Diego (加州大学圣地亚哥分校); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to WACV 2026
Abstract:Branches swaying in the breeze, flags rippling in the wind, and boats rocking on the water all show how aerodynamics shape natural motion – an effect crucial for realism in vision and graphics. In this paper, we present Gaussian Swaying, a surface-based framework for aerodynamic simulation using 3D Gaussians. Unlike mesh-based methods that require costly meshing, or particle-based approaches that rely on discrete positional data, Gaussian Swaying models surfaces continuously with 3D Gaussians, enabling efficient and fine-grained aerodynamic interaction. Our framework unifies simulation and rendering on the same representation: Gaussian patches, which support force computation for dynamics while simultaneously providing normals for lightweight shading. Comprehensive experiments on both synthetic and real-world datasets across multiple metrics demonstrate that Gaussian Swaying achieves state-of-the-art performance and efficiency, offering a scalable approach for realistic aerodynamic scene simulation.
zh
[CV-91] DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy WACV2026
【速读】:该论文旨在解决当前文本到图像生成模型在处理长文本或多句文本时因全局注意力稀释而导致的文本渲染质量下降问题。其解决方案的关键在于提出一种无需训练的视觉文本生成方法 DCText,该方法采用分而治之策略:首先将输入文本分割为短段落并分配至指定图像区域,随后在去噪过程中依次应用两种注意力掩码——Text-Focus 掩码用于聚焦局部文本内容,Context-Expansion 掩码用于保持整体图像一致性;此外,通过局部噪声初始化(Localized Noise Initialization)进一步提升文本准确性和区域对齐效果,且不增加计算开销。
链接: https://arxiv.org/abs/2512.01302
作者: Jaewoo Song,Jooyoung Choi,Kanghyun Baek,Sangyub Lee,Daemin Park,Sungroh Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026
Abstract:Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.
zh
[CV-92] BT-Former: Learning Temporal Boundary Distributions for Action Localization
【速读】:该论文旨在解决视频理解中时序动作定位(Temporal Action Localization, TAL)任务的两个核心挑战:一是对边界模糊或“模糊”的动作实例进行精确时序定位,二是有效融合多尺度上下文信息以提升模型性能。其解决方案的关键在于提出Temporal Boundary Transformer (TBT-Former) 架构,包含三项创新:(1) 采用更高容量的缩放Transformer骨干网络,通过增加注意力头数和MLP维度增强时序特征提取能力;(2) 设计跨尺度特征金字塔网络(cross-scale FPN),结合自顶向下路径与侧向连接,实现高层语义与低层时序细节的深度融合;(3) 引入一种基于广义焦点损失(Generalized Focal Loss, GFL)启发的边界分布回归头,将边界回归问题转化为概率分布学习任务,从而显式建模和推理边界不确定性。这一系列改进使TBT-Former在THUMOS14和EPIC-Kitchens 100等竞赛级数据集上显著优于现有方法。
链接: https://arxiv.org/abs/2512.01298
作者: Thisara Rathnayaka,Uthayasanker Thayasivam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or “fuzzy” temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at this https URL
zh
[CV-93] EGG-Fusion: Efficient 3D Reconstruction with Geometry-aware Gaussian Surfel on the Fly SIGGRAPH
【速读】:该论文旨在解决基于可微渲染的实时三维重建系统在计算效率和传感器噪声敏感性方面的双重挑战,这些问题导致重建几何精度下降且实用性受限。其解决方案的关键在于提出一种名为EGG-Fusion的新系统,该系统包含两个核心组件:一是鲁棒的稀疏到稠密相机跟踪机制,二是基于信息滤波器的几何感知高斯surfel映射模块,该模块显式建模传感器噪声以实现高精度表面重建;同时,所提出的可微高斯surfel映射方法能够有效建模多视角一致的表面并支持高效参数优化,从而在保持24 FPS实时性能的同时,将标准化基准数据集(如Replica和ScanNet++)上的表面重建误差降低至0.6 cm,相较当前最优的基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的方法提升超过20%的精度。
链接: https://arxiv.org/abs/2512.01296
作者: Xiaokun Pan,Zhenzhe Li,Zhichao Ye,Hongjia Zhai,Guofeng Zhang
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室); SenseTime Research (商汤科技研究部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH ASIA 2025
Abstract:Real-time 3D reconstruction is a fundamental task in computer graphics. Recently, differentiable-rendering-based SLAM system has demonstrated significant potential, enabling photorealistic scene rendering through learnable scene representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Current differentiable rendering methods face dual challenges in real-time computation and sensor noise sensitivity, leading to degraded geometric fidelity in scene reconstruction and limited practicality. To address these challenges, we propose a novel real-time system EGG-Fusion, featuring robust sparse-to-dense camera tracking and a geometry-aware Gaussian surfel mapping module, introducing an information filter-based fusion method that explicitly accounts for sensor noise to achieve high-precision surface reconstruction. The proposed differentiable Gaussian surfel mapping effectively models multi-view consistent surfaces while enabling efficient parameter optimization. Extensive experimental results demonstrate that the proposed system achieves a surface reconstruction error of 0.6\textitcm on standardized benchmark datasets including Replica and ScanNet++, representing over 20% improvement in accuracy compared to state-of-the-art (SOTA) GS-based methods. Notably, the system maintains real-time processing capabilities at 24 FPS, establishing it as one of the most accurate differentiable-rendering-based real-time reconstruction systems. Project Page: this https URL
zh
[CV-94] Diffusion Model in Latent Space for Medical Image Segmentation Task
【速读】:该论文旨在解决传统医学图像分割方法仅生成单一分割掩码、无法捕捉图像中固有不确定性的局限性,从而影响临床诊断的可靠性与可解释性。其解决方案的关键在于提出MedSegLatDiff框架,该框架结合变分自编码器(Variational Autoencoder, VAE)与潜在扩散模型(Latent Diffusion Model),通过VAE将输入图像压缩至低维潜在空间以降噪并加速训练,随后在该紧凑表示上执行扩散过程;同时,在VAE的掩码重建路径中引入加权交叉熵损失,以更好保留微小结构(如肺结节或小息肉),最终实现高效生成多个合理的分割假设及其置信图,显著提升分割结果的多样性、准确性和临床可用性。
链接: https://arxiv.org/abs/2512.01292
作者: Huynh Trinh Ngoc,Toan Nguyen Hai,Ba Luong Son,Long Tran Quoc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image segmentation is crucial for clinical diagnosis and treatment planning. Traditional methods typically produce a single segmentation mask, failing to capture inherent uncertainty. Recent generative models enable the creation of multiple plausible masks per image, mimicking the collaborative interpretation of several clinicians. However, these approaches remain computationally heavy. We propose MedSegLatDiff, a diffusion based framework that combines a variational autoencoder (VAE) with a latent diffusion model for efficient medical image segmentation. The VAE compresses the input into a low dimensional latent space, reducing noise and accelerating training, while the diffusion process operates directly in this compact representation. We further replace the conventional MSE loss with weighted cross entropy in the VAE mask reconstruction path to better preserve tiny structures such as small nodules. MedSegLatDiff is evaluated on ISIC-2018 (skin lesions), CVC-Clinic (polyps), and LIDC-IDRI (lung nodules). It achieves state of the art or highly competitive Dice and IoU scores while simultaneously generating diverse segmentation hypotheses and confidence maps. This provides enhanced interpretability and reliability compared to deterministic baselines, making the model particularly suitable for clinical deployment.
zh
[CV-95] Supervised Contrastive Machine Unlearning of Background Bias in Sonar Image Classification with Fine-Grained Explainable AI
【速读】:该论文旨在解决当前声呐图像分析中AI模型过度依赖海底背景特征而导致泛化能力差的问题。其核心解决方案在于提出一个集成两个关键模块的框架:一是目标对比遗忘(Targeted Contrastive Unlearning, TCU)模块,通过扩展传统三元组损失函数来削弱由海底特征引发的背景偏差;二是可解释性遗忘框架(Unlearn to Explain Sonar Framework, UESF),结合LIME解释器生成更忠实且局部化的归因图,从而提供模型在遗忘过程中所“主动舍弃”信息的可视化依据,显著提升了模型的可解释性和鲁棒性。
链接: https://arxiv.org/abs/2512.01291
作者: Kamal Basha S,Athira Nambiar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVIP 2025
Abstract:Acoustic sonar image analysis plays a critical role in object detection and classification, with applications in both civilian and defense domains. Despite the availability of real and synthetic datasets, existing AI models that achieve high accuracy often over-rely on seafloor features, leading to poor generalization. To mitigate this issue, we propose a novel framework that integrates two key modules: (i) a Targeted Contrastive Unlearning (TCU) module, which extends the traditional triplet loss to reduce seafloor-induced background bias and improve generalization, and (ii) the Unlearn to Explain Sonar Framework (UESF), which provides visual insights into what the model has deliberately forgotten while adapting the LIME explainer to generate more faithful and localized attributions for unlearning evaluation. Extensive experiments across both real and synthetic sonar datasets validate our approach, demonstrating significant improvements in unlearning effectiveness, model robustness, and interpretability.
zh
[CV-96] nnMobileNet: Towards Efficient Hybrid Networks for Retinal Image Analysis
【速读】:该论文旨在解决纯卷积神经网络(Convolutional Neural Networks, CNNs)在视网膜图像分析中难以捕捉长距离依赖关系、无法有效建模不规则病灶和延伸状血管结构的问题,从而限制了其在临床诊断中的可靠性。解决方案的关键在于提出一种混合架构 nnMobileNet++,通过三个核心组件实现:(i) 使用动态蛇形卷积(dynamic snake convolution)增强边界感知特征提取能力;(ii) 在第二次下采样阶段后引入阶段特定的 Transformer 块以建模全局上下文信息;(iii) 采用视网膜图像预训练策略提升模型泛化性能。该设计在保持轻量化的同时显著提升了对复杂血管模式和病变区域的建模能力,实现了高精度与低计算成本的平衡。
链接: https://arxiv.org/abs/2512.01273
作者: Xin Li,Wenhui Zhu,Xuanzhao Dong,Hao Wang,Yujian Xiong,Oana Dumitrascu,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retinal imaging is a critical, non-invasive modality for the early detection and monitoring of ocular and systemic diseases. Deep learning, particularly convolutional neural networks (CNNs), has significant progress in automated retinal analysis, supporting tasks such as fundus image classification, lesion detection, and vessel segmentation. As a representative lightweight network, nnMobileNet has demonstrated strong performance across multiple retinal benchmarks while remaining computationally efficient. However, purely convolutional architectures inherently struggle to capture long-range dependencies and model the irregular lesions and elongated vascular patterns that characterize on retinal images, despite the critical importance of vascular features for reliable clinical diagnosis. To further advance this line of work and extend the original vision of nnMobileNet, we propose nnMobileNet++, a hybrid architecture that progressively bridges convolutional and transformer representations. The framework integrates three key components: (i) dynamic snake convolution for boundary-aware feature extraction, (ii) stage-specific transformer blocks introduced after the second down-sampling stage for global context modeling, and (iii) retinal image pretraining to improve generalization. Experiments on multiple public retinal datasets for classification, together with ablation studies, demonstrate that nnMobileNet++ achieves state-of-the-art or highly competitive accuracy while maintaining low computational cost, underscoring its potential as a lightweight yet effective framework for retinal image analysis.
zh
[CV-97] ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process
【速读】:该论文旨在解决传统粘度计(viscometer)在过程监测和自动化实验室操作中存在侵入性高、依赖受控实验环境等问题,难以适应真实工业场景的挑战。其解决方案的关键在于提出一种基于计算机视觉(computer-vision-based)的非接触式粘度测量方法:通过分析固定背景图案在光穿过由搅拌驱动的连续变形自由液面时产生的光学畸变,实现对流体粘度的推断。该方法在不同光照条件下表现出良好的鲁棒性,结合不确定性量化技术可提供带置信度估计的粘度预测,从而为自动化流程控制提供了可靠、实用且无需物理接触的替代方案。
链接: https://arxiv.org/abs/2512.01268
作者: Jongwon Sohn,Juhyeon Moon,Hyunjoon Jung,Jaewook Nam
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Viscosity measurement is essential for process monitoring and autonomous laboratory operation, yet conventional viscometers remain invasive and require controlled laboratory environments that differ substantially from real process conditions. We present a computer-vision-based viscometer that infers viscosity by exploiting how a fixed background pattern becomes optically distorted as light refracts through the mixing-driven, continuously deforming free surface. Under diverse lighting conditions, the system achieves a mean absolute error of 0.113 in log m2 s^-1 units for regression and reaches up to 81% accuracy in viscosity-class prediction. Although performance declines for classes with closely clustered viscosity values, a multi-pattern strategy improves robustness by providing enriched visual cues. To ensure sensor reliability, we incorporate uncertainty quantification, enabling viscosity predictions with confidence estimates. This stand-off viscometer offers a practical, automation-ready alternative to existing viscometry methods.
zh
[CV-98] Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe
【速读】:该论文旨在解决扩散模型(Diffusion Models)中专家混合(Mixture-of-Experts, MoE)架构配置空间尚未被充分探索的问题,尤其针对当前研究过度聚焦于路由机制优化而忽视底层结构设计的局限性。解决方案的关键在于系统性地识别并调优一系列核心架构因素,包括受DeepSeek启发的专家模块设计、替代中间宽度配置、不同专家数量设置以及增强的注意力位置编码方式;实验表明,这些配置的精细调整能够显著提升模型性能,其增益甚至超过单纯改进路由策略的效果,从而为潜空间和像素空间扩散框架提供高效且实用的训练方案,实现与强基线相当或更优的结果,同时保持激活参数量不增加。
链接: https://arxiv.org/abs/2512.01252
作者: Yahui Liu,Yang Yue,Jingyuan Zhang,Chenxi Sun,Yang Zhou,Wencong Zeng,Ruiming Tang,Guorui Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures
Abstract:Recent efforts on Diffusion Mixture-of-Experts (MoE) models have primarily focused on developing more sophisticated routing mechanisms. However, we observe that the underlying architectural configuration space remains markedly under-explored. Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models–including DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. Our systematic study reveals that carefully tuning these configurations is essential for unlocking the full potential of Diffusion MoE models, often yielding gains that exceed those achieved by routing innovations alone. Through extensive experiments, we present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks, which provide a practical and efficient training recipe that enables Diffusion MoE models to surpass strong baselines while using equal or fewer activated parameters. All code and models are publicly available at: this https URL.
zh
[CV-99] Rivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
【速读】:该论文旨在解决表格识别(Table Recognition, TR)任务中因依赖大量标注数据而导致的性能瓶颈问题,尤其是开源模型在有限训练资源和隐私合规限制下难以追赶商业模型的问题。其解决方案的关键在于提出一种名为TRivia的自监督微调方法,通过引入基于问答的奖励机制与组相对策略优化(Group Relative Policy Optimization),使预训练视觉语言模型(VLM)能够从无标注的野外表格图像中自主学习。该方法利用注意力引导模块生成多样化的表格理解问题,并以模型能否正确回答这些问题作为反馈信号,构建了一个闭环学习系统,从而无需人工标注即可实现表格结构化识别与推理能力的提升。
链接: https://arxiv.org/abs/2512.01248
作者: Junyuan Zhang,Bin Wang,Qintong Zhang,Fan Wu,Zichen Wen,Jialin Lu,Junjie Shan,Ziqi Zhao,Shuya Yang,Ziling Wang,Ziyang Miao,Huaping Zhong,Yuhang Zang,Xiaoyi Dong,Ka-Ho Chow,Conghui He
机构: The University of HongKong (香港大学); Shanghai AI Laboratory (上海人工智能实验室); Peking University (北京大学); Shanghai Jiaotong University (上海交通大学); Sensetime (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: this https URL
zh
[CV-100] PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
【速读】:该论文旨在解决多主体个性化图像生成中模型性能下降的问题,特别是主体一致性(subject consistency)和文本控制能力(text controllability)的不足。其关键解决方案包括:首先构建一个可扩展的多主体数据生成流水线,利用强大的单主体生成模型合成高质量、多样化的多主体训练数据;其次设计成对的主体一致性奖励(Pairwise Subject-Consistency Rewards)与通用奖励机制,并引入强化学习阶段以优化模型在多主体场景下的表现。通过这一方法,模型能够从单主体个性化模型中迁移多主体合成能力,并显著提升生成结果的一致性与可控性。
链接: https://arxiv.org/abs/2512.01236
作者: Shulei Wang,Longhui Wei,Xin He,Jianbo Ouyang,Hui Lu,Zhou Zhao,Qi Tian
机构: Zhejiang University (浙江大学); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: this https URL
zh
[CV-101] S2-MLLM : Boosting Spatial Reasoning Capability of MLLM s for 3D Visual Grounding with Structural Guidance
【速读】:该论文旨在解决多模态大语言模型(MLLM)在3D视觉定位(3D Visual Grounding, 3DVG)任务中因仅处理2D视觉输入而导致的3D空间结构理解不足的问题。现有方法依赖于重建点云的视角相关渲染来提供显式结构引导,存在效率低且空间推理能力受限的缺陷。解决方案的关键在于提出S²-MLLM框架,其核心创新是通过隐式空间推理增强MLLM的空间理解能力:首先利用前馈式3D重建过程中的结构感知能力,在训练阶段获取3D结构知识;其次设计结构增强模块(SE),结合视图内与视图间注意力机制以及多层级位置编码,实现对视觉表征与空间位置及视角信息的精准关联,从而无需依赖低效的点云重建即可完成高效的3D空间推理。
链接: https://arxiv.org/abs/2512.01223
作者: Beining Xu,Siting Zhu,Zhao Jin,Junxian Li,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures
Abstract:3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S ^2 -MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S ^2 -MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.
zh
[CV-102] M4-BLIP: Advancing Multi-Modal Media Manipulation Detection through Face-Enhanced Local Analysis
【速读】:该论文旨在解决多模态媒体伪造检测中忽视局部信息(尤其是人脸区域)的问题,这一缺陷导致现有方法在面对局部篡改时检测精度不足。解决方案的关键在于提出M4-BLIP框架,其核心创新包括:利用BLIP-2模型提取局部特征,并将人脸区域的先验知识作为约束引入;设计了一个对齐与融合模块,实现局部特征与全局特征的协同整合;同时无缝集成大语言模型(Large Language Models, LLM),显著提升检测结果的可解释性。
链接: https://arxiv.org/abs/2512.01214
作者: Hang Wu,Ke Sun,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:In the contemporary digital landscape, multi-modal media manipulation has emerged as a significant societal threat, impacting the reliability and integrity of information dissemination. Current detection methodologies in this domain often overlook the crucial aspect of localized information, despite the fact that manipulations frequently occur in specific areas, particularly in facial regions. In response to this critical observation, we propose the M4-BLIP framework. This innovative framework utilizes the BLIP-2 model, renowned for its ability to extract local features, as the cornerstone for feature extraction. Complementing this, we incorporate local facial information as prior knowledge. A specially designed alignment and fusion module within M4-BLIP meticulously integrates these local and global features, creating a harmonious blend that enhances detection accuracy. Furthermore, our approach seamlessly integrates with Large Language Models (LLM), significantly improving the interpretability of the detection outcomes. Extensive quantitative and visualization experiments validate the effectiveness of our framework against the state-of-the-art competitors.
zh
[CV-103] Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations
【速读】:该论文旨在解决部分受试者工作特征曲线下面积(partial AUC, PAUC)优化中的近似误差不可控与可扩展性受限问题。PAUC在类别不平衡且存在决策约束的实际场景中具有重要意义,但其计算涉及对特定假正率(FPR)和/或真正率(TPR)区间内样本的选择,该过程被证明是NP难问题,通常需依赖近似方法处理。现有方法普遍存在近似误差难以控制或在大规模数据下效率低下的缺陷。论文的关键创新在于提出两种简洁的实例级极小极大(minimax)重表述:一种具有渐近消失的近似间隙,另一种通过引入更多变量实现无偏估计。核心思路是首先将原问题转化为等价的实例级形式以降低时间复杂度,并借助阈值学习简化复杂的样本选择流程,再结合不同的平滑技术进行优化。所提算法具备线性每迭代步复杂度和 $ O(\epsilon^{-1/3}) $ 的收敛速率,同时提供了紧致的泛化界,明确揭示了TPR/FPR约束参数 α/β 对泛化性能的影响,其阶为 O~(α−1n+−1+β−1n−−1)。
链接: https://arxiv.org/abs/2512.01213
作者: Yangbangyan Jiang,Qianqian Xu,Huiyang Shao,Zhiyong Yang,Shilong Bao,Xiaochun Cao,Qingming Huang
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); ByteDance Inc. (字节跳动); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区网络科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:As a variant of the Area Under the ROC Curve (AUC), the partial AUC (PAUC) focuses on a specific range of false positive rate (FPR) and/or true positive rate (TPR) in the ROC curve. It is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints. However, selecting instances within these constrained intervals during its calculation is NP-hard, and thus typically requires approximation techniques for practical resolution. Despite the progress made in PAUC optimization over the last few years, most existing methods still suffer from uncontrollable approximation errors or a limited scalability when optimizing the approximate PAUC objectives. In this paper, we close the approximation gap of PAUC optimization by presenting two simple instance-wise minimax reformulations: one with an asymptotically vanishing gap, the other with the unbiasedness at the cost of more variables. Our key idea is to first establish an equivalent instance-wise problem to lower the time complexity, simplify the complicated sample selection procedure by threshold learning, and then apply different smoothing techniques. Equipped with an efficient solver, the resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of O(\epsilon^-1/3) for typical one-way and two-way PAUCs. Moreover, we provide a tight generalization bound of our minimax reformulations. The result explicitly demonstrates the impact of the TPR/FPR constraints \alpha / \beta on the generalization and exhibits a sharp order of \tildeO(\alpha^-1\n_+^-1 + \beta^-1\n_-^-1) . Finally, extensive experiments on several benchmark datasets validate the strength of our proposed methods.
zh
[CV-104] abletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image
【速读】:该论文旨在解决当前文本或图像驱动的3D场景生成方法在构建高保真、物理可交互的桌面上场景(tabletop scenes)时存在的局限性,特别是这些方法难以捕捉桌面场景中高密度布局和复杂空间关系的问题。解决方案的关键在于提出了一种无需训练、全自动的框架TabletopGen,其核心创新是通过解耦复杂的三维空间推理过程:首先利用可微分旋转优化器(Differentiable Rotation Optimizer)精确恢复每个物体的旋转姿态,再借助顶视图空间对齐机制(Top-view Spatial Alignment)实现鲁棒的平移与尺度估计,从而实现从2D参考图像到高质量、碰撞-free、仿真就绪的3D桌面上场景的精准重建。
链接: https://arxiv.org/abs/2512.01204
作者: Ziqian Wang,Yonghao He,Licheng Yang,Wei Zou,Hongxuan Ma,Liu Liu,Wei Sui,Yuxin Guo,Hu Su
机构: University of Chinese Academy of Sciences (中国科学院大学); D-Robotics; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS) (多模态人工智能系统国家重点实验室), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI–especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.
zh
[CV-105] First On-Orbit Demonstration of a Geospatial Foundation Model
【速读】:该论文旨在解决大型地空间基础模型(GeoFM)因体积庞大而难以在资源受限的航天硬件上部署的问题,尤其针对地球观测(EO)任务中数据稀缺场景下的实际应用瓶颈。其解决方案的关键在于通过模型压缩(model compression)与领域适应(domain adaptation)技术,构建轻量化的视觉Transformer(ViT)架构变体,在显著降低模型尺寸和计算资源需求的同时,保持下游任务性能,并在国际空间站搭载的IMAGIN-e载荷上验证了可靠性的在轨推理能力,从而为地球观测任务中机载人工智能的实用化提供了可行路径。
链接: https://arxiv.org/abs/2512.01181
作者: Andrew Du,Roberto Del Prete,Alejandro Mousist,Nick Manser,Fabrice Marre,Andrew Barton,Carl Seubert,Gabriele Meoni,Tat-Jun Chin
机构: University of Adelaide (阿德莱德大学); European Space Agency (欧洲空间局); Thales Alenia Space (泰雷兹阿莱尼亚航天公司); SmartSat CRC (智能卫星CRC中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geospatial foundation models (GeoFMs) promise broad generalisation capacity for Earth observation (EO) tasks, particularly under data-limited conditions. However, their large size poses a barrier to deployment on resource-constrained space hardware. To address this, we present compact variants of a Vision Transformer (ViT)-based GeoFM that preserve downstream task performance while enabling onboard execution. Evaluation across five downstream tasks and validation in two representative flight environments show that model compression and domain adaptation are critical to reducing size and resource demands while maintaining high performance under operational conditions. We further demonstrate reliable on-orbit inference with the IMAGIN-e payload aboard the International Space Station. These results establish a pathway from large GeoFMs to flight-ready, resource-efficient deployments, expanding the feasibility of onboard AI for EO missions.
zh
[CV-106] VSRD: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering
【速读】:该论文旨在解决单目3D目标检测(monocular 3D object detection)中对大量3D标注数据依赖的问题,这些标注通常来自激光雷达(LiDAR)点云并需人工密集标注,成本高昂。解决方案的关键在于提出一种弱监督框架VSRD++,其核心创新包括:利用基于神经场(neural field)的体渲染技术,结合弱2D监督信号生成高质量伪标签;在多视角自动标注阶段,通过实例感知的体积轮廓渲染(instance-aware volumetric silhouette rendering)将物体表面建模为符号距离场(signed distance field, SDF),并进一步分解SDF为立方体SDF与残差距离场(residual distance field, RDF)以优化3D边界框;同时引入速度属性和置信度估计来处理动态物体的几何不一致性问题,并设计3D属性初始化模块提升动态边界框参数的初始质量。最终,优化后的伪标签用于训练单目3D检测器,在KITTI-360数据集上显著优于现有弱监督方法。
链接: https://arxiv.org/abs/2512.01178
作者: Zihua Liu,Hiroki Sakuma,Masatoshi Okutomi
机构: Institute of Science Tokyo (东京科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2404.00149
Abstract:Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance’s SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at this https URL
zh
[CV-107] Real-Time On-the-Go Annotation Framework Using YOLO for Automated Dataset Generation CEC65580 CEC2025
【速读】:该论文旨在解决农业场景中对象检测模型(如YOLO)部署时数据集标注效率低、准确性不足的问题,尤其是在需要快速决策的实时应用中。传统标注方法依赖大量人工后处理,耗时且成本高。解决方案的关键在于提出一种基于边缘设备部署YOLO模型的实时标注框架,在图像采集过程中即完成自动标注,从而显著缩短数据准备时间并保持高质量标注结果。实验通过对比YOLOv5、YOLOv8和YOLOv12在单类与多类标注、预训练与从头训练等配置下的性能,验证了预训练及单类标注策略在收敛速度、鲁棒性和精度上的优势,证明该方案具备可行性与有效性。
链接: https://arxiv.org/abs/2512.01165
作者: Mohamed Abdallah Salem(1),Ahmed Harb Rabia(1) ((1) North Dakota State University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Copyright 2025 IEEE. This is the author’s version of the work that has been accepted for publication in Proceedings of the 5. Interdisciplinary Conference on Electrics and Computer (INTCEC 2025) 15-16 September 2025, Chicago-USA. The final version of record is available at: this https URL
Abstract:Efficient and accurate annotation of datasets remains a significant challenge for deploying object detection models such as You Only Look Once (YOLO) in real-world applications, particularly in agriculture where rapid decision-making is critical. Traditional annotation techniques are labor-intensive, requiring extensive manual labeling post data collection. This paper presents a novel real-time annotation approach leveraging YOLO models deployed on edge devices, enabling immediate labeling during image capture. To comprehensively evaluate the efficiency and accuracy of our proposed system, we conducted an extensive comparative analysis using three prominent YOLO architectures (YOLOv5, YOLOv8, YOLOv12) under various configurations: single-class versus multi-class annotation and pretrained versus scratch-based training. Our analysis includes detailed statistical tests and learning dynamics, demonstrating significant advantages of pretrained and single-class configurations in terms of model convergence, performance, and robustness. Results strongly validate the feasibility and effectiveness of our real-time annotation framework, highlighting its capability to drastically reduce dataset preparation time while maintaining high annotation quality.
zh
[CV-108] DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling
【速读】:该论文旨在解决生成式 AI(Generative AI)中对抗引导扩散采样(adversarially guided diffusion sampling)时样本质量退化的问题,其根源在于对抗控制轨迹与原始扩散路径之间的累积偏差。解决方案的关键在于从随机最优控制(stochastic optimal control, SOC)视角出发,将这种退化形式化为路径空间的 Kullback-Leibler 散度(path-KL),并证明其等价于控制能量。基于此理论框架,作者提出 DPAC(Distribution-Preserving Adversarial Control),通过将对抗梯度投影到由生成模型得分函数(score)定义的切空间上,从而最小化 path-KL,有效抑制分布漂移。该方法不仅理论上能同时收紧 2-Wasserstein 距离和 Fréchet Inception Distance(FID)的上界,且在离散求解器中可消除主导的 O(Δt) 误差项,实现 O(Δt2) 的采样质量差距,并对得分或度量近似具有二阶鲁棒性。
链接: https://arxiv.org/abs/2512.01153
作者: Han-Jin Lee,Han-Ju Lee,Jin-Seong Kim,Seok-Hwan Choi
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Adversarially guided diffusion sampling often achieves the target class, but sample quality degrades as deviations between the adversarially controlled and nominal trajectories accumulate. We formalize this degradation as a path-space Kullback-Leibler divergence(path-KL) between controlled and nominal (uncontrolled) diffusion processes, thereby showing via Girsanov’s theorem that it exactly equals the control energy. Building on this stochastic optimal control (SOC) view, we theoretically establish that minimizing this path-KL simultaneously tightens upper bounds on both the 2-Wasserstein distance and Fréchet Inception Distance (FID), revealing a principled connection between adversarial control energy and perceptual fidelity. From a variational perspective, we derive a first-order optimality condition for the control: among all directions that yield the same classification gain, the component tangent to iso-(log-)density surfaces (i.e., orthogonal to the score) minimizes path-KL, whereas the normal component directly increases distributional drift. This leads to DPAC (Distribution-Preserving Adversarial Control), a diffusion guidance rule that projects adversarial gradients onto the tangent space defined by the generative score geometry. We further show that in discrete solvers, the tangent projection cancels the O(\Deltat) leading error term in the Wasserstein distance, achieving an O(\Deltat^2) quality gap; moreover, it remains second-order robust to score or metric approximation. Empirical studies on ImageNet-100 validate the theoretical predictions, confirming that DPAC achieves lower FID and estimated path-KL at matched attack success rates.
zh
[CV-109] Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution
【速读】:该论文旨在解决开放集识别(open-set recognition)在背景分布(background distribution)发生漂移时性能下降的问题。传统方法通常假设已知类别分布保持不变,而本文提出的方法 \ours 在背景分布动态变化的挑战场景下仍能保证有效识别未知类别。其核心解决方案是基于一个关键假设:新类别(novel class)与已知类别在特征空间中具有可分性,并在此基础上建立了理论保证,证明其在简化过参数化设置下优于代表性基线方法。通过设计可扩展且鲁棒的实现技术,作者在图像和文本数据上进行了全面实验,验证了 \ours 在背景分布漂移条件下显著优于现有方法,并揭示了新类别样本量对性能影响的新见解。
链接: https://arxiv.org/abs/2512.01152
作者: Shravan Chaudhari,Yoav Wald,Suchi Saria
机构: Johns Hopkins University (约翰霍普金斯大学); New York University (纽约大学); Bayesian Health
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call \emphthe background distribution, is fixed. In this paper we develop \ours, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make \ours scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that \ours significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.
zh
[CV-110] SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在统一学习多个社会感知任务时出现的负迁移问题,其核心原因是“社会退化”(social degradation)——即通用的视觉-语言预训练过程削弱了视觉编码器对细微社会信息的表征能力。解决方案的关键在于提出SocialFusion框架,通过在冻结的视觉编码器与语言模型之间引入一个最小化的连接模块,实现多社会任务间的正向迁移,从而在五个社会感知任务上均表现出协同增强的效果,并达到与专用最优模型相当的性能水平。
链接: https://arxiv.org/abs/2512.01148
作者: Hamza Tahboub,Weiyan Shi,Gang Hua,Huaizu Jiang
机构: Northeastern University (东北大学); Amazon.com, Inc. (亚马逊公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures
Abstract:Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term “social degradation,” whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder’s ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.
zh
[CV-111] Weakly Supervised Continuous Micro-Expression Intensity Estimation Using Temporal Deep Neural Network
【速读】:该论文旨在解决微表情(micro-facial expressions)强度的连续建模问题,即如何在缺乏逐帧标注的情况下准确估计微表情在时间维度上的强度演变。传统方法多聚焦于离散类别分类,而忽略其动态变化过程;现有瓶颈在于缺少帧级强度标签,导致全监督回归不可行。解决方案的关键在于提出一种仅依赖稀疏时间标签(起始点、峰值点、结束点)的统一框架:通过一个简单的三角先验将稀疏时间地标转换为密集伪强度轨迹,并结合ResNet18编码器与双向GRU的轻量级时序回归模型,直接从图像序列中预测帧级强度。该方法无需帧级标注,且在SAMM和CASME II数据集上均取得优异结果(如SAMM上Spearman相关系数达0.9014),验证了结构化伪标签与时序建模对捕捉微表情“上升-峰值-下降”动态过程的核心作用。
链接: https://arxiv.org/abs/2512.01145
作者: Riyadh Mohammed Almushrafy(Majmaah University, Saudi Arabia)
机构: Majmaah University (majmaah大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-facial expressions are brief and involuntary facial movements that reflect genuine emotional states. While most prior work focuses on classifying discrete micro-expression categories, far fewer studies address the continuous evolution of intensity over time. Progress in this direction is limited by the lack of frame-level intensity labels, which makes fully supervised regression impractical. We propose a unified framework for continuous micro-expression intensity estimation using only weak temporal labels (onset, apex, offset). A simple triangular prior converts sparse temporal landmarks into dense pseudo-intensity trajectories, and a lightweight temporal regression model that combines a ResNet18 encoder with a bidirectional GRU predicts frame-wise intensity directly from image sequences. The method requires no frame-level annotation effort and is applied consistently across datasets through a single preprocessing and temporal alignment pipeline. Experiments on SAMM and CASME II show strong temporal agreement with the pseudo-intensity trajectories. On SAMM, the model reaches a Spearman correlation of 0.9014 and a Kendall correlation of 0.7999, outperforming a frame-wise baseline. On CASME II, it achieves up to 0.9116 and 0.8168, respectively, when trained without the apex-ranking term. Ablation studies confirm that temporal modeling and structured pseudo labels are central to capturing the rise-apex-fall dynamics of micro-facial movements. To our knowledge, this is the first unified approach for continuous micro-expression intensity estimation using only sparse temporal annotations. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.01145 [cs.CV] (or arXiv:2512.01145v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.01145 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Riyadh Almushrafy [view email] [v1] Sun, 30 Nov 2025 23:47:47 UTC (185 KB)
zh
[CV-112] OmniFD: A Unified Model for Versatile Face Forgery Detection
【速读】:该论文旨在解决当前人脸伪造检测(Face Forgery Detection, FFD)方法中因采用任务独立模型而导致的计算冗余与跨任务关联信息未被充分利用的问题。其解决方案的关键在于提出一个统一框架OmniFD,通过共享的Swin Transformer编码器提取图像和视频输入的统一4D时空表征,引入基于可学习查询的跨任务交互模块以动态捕捉不同任务间的依赖关系,同时设计轻量级解码头将精炼后的表示映射至四类核心任务(图像分类、视频分类、空间定位与时间定位),从而实现多任务协同学习与细粒度知识迁移,显著提升检测性能并降低模型复杂度与训练开销。
链接: https://arxiv.org/abs/2512.01128
作者: Haotian Liu,Haoyu Chen,Chenhui Pan,You Hu,Guoying Zhao,Xiaobai Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD’s advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at this https URL.
zh
[CV-113] Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis
【速读】:该论文旨在解决多模态数据(如组织病理图像与基因表达谱)在癌症生存预测中,因输入高维复杂性导致的跨模态和模态内交互建模效率低、效果差的问题,尤其针对稀疏且未标注的关键预后事件(structural prognostic events)难以被有效捕捉的挑战。其解决方案的核心在于提出SlotSPE框架,通过基于slot注意力机制的因子编码(factorial coding)策略,将每位患者的多模态输入压缩为一组模态特异、互斥的“槽位”(slots),从而以紧凑表示编码潜在的高阶结构信号(如空间组织学模式或通路共激活),实现高效且可解释的跨模态交互建模,并支持引入生物学先验知识以增强预后相关性。
链接: https://arxiv.org/abs/2512.01116
作者: Yilan Zhang,Li Nanbo,Changchun Yang,Jürgen Schmidhuber,Xin Gao
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 14 Figures
Abstract:The integration of histology images and gene profiles has shown great promise for improving survival prediction in cancer. However, current approaches often struggle to model intra- and inter-modal interactions efficiently and effectively due to the high dimensionality and complexity of the inputs. A major challenge is capturing critical prognostic events that, though few, underlie the complexity of the observed inputs and largely determine patient outcomes. These events, manifested as high-level structural signals such as spatial histologic patterns or pathway co-activations, are typically sparse, patient-specific, and unannotated, making them inherently difficult to uncover. To address this, we propose SlotSPE, a slot-based framework for structural prognostic event modeling. Specifically, inspired by the principle of factorial coding, we compress each patient’s multimodal inputs into compact, modality-specific sets of mutually distinctive slots using slot attention. By leveraging these slot representations as encodings for prognostic events, our framework enables both efficient and effective modeling of complex intra- and inter-modal interactions, while also facilitating seamless incorporation of biological priors that enhance prognostic relevance. Extensive experiments on ten cancer benchmarks show that SlotSPE outperforms existing methods in 8 out of 10 cohorts, achieving an overall improvement of 2.9%. It remains robust under missing genomic data and delivers markedly improved interpretability through structured event decomposition.
zh
[CV-114] Estimation of Kinematic Motion from Dashcam Footage
【速读】:该论文旨在解决如何利用行车记录仪(dashcam)视频数据准确预测类车车辆实际运动学状态的问题,包括车辆速度、偏航角(yaw)、前车存在性及其相对距离与速度。其解决方案的关键在于结合车载控制器局域网(CAN)总线提供的真实运动数据与时间同步的行车记录仪视频流,构建神经网络模型以量化预测精度,并提供基于开源工具和消费级设备的数据采集方法,使其他研究者可复现类似实验。
链接: https://arxiv.org/abs/2512.01104
作者: Evelyn Zhang,Alex Richardson,Jonathan Sprinkle
机构: Vanderbilt University (范德比尔特大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures
Abstract:The goal of this paper is to explore the accuracy of dashcam footage to predict the actual kinematic motion of a car-like vehicle. Our approach uses ground truth information from the vehicle’s on-board data stream, through the controller area network, and a time-synchronized dashboard camera, mounted to a consumer-grade vehicle, for 18 hours of footage and driving. The contributions of the paper include neural network models that allow us to quantify the accuracy of predicting the vehicle speed and yaw, as well as the presence of a lead vehicle, and its relative distance and speed. In addition, the paper describes how other researchers can gather their own data to perform similar experiments, using open-source tools and off-the-shelf technology.
zh
[CV-115] Learning Eigenstructures of Unstructured Data Manifolds
【速读】:该论文旨在解决几何处理中传统谱分析方法依赖于人工选择算子、离散化过程及特征求解器的问题,尤其是在无结构数据(如点云或高维图像流形)上难以直接应用的挑战。其解决方案的关键在于提出一种基于最优逼近理论的新型学习框架:通过训练神经网络来分解隐式近似算子,以最小化在选定探测函数分布上的重构误差,从而直接从数据中学习谱基。此方法无需显式构建拉普拉斯算子(Laplacian operator)及其特征分解,即可恢复出具有几何意义的谱基、隐式度量的采样密度以及底层算子的特征值,在不假设流形结构(如网格化或维度)的前提下实现了对任意维度数据的统一建模与扩展。
链接: https://arxiv.org/abs/2512.01103
作者: Roy Velich,Arkadi Piven,David Bensaïd,Daniel Cremers,Thomas Dagès,Ron Kimmel
机构: Technion-Israel Institute of Technology (以色列理工学院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers. Grounded in optimal-approximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing. Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric’s sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension. On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.
zh
[CV-116] CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions
【速读】:该论文旨在解决当前视频语言模型(Video Language Models, VLM)在处理周期性状态转换(cyclical state transitions)时存在的局限性,特别是其在时空认知和文本推理能力上的不足。现有VLM难以可靠地识别与利用周期性模式(如线性运动、轨道运动及随时间变化的颜色或尺度等视觉属性),缺乏对时间维度的理解,也无法从场景中提取定量信息(例如运动物体数量)。解决方案的关键在于提出CycliST——一个新型基准数据集,通过合成具有丰富结构的视频序列来模拟真实世界中的周期性过程,并采用分层评估体系逐步提升任务难度(如增加循环对象数量、场景杂乱度和光照变化),从而系统性地测试并揭示VLM在周期性动态理解方面的性能瓶颈,为未来视觉推理模型的发展提供针对性挑战和全面评估框架。
链接: https://arxiv.org/abs/2512.01095
作者: Simon Kohaut,Daniel Ochs,Shun Zhang,Benedict Flade,Julian Eggert,Kristian Kersting,Devendra Singh Dhami
机构: TU Darmstadt (达姆施塔特工业大学); Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) (康拉德·祖塞学习与智能系统卓越学院); Honda Research Institute Europe GmbH (本田研究欧洲有限公司); TU Eindhoven (埃因霍温理工大学); Hessian Center for AI (hessian.AI) (黑森州人工智能中心); Center for Cognitive Science (认知科学中心); German Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.
zh
[CV-117] Accelerating Inference of Masked Image Generators via Reinforcement Learning
【速读】:该论文旨在解决掩码生成模型(Masked Generative Models, MGMs)在生成高质量图像时采样步骤过多导致推理速度慢的问题。解决方案的关键在于将加速问题从传统的分布匹配范式(如知识蒸馏)转变为强化学习(Reinforcement Learning, RL)框架,通过设计一个结合图像质量奖励与推理速度奖励的联合目标函数,对预训练模型进行微调,从而在显著减少采样步数的同时保持图像质量。实验表明,该方法可实现3倍加速且图像质量相当。
链接: https://arxiv.org/abs/2512.01094
作者: Pranav Subbaraman,Shufan Li,Siyan Zhao,Aditya Grover
机构: UCLA (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures
Abstract:Masked Generative Models (MGM)s demonstrate strong capabilities in generating high-fidelity images. However, they need many sampling steps to create high-quality generations, resulting in slow inference speed. In this work, we propose Speed-RL, a novel paradigm for accelerating a pretrained MGMs to generate high-quality images in fewer steps. Unlike conventional distillation methods which formulate the acceleration problem as a distribution matching problem, where a few-step student model is trained to match the distribution generated by a many-step teacher model, we consider this problem as a reinforcement learning problem. Since the goal of acceleration is to generate high quality images in fewer steps, we can combine a quality reward with a speed reward and finetune the base model using reinforcement learning with the combined reward as the optimization target. Through extensive experiments, we show that the proposed method was able to accelerate the base model by a factor of 3x while maintaining comparable image quality.
zh
[CV-118] Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer
【速读】:该论文旨在解决机器人在复杂环境中实现鲁棒的视觉引导式人形机器人运动与操作(loco-manipulation)问题,特别是如何从仿真中训练出可直接部署到真实世界且性能优于人类操作员的策略。其关键解决方案在于提出了一种教师-学生-自举(teacher-student-bootstrap)学习框架,结合分阶段重置探索策略以稳定长时程特权策略(privileged-policy)训练,并采用基于GRPO(Generalized Reward Policy Optimization)的微调机制来缓解部分可观测性问题并提升闭环一致性,从而实现纯RGB感知下无需微调的零样本(zero-shot)跨门类型泛化能力。
链接: https://arxiv.org/abs/2512.01061
作者: Haoru Xue,Tairan He,Zi Wang,Qingwei Ben,Wenli Xiao,Zhengyi Luo,Xingye Da,Fernando Castañeda,Guanya Shi,Shankar Sastry,Linxi “Jim” Fan,Yuke Zhu
机构: NVIDIA; UC Berkeley; CMU; CUHK
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.
zh
[CV-119] Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction KR
【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)在模型规模扩大时性能提升不具单调性的问题,尤其是在 ViT-B/16 于 ImageNet-1K 上训练时存在过参数化(overparameterized)现象。研究发现,尽管增大模型参数通常被认为能提升性能,但通过简化 MLP(多层感知机)模块的结构可实现更优或相当的精度与训练稳定性。解决方案的关键在于两种轻量化策略:一是 GroupedMLP,通过在相邻 Transformer 块之间共享 MLP 权重,在减少 32.7% 参数的同时保持计算成本不变并提升精度至 81.47%;二是 ShallowMLP,将 MLP 隐藏维度减半,使推理吞吐量提升 38%,精度仍达 81.25%。两者均优于基准模型(81.05%),且显著改善了训练稳定性(峰值到最终准确率下降从 0.47% 降至 0.03–0.06%),表明合理分配参数(如采用参数共享和宽度约束)可作为有效的归纳偏置(inductive bias),从而优化 Vision Transformers 的设计。
链接: https://arxiv.org/abs/2512.01059
作者: Anantha Padmanaban Krishna Kumar(Boston University)
机构: Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages total (6 pages main text, 1 page references), 1 figures, 2 tables. Code available at this https URL
Abstract:Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7% of the baseline parameters. Our \emphGroupedMLP variant shares MLP weights between adjacent transformer blocks and achieves 81.47% top-1 accuracy while maintaining the baseline computational cost. Our \emphShallowMLP variant halves the MLP hidden dimension and reaches 81.25% top-1 accuracy with a 38% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47% to the range 0.03% to 0.06%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: this https URL.
zh
[CV-120] RoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models NEURIPS2025
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在时序理解任务中因依赖静态特征偏差(static feature biases)而导致系统性预测错误的问题。这类偏差通常源于背景或物体的静态属性,而非动态视觉变化,从而影响模型在真实场景下的泛化能力。解决方案的关键在于提出TRoVe——一种自动化发现误差诱导型静态特征偏差的方法:它通过分析训练好的VLM与标注验证数据集,提取候选静态特征,并基于两个维度进行评分——(i) 该特征对分类错误的影响程度,以及(ii) 模型在预测时对该特征的依赖强度。实验表明,TRoVe相比最接近的基线方法在识别准确率上提升28.6%,并成功揭示了7个现成VLM在两类时序理解任务中的未知静态偏差,验证了其在提升测试阶段性能方面的价值。
链接: https://arxiv.org/abs/2512.01048
作者: Maya Varma,Jean-Benoit Delbrouck,Sophie Ostmeier,Akshay Chaudhari,Curtis Langlotz
机构: Stanford University (斯坦福大学); HOPPR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Vision-language models (VLMs) have made great strides in addressing temporal understanding tasks, which involve characterizing visual changes across a sequence of images. However, recent works have suggested that when making predictions, VLMs may rely on static feature biases, such as background or object features, rather than dynamic visual changes. Static feature biases are a type of shortcut and can contribute to systematic prediction errors on downstream tasks; as a result, identifying and characterizing error-inducing static feature biases is critical prior to real-world model deployment. In this work, we introduce TRoVe, an automated approach for discovering error-inducing static feature biases learned by temporal VLMs. Given a trained VLM and an annotated validation dataset associated with a downstream classification task, TRoVe extracts candidate static features from the dataset and scores each feature by (i) the effect of the feature on classification errors as well as (ii) the extent to which the VLM relies on the feature when making predictions. In order to quantitatively evaluate TRoVe, we introduce an evaluation framework consisting of 101 trained temporal VLMs paired with ground-truth annotations for learned static feature biases. We use this framework to demonstrate that TRoVe can accurately identify error-inducing static feature biases in VLMs, achieving a 28.6% improvement over the closest baseline. Finally, we apply TRoVe to 7 off-the-shelf VLMs and 2 temporal understanding tasks, surfacing previously-unknown static feature biases and demonstrating that knowledge of learned biases can aid in improving model performance at test time. Our code is available at this https URL.
zh
[CV-121] Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model
【速读】:该论文旨在解决从单张图像中恢复像素级几何属性(如深度图和表面法向量)这一经典计算机视觉问题,其本质是一个病态问题(ill-posed),因为2D观测与3D结构之间存在非单射映射(non-injective mappings)和外观歧义(appearance ambiguity)。传统判别式回归模型受限于训练数据的规模、质量和多样性,且缺乏物理推理能力;而扩散模型虽具备强大的世界先验(world priors),但其随机生成机制并不适用于确定性几何推理任务。解决方案的关键在于提出Lotus-2——一种两阶段确定性框架:第一阶段采用单步确定性形式化与干净数据目标函数,并引入轻量级局部连续性模块(LCM),以生成无网格伪影的全局一致结构;第二阶段通过约束多步修正流(rectified-flow)优化,在核心预测器定义的流形上进行无噪确定性流匹配,从而提升细粒度几何精度。该方法仅用59K样本(<1%现有大规模数据集)即实现单目深度估计的新SOTA及表面法向量预测的强竞争力,证明了扩散模型可作为确定性世界先验,超越传统判别与生成范式。
链接: https://arxiv.org/abs/2512.01030
作者: Jing He,Haodong Li,Mingzhi Sheng,Ying-Cong Chen
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work done at the Hong Kong University of Science and Technology (Guangzhou). Project page: this https URL . 15 Pages, 12 Figures, 3 Tables
Abstract:Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality and diversity of available data and limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaption protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.
zh
[CV-122] FOM-Nav: Frontier-Object Maps for Object Goal Navigation WWW
【速读】:该论文旨在解决目标导向导航(Object Goal Navigation)问题,即机器人在未知环境中高效定位特定目标物体。现有方法中,基于隐式记忆的方法难以维持长期记忆与规划能力,而基于显式地图的方法则缺乏丰富的语义信息。解决方案的关键在于提出一种模块化框架FOM-Nav,其核心创新是在线构建前沿-物体地图(Frontier-Object Maps),联合编码空间前沿与细粒度物体信息;并利用视觉-语言模型(Vision-Language Model, VLM)实现多模态场景理解与高层目标预测,由低层规划器生成高效轨迹。该设计显著提升了导航效率,在MP3D和HM3D基准上取得SPL指标最优表现,并在真实机器人平台上验证了实用性。
链接: https://arxiv.org/abs/2512.01009
作者: Thomas Chabal,Shizhe Chen,Jean Ponce,Cordelia Schmid
机构: Inria(法国国家信息与自动化研究院); École normale supérieure (ENS-PSL, CNRS, Inria)(巴黎高等师范学院(ENS-PSL、法国国家科学研究中心、法国国家信息与自动化研究院)); New York University (纽约大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.
zh
[CV-123] LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency
【速读】:该论文旨在解决文本驱动的三维重建(text-driven 3D reconstruction)中,如何实现跨视角一致且能理解开放词汇指令的掩码生成问题。现有方法难以在不依赖额外3D-文本标注的情况下保持多视角语义一致性,导致重建结果不稳定。解决方案的关键在于提出LISA-3D框架,其核心创新是通过引入几何感知的低秩适配(geometry-aware Low-Rank Adaptation, LoRA)层对已有的语言图像分割模型LISA进行微调,并复用冻结的SAM-3D重建器;训练阶段利用现成的RGB-D序列及其相机位姿构建可微分的重投影损失,强制不同视角下的掩码保持一致性,无需任何3D-文本监督。该方法仅需调整11.6M参数,即可在ScanRefer和Nr3D数据集上比单视角基线提升最高达15.6点的语言到3D准确率,同时具备模块化、数据高效及零样本部署能力。
链接: https://arxiv.org/abs/2512.01008
作者: Zhongbin Guo,Jiahe Liu,Wenyu Gao,Yushan Li,Chengzhi Li,Ping Jian
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at this https URL.
zh
[CV-124] Provenance-Driven Reliable Semantic Medical Image Vector Reconstruction via Lightweight Blockchain-Verified Latent Fingerprints
【速读】:该论文旨在解决医学影像在实际应用中因噪声、损坏或潜在篡改而导致的AI辅助诊断可靠性下降问题。传统重建方法虽能恢复像素级细节,但可能牺牲解剖结构的准确性,进而影响临床决策。其解决方案的关键在于提出一种语义感知的医学图像重建框架,通过将高层语义嵌入(high-level latent embeddings)与混合U-Net架构相结合,在恢复过程中保留临床相关结构;同时引入轻量级区块链溯源层,基于无标度图设计实现可验证的重建事件记录,从而保障重建过程的可信性和可追溯性,显著提升结构一致性、恢复精度及溯源完整性。
链接: https://arxiv.org/abs/2512.00999
作者: Mohsin Rasheed,Abdullah Al-Mamun
机构: Augusta University (奥古斯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical imaging is essential for clinical diagnosis, yet real-world data frequently suffers from corruption, noise, and potential tampering, challenging the reliability of AI-assisted interpretation. Conventional reconstruction techniques prioritize pixel-level recovery and may produce visually plausible outputs while compromising anatomical fidelity, an issue that can directly impact clinical outcomes. We propose a semantic-aware medical image reconstruction framework that integrates high-level latent embeddings with a hybrid U-Net architecture to preserve clinically relevant structures during restoration. To ensure trust and accountability, we incorporate a lightweight blockchain-based provenance layer using scale-free graph design, enabling verifiable recording of each reconstruction event without imposing significant overhead. Extensive evaluation across multiple datasets and corruption types demonstrates improved structural consistency, restoration accuracy, and provenance integrity compared with existing approaches. By uniting semantic-guided reconstruction with secure traceability, our solution advances dependable AI for medical imaging, enhancing both diagnostic confidence and regulatory compliance in healthcare environments.
zh
[CV-125] S2AM3D: Scale-controllable Part Segmentation of 3D Point Cloud
【速读】:该论文旨在解决点云分割中两个关键问题:一是纯3D模型因数据稀缺导致泛化能力不足;二是引入2D预训练知识时,不同视角间分割结果一致性差。解决方案的关键在于提出S2AM3D框架,其核心创新包括:(1)设计点一致的部分编码器(point-consistent part encoder),通过原生3D对比学习聚合多视图2D特征,生成全局一致的点级特征;(2)提出尺度感知提示解码器(scale-aware prompt decoder),利用连续尺度信号实现分割粒度的实时调控;同时构建包含超10万样本的大规模高质量部件级点云数据集,为模型提供充分监督信号。该方案在复杂结构和显著尺寸差异的部件上展现出卓越的鲁棒性和可控性。
链接: https://arxiv.org/abs/2512.00995
作者: Han Su,Tianyu Huang,Zichen Wan,Xiaohe Wu,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.
zh
[CV-126] PhotoFramer: Multi-modal Image Composition Instruction
【速读】:该论文旨在解决普通用户在拍摄照片时难以构图良好(composition)的问题,从而提升其成像质量。解决方案的关键在于提出 PhotoFramer,一个基于多模态的构图指导框架:首先通过自然语言描述如何改进当前图像的构图,随后生成一张构图优化后的示例图像。其核心创新在于构建了一个大规模、分层的训练数据集,将构图指导细分为“平移(shift)”、“缩放(zoom-in)”和“视角变换(view-change)”三个子任务,并采用两阶段合成方法获取视点变化数据——先训练退化模型将优质图像转化为劣质图像,再应用于专家拍摄的照片以生成训练对。最终,通过微调一个能联合处理文本与图像的模型,实现可操作的文本指导与可视化示例的协同输出,显著优于仅依赖示例的基线方法。
链接: https://arxiv.org/abs/2512.00993
作者: Zhiyuan You,Ke Wang,He Zhang,Xin Cai,Jinjin Gu,Tianfan Xue,Chao Dong,Zhoutong Zhang
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Multimedia Laboratory, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Adobe NextCam (Adobe NextCam); Adobe Research (Adobe 研究院); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥霍里斯基”INSAIT); Shanghai AI Laboratory (上海人工智能实验室); CPII under InnoHK (InnoHK 下的CPII); Shenzhen University of Advanced Technology (深圳先进科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users. Codes, model weights, and datasets have been released in this https URL.
zh
[CV-127] MM-ACT: Learn from Multimodal Parallel Generation to Act
【速读】:该论文旨在解决通用机器人策略(generalist robotic policy)在任务规划中对语义理解的需求与环境交互中预测能力之间的协同问题,即如何统一建模文本、图像和动作三模态信息以实现高效且鲁棒的机器人操作。其解决方案的关键在于提出MM-ACT模型——一个将文本、图像和动作统一嵌入共享token空间的视觉-语言-动作(Vision-Language-Action, VLA)模型,并通过上下文共享的多模态学习范式(Context-Shared Multimodal Learning)实现跨模态监督与增强;此外,采用重掩码并行解码策略用于文本与图像生成,以及单步并行解码策略用于动作生成,从而兼顾生成质量与推理效率。
链接: https://arxiv.org/abs/2512.00975
作者: Haotian Liang,Xinyi Chen,Bin Wang,Mingkang Chen,Yitian Liu,Yuhao Zhang,Zanxin Chen,Tianshuo Yang,Yilun Chen,Jiangmiao Pang,Dong Liu,Xiaokang Yang,Yao Mu,Wenqi Shao,Ping Luo
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The University of Hong Kong (香港大学); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 17 pages
Abstract:A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at this https URL.
zh
[CV-128] Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction
【速读】:该论文旨在解决从单目互联网视频中高效、准确地重建4D人类-物体交互(4D Human-Object Interaction, HOI)数据的难题,这是通用机器人在真实世界中鲁棒运行的关键前提。现有方法难以从复杂多样的自然视频中提取高时空一致性和物理合理性的真实交互信息。其解决方案的核心在于提出4DHOISolver框架,该框架通过引入稀疏的人工标注接触点作为约束条件,有效缓解了4D HOI重建问题的病态性(ill-posed),从而在保持高时空连贯性和物理合理性的同时实现可扩展的优化重建。这一策略显著提升了重建质量,并推动了Open4DHOI大规模数据集的构建与基于强化学习(Reinforcement Learning, RL)的运动模仿验证,但同时也揭示了当前3D基础模型在自动预测精确人-物接触对应关系上的不足,凸显了人-in-the-loop方法的必要性与挑战性。
链接: https://arxiv.org/abs/2512.00960
作者: Boran Wen,Ye Lu,Keyan Wan,Sirui Wang,Jiahong Zhou,Junxuan Liang,Xinpeng Liu,Bang Xiao,Dingbang Huang,Ruiyang Liu,Yong-Lu Li
机构: SJTU(上海交通大学); SII; FDU(复旦大学); BJTU(北京交通大学); STU(深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at this https URL
zh
[CV-129] Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval AAAI2026
【速读】:该论文旨在解决视频中基于自然语言查询的时序片段定位(moment retrieval)任务中存在的挑战,特别是传统预训练模型在细粒度信息捕捉和确定性推理方面的局限性,导致难以准确对齐复杂或模糊的时序片段。其核心问题在于现有方法在跨模态对齐和不确定性估计方面存在偏差,使得高不确定性被错误地分配给准确样本而非困难样本。解决方案的关键在于提出一种新的框架——Debiased Evidential Learning for Moment Retrieval (DEMR),包含两个创新组件:一是引入Reflective Flipped Fusion (RFF)模块以增强跨模态对齐能力,二是设计查询重构任务提升文本敏感性,从而减少不确定性估计中的偏差;此外,通过Geom-regularizer优化不确定性预测,使模型能够自适应地聚焦于困难时刻,显著提升检索精度、鲁棒性和可解释性。
链接: https://arxiv.org/abs/2512.00953
作者: Haojian Huang,Kaijing Ma,Jin Chen,Haodong Chen,Zhou Wu,Xianghao Zang,Han Fang,Chao Ban,Hao Sun,Mulin Chen,Zhongjiang He
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tongji University (同济大学); 4. Tsinghua University (清华大学); 5. National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026, 10 pages, 9 figures, 5 tables
Abstract:In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER’s heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval. The code is publicly available at this https URL.
zh
[CV-130] Binary-Gaussian: Compact and Progressive Representation for 3D Gaussian Segmentation
【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3D-GS)的语义分割方法中存在的两个核心问题:一是高维类别特征导致的内存开销过大,二是细粒度分割困难,主要源于标签空间拥挤及缺乏稳定的多粒度控制机制。解决方案的关键在于提出一种从粗到细的二进制编码方案,通过二进制到十进制映射将每个高斯点的类别特征压缩为单个整数,显著降低内存占用;同时设计渐进式训练策略,将全景分割分解为一系列独立子任务以减少类间冲突,并在分割训练中微调不透明度(opacity),缓解光度渲染与语义分割之间的不兼容性,从而提升前景-背景区分能力与细粒度分割性能。
链接: https://arxiv.org/abs/2512.00944
作者: An Yang,Chenyu Liu,Jun Du,Jianqing Gao,Jia Pan,Jinshui Hu,Baocai Yin,Bing Yin,Cong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3D-GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation. However, existing 3D-GS-based segmentation methods typically rely on high-dimensional category features, which introduce substantial memory overhead. Moreover, fine-grained segmentation remains challenging due to label space congestion and the lack of stable multi-granularity control mechanisms. To address these limitations, we propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping, drastically reducing memory usage. We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub-tasks, reducing inter-class conflicts and thereby enhancing fine-grained segmentation capability. Additionally, we fine-tune opacity during segmentation training to address the incompatibility between photometric rendering and semantic segmentation, which often leads to foreground-background confusion. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art segmentation performance while significantly reducing memory consumption and accelerating inference.
zh
[CV-131] SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding WACV2026
【速读】:该论文旨在解决视觉-语言模型在处理复杂、组合式视觉查询时的场景图接地(scene-graph grounding)问题,即如何准确地将包含多个对象及其关系的结构化查询映射到图像中的对应区域。现有方法在面对日益复杂的查询时性能反而下降,缺乏利用关系信息的能力。解决方案的关键在于将场景图接地重新建模为马尔可夫随机场(Markov Random Field, MRF)中的最大后验估计(Maximum a Posteriori, MAP)推理问题,并通过可微分信念传播(Belief Propagation)算法实现端到端的全局推理,从而在整体上优化图像区域与查询图节点之间的匹配,确保所有约束条件被联合满足。这一方法首次证明了更复杂的查询图能显著提升接地准确性。
链接: https://arxiv.org/abs/2512.00936
作者: Keita Otani,Tatsuya Harada
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026
Abstract:Grounding complex, compositional visual queries with multiple objects and relationships is a fundamental challenge for vision-language models. While standard phrase grounding methods excel at localizing single objects, they lack the structural inductive bias to parse intricate relational descriptions, often failing as queries become more descriptive. To address this structural deficit, we focus on scene-graph grounding, a powerful but less-explored formulation where the query is an explicit graph of objects and their relationships. However, existing methods for this task also struggle, paradoxically showing decreased performance as the query graph grows – failing to leverage the very information that should make grounding easier. We introduce SceneProp, a novel method that resolves this issue by reformulating scene-graph grounding as a Maximum a Posteriori (MAP) inference problem in a Markov Random Field (MRF). By performing global inference over the entire query graph, SceneProp finds the optimal assignment of image regions to nodes that jointly satisfies all constraints. This is achieved within an end-to-end framework via a differentiable implementation of the Belief Propagation algorithm. Experiments on four benchmarks show that our dedicated focus on the scene-graph grounding formulation allows SceneProp to significantly outperform prior work. Critically, its accuracy consistently improves with the size and complexity of the query graph, demonstrating for the first time that more relational context can, and should, lead to better grounding. Codes are available at this https URL.
zh
[CV-132] LAHNet: Local Attentive Hashing Network for Point Cloud Registration
【速读】:该论文旨在解决点云配准中特征区分度不足的问题,尤其是现有基于学习的点云描述子多局限于局部信息感知,难以构建合理且广泛的感受野以提升特征的判别能力。其解决方案的关键在于提出一种名为LAHNet(Local Attentive Hashing Network)的网络架构,核心创新包括:1)引入具有局部归纳偏置的局部注意力机制,结合局部敏感哈希(Locality-Sensitive Hashing, LSH)实现非重叠窗口划分,从而获得结构化的长程上下文建模能力;2)设计跨窗口策略进一步扩展合理的特征感受野;3)提出交互Transformer(Interaction Transformer),通过将每个窗口表示为全局信号并计算重叠矩阵,增强点云对之间重叠区域的特征交互,从而显著提升配准性能。
链接: https://arxiv.org/abs/2512.00927
作者: Wentao Qu,Xiaoshui Huang,Liang Xiao
机构: Nanjing University of Science and Technology (南京理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing learning-based point cloud descriptors for point cloud registration focus on perceiving local information of point clouds to generate distinctive features. However, a reasonable and broader receptive field is essential for enhancing feature distinctiveness. In this paper, we propose a Local Attentive Hashing Network for point cloud registration, called LAHNet, which introduces a local attention mechanism with the inductive bias of locality of convolution-like operators into point cloud descriptors. Specifically, a Group Transformer is designed to capture reasonable long-range context between points. This employs a linear neighborhood search strategy, Locality-Sensitive Hashing, enabling uniformly partitioning point clouds into non-overlapping windows. Meanwhile, an efficient cross-window strategy is adopted to further expand the reasonable feature receptive field. Furthermore, building on this effective windowing strategy, we propose an Interaction Transformer to enhance the feature interactions of the overlap regions within point cloud pairs. This computes an overlap matrix to match overlap regions between point cloud pairs by representing each window as a global signal. Extensive results demonstrate that LAHNet can learn robust and distinctive features, achieving significant registration results on real-world indoor and outdoor benchmarks.
zh
[CV-133] ForamDeepSlice: A High-Accuracy Deep Learning Framework for Foraminifera Species Classification from 2D Micro-CT Slices
【速读】:该论文旨在解决微古生物分类中依赖专家经验、效率低且易受主观因素影响的问题,特别是在有孔虫(foraminifera)物种识别方面。其核心解决方案是构建一个基于深度学习的自动化分类框架,关键在于:首先,通过科学严谨的数据筛选与标本级数据划分策略,确保训练、验证和测试集之间无数据泄露;其次,采用迁移学习结合多模型集成(ConvNeXt-Large与EfficientNetV2-Small),在包含109,617张高质量2D微CT切片的基准数据集上实现95.64%的测试准确率和0.998的AUC值;最后,开发交互式可视化仪表板支持实时切片分类与三维切片匹配,显著提升AI模型在地质学应用场景中的实用性与可部署性。
链接: https://arxiv.org/abs/2512.00912
作者: Abdelghafour Halimi,Ali Alibrahim,Didier Barradas-Bautista,Ronell Sicat,Abdulkader M. Afifi
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This study presents a comprehensive deep learning pipeline for the automated classification of 12 foraminifera species using 2D micro-CT slices derived from 3D scans. We curated a scientifically rigorous dataset comprising 97 micro-CT scanned specimens across 27 species, selecting 12 species with sufficient representation for robust machine learning. To ensure methodological integrity and prevent data leakage, we employed specimen-level data splitting, resulting in 109,617 high-quality 2D slices (44,103 for training, 14,046 for validation, and 51,468 for testing). We evaluated seven state-of-the-art 2D convolutional neural network (CNN) architectures using transfer learning. Our final ensemble model, combining ConvNeXt-Large and EfficientNetV2-Small, achieved a test accuracy of 95.64%, with a top-3 accuracy of 99.6% and an area under the ROC curve (AUC) of 0.998 across all species. To facilitate practical deployment, we developed an interactive advanced dashboard that supports real-time slice classification and 3D slice matching using advanced similarity metrics, including SSIM, NCC, and the Dice coefficient. This work establishes new benchmarks for AI-assisted micropaleontological identification and provides a fully reproducible framework for foraminifera classification research, bridging the gap between deep learning and applied geosciences.
zh
[CV-134] Dual-Projection Fusion for Accurate Upright Panorama Generation in Robotic Vision
【速读】:该论文旨在解决非正交全景图像(non-upright panoramas)因机器人姿态不稳定而影响下游任务的问题。传统基于惯性测量单元(IMU)的校正方法存在漂移和外部干扰问题,而视觉方法则展现出更好的潜力。解决方案的关键在于提出一种双流角度感知生成网络,通过卷积神经网络(CNN)分支提取等距投影图中的局部几何结构,以及视觉Transformer(ViT)分支捕捉立方体贴图投影中的全局上下文信息,并借助双投影自适应融合模块实现跨域空间特征对齐,从而联合估计相机倾斜角并重建正交全景图像。此外,引入高频增强模块、圆形填充和通道注意力机制进一步提升了360°连续性和几何敏感性,实验证明该方法在SUN360和M3D数据集上优于现有技术。
链接: https://arxiv.org/abs/2512.00911
作者: Yuhao Shan,Qianyi Yuan,Jingguo Liu,Shigang Li,Jianfeng Li,Tong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Panoramic cameras, capable of capturing a 360-degree field of view, are crucial in robotic vision, particularly in environments with sparse features. However, non-upright panoramas due to unstable robot postures hinder downstream tasks. Traditional IMU-based correction methods suffer from drift and external disturbances, while vision-based approaches offer a promising alternative. This study presents a dual-stream angle-aware generation network that jointly estimates camera inclination angles and reconstructs upright panoramic images. The network comprises a CNN branch that extracts local geometric structures from equirectangular projections and a ViT branch that captures global contextual cues from cubemap projections. These are integrated through a dual-projection adaptive fusion module that aligns spatial features across both domains. To further enhance performance, we introduce a high-frequency enhancement block, circular padding, and channel attention mechanisms to preserve 360° continuity and improve geometric sensitivity. Experiments on the SUN360 and M3D datasets demonstrate that our method outperforms existing approaches in both inclination estimation and upright panorama generation. Ablation studies further validate the contribution of each module and highlight the synergy between the two tasks. The code and related datasets can be found at: this https URL.
zh
[CV-135] alkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model WACV2026
【速读】:该论文旨在解决当前基于扩散模型的字符驱动动画在生成长时间、时序一致的人体上半身动作时面临的挑战,尤其是现有方法受限于训练数据的短视频片段长度,导致难以实现长时间连续生成。其解决方案的关键在于提出了一种名为TalkingPose的新颖扩散框架,该框架通过引入基于图像的反馈机制,在不增加额外计算成本或二次训练阶段的前提下,实现了无限制时长的稳定时序一致性动画生成,同时利用驱动帧精确捕捉面部与手部表情运动,并将其无缝迁移至目标角色。
链接: https://arxiv.org/abs/2512.00909
作者: Alireza Javanmardi,Pragati Jaiswal,Tewodros Amberbir Habtegebrial,Christen Millerdurai,Shaoxiang Wang,Alain Pagani,Didier Stricker
机构: German Research Center for Artificial Intelligence (DFKI); RPTU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026, Project page available at this https URL
Abstract:Recent advancements in diffusion models have significantly improved the realism and generalizability of character-driven animation, enabling the synthesis of high-quality motion from just a single RGB image and a set of driving poses. Nevertheless, generating temporally coherent long-form content remains challenging. Existing approaches are constrained by computational and memory limitations, as they are typically trained on short video segments, thus performing effectively only over limited frame lengths and hindering their potential for extended coherent generation. To address these constraints, we propose TalkingPose, a novel diffusion-based framework specifically designed for producing long-form, temporally consistent human upper-body animations. TalkingPose leverages driving frames to precisely capture expressive facial and hand movements, transferring these seamlessly to a target actor through a stable diffusion backbone. To ensure continuous motion and enhance temporal coherence, we introduce a feedback-driven mechanism built upon image-based diffusion models. Notably, this mechanism does not incur additional computational costs or require secondary training stages, enabling the generation of animations with unlimited duration. Additionally, we introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.
zh
[CV-136] Hierarchical Semantic Alignment for Image Clustering AAAI2026
【速读】:该论文旨在解决图像聚类中因忽略名词(noun)固有模糊性而导致语义表示失真、进而降低聚类质量的问题。其解决方案的关键在于提出一种无需训练的分层语义对齐方法(CAE),通过融合两类互补的文本语义信息——细粒度的图像描述(caption-level descriptions)和高层物体类别概念(noun-level concepts),利用最优传输(optimal transport)技术将图像特征与选定的名词及描述进行对齐,从而构建更具判别性的语义空间,并在此基础上结合增强后的语义与图像特征完成聚类。
链接: https://arxiv.org/abs/2512.00904
作者: Xingyu Zhu,Beier Zhu,Yunfan Li,Junfeng Fang,Shuo Wang,Kesen Zhao,Hanwang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2026
Abstract:Image clustering is a classic problem in computer vision, which categorizes images into different groups. Recent studies utilize nouns as external semantic knowledge to improve clus- tering performance. However, these methods often overlook the inherent ambiguity of nouns, which can distort semantic representations and degrade clustering quality. To address this issue, we propose a hierarChical semAntic alignmEnt method for image clustering, dubbed CAE, which improves cluster- ing performance in a training-free manner. In our approach, we incorporate two complementary types of textual seman- tics: caption-level descriptions, which convey fine-grained attributes of image content, and noun-level concepts, which represent high-level object categories. We first select relevant nouns from WordNet and descriptions from caption datasets to construct a semantic space aligned with image features. Then, we align image features with selected nouns and captions via optimal transport to obtain a more discriminative semantic space. Finally, we combine the enhanced semantic and image features to perform clustering. Extensive experiments across 8 datasets demonstrate the effectiveness of our method, notably surpassing the state-of-the-art training-free approach with a 4.2% improvement in accuracy and a 2.9% improvement in adjusted rand index (ARI) on the ImageNet-1K dataset.
zh
[CV-137] SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型因参数量庞大而导致部署效率低的问题,尤其是在资源受限的边缘设备上难以实用。现有轻量化方案虽能减少计算开销,但通常牺牲了时空推理能力;而依赖大型预训练视觉语言模型(VLM)融合3D输入的方法则仍缺乏对时间维度的有效建模。其解决方案的关键在于提出SwiftVLA架构:首先设计了一个基于预训练4D视觉几何Transformer的轻量级模块,通过引入时间缓存机制从2D图像中提取4D特征;其次引入可学习的融合标记(Fusion Tokens),以未来预测目标训练生成统一的2D与4D特征表示用于动作生成;最后采用掩码重建策略,在训练时对4D输入进行掩码并引导模型重构,从而让VLM学会有效编码4D信息,并可在推理阶段移除4D分支而不显著损失性能。该方法在保持高效性的同时显著提升了轻量模型的时空理解能力,实验证明其在真实和仿真环境中优于同类轻量基线,且性能接近7倍参数量的大模型,同时在边缘设备上实现18倍加速和12倍内存压缩。
链接: https://arxiv.org/abs/2512.00903
作者: Chaojun Ni,Cheng Chen,Xiaofeng Wang,Zheng Zhu,Wenzhao Zheng,Boyuan Wang,Tianrun Chen,Guosheng Zhao,Haoyun Li,Zhehao Dong,Qiang Zhang,Yun Ye,Yang Wang,Guan Huang,Wenjun Mei
机构: GigaAI; Peking University (北京大学); Moxin (Huzhou) Technology Co., Ltd.; Tsinghua University (清华大学); X-Humanoid
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM’s ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.
zh
[CV-138] Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
【速读】:该论文旨在解决流式视频大语言模型(Streaming Video Large Language Models, VideoLLMs)在实时部署中因密集视觉标记(visual tokens)处理带来的高计算开销问题,尤其聚焦于Vision Transformer(ViT)编码阶段因帧间时间相似性导致的冗余计算,以及LLM预填充阶段因token序列膨胀引发的延迟与内存压力。解决方案的关键在于提出一种即插即用的分层令牌压缩框架(Streaming Token Compression, STC),其核心由两个令牌级加速器组成:STC-Cacher通过缓存并复用时间上相似帧的特征以降低ViT编码开销;STC-Pruner则在视觉token输入LLM前进行压缩,基于空间与时间相关性保留最具代表性的token,从而显著减少延迟和内存占用。实验表明,STC在保持高达99%准确率的同时,使ViT编码延迟和LLM预填充延迟分别降低24.5%和45.3%。
链接: https://arxiv.org/abs/2512.00891
作者: Yiyu Wang,Xuyang Liu,Xiyan Gui,Xinying Lin,Boxue Yang,Chenfei Liao,Tailai Chen,Linfeng Zhang
机构: EPIC Lab, Shanghai Jiao Tong University (上海交通大学EPIC实验室); Sichuan University (四川大学); Huazhong University of Science and Technology (华中科技大学); Sun Yat-sen University (中山大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is avaliable at \url{ this https URL }
Abstract:Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \textbfStreaming \textbfToken \textbfCompression (\textbfSTC), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \textbfSTC-Cacher, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \textbfSTC-Pruner, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \textbf99% of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \textbf24.5% and \textbf45.3%.
zh
[CV-139] Multilingual Training-Free Remote Sensing Image Captioning
【速读】:该论文旨在解决遥感图像描述生成(Remote Sensing Image Captioning)中对大规模标注数据的依赖以及仅支持英语导致的全球适用性受限问题。其核心解决方案是提出一种无需训练的多语言方法,基于检索增强提示(Retrieval-Augmented Prompting):首先利用领域自适应的SigLIP2编码器从知识库中检索相关句段和少量示例,再结合大语言模型(LLM)或视觉-语言模型(VLM)进行生成。关键创新在于引入基于PageRank的图重排序策略以提升检索内容的一致性,显著改善跨语言性能(最高提升35%),并验证了直接在目标语言生成优于翻译策略,为构建更具包容性和可扩展性的多模态地球观测系统提供了系统性范式。
链接: https://arxiv.org/abs/2512.00887
作者: Carlos Rebelo,Gil Rocha,João Daniel Silva,Bruno Martins
机构: INESC-ID (Institute of Systems and Robotics); Instituto Superior Técnico, Universidade de Lisboa (里斯本大学理工学院); Faculdade de Engenharia da Universidade do Porto (波尔图大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing image captioning has advanced rapidly through encoder–decoder models, although the reliance on large annotated datasets and the focus on English restricts global applicability. To address these limitations, we propose the first training-free multilingual approach, based on retrieval-augmented prompting. For a given aerial image, we employ a domain-adapted SigLIP2 encoder to retrieve related captions and few-shot examples from a datastore, which are then provided to a language model. We explore two variants: an image-blind setup, where a multilingual Large Language Model (LLM) generates the caption from textual prompts alone, and an image-aware setup, where a Vision–Language Model (VLM) jointly processes the prompt and the input image. To improve the coherence of the retrieved content, we introduce a graph-based re-ranking strategy using PageRank on a graph of images and captions. Experiments on four benchmark datasets across ten languages demonstrate that our approach is competitive with fully supervised English-only systems and generalizes to other languages. Results also highlight the importance of re-ranking with PageRank, yielding up to 35% improvements in performance metrics. Additionally, it was observed that while VLMs tend to generate visually grounded but lexically diverse captions, LLMs can achieve stronger BLEU and CIDEr scores. Lastly, directly generating captions in the target language consistently outperforms other translation-based strategies. Overall, our work delivers one of the first systematic evaluations of multilingual, training-free captioning for remote sensing imagery, advancing toward more inclusive and scalable multimodal Earth observation systems.
zh
[CV-140] HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
【速读】:该论文旨在解决现有手-物体交互(Hand-Object Interaction, HOI)语义基准在细粒度时空推理上的不足,尤其是对人类操作行为及其引起的物体状态变化之间动态关系的刻画不充分的问题。解决方案的关键在于构建HanDyVQA这一新型视频问答(Video Question Answering, VQA)基准,其包含六类互补的问题类型(动作、过程、物体、位置、状态变化和物体部件),共计11.1K个多选题对,并配套10.3K个分割掩码用于物体与部件级别的推理评估,从而系统性地覆盖HOI中操作与效应两个维度的细粒度时空理解需求。实验表明,当前最优模型(Gemini-2.5-Pro)平均准确率仅为73%,远低于人类水平(97%),揭示了空间关系、运动建模和部件级几何理解仍是主要挑战;同时发现引入显式的HOI相关视觉特征可显著提升性能,为未来模型设计提供关键方向。
链接: https://arxiv.org/abs/2512.00885
作者: Masatoshi Tateno,Gido Kato,Hirokatsu Kataoka,Yoichi Sato,Takuma Yagi
机构: Institute of Industrial Science, The University of Tokyo (东京大学工业科学研究所); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所); Waseda University (早稻田大学); Visual Geometry Group, University of Oxford (牛津大学视觉几何组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.
zh
[CV-141] Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
【速读】:该论文旨在解决当前世界模型(World Models)研究中对多模态感知,特别是音频与视觉信息融合不足的问题。现有方法主要依赖视觉观测,而真实世界感知本质上是多模态的,其中音频提供了关键的空间和时间线索(如声源定位和声学场景特性),但其在世界模型中的集成仍处于探索阶段。为填补这一空白,作者首次提出了音频-视觉世界模型(Audio-Visual World Models, AVWM)的正式框架,将其建模为一个带有同步音频-视觉观测、细粒度动作控制和任务奖励预测的部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)。解决方案的关键在于:1)构建了首个大规模多模态数据集AVW-4k,包含30小时带动作标注和奖励信号的双耳音频-视觉轨迹;2)提出AV-CDiT模型——一种基于条件扩散Transformer的多模态专家架构,通过三阶段训练策略实现视听学习的平衡与高效融合,从而在视觉、听觉及任务奖励预测上均实现高保真度联合建模,并在连续音频-视觉导航任务中显著提升智能体性能。
链接: https://arxiv.org/abs/2512.00883
作者: Jiahua Wang,Shannan Yan,Leqi Zheng,Jialong Wu,Yaoxin Mao
机构: Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control with task reward prediction. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations, fine-grained actions, and task rewards. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations and reward signals across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities with reward. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent’s performance.
zh
[CV-142] Look Recite Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在精准农业等专业领域中因“推理驱动型幻觉”(Reasoning-Driven Hallucination)导致的性能瓶颈问题,其核心在于视觉嵌入无法有效激活模型参数中已编码的细粒度专家知识,即存在“模态鸿沟”(Modality Gap)。解决方案的关键是提出一种参数高效、模块化的“看、复述、再回答”(Look, Recite, Then Answer)框架:通过三个阶段解耦推理过程——首先生成客观视觉描述与候选集(Look);接着利用轻量级1.7B路由器将视觉线索转化为特定候选查询以触发参数化知识(Recite);最后并行对齐描述与复述知识,选择最一致标签(Answer)。该设计将被动感知转为可控的知识检索机制,显著降低幻觉并提升专业任务准确性。
链接: https://arxiv.org/abs/2512.00882
作者: Xisheng Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to “Reasoning-Driven Hallucination” where linguistic priors override visual perception. A key bottleneck is the “Modality Gap”: visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose “Look, Recite, Then Answer,” a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.6% over Qwen-VL and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval
zh
[CV-143] Quantum-Inspired Spectral Geometry for Neural Operator Equivalence and Structured Pruning
【速读】:该论文旨在解决资源受限且异构的国产硬件上多模态智能发展所面临的三大瓶颈问题:多模态特征异质性、动态场景中的实时性要求以及硬件特定算子冗余。其解决方案的关键在于提出了一种受量子力学启发的神经算子几何框架,通过将每个算子表示为其在Bloch超球面上的归一化奇异值谱,并证明了一个紧致的谱到函数等价定理——即Fubini–Study距离或Wasserstein-2距离趋近于零时可严格保证函数层面的接近性,从而首次建立了跨模态与跨架构算子可替代性的理论基础。基于此度量,进一步提出了量子度量驱动的功能冗余图(QM-FRG)与一次性结构化剪枝方法,在控制仿真中验证了该度量优于传统幅度和随机基线。
链接: https://arxiv.org/abs/2512.00880
作者: Haijian Shao,Wei Liu,Xing Deng
机构: Jiangsu University of Science and Technology (江苏科技大学); Nanjing University of Information Science and Technology (南京信息工程大学); University of Nevada, Las Vegas (内华达大学拉斯维加斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 1 figure, preliminary version; concepts and simulation experiments only
Abstract:The rapid growth of multimodal intelligence on resource-constrained and heterogeneous domestic hardware exposes critical bottlenecks: multimodal feature heterogeneity, real-time requirements in dynamic scenarios, and hardware-specific operator redundancy. This work introduces a quantum-inspired geometric framework for neural operators that represents each operator by its normalized singular value spectrum on the Bloch hypersphere. We prove a tight spectral-to-functional equivalence theorem showing that vanishing Fubini–Study/Wasserstein-2 distance implies provable functional closeness, establishing the first rigorous foundation for cross-modal and cross-architecture operator substitutability. Based on this metric, we propose Quantum Metric-Driven Functional Redundancy Graphs (QM-FRG) and one-shot structured pruning. Controlled simulation validates the superiority of the proposed metric over magnitude and random baselines. An extensive experimental validation on large-scale multimodal transformers and domestic heterogeneous hardware (Huawei Ascend, Cambricon MLU, Kunlunxin) hardware is deferred to an extended journal version currently in preparation.
zh
[CV-144] Feed-Forward 3D Gaussian Splatting Compression with Long-Context Modeling
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 表示中数据规模庞大导致的压缩难题,尤其针对现有前向传播压缩方法难以建模长距离空间依赖的问题。其解决方案的关键在于构建一个基于Morton编码序列的大规模上下文结构(包含数千个高斯点),并设计一种细粒度的空间-通道自回归熵模型以充分利用该扩展上下文;同时引入基于注意力机制的变换编码模型,通过聚合远距离邻近高斯点的特征提取信息丰富的潜在先验,从而实现高效且通用的3DGS压缩,最终在前向推理下达到20倍压缩比,并优于现有可泛化的编码器。
链接: https://arxiv.org/abs/2512.00877
作者: Zhening Liu,Rui Song,Yushi Huang,Yingdong Hu,Xinjie Zhang,Jiawei Shao,Zehong Lin,Jun Zhang
机构: Hong Kong University of Science and Technology (香港科技大学); Institute of Artificial Intelligence (TeleAI) (人工智能研究院(TeleAI)), China Telecom (中国电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has emerged as a revolutionary 3D representation. However, its substantial data size poses a major barrier to widespread adoption. While feed-forward 3DGS compression offers a practical alternative to costly per-scene per-train compressors, existing methods struggle to model long-range spatial dependencies, due to the limited receptive field of transform coding networks and the inadequate context capacity in entropy models. In this work, we propose a novel feed-forward 3DGS compression framework that effectively models long-range correlations to enable highly compact and generalizable 3D representations. Central to our approach is a large-scale context structure that comprises thousands of Gaussians based on Morton serialization. We then design a fine-grained space-channel auto-regressive entropy model to fully leverage this expansive context. Furthermore, we develop an attention-based transform coding model to extract informative latent priors by aggregating features from a wide range of neighboring Gaussians. Our method yields a 20\times compression ratio for 3DGS in a feed-forward inference and achieves state-of-the-art performance among generalizable codecs.
zh
[CV-145] Neural Discrete Representation Learning for Sparse-View CBCT Reconstruction: From Algorithm Design to Prospective Multicenter Clinical Evaluation
【速读】:该论文旨在解决锥形束计算机断层扫描(Cone Beam Computed Tomography, CBCT)引导穿刺过程中因高辐射剂量导致的继发性恶性肿瘤风险问题。现有低剂量CBCT策略尚未在大规模多中心回顾性数据集上得到验证,且缺乏前瞻性临床评估。解决方案的关键在于提出一种三阶段深度学习框架DeepPriorCBCT,通过仅使用常规辐射剂量的六分之一即可实现诊断级图像重建;该方法基于来自12个中心的4102名患者共8675次CBCT扫描数据进行训练与验证,并在一项包含138名患者的前瞻性交叉试验中证实其临床适用性——11名医生评价重建图像与原始图像无差异,且诊断性能和整体图像质量相当,同时五名放射科医师和二十五名介入医生在图像质量和手术指导方面均未表现出显著偏好,表明该模型可在大幅降低术中辐射暴露的同时保障图像质量与诊疗可靠性。
链接: https://arxiv.org/abs/2512.00873
作者: Haoshen Wang,Lei Chen,Wei-Hua Zhang,Linxia Wu,Yong Luo,Zengmao Wang,Yuan Xiong,Chengcheng Zhu,Wenjuan Tang,Xueyi Zhang,Wei Zhou,Xuhua Duan,Lefei Zhang,Gao-Jun Teng,Bo Du,Huangxuan Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cone beam computed tomography (CBCT)-guided puncture has become an established approach for diagnosing and treating early- to mid-stage thoracic tumours, yet the associated radiation exposure substantially elevates the risk of secondary malignancies. Although multiple low-dose CBCT strategies have been introduced, none have undergone validation using large-scale multicenter retrospective datasets, and prospective clinical evaluation remains lacking. Here, we propose DeepPriorCBCT - a three-stage deep learning framework that achieves diagnostic-grade reconstruction using only one-sixth of the conventional radiation dose. 4102 patients with 8675 CBCT scans from 12 centers were included to develop and validate DeepPriorCBCT. Additionally, a prospective cross-over trial (Registry number: NCT07035977) which recruited 138 patients scheduled for percutaneous thoracic puncture was conducted to assess the model’s clinical applicability. Assessment by 11 physicians confirmed that reconstructed images were indistinguishable from original scans. Moreover, diagnostic performance and overall image quality were comparable to those generated by standard reconstruction algorithms. In the prospective trial, five radiologists reported no significant differences in image quality or lesion assessment between DeepPriorCBCT and the clinical standard (all P0.05). Likewise, 25 interventionalists expressed no preference between model-based and full-sampling images for surgical guidance (Kappa0.2). Radiation exposure with DeepPriorCBCT was reduced to approximately one-sixth of that with the conventional approach, and collectively, the findings confirm that it enables high-quality CBCT reconstruction under sparse sampling conditions while markedly decreasing intraoperative radiation risk.
zh
[CV-146] AP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models
【速读】:该论文旨在解决当前医学领域基础模型(Foundation Models, FMs)在实际应用中普遍存在的问题:即多数模型需要大量微调或依赖计算资源密集的解码器,且许多编码器的预训练目标偏向特定任务,导致泛化能力受限。为应对这一挑战,作者提出了一种任务无关的CT基础模型预训练方法(Task-Agnostic Pretraining of CT Foundation Models, TAP-CT),其关键在于对Vision Transformer(ViT)和DINOv2架构进行针对性改进——包括调整patch嵌入、位置编码以及引入体积增强策略,从而实现对3D CT体数据的可扩展自监督预训练,同时保持原始架构的简洁性并赋予其深度感知能力。实验表明,在10.5万例内部CT数据上大规模预训练后,模型能生成稳定、鲁棒且跨下游任务表现优异的冻结特征表示。
链接: https://arxiv.org/abs/2512.00872
作者: Tim Veenboer,George Yiasemis,Eric Marcus,Vivien Van Veldhuizen,Cees G. M. Snoek,Jonas Teuwen,Kevin B. W. Groot Lipman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures, 8 tables
Abstract:Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders, while many existing encoders are pretrained with objectives biased toward specific tasks. This illustrates a need for a strong, task-agnostic foundation model that requires minimal fine-tuning beyond feature extraction. In this work, we introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 for volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware while preserving the simplicity of the underlying architectures. We show that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) yields stable, robust frozen representations that generalize strongly across downstream tasks. To promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging, we will release all pretrained models, experimental configurations, and downstream benchmark code at this https URL.
zh
[CV-147] Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 中场景表示冗余与压缩效率低的问题,目标是在不牺牲渲染质量的前提下实现高阶压缩。其解决方案的关键在于提出Smol-GS方法,通过递归体素层次结构(recursive voxel hierarchy)捕获点云坐标的空间信息,并利用每个点(splat)的特征向量存储颜色、透明度、变换和材质等抽象语义信息,从而在保持灵活性的同时实现对3D场景的高效编码与压缩。
链接: https://arxiv.org/abs/2512.00850
作者: Haishan Wang,Mohammad Hassan Vali,Arno Solin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Smol-GS, a novel method for learning compact representations for 3D Gaussian Splatting (3DGS). Our approach learns highly efficient encodings in 3D space that integrate both spatial and semantic information. The model captures the coordinates of the splats through a recursive voxel hierarchy, while splat-wise features store abstracted cues, including color, opacity, transformation, and material properties. This design allows the model to compress 3D scenes by orders of magnitude without loss of flexibility. Smol-GS achieves state-of-the-art compression on standard benchmarks while maintaining high rendering quality. Beyond visual fidelity, the discrete representations could potentially serve as a foundation for downstream tasks such as navigation, planning, and broader 3D scene understanding.
zh
[CV-148] AFRAg ent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent WACV2026
【速读】:该论文旨在解决当前基于视觉语言模型(Visual Language Models, VLMs)的图形用户界面(GUI)自动化任务中存在的两大核心问题:一是由于视觉编码器特征空间信息有限,导致对界面控件(widgets)识别不准确及动作决策困难;二是现有高性能模型普遍参数量大、训练成本高且推理延迟显著。解决方案的关键在于提出AFRAgent——一种基于instruct-BLIP的多模态架构,其创新性地引入了一种基于自适应特征重归一化(adaptive feature renormalization)的技术,通过token级仿射变换有效增强低分辨率图像嵌入并融合高分辨率细节信息,从而在显著减小模型规模(小于最接近竞争者四分之一)的前提下实现更优的GUI自动化性能。
链接: https://arxiv.org/abs/2512.00846
作者: Neeraj Anand,Rishabh Jain,Sohan Patnaik,Balaji Krishnamurthy,Mausoom Sarkar
机构: Adobe(Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026 Conference
Abstract:There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
zh
[CV-149] PanFlow: Decoupled Motion Control for Panoramic Video Generation AAAI
【速读】:该论文旨在解决全景视频生成中缺乏显式运动控制以及难以生成具有大尺度和复杂运动场景的问题。其解决方案的关键在于提出PanFlow方法,该方法利用全景图的球面特性,将高度动态的相机旋转从输入的光流条件中解耦出来,从而实现对大尺度动态运动的更精确控制;同时引入球面噪声变形策略以提升全景边界处运动的一致性,确保时序连贯性。
链接: https://arxiv.org/abs/2512.00832
作者: Cheng Zhang,Hanwen Liang,Donny Y. Chen,Qianyi Wu,Konstantinos N. Plataniotis,Camilo Cruz Gambardella,Jianfei Cai
机构: 1. University of Toronto (多伦多大学); 2. Vector Institute (向量研究所); 3. University of Toronto (多伦多大学); 4. Swiss Federal Institute of Technology in Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI. Code: this https URL
Abstract:Panoramic video generation has attracted growing attention due to its applications in virtual reality and immersive media. However, existing methods lack explicit motion control and struggle to generate scenes with large and complex motions. We propose PanFlow, a novel approach that exploits the spherical nature of panoramas to decouple the highly dynamic camera rotation from the input optical flow condition, enabling more precise control over large and dynamic motions. We further introduce a spherical noise warping strategy to promote loop consistency in motion across panorama boundaries. To support effective training, we curate a large-scale, motion-rich panoramic video dataset with frame-level pose and flow annotations. We also showcase the effectiveness of our method in various applications, including motion transfer and video editing. Extensive experiments demonstrate that PanFlow significantly outperforms prior methods in motion fidelity, visual quality, and temporal coherence. Our code, dataset, and models are available at this https URL.
zh
[CV-150] Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在临床场景中执行复杂医学推理能力不明确的问题。现有评估方法缺乏对医学多模态推理能力的细粒度拆解与高临床真实性验证,导致模型性能评估存在盲区。解决方案的关键在于提出Med-CMR基准,其核心创新包括:1)系统性能力分解,将医学多模态推理细分为视觉理解与多步推理两个维度以实现靶向评估;2)设计具有挑战性的任务,涵盖小目标检测、细节辨别和空间理解等视觉维度,以及时间预测、因果推理、长尾泛化和多源信息整合等临床相关推理场景;3)构建覆盖11个器官系统和12种影像模态的高质量VQA数据集(共20,653对),并通过两阶段(专家+模型辅助)审核确保临床真实性。该基准为评估MLLMs在医疗场景下的视觉-推理融合能力和罕见病例鲁棒性提供了严格标准。
链接: https://arxiv.org/abs/2512.00818
作者: Haozhen Gong,Xiaozhong Ji,Yuansen Liu,Wenbin Wu,Xiaoxiao Yan,Jingjing Liu,Kai Wu,Jiazhen Pan,Bailiang Jian,Jiangning Zhang,Xiaobin Hu,Hongwei Bran Li
机构: National University of Singapore(新加坡国立大学); Nanjing University(南京大学); Tongji University(同济大学); Ruijin Hospital(瑞金医院); Technical University of Munich(慕尼黑工业大学); Zhejiang University(浙江大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.
zh
[CV-151] IRPO: Boosting Image Restoration via Post-training GRPO
【速读】:该论文旨在解决低层次视觉任务中现有图像恢复(Image Restoration, IR)方法因依赖像素级硬拟合真实图像而导致的过度平滑和泛化能力差的问题。其解决方案的关键在于提出一种基于生成式强化学习(Generative Reinforcement Learning, GRPO)的后训练范式IRPO,通过两个核心创新实现:一是设计了一种数据构造原则,即从预训练阶段选取表现不佳的样本进行后训练,以提升性能并增强效率;二是构建了一个多维度奖励建模体系,包含通用奖励(General Reward)用于结构保真度、专家奖励(Expert Reward)利用Qwen-VL模型实现感知对齐、以及任务特异性恢复奖励(Restoration Reward),从而在客观精度与人类感知偏好之间取得平衡。
链接: https://arxiv.org/abs/2512.00814
作者: Haoxuan Xu. Yi Liu,Boyuan Jiang,Jinlong Peng,Donghao Luo,Xiaobin Hu,Shuicheng Yan,Haoang Li
机构: The Hong Kong University of Science and Technology (Guangzhou) (广州科技大学); Tsinghua University (清华大学); Tencent Youtu Lab (腾讯优图实验室); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in this https URL.
zh
[CV-152] hinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
【速读】:该论文旨在解决视频多模态大语言模型(MLLM)在长视频理解中因逐帧处理导致的效率瓶颈问题,即全局时间推理与局部帧分析交替进行时产生的冗余多模态上下文累积,从而影响推理效率。解决方案的关键在于提出一种基于强化学习的推测性时间推理框架SpecTemp,其核心创新是通过合作式双模型架构将时间感知与推理解耦:轻量级草稿MLLM快速探索并提议关键帧,而强大目标MLLM专注于时间推理并验证草稿提案,迭代优化注意力直至收敛,模拟人类大脑协同认知路径,在保证精度的同时显著提升推理效率。
链接: https://arxiv.org/abs/2512.00805
作者: Pengfei Hu,Meng Cao,Yingyao Wang,Yi Wang,Jiahua Dong,Jun Song,Yu Cheng,Bo Zheng,Xiaodan Liang
机构: MBZUAI; Alibaba Group; Shanghai AI Lab; Sun Yat-sen University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm, which alternates between global temporal reasoning and local frame examination, has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft’s proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.
zh
[CV-153] CircleFlow: Flow-Guided Camera Blur Estimation using a Circle Grid Target
【速读】:该论文旨在解决点扩散函数(Point Spread Function, PSF)估计中的关键挑战,即由于强度域去卷积的病态性(ill-posed nature)和固有模糊性(inherent ambiguity)导致的高精度PSF建模难题。其解决方案的核心在于提出CircleFlow框架,通过结构化采集(circle grid target)编码局部各向异性和空间变化的PSF,并利用目标二值亮度先验实现图像与核估计的解耦;进一步地,借助光学流引导的亚像素对齐重建潜在清晰图像,同时将PSF建模为能量约束的隐式神经表示(energy-constrained implicit neural representation),并在一个考虑去马赛克(demosaicing-aware)的可微分框架中联合优化,从而实现物理一致且鲁棒的PSF估计。
链接: https://arxiv.org/abs/2512.00796
作者: Jiajian He,Enjie Hu,Shiqi Chen,Tianchen Qiu,Huajun Feng,Zhihai Xu,Yueting Chen
机构: Zhejiang University (浙江大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The point spread function (PSF) serves as a fundamental descriptor linking the real-world scene to the captured signal, manifesting as camera blur. Accurate PSF estimation is crucial for both optical characterization and computational vision, yet remains challenging due to the inherent ambiguity and the ill-posed nature of intensity-based deconvolution. We introduce CircleFlow, a high-fidelity PSF estimation framework that employs flow-guided edge localization for precise blur characterization. CircleFlow begins with a structured capture that encodes locally anisotropic and spatially varying PSFs by imaging a circle grid target, while leveraging the target’s binary luminance prior to decouple image and kernel estimation. The latent sharp image is then reconstructed through subpixel alignment of an initialized binary structure guided by optical flow, whereas the PSF is modeled as an energy-constrained implicit neural representation. Both components are jointly optimized within a demosaicing-aware differentiable framework, ensuring physically consistent and robust PSF estimation enabled by accurate edge localization. Extensive experiments on simulated and real-world data demonstrate that CircleFlow achieves state-of-the-art accuracy and reliability, validating its effectiveness for practical PSF calibration.
zh
[CV-154] PolarGS: Polarimetric Cues for Ambiguity-Free Gaussian Splatting with Accurate Geometry Recovery
【速读】:该论文旨在解决基于RGB的3D高斯泼溅(3D Gaussian Splatting, 3DGS)在光照敏感区域(如反射表面和无纹理区域)中因光度不一致而导致几何重建精度下降的问题。其核心解决方案是提出PolarGS,一种光学感知的扩展框架,通过引入偏振信息作为光学先验来增强光度一致性并提升几何恢复能力。关键创新在于两个模块:一是基于偏振度(Degree of Linear Polarization, DoLP)的光度校正策略,用于识别反射区域并对高斯分布进行颜色精修;二是融合偏振角与偏振度(A/DoLP)的高斯密度增强机制,结合PatchMatch深度补全方法实现无纹理区域的几何恢复,从而更完整地重建场景结构。
链接: https://arxiv.org/abs/2512.00794
作者: Bo Guo,Sijia Wen,Yifan Zhao,Jia Li,Zhiming Zheng
机构: Beihang University (北京航空航天大学); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (未来区块链与隐私计算高精尖创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in surface reconstruction for 3D Gaussian Splatting (3DGS) have enabled remarkable geometric accuracy. However, their performance degrades in photometrically ambiguous regions such as reflective and textureless surfaces, where unreliable cues disrupt photometric consistency and hinder accurate geometry estimation. Reflected light is often partially polarized in a manner that reveals surface orientation, making polarization an optic complement to photometric cues in resolving such ambiguities. Therefore, we propose PolarGS, an optics-aware extension of RGB-based 3DGS that leverages polarization as an optical prior to resolve photometric ambiguities and enhance reconstruction accuracy. Specifically, we introduce two complementary modules: a polarization-guided photometric correction strategy, which ensures photometric consistency by identifying reflective regions via the Degree of Linear Polarization (DoLP) and refining reflective Gaussians with Color Refinement Maps; and a polarization-enhanced Gaussian densification mechanism for textureless area geometry recovery, which integrates both Angle and Degree of Linear Polarization (A/DoLP) into a PatchMatch-based depth completion process. This enables the back-projection and fusion of new Gaussians, leading to more complete reconstruction. PolarGS is framework-agnostic and achieves superior geometric accuracy compared to state-of-the-art methods.
zh
[CV-155] Sign Language Recognition using Bidirectional Reservoir Computing
【速读】:该论文旨在解决深度学习在手语识别(Sign Language Recognition, SLR)中计算资源消耗大、训练时间长的问题,使其难以部署于资源受限的边缘设备。解决方案的关键在于提出一种基于MediaPipe与回声状态网络(Echo State Network, ESN)的双向储备计算(Bidirectional Reservoir Computing, BRC)架构:首先利用MediaPipe提取手部关节点坐标作为输入特征,随后通过BRC在前向和后向方向上高效捕捉时序依赖关系,最终将BRC的状态拼接为分类用的鲁棒表征。该方法在WLASL数据集上实现了57.71%的准确率,且训练时间仅需9秒,显著优于基于Bi-GRU的深度学习方法(55分38秒),从而实现了高效率与低资源消耗的SLR系统设计。
链接: https://arxiv.org/abs/2512.00777
作者: Nitin Kumar Singh,Arie Rachmad Syulistyo,Yuichiro Tanaka,Hakaru Tamukoh
机构: Kyushu Institute of Technology (九州工业大学); State Polytechnic of Malang (马朗州立理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign language recognition (SLR) facilitates communication between deaf and hearing individuals. Deep learning is widely used to develop SLR-based systems; however, it is computationally intensive and requires substantial computational resources, making it unsuitable for resource-constrained devices. To address this, we propose an efficient sign language recognition system using MediaPipe and an echo state network (ESN)-based bidirectional reservoir computing (BRC) architecture. MediaPipe extracts hand joint coordinates, which serve as inputs to the ESN-based BRC architecture. The BRC processes these features in both forward and backward directions, efficiently capturing temporal dependencies. The resulting states of BRC are concatenated to form a robust representation for classification. We evaluated our method on the Word-Level American Sign Language (WLASL) video dataset, achieving a competitive accuracy of 57.71% and a significantly lower training time of only 9 seconds, in contrast to the 55 minutes and 38 seconds required by the deep learning-based Bi-GRU approach. Consequently, the BRC-based SLR system is well-suited for edge devices.
zh
[CV-156] DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering
【速读】:该论文旨在解决日本视觉-语言(Vision-and-Language, VL)建模领域中高质量、大规模数据资源稀缺的问题。其核心解决方案在于构建一个可扩展且可复现的流水线,关键步骤包括:大规模网络采集、严格的过滤与去重机制、基于目标检测的证据提取,以及在接地约束条件下利用大语言模型(Large Language Model, LLM)进行精细化修正。该方法生成的两个数据集(DEJIMA-Cap 和 DEJIMA-VQA)均包含388万张图像-文本对,显著超越现有日语VL数据集规模,并通过人工评估证明其在“日本文化契合度”和语言自然性上优于翻译或人工标注数据,同时保持事实准确性与人工标注语料相当。定量分析进一步验证了DEJIMA在视觉分布上覆盖了日本典型场景,从而为日语多模态模型性能提升提供了坚实基础。
链接: https://arxiv.org/abs/2512.00773
作者: Toshiki Katsube,Taiga Fukuhara,Kenichiro Ando,Yusuke Mukuta,Kohei Uehara,Tatsuya Harada
机构: The University of Tokyo (东京大学); RIKEN (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (VL) modeling. We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement under grounding constraints. Using this pipeline, we build two resources: an image-caption dataset (DEJIMA-Cap) and a VQA dataset (DEJIMA-VQA), each containing 3.88M image-text pairs, far exceeding the size of existing Japanese VL datasets. Human evaluations demonstrate that DEJIMA achieves substantially higher Japaneseness and linguistic naturalness than datasets constructed via translation or manual annotation, while maintaining factual correctness at a level comparable to human-annotated corpora. Quantitative analyses of image feature distributions further confirm that DEJIMA broadly covers diverse visual domains characteristic of Japan, complementing its linguistic and cultural representativeness. Models trained on DEJIMA exhibit consistent improvements across multiple Japanese multimodal benchmarks, confirming that culturally grounded, large-scale resources play a key role in enhancing model performance. All data sources and modules in our pipeline are licensed for commercial use, and we publicly release the resulting dataset and metadata to encourage further research and industrial applications in Japanese VL modeling.
zh
[CV-157] EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes NEURIPS2025
【速读】:该论文旨在解决在动态场景和极端光照条件下,基于RGB图像的三维几何估计方法(如DUSt3R)性能下降的问题,其核心挑战在于传统相机在低光照或存在动态物体时难以提取可靠特征。解决方案的关键在于提出EAG3R框架,通过引入异步事件流(asynchronous event streams)增强点图(pointmap)重建能力:一是设计了受Retinex启发的图像增强模块与轻量级事件适配器,结合信噪比(SNR)感知融合机制,实现RGB与事件特征的局部可靠性自适应融合;二是提出一种基于事件的光度一致性损失函数,在全局优化中强化时空一致性,从而在无需夜间数据重训练的前提下显著提升复杂环境下的几何估计鲁棒性。
链接: https://arxiv.org/abs/2512.00771
作者: Xiaoshan Wu,Yifei Yu,Xiaoyang Lyu,Yihua Huang,Bo Wang,Baoheng Zhang,Zhongrui Wang,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025 (spotlight)
Abstract:Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose EAG3R, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.
zh
[CV-158] he Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches
【速读】:该论文旨在解决智能驾驶系统中交通标志易受物理对抗攻击的问题,此类攻击可能导致误分类并引发错误的驾驶决策,甚至在车联网(V2X)环境中引发级联失效,影响整体交通流稳定性。现有物理攻击方法因扰动集中在标志中心区域而缺乏隐蔽性,易被人类观察者察觉,限制了其现实应用。解决方案的关键在于提出一种名为TESP-Attack的新型隐匿感知对抗补丁方法:首先基于人眼注意力主要集中在交通标志中心区域的观察,利用实例分割生成与标志形状对齐的边缘掩码;进而采用U-Net生成器设计对抗补丁,并通过颜色、纹理约束及频域分析优化,实现与背景环境的高度融合,从而显著提升视觉隐蔽性。该方法在多种架构的交通标志分类模型上均表现出超过90%的攻击成功率,且具备强跨模型迁移性和在不同角度与距离下的鲁棒性。
链接: https://arxiv.org/abs/2512.00765
作者: Haojie Jia,Te Hu,Haowen Li,Long Jin,Chongshi Xin,Yuchi Yao,Jiarui Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Intelligent driving systems are vulnerable to physical adversarial attacks on traffic signs. These attacks can cause misclassification, leading to erroneous driving decisions that compromise road safety. Moreover, within V2X networks, such misinterpretations can propagate, inducing cascading failures that disrupt overall traffic flow and system stability. However, a key limitation of current physical attacks is their lack of stealth. Most methods apply perturbations to central regions of the sign, resulting in visually salient patterns that are easily detectable by human observers, thereby limiting their real-world practicality. This study proposes TESP-Attack, a novel stealth-aware adversarial patch method for traffic sign classification. Based on the observation that human visual attention primarily focuses on the central regions of traffic signs, we employ instance segmentation to generate edge-aligned masks that conform to the shape characteristics of the signs. A U-Net generator is utilized to craft adversarial patches, which are then optimized through color and texture constraints along with frequency domain analysis to achieve seamless integration with the background environment, resulting in highly effective visual concealment. The proposed method demonstrates outstanding attack success rates across traffic sign classification models with varied architectures, achieving over 90% under limited query budgets. It also exhibits strong cross-model transferability and maintains robust real-world performance that remains stable under varying angles and distances.
zh
[CV-159] Seeing the Wind from a Falling Leaf NEURIPS2025
【速读】:该论文旨在解决从视频中恢复不可见物理作用力(invisible forces)的问题,即如何通过视觉观测推断出导致物体形变和运动的隐式物理交互。其关键解决方案是一种端到端可微分的逆图形(inverse graphics)框架,该框架能够直接从视频中联合建模物体几何、物理属性及相互作用,并借助反向传播机制实现对力场表示的恢复。
链接: https://arxiv.org/abs/2512.00762
作者: Zhiyuan Gao,Jiageng Mao,Hong-Xing Yu,Haozhe Lou,Emily Yue-Ting Jia,Jernej Barbic,Jiajun Wu,Yue Wang
机构: University of Southern California (南加州大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025
Abstract:A longstanding goal in computer vision is to model motions from videos, while the representations behind motions, i.e. the invisible physical interactions that cause objects to deform and move, remain largely unexplored. In this paper, we study how to recover the invisible forces from visual observations, e.g., estimating the wind field by observing a leaf falling to the ground. Our key innovation is an end-to-end differentiable inverse graphics framework, which jointly models object geometry, physical properties, and interactions directly from videos. Through backpropagation, our approach enables the recovery of force representations from object motions. We validate our method on both synthetic and real-world scenarios, and the results demonstrate its ability to infer plausible force fields from videos. Furthermore, we show the potential applications of our approach, including physics-based video generation and editing. We hope our approach sheds light on understanding and modeling the physical process behind pixels, bridging the gap between vision and physics. Please check more video results in our \hrefthis https URLproject page.
zh
[CV-160] Charts Are Not Images: On the Challenges of Scientific Chart Editing
【速读】:该论文旨在解决当前生成式AI模型在编辑科学图表时存在的根本性问题:现有方法(如扩散模型和自回归模型)将图表视为纯像素排列,忽略了其本质是结构化数据的可视化表达,即图表遵循特定的图形语法规则(graphical grammar)。这种误解导致现有模型无法正确执行需要理解数据结构和语义的编辑任务。解决方案的关键在于提出一个名为FigEdit的大规模基准测试集,包含超过30,000个样本,涵盖10种不同图表类型和复杂编辑指令,并设计了五个逐步递增难度的任务类别(单次编辑、多次编辑、对话式编辑、基于视觉引导的编辑及风格迁移),从而系统评估模型对图表结构变换的理解能力。该基准揭示了现有模型在处理科学图表编辑中的严重不足,并强调传统像素级指标(如SSIM、PSNR)难以衡量语义正确性,为未来开发具备结构感知能力的图表编辑模型提供了统一评测标准与研究基础。
链接: https://arxiv.org/abs/2512.00752
作者: Shawn Li,Ryan Rossi,Sungchul Kim,Sunav Choudhary,Franck Dernoncourt,Puneet Mathur,Zhengzhong Tu,Yue Zhao
机构: University of Southern California (南加州大学); Adobe Research (Adobe 研究院); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem. To address this fundamental mismatch, we introduce \textitFigEdit, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits, visual-guidance-based edits, and style transfer. Our evaluation of a range of state-of-the-art models on this benchmark reveals their poor performance on scientific figures, as they consistently fail to handle the underlying structured transformations required for valid edits. Furthermore, our analysis indicates that traditional evaluation metrics (e.g., SSIM, PSNR) have limitations in capturing the semantic correctness of chart edits. Our benchmark demonstrates the profound limitations of pixel-level manipulation and provides a robust foundation for developing and evaluating future structure-aware models. By releasing \textitFigEdit (this https URL), we aim to enable systematic progress in structure-aware figure editing, provide a common ground for fair comparison, and encourage future research on models that understand both the visual and semantic layers of scientific charts.
zh
[CV-161] Probabilistic Modeling of Multi-rater Medical Image Segmentation for Diversity and Personalization
【速读】:该论文旨在解决医学图像分割中因标注边界模糊和不同专家间诊断差异导致的数据不确定性问题。现有方法通常只能生成多样性但缺乏专家特异性的分割结果,或仅能复现单一标注者的个性化输出,无法兼顾多样性和个性化需求。其解决方案的关键在于提出概率建模框架ProSeg,通过引入两个潜在变量分别建模专家标注偏好(expert annotation preferences)和图像边界模糊性(image boundary ambiguity),并利用变分推断获得条件概率分布,从而实现从该分布中采样生成既多样化又具专家个性化的分割结果。
链接: https://arxiv.org/abs/2512.00748
作者: Ke Liu,Shangde Gao,Yichao Fu,Shangqi Gao,Chunhua Shen
机构: Zhejiang University (浙江大学); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image segmentation is inherently influenced by data uncertainty, arising from ambiguous boundaries in medical scans and inter-observer variability in diagnosis. To address this challenge, previous works formulated the multi-rater medical image segmentation task, where multiple experts provide separate annotations for each image. However, existing models are typically constrained to either generate diverse segmentation that lacks expert specificity or to produce personalized outputs that merely replicate individual annotators. We propose Probabilistic modeling of multi-rater medical image Segmentation (ProSeg) that simultaneously enables both diversification and personalization. Specifically, we introduce two latent variables to model expert annotation preferences and image boundary ambiguity. Their conditional probabilistic distributions are then obtained through variational inference, allowing segmentation outputs to be generated by sampling from these distributions. Extensive experiments on both the nasopharyngeal carcinoma dataset (NPC) and the lung nodule dataset (LIDC-IDRI) demonstrate that our ProSeg achieves a new state-of-the-art performance, providing segmentation results that are both diverse and expert-personalized. Code can be found in this https URL.
zh
[CV-162] Joint Multi-scale Gated Transformer and Prior-guided Convolutional Network for Learned Image Compression
【速读】:该论文旨在解决传统图像压缩方法(如VVC)在非线性变换编码能力上的局限性,从而提升学习型图像压缩算法的性能与复杂度平衡。其解决方案的关键在于提出了一种新颖的联合多尺度门控变压器与先验引导卷积网络(MGTPCN),其中包含两个核心组件:一是先验引导卷积(PGConv),通过引入异构卷积(AConv)和差分卷积(DConv)增强骨架特征提取与高频信息捕捉能力,并采用重参数化策略降低计算复杂度;二是多尺度门控变压器(MGT),利用不同膨胀率的膨胀窗口多头自注意力模块与不同核大小的深度可分离卷积层实现多尺度非局部特征提取,并引入门控机制以增强非线性表达能力。该架构显著提升了模型在图像压缩任务中的表示能力和效率。
链接: https://arxiv.org/abs/2512.00744
作者: Zhengxin Chen,Xiaohai He,Tingrong Zhang,Shuhua Xiong,Chao Ren
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, learned image compression methods have made remarkable achievements, some of which have outperformed the traditional image codec VVC. The advantages of learned image compression methods over traditional image codecs can be largely attributed to their powerful nonlinear transform coding. Convolutional layers and shifted window transformer (Swin-T) blocks are the basic units of neural networks, and their representation capabilities play an important role in nonlinear transform coding. In this paper, to improve the ability of the vanilla convolution to extract local features, we propose a novel prior-guided convolution (PGConv), where asymmetric convolutions (AConvs) and difference convolutions (DConvs) are introduced to strengthen skeleton elements and extract high-frequency information, respectively. A re-parameterization strategy is also used to reduce the computational complexity of PGConv. Moreover, to improve the ability of the Swin-T block to extract non-local features, we propose a novel multi-scale gated transformer (MGT), where dilated window-based multi-head self-attention blocks with different dilation rates and depth-wise convolution layers with different kernel sizes are used to extract multi-scale features, and a gate mechanism is introduced to enhance non-linearity. Finally, we propose a novel joint Multi-scale Gated Transformer and Prior-guided Convolutional Network (MGTPCN) for learned image compression. Experimental results show that our MGTPCN surpasses state-of-the-art algorithms with a better trade-off between performance and complexity.
zh
[CV-163] Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards
【速读】:该论文旨在解决基于Group Relative Policy Optimization (GRPO)方法在对齐文本到图像(Text-to-Image, T2I)模型时存在的两个关键问题:一是“共享信用分配”问题,即轨迹级优势值由组归一化的稀疏终端奖励均匀应用于所有时间步,无法准确评估早期去噪步骤中因探索空间巨大而产生的潜在价值;二是“奖励混杂”问题,即多目标奖励(如文本准确性、视觉质量、文本颜色等)使用预定义权重进行组合时,由于各奖励尺度和方差不匹配,导致梯度不稳定且更新冲突。解决方案的关键在于提出Multi-GRPO框架,其核心创新包括两个正交的分组机制:首先引入受蒙特卡洛树搜索启发的树状轨迹结构,在选定的早期去噪步骤处分支形成时间分组,通过后代叶子节点实现对早期步骤的优势精准估计,同时利用共享前缀降低计算开销;其次采用奖励分组机制,独立计算每个奖励函数的优势后再聚合,从而解耦冲突信号。该方法在PickScore-25k和OCR-Color-10两个基准上均展现出更优的稳定性和对齐性能,有效平衡了多目标优化中的冲突。
链接: https://arxiv.org/abs/2512.00743
作者: Qiang Lyu,Zicong Chen,Chongxiao Wang,Haolin Shi,Shibo Gao,Ran Piao,Youwei Zeng,Jianlou Si,Fei Ding,Jing Li,Chun Pong Lau,Weiqiang Wang
机构: University of Chinese Academy of Sciences (中国科学院大学); Beihang University (北京航空航天大学); Alibaba Group; Beijing Jiaotong University (北京交通大学); TikTok Inc. (抖音公司); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 15 figures
Abstract:Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \textitShared credit assignment: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \textitReward-mixing: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)–which have mismatched scales and variances–lead to unstable gradients and conflicting updates. To address these issues, we propose \textbfMulti-GRPO, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \emphtemporal groups, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \emphreward-based grouping to compute advantages for each reward function \textitindependently before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \textitOCR-Color-10, a visual text rendering dataset with explicit color constraints. Across the single-reward \textitPickScore-25k and multi-objective \textitOCR-Color-10 benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \hrefthis https URLthis https URL.
zh
[CV-164] REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身场景中缺乏可靠的空间推理能力的问题,尤其是其对物体恒常性(object permanence)、空间关系和数值追踪等关键认知能力的不足,这限制了其在具身智能应用中的实际表现。解决方案的关键在于提出一个名为REM(Reasoning over Embodied Multi-Frame Trajectories)的基准测试框架,该框架利用可控的3D环境生成长时程的具身视觉轨迹数据,系统性地评估模型在动态视角下的空间推理性能,并提供针对性的指标与诊断工具,以推动未来模型发展更鲁棒的空间表征能力。
链接: https://arxiv.org/abs/2512.00736
作者: Jacob Thompson,Emiliano Garcia-Lopez,Yonatan Bisk
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.
zh
[CV-165] rajDiff: End-to-end Autonomous Driving without Perception Annotation
【速读】:该论文旨在解决端到端自动驾驶系统中依赖高成本人工标注的感知任务问题,提出一种完全无需感知标注的生成式规划方法。其核心挑战在于如何在不使用目标检测、语义分割等辅助感知任务的情况下,直接从原始传感器输入中学习有效的环境表征并生成合理轨迹。解决方案的关键在于提出TrajDiff框架,该框架通过构建基于未来轨迹的BEV(鸟瞰图)热力图作为高斯目标,设计了一种无需感知监督的轨迹导向型BEV编码器(TrajBEV encoder)来提取环境特征,并引入轨迹导向的BEV扩散Transformer(TB-DiT),利用自车状态和TrajBEV特征直接生成多样且合理的轨迹,从而摒弃了传统方法对人工设计运动先验的依赖。
链接: https://arxiv.org/abs/2512.00723
作者: Xingtai Gui,Jianbo Zhao,Wencheng Han,Jikai Wang,Jiahao Gong,Feiyang Tan,Cheng-zhong Xu,Jianbing Shen
机构: University of Macau (澳门大学); University of Science and Technology of China (中国科学技术大学); Mach Drive
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:End-to-end autonomous driving systems directly generate driving policies from raw sensor inputs. While these systems can extract effective environmental features for planning, relying on auxiliary perception tasks, developing perception annotation-free planning paradigms has become increasingly critical due to the high cost of manual perception annotation. In this work, we propose TrajDiff, a Trajectory-oriented BEV Conditioned Diffusion framework that establishes a fully perception annotation-free generative method for end-to-end autonomous driving. TrajDiff requires only raw sensor inputs and future trajectory, constructing Gaussian BEV heatmap targets that inherently capture driving modalities. We design a simple yet effective trajectory-oriented BEV encoder to extract the TrajBEV feature without perceptual supervision. Furthermore, we introduce Trajectory-oriented BEV Diffusion Transformer (TB-DiT), which leverages ego-state information and the predicted TrajBEV features to directly generate diverse yet plausible trajectories, eliminating the need for handcrafted motion priors. Beyond architectural innovations, TrajDiff enables exploration of data scaling benefits in the annotation-free setting. Evaluated on the NAVSIM benchmark, TrajDiff achieves 87.5 PDMS, establishing state-of-the-art performance among all annotation-free methods. With data scaling, it further improves to 88.5 PDMS, which is comparable to advanced perception-based approaches. Our code and model will be made publicly available.
zh
[CV-166] RS-ISRefiner: Towards Better Adapting Vision Foundation Models for Interactive Segmentation of Remote Sensing Images
【速读】:该论文旨在解决交互式图像分割(Interactive Image Segmentation, IIS)在遥感影像场景中性能下降的问题,尤其针对遥感图像中存在的尺度变化大、边界不规则和背景复杂等挑战,以及现有IIS方法主要面向自然图像、在遥感领域泛化能力弱、标注数据稀缺和计算开销高的局限性。解决方案的关键在于提出一种专为遥感设计的点击式交互分割框架RS-ISRefiner:其核心创新包括基于适配器(adapter-based)的微调策略以保留视觉基础模型(Vision Foundation Models)的通用表征并高效学习遥感特有的空间与边界特征;引入融合卷积局部建模与Transformer全局推理的混合注意力机制,提升对尺度多样性和场景复杂性的鲁棒性;并通过改进的概率图调制方案有效整合历史用户交互信息,实现更稳定的迭代优化与更高的边界保真度。
链接: https://arxiv.org/abs/2512.00718
作者: Deliang Wang,Peng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive image segmentation(IIS) plays a critical role in generating precise annotations for remote sensing imagery, where objects often exhibit scale variations, irregular boundaries and complex backgrounds. However, existing IIS methods, primarily designed for natural images, struggle to generalize to remote sensing domains due to limited annotated data and computational overhead. To address these challenges, we proposed RS-ISRefiner, a novel click-based IIS framework tailored for remote sensing images. The framework employs an adapter-based tuning strategy that preserves the general representations of Vision Foundation Models while enabling efficient learning of remote sensing-specific spatial and boundary characteristics. A hybrid attention mechanism integrating convolutional local modeling with Transformer-based global reasoning enhances robustness against scale diversity and scene complexity. Furthermore, an improved probability map modulation scheme effectively incorporates historical user interactions, yielding more stable iterative refinement and higher boundary fidelity. Comprehensive experiments on six remote sensing datasets, including iSAID, ISPRS Potsdam, SandBar, NWPU, LoveDA Urban and WHUBuilding, demonstrate that RS-ISRefiner consistently outperforms state-of-the-art IIS methods in terms of segmentation accuracy, efficiency and interaction cost. These results confirm the effectiveness and generalizability of our framework, making it highly suitable for high-quality instance segmentation in practical remote sensing scenarios.
zh
[CV-167] Deep Learning-Based Computer Vision Models for Early Cancer Detection Using Multimodal Medical Imaging and Radiogenomic Integration Frameworks
【速读】:该论文旨在解决早期癌症检测中因诊断延迟而导致生存率下降的核心挑战。其解决方案的关键在于利用深度学习驱动的医学影像分析技术,特别是卷积神经网络(Convolutional Neural Networks, CNNs)、Transformer架构及混合注意力机制等模型,从多模态影像数据(如MRI、CT、PET、乳腺X线摄影、组织病理学和超声)中自动提取复杂的空间、形态学和时间模式,从而识别出人眼难以察觉的细微组织异常和肿瘤微环境变化。此外,通过将多模态影像与放射基因组学(radiogenomics)融合,实现定量影像特征与基因组、转录组及表观遗传生物标志物的关联建模,为无创预测肿瘤基因型、免疫反应、分子亚型及治疗耐药性提供了新范式,推动个性化肿瘤诊疗的发展。
链接: https://arxiv.org/abs/2512.00714
作者: Emmanuella Avwerosuoghene Oghenekaro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Early cancer detection remains one of the most critical challenges in modern healthcare, where delayed diagnosis significantly reduces survival outcomes. Recent advancements in artificial intelligence, particularly deep learning, have enabled transformative progress in medical imaging analysis. Deep learning-based computer vision models, such as convolutional neural networks (CNNs), transformers, and hybrid attention architectures, can automatically extract complex spatial, morphological, and temporal patterns from multimodal imaging data including MRI, CT, PET, mammography, histopathology, and ultrasound. These models surpass traditional radiological assessment by identifying subtle tissue abnormalities and tumor microenvironment variations invisible to the human eye. At a broader scale, the integration of multimodal imaging with radiogenomics linking quantitative imaging features with genomics, transcriptomics, and epigenetic biomarkers has introduced a new paradigm for personalized oncology. This radiogenomic fusion allows the prediction of tumor genotype, immune response, molecular subtypes, and treatment resistance without invasive biopsies.
zh
[CV-168] Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在多模态任务中普遍存在且难以抑制的幻觉问题(hallucination),特别是针对现有基于偏好优化(preference optimization)方法在训练数据生成过程中引入额外幻觉、导致模型学习到错误模式的问题。解决方案的关键在于:首先设计一个二分类幻觉判别器(hallucination classifier),对在线策略数据(on-policy data)进行清洁标注,确保后续对齐阶段所用样本无幻觉;其次提出一种鲁棒的迭代直接偏好优化(robust iterative direct preference optimization, DPO)算法,结合动态样本重加权机制,高效利用高质量的在线策略数据,从而显著降低模型幻觉率并提升性能。实验表明,该方法在多个基准上大幅优于现有主流基线,甚至使开源模型LLaVA-1.5-13B超越GPT-4V的表现。
链接: https://arxiv.org/abs/2512.00706
作者: Chengzhi Yu,Yifan Xu,Yifan Chen,Wenyi Zhang
机构: University of Science and Technology of China (中国科学技术大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical this http URL this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model’s hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.
zh
[CV-169] CAR-Net: A Cascade Refinement Network for Rotational Motion Deblurring under Angle Information Uncertainty
【速读】:该论文旨在解决旋转运动模糊(rotational motion blur)图像的去模糊问题,尤其针对半盲(semi-blind)场景——即仅能获得模糊角度的噪声信息。解决方案的关键在于提出了一种名为CAR-net(CAscade Refinement Network)的神经网络架构,其核心是渐进式精炼过程:从频域反演得到初始去模糊估计后,通过一系列精炼阶段逐步预测并应用残差修正,从而有效抑制伪影并恢复细节。此外,为应对参数不确定性,模型还集成了一个可端到端训练的角度检测模块,增强了对模糊参数不确定性的鲁棒性。
链接: https://arxiv.org/abs/2512.00700
作者: Ka Chung Lai,Ahmet Cetinkaya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAIML 2026
Abstract:We propose a new neural network architecture called CAR-net (CAscade Refinement Network) to deblur images that are subject to rotational motion blur. Our architecture is specifically designed for the semi-blind scenarios where only noisy information of the rotational motion blur angle is available. The core of our approach is progressive refinement process that starts with an initial deblurred estimate obtained from frequency-domain inversion; A series of refinement stages take the current deblurred image to predict and apply residual correction to the current estimate, progressively suppressing artifacts and restoring fine details. To handle parameter uncertainty, our architecture accommodates an optional angle detection module which can be trained end-to-end with refinement modules. We provide a detailed description of our architecture and illustrate its efficiency through experiments using both synthetic and real-life images. Our code and model as well as the links to the datasets are available at this https URL
zh
[CV-170] Affordance-First Decomposition for Continual Learning in Video-Language Understanding
【速读】:该论文旨在解决视频-语言理解任务中持续学习(continual learning)面临的挑战,即模型在面对非平稳的数据分布、领域变化和查询风格演进时,如何有效区分应保持稳定的组件与需动态适应的部分,同时避免依赖静态路由或回放历史视频数据所带来的资源消耗和隐私风险。其解决方案的关键在于提出先验导向分解(Affordance-First Decomposition, AFD):将视频映射为缓慢变化的“ affordance tokens”( affordance tokens),构成一个共享且时间对齐的稳定基底(substrate),并通过轻量级、查询驱动的冲突感知调度器(scheduler)仅在必要时集中进行适应性调整并扩展容量。该基底通过弱对齐和教师一致性机制稳定,训练阶段采用仅基于问题的重放策略(question-only replay),从而实现对交互中心基底的显式稳定性和目标化适应性的解耦,显著提升了模型在多个持续学习协议下的性能与鲁棒性。
链接: https://arxiv.org/abs/2512.00694
作者: Mengzhu Xu,Hanzhi Liu,Ningkang Peng,Qianyu Chen,Canran Xiao
机构: University of Sydney (悉尼大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Nanjing Normal University (南京师范大学); Nanyang Technological University (南洋理工大学); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Continual learning for video–language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.
zh
[CV-171] Silhouette-based Gait Foundation Model
【速读】:该论文旨在解决当前步态识别模型在可扩展性和泛化能力方面的局限性问题。传统模型规模小、设计狭窄,难以适应多样化的任务和场景,且缺乏统一的预训练框架支持跨任务迁移。其解决方案的关键在于提出FoundationGait——首个可扩展的自监督预训练框架,通过在12个公开步态数据集(超过200万条行走序列)上进行大规模预训练,构建了一个拥有近0.13亿参数的统一基础模型。该框架显著提升了模型在不同数据集、任务(如人体识别、脊柱侧弯筛查、抑郁预测等)及输入模态下的鲁棒性能,尤其在无微调情况下实现了48.0%零样本rank-1准确率(Gait3D数据集)和64.5%的高精度(OU-MVLP数据集),标志着步态识别领域的重要突破。
链接: https://arxiv.org/abs/2512.00691
作者: Dingqiang Ye,Chao Fan,Kartik Narayan,Bingzhe Wu,Chengwen Luo,Jianqiang Li,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gait patterns play a critical role in human identification and healthcare analytics, yet current progress remains constrained by small, narrowly designed models that fail to scale or generalize. Building a unified gait foundation model requires addressing two longstanding barriers: (a) Scalability. Why have gait models historically failed to follow scaling laws? (b) Generalization. Can one model serve the diverse gait tasks that have traditionally been studied in isolation? We introduce FoundationGait, the first scalable, self-supervised pretraining framework for gait understanding. Its largest version has nearly 0.13 billion parameters and is pretrained on 12 public gait datasets comprising over 2 million walking sequences. Extensive experiments demonstrate that FoundationGait, with or without fine-tuning, performs robustly across a wide spectrum of gait datasets, conditions, tasks (e.g., human identification, scoliosis screening, depression prediction, and attribute estimation), and even input modality. Notably, it achieves 48.0% zero-shot rank-1 accuracy on the challenging in-the-wild Gait3D dataset (1,000 test subjects) and 64.5% on the largest in-the-lab OU-MVLP dataset (5,000+ test subjects), setting a new milestone in robust gait recognition. Coming code and model: this https URL.
zh
[CV-172] Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
【速读】:该论文旨在解决文本驱动的4D场景编辑中多视角一致性与时间一致性难以保障的问题,现有方法依赖独立编辑帧的2D扩散模型,常导致运动失真、几何漂移和编辑不完整。其解决方案的关键在于提出无需训练的Dynamic-eDiTor框架,利用预训练的4D Gaussian Splatting (4DGS) 与Multimodal Diffusion Transformer (MM-DiT),通过时空子网格注意力(Spatio-Temporal Sub-Grid Attention, STGA)实现局部跨视角与时间的一致性融合,并结合上下文标记传播(Context Token Propagation, CTP)机制,基于标记继承与光流引导的标记替换实现全局一致性传播,从而在不额外训练的前提下直接优化预训练4DGS,实现高质量、无缝且全局一致的多视角视频编辑。
链接: https://arxiv.org/abs/2512.00677
作者: Dong In Lee,Hyungjun Doh,Seunggeun Chi,Runlin Duan,Sangpil Kim,Karthik Ramani
机构: Purdue University (普渡大学); Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4D Scene Editing
Abstract:Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing. Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS. Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches. Project page for results and code: this https URL
zh
[CV-173] Realistic Handwritten Multi-Digit Writer (MDW) Number Recognition Challenges
【速读】:该论文旨在解决孤立数字分类模型在真实场景中表现不佳的问题,尤其是在多数字串(如ZIP码、手写支票金额等)识别任务中,尽管模型在单个数字上表现良好,但整体识别性能显著下降。其解决方案的关键在于利用NIST数字图像中作者(writer)的知识构建更贴近现实的多数字书写者(Multi-Digit Writer, MDW)基准数据集,并引入与实际应用更相关的任务特定性能指标,从而推动开发能够利用书写者先验知识以提升识别准确率的方法,超越传统孤立数字分类方法的局限性。
链接: https://arxiv.org/abs/2512.00676
作者: Kiri L. Wagstaff
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures
Abstract:Isolated digit classification has served as a motivating problem for decades of machine learning research. In real settings, numbers often occur as multiple digits, all written by the same person. Examples include ZIP Codes, handwritten check amounts, and appointment times. In this work, we leverage knowledge about the writers of NIST digit images to create more realistic benchmark multi-digit writer (MDW) data sets. As expected, we find that classifiers may perform well on isolated digits yet do poorly on multi-digit number recognition. If we want to solve real number recognition problems, additional advances are needed. The MDW benchmarks come with task-specific performance metrics that go beyond typical error calculations to more closely align with real-world impact. They also create opportunities to develop methods that can leverage task-specific knowledge to improve performance well beyond that of individual digit classification methods.
zh
[CV-174] Fast Robust Permutation-and-Sign Invariant SO(3) Pattern Alignment
【速读】:该论文旨在解决旋转集合在 SO(3) 上的无对应关系对齐问题,这是标定与配准中的核心任务,常受时间未对齐、异常值及未知轴约定的影响。其解决方案的关键在于将每个旋转分解为三个单位向量(Transformed Basis Vectors, TBVs),形成球面上的点集,并利用快速鲁棒匹配方法(如SPMC、FRS及其混合)进行逐轴对齐;同时引入Permutation-and-Sign Invariant (PASI)包装器,枚举24种符号保留的轴重排组合,通过相关性得分筛选最优配置,并基于投影或Karcher均值融合各轴估计结果以获得最终旋转矩阵。该方法保持线性复杂度 O(n),显著优于传统球面/旋转空间相关方法的 O(Nr3logNr),且无需显式对应搜索,在极端异常值比例(高达90%)下仍具鲁棒性。
链接: https://arxiv.org/abs/2512.00659
作者: Anik Sarker,Alan T. Asbeck
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Robotics (cs.RO); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We address the correspondence-free alignment of two rotation sets on (SO(3)), a core task in calibration and registration that is often impeded by missing time alignment, outliers, and unknown axis conventions. Our key idea is to decompose each rotation into its \emphTransformed Basis Vectors (TBVs)-three unit vectors on (S^2)-and align the resulting spherical point sets per axis using fast, robust matchers (SPMC, FRS, and a hybrid). To handle axis relabels and sign flips, we introduce a \emphPermutation-and-Sign Invariant (PASI) wrapper that enumerates the 24 proper signed permutations, scores them via summed correlations, and fuses the per-axis estimates into a single rotation by projection/Karcher mean. The overall complexity remains linear in the number of rotations ((\mathcalO(n))), contrasting with (\mathcalO(N_r^3\log N_r)) for spherical/(SO(3)) correlation. Experiments on EuRoC Machine Hall simulations (axis-consistent) and the ETH Hand-Eye benchmark (\textttrobot_arm_real) (axis-ambiguous) show that our methods are accurate, 6-60x faster than traditional methods, and robust under extreme outlier ratios (up to 90%), all without correspondence search. Subjects: Robotics (cs.RO); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.00659 [cs.RO] (or arXiv:2512.00659v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2512.00659 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-175] MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
【速读】:该论文旨在解决Vision Mamba在处理图像时因输入token数量过多而导致的效率瓶颈问题。现有token缩减方法(如token剪枝或合并)虽能降低计算量,但会引入信息损失,尤其在对所有图像统一应用细粒度处理时更为显著。其关键解决方案是提出一种自适应的粗粒度到细粒度推理框架——Coarse-to-Fine Vision Mamba (CF-ViM),通过首先以大patch为单位进行粗粒度推理大幅减少token长度与计算开销;当模型预测置信度较低时,仅对选定区域重新以更高分辨率处理,从而在保持视觉细节的同时实现计算资源的动态分配,兼顾效率与精度。
链接: https://arxiv.org/abs/2512.00647
作者: Shanhui Liu,Rui Xu,Yunke Wang
机构: The University of Sydney (悉尼大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss, as they discard or compress token representations. This problem is exacerbated when applied uniformly to fine-grained token representations across all images, regardless of visual complexity. We observe that not all inputs require fine-grained processing. Simple images can be effectively handled at coarse resolution, while only complex ones may warrant refinement. Based on this insight, we propose \textitCoarse-to-Fine Vision Mamba (CF-ViM), an adaptive framework for efficient inference. CF-ViM first performs coarse-grained inference by dividing the input image into large patches, significantly reducing the token length and computation. When the model’s prediction confidence is low, selected regions are re-processed at a finer resolution to recover critical visual details with minimal additional cost. This dynamic resolution assignment strategy allows CF-ViM to allocate computation adaptively according to image complexity, ensuring efficient processing without compromising essential visual information. Experiments on ImageNet demonstrate that CF-ViM outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.
zh
[CV-176] Graph-Attention Network with Adversarial Domain Alignment for Robust Cross-Domain Facial Expression Recognition ACML2025
【速读】:该论文旨在解决跨域面部表情识别(Cross-domain Facial Expression Recognition, CD-FER)中因训练数据与部署数据之间存在严重域偏移(domain shift)而导致的性能下降问题。其解决方案的关键在于提出一种混合框架GAT-ADA,该框架结合ResNet-50作为主干网络与批次级图注意力网络(Batch-level Graph Attention Network, GAT),通过将每个mini-batch建模为稀疏环状图来捕捉样本间的关联信息,从而增强适应性;同时,采用梯度反转层(Gradient Reversal Layer, GRL)进行对抗学习,并融合CORAL与MMD统计对齐方法实现特征分布的联合对齐,有效缓解域差异带来的负面影响。
链接: https://arxiv.org/abs/2512.00641
作者: Razieh Ghaedi,AmirReza BabaAhmadi,Reyer Zwiggelaar,Xinqi Fan,Nashid Alam
机构: Manchester Metropolitan University (曼彻斯特都会大学); University of Tehran (德黑兰大学); Aberystwyth University (阿伯里斯特威斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures. Accepted at the 17th Asian Conference on Machine Learning (ACML 2025), Taipei, Taiwan, December 9-12, 2025
Abstract:Cross-domain facial expression recognition (CD-FER) remains difficult due to severe domain shift between training and deployment data. We propose Graph-Attention Network with Adversarial Domain Alignment (GAT-ADA), a hybrid framework that couples a ResNet-50 as backbone with a batch-level Graph Attention Network (GAT) to model inter-sample relations under shift. Each mini-batch is cast as a sparse ring graph so that attention aggregates cross-sample cues that are informative for adaptation. To align distributions, GAT-ADA combines adversarial learning via a Gradient Reversal Layer (GRL) with statistical alignment using CORAL and MMD. GAT-ADA is evaluated under a standard unsupervised domain adaptation protocol: training on one labeled source (RAF-DB) and adapting to multiple unlabeled targets (CK+, JAFFE, SFEW 2.0, FER2013, and ExpW). GAT-ADA attains 74.39% mean cross-domain accuracy. On RAF-DB to FER2013, it reaches 98.0% accuracy, corresponding to approximately a 36-point improvement over the best baseline we re-implemented with the same backbone and preprocessing.
zh
[CV-177] Doppler-Enhanced Deep Learning: Improving Thyroid Nodule Segmentation with YOLOv5 Instance Segmentation
【速读】:该论文旨在解决甲状腺结节(thyroid nodule)在超声图像中精准分割的问题,这是构建人工智能辅助临床决策支持系统的关键第一步。解决方案的核心在于采用YOLOv5系列算法进行实例分割(instance segmentation),并通过对比不同模型规模(Nano至XLarge)在包含与不包含多普勒(Doppler)图像的数据集上的性能表现,发现引入多普勒图像可显著提升分割精度——其中YOLOv5-Large模型在含多普勒图像的数据集上达到91%的Dice分数和0.87的mAP,优于仅使用常规灰阶图像的方案。这一结果表明,尽管多普勒图像常被临床医生忽略,但其对提升AI分割性能具有重要价值,为实时自动化甲状腺结节检测提供了可行路径。
链接: https://arxiv.org/abs/2512.00639
作者: Mahmoud El Hussieni
机构: Istanbul Medipol University (伊斯坦布尔Medipol大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:The increasing prevalence of thyroid cancer globally has led to the development of various computer-aided detection methods. Accurate segmentation of thyroid nodules is a critical first step in the development of AI-assisted clinical decision support systems. This study focuses on instance segmentation of thyroid nodules using YOLOv5 algorithms on ultrasound images. We evaluated multiple YOLOv5 variants (Nano, Small, Medium, Large, and XLarge) across two dataset versions, with and without doppler images. The YOLOv5-Large algorithm achieved the highest performance with a dice score of 91% and mAP of 0.87 on the dataset including doppler images. Notably, our results demonstrate that doppler images, typically excluded by physicians, can significantly improve segmentation performance. The YOLOv5-Small model achieved 79% dice score when doppler images were excluded, while including them improved performance across all model variants. These findings suggest that instance segmentation with YOLOv5 provides an effective real-time approach for thyroid nodule detection, with potential clinical applications in automated diagnostic systems.
zh
[CV-178] XAI-Driven Skin Disease Classification: Leverag ing GANs to Augment ResNet-50 Performance
【速读】:该论文旨在解决多类皮肤病变诊断中因主观判断、数据集(如HAM10000)固有的类别不平衡以及深度学习模型“黑箱”特性所带来的准确性与可解释性不足的问题。其关键解决方案在于:首先利用深度卷积生成对抗网络(Deep Convolutional Generative Adversarial Networks, DCGANs)对每类样本进行数据增强,有效缓解类别不平衡问题;其次采用微调后的ResNet-50分类器在增强数据上训练,实现高精度分类;最后集成局部可解释性方法(LIME)和SHAP(SHapley Additive exPlanations)技术,确保模型预测基于临床相关特征(如不规则形态),从而提升诊断结果的可信度与可验证性。
链接: https://arxiv.org/abs/2512.00626
作者: Kim Gerard A. Villanueva,Priyanka Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and timely diagnosis of multi-class skin lesions is hampered by subjective methods, inherent data imbalance in datasets like HAM10000, and the “black box” nature of Deep Learning (DL) models. This study proposes a trustworthy and highly accurate Computer-Aided Diagnosis (CAD) system to overcome these limitations. The approach utilizes Deep Convolutional Generative Adversarial Networks (DCGANs) for per class data augmentation to resolve the critical class imbalance problem. A fine-tuned ResNet-50 classifier is then trained on the augmented dataset to classify seven skin disease categories. Crucially, LIME and SHAP Explainable AI (XAI) techniques are integrated to provide transparency by confirming that predictions are based on clinically relevant features like irregular morphology. The system achieved a high overall Accuracy of 92.50 % and a Macro-AUC of 98.82 %, successfully outperforming various prior benchmarked architectures. This work successfully validates a verifiable framework that combines high performance with the essential clinical interpretability required for safe diagnostic deployment. Future research should prioritize enhancing discrimination for critical categories, such as Melanoma NOS (F1-Score is 0.8602).
zh
[CV-179] Automatic Pith Detection in Tree Cross-Section Images Using Deep Learning
【速读】:该论文旨在解决树木横截面中髓心(pith)检测的自动化问题,传统方法依赖人工操作且易出错。解决方案的关键在于利用多种深度学习模型(YOLOv9、U-Net、Swin Transformer、DeepLabV3 和 Mask R-CNN)对树干横截面图像进行自动识别与分割,并通过动态数据增强提升模型泛化能力。其中,Swin Transformer 在细粒度分割任务中表现最优(准确率0.94),而 Mask R-CNN 经过非极大值抑制(NMS)后交并比(IoU)显著提升至0.80,体现了超参数调优和后处理策略对性能改进的重要性。
链接: https://arxiv.org/abs/2512.00625
作者: Tzu-I Liao,Mahmoud Fakhry,Jibin Yesudas Varghese
机构: Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:Pith detection in tree cross-sections is essential for forestry and wood quality analysis but remains a manual, error-prone task. This study evaluates deep learning models – YOLOv9, U-Net, Swin Transformer, DeepLabV3, and Mask R-CNN – to automate the process efficiently. A dataset of 582 labeled images was dynamically augmented to improve generalization. Swin Transformer achieved the highest accuracy (0.94), excelling in fine segmentation. YOLOv9 performed well for bounding box detection but struggled with boundary precision. U-Net was effective for structured patterns, while DeepLabV3 captured multi-scale features with slight boundary imprecision. Mask R-CNN initially underperformed due to overlapping detections, but applying Non-Maximum Suppression (NMS) improved its IoU from 0.45 to 0.80. Generalizability was next tested using an oak dataset of 11 images from Oregon State University’s Tree Ring Lab. Additionally, for exploratory analysis purposes, an additional dataset of 64 labeled tree cross-sections was used to train the worst-performing model to see if this would improve its performance generalizing to the unseen oak dataset. Key challenges included tensor mismatches and boundary inconsistencies, addressed through hyperparameter tuning and augmentation. Our results highlight deep learning’s potential for tree cross-section pith detection, with model choice depending on dataset characteristics and application needs.
zh
[CV-180] Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models
【速读】:该论文旨在解决大规模视觉-语言基础模型(Vision-Language Foundation Models)在体积医学影像(如胸部CT)下游临床任务中应用受限的问题,尤其是在零样本(zero-shot)场景下缺乏有效迁移能力的挑战。解决方案的关键在于提出MedCT-VLM框架,采用低秩自适应(Low-Rank Adaptation, LoRA)技术,仅通过训练1.67M个参数(占总参数440M的0.38%)对预训练的CT-CLIP模型进行微调,而非全量参数更新。该方法在注意力层中插入低秩分解矩阵,显著提升了模型在18种胸腔病理多标签分类任务中的性能,验证了参数高效适配策略在医疗影像领域零样本迁移中的有效性。
链接: https://arxiv.org/abs/2512.00597
作者: Thuraya Alzubaidi,Farhad R. Nezami,Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals, Saudi Arabia; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, US; Harvard Medical School, Harvard University, US; SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Saudi Arabia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains, yet their application to volumetric medical imaging remains limited. We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient vision-language framework designed to adapt large-scale CT foundation models for downstream clinical tasks. MedCT-VLM uses a parameter-efficient approach to adapt CT-CLIP, a contrastive vision-language model trained on 25,692 chest CT volumes, for multi-label pathology classification using Low-Rank Adaptation (LoRA). Rather than fine-tuning the model’s 440 M parameters directly, we insert low-rank decomposition matrices into attention layers of both vision and text encoders, training only 1.67M parameters (0.38% of total). We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training. LoRA fine-tuning improves mean AUROC from 61.3% to 68.9% (+7.6 pp), accuracy from 67.2% to 73.6% (+6.4 pp), and macro-F1 from 32.1% to 36.9% (+4.8 pp). These results demonstrate that parameter-efficient methods can effectively transfer large-scale pretraining to downstream medical imaging tasks, particularly for zero-shot scenarios where labeled data is scarce.
zh
[CV-181] SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension AAAI2026
【速读】:该论文旨在解决当前视觉-语言模型在理解纯视觉形式的讽刺图像时存在的难题,即难以有效整合局部实体关系与全局语境,导致误判、理解偏差和幻觉现象。其关键解决方案是提出一种无需训练的框架SatireDecoder,通过多智能体系统实现视觉级联解耦,将图像分解为细粒度的局部与全局语义表征,并引入基于不确定性分析的思维链(chain-of-thought)推理策略,将复杂的讽刺理解过程拆解为一系列不确定性最小化的子任务,从而显著提升解释准确性并减少幻觉。
链接: https://arxiv.org/abs/2512.00582
作者: Yue Jiang,Haiwei Xue,Minghao Han,Mingcheng Li,Xiaolu Hou,Dingkang Yang,Lihua Zhang,Xu Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Satire, a form of artistic expression combining humor with implicit critique, holds significant social value by illuminating societal issues. Despite its cultural and societal significance, satire comprehension, particularly in purely visual forms, remains a challenging task for current vision-language models. This task requires not only detecting satire but also deciphering its nuanced meaning and identifying the implicated entities. Existing models often fail to effectively integrate local entity relationships with global context, leading to misinterpretation, comprehension biases, and hallucinations. To address these limitations, we propose SatireDecoder, a training-free framework designed to enhance satirical image comprehension. Our approach proposes a multi-agent system performing visual cascaded decoupling to decompose images into fine-grained local and global semantic representations. In addition, we introduce a chain-of-thought reasoning strategy guided by uncertainty analysis, which breaks down the complex satire comprehension process into sequential subtasks with minimized uncertainty. Our method significantly improves interpretive accuracy while reducing hallucinations. Experimental results validate that SatireDecoder outperforms existing baselines in comprehending visual satire, offering a promising direction for vision-language reasoning in nuanced, high-level semantic tasks.
zh
[CV-182] Integrating Skeleton Based Representations for Robust Yoga Pose Classification Using Deep Learning Models
【速读】:该论文旨在解决瑜伽姿势识别中因错误姿势导致受伤的问题,通过自动化瑜伽姿势分类减少对专业教练的依赖。其解决方案的关键在于构建了一个名为“Yoga-16”的高质量数据集,并系统性地评估了三种深度学习架构(VGG16、ResNet50 和 Xception)在三种输入模态(原始图像、MediaPipe Pose 骨架图像和 YOLOv8 Pose 骨架图像)下的表现。实验结果表明,基于骨架表示的输入显著优于原始图像输入,其中 VGG16 结合 MediaPipe Pose 骨架输入达到最高准确率 96.09%,同时通过 Grad-CAM 提供了模型决策的可解释性分析,增强了对模型行为的理解与可信度。
链接: https://arxiv.org/abs/2512.00572
作者: Mohammed Mohiuddin,Syed Mohammod Minhaz Hossain,Sumaiya Khanam,Prionkar Barua,Aparup Barua,MD Tamim Hossain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, ‘Yoga-16’, which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception) using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross validation analysis.
zh
[CV-183] Describe Anything Anywhere At Any Moment
【速读】:该论文旨在解决大规模、实时4D场景理解中语义描述丰富性与计算效率之间的矛盾问题,即如何在保持高精度语言-空间对齐的同时实现近实时推理。现有方法往往因追求开放词汇的语义细节而牺牲性能,难以满足机器人自主导航或增强现实等应用对时空一致性与响应速度的要求。其解决方案的关键在于提出了一种新颖的时空记忆框架DAAAM(Describe Anything, Anywhere, at Any Moment),通过优化驱动的前端模块从局部captioning模型(如Describe Anything Model, DAM)中高效推断几何约束下的细粒度语义描述,并利用批处理加速在线推理;同时构建分层4D场景图(4D Scene Graph, 4D SG),作为全局时空一致的记忆表征,从而实现高精度语义建模与实时性能的协同优化。
链接: https://arxiv.org/abs/2512.00565
作者: Nicolas Gorlo,Lukas Schmid,Luca Carlone
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 14 pages, 5 figures, 6 tables
Abstract:Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing. It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation. DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM’s 4D SG interfaces well with a tool-calling agent for inference and reasoning. We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations. DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, position errors by 21.9%, temporal errors by 21.6%, and SG3D task grounding accuracy by 27.8% over the most competitive baselines, respectively. We release our data and code open-source. Comments: 14 pages, 5 figures, 6 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2512.00565 [cs.CV] (or arXiv:2512.00565v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.00565 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-184] NeuroVolve: Evolving Visual Stimuli toward Programmable Neural Objectives
【速读】:该论文旨在解决神经科学中关于大脑不同区域如何编码视觉信息,以及这些分布式模式如何协同形成复杂视觉表征的问题。传统方法虽能通过生成模型再现单个脑区(如梭状回面孔区,FFA)对特定类别(如人脸)的选择性响应,但难以揭示多脑区在自然视觉场景中的交互机制。其解决方案的关键在于提出NeuroVolve框架——一个基于预训练视觉-语言模型嵌入空间的脑引导图像生成方法,通过优化可编程的神经目标函数(如激活或抑制单个或多个脑区),实现从脑活动约束到图像合成的一体化过程。该框架不仅恢复了已知的单区域选择性,还能生成满足多区域协同或拮抗关系的语义一致场景,从而揭示大脑中视觉信息处理的动态轨迹与个体特异性偏好,为解析和操控神经表征提供了可解释、可定制的工具。
链接: https://arxiv.org/abs/2512.00557
作者: Haomiao Chen,Keith W Jamison,Mert R. Sabuncu,Amy Kuceyeski
机构: Cornell University (康奈尔大学); Cornell Tech (康奈尔技术学院); Weill Cornell Medicine (威尔康奈尔医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:What visual information is encoded in individual brain regions, and how do distributed patterns combine to create their neural representations? Prior work has used generative models to replicate known category selectivity in isolated regions (e.g., faces in FFA), but these approaches offer limited insight into how regions interact during complex, naturalistic vision. We introduce NeuroVolve, a generative framework that provides brain-guided image synthesis via optimization of a neural objective function in the embedding space of a pretrained vision-language model. Images are generated under the guidance of a programmable neural objective, i.e., activating or deactivating single regions or multiple regions together. NeuroVolve is validated by recovering known selectivity for individual brain regions, while expanding to synthesize coherent scenes that satisfy complex, multi-region constraints. By tracking optimization steps, it reveals semantic trajectories through embedding space, unifying brain-guided image editing and preferred stimulus generation in a single process. We show that NeuroVolve can generate both low-level and semantic feature-specific stimuli for single ROIs, as well as stimuli aligned to curated neural objectives. These include co-activation and decorrelation between regions, exposing cooperative and antagonistic tuning relationships. Notably, the framework captures subject-specific preferences, supporting personalized brain-driven synthesis and offering interpretable constraints for mapping, analyzing, and probing neural representations of visual information.
zh
[CV-185] Asset-Driven Sematic Reconstruction of Dynamic Scene with Multi-Human-Object Interactions
【速读】:该论文旨在解决多人类、多物体动态场景中3D几何建模的难题,尤其在单目设置下因复杂运动模式和频繁遮挡导致结构一致性难以保持的问题。其解决方案的关键在于提出一种混合方法:首先利用3D生成模型生成高保真度的场景元素网格;其次通过语义感知变形(即刚性物体的刚性变换与人类的基于线性骨权重(Linear Blend Skinning, LBS)的形变)将变形后的高保真网格映射到动态场景中;最后采用3D高斯泼溅(3D Gaussian Splatting, GS)对各元素进行优化,以进一步精调其在场景中的对齐关系。该策略有效提升了严重遮挡条件下的结构稳定性,并实现了多视角及时间上的一致性几何重建。
链接: https://arxiv.org/abs/2512.00547
作者: Sandika Biswas,Qianyi Wu,Biplab Banerjee,Hamid Rezatofighi
机构: Monash University (蒙纳士大学); Indian Institute of Technology, Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world human-built environments are highly dynamic, involving multiple humans and their complex interactions with surrounding objects. While 3D geometry modeling of such scenes is crucial for applications like AR/VR, gaming, and embodied AI, it remains underexplored due to challenges like diverse motion patterns and frequent occlusions. Beyond novel view rendering, 3D Gaussian Splatting (GS) has demonstrated remarkable progress in producing detailed, high-quality surface geometry with fast optimization of the underlying structure. However, very few GS-based methods address multihuman, multiobject scenarios, primarily due to the above-mentioned inherent challenges. In a monocular setup, these challenges are further amplified, as maintaining structural consistency under severe occlusion becomes difficult when the scene is optimized solely based on GS-based rendering loss. To tackle the challenges of such a multihuman, multiobject dynamic scene, we propose a hybrid approach that effectively combines the advantages of 1) 3D generative models for generating high-fidelity meshes of the scene elements, 2) Semantic-aware deformation, \ie rigid transformation of the rigid objects and LBS-based deformation of the humans, and mapping of the deformed high-fidelity meshes in the dynamic scene, and 3) GS-based optimization of the individual elements for further refining their alignments in the scene. Such a hybrid approach helps maintain the object structures even under severe occlusion and can produce multiview and temporally consistent geometry. We choose HOI-M3 for evaluation, as, to the best of our knowledge, this is the only dataset featuring multihuman, multiobject interactions in a dynamic scene. Our method outperforms the state-of-the-art method in producing better surface reconstruction of such scenes.
zh
[CV-186] SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
【速读】:该论文旨在解决AI生成图像检测方法在面对新兴生成技术与多样化内容类型时普遍存在的泛化能力不足问题(generalization challenge),即现有方法难以适应现实场景中不断出现的新生成方式和图像内容。其解决方案的关键在于提出一种结合场景感知与重要性引导的动态优化框架(Scene-Aware and Importance-Guided Dynamic Optimization, SAIDO),其中包含两个核心机制:一是基于视觉语言大模型(VLLM)的场景感知专家模块(SAEM),能够动态识别并引入新场景,为每类场景分配独立专家模块以捕捉特定伪造特征并提升跨场景泛化能力;二是重要性引导的动态优化机制(IDOM),通过重要性导向的梯度投影策略优化神经元,有效平衡模型的可塑性(plasticity)与稳定性(stability),缓解多源生成方法学习过程中的灾难性遗忘问题。
链接: https://arxiv.org/abs/2512.00539
作者: Yongkang Hu,Yu Cheng,Yushuo Zhang,Yuan Xie,Zhaoxia Yin
机构: East China Normal University (华东师范大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 19 figures
Abstract:The widespread misuse of image generation technologies has raised security concerns, driving the development of AI-generated image detection methods. However, generalization has become a key challenge and open problem: existing approaches struggle to adapt to emerging generative methods and content types in real-world scenarios. To address this issue, we propose a Scene-Aware and Importance-Guided Dynamic Optimization detection framework with continual learning (SAIDO). Specifically, we design Scene-Awareness-Based Expert Module (SAEM) that dynamically identifies and incorporates new scenes using VLLMs. For each scene, independent expert modules are dynamically allocated, enabling the framework to capture scene-specific forgery features better and enhance cross-scene generalization. To mitigate catastrophic forgetting when learning from multiple image generative methods, we introduce Importance-Guided Dynamic Optimization Mechanism (IDOM), which optimizes each neuron through an importance-guided gradient projection strategy, thereby achieving an effective balance between model plasticity and stability. Extensive experiments on continual learning tasks demonstrate that our method outperforms the current SOTA method in both stability and plasticity, achieving 44.22% and 40.57% relative reductions in average detection error rate and forgetting rate, respectively. On open-world datasets, it improves the average detection accuracy by 9.47% compared to the current SOTA method.
zh
[CV-187] Cross-Temporal 3D Gaussian Splatting for Sparse-View Guided Scene Update AAAI2026
【速读】:该论文旨在解决跨时间维度下保持3D场景表示一致性的问题,尤其针对从稀疏视角观测中高效重建与更新不同时期3D场景的挑战。其关键解决方案是提出了一种名为Cross-Temporal 3D Gaussian Splatting (Cross-Temporal 3DGS) 的新框架,包含三个核心阶段:1)跨时间相机对齐(Cross-temporal camera alignment),用于估计并校准不同时间戳下的相机位姿;2)基于干扰的置信度初始化(Interference-based confidence initialization),通过识别时序间未变化区域来引导更新策略;3)渐进式跨时间优化(Progressive cross-temporal optimization),迭代融合历史先验信息以提升重建质量。该方法支持非连续采集,既可利用新增稀疏视图优化现有场景,也能借助当前数据恢复历史场景,显著提升了重建精度和数据效率,适用于场景版本管理、跨时间数字孪生及长期空间记录等应用。
链接: https://arxiv.org/abs/2512.00534
作者: Zeyuan An,Yanghang Xiao,Zhiying Leng,Frederick W. B. Li,Xiaohui Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026 accepted
Abstract:Maintaining consistent 3D scene representations over time is a significant challenge in computer vision. Updating 3D scenes from sparse-view observations is crucial for various real-world applications, including urban planning, disaster assessment, and historical site preservation, where dense scans are often unavailable or impractical. In this paper, we propose Cross-Temporal 3D Gaussian Splatting (Cross-Temporal 3DGS), a novel framework for efficiently reconstructing and updating 3D scenes across different time periods, using sparse images and previously captured scene priors. Our approach comprises three stages: 1) Cross-temporal camera alignment for estimating and aligning camera poses across different timestamps; 2) Interference-based confidence initialization to identify unchanged regions between timestamps, thereby guiding updates; and 3) Progressive cross-temporal optimization, which iteratively integrates historical prior information into the 3D scene to enhance reconstruction quality. Our method supports non-continuous capture, enabling not only updates using new sparse views to refine existing scenes, but also recovering past scenes from limited data with the help of current captures. Furthermore, we demonstrate the potential of this approach to achieve temporal changes using only sparse images, which can later be reconstructed into detailed 3D representations as needed. Experimental results show significant improvements over baseline methods in reconstruction quality and data efficiency, making this approach a promising solution for scene versioning, cross-temporal digital twins, and long-term spatial documentation.
zh
[CV-188] Image Generation as a Visual Planner for Robotic Manipulation CVPR2026
【速读】:该论文旨在解决机器人操作视频生成中缺乏通用性与高效训练的问题,即如何在不依赖大规模特定领域数据集的前提下,实现高质量、时序一致的机器人动作视频合成,从而统一感知、规划与执行模块。其解决方案的关键在于利用预训练图像生成模型(如基于语言-图像语料库训练的模型)所具备的潜在时序先验能力,并通过轻量级LoRA微调技术,将这些模型适配为机器人视觉规划器:一方面支持文本条件生成(以自然语言指令和初始帧为输入),另一方面支持轨迹条件生成(以2D轨迹叠加图和初始帧为输入),二者均能在多个真实机器人数据集上生成流畅且符合条件的机器人操作视频,表明预训练图像生成模型具有可迁移的时序建模能力,可在最小监督下作为视频类机器人规划工具使用。
链接: https://arxiv.org/abs/2512.00532
作者: Ye Pang
机构: Southern China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages 9 figures Under review at CVPR 2026
Abstract:Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \hrefthis https URLthis https URL. Comments: 11 pages 9 figures Under review at CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2512.00532 [cs.CV] (or arXiv:2512.00532v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.00532 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-189] rrain Sensing with Smartphone Structured Light: 2D Dynamic Time Warping for Grid Pattern Matching
【速读】:该论文旨在解决低成本移动机器人在不平坦地形上运行时,因难以感知微小凸起或倾斜而导致的运动稳定性问题。解决方案的关键在于提出了一种基于智能手机的结构光系统,通过投影网格图案并利用拓扑约束的二维动态时间规整(topology-constrained two-dimensional dynamic time warping, 2D-DTW)算法,实现对地面局部不平整度的重建。该算法克服了传统一维动态时间规整(1D-DTW)无法处理二维网格模式在透视畸变和部分遮挡下的匹配难题,通过列方向对齐并施加全局网格一致性约束,在资源受限设备上实现了高效且结构保持的匹配,从而支持精确三角测量与地形感知。
链接: https://arxiv.org/abs/2512.00514
作者: Tanaka Nobuaki
机构: Meiji University (明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-cost mobile rovers often operate on uneven terrain where small bumps or tilts are difficult to perceive visually but can significantly affect locomotion stability. To address this problem, we explore a smartphone-based structured-light system that projects a grid pattern onto the ground and reconstructs local terrain unevenness from a single handheld device. The system is inspired by face-recognition projectors, but adapted for ground sensing. A key technical challenge is robustly matching the projected grid with its deformed observation under perspective distortion and partial occlusion. Conventional one-dimensional dynamic time warping (1D-DTW) is not directly applicable to such two-dimensional grid patterns. We therefore propose a topology-constrained two-dimensional dynamic time warping (2D-DTW) algorithm that performs column-wise alignment under a global grid consistency constraint. The proposed method is designed to be simple enough to run on resource limited platforms while preserving the grid structure required for accurate triangulation. We demonstrate that our 2D-DTW formulation can be used not only for terrain sensing but also as a general tool for matching structured grid patterns in image processing scenarios. This paper describes the overall system design as well as the 2D-DTW extension that emerged from this application.
zh
[CV-190] CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration
【速读】:该论文旨在解决单图到3D场景生成中的场景一致性与实例保真度问题,特别是针对现有方法在实例级生成上表现良好但场景级协同性不足的局限性,如因物体位姿估计不准确导致的空间不一致。解决方案的关键在于提出CC-FMO(Camera-Conditioned Free-Form Object generation)框架,其核心创新包括:1)采用语义感知的向量集表示与细节丰富的结构化潜在表示相结合的混合实例生成机制,以同时保障物体语义合理性与几何质量;2)引入一种简单而有效的相机条件尺度求解算法,使基础位姿估计模型可直接应用于场景生成任务,从而实现场景级别的空间一致性约束。实验证明,该方法能在零样本条件下生成高质量、相机对齐的组合式3D场景,显著优于现有最先进方法。
链接: https://arxiv.org/abs/2512.00493
作者: Boshi Tang,Henry Zheng,Rui Huang,Gao Huang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications. Early approaches struggle to generalize due to reliance on specialized models trained on curated small datasets. While recent advancements in large-scale 3D foundation models have significantly enhanced instance-level generation, coherent scene generation remains a challenge, where performance is limited by inaccurate per-object pose estimations and spatial inconsistency. To this end, this paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation that jointly conforms to the object layout in input image and preserves instance fidelity. CC-FMO employs a hybrid instance generator that combines semantics-aware vector-set representation with detail-rich structured latent representation, yielding object geometries that are both semantically plausible and high-quality. Furthermore, CC-FMO enables the application of foundational pose estimation models in the scene generation task via a simple yet effective camera-conditioned scale-solving algorithm, to enforce scene-level coherence. Extensive experiments demonstrate that CC-FMO consistently generates high-fidelity camera-aligned compositional scenes, outperforming all state-of-the-art methods.
zh
[CV-191] Learning What Helps: Task-Aligned Context Selection for Vision Tasks
【速读】:该论文旨在解决视觉模型(如Vision Transformers, ViTs)在面对视觉不确定性时,无法有效识别哪些上下文示例能够真正提升任务性能的问题。现有方法通常依赖于相似性度量来检索示例,但这种策略可能选择与当前输入外观相似却对任务无实质帮助的样本。解决方案的关键在于提出任务对齐的上下文选择(Task-Aligned Context Selection, TACS),其通过联合训练一个选择器网络与目标任务模型,并采用梯度监督与强化学习相结合的混合优化机制,将上下文检索过程嵌入到任务优化目标中,从而实现基于任务奖励的示例选择,确保所选示例能显著改善预测性能。
链接: https://arxiv.org/abs/2512.00489
作者: Jingyu Guo,Emir Konuk,Fredrik Strand,Christos Matsoukas,Kevin Smith
机构: KTH Royal Institute of Technology (皇家理工学院); Science for Life Laboratory (生命科学实验室); Karolinska Institutet (卡罗林斯卡学院); AstraZeneca (阿斯利康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present Task-Aligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective. By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help. Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.
zh
[CV-192] Structured Context Learning for Generic Event Boundary Detection
【速读】:该论文旨在解决通用事件边界检测(Generic Event Boundary Detection, GEBD)问题,即在视频中准确识别人类感知的事件边界。其解决方案的关键在于提出一种名为结构化上下文学习(Structured Context Learning)的新方法,核心创新是引入结构化序列划分(Structured Partition of Sequence, SPoS),通过将输入帧序列进行结构化分割,为后续时序模型提供结构化的上下文信息。SPoS具有线性计算复杂度,且不依赖特定时序模型(如GRU、LSTM或Transformer),从而实现更优的速度-精度权衡;此外,通过计算组内相似性并结合轻量级全卷积网络生成边界预测,并利用高斯核对标注模糊性进行预处理,显著提升了检测性能。
链接: https://arxiv.org/abs/2512.00475
作者: Xin Gu,Congcong Li,Xinyao Wang,Dexiang Hong,Libo Zhang,Tiejian Luo,Longyin Wen,Heng Fan
机构: University of Chinese Academy of Sciences (中国科学院大学); Sichuan Changhong Electric Co., Ltd. (四川长虹电器股份有限公司); ByteDance Intelligent Creation (字节跳动智能创作); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); Department of Computer Science and Engineering, University of North Texas (北德克萨斯大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide a structured context for learning temporal information. Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS’s overall computational complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method has been extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.
zh
[CV-193] RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在文本到图像(text-to-image, T2I)生成任务中难以实现高保真度真实感图像的问题,尤其是现有模型如 GPT-Image-1 和 Qwen-Image 在简单任务下仍会产生“虚假”图像,表现为皮肤过于光滑和面部油光等明显 AI 伪影。解决方案的关键在于提出 RealGen 框架,其核心创新包括:(1) 引入基于对抗生成思想的“Detector Reward”机制,利用语义级与特征级合成图像检测器量化伪影并评估现实性;(2) 使用 GRPO 算法优化整个生成流程以提升图像真实性与细节;(3) 构建 RealBench 自动化评估基准,通过 Detector-Scoring 和 Arena-Scoring 实现无需人工参与的高精度现实感评估,从而有效引导模型向“难以与真实图像区分”的目标演进。
链接: https://arxiv.org/abs/2512.00473
作者: Junyan Ye,Leiqi Zhu,Yuncheng Guo,Dongzhi Jiang,Zilong Huang,Yifan Zhang,Zhiyuan Yan,Haohuan Fu,Conghui He,Weijia Li
机构: Shanghai AI Lab; Sun Yat-Sen University; Nanjing University; CUHK MMLab; Tsinghua University; Peking University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the continuous advancement of image generation technology, advanced models such as GPT-Image-1 and Qwen-Image have achieved remarkable text-to-image consistency and world knowledge However, these models still fall short in photorealistic image generation. Even on simple T2I tasks, they tend to produce " fake" images with distinct AI artifacts, often characterized by “overly smooth skin” and “oily facial sheens”. To recapture the original goal of “indistinguishable-from-reality” generation, we propose RealGen, a photorealistic text-to-image framework. RealGen integrates an LLM component for prompt optimization and a diffusion model for realistic image generation. Inspired by adversarial generation, RealGen introduces a “Detector Reward” mechanism, which quantifies artifacts and assesses realism using both semantic-level and feature-level synthetic image detectors. We leverage this reward signal with the GRPO algorithm to optimize the entire generation pipeline, significantly enhancing image realism and detail. Furthermore, we propose RealBench, an automated evaluation benchmark employing Detector-Scoring and Arena-Scoring. It enables human-free photorealism assessment, yielding results that are more accurate and aligned with real user experience. Experiments demonstrate that RealGen significantly outperforms general models like GPT-Image-1 and Qwen-Image, as well as specialized photorealistic models like FLUX-Krea, in terms of realism, detail, and aesthetics. The code is available at this https URL.
zh
[CV-194] CausalAffect: Causal Discovery for Facial Affective Understanding
【速读】:该论文旨在解决面部情感分析中缺乏对动作单元(Action Units, AUs)与表情之间心理上合理因果关系的直接数据驱动推断问题。现有方法虽以AUs为基础,但未能有效建模AU间的潜在因果依赖及其与表情表达的关联。解决方案的关键在于提出CausalAffect框架,其通过两级极性与方向感知的因果层次结构,融合群体层面规律与样本自适应结构来建模AU-AU和AU-表情依赖关系;同时引入特征级反事实干预机制,强化真实因果效应并抑制虚假相关性,且无需联合标注数据或人工设计的因果先验,即可恢复符合心理学理论的因果结构,并揭示新的抑制性和未被识别的依赖关系。
链接: https://arxiv.org/abs/2512.00456
作者: Guanyu Hu,Tangzheng Lian,Dimitrios Kollias,Oya Celiktutan,Xinyu Yang
机构: Xi’an Jiaotong University (西安交通大学); Queen Mary University of London (伦敦玛丽女王大学); Kings College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding human affect from facial behavior requires not only accurate recognition but also structured reasoning over the latent dependencies that drive muscle activations and their expressive outcomes. Although Action Units (AUs) have long served as the foundation of affective computing, existing approaches rarely address how to infer psychologically plausible causal relations between AUs and expressions directly from data. We propose CausalAffect, the first framework for causal graph discovery in facial affect analysis. CausalAffect models AU-AU and AU-Expression dependencies through a two-level polarity and direction aware causal hierarchy that integrates population-level regularities with sample-adaptive structures. A feature-level counterfactual intervention mechanism further enforces true causal effects while suppressing spurious correlations. Crucially, our approach requires neither jointly annotated datasets nor handcrafted causal priors, yet it recovers causal structures consistent with established psychological theories while revealing novel inhibitory and previously uncharacterized dependencies. Extensive experiments across six benchmarks demonstrate that CausalAffect advances the state of the art in both AU detection and expression recognition, establishing a principled connection between causal discovery and interpretable facial behavior. All trained models and source code will be released upon acceptance.
zh
[CV-195] RecruitView: A Multimodal Dataset for Predicting Personality and Interview Performance for Human Resources Applications
【速读】:该论文旨在解决从多模态行为数据中自动评估人格特质与软技能的难题,其核心挑战在于现有数据集规模有限且模型难以捕捉人类特质内在的几何结构。解决方案的关键在于提出一种名为Cross-Modal Regression with Manifold Fusion (CRMF) 的几何深度学习框架,该框架通过在双曲空间(hyperbolic)、球面空间(spherical)和欧几里得空间(Euclidean)上分别建模行为表征,显式地捕获人格的层次结构、行为的方向性模式以及绩效的连续变化;同时引入基于输入特征自适应加权的专家网络路由机制,并通过切空间融合实现高效参数共享,从而在仅训练40–50%参数量的情况下显著提升性能,实验证明其在Spearman相关系数和一致性指数上分别达到最高11.4%和6.0%的改进。
链接: https://arxiv.org/abs/2512.00450
作者: Amit Kumar Gupta,Farhan Sheth,Hammad Shaikh,Dheeraj Kumar,Angkul Puniya,Deepak Panwar,Sandeep Chaurasia,Priya Mathur
机构: Manipal University Jaipur (曼加尔大学贾伊普尔分校); Poornima Institute of Engineering & Technology (普尔尼玛工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures, 10 tables
Abstract:Automated personality and soft skill assessment from multimodal behavioral data remains challenging due to limited datasets and methods that fail to capture geometric structure inherent in human traits. We introduce RecruitView, a dataset of 2,011 naturalistic video interview clips from 300+ participants with 27,000 pairwise comparative judgments across 12 dimensions: Big Five personality traits, overall personality score, and six interview performance metrics. To leverage this data, we propose Cross-Modal Regression with Manifold Fusion (CRMF), a geometric deep learning framework that explicitly models behavioral representations across hyperbolic, spherical, and Euclidean manifolds. CRMF employs geometry-specific expert networks to capture hierarchical trait structures, directional behavioral patterns, and continuous performance variations simultaneously. An adaptive routing mechanism dynamically weights expert contributions based on input characteristics. Through principled tangent space fusion, CRMF achieves superior performance while training 40-50% fewer trainable parameters than large multimodal models. Extensive experiments demonstrate that CRMF substantially outperforms the selected baselines, achieving up to 11.4% improvement in Spearman correlation and 6.0% in concordance index. Our RecruitView dataset is publicly available at this https URL
zh
[CV-196] FR-TTS: Test-Time Scaling for NTP-based Image Generation with Effective Filling-based Reward Signal
【速读】:该论文旨在解决在文本生成任务中应用测试时缩放(Test-time Scaling, TTS)方法时面临的挑战,即中间token序列的奖励与最终完整生成结果之间的相关性较低,导致难以有效指导采样过程中的剪枝方向。其解决方案的关键在于提出一种基于填充的奖励机制(Filling-Based Reward, FR),该机制通过为中间样本寻找并应用合理的填充方案来估计其未来的生成轨迹,从而显著提升中间状态奖励与最终奖励的相关性。FR不仅作为可靠的评估指标,还支撑了进一步设计的FR-TTS策略,该策略结合动态加权的多样性奖励,实现对中间样本的高效搜索与综合评价,最终在多个基准和奖励模型上验证了优越性能。
链接: https://arxiv.org/abs/2512.00438
作者: Hang Xu,Linjiang Huang,Feng Zhao
机构: MoE Key Lab of BIPC, USTC (中国科学技术大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Test-time scaling (TTS) has become a prevalent technique in image generation, significantly boosting output quality by expanding the number of parallel samples and filtering them using pre-trained reward models. However, applying this powerful methodology to the next-token prediction (NTP) paradigm remains challenging. The primary obstacle is the low correlation between the reward of an image decoded from an intermediate token sequence and the reward of the fully generated image. Consequently, these incomplete intermediate representations prove to be poor indicators for guiding the pruning direction, a limitation that stems from their inherent incompleteness in scale or semantic content. To effectively address this critical issue, we introduce the Filling-Based Reward (FR). This novel design estimates the approximate future trajectory of an intermediate sample by finding and applying a reasonable filling scheme to complete the sequence. Both the correlation coefficient between rewards of intermediate samples and final samples, as well as multiple intrinsic signals like token confidence, indicate that the FR provides an excellent and reliable metric for accurately evaluating the quality of intermediate samples. Building upon this foundation, we propose FR-TTS, a sophisticated scaling strategy. FR-TTS efficiently searches for good filling schemes and incorporates a diversity reward with a dynamic weighting schedule to achieve a balanced and comprehensive evaluation of intermediate samples. We experimentally validate the superiority of FR-TTS over multiple established benchmarks and various reward models. Code is available at \hrefthis https URLthis https URL.
zh
[CV-197] Recognizing Pneumonia in Real-World Chest X-rays with a Classifier Trained with Images Synthetically Generated by Nano Banana
【速读】:该论文旨在解决医学人工智能(AI)模型训练中高质量标注数据稀缺的问题,提出利用生成式AI(Generative AI)合成胸部X光图像(CXR)作为训练数据的替代方案。其关键解决方案是使用谷歌最新发布的图像生成与编辑模型Nano Banana生成合成CXRs,并基于这些数据训练分类器,结果在两个真实世界CXRs数据集上均表现出优异性能(如AUROC达0.923),验证了合成数据在医学AI开发中的可行性。
链接: https://arxiv.org/abs/2512.00428
作者: Jiachuan Peng,Kyle Lam,Jianing Qiu
机构: MBZUAI (穆巴达拉科技大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages
Abstract:We trained a classifier with synthetic chest X-ray (CXR) images generated by Nano Banana, the latest AI model for image generation and editing, released by Google. When directly applied to real-world CXRs having only been trained with synthetic data, the classifier achieved an AUROC of 0.923 (95% CI: 0.919 - 0.927), and an AUPR of 0.900 (95% CI: 0.894 - 0.907) in recognizing pneumonia in the 2018 RSNA Pneumonia Detection dataset (14,863 CXRs), and an AUROC of 0.824 (95% CI: 0.810 - 0.836), and an AUPR of 0.913 (95% CI: 0.904 - 0.922) in the Chest X-Ray dataset (5,856 CXRs). These external validation results on real-world data demonstrate the feasibility of this approach and suggest potential for synthetic data in medical AI development. Nonetheless, several limitations remain at present, including challenges in prompt design for controlling the diversity of synthetic CXR data and the requirement for post-processing to ensure alignment with real-world data. However, the growing sophistication and accessibility of medical intelligence will necessitate substantial validation, regulatory approval, and ethical oversight prior to clinical translation.
zh
[CV-198] What about gravity in video generation? Post-Training Newtons Laws with Verifiable Rewards
【速读】:该论文旨在解决当前视频扩散模型在生成视频时存在的物理合理性不足问题,即虽然视觉效果逼真,但常违反基本物理定律(如物体漂浮、加速度漂移和碰撞行为不一致),从而导致视觉真实性和物理真实性之间存在显著差距。解决方案的关键在于提出首个基于物理约束的后训练框架 NewtonRewards,其核心创新是利用冻结的通用模型提取可验证的物理代理变量:通过光流(optical flow)作为速度代理,通过高层外观特征(high-level appearance features)作为质量(mass)代理,进而设计两类互补的奖励机制——牛顿运动学约束奖励以强制恒定加速度动力学,以及质量守恒奖励以防止退化解。该方法无需人类或视觉语言模型(VLM)反馈,实现了对视频生成过程的物理结构显式约束,显著提升了生成视频的物理合理性、运动平滑性和时间一致性。
链接: https://arxiv.org/abs/2512.00425
作者: Minh-Quan Le,Yuanzhi Zhu,Vicky Kalogeiton,Dimitris Samaras
机构: Stony Brook University (石溪大学); LIX, École Polytechnique, CNRS, IPP (École Polytechnique 是法国综合理工学院,CNRS 是法国国家科学研究中心,IPP 是法国巴黎理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose \textttNewtonRewards , the first physics-grounded post-training framework for video generation based on \textitverifiable rewards . Instead of relying on human or VLM feedback, \textttNewtonRewards extracts \textitmeasurable proxies from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate \textttNewtonRewards on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, \textttNewtonBench-60K . Across all primitives in visual and physics metrics, \textttNewtonRewards consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.
zh
[CV-199] Recovering Origin Destination Flows from Bus CCTV: Early Results from Nairobi and Kigali
【速读】:该论文旨在解决撒哈拉以南非洲(Sub-Saharan Africa, SSA)地区公共交通系统中因过度拥挤导致现有自动化乘客流量监测系统失效的问题,从而难以获取可靠的公交OD(Origin-Destination)流数据。其解决方案的关键在于构建一个基于车载闭路电视(CCTV)的端到端分析流水线,整合YOLOv12目标检测、BotSORT多目标跟踪、OSNet特征嵌入、OCR时间戳识别及基于车辆定位信息的站点分类技术,实现对乘客上下车行为的自动追踪与OD矩阵重建,在低密度光照条件下可达到高精度计数(召回率≈95%,精确率≈91%,F1分数≈93%),并揭示了在现实复杂场景下如拥挤、色彩失真和非标准门使用等压力因素下的性能退化模式,为未来面向SSA部署环境的更鲁棒重识别(Re-ID)方法提供了明确方向。
链接: https://arxiv.org/abs/2512.00424
作者: Nthenya Kyatha,Jay Taneja
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Public transport in sub-Saharan Africa (SSA) often operates in overcrowded conditions where existing automated systems fail to capture reliable passenger flow data. Leveraging onboard CCTV already deployed for security, we present a baseline pipeline that combines YOLOv12 detection, BotSORT tracking, OSNet embeddings, OCR-based timestamping, and telematics-based stop classification to recover bus origin–destination (OD) flows. On annotated CCTV segments from Nairobi and Kigali buses, the system attains high counting accuracy under low-density, well-lit conditions (recall \approx 95%, precision \approx 91%, F1 \approx 93%). It produces OD matrices that closely match manual tallies. Under realistic stressors such as overcrowding, color-to-monochrome shifts, posture variation, and non-standard door use, performance degrades sharply (e.g., \sim 40% undercount in peak-hour boarding and a \sim 17 percentage-point drop in recall for monochrome segments), revealing deployment-specific failure modes and motivating more robust, deployment-focused Re-ID methods for SSA transit.
zh
[CV-200] PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
【速读】:该论文旨在解决现有生成式3D形状模型在工业设计场景中难以兼顾物理合理性与视觉真实性的关键问题。由于传统方法缺乏对物理属性(如空气动力学效率)的显式建模能力,生成的形状虽视觉上合理,但可能不符合工程约束。解决方案的关键在于提出一个统一的基于物理的3D形状生成框架,其核心创新包括:1)设计一种具有显式物理引导的流匹配(flow matching)模型,通过速度更新与物理优化交替迭代,逐步调整潜在编码以同时满足目标形状和物理特性;2)引入物理感知正则化项强化物理有效性;3)构建形状-物理变分自编码器(shape-and-physics variational autoencoder, SP-VAE),将形状与物理信息联合编码至统一潜在空间,从而实现物理知识驱动的高质量生成。
链接: https://arxiv.org/abs/2512.00422
作者: Yingxuan You,Chen Zhao,Hantao Zhang,Mingda Xu,Pascal Fua
机构: CVLab, EPFL (École Polytechnique Fédérale de Lausanne)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures
Abstract:Existing generative models for 3D shapes can synthesize high-fidelity and visually plausible shapes. For certain classes of shapes that have undergone an engineering design process, the realism of the shape is tightly coupled with the underlying physical properties, e.g., aerodynamic efficiency for automobiles. Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. Specifically, we introduce a new flow matching model with explicit physical guidance, consisting of an alternating update process. We iteratively perform a velocity-based update and a physics-based refinement, progressively adjusting the latent code to align with the desired 3D shapes and physical properties. We further strengthen physical validity by incorporating a physics-aware regularization term into the velocity-based update step. To support such physics-guided updates, we build a shape-and-physics variational autoencoder (SP-VAE) that jointly encodes shape and physics information into a unified latent space. The experiments on three benchmarks show that this synergistic formulation improves shape realism beyond mere visual plausibility.
zh
[CV-201] SplatFont3D: Structure-Aware Text-to-3D Artistic Font Generation with Part-Level Style Control
【速读】:该论文旨在解决3D艺术字体生成(3D-AFG)中缺乏个性化设计能力、结构约束难以满足以及细粒度部件级风格控制不足的问题。现有方法多集中于2D艺术字体生成,而3D-AFG不仅在沉浸式环境中具有广泛应用潜力,还需保持字体语义的精确性和强结构约束,并支持部件级别的风格调节。解决方案的关键在于提出SplatFont3D框架,其核心创新包括:(1)Glyph2Cloud模块,通过渐进式增强2D字形(glyph)的形状与风格并生成对应的3D点云用于高斯初始化;(2)利用预训练2D扩散模型结合分数蒸馏采样(score distillation sampling)优化初始3D高斯分布;(3)引入基于几何先验的动态组件分配策略,在优化过程中实现部件级风格控制的同时缓解因漂移导致的特征纠缠问题。该方案显著提升了3D-AFG在风格一致性、视觉质量和渲染效率方面的性能表现。
链接: https://arxiv.org/abs/2512.00413
作者: Ji Gan,Lingxu Chen,Jiaxu Leng,Xinbo Gao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Artistic font generation (AFG) can assist human designers in creating innovative artistic fonts. However, most previous studies primarily focus on 2D artistic fonts in flat design, leaving personalized 3D-AFG largely underexplored. 3D-AFG not only enables applications in immersive 3D environments such as video games and animations, but also may enhance 2D-AFG by rendering 2D fonts of novel views. Moreover, unlike general 3D objects, 3D fonts exhibit precise semantics with strong structural constraints and also demand fine-grained part-level style control. To address these challenges, we propose SplatFont3D, a novel structure-aware text-to-3D AFG framework with 3D Gaussian splatting, which enables the creation of 3D artistic fonts from diverse style text prompts with precise part-level style control. Specifically, we first introduce a Glyph2Cloud module, which progressively enhances both the shapes and styles of 2D glyphs (or components) and produces their corresponding 3D point clouds for Gaussian initialization. The initialized 3D Gaussians are further optimized through interaction with a pretrained 2D diffusion model using score distillation sampling. To enable part-level control, we present a dynamic component assignment strategy that exploits the geometric priors of 3D Gaussians to partition components, while alleviating drift-induced entanglement during 3D Gaussian optimization. Our SplatFont3D provides more explicit and effective part-level style control than NeRF, attaining faster rendering efficiency. Experiments show that our SplatFont3D outperforms existing 3D models for 3D-AFG in style-text consistency, visual quality, and rendering efficiency.
zh
[CV-202] Low-Bitrate Video Compression through Semantic-Conditioned Diffusion
【速读】:该论文旨在解决传统视频编码器在超低比特率下因像素保真度优化导致严重失真、且与人类感知不一致的问题。其核心解决方案是提出一种名为DiSCo的语义视频压缩框架,关键在于将源视频分解为三种紧凑模态——文本描述(捕捉语义信息)、时空退化视频(保留外观特征)和可选草图或姿态信息(表征运动线索),并通过条件视频扩散模型从这些模态中重建高质量、时序一致的视频,从而利用生成先验合成细节,显著提升低码率下的感知质量。
链接: https://arxiv.org/abs/2512.00408
作者: Lingdong Wang,Guan-Ming Su,Divya Kothandaraman,Tsung-Wei Huang,Mohammad Hajiesmaili,Ramesh K. Sitaraman
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Dolby Laboratories (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.
zh
[CV-203] SelfAI: Building a Self-Training AI System with LLM Agents
【速读】:该论文旨在解决当前自主科学发现框架中存在的三大问题:应用领域过于狭窄、缺乏与研究人员的实时交互能力,以及缺少确定性停止探索的机制,从而导致效率低下、可复现性差及人类专家知识利用不足。其解决方案的关键在于提出一个名为SelfAI的通用多智能体平台,该平台由三个核心组件构成:用户代理(User Agent)负责将高层次研究目标转化为标准化实验配置;认知代理(Cognitive Agent)基于大语言模型(LLM)并引入最优停止准则,实现超参数搜索的迭代优化;实验管理器(Experiment Manager)则协调跨异构硬件的并行、容错训练流程,并维护结构化知识库以支持持续反馈。此外,作者还设计了Score和AUP_D两项新评估指标,用于量化发现效率与搜索多样性,实验证明SelfAI在多个科学领域基准任务中显著优于传统贝叶斯优化和基于LLM的基线方法。
链接: https://arxiv.org/abs/2512.00403
作者: Xiao Wu,Ting-Zhu Huang,Liang-Jian Deng,Xiaobing Yu,Yu Zhong,Shangqi Deng,Ufaq Khan,Jianghao Wu,Xiaofeng Liu,Imran Razzak,Xiaojun Chang,Yutong Xie
机构: University of Electronic Science and Technology of China (电子科技大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Xian Jiaotong University (西安交通大学); Yale University (耶鲁大学); Washington University in St. Louis (圣路易斯华盛顿大学); Monash University (蒙纳士大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent work on autonomous scientific discovery has leveraged LLM-based agents to integrate problem specification, experiment planning, and execution into end-to-end systems. However, these frameworks are often confined to narrow application domains, offer limited real-time interaction with researchers, and lack principled mechanisms for determining when to halt exploration, resulting in inefficiencies, reproducibility challenges, and under-utilized human expertise. To address these gaps, we propose \textitSelfAI, a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations, a Cognitive Agent powered by LLMs with optimal stopping criteria to iteratively refine hyperparameter searches, and an Experiment Manager responsible for orchestrating parallel, fault-tolerant training workflows across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback. We further introduce two novel evaluation metrics, Score and \textAUP_D , to quantify discovery efficiency and search diversity. Across regression, NLP, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials compared to classical Bayesian optimization and LLM-based baselines, while enabling seamless interaction with human researchers.
zh
[CV-204] me-Series at the Edge: Tiny Separable CNNs for Wearable Gait Detection and Optimal Sensor Placement
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)患者步态检测中资源受限可穿戴设备上的时序数据分析问题,目标是在边缘节点实现高效、准确的实时识别。其关键解决方案是采用超轻量级一维卷积神经网络(1D CNNs),特别是基于深度可分离卷积(depthwise separable convolutions)并引入残差连接的模型架构,在显著降低参数量(如仅533参数)的前提下,实现了与传统方法相当甚至更优的性能(PR-AUC=94.5%,F1=91.2%),同时满足STM32类微控制器的内存和延迟约束(<10 ms),从而支持传感器端的实时数据筛选与传输控制,优于固定幅度阈值法的高假阳性率和不稳定性。
链接: https://arxiv.org/abs/2512.00396
作者: Andrea Procopio,Marco Esposito,Sara Raggiunto,Andrey Gizdov,Alberto Belli,Paola Pierleoni
机构: Harvard University (哈佛大学); University of Bologna (博洛尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:We study on-device time-series analysis for gait detection in Parkinson’s disease (PD) from short windows of triaxial acceleration, targeting resource-constrained wearables and edge nodes. We compare magnitude thresholding to three 1D CNNs for time-series analysis: a literature baseline (separable convolutions) and two ultra-light models - one purely separable and one with residual connections. Using the BioStampRC21 dataset, 2 s windows at 30 Hz, and subject-independent leave-one-subject-out (LOSO) validation on 16 PwPD with chest-worn IMUs, our residual separable model (Model 2, 533 params) attains PR-AUC = 94.5%, F1 = 91.2%, MCC = 89.4%, matching or surpassing the baseline (5,552 params; PR-AUC = 93.7%, F1 = 90.5%, MCC = 88.5%) with approximately 10x fewer parameters. The smallest model (Model 1, 305 params) reaches PR-AUC = 94.0%, F1 = 91.0%, MCC = 89.1%. Thresholding obtains high recall (89.0%) but low precision (76.5%), yielding many false positives and high inter-subject variance. Sensor-position analysis (train-on-all) shows chest and thighs are most reliable; forearms degrade precision/recall due to non-gait arm motion; naive fusion of all sites does not outperform the best single site. Both compact CNNs execute within tight memory/latency budgets on STM32-class MCUs (sub-10 ms on low-power boards), enabling on-sensor gating of transmission/storage. Overall, ultra-light separable CNNs provide a superior accuracy-efficiency-generalization trade-off to fixed thresholds for wearable PD gait detection and underscore the value of tailored time-series models for edge deployment.
zh
[CV-205] Better Stronger Faster: Tackling the Trilemma in MLLM -based Segmentation with Simultaneous Textual Mask Prediction
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中图像分割任务所面临的三难困境:即在保持对话能力、实现高精度分割性能和确保快速推理之间难以兼顾。现有方法要么因引入像素级预测目标而损害模型的通用对话能力,要么将分割重构为自回归生成任务,导致分割质量差或推理速度过慢。解决方案的关键在于提出“全掩码预测”(All-Mask Prediction)新范式,通过解耦自回归对话生成与非自回归掩码预测过程,使模型在生成文本响应后,能以单次前向传播并行预测完整分割掩码(将掩码视为图像块上的“填空”任务),从而同时保障对话能力、提升分割精度并显著加速推理。该方案由STAMP模型实现,在多个分割基准上显著优于当前最优方法,实现了三者无妥协的协同优化。
链接: https://arxiv.org/abs/2512.00395
作者: Jiazhen Liu,Mingkuan Feng,Long Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM’s general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel “fill-in-the-blank” task over image patches. This design maintains the MLLM’s dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.
zh
[CV-206] WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
【速读】:该论文旨在解决当前图像编辑模型评估基准过于狭窄的问题,无法全面衡量基于认知与创造力的高级图像编辑能力。解决方案的关键在于提出WiseEdit——一个知识密集型的综合性评估基准,其核心创新在于将图像编辑过程解构为三个连续的认知步骤:Awareness(感知)、Interpretation(理解)和Imagination(想象),并引入三类基础性知识(Declarative知识、Procedural知识和Metacognitive知识),从而系统性地评估模型在复杂任务中的知识驱动型推理与创造性构图能力。
链接: https://arxiv.org/abs/2512.00387
作者: Kaihang Pan,Weile Chen,Haiyi Qiu,Qifan Yu,Wendong Bu,Zehan Wang,Yun Zhu,Juncheng Li,Siliang Tang
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 20 figures. Project Page: this https URL
Abstract:Recent image editing models boast next-level intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps, i.e., Awareness, Interpretation, and Imagination, each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities. The benchmark, evaluation code, and the generated images of each model will be made publicly available soon. Project Page: this https URL.
zh
[CV-207] EZ-SP: Fast and Lightweight Superpoint-Based 3D Segmentation
【速读】:该论文旨在解决基于超点(superpoint)的3D语义分割流水线中,因CPU密集型划分步骤导致的性能瓶颈问题。其核心解决方案是提出一种可学习的、完全在GPU上执行的超点划分算法(EZ-SP),该算法通过一个参数量低于60k的轻量级模块,在不到20分钟内完成训练,并采用可微分的代理损失函数优化,无需手工特征设计。该方法生成几何与语义一致的超点,速度比现有方法快13倍,且整体流水线仅占用2MB显存,支持百万点级别的场景和实时推理,同时在三个不同领域(室内扫描S3DIS、自动驾驶KITTI-360、航空LiDAR DALES)中达到与当前最优点级模型相当的精度。
链接: https://arxiv.org/abs/2512.00385
作者: Louis Geist,Loic Landrieu,Damien Robert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Superpoint-based pipelines provide an efficient alternative to point- or voxel-based 3D semantic segmentation, but are often bottlenecked by their CPU-bound partition step. We propose a learnable, fully GPU partitioning algorithm that generates geometrically and semantically coherent superpoints 13 \times faster than prior methods. Our module is compact (under 60k parameters), trains in under 20 minutes with a differentiable surrogate loss, and requires no handcrafted features. Combine with a lightweight superpoint classifier, the full pipeline fits in 2 MB of VRAM, scales to multi-million-point scenes, and supports real-time inference. With 72 \times faster inference and 120 \times fewer parameters, EZ-SP matches the accuracy of point-based SOTA models across three domains: indoor scans (S3DIS), autonomous driving (KITTI-360), and aerial LiDAR (DALES). Code and pretrained models are accessible at this http URL.
zh
[CV-208] Pore-scale Image Patch Dataset and A Comparative Evaluation of Pore-scale Facial Features
【速读】:该论文旨在解决面部皮肤区域纹理薄弱导致局部描述符匹配困难的问题,这在面部运动分析和三维人脸重建等应用中尤为突出。其关键解决方案是提出PorePatch数据集——一个高质量的毛孔尺度图像块数据集,并构建了一个数据-模型协同进化(Data-Model Co-Evolution, DMCE)框架,从高分辨率面部图像中逐步生成高质量标注数据。通过在此数据集上训练当前最优(SOTA)深度学习描述符模型,实验表明其在匹配任务上的FPR95值达到1.91%,显著优于传统PSIFT方法(22.41%),但其优势在三维重建任务中并不明显,说明深度学习描述符在处理面部弱纹理区域仍存在局限性。
链接: https://arxiv.org/abs/2512.00381
作者: Dong Li,HuaLiang Lin,JiaYu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The weak-texture nature of facial skin regions presents significant challenges for local descriptor matching in applications such as facial motion analysis and 3D face reconstruction. Although deep learning-based descriptors have demonstrated superior performance to traditional hand-crafted descriptors in many applications, the scarcity of pore-scale image patch datasets has hindered their further development in the facial domain. In this paper, we propose the PorePatch dataset, a high-quality pore-scale image patch dataset, and establish a rational evaluation benchmark. We introduce a Data-Model Co-Evolution (DMCE) framework to generate a progressively refined, high-quality dataset from high-resolution facial images. We then train existing SOTA models on our dataset and conduct extensive experiments. Our results show that the SOTA model achieves a FPR95 value of 1.91% on the matching task, outperforming PSIFT (22.41%) by a margin of 20.5%. However, its advantage is diminished in the 3D reconstruction task, where its overall performance is not significantly better than that of traditional descriptors. This indicates that deep learning descriptors still have limitations in addressing the challenges of facial weak-texture regions, and much work remains to be done in this field.
zh
[CV-209] POLARIS: Projection-Orthogonal Least Squares for Robust and Adaptive Inversion in Diffusion Models
【速读】:该论文旨在解决基于扩散模型的图像逆向生成(inversion)过程中因噪声近似误差(approximate noise error)累积导致的重建质量下降问题。该误差源于在每一步迭代中用前一步的噪声预测值来近似当前步的真实噪声,从而引发误差传播并显著影响最终结果。解决方案的关键在于提出Projection-Orthogonal Least Squares for Robust and Adaptive Inversion (POLARIS),其核心思想是将原本作为误差补偿问题的逆向过程重构为误差来源建模问题:不再通过优化嵌入或潜在码来抵消漂移,而是将引导尺度 ω 设计为逐步骤可变参数,并推导出一个数学上严谨的公式,在每个步骤最小化逆向误差。此方法仅需一行代码即可显著提升潜在表示质量,且几乎无额外计算开销,有效缓解了噪声近似误差并提升了下游任务精度。
链接: https://arxiv.org/abs/2512.00369
作者: Wenshuo Chen,Haosen Li,Shaofeng Liang,Lei Wang,Haozhe Jia,Kaishen Yuan,Jieming Wu,Bowen Tian,Yutao Yue
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Griffith University (格里菲斯大学); Data61/CSIRO (数据61/澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Inversion-Denoising Paradigm, which is based on diffusion models, excels in diverse image editing and restoration tasks. We revisit its mechanism and reveal a critical, overlooked factor in reconstruction degradation: the approximate noise error. This error stems from approximating the noise at step t with the prediction at step t-1, resulting in severe error accumulation throughout the inversion process. We introduce Projection-Orthogonal Least Squares for Robust and Adaptive Inversion (POLARIS), which reformulates inversion from an error-compensation problem into an error-origin problem. Rather than optimizing embeddings or latent codes to offset accumulated drift, POLARIS treats the guidance scale \omega as a step-wise variable and derives a mathematically grounded formula to minimize inversion error at each step. Remarkably, POLARIS improves inversion latent quality with just one line of code. With negligible performance overhead, it substantially mitigates noise approximation errors and consistently improves the accuracy of downstream tasks.
zh
[CV-210] HCRL: Trusted Hierarchical Contrastive Representation Learning for Multi-View Clustering
【速读】:该论文旨在解决多视图聚类(Multi-View Clustering, MVC)中因不可信融合(untrustworthy fusion)而导致的性能下降问题。核心挑战源于两个方面:一是现有方法忽视了各视图中存在的固有噪声;二是传统基于对比学习(Contrastive Learning)的MVC方法在计算相似性时仅依赖同一实例的不同视图,而忽略了同一簇内最近邻样本的结构信息,从而导致融合方向错误。解决方案的关键在于提出一种可信分层对比表示学习框架(Trusted Hierarchical Contrastive Representation Learning, THCRL),其包含两个核心模块:1)深度对称分层融合(Deep Symmetry Hierarchical Fusion, DSHF)模块,利用UNet架构与多种去噪机制实现多视图数据的可信融合;2)平均K近邻对比学习(Average K-Nearest Neighbors Contrastive Learning, AKCL)模块,通过增强同簇内样本间的表示相似性而非仅关注跨视图相同实例,提升融合表示的置信度。
链接: https://arxiv.org/abs/2512.00368
作者: Jian Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-View Clustering (MVC) has garnered increasing attention in recent years. It is capable of partitioning data samples into distinct groups by learning a consensus representation. However, a significant challenge remains: the problem of untrustworthy fusion. This problem primarily arises from two key factors: 1) Existing methods often ignore the presence of inherent noise within individual views; 2) In traditional MVC methods using Contrastive Learning (CL), similarity computations typically rely on different views of the same instance, while neglecting the structural information from nearest neighbors within the same cluster. Consequently, this leads to the wrong direction for multi-view fusion. To address this problem, we present a novel Trusted Hierarchical Contrastive Representation Learning (THCRL). It consists of two key modules. Specifically, we propose the Deep Symmetry Hierarchical Fusion (DSHF) module, which leverages the UNet architecture integrated with multiple denoising mechanisms to achieve trustworthy fusion of multi-view data. Furthermore, we present the Average K-Nearest Neighbors Contrastive Learning (AKCL) module to align the fused representation with the view-specific representation. Unlike conventional strategies, AKCL enhances representation similarity among samples belonging to the same cluster, rather than merely focusing on the same sample across views, thereby reinforcing the confidence of the fused representation. Extensive experiments demonstrate that THCRL achieves the state-of-the-art performance in deep MVC tasks.
zh
[CV-211] owards aligned body representations in vision models
【速读】:该论文试图解决的问题是:人类在进行物理推理时依赖于内部的“身体”表征(body representations)——一种粗粒度、体积化的近似,能够捕捉物体的范围并支持对运动和物理行为的直觉预测;然而,这些表征的内在结构尚不明确。解决方案的关键在于:通过将针对50名人类受试者开展的心理物理学实验改编为语义分割任务,测试七种不同规模的分割网络,发现较小模型自然形成类人的粗粒度身体表征,而较大模型则倾向于过度细化的细粒度编码,从而表明在计算资源受限条件下,粗粒度表征可自发涌现,并且机器模型能为理解大脑中物理推理的结构提供可扩展路径。
链接: https://arxiv.org/abs/2512.00365
作者: Andrey Gizdov,Andrea Procopio,Yichen Li,Daniel Harari,Tomer Ullman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Andrea Procopio and Andrey Gizdov have equal contributions
Abstract:Human physical reasoning relies on internal “body” representations - coarse, volumetric approximations that capture an object’s extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.
zh
[CV-212] MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters
【速读】:该论文旨在解决多模态遥感目标检测中现有方法在性能与轻量化设计之间难以平衡的问题,尤其是注意力机制或可变形卷积融合模块带来的复杂性,以及共享骨干网络导致的模态特异性建模不足和双流架构引起的参数冗余问题。解决方案的关键在于提出一种名为MM-DETR的轻量高效框架:首先设计基于Mamba的双粒度融合编码器,将全局交互重构为通道级动态门控,并利用一维选择性扫描实现线性复杂度的跨模态建模;其次将多模态融合重新定义为模态补全问题,引入区域感知的二维选择性扫描补全分支,在双向金字塔路径上实现细粒度融合且开销极小;最后在共享骨干网络中嵌入轻量级频域感知模态适配器,通过空间-频率协同专家结构捕获模态特异性特征,并由像素级路由机制动态调节专家贡献,从而在减少参数冗余的同时保持强特征提取能力。
链接: https://arxiv.org/abs/2512.00363
作者: Jianhong Han,Yupei Wang,Yuan Zhang,Liang Chen
机构: Beijing Institute of Technology (北京理工大学); National Key Laboratory for Space-Born Intelligent Information Processing (空间智能信息处理全国重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript submitted to IEEE Transactions on Geoscience and Remote Sensing
Abstract:Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions by fusing complementary information from different modalities. However, existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design. Beyond fusion complexity, extracting modality features with shared backbones yields suboptimal representations due to insufficient modality-specific modeling, whereas dual-stream architectures nearly double the parameter count, ultimately limiting practical deployment. To this end, we propose MM-DETR, a lightweight and efficient framework for multimodal object detection. Specifically, we propose a Mamba-based dual granularity fusion encoder that reformulates global interaction as channel-wise dynamic gating and leverages a 1D selective scan for efficient cross-modal modeling with linear complexity. Following this design, we further reinterpret multimodal fusion as a modality completion problem. A region-aware 2D selective scanning completion branch is introduced to recover modality-specific cues, supporting fine-grained fusion along a bidirectional pyramid pathway with minimal overhead. To further reduce parameter redundancy while retaining strong feature extraction capability, a lightweight frequency-aware modality adapter is inserted into the shared backbone. This adapter employs a spatial-frequency co-expert structure to capture modality-specific cues, while a pixel-wise router dynamically balances expert contributions for efficient spatial-frequency fusion. Extensive experiments conducted on four multimodal benchmark datasets demonstrate the effectiveness and generalization capability of the proposed method.
zh
[CV-213] SMamDiff: Spatial Mamba for Stochastic Human Motion Prediction
【速读】:该论文旨在解决单阶段扩散模型在人类运动预测(Human Motion Prediction, HMP)中难以保证时空一致性的问题,尤其针对现有方法要么仅输出确定性预测而忽略不确定性,要么虽采用概率建模但牺牲了运动学合理性。解决方案的关键在于提出SMamDiff模型,其核心创新包括:(i) 残差-离散余弦变换(residual-DCT)运动编码机制,通过在时间域DCT前减去最后一个观测姿态,削弱直流分量(f=0)主导效应,从而突出高频运动特征,使模型学习关节间的相对运动而非绝对位置;(ii) 基于stickman绘制的Spatial Mamba模块,以顺序、逐关节的方式处理关节信息,使后序关节条件依赖于先前关节,从而引入长程跨关节依赖关系,增强时空一致性。该方法在Human3.6M和HumanEva数据集上实现了单阶段概率HMP的最先进性能,同时显著降低延迟与内存消耗。
链接: https://arxiv.org/abs/2512.00355
作者: Junqiao Fan,Pengfei Liu,Haocong Rao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With intelligent room-side sensing and service robots widely deployed, human motion prediction (HMP) is essential for safe, proactive assistance. However, many existing HMP methods either produce a single, deterministic forecast that ignores uncertainty or rely on probabilistic models that sacrifice kinematic plausibility. Diffusion models improve the accuracy-diversity trade-off but often depend on multi-stage pipelines that are costly for edge deployment. This work focuses on how to ensure spatial-temporal coherence within a single-stage diffusion model for HMP. We introduce SMamDiff, a Spatial Mamba-based Diffusion model with two novel designs: (i) a residual-DCT motion encoding that subtracts the last observed pose before a temporal DCT, reducing the first DC component ( f=0 ) dominance and highlighting informative higher-frequency cues so the model learns how joints move rather than where they are; and (ii) a stickman-drawing spatial-mamba module that processes joints in an ordered, joint-by-joint manner, making later joints condition on earlier ones to induce long-range, cross-joint dependencies. On Human3.6M and HumanEva, these coherence mechanisms deliver state-of-the-art results among single-stage probabilistic HMP methods while using less latency and memory than multi-stage diffusion baselines.
zh
[CV-214] mmPred: Radar-based Human Motion Prediction in the Dark AAAI-2026
【速读】:该论文旨在解决基于RGB-D相机的人体运动预测(Human Motion Prediction, HMP)方法在光照敏感性和隐私泄露方面的局限性,提出将毫米波雷达(millimeter-wave radar)作为新型传感模态用于HMP,以提升鲁棒性和隐私保护能力。其关键解决方案是提出mmPred——首个面向雷达信号的扩散模型框架,通过双域历史运动表示机制:时域姿态精修(Time-domain Pose Refinement, TPR)分支学习细粒度运动细节,频域主导运动(Frequency-domain Dominant Motion, FDM)分支捕捉全局运动趋势并抑制帧级不一致性;同时设计全局骨骼关系Transformer(Global Skeleton-relational Transformer, GST)作为扩散主干网络,建模关节间的全局协作关系,使受损关节能动态聚合其他关节信息,从而有效应对雷达信号中的镜面反射和多径效应导致的噪声与肢体缺失问题。
链接: https://arxiv.org/abs/2512.00345
作者: Junqiao Fan,Haocong Rao,Jiarui Zhang,Jianfei Yang,Lihua Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by AAAI-2026
Abstract:Existing Human Motion Prediction (HMP) methods based on RGB-D cameras are sensitive to lighting conditions and raise privacy concerns, limiting their real-world applications such as firefighting and healthcare. Motivated by the robustness and privacy-preserving nature of millimeter-wave (mmWave) radar, this work introduces radar as a novel sensing modality for HMP, for the first time. Nevertheless, radar signals often suffer from specular reflections and multipath effects, resulting in noisy and temporally inconsistent measurements, such as body-part miss-detection. To address these radar-specific artifacts, we propose mmPred, the first diffusion-based framework tailored for radar-based HMP. mmPred introduces a dual-domain historical motion representation to guide the generation process, combining a Time-domain Pose Refinement (TPR) branch for learning fine-grained details and a Frequency-domain Dominant Motion (FDM) branch for capturing global motion trends and suppressing frame-level inconsistency. Furthermore, we design a Global Skeleton-relational Transformer (GST) as the diffusion backbone to model global inter-joint cooperation, enabling corrupted joints to dynamically aggregate information from others. Extensive experiments show that mmPred achieves state-of-the-art performance, outperforming existing methods by 8.6% on mmBody and 22% on mm-Fi.
zh
[CV-215] Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models
【速读】:该论文旨在解决视觉-语言预训练模型(Vision-Language Pretrained Models, VLPs)在第三方微调后可能被植入后门攻击(backdoor attacks)却难以检测的问题。现有检测方法通常依赖于训练数据集、触发器(trigger)或目标类别等先验知识,这在实际应用中往往不可行。为此,作者提出了一种无需任何先验信息的模型级检测框架AMDET(Assimilation Matters in DETection)。其关键在于发现并利用了后门文本编码器中的特征同化现象(feature assimilation property):在后门样本中,所有token的表示具有高度相似性,这是由于注意力权重集中于触发词所致。AMDET通过梯度反演(gradient-based inversion)恢复隐式特征以激活后门行为,并结合损失景观分析区分自然存在的类似后门特征与真实注入的后门,从而实现高精度检测(F1分数达89.90%),且具备对自适应攻击的鲁棒性。
链接: https://arxiv.org/abs/2512.00343
作者: Zhongqi Wang,Jie Zhang,Shiguang Shan,Xilin Chen
机构: Key Laboratory of AI Safety of CAS, Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS); University of Chinese Academy of Sciences (UCAS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language pretrained models (VLPs) such as CLIP have achieved remarkable success, but are also highly vulnerable to backdoor attacks. Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem. Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets, or downstream classifiers, which may be impractical for real-world applications. To address this, To address this challenge, we introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge. Specifically, we first reveal the feature assimilation property in backdoored text encoders: the representations of all tokens within a backdoor sample exhibit a high similarity. Further analysis attributes this effect to the concentration of attention weights on the trigger token. Leveraging this insight, AMDET scans a model by performing gradient-based inversion on token embeddings to recover implicit features that capable of activating backdoor behaviors. Furthermore, we identify the natural backdoor feature in the OpenAI’s official CLIP model, which are not intentionally injected but still exhibit backdoor-like behaviors. We then filter them out from real injected backdoor by analyzing their loss landscapes. Extensive experiments on 3,600 backdoored and benign-finetuned models with two attack paradigms and three VLP model structures show that AMDET detects backdoors with an F1 score of 89.90%. Besides, it achieves one complete detection in approximately 5 minutes on a RTX 4090 GPU and exhibits strong robustness against adaptive attacks. Code is available at: this https URL
zh
[CV-216] MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection
【速读】:该论文旨在解决当前AI生成的多模态视频-音频内容检测领域中缺乏全面、真实且多样化的数据集问题,现有合成视频数据集主要局限于视觉模态,少数包含音频的数据也仅聚焦于人脸深度伪造(deepfake),无法覆盖日益扩展的一般性多模态AI生成内容,严重制约了可信检测系统的发展。解决方案的关键在于提出首个专为检测AI生成多模态视频-音频内容设计的综合性数据集——Multimodal Video-Audio Dataset (MVAD),其核心特征包括:(1) 真实的多模态特性,基于三种现实可行的视频-音频伪造模式生成样本;(2) 高感知质量,采用多种前沿生成模型实现;(3) 广泛多样性,涵盖真实与动漫视觉风格、四类内容类别(人物、动物、物体、场景)及四类视频-音频多模态数据类型,从而显著提升检测模型的泛化能力与实用性。
链接: https://arxiv.org/abs/2512.00336
作者: Mengxue Hu,Yunfeng Diao,Changtao Miao,Jianshu Li,Zhe Li,Joey Tianyi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages,2 figures
Abstract:The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes–a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at this https URL.
zh
[CV-217] Odometry Without Correspondence from Inertially Constrained Ruled Surfaces
【速读】:该论文旨在解决传统视觉里程计(Visual Odometry)技术中因依赖点对点对应关系而导致计算成本高且精度不稳定的问题。其解决方案的关键在于利用图像序列中直线特征在时空中形成的规则曲面(ruled surface)进行三维场景重建与位姿估计,仅需基于点到线的关联差分更新即可实现,显著降低了对复杂对应匹配的依赖;同时融合机载惯性测量单元(IMU)的数据以约束解空间维度,从而提升估计精度和鲁棒性。
链接: https://arxiv.org/abs/2512.00327
作者: Chenqi Zhu,Levi Burner,Yiannis Aloimonos
机构: University of Maryland, College Park (马里兰大学学院公园分校); University of Maryland Institute for Advanced Computer Studies (马里兰大学高级计算机研究学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 13 figures, 5 tables
Abstract:Visual odometry techniques typically rely on feature extraction from a sequence of images and subsequent computation of optical flow. This point-to-point correspondence between two consecutive frames can be costly to compute and suffers from varying accuracy, which affects the odometry estimate’s quality. Attempts have been made to bypass the difficulties originating from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU) to improve performance, many of which still heavily rely on correspondence. If the camera observes a straight line as it moves, the image of the line sweeps a smooth surface in image-space time. It is a ruled surface and analyzing its shape gives information about odometry. Further, its estimation requires only differentially computed updates from point-to-line associations. Inspired by event cameras’ propensity for edge detection, this research presents a novel algorithm to reconstruct 3D scenes and visual odometry from these ruled surfaces. By constraining the surfaces with the inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.
zh
[CV-218] MILE: A Mechanically Isomorphic Exoskeleton Data Collection System with Fingertip Visuotactile Sensing for Dexterous Manipulation
【速读】:该论文旨在解决当前模仿学习(Imitation Learning)在灵巧手操作任务中因缺乏大规模、高保真数据而导致性能受限的问题。现有数据采集流程存在运动重定向不准确、采集效率低以及缺失高分辨率指尖触觉感知等缺陷。其解决方案的关键在于提出MILE系统——一个从人类手到外骨骼再到机器人手的机械同构远程操作与数据采集系统,通过设计符合人体工学的外骨骼和保持一对一关节位置同构的机器人手,消除了非线性重定向,实现了精确自然的控制;同时集成紧凑型指尖视觉-触觉模块,获取高分辨率触觉信号,从而高效构建包含多模态信息(高分辨率指尖视觉-触觉信号、RGB-D图像和关节位置)的数据集,显著提升了操作成功率。
链接: https://arxiv.org/abs/2512.00324
作者: Jinda Du,Jieji Ren,Qiaojun Yu,Ningbin Zhang,Yu Deng,Xingyu Wei,Yufei Liu,Guoying Gu,Xiangyang Zhu
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Humanoid Robot (Shanghai) Co., Ltd. (人形机器人(上海)有限公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Imitation learning provides a promising approach to dexterous hand manipulation, but its effectiveness is limited by the lack of large-scale, high-fidelity data. Existing data-collection pipelines suffer from inaccurate motion retargeting, low data-collection efficiency, and missing high-resolution fingertip tactile sensing. We address this gap with MILE, a mechanically isomorphic teleoperation and data-collection system co-designed from human hand to exoskeleton to robotic hand. The exoskeleton is anthropometrically derived from the human hand, and the robotic hand preserves one-to-one joint-position isomorphism, eliminating nonlinear retargeting and enabling precise, natural control. The exoskeleton achieves a multi-joint mean absolute angular error below one degree, while the robotic hand integrates compact fingertip visuotactile modules that provide high-resolution tactile observations. Built on this retargeting-free interface, we teleoperate complex, contact-rich in-hand manipulation and efficiently collect a multimodal dataset comprising high-resolution fingertip visuotactile signals, RGB-D images, and joint positions. The teleoperation pipeline achieves a mean success rate improvement of 64%. Incorporating fingertip tactile observations further increases the success rate by an average of 25% over the vision-only baseline, validating the fidelity and utility of the dataset. Further details are available at: this https URL.
zh
[CV-219] ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays WACV2026
【速读】:该论文旨在解决无监督异常检测中合成异常样本与真实病理模式视觉差异大、且忽视解剖结构一致性的问题。现有方法生成的异常区域往往缺乏真实性,难以在训练中提供有效的像素级监督信号,从而限制了模型性能和泛化能力。其解决方案的关键在于提出了一种基于解剖结构感知的纹理增强合成框架(ART-ASyn),通过引入一种新颖的渐进式二值阈值分割方法(PBTSeg)实现高精度肺部区域分割,并在此基础上利用纹理迁移技术生成具有解剖合理性且视觉逼真的肺部阴影类异常。该方法能够为每个正常样本生成配对的合成异常图像及其精确的像素级异常掩膜,从而在训练阶段提供显式的分割监督信号;此外,ART-ASyn进一步在零样本场景下评估异常分割任务,展现出无需目标域标注即可跨数据集泛化的潜力。
链接: https://arxiv.org/abs/2512.00310
作者: Qinyi Cao,Jianan Fan,Weidong Cai
机构: The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in WACV2026
Abstract:Unsupervised anomaly detection aims to identify anomalies without pixel-level annotations. Synthetic anomaly-based methods exhibit a unique capacity to introduce controllable irregularities with known masks, enabling explicit supervision during training. However, existing methods often produce synthetic anomalies that are visually distinct from real pathological patterns and ignore anatomical structure. This paper presents a novel Anatomy-aware Realistic Texture-based Anomaly Synthesis framework (ART-ASyn) for chest X-rays that generates realistic and anatomically consistent lung opacity related anomalies using texture-based augmentation guided by our proposed Progressive Binary Thresholding Segmentation method (PBTSeg) for lung segmentation. The generated paired samples of synthetic anomalies and their corresponding precise pixel-level anomaly mask for each normal sample enable explicit segmentation supervision. In contrast to prior work limited to one-class classification, ART-ASyn is further evaluated for zero-shot anomaly segmentation, demonstrating generalizability on an unseen dataset without target-domain annotations. Code availability is available at this https URL.
zh
[CV-220] Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation NEURIPS2025
【速读】:该论文旨在解决大规模数据集蒸馏(dataset distillation)中因仅匹配全局分布统计量(如均值和方差)而忽略实例级特征与类内差异,导致模型泛化性能不佳的问题。其解决方案的关键在于将数据集蒸馏重新建模为最优传输(Optimal Transport, OT)距离最小化问题,从而在全局和实例层面实现细粒度对齐;具体包括三个核心组件:(1) 基于OT引导的扩散采样,对齐真实图像与蒸馏图像的潜在分布;(2) 标签-图像对齐的软重标注,根据蒸馏图像分布复杂度动态调整标签分布;(3) 基于OT的logit匹配,使学生模型输出与软标签分布一致。此方法有效保留了高维分布中的局部模式、类内结构及细微变化,显著提升了蒸馏数据的质量与模型性能,在ImageNet-1K上IPC=10设置下相较现有最优方法至少提升4%准确率。
链接: https://arxiv.org/abs/2512.00308
作者: Xiao Cui,Yulei Qin,Wengang Zhou,Hongsheng Li,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学); CUHK MMLab (香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Dataset distillation seeks to synthesize a compact distilled dataset, enabling models trained on it to achieve performance comparable to models trained on the full dataset. Recent methods for large-scale datasets focus on matching global distributional statistics (e.g., mean and variance), but overlook critical instance-level characteristics and intraclass variations, leading to suboptimal generalization. We address this limitation by reformulating dataset distillation as an Optimal Transport (OT) distance minimization problem, enabling fine-grained alignment at both global and instance levels throughout the pipeline. OT offers a geometrically faithful framework for distribution matching. It effectively preserves local modes, intra-class patterns, and fine-grained variations that characterize the geometry of complex, high-dimensional distributions. Our method comprises three components tailored for preserving distributional geometry: (1) OT-guided diffusion sampling, which aligns latent distributions of real and distilled images; (2) label-image-aligned soft relabeling, which adapts label distributions based on the complexity of distilled image distributions; and (3) OT-based logit matching, which aligns the output of student models with soft-label distributions. Extensive experiments across diverse architectures and large-scale datasets demonstrate that our method consistently outperforms state-of-the-art approaches in an efficient manner, achieving at least 4% accuracy improvement under IPC=10 settings for each architecture on ImageNet-1K.
zh
[CV-221] GSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
【速读】:该论文旨在解决**具身三维语义场景补全(Embodied 3D Semantic Scene Completion, SSC)**中现有基于高斯的方法因随机初始化大量基元(primitive)导致冗余和难以扩展至无界场景的问题,以及深度引导方法虽缓解此问题但仍受限于局部性、随规模增大产生延迟与内存开销的挑战。其解决方案的关键在于提出 TGSFormer——一种可扩展的时序高斯点绘(Temporal Gaussian Splatting)框架:通过维护一个持久化的高斯记忆实现长期时序预测,无需依赖图像一致性或帧缓存;采用双时序编码器(Dual Temporal Encoder)结合置信度感知交叉注意力融合当前与历史高斯特征,并进一步利用置信度感知体素融合模块将重叠基元合并为体素对齐表示,从而在减少基元数量的同时保持高精度与长期场景完整性。
链接: https://arxiv.org/abs/2512.00300
作者: Rui Qian,Haozhi Cao,Tianchen Deng,Tianxin Hu,Weixiang Guo,Shenghai Yuan,Lihua Xie
机构: Nanyang Technological University (南洋理工大学); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures
Abstract:Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases. To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches. For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention. Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness. Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.
zh
[CV-222] Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
【速读】:该论文旨在解决传统增强现实(AR)系统在处理开放词汇自然语言查询时能力受限的问题,尤其是其依赖固定类别检测器或特征标记(fiducial markers)导致无法有效进行复杂空间关系推理和语义理解。解决方案的关键在于构建一个模块化的AR代理系统,该系统整合多模态大语言模型(MLLMs)与具身视觉模型(grounded vision models),通过动态生成包含九类关系(空间、结构-语义、因果-功能)的AR场景图,实现语言条件下的空间检索与三维空间中的关系推理。该系统利用任务自适应区域兴趣提示和上下文感知的空间检索机制,引导人类注意力至信息密集区域,并支持人机协同优化;同时,其模块化设计允许无需重新训练即可接入不同视觉-语言模型,从而将MLLMs与真实世界的几何智能相结合,形成可交互的场景理解中介。
链接: https://arxiv.org/abs/2512.00294
作者: Lixing Guo,Tobias Höllerer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through task-adaptive region-of-interest highlighting and contextual spatial retrieval, the system guides human attention to information-dense areas while supporting human-in-the-loop refinement. The agent dynamically invokes coordinate-aware tools for complex queries-selection, measurement, comparison, and actuation-grounding language understanding in physical operations. The modular architecture supports plug-and-use vision-language models without retraining, establishing AR agents as intermediaries that augment MLLMs with real-world spatial intelligence for interactive scene understanding. We also introduce GroundedAR-Bench, an evaluation framework for language-driven real world localization and relation grounding across diverse environments.
zh
[CV-223] RealAppliance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manuals
链接: https://arxiv.org/abs/2512.00287
作者: Yuzheng Gao,Yuxing Long,Lei Kang,Yuchong Guo,Ziyan Yu,Shangqing Mao,Jiyao Zhang,Ruihai Wu,Dongjiang Li,Hui Shen,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-224] Rethinking Lung Cancer Screening: AI Nodule Detection and Diagnosis Outperforms Radiologists Leading Models and Standards Beyond Size and Growth
【速读】:该论文旨在解决肺癌筛查中因依赖结节大小和生长速度而导致的诊断延迟问题,从而提升早期恶性肺结节的检出效率。其解决方案的关键在于构建一个基于浅层深度学习与特征工程相结合的集成模型系统,该系统能够直接在低剂量CT图像上对结节进行检测与良恶性判别,且在大规模标注数据集(25,709例扫描,69,449个结节)上训练和验证,实现了优于放射科医生及现有主流AI模型的性能,尤其在早期肺癌(Stage 1)、小结节及生长缓慢结节的识别上表现突出,显著缩短了诊断时间(最多提前一年),并保持高敏感度(99.3%)与极低假阳性率(0.5个/扫描)。
链接: https://arxiv.org/abs/2512.00281
作者: Sylvain Bodard,Pierre Baudot,Benjamin Renoust,Charles Voyton,Gwendoline De Bie,Ezequiel Geremia,Van-Khoa Le,Danny Francis,Pierre-Henri Siot,Yousra Haddou,Vincent Bobin,Jean-Christophe Brisset,Carey C. Thomson,Valerie Bourdes,Benoit Huet
机构: Université de Paris Cité, AP-HP, Hôpital Universitaire Necker Enfants Malades, Service d’Imagerie Adulte, F-75015, Paris, France; Memorial Sloan Kettering Cancer Center, Department of Radiology, 1275 York Avenue, New York, NY 10065, USA; Massachusetts General Hospital, Center for Transplantation Sciences, Harvard Medical Shool, Boston, USA; Sorbonne Université, CNRS UMR 7371, INSERM U 1146, Laboratoire d’Imagerie Biomédicale (LIB), F-75006, Paris, France; Median Technologies, eyonis, Valbonne, 06560, France; Mount Auburn Hospital/Beth Israel Lahey Health, Cambridge MA, USA; Harvard Medical School, Boston MA, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: 25 pages, 8 figures, with supplementary information containing 11 figures
Abstract:Early detection of malignant lung nodules is critical, but its dependence on size and growth in screening inherently delays diagnosis. We present an AI system that redefines lung cancer screening by performing both detection and malignancy diagnosis directly at the nodule level on low-dose CT scans. To address limitations in dataset scale and explainability, we designed an ensemble of shallow deep learning and feature-based specialized models. Trained and evaluated on 25,709 scans with 69,449 annotated nodules, the system outperforms radiologists, Lung-RADS, and leading AI models (Sybil, Brock, Google, Kaggle). It achieves an area under the receiver operating characteristic curve (AUC) of 0.98 internally and 0.945 on an independent cohort. With 0.5 false positives per scan at 99.3% sensitivity, it addresses key barriers to AI adoption. Critically, it outperforms radiologists across all nodule sizes and stages, excelling in stage 1 cancers, and all growth-based metrics, including the least accurate: Volume-Doubling Time. It also surpasses radiologists by up to one year in diagnosing indeterminate and slow-growing nodules.
zh
[CV-225] HIMOSA: Efficient Remote Sensing Image Super-Resolution with Hierarchical Mixture of Sparse Attention
链接: https://arxiv.org/abs/2512.00275
作者: Yi Liu,Yi Wan,Xinyi Liu,Qiong Wu,Panwang Xia,Xuejun Huang,Yongjun Zhang
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
[CV-226] USB: Unified Synthetic Brain Framework for Bidirectional Pathology-Healthy Generation and Editing
【速读】:该论文旨在解决脑部影像中病理与健康结构之间关系建模的难题,尤其针对配对病理-健康数据难以获取的问题,传统方法多局限于单一域(如仅生成健康图像或仅编辑病理图像),缺乏统一框架实现双向生成与编辑。其解决方案的关键在于提出首个端到端的统一框架USB(Unified Synthetic Brain),通过配对扩散机制建模病变与脑解剖结构的联合分布,从而同时支持病理与健康脑图像的生成;并引入一致性引导算法,在双向编辑过程中保持解剖结构一致性和病灶对应关系,显著提升生成结果的真实性和临床可用性。
链接: https://arxiv.org/abs/2512.00269
作者: Jun Wang,Peirong Liu
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 17 figures
Abstract:Understanding the relationship between pathological and healthy brain structures is fundamental to neuroimaging, connecting disease diagnosis and detection with modeling, prediction, and treatment planning. However, paired pathological-healthy data are extremely difficult to obtain, as they rely on pre- and post-treatment imaging, constrained by clinical outcomes and longitudinal data availability. Consequently, most existing brain image generation and editing methods focus on visual quality yet remain domain-specific, treating pathological and healthy image modeling independently. We introduce USB (Unified Synthetic Brain), the first end-to-end framework that unifies bidirectional generation and editing of pathological and healthy brain images. USB models the joint distribution of lesions and brain anatomy through a paired diffusion mechanism and achieves both pathological and healthy image generation. A consistency guidance algorithm further preserves anatomical consistency and lesion correspondence during bidirectional pathology-healthy editing. Extensive experiments on six public brain MRI datasets including healthy controls, stroke, and Alzheimer’s patients, demonstrate USB’s ability to produce diverse and realistic results. By establishing the first unified benchmark for brain image generation and editing, USB opens opportunities for scalable dataset creation and robust neuroimaging analysis. Code is available at this https URL.
zh
[CV-227] HeartFormer: Semantic-Aware Dual-Structure Transformers for 3D Four-Chamber Cardiac Point Cloud Reconstruction
【速读】:该论文旨在解决传统电影磁共振成像(cine MRI)仅提供心脏二维切片图像的局限性,从而难以全面理解健康及病理状态下心脏形态学与生理机制的问题。其解决方案的关键在于提出一种基于点云表示的几何深度学习框架——HeartFormer,该框架首次将点云补全技术从单类扩展至多类,并包含两个核心组件:语义感知双结构Transformer网络(SA-DSTNet)用于生成具有全局几何特征和亚结构几何特征的初始粗略点云,以及语义感知几何特征精炼Transformer网络(SA-GFRTNet)通过逐步优化提升重建精度,有效利用全局与亚结构几何先验,实现高保真且几何一致的三维四腔心重建。
链接: https://arxiv.org/abs/2512.00264
作者: Zhengda Ma,Abhirup Banerjee
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present the first geometric deep learning framework based on point cloud representation for 3D four-chamber cardiac reconstruction from cine MRI data. This work addresses a long-standing limitation in conventional cine MRI, which typically provides only 2D slice images of the heart, thereby restricting a comprehensive understanding of cardiac morphology and physiological mechanisms in both healthy and pathological conditions. To overcome this, we propose \textbfHeartFormer, a novel point cloud completion network that extends traditional single-class point cloud completion to the multi-class. HeartFormer consists of two key components: a Semantic-Aware Dual-Structure Transformer Network (SA-DSTNet) and a Semantic-Aware Geometry Feature Refinement Transformer Network (SA-GFRTNet). SA-DSTNet generates an initial coarse point cloud with both global geometry features and substructure geometry features. Guided by these semantic-geometry representations, SA-GFRTNet progressively refines the coarse output, effectively leveraging both global and substructure geometric priors to produce high-fidelity and geometrically consistent reconstructions. We further construct \textbfHeartCompv1, the first publicly available large-scale dataset with 17,000 high-resolution 3D multi-class cardiac meshes and point-clouds, to establish a general benchmark for this emerging research direction. Extensive cross-domain experiments on HeartCompv1 and UK Biobank demonstrate that HeartFormer achieves robust, accurate, and generalizable performance, consistently surpassing state-of-the-art (SOTA) methods. Code and dataset will be released upon acceptance at: this https URL.
zh
[CV-228] UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations WACV2026
【速读】:该论文旨在解决多模态遥感数据中因稀疏标注(sparse annotations)导致的监督学习方法性能受限问题,尤其是针对如高光谱成像(hyperspectral imaging, HSI)和合成孔径雷达(synthetic aperture radar, SAR)等异构模态,难以在缺乏大规模标注数据的情况下有效迁移预训练模型的问题。解决方案的关键在于提出UniDiff框架,其核心创新包括:基于FiLM(Feature-wise Linear Modulation)的时步-模态条件控制机制、仅调整约5%参数的参数高效适配策略,以及伪RGB锚定(pseudo-RGB anchoring)技术,以保留ImageNet预训练扩散模型的表征能力并避免灾难性遗忘,从而实现仅用目标域数据即可完成多模态遥感特征的有效提取与融合。
链接: https://arxiv.org/abs/2512.00261
作者: Yuzhen Hu,Saurabh Prasad
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera-ready for WACV 2026
Abstract:Sparse annotations fundamentally constrain multimodal remote sensing: even recent state-of-the-art supervised methods such as MSFMamba are limited by the availability of labeled data, restricting their practical deployment despite architectural advances. ImageNet-pretrained models provide rich visual representations, but adapting them to heterogeneous modalities such as hyperspectral imaging (HSI) and synthetic aperture radar (SAR) without large labeled datasets remains challenging. We propose UniDiff, a parameter-efficient framework that adapts a single ImageNet-pretrained diffusion model to multiple sensing modalities using only target-domain data. UniDiff combines FiLM-based timestep-modality conditioning, parameter-efficient adaptation of approximately 5% of parameters, and pseudo-RGB anchoring to preserve pre-trained representations and prevent catastrophic forgetting. This design enables effective feature extraction from remote sensing data under sparse annotations. Our results with two established multi-modal benchmarking datasets demonstrate that unsupervised adaptation of a pre-trained diffusion model effectively mitigates annotation constraints and achieves effective fusion of multi-modal remotely sensed data.
zh
[CV-229] Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
【速读】:该论文旨在解决从稀疏视角RGB视频中实现高保真、自由视角的全身动态人类渲染与光照重演(relighting)问题,尤其针对传统逐光源拍摄(One-Light-At-A-Time, OLAT)方法在采集和计算上的高成本与低效性。其解决方案的关键在于提出了一种基于Transformer架构的RelightNet模型,该模型通过物理启发特征(physics-informed features)编码几何、反照率、阴影及相机视角信息,并结合粗略人体网格代理(coarse human mesh proxy)与输入视图,直接预测新光照条件下的图像外观;同时,采用多视角光场捕捉策略构建新型数据集,在交替使用随机环境贴图与均匀光照帧的同时实现精准运动跟踪与多样光照覆盖,最终以texel对齐的3D高斯泼溅(3D Gaussian splats)形式输出结果,从而在单次前向传播中高效逼近渲染方程,显著提升视觉保真度与光照再现能力。
链接: https://arxiv.org/abs/2512.00255
作者: Kunwar Maheep Singh,Jianchun Chen,Vladislav Golyanik,Stephan J. Garbin,Thabo Beeler,Rishabh Dabral,Marc Habermann,Christian Theobalt
机构: 1: Max Planck Institute for Informatics (马克斯普朗克信息研究所); 2: Saarland University (萨尔兰大学); 3: University of Bath (巴斯大学); 4: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Relightable Holoported Characters (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method’s superior visual fidelity and lighting reproduction compared to state-of-the-art approaches. Project page: this https URL
zh
[CV-230] IE: A Training-Inversion-Exclusion Framework for Visually Interpretable and Uncertainty-Guided Out-of-Distribution Detection
链接: https://arxiv.org/abs/2512.00229
作者: Pirzada Suhail,Rehna Afroz,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:
[CV-231] DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation NEURIPS2025
【速读】:该论文旨在解决当前3D场景理解数据集在语义标注层面的不足问题,即现有数据集虽提供几何结构和实例级信息,但缺乏能够支持复杂视觉-语言任务的细粒度语义描述。其解决方案的关键在于提出DenseScan数据集,通过自动化流水线整合多视角2D图像与多模态大语言模型(Multimodal Large Language Models, MLLMs),实现对场景元素的密集式描述(dense captioning)以及基于场景的问题生成,从而在对象属性、空间关系和上下文感知维度上增强语义丰富性。该方法显著提升了3D环境中对象级理解与问答性能,为机器人、增强现实等应用提供了更具上下文敏感性的标注基础。
链接: https://arxiv.org/abs/2512.00226
作者: Zirui Wang,Tao Zhang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Workshop on Space in Vision, Language, and Embodied AI at NeurIPS 2025
Abstract:3D understanding is a key capability for real-world AI assistance. High-quality data plays an important role in driving the development of the 3D understanding community. Current 3D scene understanding datasets often provide geometric and instance-level information, yet they lack the rich semantic annotations necessary for nuanced visual-language this http URL this work, we introduce DenseScan, a novel dataset with detailed multi-level descriptions generated by an automated pipeline leveraging multi-view 2D images and multimodal large language models (MLLMs). Our approach enables dense captioning of scene elements, ensuring comprehensive object-level descriptions that capture context-sensitive details. Furthermore, we extend these annotations through scenario-based question generation, producing high-level queries that integrate object properties, spatial relationships, and scene context. By coupling geometric detail with semantic richness, DenseScan broadens the range of downstream tasks, from detailed visual-language navigation to interactive question answering. Experimental results demonstrate that our method significantly enhances object-level understanding and question-answering performance in 3D environments compared to traditional annotation pipelines. We release both the annotated dataset and our annotation pipeline to facilitate future research and applications in robotics, augmented reality, and beyond. Through DenseScan, we aim to catalyze new avenues in 3D scene understanding, allowing researchers and practitioners to tackle the complexities of real-world environments with richer, more contextually aware annotations.
zh
[CV-232] ReactionMamba: Generating Short Long Human Reaction Sequences
【速读】:该论文旨在解决生成长序列3D人类反应动作(reaction motion)时面临的现实感不足、多样性有限以及长时间一致性难以保持的问题。解决方案的关键在于提出ReactionMamba框架,其核心创新是将运动变分自编码器(motion VAE)与基于Mamba的状态空间模型(state-space models)相结合:运动VAE实现高效的动作编码,而Mamba架构则用于解码具有时间一致性的复杂反应动作,从而在保持高真实感和多样性的前提下,显著提升推理速度并支持从简单短动作到舞蹈、武术等复杂长序列动作的生成。
链接: https://arxiv.org/abs/2512.00208
作者: Hajra Anwar Beg,Baptiste Chopin,Hao Tang,Mohamed Daoudi
机构: Univ. Lille (Lille大学); CNRS (法国国家科学研究中心); Centrale Lille (中央理工-巴黎高科里尔分校); Institut Mines-Télécom (电信学院); UMR 9189 CRIStAL (CRIStAL联合研究实验室); IMT Nord Europe (IMT北欧工程学院); Centre for Digital Systems (数字系统中心); da/sec – Biometrics and Security Research Group (da/sec生物识别与安全研究组); Hochschule Darmstadt (达姆施塔特应用技术大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present ReactionMamba, a novel framework for generating long 3D human reaction motions. Reaction-Mamba integrates a motion VAE for efficient motion encoding with Mamba-based state-space models to decode temporally consistent reactions. This design enables ReactionMamba to generate both short sequences of simple motions and long sequences of complex motions, such as dance and martial arts. We evaluate ReactionMamba on three datasets–NTU120-AS, Lindy Hop, and InterX–and demonstrate competitive performance in terms of realism, diversity, and long-sequence generation compared to previous methods, including InterFormer, ReMoS, and Ready-to-React, while achieving substantial improvements in inference speed.
zh
[CV-233] Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis Prognosis and Reporting
【速读】:该论文旨在解决乳腺癌影像诊断中缺乏统一、高效且可解释的通用模型问题,尤其针对乳腺X线摄影(mammography)这一特定临床场景。其解决方案的关键在于构建首个专为乳腺影像设计的基础模型Mammo-FM,该模型在包含140,677名患者(821,326张乳腺X光片)的跨机构大规模多样化数据集上预训练,能够统一支持癌症诊断、病灶定位、结构化报告生成和风险预测等核心任务。Mammo-FM通过图像与文本的对齐实现视觉与文本双重可解释性,提升临床透明度与审计能力;同时,在保持参数量仅为当前最优通用基础模型三分之一的情况下,于分布内与分布外数据集上均显著优于现有方法,验证了领域特异性基础模型在任务覆盖广度与性能效率上的优势。
链接: https://arxiv.org/abs/2512.00198
作者: Shantanu Ghosh,Vedant Parthesh Joshi,Rayan Syed,Aya Kassem,Abhishek Varshney,Payel Basak,Weicheng Dai,Judy Wawira Gichoya,Hari M. Trivedi,Imon Banerjee,Shyam Visweswaran,Clare B. Poynton,Kayhan Batmanghelich
机构: Boston University (波士顿大学); Arizona State University (亚利桑那州立大学); Emory University (埃默里大学); Mayo Clinic (梅奥诊所); University of Pittsburgh (匹兹堡大学); Boston Medical Center (波士顿医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast cancer is one of the leading causes of death among women worldwide. We introduce Mammo-FM, the first foundation model specifically for mammography, pretrained on the largest and most diverse dataset to date - 140,677 patients (821,326 mammograms) across four U.S. institutions. Mammo-FM provides a unified foundation for core clinical tasks in breast imaging, including cancer diagnosis, pathology localization, structured report generation, and cancer risk prognosis within a single framework. Its alignment between images and text enables both visual and textual interpretability, improving transparency and clinical auditability, which are essential for real-world adoption. We rigorously evaluate Mammo-FM across diagnosis, prognosis, and report-generation tasks in in- and out-of-distribution datasets. Despite operating on native-resolution mammograms and using only one-third of the parameters of state-of-the-art generalist FMs, Mammo-FM consistently outperforms them across multiple public and private benchmarks. These results highlight the efficiency and value of domain-specific foundation models designed around the full spectrum of tasks within a clinical domain and emphasize the importance of rigorous, domain-aligned evaluation.
zh
[CV-234] AutocleanEEG ICVision: Automated ICA Artifact Classification Using Vision-Language AI
【速读】:该论文旨在解决传统脑电图(EEG)独立成分分析(ICA)组件分类方法依赖人工设计特征、缺乏可解释性且难以保持临床相关信号的问题。其解决方案的关键在于提出EEG Autoclean Vision Language AI(ICVision),这是一个基于多模态大语言模型(GPT-4 Vision)的AI代理系统,能够直接解读ICA仪表盘中的拓扑图、时间序列、功率谱和ERP图等视觉信息,并通过类人自然语言推理进行分类与解释。该系统首次实现了神经生理学领域中AI代理的视觉认知能力,能够在不牺牲临床重要信号的前提下,达到与专家共识高度一致的分类性能(k = 0.677),并显著提升结果的可解释性和可操作性(>97%被专家评为可理解且可用)。
链接: https://arxiv.org/abs/2512.00194
作者: Zag ElSayed,Grace Westerkamp,Gavin Gammoh,Yanchen Liu,Peyton Siekierski,Craig Erickson,Ernest Pedapati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: 6 pages, 8 figures
Abstract:We introduce EEG Autoclean Vision Language AI (ICVision) a first-of-its-kind system that emulates expert-level EEG ICA component classification through AI-agent vision and natural language reasoning. Unlike conventional classifiers such as ICLabel, which rely on handcrafted features, ICVision directly interprets ICA dashboard visualizations topography, time series, power spectra, and ERP plots, using a multimodal large language model (GPT-4 Vision). This allows the AI to see and explain EEG components the way trained neurologists do, making it the first scientific implementation of AI-agent visual cognition in neurophysiology. ICVision classifies each component into one of six canonical categories (brain, eye, heart, muscle, channel noise, and other noise), returning both a confidence score and a human-like explanation. Evaluated on 3,168 ICA components from 124 EEG datasets, ICVision achieved k = 0.677 agreement with expert consensus, surpassing MNE ICLabel, while also preserving clinically relevant brain signals in ambiguous cases. Over 97% of its outputs were rated as interpretable and actionable by expert reviewers. As a core module of the open-source EEG Autoclean platform, ICVision signals a paradigm shift in scientific AI, where models do not just classify, but see, reason, and communicate. It opens the door to globally scalable, explainable, and reproducible EEG workflows, marking the emergence of AI agents capable of expert-level visual decision-making in brain science and beyond.
zh
[CV-235] Efficient Edge-Compatible CNN for Speckle-Based Material Recognition in Laser Cutting Systems
链接: https://arxiv.org/abs/2512.00179
作者: Mohamed Abdallah Salem(North Dakota State University),Nourhan Zein Diab(New Mansoura University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Copyright 2025 IEEE. This is the author’s version of the work that has been Accepted for publication in the Proceedings of the 2025 IEEE The 35th International Conference on Computer Theory and Applications (ICCTA 2025). Final published version will be available on IEEE Xplore
[CV-236] rnary-Input Binary-Weight CNN Accelerator Design for Miniature Object Classification System with Query-Driven Spatial DVS
【速读】:该论文旨在解决微型成像系统在资源受限场景下(如内存和功耗限制)的高效目标分类问题。其关键解决方案是设计了一种针对空间事件相机(Spatial Dynamic Vision Sensor, DVS)输出优化的卷积神经网络(CNN)硬件加速器,通过像素共享重构为时间事件相机(Temporal DVS),并结合三值化DVS输出与二值权重神经网络结构,在保持分类精度的同时显著降低计算复杂度和存储需求。该方案在28 nm CMOS工艺中实现,数据量减少81%,乘累加(MAC)操作减少27%,推理功耗仅为1.6 mW,相较现有微型系统CNN加速器的性能-功耗指标(Figure-of-Merit, FoM)提升7.3倍。
链接: https://arxiv.org/abs/2512.00138
作者: Yuyang Li,Swasthik Muloor,Jack Laudati,Nickolas Dematteis,Yidam Park,Hana Kim,Nathan Chang,Inhee Lee
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 6 pages.12 figures 2 table
Abstract:Miniature imaging systems are essential for space-constrained applications but are limited by memory and power constraints. While machine learning can reduce data size by extracting key features, its high energy demands often exceed the capacity of small batteries. This paper presents a CNN hardware accelerator optimized for object classification in miniature imaging systems. It processes data from a spatial Dynamic Vision Sensor (DVS), reconfigurable to a temporal DVS via pixel sharing, minimizing sensor area. By using ternary DVS outputs and a ternary-input, binary-weight neural network, the design reduces computation and memory needs. Fabricated in 28 nm CMOS, the accelerator cuts data size by 81% and MAC operations by 27%. It achieves 440 ms inference time at just 1.6 mW power consumption, improving the Figure-of-Merit (FoM) by 7.3x over prior CNN accelerators for miniature systems.
zh
[CV-237] Local and Global Context-and-Object-part-Aware Superpixel-based Data Augmentation for Deep Visual Recognition
【速读】:该论文旨在解决现有基于Cutmix的数据增强方法在图像级约束下忽视类别判别性局部上下文信息、导致性能提升受限,以及因采用矩形或方形区域切割而造成物体局部结构信息丢失的问题。此外,传统方法为缓解增强图像与混合标签不一致问题常依赖双前向传播或外部预训练网络进行目标中心化,效率较低。其解决方案的关键在于提出LGCOAMix——一种基于超像素(superpixel)的上下文感知与物体部件感知的网格融合数据增强方法,首次引入超像素注意力机制实现标签混合策略,通过学习判别性超像素区域的局部特征并进行跨图像超像素对比,有效提升了模型对局部语义信息的敏感度和增强样本的真实性,从而在分类任务及弱监督目标定位任务中显著优于当前主流Cutmix方法,并适用于CNN与Transformer架构。
链接: https://arxiv.org/abs/2512.00130
作者: Fadi Dornaika,Danyang Sun
机构: University of the Basque Country (巴斯克大学); IKERBASQUE (巴斯克基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cutmix-based data augmentation, which uses a cut-and-paste strategy, has shown remarkable generalization capabilities in deep learning. However, existing methods primarily consider global semantics with image-level constraints, which excessively reduces attention to the discriminative local context of the class and leads to a performance improvement bottleneck. Moreover, existing methods for generating augmented samples usually involve cutting and pasting rectangular or square regions, resulting in a loss of object part information. To mitigate the problem of inconsistency between the augmented image and the generated mixed label, existing methods usually require double forward propagation or rely on an external pre-trained network for object centering, which is inefficient. To overcome the above limitations, we propose LGCOAMix, an efficient context-aware and object-part-aware superpixel-based grid blending method for data augmentation. To the best of our knowledge, this is the first time that a label mixing strategy using a superpixel attention approach has been proposed for cutmix-based data augmentation. It is the first instance of learning local features from discriminative superpixel-wise regions and cross-image superpixel contrasts. Extensive experiments on various benchmark datasets show that LGCOAMix outperforms state-of-the-art cutmix-based data augmentation methods on classification tasks, and weakly supervised object location on CUB200-2011. We have demonstrated the effectiveness of LGCOAMix not only for CNN networks, but also for Transformer networks. Source codes are available at this https URL.
zh
[CV-238] Analysis of Incursive Breast Cancer in Mammograms Using YOLO Explainability and Domain Adaptation
【速读】:该论文旨在解决深度学习模型在乳腺癌检测任务中对分布外(Out-of-Distribution, OOD)输入(如CT、MRI、X光等非乳腺X线图像或设备差异导致的变体)缺乏鲁棒性的问题,从而引发误诊和不可靠的检测结果。解决方案的关键在于构建一个集成框架:首先利用ResNet50作为主干网络进行OOD过滤,通过余弦相似度建立域内图像库,严格排除非乳腺X线图像;随后将经验证的域内图像输入YOLO系列目标检测架构(YOLOv8/v11/v12)进行高精度检测。该策略在OOD测试集上实现100%准确率,同时保持mAP@0.5为0.947的优异检测性能,并借助Grad-CAM提升可解释性,显著提升了系统在多样化临床环境中的可靠性。
链接: https://arxiv.org/abs/2512.00129
作者: Jayan Adhikari,Prativa Joshi,Susish Baral
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models for breast cancer detection from mammographic images have significant reliability problems when presented with Out-of-Distribution (OOD) inputs such as other imaging modalities (CT, MRI, X-ray) or equipment variations, leading to unreliable detection and misdiagnosis. The current research mitigates the fundamental OOD issue through a comprehensive approach integrating ResNet50-based OOD filtering with YOLO architectures (YOLOv8, YOLOv11, YOLOv12) for accurate detection of breast cancer. Our strategy establishes an in-domain gallery via cosine similarity to rigidly reject non-mammographic inputs prior to processing, ensuring that only domain-associated images supply the detection pipeline. The OOD detection component achieves 99.77% general accuracy with immaculate 100% accuracy on OOD test sets, effectively eliminating irrelevant imaging modalities. ResNet50 was selected as the optimum backbone after 12 CNN architecture searches. The joint framework unites OOD robustness with high detection performance (mAP@0.5: 0.947) and enhanced interpretability through Grad-CAM visualizations. Experimental validation establishes that OOD filtering significantly improves system reliability by preventing false alarms on out-of-distribution inputs while maintaining higher detection accuracy on mammographic data. The present study offers a fundamental foundation for the deployment of reliable AI-based breast cancer detection systems in diverse clinical environments with inherent data heterogeneity.
zh
[CV-239] Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance
【速读】:该论文旨在解决工业质量检测中因标注数据稀缺、样本类别严重不平衡而导致的机器学习模型性能下降问题。其核心挑战在于制造场景下高质量标注数据获取成本高,且缺陷样本稀少,难以训练出鲁棒的检测与分类模型。解决方案的关键在于提出一种混合合成数据生成(Synthetic Data Generation, SDG)框架,融合基于仿真的渲染、域随机化(Domain Randomization)和真实背景拼接技术,实现无需人工标注的零样本学习(Zero-shot Learning)。该方法通过自动化生成12,960张带标签图像(含多样几何、光照和表面属性),并将其与真实背景合成,构建大规模平衡数据集;进而使用YOLOv8n进行目标检测、MobileNetV3-small进行质量分类,在仅依赖合成数据训练的情况下,于真实工业零件上实现mAP@0.5为0.995、分类准确率达96%、平衡准确率达90.1%,显著优于仅用少量真实数据的基线方法(仅50%平衡准确率),验证了该SDG策略在实际制造环境中具备可扩展性、鲁棒性和实用性。
链接: https://arxiv.org/abs/2512.00125
作者: Ruo-Syuan Mei,Sixian Jia,Guangze Li,Soo Yeon Lee,Brian Musser,William Keller,Sreten Zakula,Jorge Arinez,Chenhui Shao
机构: University of Michigan (密歇根大学); General Motors (通用汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to the NAMRC 54
Abstract:Machine learning, particularly deep learning, is transforming industrial quality inspection. Yet, training robust machine learning models typically requires large volumes of high-quality labeled data, which are expensive, time-consuming, and labor-intensive to obtain in manufacturing. Moreover, defective samples are intrinsically rare, leading to severe class imbalance that degrades model performance. These data constraints hinder the widespread adoption of machine learning-based quality inspection methods in real production environments. Synthetic data generation (SDG) offers a promising solution by enabling the creation of large, balanced, and fully annotated datasets in an efficient, cost-effective, and scalable manner. This paper presents a hybrid SDG framework that integrates simulation-based rendering, domain randomization, and real background compositing to enable zero-shot learning for computer vision-based industrial part inspection without manual annotation. The SDG pipeline generates 12,960 labeled images in one hour by varying part geometry, lighting, and surface properties, and then compositing synthetic parts onto real image backgrounds. A two-stage architecture utilizing a YOLOv8n backbone for object detection and MobileNetV3-small for quality classification is trained exclusively on synthetic data and evaluated on 300 real industrial parts. The proposed approach achieves an mAP@0.5 of 0.995 for detection, 96% classification accuracy, and 90.1% balanced accuracy. Comparative evaluation against few-shot real-data baseline approaches demonstrates significant improvement. The proposed SDG-based approach achieves 90-91% balanced accuracy under severe class imbalance, while the baselines reach only 50% accuracy. These results demonstrate that the proposed method enables annotation-free, scalable, and robust quality inspection for real-world manufacturing applications.
zh
[CV-240] Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中多模态音乐生成的感知自然性与情感一致性难题,尤其是如何在缺乏昂贵情感标签的情况下实现跨模态的情感对齐。其核心解决方案是构建 ArtiCaps 数据集(基于 ArtEmis 和 MusicCaps 的语义匹配伪情感对齐图像-音乐-文本数据集),并提出 Art2Music 框架:第一阶段通过 OpenCLIP 编码图像与文本,并利用门控残差模块融合特征,再经双向 LSTM 解码为 Mel-spectrograms(采用频率加权 L1 损失提升高频保真度);第二阶段使用微调后的 HiFi-GAN 语音合成器重建高质量音频波形。该方法在少量样本(仅 50k)下即可实现感知自然性、频谱保真度和语义一致性的显著提升,为交互艺术、个性化声景及数字艺术展览中的情感对齐音频生成提供了可扩展方案。
链接: https://arxiv.org/abs/2512.00120
作者: Jiaying Hong,Ting Zhu,Thanet Markchom,Huizhi Liang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.
zh
[CV-241] nyViT: Field Deployable Transformer Pipeline for Solar Panel Surface Fault and Severity Screening
【速读】:该论文旨在解决太阳能光伏(Photovoltaic, PV)资产在大规模、地理分散的模块中持续运行时,如何准确检测和优先排序表面故障的问题。传统多模态成像策略虽有效但存在部署成本高、物流复杂等局限,难以在农场级常规应用。解决方案的关键在于将深度学习与经典机器学习方法有机结合,构建了一个轻量级的集成系统TinyViT:其核心包括基于Transformer的分割模块、光谱-空间特征工程以及集成回归模型,仅依赖消费级彩色相机获取的平面可见光图像即可实现对七类细微表面故障的分类和严重程度估计。该方案无需电致发光(Electroluminescence, EL)或红外(Infrared, IR)传感器,显著降低了维护成本并提升了可扩展性,从而推动光伏健康监测向普适化田间应用迈进。
链接: https://arxiv.org/abs/2512.00117
作者: Ishwaryah Pandiarajan,Mohamed Mansoor Roomi Sindha,Uma Maheswari Pandyan,Sharafia N
机构: Thiagarajar College of Engineering (泰加拉贾工程技术学院); Velammal College of Engineering and Technology (维拉马拉工程技术学院); Kumaraguru College of Technology (库马拉古鲁技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 3pages, 2figures,ICGVIP 2025
Abstract:Sustained operation of solar photovoltaic assets hinges on accurate detection and prioritization of surface faults across vast, geographically distributed modules. While multi modal imaging strategies are popular, they introduce logistical and economic barriers for routine farm level deployment. This work demonstrates that deep learning and classical machine learning may be judiciously combined to achieve robust surface anomaly categorization and severity estimation from planar visible band imagery alone. We introduce TinyViT which is a compact pipeline integrating Transformer based segmentation, spectral-spatial feature engineering, and ensemble regression. The system ingests consumer grade color camera mosaics of PV panels, classifies seven nuanced surface faults, and generates actionable severity grades for maintenance triage. By eliminating reliance on electroluminescence or IR sensors, our method enables affordable, scalable upkeep for resource limited installations, and advances the state of solar health monitoring toward universal field accessibility. Experiments on real public world datasets validate both classification and regression sub modules, achieving accuracy and interpretability competitive with specialized approaches.
zh
[CV-242] MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
【速读】:该论文旨在解决音频-视觉(audio-visual)学习中现有方法在参数效率和内存消耗方面的瓶颈问题,尤其是传统基于Transformer的逐层序列式适配策略带来的计算开销过大与冗余特征传播风险。其解决方案的关键在于提出一种分层令牌混合机制(Mixture of Layer-Wise Tokens, MoLT),通过仅从预训练模型的深层提取并融合轻量级隐含令牌(latent tokens),实现高效且鲁棒的跨模态适应;该方法采用两种适配器模块分别捕获模态特异性信息与跨模态交互,并引入动态令牌融合模块以考虑各层令牌的重要性权重,同时通过正交性正则化约束减少隐含令牌间的冗余,从而在避免早期层误差传播的同时显著提升性能。
链接: https://arxiv.org/abs/2512.00115
作者: Kyeongha Rho,Hyeongkeun Lee,Jae Won Cho,Joon Son Chung
机构: KAIST(韩国科学技术院); Sejong University(世宗大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 5 figures
Abstract:In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at every transformer layer with a parallel, lightweight scheme that extracts and fuses layer-wise tokens only from the late layers. We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner. A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance. To prevent the redundancy of latent tokens, we apply an orthogonality regularization between latent tokens during training. Through the systematic analysis of the position of adaptation in the pre-trained transformers, we extract latent tokens only from the late layers of the transformers. This strategic adaptation approach avoids error propagation from the volatile early-layer features, thereby maximizing the adaptation performance while maintaining parameter and memory efficiency. Through extensive experiments, we demonstrate that MoLT outperforms existing methods on diverse audio-visual benchmarks, including Audio-Visual Question Answering, Audio-Visual Segmentation, and Audio-Visual Event Localization.
zh
[CV-243] Comparative Analysis of Vision Transformer Convolutional and Hybrid Architectures for Mental Health Classification Using Actigraphy-Derived Images
链接: https://arxiv.org/abs/2512.00103
作者: Ifeanyi Okala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-244] HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models
【速读】:该论文旨在解决生成式扩散模型在训练过程中可能使用未经授权的图像数据(即所有权或使用权不确定的数据)所带来的版权与合规风险问题。传统水印技术难以同时满足放射性(radioactivity,指水印可传递至模型输出)、不可感知性、鲁棒性和多比特容量等关键需求。解决方案的关键在于提出HMARK——一种基于语义潜空间(semantic-latent space, h-space)的多比特水印方案,通过将所有权信息编码为h-space中的语义相关信号,实现水印的放射性、对各类失真的鲁棒性以及对图像感知质量的最小影响,从而有效追踪并验证训练数据来源。
链接: https://arxiv.org/abs/2512.00094
作者: Kexin Li,Guozhen Ding,Ilya Grishchenko,David Lie
机构: University of Toronto (多伦多大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern generative diffusion models rely on vast training datasets, often including images with uncertain ownership or usage rights. Radioactive watermarks – marks that transfer to a model’s outputs – can help detect when such unauthorized data has been used for training. Moreover, aside from being radioactive, an effective watermark for protecting images from unauthorized training also needs to meet other existing requirements, such as imperceptibility, robustness, and multi-bit capacity. To overcome these challenges, we propose HMARK, a novel multi-bit watermarking scheme, which encodes ownership information as secret bits in the semantic-latent space (h-space) for image diffusion models. By leveraging the interpretability and semantic significance of h-space, ensuring that watermark signals correspond to meaningful semantic attributes, the watermarks embedded by HMARK exhibit radioactivity, robustness to distortions, and minimal impact on perceptual quality. Experimental results demonstrate that HMARK achieves 98.57% watermark detection accuracy, 95.07% bit-level recovery accuracy, 100% recall rate, and 1.0 AUC on images produced by the downstream adversarial model finetuned with LoRA on watermarked data across various types of distortions.
zh
[CV-245] Deep Filament Extraction for 3D Concrete Printing
链接: https://arxiv.org/abs/2512.00091
作者: Karam Mawas,Mehdi Maboudi,Pedro Achanccaray,Markus Gerke
机构: Technische Universität Braunschweig (布伦瑞克工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-246] ViT1.0: Teleconnection-aware Vision Transformers for Subseasonal to Seasonal Wildfire Pattern Forecasts
链接: https://arxiv.org/abs/2512.00089
作者: Ioannis Prapas,Nikolaos Papadopoulos,Nikolaos-Ioannis Bountos,Dimitrios Michail,Gustau Camps-Valls,Ioannis Papoutsis
机构: Orion Lab, National Technical University of Athens (国家技术大学雅典分校); Image Processing Laboratory (IPL), Universitat de València (瓦伦西亚大学); Department of Informatics and Telematics, Harokopio University of Athens (雅典哈尔科皮奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
[CV-247] SemImage: Semantic Image Representation for Text a Novel Framework for Embedding Disentangled Linguistic Features
【速读】:该论文旨在解决文本文档分类中语义信息难以有效建模且模型可解释性不足的问题。其核心解决方案是提出SemImage,一种将文本文档映射为二维语义图像(Semantic Image)的表示方法,使得卷积神经网络(CNN)能够直接处理文本内容。关键创新在于:1)将每个词表示为二维图像中的一个像素,行对应句子,句间插入动态计算的边界行以突出语义转换;2)采用解耦的HSV颜色空间编码不同语言特征——色相(Hue)用于表示主题(通过H_cos和H_sin实现周期性建模),饱和度(Saturation)表示情感,明度(Value)表示强度或置信度;3)通过多任务学习框架训练ColorMapper网络,对色相和饱和度通道施加辅助监督以预测主题和情感标签,从而强化特征解耦。该方法在多个多标签和单标签数据集上优于或媲美BERT等先进模型,同时提供直观的视觉解释能力,使主题跃迁与情感变化在图像中清晰可见。
链接: https://arxiv.org/abs/2512.00088
作者: Mohammad Zare
机构: AI Lab, Arioobarzan Engineering Team (Arioobarzan 工程团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.
zh
[CV-248] Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data WACV
链接: https://arxiv.org/abs/2512.00087
作者: Ivo Bueno,Ruikun Hou,Babette Bühler,Tim Fütterer,James Drimalla,Jonathan Kyle Foster,Peter Youngs,Peter Gerjets,Ulrich Trautwein,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); University of Tübingen (图宾根大学); Gordon College (戈登学院); State University of New York at Albany (纽约州立大学阿尔巴尼分校); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This article has been accepted for publication in the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
[CV-249] Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs
【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)在超低功耗物联网(Ultra-low-power Internet-of-Things, ULP IoT)平台中因领域偏移(domain shift)导致的精度显著下降问题。当传感器在实际部署环境中采集的数据与训练数据分布差异较大时,轻量级神经网络模型性能急剧恶化。解决方案的关键在于提出一种多模态设备端学习(On-Device Learning, ODL)方法:利用集成于IoT节点中的8×8像素深度传感器,在新环境激活以生成伪标签(pseudo-labels),并基于此在本地微控制器单元(MCU)上对仅107k参数的μPyD-Net模型进行增量微调;同时引入一种基于内存驱动的稀疏更新机制(memory-driven sparse update scheme),将训练内存占用压缩至1.2 MB(相比完整更新减少2.2倍),在保持精度损失仅为KITT和NYUv2数据集上2%和1.5%的前提下,实现仅需3k自标注样本即可在17.8分钟内完成域适应,首次验证了MDE任务在真实IoT节点上的高效、低功耗在线学习可行性。
链接: https://arxiv.org/abs/2512.00086
作者: Davide Nadalini,Manuele Rusci,Elia Cereda,Luca Benini,Francesco Conti,Daniele Palossi
机构: University of Bologna (博洛尼亚大学); Politecnico di Torino (都灵理工大学); KU Leuven (鲁汶大学); Dalle Molle Institute for Artificial Intelligence (IDSIA) (达勒莫利人工智能研究所); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures, 3 tables. Associated open-source release available at: this https URL
Abstract:Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming \approx 300mW. In its normal operation, this setup feeds a tiny 107 k-parameter \mu PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.
zh
[CV-250] A Fast and Efficient Modern BERT based Text-Conditioned Diffusion Model for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中因依赖密集像素级标注而导致的标注成本高、效率低的问题。其解决方案的关键在于提出FastTextDiff模型,通过整合医学文本注释来增强语义表示:利用ModernBERT(一种可处理长临床笔记的Transformer)将文本信息与医学图像的语义内容紧密关联,并借助跨模态注意力机制实现视觉与文本特征的有效融合;同时,以2万亿词元语料库训练的ModernBERT替代传统Clinical BioBERT,结合FlashAttention 2机制,在提升分割精度的同时显著优化了训练效率。
链接: https://arxiv.org/abs/2512.00084
作者: Venkata Siddharth Dhara,Pawan Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 3 figures, Accepted in Slide 3 10th International Conference on Computer Vision Image Processing (CVIP 2026)
Abstract:In recent times, denoising diffusion probabilistic models (DPMs) have proven effective for medical image generation and denoising, and as representation learners for downstream segmentation. However, segmentation performance is limited by the need for dense pixel-wise labels, which are expensive, time-consuming, and require expert knowledge. We propose FastTextDiff, a label-efficient diffusion-based segmentation model that integrates medical text annotations to enhance semantic representations. Our approach uses ModernBERT, a transformer capable of processing long clinical notes, to tightly link textual annotations with semantic content in medical images. Trained on MIMIC-III and MIMIC-IV, ModernBERT encodes clinical knowledge that guides cross-modal attention between visual and textual features. This study validates ModernBERT as a fast, scalable alternative to Clinical BioBERT in diffusion-based segmentation pipelines and highlights the promise of multi-modal techniques for medical image analysis. By replacing Clinical BioBERT with ModernBERT, FastTextDiff benefits from FlashAttention 2, an alternating attention mechanism, and a 2-trillion-token corpus, improving both segmentation accuracy and training efficiency over traditional diffusion-based models.
zh
[CV-251] Exploring Diagnostic Prompting Approach for Multimodal LLM -based Visual Complexity Assessment: A Case Study of Amazon Search Result Pages
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在评估亚马逊搜索结果页面(Search Results Page, SRP)视觉复杂度时可靠性不足的问题。解决方案的关键在于引入诊断式提示(diagnostic prompting),相较于传统的基于格式塔原则(gestalt principles)的标准提示方法,诊断式提示通过结构化分解视觉元素并引导模型关注关键设计特征(如徽章杂乱度占比达38.6%的重要性),显著提升了模型对人类专家复杂度判断的预测能力(F1-score从0.031提升至0.297,相对改进达858%),尽管绝对一致性仍较低(Cohen’s κ = 0.071)。这一策略为实现更贴近人类认知逻辑的MLLM视觉评估提供了初步可行路径。
链接: https://arxiv.org/abs/2512.00082
作者: Divendar Murtadak,Yoon Kim,Trilokya Akula
机构: Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 9 tables. Study on diagnostic prompting for multimodal LLM-based visual complexity assessment of Amazon search result pages
Abstract:This study investigates whether diagnostic prompting can improve Multimodal Large Language Model (MLLM) reliability for visual complexity assessment of Amazon Search Results Pages (SRP). We compare diagnostic prompting with standard gestalt principles-based prompting using 200 Amazon SRP pages and human expert annotations. Diagnostic prompting showed notable improvements in predicting human complexity judgments, with F1-score increasing from 0.031 to 0.297 (+858% relative improvement), though absolute performance remains modest (Cohen’s \kappa = 0.071). The decision tree revealed that models prioritize visual design elements (badge clutter: 38.6% importance) while humans emphasize content similarity, suggesting partial alignment in reasoning patterns. Failure case analysis reveals persistent challenges in MLLM visual perception, particularly for product similarity and color intensity assessment. Our findings indicate that diagnostic prompting represents a promising initial step toward human-aligned MLLM-based evaluation, though failure cases with consistent human-MLLM disagreement require continued research and refinement in prompting approaches with larger ground truth datasets for reliable practical deployment.
zh
[CV-252] Conceptual Evaluation of Deep Visual Stereo Odometry for the MARWIN Radiation Monitoring Robot in Accelerator Tunnels
【速读】:该论文旨在解决MARWIN机器人在欧洲XFEL加速器隧道中自主辐射监测时面临的导航灵活性不足问题,尤其是在未知几何结构和障碍物环境下传统基于激光雷达(LiDAR)的定位方法失效的问题。解决方案的关键在于引入深度视觉立体里程计(Deep Visual Stereo Odometry, DVSO),其通过纯视觉方式融合立体视差、光流与自监督学习,实现无需标注数据的深度估计与自身运动(ego-motion)联合估计;同时结合3D几何约束和绝对参考点(如地标或其它传感器)以提升全局一致性,从而在低纹理表面、光照变化及辐射环境等挑战下增强系统的鲁棒性与可扩展性。
链接: https://arxiv.org/abs/2512.00080
作者: André Dehne,Juri Zach,Peer Stelldinger
机构: HAW Hamburg (汉堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:The MARWIN robot operates at the European XFEL to perform autonomous radiation monitoring in long, monotonous accelerator tunnels where conventional localization approaches struggle. Its current navigation concept combines lidar-based edge detection, wheel/lidar odometry with periodic QR-code referencing, and fuzzy control of wall distance, rotation, and longitudinal position. While robust in predefined sections, this design lacks flexibility for unknown geometries and obstacles. This paper explores deep visual stereo odometry (DVSO) with 3D-geometric constraints as a focused alternative. DVSO is purely vision-based, leveraging stereo disparity, optical flow, and self-supervised learning to jointly estimate depth and ego-motion without labeled data. For global consistency, DVSO can subsequently be fused with absolute references (e.g., landmarks) or other sensors. We provide a conceptual evaluation for accelerator tunnel environments, using the European XFEL as a case study. Expected benefits include reduced scale drift via stereo, low-cost sensing, and scalable data collection, while challenges remain in low-texture surfaces, lighting variability, computational load, and robustness under radiation. The paper defines a research agenda toward enabling MARWIN to navigate more autonomously in constrained, safety-critical infrastructures.
zh
[CV-253] Diffusion-Based Synthetic Brightfield Microscopy Images for Enhanced Single Cell Detection
【速读】:该论文旨在解决明亮场显微镜下单细胞检测中因数据稀缺和标注瓶颈限制深度学习方法进展的问题。其解决方案的关键在于采用无条件扩散模型(diffusion model)生成合成的明亮场显微图像,并通过控制合成数据与真实数据的比例进行训练验证,结果表明使用合成数据可显著提升目标检测模型(如YOLOv8、YOLOv9和RT-DETR)的准确性,同时减少对人工标注的依赖,且生成图像具有高度真实性,经专家评估无法与真实图像区分。
链接: https://arxiv.org/abs/2512.00078
作者: Mario de Jesus da Graca,Jörg Dahlkemper,Peer Stelldinger
机构: Synentec GmbH( Synentec GmbH); HAW Hamburg(汉堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate single cell detection in brightfield microscopy is crucial for biological research, yet data scarcity and annotation bottlenecks limit the progress of deep learning methods. We investigate the use of unconditional models to generate synthetic brightfield microscopy images and evaluate their impact on object detection performance. A U-Net based diffusion model was trained and used to create datasets with varying ratios of synthetic and real images. Experiments with YOLOv8, YOLOv9 and RT-DETR reveal that training with synthetic data can achieve improved detection accuracies (at minimal costs). A human expert survey demonstrates the high realism of generated images, with experts not capable to distinguish them from real microscopy images (accuracy 50%). Our findings suggest that diffusion-based synthetic data generation is a promising avenue for augmenting real datasets in microscopy image analysis, reducing the reliance on extensive manual annotation and potentially improving the robustness of cell detection models.
zh
[CV-254] Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning
【速读】:该论文旨在解决当前具身学习(embodied learning)系统在实际应用中难以持续改进和泛化的问题,即现有方法通常仅优化生命周期中的单一环节(如数据采集、仿真、学习或部署),导致性能停滞且无法适应新场景。其解决方案的核心在于提出Arcadia框架,通过紧密耦合四个关键阶段构建闭环:(1) 自主探索与具身锚定实现物理环境中的自演化数据采集;(2) 生成式场景重建与增强支持真实且可扩展的场景生成;(3) 共享具身表征架构统一导航与操作任务于单一多模态骨干网络;(4) 从真实到仿真的评估与演化机制形成反馈闭环。这种非分解式的耦合设计确保了系统的持续进化能力,实验证明其在导航与操作基准上均取得稳定提升,并能有效迁移到物理机器人平台,验证了该生命周期方法对长期性能优化和端到端泛化的支撑作用。
链接: https://arxiv.org/abs/2512.00076
作者: Minghe Gao,Juncheng Li,Yuze Lin,Xuqi Liu,Jiaming Ji,Xiaoran Pan,Zihan Xu,Xian Li,Mingjie Li,Wei Ji,Rong Wei,Rui Tang,Qizhou Wang,Kai Shen,Jun Xiao,Qi Wu,Siliang Tang,Yueting Zhuang
机构: Zhejiang University (浙江大学); Unitree Tech (Unitree科技); Peking University (北京大学); Nanjing University (南京大学); Manycore Tech (Manycore科技); Bytedance Seed (字节跳动种子团队); University of Adelaide (阿德莱德大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We contend that embodied learning is fundamentally a lifecycle problem rather than a single-stage optimization. Systems that optimize only one link (data collection, simulation, learning, or deployment) rarely sustain improvement or generalize beyond narrow settings. We introduce Arcadia, a closed-loop framework that operationalizes embodied lifelong learning by tightly coupling four stages: (1) Self-evolving exploration and grounding for autonomous data acquisition in physical environments, (2) Generative scene reconstruction and augmentation for realistic and extensible scene creation, (3) a Shared embodied representation architecture that unifies navigation and manipulation within a single multimodal backbone, and (4) Sim-from-real evaluation and evolution that closes the feedback loop through simulation-based adaptation. This coupling is non-decomposable: removing any stage breaks the improvement loop and reverts to one-shot training. Arcadia delivers consistent gains on navigation and manipulation benchmarks and transfers robustly to physical robots, indicating that a tightly coupled lifecycle: continuous real-world data acquisition, generative simulation update, and shared-representation learning, supports lifelong improvement and end-to-end generalization. We release standardized interfaces enabling reproducible evaluation and cross-model comparison in reusable environments, positioning Arcadia as a scalable foundation for general-purpose embodied agents.
zh
[CV-255] Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
【速读】:该论文旨在解决零样本图像到图像生成(zero-shot image-to-image generation)技术中潜在的知识产权侵犯问题,如未经授权的身份克隆和风格模仿。当前方法通过图像编码器提取输入图像的嵌入(embedding),并将其通过交叉注意力层传递至扩散模型的UNet结构中,从而实现高保真度的生成。为应对这一风险,作者提出Adapter Shield,其核心在于构建一个可逆加密系统:原始嵌入根据不同的密钥被映射为特定的加密表示,授权用户可通过解密模块与正确密钥恢复真实嵌入以进行合法生成;同时设计多目标对抗扰动方法,主动将原始嵌入引导至预设加密模式,使未授权用户只能生成失真或加密输出,从而实现对个人图像的有效防御与可控访问。
链接: https://arxiv.org/abs/2512.00075
作者: Jun Jia,Hongyi Miao,Yingjie Zhou,Wangqiu Zhou,Jianbo Zhang,Linhan Cao,Dandan Zhu,Hua Yang,Xiongkuo Min,Wei Sun,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shandong University (山东大学); Hefei University of Technology (合肥工业大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:With the rapid progress in diffusion models, image synthesis has advanced to the stage of zero-shot image-to-image generation, where high-fidelity replication of facial identities or artistic styles can be achieved using just one portrait or artwork, without modifying any model weights. Although these techniques significantly enhance creative possibilities, they also pose substantial risks related to intellectual property violations, including unauthorized identity cloning and stylistic imitation. To counter such threats, this work presents Adapter Shield, the first universal and authentication-integrated solution aimed at defending personal images from misuse in zero-shot generation scenarios. We first investigate how current zero-shot methods employ image encoders to extract embeddings from input images, which are subsequently fed into the UNet of diffusion models through cross-attention layers. Inspired by this mechanism, we construct a reversible encryption system that maps original embeddings into distinct encrypted representations according to different secret keys. The authorized users can restore the authentic embeddings via a decryption module and the correct key, enabling normal usage for authorized generation tasks. For protection purposes, we design a multi-target adversarial perturbation method that actively shifts the original embeddings toward designated encrypted patterns. Consequently, protected images are embedded with a defensive layer that ensures unauthorized users can only produce distorted or encrypted outputs. Extensive evaluations demonstrate that our method surpasses existing state-of-the-art defenses in blocking unauthorized zero-shot image synthesis, while supporting flexible and secure access control for verified users.
zh
[CV-256] Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
链接: https://arxiv.org/abs/2512.00074
作者: Qiwei Liang,Boyang Cai,Minghao Lai,Sitong Zhuang,Tao Lin,Yan Qin,Yixuan Ye,Jiaming Liang,Renjing Xu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Shenzhen University (深圳大学); Beijing Jiaotong University (北京交通大学); Central South University (中南大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-257] ProvRain: Rain-Adaptive Denoising and Vehicle Detection via MobileNet-UNet and Faster R-CNN
链接: https://arxiv.org/abs/2512.00073
作者: Aswinkumar Varathakumaran,Nirmala Paramanandham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-258] Satellite to Street : Disaster Impact Estimator
【速读】:该论文旨在解决灾后结构损伤评估中人工解译卫星影像效率低、主观性强且难以扩展的问题。传统基于U-Net的图像分割模型和变化检测模型在面对细微结构变化和严重类别不平衡时表现不佳,难以准确识别高损区域。解决方案的关键在于提出一种双输入U-Net架构的深度学习框架——Satellite-to-Street: Disaster Impact Estimator,通过改进特征融合机制同时捕捉局部结构变化与全局上下文信息,并引入类感知加权损失函数以缓解未受损像素主导的数据分布问题,从而显著提升对严重损坏区域的定位与分类精度。
链接: https://arxiv.org/abs/2512.00065
作者: Sreesritha Sai,Sai Venkata Suma Sreeja,Deepthi,Nikhil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages,9 figures
Abstract:Accurate post-disaster damage assessment is of high importance for prioritizing emergency response; however, manual interpretation of satellite imagery is slow, subjective, and hard to scale. While deep-learning models for image segmentation, such as U-Net-based baselines and change-detection models, are useful baselines, they often struggle with subtle structural variations and severe class imbalance, yielding poor detection of highly damaged regions. The present work proposes a deep-learning framework that jointly processes pre- and post-disaster satellite images to obtain fine-grained pixel-level damage maps: Satellite-to-Street: Disaster Impact Estimator. The model uses a modified dual-input U-Net architecture with enhanced feature fusion to capture both the local structural changes as well as the broader contextual cues. Class-aware weighted loss functions are integrated in order to handle the dominance of undamaged pixels in real disaster datasets, thus enhancing sensitivity toward major and destroyed categories. Experimentation on publicly available disaster datasets shows improved localization and classification of structural damage when compared to traditional segmentation and baseline change-detection models. The resulting damage maps provide a rapid and consistent assessment mechanism to support and not replace expert decision-making, thus allowing more efficient, data-driven disaster management.
zh
[CV-259] DL-CapsNet: A Deep and Light Capsule Network
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Network, CNN)在处理类别重叠图像和仿射变换图像时准确性不足的问题,以及现有胶囊网络(Capsule Network, CapsNet)在复杂数据集上参数量大、训练与推理效率低的局限性。其解决方案的关键在于提出一种深度胶囊网络(Deep Capsule Network, DL-CapsNet),通过引入多层胶囊结构提升模型表达能力,并设计胶囊汇总层(Capsule Summarization layer)以显著减少参数数量,在保证高准确率的同时实现更快的训练和推理速度,从而有效应对高类别数复杂数据集的处理需求。
链接: https://arxiv.org/abs/2512.00061
作者: Pouya Shiri,Amirali Baniasadi
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Capsule Network (CapsNet) is among the promising classifiers and a possible successor of the classifiers built based on Convolutional Neural Network (CNN). CapsNet is more accurate than CNNs in detecting images with overlapping categories and those with applied affine transformations. In this work, we propose a deep variant of CapsNet consisting of several capsule layers. In addition, we design the Capsule Summarization layer to reduce the complexity by reducing the number of parameters. DL-CapsNet, while being highly accurate, employs a small number of parameters and delivers faster training and inference. DL-CapsNet can process complex datasets with a high number of categories.
zh
[CV-260] PEFT-DML: Parameter-Efficient Fine-Tuning Deep Metric Learning for Robust Multi-Modal 3D Object Detection in Autonomous Driving
【速读】:该论文旨在解决自动驾驶中多模态3D目标检测在传感器失效或模态组合未见过情况下的鲁棒性问题,传统方法通常假设传感器始终可用,难以应对实际场景中的动态变化。解决方案的关键在于提出PEFT-DML框架,通过将多种传感器模态(LiDAR、雷达、摄像头、IMU、GNSS)映射到共享潜在空间,并结合低秩适配(Low-Rank Adaptation, LoRA)与适配器层(adapter layers),实现参数高效训练并显著提升对快速运动、天气变化及域偏移的鲁棒性。
链接: https://arxiv.org/abs/2512.00060
作者: Abdolazim Rezaei,Mehdi Sookhak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This study introduces PEFT-DML, a parameter-efficient deep metric learning framework for robust multi-modal 3D object detection in autonomous driving. Unlike conventional models that assume fixed sensor availability, PEFT-DML maps diverse modalities (LiDAR, radar, camera, IMU, GNSS) into a shared latent space, enabling reliable detection even under sensor dropout or unseen modality class combinations. By integrating Low-Rank Adaptation (LoRA) and adapter layers, PEFT-DML achieves significant training efficiency while enhancing robustness to fast motion, weather variability, and domain shifts. Experiments on benchmarks nuScenes demonstrate superior accuracy.
zh
[CV-261] VISTAv2: World Imagination for Indoor Vision-and-Language Navigation
链接: https://arxiv.org/abs/2512.00041
作者: Yanjia Huang,Xianshun Jiang,Xiangbo Gao,Mingyang Wu,Zhengzhong Tu
机构: Texas A&M University (德克萨斯农工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
[CV-262] ICD-Net: Inertial Covariance Displacement Network for Drone Visual-Inertial SLAM
【速读】:该论文旨在解决视觉-惯性同步定位与建图(Visual-Inertial SLAM)系统在实际应用中因传感器校准不完善、测量噪声、快速运动动态、低光照条件以及传统惯性导航积分方法固有局限性而导致的性能下降问题,尤其针对无人机场景下对高鲁棒性和高精度状态估计的需求。其解决方案的关键在于提出 ICD-Net 框架,该框架通过直接从原始惯性测量数据中学习生成位移估计及其不确定性量化结果,替代依赖理想化惯性传感器模型的传统方法;进一步将这些学习得到的位移约束及其预测协方差作为残差项引入 VINS-Fusion 优化框架,利用不确定性权重自动调节神经网络输出与传统视觉及惯性项之间的贡献比例,从而提供互补信息以补偿SLAM流水线中的多种误差源,并在高速无人机序列上验证了轨迹估计精度提升超过38%(均方根APE),同时保持实时性能。
链接: https://arxiv.org/abs/2512.00037
作者: Tali Orlev Shapira,Itzik Klein
机构: University of Haifa (海法大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-inertial SLAM systems often exhibit suboptimal performance due to multiple confounding factors including imperfect sensor calibration, noisy measurements, rapid motion dynamics, low illumination, and the inherent limitations of traditional inertial navigation integration methods. These issues are particularly problematic in drone applications where robust and accurate state estimation is critical for safe autonomous operation. In this work, we present ICD-Net, a novel framework that enhances visual-inertial SLAM performance by learning to process raw inertial measurements and generating displacement estimates with associated uncertainty quantification. Rather than relying on analytical inertial sensor models that struggle with real-world sensor imperfections, our method directly extracts displacement maps from sensor data while simultaneously predicting measurement covariances that reflect estimation confidence. We integrate ICD-Net outputs as additional residual constraints into the VINS-Fusion optimization framework, where the predicted uncertainties appropriately weight the neural network contributions relative to traditional visual and inertial terms. The learned displacement constraints provide complementary information that compensates for various error sources in the SLAM pipeline. Our approach can be used under both normal operating conditions and in situations of camera inconsistency or visual degradation. Experimental evaluation on challenging high-speed drone sequences demonstrated that our approach significantly improved trajectory estimation accuracy compared to standard VINS-Fusion, with more than 38% improvement in mean APE and uncertainty estimates proving crucial for maintaining system robustness. Our method shows that neural network enhancement can effectively address multiple sources of SLAM degradation while maintaining real-time performance requirements.
zh
[CV-263] A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)任务中多机器人协作面临的挑战,包括双向通信不足、歧义解析困难以及多智能体系统中的协同决策效率低下等问题。其解决方案的关键在于:首先,通过先进的自然语言理解(Natural Language Understanding, NLU)技术实现主动澄清、实时反馈与情境推理,提升导航指令的语义准确性;其次,采用去中心化的决策框架结合动态角色分配机制,以支持可扩展、高效的多机器人协作,从而显著增强人机交互(Human-Robot Interaction, HRI)能力,并推动VLN系统在医疗、物流及灾难响应等实际场景中的部署应用。
链接: https://arxiv.org/abs/2512.00027
作者: Nivedan Yakolli,Avinash Gautam,Abhijit Das,Yuankai Qi,Virendra Singh Shekhawat
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi-robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems. We review approximately 200 relevant articles to provide an in-depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision-making frameworks with dynamic role assignment are essential for scalable, efficient multi-robot collaboration. These innovations can significantly enhance human-robot interaction (HRI) and enable real-world deployment in domains such as healthcare, logistics, and disaster response.
zh
[CV-264] Learning from Watching: Scalable Extraction of Manipulation Trajectories from Human Videos
【速读】:该论文旨在解决大规模机器人模型训练中高质量数据收集成本高、效率低的问题,尤其是依赖真实机器人平台进行遥操作或脚本示范时存在的劳动密集与昂贵开销。现有方法多聚焦于手部检测或物体位姿估计,未能充分挖掘在线人类操作视频中蕴含的丰富交互线索。其解决方案的关键在于融合大规模基础模型(foundation models)用于视频理解与点跟踪技术,从而提取任务相关关键点在整个操作过程中的密集轨迹,实现对互联网规模人类示范视频的更全面利用,显著提升数据采集的可扩展性与学习效率。
链接: https://arxiv.org/abs/2512.00024
作者: X. Hu,G. Ye
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to RSS 2025 Workshop
Abstract:Collecting high-quality data for training large-scale robotic models typically relies on real robot platforms, which is labor-intensive and costly, whether via teleoperation or scripted demonstrations. To scale data collection, many researchers have turned to leveraging human manipulation videos available online. However, current methods predominantly focus on hand detection or object pose estimation, failing to fully exploit the rich interaction cues embedded in these videos. In this work, we propose a novel approach that combines large foundation models for video understanding with point tracking techniques to extract dense trajectories of all task-relevant keypoints during manipulation. This enables more comprehensive utilization of Internet-scale human demonstration videos. Experimental results demonstrate that our method can accurately track keypoints throughout the entire manipulation process, paving the way for more scalable and data-efficient robot learning.
zh
[CV-265] Foundation Models for Trajectory Planning in Autonomous Driving: A Review of Progress and Open Challenges
链接: https://arxiv.org/abs/2512.00021
作者: Kemal Oksuz,Alexandru Buburuzan,Anthony Knittel,Yuhan Yao,Puneet K. Dokania
机构: Five AI Ltd.(Five AI有限公司); Robert Bosch GmbH(罗伯特·博世公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
[CV-266] A Comprehensive Survey on Surgical Digital Twin
链接: https://arxiv.org/abs/2512.00019
作者: Afsah Sharaf Khan,Falong Fan,Doohwan DH Kim,Abdurrahman Alshareef,Dong Chen,Justin Kim,Ernest Carter,Bo Liu,Jerzy W. Rozenblit,Bernard Zeigler
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-267] MOTION: ML-Assisted On-Device Low-Latency Motion Recognition
【速读】:该论文旨在解决嵌入式设备中实现低延迟手势识别的难题,尤其针对医疗监测场景下对快速、高效且可靠运动追踪的需求,同时避免误报问题。其解决方案的关键在于利用AutoML流水线从三轴加速度计数据段中自动提取关键特征,并在此基础上训练多种轻量级机器学习模型;最终在WeBe Band这一具备足够计算能力的多传感器可穿戴设备上部署,发现神经网络在准确率、延迟和内存占用之间提供了最佳平衡,从而实现了可靠的实时手势识别,为需要快速响应的医疗监控应用提供了可行路径。
链接: https://arxiv.org/abs/2512.00008
作者: Veeramani Pugazhenthi,Wei-Hsiang Chu,Junwei Lu,Jadyn N. Miyahira,Soheil Salehi
机构: University of Arizona (亚利桑那大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The use of tiny devices capable of low-latency gesture recognition is gaining momentum in everyday human-computer interaction and especially in medical monitoring fields. Embedded solutions such as fall detection, rehabilitation tracking, and patient supervision require fast and efficient tracking of movements while avoiding unwanted false alarms. This study presents an efficient solution on how to build very efficient motion-based models only using triaxial accelerometer sensors. We explore the capability of the AutoML pipelines to extract the most important features from the data segments. This approach also involves training multiple lightweight machine learning algorithms using the extracted features. We use WeBe Band, a multi-sensor wearable device that is equipped with a powerful enough MCU to effectively perform gesture recognition entirely on the device. Of the models explored, we found that the neural network provided the best balance between accuracy, latency, and memory use. Our results also demonstrate that reliable real-time gesture recognition can be achieved in WeBe Band, with great potential for real-time medical monitoring solutions that require a secure and fast response time.
zh
[CV-268] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
【速读】:该论文旨在解决工业场景中大规模未标注人类示范视频数据难以用于视觉-语言-动作(Vision-Language-Action, VLA)模型预训练的问题。现有方法通常依赖人工标注或结构化数据,而实际工业视频流多为连续、无标签的非结构化数据,限制了VLA模型在制造环境中的可扩展性。解决方案的关键在于提出一种全自动端到端框架:首先训练一个轻量级运动分词器(motion tokenizer)以编码运动动态,随后利用一种新颖的“潜在动作能量”(Latent Action Energy)度量,实现无监督的动作片段分割,从而发现语义一致的动作基元(action primitives)。该方法生成结构化的视频片段及其对应的潜在动作序列,可直接用于VLA模型预训练,显著提升了从原始工业视频中提取高质量预训练数据的能力。
链接: https://arxiv.org/abs/2511.21428
作者: Jiajie Zhang,Sören Schwertfeger,Alexander Kleiner
机构: ShanghaiTech University (上海科技大学); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 5 figures
Abstract:We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel “Latent Action Energy” metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
zh
[CV-269] Disentangling Progress in Medical Image Registration: Beyond Trend-Driven Architectures towards Domain-Specific Strategies
链接: https://arxiv.org/abs/2512.01913
作者: Bailiang Jian,Jiazhen Pan,Rohit Jena,Morteza Ghahremani,Hongwei Bran Li,Daniel Rueckert,Christian Wachinger,Benedikt Wiestler
机构: Technical University of Munich, Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany; Imperial College London, London, England; University of Pennsylvania, Pennsylvania, USA; National University of Singapore, Singapore
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Medical Image Analysis. Journal Extension of arXiv:2407.19274
[CV-270] Panda: Self-distillation of Reusable Sensor-level Representations for High Energy Physics
链接: https://arxiv.org/abs/2512.01324
作者: Samuel Young,Kazuhiro Terao
机构: Stanford University (斯坦福大学); SLAC National Accelerator Laboratory (SLAC 国家加速器实验室)
类目: High Energy Physics - Experiment (hep-ex); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 figures, preprint. Project page at this https URL
[CV-271] MedCondDiff: Lightweight Robust Semantically Guided Diffusion for Medical Image Segmentation
链接: https://arxiv.org/abs/2512.00350
作者: Ruirui Huang,Jiacheng Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-272] Coarse-to-Fine Non-Rigid Registration for Side-Scan Sonar Mosaicking
链接: https://arxiv.org/abs/2512.00052
作者: Can Lei,Nuno Gracias,Rafael Garcia,Hayat Rajani,Huigang Wang
机构: Northwestern Polytechnical University (西北工业大学); Research & Development Institute of Northwestern Polytechnical University in Shenzhen (西北工业大学深圳研究院); University of Girona (赫罗纳大学)
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
人工智能
[AI-0] A Diffusion Model Framework for Maximum Entropy Reinforcement Learning
【速读】:该论文旨在解决最大熵强化学习(MaxEntRL)中策略采样效率低、难以有效逼近最优策略分布的问题。其核心解决方案是将MaxEntRL重新诠释为基于扩散模型的采样问题,通过最小化反向Kullback-Leibler(KL)散度来逼近最优策略分布,并利用一个可计算的上界进行优化。在此基础上,结合策略梯度定理推导出包含扩散动力学的改进型代理目标函数,从而自然地将扩散机制融入经典算法如Soft Actor-Critic(SAC)、Proximal Policy Optimization(PPO)和Wasserstein Policy Optimization(WPO),形成DiffSAC、DiffPPO与DiffWPO。关键创新在于以扩散过程建模策略采样路径,显著提升样本效率并获得更优的性能表现。
链接: https://arxiv.org/abs/2512.02019
作者: Sebastian Sanokowski,Kaustubh Patil,Alois Knoll
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
zh
[AI-1] Learning Sim-to-Real Humanoid Locomotion in 15 Minutes FAST
【速读】:该论文旨在解决大规模并行仿真环境下,实现快速且可靠的机器人仿真实体到现实世界的强化学习(Reinforcement Learning, RL)迁移难题,尤其是在高维状态空间和强领域随机化(domain randomization)条件下的人形机器人控制问题。其解决方案的关键在于提出了一种简单而实用的训练配方(recipe),基于离策略强化学习算法(如FastSAC和FastTD3),通过精心设计的超参数配置与极简奖励函数,在数千个并行环境中稳定训练,仅用一块RTX 4090 GPU即可在15分钟内完成人形机器人行走策略的端到端学习,同时在Unitree G1和Booster T1机器人上验证了其对动态随机性、复杂地形及外部扰动的鲁棒性。
链接: https://arxiv.org/abs/2512.01996
作者: Younggyo Seo,Carmelo Sferrazza,Juyue Chen,Guanya Shi,Rocky Duan,Pieter Abbeel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this https URL
Abstract:Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: this https URL.
zh
[AI-2] Forecasting in Offline Reinforcement Learning for Non-stationary Environments NEURIPS2025
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在现实世界中面临的时间非平稳性(non-stationarity)问题,尤其是由突发、时变偏移(time-varying offsets)导致的状态观测不完整(partial observability),进而引发智能体对真实状态的误判和性能下降。解决方案的关键在于提出 Forecasting in Non-stationary Offline RL (FORL) 框架,其核心创新包括:(i) 基于条件扩散模型(conditional diffusion-based candidate state generation)生成候选状态,无需预设未来非平稳性的特定模式;(ii) 引入零样本时间序列基础模型(zero-shot time-series foundation models),使智能体能够在不依赖额外交互数据的情况下,利用历史经验进行实时预测与状态校正,从而提升在突发非平稳环境下的初始阶段鲁棒性表现。
链接: https://arxiv.org/abs/2512.01987
作者: Suzan Ece Ada,Georg Martius,Emre Ugur,Erhan Oztop
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: The Thirty-Ninth Annual Conference on Neural Information Processing Systems, NeurIPS 2025
Abstract:Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent’s experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.
zh
[AI-3] AI-Driven Optimization under Uncertainty for Mineral Processing Operations
【速读】:该论文旨在解决矿物加工(mineral processing)在面对原料不确定性(feedstock uncertainty)和过程模型不确定性(process model uncertainty)时效率受限的问题。其解决方案的关键在于将矿物加工过程建模为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),通过整合信息获取(即不确定性降低)与过程优化的双重目标,在不依赖额外硬件的前提下,实现对浮选单元等典型工艺环节的优化运行,从而显著提升整体经济指标(如净现值,NPV)。该方法提供了一个可扩展的数学与计算框架,为实验室实验设计及工业级矿物加工流程的智能化控制奠定了基础。
链接: https://arxiv.org/abs/2512.01977
作者: William Xu,Amir Eskanlou,Mansur Arief,David Zhen Yin,Jef K. Caers
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 27 pages, 13 figures, submitted to Sustainable Earth Resources Communications (SERC)
Abstract:The global capacity for mineral processing must expand rapidly to meet the demand for critical minerals, which are essential for building the clean energy technologies necessary to mitigate climate change. However, the efficiency of mineral processing is severely limited by uncertainty, which arises from both the variability of feedstock and the complexity of process dynamics. To optimize mineral processing circuits under uncertainty, we introduce an AI-driven approach that formulates mineral processing as a Partially Observable Markov Decision Process (POMDP). We demonstrate the capabilities of this approach in handling both feedstock uncertainty and process model uncertainty to optimize the operation of a simulated, simplified flotation cell as an example. We show that by integrating the process of information gathering (i.e., uncertainty reduction) and process optimization, this approach has the potential to consistently perform better than traditional approaches at maximizing an overall objective, such as net present value (NPV). Our methodological demonstration of this optimization-under-uncertainty approach for a synthetic case provides a mathematical and computational framework for later real-world application, with the potential to improve both the laboratory-scale design of experiments and industrial-scale operation of mineral processing circuits without any additional hardware.
zh
[AI-4] Learned-Rule-Augmented Large Language Model Evaluators
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)作为评估工具时泛化能力不足的问题,尤其在非自然语言生成(Natural Language Generation, NLG)任务中,现有方法依赖昂贵的人工设计评价规则,导致与标注数据及LLM自身推理能力存在错位。解决方案的关键在于提出一种规则增强的评估范式(Rule-augmented Evaluation Paradigm),其核心包括:(1)通过LLM辅助的蒙特卡洛树搜索(LLM-assisted Monte Carlo Tree Search, MCTS)自动从数据中蒸馏评分规则,提升规则提取的可扩展性并增强与数据的一致性;(2)设计两种策略使LLM能有效应用所学规则——链式规则推理(Chain-of-Rule, CoR)引导LLM遵循规则,以及基于强化学习训练规则增强型LLM评估器(Rule-augmented LLM Evaluator, RuAE),从而进一步弥合规则与LLM推理之间的差距。
链接: https://arxiv.org/abs/2512.01958
作者: Jie Meng,Jin Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are predominantly used as evaluators for natural language generation (NLG) tasks, but their application to broader evaluation scenarios remains limited. In this work, we explore the potential of LLMs as general evaluators across diverse tasks. Although LLM-based evaluators have made progress in different areas, existing methods struggle to generalize due to their reliance on costly, human-designed evaluation principles, which are often misaligned with both annotated data and LLMs’ this http URL address these challenges, we propose a rule-augmented evaluation paradigm. First, we introduce a rule distillation method that automatically extracts scoring rules from data using an LLM-assisted Monte Carlo Tree Search (MCTS), alleviating scalability issues and improving alignment with data. Second, to enable LLMs to effectively apply the learned rules, we propose two strategies: (1) Chain-of-Rule (CoR), which guides LLM to follow distilled rules, and (2) training a rule-augmented LLM evaluator (RuAE) via reinforcement learning, further bridging the gap between rules and LLMs’ reasoning. Extensive experiments on diverse tasks demonstrate the effectiveness and generalizability of our approach across various evaluation scenarios.
zh
[AI-5] An Empirical Study of Agent Developer Practices in AI Agent Frameworks
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)驱动的智能体(Agent)框架在实际开发中存在的一系列未被充分探索的问题,包括框架如何影响开发流程、开发者面临的共性挑战以及选择合适框架的困难。针对这些问题,研究者开展了首次面向LLM-based Agent框架的实证研究,通过收集并分析11,910条开发者讨论数据,从开发效率、功能抽象能力、学习成本、性能优化支持和可维护性五个维度对十种主流Agent框架进行了系统比较。其解决方案的关键在于基于真实世界开发者经验的量化与质性分析,从而揭示不同框架的优势与不足,并为未来Agent框架的设计提供可操作的改进方向和实践指导。
链接: https://arxiv.org/abs/2512.01939
作者: Yanlin Wang,Xinyi Xu,Jiachi Chen,Tingting Bi,Wenchao Gu,Zibin Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of large language models (LLMs) has sparked a surge of interest in agents, leading to the rapid growth of agent frameworks. Agent frameworks are software toolkits and libraries that provide standardized components, abstractions, and orchestration mechanisms to simplify agent development. Despite widespread use of agent frameworks, their practical applications and how they influence the agent development process remain underexplored. Different agent frameworks encounter similar problems during use, indicating that these recurring issues deserve greater attention and call for further improvements in agent framework design. Meanwhile, as the number of agent frameworks continues to grow and evolve, more than 80% of developers report difficulties in identifying the frameworks that best meet their specific development requirements. In this paper, we conduct the first empirical study of LLM-based agent frameworks, exploring real-world experiences of developers in building AI agents. To compare how well the agent frameworks meet developer needs, we further collect developer discussions for the ten previously identified agent frameworks, resulting in a total of 11,910 discussions. Finally, by analyzing these discussions, we compare the frameworks across five dimensions: development efficiency, functional abstraction, learning cost, performance optimization, and maintainability, which refers to how easily developers can update and extend both the framework itself and the agents built upon it over time. Our comparative analysis reveals significant differences among frameworks in how they meet the needs of agent developers. Overall, we provide a set of findings and implications for the LLM-driven AI agent framework ecosystem and offer insights for the design of future LLM-based agent frameworks and agent developers.
zh
[AI-6] SVRG and Beyond via Posterior Correction
【速读】:该论文旨在解决随机方差缩减梯度(Stochastic Variance Reduced Gradient, SVRG)在深度学习中应用受限的问题,其核心挑战在于如何提升SVRG在复杂模型训练中的效率与稳定性。解决方案的关键在于建立SVRG与贝叶斯后验校正(posterior correction)之间的新理论联系:作者发现SVRG可视为在各向同性高斯族上的后验校正特例,而通过引入更灵活的指数族分布,可自然推导出新型SVRG变体。基于此框架,论文提出了两个创新变种——一种类似牛顿法的变体利用新的海森矩阵校正机制,另一种类Adam的扩展则显著提升了Transformer语言模型在预训练和微调阶段的性能,从而首次将贝叶斯视角用于增强变分训练深度网络的效率与效果。
链接: https://arxiv.org/abs/2512.01930
作者: Nico Daheim,Thomas Möllenhoff,Ming Liang Ang,Mohammad Emtiyaz Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.
zh
[AI-7] Real-World Robot Control by Deep Active Inference With a Temporally Hierarchical World Model
【速读】:该论文旨在解决机器人在不确定现实环境中执行目标导向与探索性动作时面临的挑战,尤其是现有基于深度学习的控制方法普遍忽视探索机制且在不确定性下表现不佳的问题。解决方案的关键在于提出一种新型深度主动推理(deep active inference)框架,其核心创新包括:构建一个包含世界模型、动作模型和抽象世界模型的多尺度结构;其中,世界模型通过慢速和快速时间尺度的隐藏状态表示捕捉环境动态;动作模型利用向量量化将动作序列压缩为抽象动作;抽象世界模型则基于抽象动作预测未来慢速状态,从而实现低成本的动作选择。该设计有效提升了策略在复杂任务中的适应性与计算效率,同时支持在不确定场景中灵活切换目标导向与探索行为。
链接: https://arxiv.org/abs/2512.01924
作者: Kentaro Fujii,Shingo Murata
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in IEEE Robotics and Automation Letters (RA-L)
Abstract:Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.
zh
[AI-8] Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion
【速读】:该论文旨在解决深度视觉模型训练中优化算法在高维随机噪声环境下信息通道容量静态分配导致的收敛效率与精度失衡问题。当前主流方法如AdamW(基于梯度幅值)在平滑区域表现良好,但在非凸复杂景观中易放大噪声;而Lion(基于梯度符号)虽具鲁棒性却因1-bit量化丢失精细下降信息。解决方案的关键在于提出ThermoLion框架,其核心创新为:(1) 利用局部信噪比(Signal-to-Noise Ratio, SNR)门控机制实现参数更新比特率的动态调节,在低比特探索阶段增强泛化能力,在高精度利用阶段提升收敛速度;(2) 引入动量对齐机制(Momentum Alignment),通过检测历史漂移与瞬时梯度间的建设性干涉来加速稳定轨迹上的收敛。实证结果表明,ThermoLion无需架构特异性调参即可在12个视觉数据集上超越AdamW和Lion,在收敛速度与最终精度上均取得显著优势。
链接: https://arxiv.org/abs/2512.01881
作者: Ahmed Nebli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The training of deep vision models is fundamentally a signal recovery problem amidst high-dimensional stochastic noise. Current optimization paradigms impose a static compromise on information channel capacity. For instance, magnitude-based methods, such as AdamW, operate on the assumption that gradient norms are high-fidelity curvature signals. While this allows for precision in smooth regimes, it leads to catastrophic noise amplification when applied to rugged, non-convex landscapes. Conversely, sign-based methods (e.g., Lion) perform a radical 1-bit quantization of the gradient, which aims to provide robust regularization at the cost of discarding fine-grained descent information. We propose that optimal convergence requires neither static prior, but rather a dynamic modulation of the update bitrate. We introduce \textbfThermoLion, a vision-centric framework that utilizes local Signal-to-Noise Ratio (SNR) gating to autonomously transition parameters between a “low-bit” exploration phase and a “high-precision” exploitation phase. Furthermore, we introduce a Momentum Alignment mechanism that detects constructive interference between historical drift and instantaneous gradients to accelerate convergence during stable trajectories. Empirical benchmarks across 12 diverse vision datasets (including CIFAR, SVHN, and GTSRB) demonstrate that ThermoLion serves as a hyperparameter-free generalist, surpassing both AdamW and Lion in convergence speed and terminal accuracy without architecture-specific tuning.
zh
[AI-9] Predicting Human Chess Moves: An AI Assisted Analysis of Chess Games Using Skill-group Specific n-gram Language Models
【速读】:该论文旨在解决传统国际象棋引擎在分析人类对弈时忽视玩家技能水平差异的问题,即现有工具多聚焦于计算最优走法,而忽略了不同熟练度玩家行为模式的多样性。其核心解决方案是提出一种基于n-gram语言模型的移动预测框架,将走子行为建模为技能层级相关的序列模式识别任务;关键创新在于将玩家划分为七个从新手到专家的技能组,并为每组训练独立的语言模型,通过动态选择最匹配的模型进行实时走子预测,从而显著提升预测准确性(相较基线提升最高达39.1%),同时保持良好的计算效率,适用于实时分析场景。
链接: https://arxiv.org/abs/2512.01880
作者: Daren Zhong,Dingcheng Huang,Clayton Greenberg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chess, a deterministic game with perfect information, has long served as a benchmark for studying strategic decision-making and artificial intelligence. Traditional chess engines or tools for analysis primarily focus on calculating optimal moves, often neglecting the variability inherent in human chess playing, particularly across different skill levels. To overcome this limitation, we propose a novel and computationally efficient move prediction framework that approaches chess move prediction as a behavioral analysis task. The framework employs n-gram language models to capture move patterns characteristic of specific player skill levels. By dividing players into seven distinct skill groups, from novice to expert, we trained separate models using data from the open-source chess platform Lichess. The framework dynamically selects the most suitable model for prediction tasks and generates player moves based on preceding sequences. Evaluation on real-world game data demonstrates that the model selector module within the framework can classify skill levels with an accuracy of up to 31.7% when utilizing early game information (16 half-moves). The move prediction framework also shows substantial accuracy improvements, with our Selector Assisted Accuracy being up to 39.1% more accurate than our benchmark accuracy. The computational efficiency of the framework further enhances its suitability for real-time chess analysis. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.01880 [cs.AI] (or arXiv:2512.01880v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.01880 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-10] Graph Distance as Surprise: Free Energy Minimization in Knowledge Graph Reasoning NEURIPS2025
【速读】:该论文试图解决如何在知识图谱(Knowledge Graph, KG)网络中实现基于推理的智能决策问题,其核心挑战在于如何量化并引导代理(agent)在KG中的推理过程以最小化“意外”(surprise)。解决方案的关键在于将神经科学中的自由能原理(Free Energy Principle, FEP)引入KG系统,利用最短路径距离(shortest-path distance)来形式化“意外”,并构建一个以KG作为生成模型(generative model)的代理框架。该方法通过最小化由图距离决定的意外值,使代理在推理时优先选择与当前状态相近的实体,从而提升推理效率与一致性。
链接: https://arxiv.org/abs/2512.01878
作者: Gaganpreet Jhajj,Fuhua Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to NORA Workshop at NeurIPS 2025
Abstract:In this work, we propose that reasoning in knowledge graph (KG) networks can be guided by surprise minimization. Entities that are close in graph distance will have lower surprise than those farther apart. This connects the Free Energy Principle (FEP) from neuroscience to KG systems, where the KG serves as the agent’s generative model. We formalize surprise using the shortest-path distance in directed graphs and provide a framework for KG-based agents. Graph distance appears in graph neural networks as message passing depth and in model-based reinforcement learning as world model trajectories. This work-in-progress study explores whether distance-based surprise can extend recent work showing that syntax minimizes surprise and free energy via tree structures.
zh
[AI-11] sting Transformer Learnability on the Arithmetic Sequence of Rooted Trees
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Model, LLM)是否能够学习自然数通过迭代素因数分解生成的确定性树序列的内在结构。具体而言,每个自然数被映射为一棵有根平面树(rooted planar tree),从而构成一个具有可测量统计结构的算术文本 NT。解决方案的关键在于使用GPT-2架构的Transformer网络从头训练,基于前 1011 个自然数生成的树序列,并在下一个词预测和掩码词预测任务中测试其预测能力。结果表明,模型部分学到了 NT 的内部语法,捕捉到非平凡的规律性和相关性,这暗示了学习能力可能不仅限于经验数据,还扩展至算术本身的结构性本质。
链接: https://arxiv.org/abs/2512.01870
作者: Alessandro Breccia,Federica Gerace,Marco Lippi,Gabriele Sicuro,Pierluigi Contucci
机构: 未知
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mathematical Physics (math-ph); Number Theory (math.NT)
备注: 21 pages, 8 figures
Abstract:We study whether a Large Language Model can learn the deterministic sequence of trees generated by the iterated prime factorization of the natural numbers. Each integer is mapped into a rooted planar tree and the resulting sequence \mathbbN\mathcalT defines an arithmetic text with measurable statistical structure. A transformer network (the GPT-2 architecture) is trained from scratch on the first 10^11 elements to subsequently test its predictive ability under next-word and masked-word prediction tasks. Our results show that the model partially learns the internal grammar of \mathbbN\mathcalT , capturing non-trivial regularities and correlations. This suggests that learnability may extend beyond empirical data to the very structure of arithmetic.
zh
[AI-12] Mitigating Gender Bias in Depression Detection via Counterfactual Inference
【速读】:该论文旨在解决音频基抑郁症检测模型中存在的性别偏差问题,即由于训练数据中女性抑郁发病率较高,导致模型学习到性别与抑郁之间的虚假相关性,从而在女性患者中过度诊断而在男性患者中表现不佳,引发公平性争议。解决方案的关键在于提出一种基于因果推断的反事实去偏框架(Counterfactual Debiasing Framework),通过构建因果图识别性别对预测结果的直接因果效应,并在推理阶段利用反事实推理估计并移除该效应,使模型主要依赖于真实的声学病理特征,从而实现更公平且性能提升的检测效果。
链接: https://arxiv.org/abs/2512.01834
作者: Mingxuan Hu,Hongbo Ma,Xinlan Wu,Ziqi Liu,Jiaqi Liu,Yangbin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Audio-based depression detection models have demonstrated promising performance but often suffer from gender bias due to imbalanced training data. Epidemiological statistics show a higher prevalence of depression in females, leading models to learn spurious correlations between gender and depression. Consequently, models tend to over-diagnose female patients while underperforming on male patients, raising significant fairness concerns. To address this, we propose a novel Counterfactual Debiasing Framework grounded in causal inference. We construct a causal graph to model the decision-making process and identify gender bias as the direct causal effect of gender on the prediction. During inference, we employ counterfactual inference to estimate and subtract this direct effect, ensuring the model relies primarily on authentic acoustic pathological features. Extensive experiments on the DAIC-WOZ dataset using two advanced acoustic backbones demonstrate that our framework not only significantly reduces gender bias but also improves overall detection performance compared to existing debiasing strategies.
zh
[AI-13] Deconstructing Generative Diversity: An Information Bottleneck Analysis of Discrete Latent Generative Models
【速读】:该论文旨在解决离散潜在生成模型(如自回归模型 AR、掩码图像建模 MIM 和扩散模型 Diffusion)在生成多样性方面表现差异显著的问题。其解决方案的关键在于提出一个基于信息瓶颈(Information Bottleneck, IB)理论的诊断框架,将生成过程建模为“压缩压力”(最小化码本熵)与“多样性压力”(最大化给定输入条件下的熵)之间的冲突,并进一步将多样性分解为“路径多样性”(高阶生成策略的选择)和“执行多样性”(所选策略执行中的随机性)。通过引入三种零样本、推理时的干预手段,该框架能够揭示模型如何分配和表达多样性,从而识别出三类不同策略:MIM 优先多样性、AR 优先压缩、Diffusion 解耦策略,为理解行为差异提供了理论依据,并启发了一种新颖的推理时多样性增强技术。
链接: https://arxiv.org/abs/2512.01831
作者: Yudi Wu,Wenhao Zhao,Dianbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative diversity varies significantly across discrete latent generative models such as AR, MIM, and Diffusion. We propose a diagnostic framework, grounded in Information Bottleneck (IB) theory, to analyze the underlying strategies resolving this behavior. The framework models generation as a conflict between a ‘Compression Pressure’ - a drive to minimize overall codebook entropy - and a ‘Diversity Pressure’ - a drive to maximize conditional entropy given an input. We further decompose this diversity into two primary sources: ‘Path Diversity’, representing the choice of high-level generative strategies, and ‘Execution Diversity’, the randomness in executing a chosen strategy. To make this decomposition operational, we introduce three zero-shot, inference-time interventions that directly perturb the latent generative process and reveal how models allocate and express diversity. Application of this probe-based framework to representative AR, MIM, and Diffusion systems reveals three distinct strategies: “Diversity-Prioritized” (MIM), “Compression-Prioritized” (AR), and “Decoupled” (Diffusion). Our analysis provides a principled explanation for their behavioral differences and informs a novel inference-time diversity enhancement technique.
zh
[AI-14] Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景中评估方法的可靠性与可扩展性问题:传统人工评估虽准确但效率低,单一LLM判官存在偏倚,静态评判团缺乏适应性。解决方案的关键在于提出一种动态、基于学习的“按需裁判”框架(LLM Jury-on-Demand),通过训练一组可靠性预测器来识别哪些LLM判官在特定输入下更可能与人类专家一致,利用token分布、嵌入向量和结构化输入特征进行判断;进而对每个样本动态选择最可靠的判官集合,并以其可靠性作为权重加权聚合评分,从而实现自适应、高相关性的评估性能。
链接: https://arxiv.org/abs/2512.01786
作者: Xiaochuan Li,Ke Wang,Girija Gouda,Shubham Choudhary,Yaqun Wang,Linwei Hu,Joel Vaughan,Freddy Lecue
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 66 pages, 22 figures, 37 tables
Abstract:As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.
zh
[AI-15] Dual Randomized Smoothing: Beyond Global Noise Variance
【速读】:该论文旨在解决随机平滑(Randomized Smoothing, RS)在应对不同扰动半径时的性能局限问题:标准RS采用全局固定噪声方差,导致无法同时在小半径和大半径下实现高准确率——小半径需低噪声方差以保证精度,而大半径则需高噪声方差以提升鲁棒性。解决方案的关键在于提出一种双通道随机平滑(dual RS)框架,其核心创新是引入输入相关的噪声方差机制,通过一个独立平滑的方差估计器为每个输入预测局部最优噪声方差,并确保该方差在输入邻域内局部恒定,从而维持证书有效性。该设计突破了全局方差的限制,在CIFAR-10和ImageNet上均实现了对多个半径下认证准确率的显著提升,且计算开销可控(推理阶段仅增加60%)。
链接: https://arxiv.org/abs/2512.01782
作者: Chenhao Sun,Yuhao Mao,Martin Vechev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Randomized Smoothing (RS) is a prominent technique for certifying the robustness of neural networks against adversarial perturbations. With RS, achieving high accuracy at small radii requires a small noise variance, while achieving high accuracy at large radii requires a large noise variance. However, the global noise variance used in the standard RS formulation leads to a fundamental limitation: there exists no global noise variance that simultaneously achieves strong performance at both small and large radii. To break through the global variance limitation, we propose a dual RS framework which enables input-dependent noise variances. To achieve that, we first prove that RS remains valid with input-dependent noise variances, provided the variance is locally constant around each input. Building on this result, we introduce two components which form our dual RS framework: (i) a variance estimator first predicts an optimal noise variance for each input, (ii) this estimated variance is then used by a standard RS classifier. The variance estimator is independently smoothed via RS to ensure local constancy, enabling flexible design. We also introduce training strategies to iteratively optimize the two components. Extensive experiments on CIFAR-10 show that our dual RS method provides strong performance for both small and large radii-unattainable with global noise variance-while incurring only a 60% computational overhead at inference. Moreover, it consistently outperforms prior input-dependent noise approaches across most radii, with particularly large gains at radii 0.5, 0.75, and 1.0, achieving relative improvements of 19%, 24%, and 21%, respectively. On ImageNet, dual RS remains effective across all radii. Additionally, the dual RS framework naturally provides a routing perspective for certified robustness, improving the accuracy-robustness trade-off with off-the-shelf expert RS models.
zh
[AI-16] Weight Space Representation Learning with Neural Fields
【速读】:该论文旨在解决如何利用神经网络权重作为有效表征表示的问题,特别是在神经场(Neural Fields)场景下提升重建、生成与分析任务的性能。其解决方案的关键在于通过预训练基础模型和低秩适应(Low-Rank Adaptation, LoRA)对优化空间进行约束,从而在权重空间中诱导出结构化特征;研究发现,乘法形式的LoRA权重不仅能够实现高质量的表征效果,还具备区分度和语义结构,尤其在与潜在扩散模型(Latent Diffusion Models)结合时,显著优于现有基于权重空间的方法。
链接: https://arxiv.org/abs/2512.01759
作者: Zhuoqian Yang,Mathieu Salzmann,Sabine Süsstrunk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages body, 9 pages appendix
Abstract:In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pre-trained base model and low-rank adaptation (LoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that multiplicative LoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, multiplicative LoRA weights enable higher-quality generation than existing weight-space methods.
zh
[AI-17] Probabilistic Neuro-Symbolic Reasoning for Sparse Historical Data: A Framework Integrating Bayesian Inference Causal Models and Game-Theoretic Allocation
【速读】:该论文旨在解决历史事件建模中的核心挑战,包括极端数据稀缺(N < 100)、异构且噪声较大的观测数据、缺失的反事实信息以及对人类可解释性解释的要求。其解决方案的关键在于提出一种概率神经符号框架 HistoricalML,通过四个核心机制实现:(1) 利用贝叶斯不确定性量化分离认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty);(2) 基于结构因果模型(structural causal models)进行混杂因素下的反事实推理;(3) 引入合作博弈论中的 Shapley 值实现公平分配建模;(4) 采用注意力机制神经架构实现上下文依赖的因子加权。理论分析表明,在强领域先验条件下,该方法可在稀疏数据场景下实现一致估计,且基于 Shapley 值的分配满足公理化公平性保障,这是纯回归方法无法提供的。
链接: https://arxiv.org/abs/2512.01723
作者: Saba Kublashvili
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Probability (math.PR)
备注: Preprint. Code and simulation notebooks available at the GitHub repository: this https URL
Abstract:Modeling historical events poses fundamental challenges for machine learning: extreme data scarcity (N 100), heterogeneous and noisy measurements, missing counterfactuals, and the requirement for human interpretable explanations. We present HistoricalML, a probabilistic neuro-symbolic framework that addresses these challenges through principled integration of (1) Bayesian uncertainty quantification to separate epistemic from aleatoric uncertainty, (2) structural causal models for counterfactual reasoning under confounding, (3) cooperative game theory (Shapley values) for fair allocation modeling, and (4) attention based neural architectures for context dependent factor weighting. We provide theoretical analysis showing that our approach achieves consistent estimation in the sparse data regime when strong priors from domain knowledge are available, and that Shapley based allocation satisfies axiomatic fairness guarantees that pure regression approaches cannot provide. We instantiate the framework on two historical case studies: the 19th century partition of Africa (N = 7 colonial powers) and the Second Punic War (N = 2 factions). Our model identifies Germany’s +107.9 percent discrepancy as a quantifiable structural tension preceding World War I, with tension factor 36.43 and 0.79 naval arms race correlation. For the Punic Wars, Monte Carlo battle simulations achieve a 57.3 percent win probability for Carthage at Cannae and 57.8 percent for Rome at Zama, aligning with historical outcomes. Counterfactual analysis reveals that Carthaginian political support (support score 6.4 vs Napoleon’s 7.1), rather than military capability, was the decisive factor.
zh
[AI-18] ICAD-LLM : One-for-All Anomaly Detection via In-Context Learning with Large Language Models
【速读】:该论文旨在解决当前异常检测(Anomaly Detection, AD)方法在处理多模态数据时存在的两大局限:一是难以统一建模异构数据格式(如时间序列、系统日志和表格记录),导致模型无法在单一框架内一致地识别跨模态异常;二是缺乏对新场景的快速泛化能力,需大量重新训练才能适应新环境。解决方案的关键在于提出一种新的范式——上下文感知异常检测(In-Context Anomaly Detection, ICAD),其中异常由其与正常样本参考集的差异性定义,并基于此设计了ICAD-LLM框架,利用大语言模型(Large Language Models, LLMs)的上下文学习能力,在单个模型中实现对多种数据模态的统一处理与零样本迁移,从而显著提升跨域泛化性能并降低部署成本。
链接: https://arxiv.org/abs/2512.01672
作者: Zhongyuan Wu,Jingyuan Wang,Zexuan Cheng,Yilong Zhou,Weizhi Wang,Juhua Pu,Chao Li,Changqing Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Anomaly detection (AD) is a fundamental task of critical importance across numerous domains. Current systems increasingly operate in rapidly evolving environments that generate diverse yet interconnected data modalities – such as time series, system logs, and tabular records – as exemplified by modern IT systems. Effective AD methods in such environments must therefore possess two critical capabilities: (1) the ability to handle heterogeneous data formats within a unified framework, allowing the model to process and detect multiple modalities in a consistent manner during anomalous events; (2) a strong generalization ability to quickly adapt to new scenarios without extensive retraining. However, most existing methods fall short of these requirements, as they typically focus on single modalities and lack the flexibility to generalize across domains. To address this gap, we introduce a novel paradigm: In-Context Anomaly Detection (ICAD), where anomalies are defined by their dissimilarity to a relevant reference set of normal samples. Under this paradigm, we propose ICAD-LLM, a unified AD framework leveraging Large Language Models’ in-context learning abilities to process heterogeneous data within a single model. Extensive experiments demonstrate that ICAD-LLM achieves competitive performance with task-specific AD methods and exhibits strong generalization to previously unseen tasks, which substantially reduces deployment costs and enables rapid adaptation to new environments. To the best of our knowledge, ICAD-LLM is the first model capable of handling anomaly detection tasks across diverse domains and modalities.
zh
[AI-19] CLIP-RL: Aligning Language and Policy Representations for Task Transfer in Reinforcement Learning
【速读】:该论文旨在解决多任务强化学习(Reinforcement Learning, RL)中如何高效地在相同环境中实现跨任务迁移的问题,尤其是在任务与自然语言指令紧密关联的场景下。其核心挑战在于如何将语言指令与策略(policy)映射到统一的表示空间,从而促进不同任务间的知识迁移。解决方案的关键在于借鉴对比视觉-语言预训练(Contrastive Language-Image Pretraining, CLIP)的思想,将任务的语言指令和对应策略视为同一概念的两种模态表达,并通过构建一个联合表示空间来对齐它们的嵌入(embedding),从而实现语言与策略之间的语义一致性对齐,显著提升跨任务迁移效率。
链接: https://arxiv.org/abs/2512.01616
作者: Chainesh Gautam,Raghuram Bharadwaj Diddigi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, accepted as a extended abstract at RLDM 2025
Abstract:Recently, there has been an increasing need to develop agents capable of solving multiple tasks within the same environment, especially when these tasks are naturally associated with language. In this work, we propose a novel approach that leverages combinations of pre-trained (language, policy) pairs to establish an efficient transfer pipeline. Our algorithm is inspired by the principles of Contrastive Language-Image Pretraining (CLIP) in Computer Vision, which aligns representations across different modalities under the philosophy that ‘‘two modalities representing the same concept should have similar representations.’’ The central idea here is that the instruction and corresponding policy of a task represent the same concept, the task itself, in two different modalities. Therefore, by extending the idea of CLIP to RL, our method creates a unified representation space for natural language and policy embeddings. Experimental results demonstrate the utility of our algorithm in achieving faster transfer across tasks.
zh
[AI-20] Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade
【速读】:该论文旨在解决从极稀疏且随机测量中重建完整场(full fields)这一长期存在的病态逆问题(ill-posed inverse problem)。其解决方案的关键在于提出了一种分层重构框架——级联感知(Cascaded Sensing, Cas-Sensing),该框架通过引入一个基于神经算子的函数自编码器(functional autoencoder)先恢复原始场的主要结构(如大尺度成分和几何边界),作为中间变量;随后利用条件扩散模型(conditional diffusion model)在掩码级联训练策略下生成细粒度细节,并以贝叶斯后验采样为基础的流形约束梯度确保测量一致性。这种两级级联机制有效缓解了病态性,实现了高保真、鲁棒的重建结果。
链接: https://arxiv.org/abs/2512.01572
作者: Letian Yi,Tingpeng Zhang,Mingyuan Zhou,Guannan Wang,Quanke Su,Zhilu Lai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: 19 pages,10 figures
Abstract:Reconstructing full fields from extremely sparse and random measurements is a longstanding ill-posed inverse problem. A powerful framework for addressing such challenges is hierarchical probabilistic modeling, where uncertainty is represented by intermediate variables and resolved through marginalization during inference. Inspired by this principle, we propose Cascaded Sensing (Cas-Sensing), a hierarchical reconstruction framework that integrates an autoencoder-diffusion cascade. First, a neural operator-based functional autoencoder reconstructs the dominant structures of the original field - including large-scale components and geometric boundaries - from arbitrary sparse inputs, serving as an intermediate variable. Then, a conditional diffusion model, trained with a mask-cascade strategy, generates fine-scale details conditioned on these large-scale structures. To further enhance fidelity, measurement consistency is enforced via the manifold constrained gradient based on Bayesian posterior sampling during the generation process. This cascaded pipeline substantially alleviates ill-posedness, delivering accurate and robust reconstructions. Experiments on both simulation and real-world datasets demonstrate that Cas-Sensing generalizes well across varying sensor configurations and geometric boundaries, making it a promising tool for practical deployment in scientific and engineering applications.
zh
[AI-21] Delta Sum Learning: an approach for fast and global convergence in Gossip Learning
【速读】:该论文旨在解决分布式学习中联邦学习(Federated Learning)与 gossip 学习在模型聚合效率和全局收敛性方面的不足,尤其是在大规模边缘节点部署时准确率下降明显的问题。其解决方案的关键在于提出 Delta Sum Learning 方法,通过改进 gossip 学习中的基本聚合操作,实现更高效的本地模型差异累积与传播机制,从而显著降低全局精度损失;同时,该方法被集成到基于开放应用模型(Open Application Model)的去中心化编排框架中,支持动态节点发现与意图驱动的多工作负载部署,有效提升了系统在有限连接条件下的可扩展性和收敛稳定性。
链接: https://arxiv.org/abs/2512.01549
作者: Tom Goethals,Merlijn Sebrechts,Stijn De Schrijver,Filip De Turck,Bruno Volckaert
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning is a popular approach for distributed learning due to its security and computational benefits. With the advent of powerful devices in the network edge, Gossip Learning further decentralizes Federated Learning by removing centralized integration and relying fully on peer to peer updates. However, the averaging methods generally used in both Federated and Gossip Learning are not ideal for model accuracy and global convergence. Additionally, there are few options to deploy Learning workloads in the edge as part of a larger application using a declarative approach such as Kubernetes manifests. This paper proposes Delta Sum Learning as a method to improve the basic aggregation operation in Gossip Learning, and implements it in a decentralized orchestration framework based on Open Application Model, which allows for dynamic node discovery and intent-driven deployment of multi-workload applications. Evaluation results show that Delta Sum performance is on par with alternative integration methods for 10 node topologies, but results in a 58% lower global accuracy drop when scaling to 50 nodes. Overall, it shows strong global convergence and a logarithmic loss of accuracy with increasing topology size compared to a linear loss for alternatives under limited connectivity.
zh
[AI-22] Q2D2: A Geometry-Aware Audio Codec Leverag ing Two-Dimensional Quantization
【速读】:该论文旨在解决现有神经音频编解码器中基于残差向量量化(Residual Vector Quantization, RVQ)、向量量化(Vector Quantization, VQ)和有限标量量化(Finite Scalar Quantization, FSQ)等方法所导致的潜在空间几何结构受限问题,这些问题限制了特征间相关性的捕捉能力,进而影响表示学习效率、码本利用率以及token率。解决方案的关键在于提出二维量化(Two Dimensional Quantization, Q2D2),其通过将特征对投影到如六边形、菱形或矩形等结构化二维网格上并量化至最近的网格点,从而隐式定义由网格层级乘积构成的码本,保持与传统方法相当的码本规模的同时显著提升压缩效率,实现低token率、高码本利用率及卓越的重建质量。
链接: https://arxiv.org/abs/2512.01537
作者: Tal Shuster,Eliya Nachmani
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech domain compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
zh
[AI-23] aching an Online Multi-Institutional Research Level Software Engineering Course with Industry - an Experience Report
【速读】:该论文试图解决的问题是:小型高校由于师资力量不足,难以独立开设高水平的研究型课程,尤其是在软件工程等快速发展的领域。解决方案的关键在于通过跨机构合作与在线教学平台,整合多所院校的师资资源,并引入产业专家参与教学与研究指导,从而实现高质量、前沿性课程的共享与协同授课。这种模式特别适用于工业界高度关注的领域,如人工智能在软件工程(AI in Software Engineering)中的应用,能够有效提升课程深度与实践价值。
链接: https://arxiv.org/abs/2512.01523
作者: Pankaj Jalore,Y. Raghu Reddy,Vasudeva Varma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:Covid has made online teaching and learning acceptable and students, faculty, and industry professionals are all comfortable with this mode. This comfort can be leveraged to offer an online multi-institutional research-level course in an area where individual institutions may not have the requisite faculty to teach and/or research students to enroll. If the subject is of interest to industry, online offering also allows industry experts to contribute and participate with ease. Advanced topics in Software Engineering are ideally suited for experimenting with this approach as industry, which is often looking to incorporate advances in software engineering in their practices, is likely to agree to contribute and participate. In this paper we describe an experiment in teaching a course titled “AI in Software Engineering” jointly between two institutions with active industry participation, and share our and student’s experience. We believe this collaborative teaching approach can be used for offering research level courses in any applied area of computer science by institutions who are small and find it difficult to offer research level courses on their own.
zh
[AI-24] SynthStrategy: Extracting and Formalizing Latent Strategic Insights from LLM s in Organic Chemistry
【速读】:该论文旨在解决当前计算机辅助合成规划(Computer-Assisted Synthesis Planning, CASP)系统在生成化学可行反应步骤时,难以整合战略性考量(如汇聚式组装、保护基团最小化及最优环构建序列)的问题。解决方案的关键在于利用大语言模型(Large Language Models)将合成知识提炼为可验证的代码形式,通过分析合成路线并将其转化为Python函数来表示多样化的战略与战术规则(如战略功能团转化和环构建策略),从而实现对合成策略的可测试、可解释建模。这一方法使CASP系统能够基于自然语言进行路线检索,并首次实现了按战略标准而非仅结构特征对合成路径进行规范、搜索与评估的能力。
链接: https://arxiv.org/abs/2512.01507
作者: Daniel Armstrong,Zlatko Jončev,Andres M Bran,Philippe Schwaller
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern computer-assisted synthesis planning (CASP) systems show promises at generating chemically valid reaction steps but struggle to incorporate strategic considerations such as convergent assembly, protecting group minimization, and optimal ring-forming sequences. We introduce a methodology that leverages Large Language Models to distill synthetic knowledge into code. Our system analyzes synthesis routes and translates strategic principles into Python functions representing diverse strategic and tactical rules, such as strategic functional group interconversions and ring construction strategies. By formalizing this knowledge as verifiable code rather than simple heuristics, we create testable, interpretable representations of synthetic strategy. We release the complete codebase and the USPTO-ST dataset – synthesis routes annotated with strategic tags. This framework unlocks a novel capability for CASP: natural language-based route retrieval, achieving 75% Top-3 accuracy on our benchmark. We further validate our library through temporal analysis of historical trends and chemically intuitive route clustering that offers more granular partitioning than common previous methods. This work bridges the tactical-strategic divide in CASP, enabling specification, search, and evaluation of routes by strategic criteria rather than structure alone.
zh
[AI-25] Multi-Path Collaborative Reasoning via Reinforcement Learning
【速读】:该论文旨在解决传统链式思维(Chain-of-Thought, CoT)推理中因自回归解码的贪婪特性导致的内部确定性问题,即模型在推理过程中难以探索多种合理路径,从而限制了其推理能力的多样性与鲁棒性。解决方案的关键在于提出一种名为多路径感知策略优化(Multi-Path Perception Policy Optimization, M3PO)的强化学习框架,该框架通过并行策略轨迹作为自然多样性的推理来源,并引入轻量级协作机制将跨路径交互信息整合进策略更新中,使每条推理路径能够借助同伴反馈进行迭代优化,从而构建更可靠、更具解释性的多步推理模式。
链接: https://arxiv.org/abs/2512.01485
作者: Jindi Lv,Yuhao Zhou,Zheng Zhu,Xiaofeng Wang,Guan Huang,Jiancheng Lv
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Chain-of-Thought (CoT) reasoning has significantly advanced the problem-solving capabilities of Large Language Models (LLMs), yet conventional CoT often exhibits internal determinism during decoding, limiting exploration of plausible alternatives. Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space. However, we find that such approaches remain constrained by the greedy nature of autoregressive decoding, which fundamentally isolates the model from alternative reasoning possibilities. In this work, we propose Multi-Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process. M3PO leverages parallel policy rollouts as naturally diverse reasoning sources and integrates cross-path interactions into policy updates through a lightweight collaborative mechanism. This design allows each trajectory to refine its reasoning with peer feedback, thereby cultivating more reliable multi-step reasoning patterns. Empirical results show that M3PO achieves state-of-the-art performance on both knowledge- and reasoning-intensive benchmarks. Models trained with M3PO maintain interpretability and inference efficiency, underscoring the promise of multi-path collaborative learning for robust reasoning.
zh
[AI-26] Multi-view diffusion geometry using intertwined diffusion trajectories
【速读】:该论文旨在解决多视图数据中如何构建统一的扩散几何结构问题,以更好地捕捉不同数据视图之间的动态交互与融合。其核心解决方案是提出了一种称为交织多视图扩散轨迹(intertwined multi-view diffusion trajectories, MDTs)的新框架,这是一种非齐次扩散过程,通过迭代组合多个数据视图的随机游走算子来定义轨迹依赖的扩散算子。MDTs不仅具有明确的概率和几何解释,还能在时间演化中刻画视图间的相互作用,同时提供新的自由度用于视图融合与交互设计。理论分析表明,在温和假设下,点态扩散算子及其过程均具有遍历性,并可基于奇异值分解推导出基于MDTs的扩散距离与嵌入表示。此外,论文还提出了多种学习策略以优化MDT算子,从而实现灵活建模并为扩散方法提供中立基线评估标准。
链接: https://arxiv.org/abs/2512.01484
作者: Gwendal Debaussart-Joniec(CB),Argyris Kalogeratos(CB)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:This paper introduces a comprehensive unified framework for constructing multi-view diffusion geometries through intertwined multi-view diffusion trajectories (MDTs), a class of inhomogeneous diffusion processes that iteratively combine the random walk operators of multiple data views. Each MDT defines a trajectory-dependent diffusion operator with a clear probabilistic and geometric interpretation, capturing over time the interplay between data views. Our formulation encompasses existing multi-view diffusion models, while providing new degrees of freedom for view interaction and fusion. We establish theoretical properties under mild assumptions, including ergodicity of both the point-wise operator and the process in itself. We also derive MDT-based diffusion distances, and associated embeddings via singular value decompositions. Finally, we propose various strategies for learning MDT operators within the defined operator space, guided by internal quality measures. Beyond enabling flexible model design, MDTs also offer a neutral baseline for evaluating diffusion-based approaches through comparison with randomly selected MDTs. Experiments show the practical impact of the MDT operators in a manifold learning and data clustering context.
zh
[AI-27] Does Flatness imply Generalization for Logistic Loss in Univariate Two-Layer ReLU Network?
【速读】:该论文试图解决任意过参数化的两层ReLU神经网络在单变量输入下,使用逻辑损失(logistic loss)时平坦解(flat solutions)是否仍能保证泛化性能的问题。此前研究表明,在平方损失(square loss)下,平坦解可避免过拟合,但逻辑损失下的情况尚不明确,尤其因为梯度下降在增大步长时会收敛到无穷远处的插值解(interpolating solutions)。论文的关键解决方案在于:首先证明了平坦解在由每个候选解确定的左右“不确定集”(uncertain sets)之间的区域内可实现近最优泛化界;其次通过构造反例表明,存在在无穷远处极度平坦却依然过拟合的解,这些解在全局范围内呈现“确定性”(certainty),从而揭示仅靠平坦性不足以保证一般情况下的泛化能力。
链接: https://arxiv.org/abs/2512.01473
作者: Dan Qiao,Yu-Xiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 59 pages
Abstract:We consider the problem of generalization of arbitrarily overparameterized two-layer ReLU Neural Networks with univariate input. Recent work showed that under square loss, flat solutions (motivated by flat / stable minima and Edge of Stability phenomenon) provably cannot overfit, but it remains unclear whether the same phenomenon holds for logistic loss. This is a puzzling open problem because existing work on logistic loss shows that gradient descent with increasing step size converges to interpolating solutions (at infinity, for the margin-separable cases). In this paper, we prove that the \emphflatness implied generalization is more delicate under logistic loss. On the positive side, we show that flat solutions enjoy near-optimal generalization bounds within a region between the left-most and right-most \emphuncertain sets determined by each candidate solution. On the negative side, we show that there exist arbitrarily flat yet overfitting solutions at infinity that are (falsely) certain everywhere, thus certifying that flatness alone is insufficient for generalization in general. We demonstrate the effects predicted by our theory in a well-controlled simulation study.
zh
[AI-28] Automated Risk-of-Bias Assessment of Randomized Controlled Trials: A First Look at a GEPA-trained Programmatic Prompting Framework
【速读】:该论文旨在解决随机对照试验(Randomized Controlled Trials, RCTs)中风险偏倚(Risk of Bias, RoB)评估过程资源消耗大、评审者间一致性差的问题。传统方法依赖人工设计的提示词(prompt),存在难以复现、泛化能力弱和评估困难等局限。其解决方案的关键在于提出一个可编程的RoB评估流水线,采用DSPy框架及其GEPA模块(Pareto-guided search优化推理路径),通过代码驱动的方式自动优化提示词生成过程,并提供可追溯的执行轨迹,从而实现透明、一致且可复现的LLM辅助RoB评估。实验表明,GEPA生成的提示在多个RoB领域表现优异,尤其在方法学报告清晰的领域(如随机序列生成)显著优于手动设计提示,验证了该方法在证据合成中的实用性和可靠性。
链接: https://arxiv.org/abs/2512.01452
作者: Lingbo Li,Anuradha Mathrani,Teo Susnjak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large language models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting, such as Random Sequence Generation, GEPA-generated prompts performed best, with similar results for Allocation Concealment and Blinding of Participants, while the commercial model performed slightly better overall. We also compared GEPA with three manually designed prompts using Claude 3.5 Sonnet. GEPA achieved the highest overall accuracy and improved performance by 30%-40% in Random Sequence Generation and Selective Reporting, and showed generally comparable, competitively aligned performance in the other domains relative to manual prompts. These findings suggest that GEPA can produce consistent and reproducible prompts for RoB assessment, supporting the structured and principled use of LLMs in evidence synthesis.
zh
[AI-29] PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis AAAI2026
【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中因未充分考虑个体性格差异而导致的情感特征融合不准确问题。现有方法在单模态特征提取阶段仅获取浅层信息,忽视了不同人格对情感表达的影响;在多模态融合阶段则直接拼接各模态特征,缺乏对特征层面差异的建模,从而限制了模型的情感识别性能。其解决方案的关键在于提出一种人格-情感对齐的多层次融合框架:首先在特征提取阶段引入人格特质,并设计新颖的人格-情感对齐机制,从文本模态中生成个性化情感嵌入(personalized sentiment embeddings);其次在融合阶段采用多层级融合策略,通过预融合与增强融合逐步整合文本、视觉和音频模态中的情感信息,显著提升了模型的表达能力和泛化性能。
链接: https://arxiv.org/abs/2512.01442
作者: Heng Xie,Kang Zhu,Zhengqi Wen,Jianhua Tao,Xuefei Liu,Ruibo Fu,Changsheng Li
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: AAAI 2026 accepted
Abstract:Multimodal sentiment analysis (MSA) is a research field that recognizes human sentiments by combining textual, visual, and audio modalities. The main challenge lies in integrating sentiment-related information from different modalities, which typically arises during the unimodal feature extraction phase and the multimodal feature fusion phase. Existing methods extract only shallow information from unimodal features during the extraction phase, neglecting sentimental differences across different personalities. During the fusion phase, they directly merge the feature information from each modality without considering differences at the feature level. This ultimately affects the model’s recognition performance. To address this problem, we propose a personality-sentiment aligned multi-level fusion framework. We introduce personality traits during the feature extraction phase and propose a novel personality-sentiment alignment method to obtain personalized sentiment embeddings from the textual modality for the first time. In the fusion phase, we introduce a novel multi-level fusion method. This method gradually integrates sentimental information from textual, visual, and audio modalities through multimodal pre-fusion and a multi-level enhanced fusion strategy. Our method has been evaluated through multiple experiments on two commonly used datasets, achieving state-of-the-art results.
zh
[AI-30] A Selective Temporal Hamming distance to find patterns in state transition event timeseries at scale
【速读】:该论文旨在解决离散事件系统(Discrete Event Systems)在分析过程中因传统方法忽视其事件与状态双重特性而导致的问题:现有方法通常将信号建模为事件序列或状态时间序列,前者强调事件顺序对齐,后者则需通过重采样处理,但在观测周期和事件数量增长时,重采样会带来计算成本高且失真的问题。解决方案的关键在于提出状态转移事件时间序列(State Transition Event Time Series, STE-ts)建模方式,并设计一种选择性时间汉明距离(Selective Temporal Hamming Distance, STH),该度量同时利用事件发生时间与状态持续时间信息,从而避免了昂贵且易失真的重采样操作;STH不仅在精度和计算效率上优于传统重采样后的汉明距离和杰卡德距离(Jaccard metric),还具备聚焦多个感兴趣状态的能力,验证了其在模拟和真实数据集上的有效性。
链接: https://arxiv.org/abs/2512.01440
作者: Sylvain Marié(SE),Pablo Knecht(SE)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Discrete event systems are present both in observations of nature, socio economical sciences, and industrial systems. Standard analysis approaches do not usually exploit their dual event / state nature: signals are either modeled as transition event sequences, emphasizing event order alignment, or as categorical or ordinal state timeseries, usually resampled a distorting and costly operation as the observation period and number of events grow. In this work we define state transition event timeseries (STE-ts) and propose a new Selective Temporal Hamming distance (STH) leveraging both transition time and duration-in-state, avoiding costly and distorting resampling on large databases. STH generalizes both resampled Hamming and Jaccard metrics with better precision and computation time, and an ability to focus on multiple states of interest. We validate these benefits on simulated and real-world datasets.
zh
[AI-31] A Flexible Multi-Agent LLM -Human Framework for Fast Human Validated Tool Building
【速读】:该论文旨在解决复杂迭代问题中工具构建与人类意图对齐的挑战,尤其是在科学文档生成等任务中,如何高效地适应新领域并最小化人工反馈成本。其解决方案的关键在于提出一个名为CollabToolBuilder的多智能体大语言模型(LLM)框架,通过引入“专家在环”(Human-in-the-Loop, HITL)机制,利用四个专业化智能体(教练Agent、编码Agent、批评Agent和资本化Agent)协同工作,借助强化动态提示和系统性人类反馈整合,实现工具的迭代生成与验证,从而在保持目标约束的同时提升任务执行效率与可迁移性。
链接: https://arxiv.org/abs/2512.01434
作者: Daull Xavier(LIS, R2I, UTLN),Patrice Bellot(R2I, LIS, AMU),Emmanuel Bruno(R2I, UTLN),Vincent Martin,Elisabeth Murisasco(R2I, UTLN)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce CollabToolBuilder, a flexible multiagent LLM framework with expert-in-the-loop (HITL) guidance that iteratively learns to create tools for a target goal, aligning with human intent and process, while minimizing time for task/domain adaptation effort and human feedback capture. The architecture generates and validates tools via four specialized agents (Coach, Coder, Critic, Capitalizer) using a reinforced dynamic prompt and systematic human feedback integration to reinforce each agent’s role toward goals and constraints. This work is best viewed as a system-level integration and methodology combining multi-agent in-context learning, HITL controls, and reusable tool capitalization for complex iterative problems such as scientific document generation. We illustrate it with preliminary experiments (e.g., generating state-of-the-art research papers or patents given an abstract) and discuss its applicability to other iterative problem-solving.
zh
[AI-32] A Self-explainable Model of Long Time Series by Extracting Informative Structured Causal Patterns
【速读】:该论文旨在解决现有可解释人工智能(Explainable AI)方法在处理长时间序列建模时的局限性问题,即大多数方法仅能提供逐点的重要性评分,难以捕捉趋势、周期和状态转换等时间结构,从而削弱了人类对长时序模型的可解释性和信任度。解决方案的关键在于提出EXCAP框架,该框架满足四个核心要求:时间连续性、以模式为中心的解释、因果解耦以及对模型推理过程的忠实性;其核心技术包括基于注意力机制的片段分割器用于提取连贯的时间模式、由预训练因果图引导的因果结构解码器,以及通过潜在聚合机制实现表示稳定性的设计,从而在保证预测性能的同时生成结构清晰且因果合理的解释。
链接: https://arxiv.org/abs/2512.01412
作者: Ziqian Wang,Yuxiao Cheng,Jinli Suo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Approximately 30 pages, 8 figures, and 5 tables. Preprint version. Includes theoretical analysis, model architecture, interpretability evaluation, and extensive benchmark experiments
Abstract:Explainability is essential for neural networks that model long time series, yet most existing explainable AI methods only produce point-wise importance scores and fail to capture temporal structures such as trends, cycles, and regime changes. This limitation weakens human interpretability and trust in long-horizon models. To address these issues, we identify four key requirements for interpretable time-series modeling: temporal continuity, pattern-centric explanation, causal disentanglement, and faithfulness to the model’s inference process. We propose EXCAP, a unified framework that satisfies all four requirements. EXCAP combines an attention-based segmenter that extracts coherent temporal patterns, a causally structured decoder guided by a pre-trained causal graph, and a latent aggregation mechanism that enforces representation stability. Our theoretical analysis shows that EXCAP provides smooth and stable explanations over time and is robust to perturbations in causal masks. Extensive experiments on classification and forecasting benchmarks demonstrate that EXCAP achieves strong predictive accuracy while generating coherent and causally grounded explanations. These results show that EXCAP offers a principled and scalable approach to interpretable modeling of long time series with relevance to high-stakes domains such as healthcare and finance.
zh
[AI-33] Consistency Flow Model Achieves One-step Denoising Error Correction Codes
【速读】:该论文旨在解决神经解码器在可靠数字通信中既要保证高精度又要具备计算高效性的难题,尤其针对基于扩散模型的解码方法因迭代采样导致延迟较高、难以满足低延迟场景需求的问题。解决方案的关键在于提出一种与架构无关的训练框架——错误校正一致性流模型(Error Correction Consistency Flow Model, ECCFM),其核心是将反向去噪过程建模为概率流常微分方程(Probability Flow Ordinary Differential Equation, PF-ODE),并通过微分时间正则化强制解码轨迹的平滑性,从而学习在单步推理中直接从噪声信号映射到原始码字,实现高保真度的一次性解码。
链接: https://arxiv.org/abs/2512.01389
作者: Haoyu Lei,Chin Wa Lau,Kaiwen Zhou,Nian Guo,Farzan Farnia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Error Correction Codes (ECC) are fundamental to reliable digital communication, yet designing neural decoders that are both accurate and computationally efficient remains challenging. Recent denoising diffusion decoders with transformer backbones achieve state-of-the-art performance, but their iterative sampling limits practicality in low-latency settings. We introduce the Error Correction Consistency Flow Model (ECCFM), an architecture-agnostic training framework for high-fidelity one-step decoding. By casting the reverse denoising process as a Probability Flow Ordinary Differential Equation (PF-ODE) and enforcing smoothness through a differential time regularization, ECCFM learns to map noisy signals along the decoding trajectory directly to the original codeword in a single inference step. Across multiple decoding benchmarks, ECCFM attains lower bit-error rates (BER) than autoregressive and diffusion-based baselines, with notable improvements on longer codes, while delivering inference speeds up from 30x to 100x faster than denoising diffusion decoders.
zh
[AI-34] Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation
【速读】:该论文旨在解决多模态推荐系统中因模态特异性噪声、语义不一致性和用户-物品图上传播不稳定所导致的泛化能力下降与鲁棒性不足的问题。现有方法常因浅层融合策略或静态频域滤波而难以有效分离稳定信号与噪声,且缺乏对频谱结构的动态推理能力。其解决方案的关键在于提出一种结构化的频谱推理(Structured Spectral Reasoning, SSR)框架,通过四个阶段实现:(i) 利用图引导变换将多模态信号分解为频带以隔离语义粒度;(ii) 基于频带掩码和预测一致性目标在训练时调节各频带可靠性,抑制脆弱频率成分;(iii) 采用低秩跨频带交互进行超光谱推理以融合互补频域线索;(iv) 通过对比正则化对齐模态特定频谱特征,提升语义与结构一致性。该方法显著改善了稀疏和冷启动场景下的性能,并增强了模型的可解释性与鲁棒性。
链接: https://arxiv.org/abs/2512.01372
作者: Wei Yang,Rui Zhong,Yiqun Chen,Chi Lu,Peng Jiang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with a prediction-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance.
zh
[AI-35] Beyond Loss Guidance: Using PDE Residuals as Spectral Attention in Diffusion Neural Operators
【速读】:该论文旨在解决基于扩散模型求解偏微分方程(Partial Differential Equations, PDEs)时面临的三大问题:缓慢的梯度优化推理过程、优化不稳定性以及在PDE残差存在噪声时无法动态调整推理策略。其解决方案的关键在于提出PRISMA(PDE Residual Informed Spectral Modulation with Attention),该方法通过注意力机制将PDE残差直接嵌入到模型架构的频域中,从而实现无需梯度下降的推理过程;与以往仅将PDE损失作为外部优化目标的方法不同,PRISMA将PDE残差作为模型结构的内在特征,使模型具备更快的推理速度、更强的鲁棒性、更高的准确性,并且无需敏感超参数调优。
链接: https://arxiv.org/abs/2512.01370
作者: Medha Sawhney,Abhilash Neog,Mridul Khurana,Anuj Karpatne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注:
Abstract:Diffusion-based solvers for partial differential equations (PDEs) are often bottle-necked by slow gradient-based test-time optimization routines that use PDE residuals for loss guidance. They additionally suffer from optimization instabilities and are unable to dynamically adapt their inference scheme in the presence of noisy PDE residuals. To address these limitations, we introduce PRISMA (PDE Residual Informed Spectral Modulation with Attention), a conditional diffusion neural operator that embeds PDE residuals directly into the model’s architecture via attention mechanisms in the spectral domain, enabling gradient-descent free inference. In contrast to previous methods that use PDE loss solely as external optimization targets, PRISMA integrates PDE residuals as integral architectural features, making it inherently fast, robust, accurate, and free from sensitive hyperparameter tuning. We show that PRISMA has competitive accuracy, at substantially lower inference costs, compared to previous methods across five benchmark PDEs, especially with noisy observations, while using 10x to 100x fewer denoising steps, leading to 15x to 250x faster inference.
zh
[AI-36] angram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
【速读】:该论文旨在解决Serverless大语言模型(Large Language Models, LLMs)在冷启动阶段因模型加载延迟过高而导致的服务可用性瓶颈问题,尤其当模型规模增大时,冷启动延迟呈线性增长,严重制约了大规模LLM服务的实际部署。其解决方案的关键在于通过高效的GPU内存复用机制加速模型加载过程:设计了一个统一的GPU内存池以实现跨模型的张量级参数共享,引入按需KV缓存分配策略优化动态内存管理,并采用GPU亲和性感知调度策略提升资源利用率。这三项核心技术协同作用,显著降低了模型传输时间和冷启动延迟,实验表明Tangram相比现有最优方法可将加载速度提升至6.2倍,并减少冷启动阶段的首次令牌响应时间(Time-To-First-Token, TTFT)达23%–55%。
链接: https://arxiv.org/abs/2512.01357
作者: Wenbin Zhu(Shandong University),Zhaoyan Shen(Shandong University),Zili Shao(The Chinese University of Hong Kong),Hongjun Dai(Shandong University),Feng Chen(Indiana University Bloomington)
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a ‘pay-as-you-go’ pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23–55% over state-of-the-art methods.
zh
[AI-37] Benchmarking Overton Pluralism in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在输出中对多元观点覆盖不足的问题,即“Overton多元主义”(Overton pluralism)的量化评估难题。其核心解决方案是提出一种可量化的集合覆盖指标——OvertonScore,并基于大规模美国代表性人群调查(N=1209)与8个主流LLM的对比实验,构建了一个能高度复现人类判断的自动化评测基准,该基准与人工评分具有显著相关性(ρ=0.88),从而为模型开发提供高效、可扩展的评估工具,推动LLM向更包容、多元的方向演进。
链接: https://arxiv.org/abs/2512.01351
作者: Elinor Poole-Dayan,Jiayi Wu,Taylor Sorensen,Jiaxin Pei,Michiel A. Bakker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a novel framework for measuring Overton pluralism in LLMs–the extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set coverage metric (OvertonScore), (ii) conduct a large-scale U.S.-representative human study (N = 1209; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OvertonScores of 0.35–0.41, with DeepSeek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments ( \rho=0.88 ), providing a practical proxy without replacing human assessment. By turning pluralistic alignment from a normative aim into a measurable benchmark, our work establishes a foundation for systematic progress toward more pluralistic LLMs.
zh
[AI-38] Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限设备上部署时因高精度权重导致的计算与存储瓶颈问题,特别是现有后训练量化(Post-Training Quantization, PTQ)方法在缺乏校准数据场景下性能下降的问题。其核心挑战在于如何在不依赖激活值或二阶敏感性分析的前提下,识别出对模型性能至关重要的“异常特征权重”(outlier features)。解决方案的关键在于提出一种数据无关、结构感知的新范式:利用奇异值分解(Singular Value Decomposition, SVD)提取权重矩阵的主成分(Principal Components),并基于此设计选择启发式策略——保留与主成分对齐的 top-k 高精度浮点权重(FP32),其余权重则进行激进量化。实验证明,这种基于矩阵内在结构的重要性估计能有效替代传统依赖校准数据的方法,在GLUE基准任务(如RTE)上优于AWQ和SpQR等先进方法,验证了结构重要性与功能重要性高度相关。
链接: https://arxiv.org/abs/2512.01343
作者: Shashank Landge,Abhishek Patil,Tejas kamble,Bhushan Buddhivant,Priyanka Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) continue to scale in parameter count, deploying them on commodity hardware has become increasingly challenging. Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower. However, uniform quantization often leads to significant performance degradation due to the presence of ``outlier features’’ – weights that, while few in number, are critical for maintaining model accuracy. Current state-of-the-art methods such as AWQ (Activation-aware Weight Quantization) and SpQR (Sparse Quantization Representations) rely on calibration data to identify these salient weights via activation magnitudes or Hessian sensitivity. In scenarios where data privacy is paramount or calibration data is unavailable, these methods are inapplicable. In this work, we propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model’s downstream performance. We introduce a novel selection heuristic that preserves the top- k weights aligned with the principal components in FP32, while aggressively quantizing the residual weights. We compare our method against activation-aware (AWQ) and second-order (SpQR) methods across GLUE benchmarks (MRPC, RTE, QNLI) using a DistilBERT backbone. Our experiments reveal that structural importance is highly correlated with functional importance. On the challenging RTE task, our SVD-based method achieves an accuracy of 66.06%, outperforming both AWQ (65.34%) and SpQR (65.34%) at high protection budgets, validating that intrinsic matrix structure can serve as a robust proxy for weight saliency without the need for forward passes or calibration data. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.01343 [cs.LG] (or arXiv:2512.01343v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01343 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-39] A Fast Heuristic Search Approach for Energy-Optimal Profile Routing for Electric Vehicles AAAI2026
【速读】:该论文旨在解决大规模道路网络中电动汽车(Electric Vehicle, EV)的能耗最优最短路径问题,尤其关注下坡路段回收能量导致负能耗成本的情形。传统点对点路径规划算法通常假设初始电量已知,但在实际应用中,由于可用能量存在不确定性,需要为所有可能的初始能量水平规划最优路径,即所谓的能耗最优轮廓搜索(energy-optimal profile search)。现有方法依赖于标签修正框架中的专用轮廓合并机制,导致复杂轮廓的搜索与处理。本文提出一种基于多目标A搜索的简洁而有效的标签设置方法,其核心创新在于引入一种新颖的轮廓支配规则(profile dominance rule),从而避免生成和处理复杂的能量轮廓,显著提升效率。实验表明,该方法在真实路网上的性能可媲美已知初始电量下的能耗最优A算法。
链接: https://arxiv.org/abs/2512.01331
作者: Saman Ahmadi,Mahdi Jalili
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 Figures, 1 table, To appear as part of AAAI 2026 Proceedings
Abstract:We study the energy-optimal shortest path problem for electric vehicles (EVs) in large-scale road networks, where recuperated energy along downhill segments introduces negative energy costs. While traditional point-to-point pathfinding algorithms for EVs assume a known initial energy level, many real-world scenarios involving uncertainty in available energy require planning optimal paths for all possible initial energy levels, a task known as energy-optimal profile search. Existing solutions typically rely on specialized profile-merging procedures within a label-correcting framework that results in searching over complex profiles. In this paper, we propose a simple yet effective label-setting approach based on multi-objective A* search, which employs a novel profile dominance rule to avoid generating and handling complex profiles. We develop four variants of our method and evaluate them on real-world road networks enriched with realistic energy consumption data. Experimental results demonstrate that our energy profile A* search achieves performance comparable to energy-optimal A* with a known initial energy level.
zh
[AI-40] Extending NGU to Multi-Agent RL: A Preliminary Study NEURIPS2025
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中稀疏奖励环境下探索效率低的问题。其解决方案的关键在于将单智能体的Never Give Up (NGU)算法扩展至多智能体场景,通过结合**情景新颖性(episodic novelty)与内在动机(intrinsic motivation)**来增强探索能力,并重点考察了三种设计选择:共享经验回放缓冲区、智能体间新颖性共享策略及β参数异质性。实验表明,使用共享回放缓冲区能显著提升性能和稳定性,说明多智能体NGU的有效性依赖于经验共享与内在探索信号的协同优化。
链接: https://arxiv.org/abs/2512.01321
作者: Juan Hernandez,Diego Fernández,Manuel Cifuentes,Denis Parra,Rodrigo Toro Icarte
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 1 table. Accepted at the LatinX in AI (LXAI) Workshop at NeurIPS 2025. Includes experimental results for Multi-NGU and Multi-DQN in the PettingZoo simple_tag environment
Abstract:The Never Give Up (NGU) algorithm has proven effective in reinforcement learning tasks with sparse rewards by combining episodic novelty and intrinsic motivation. In this work, we extend NGU to multi-agent environments and evaluate its performance in the simple_tag environment from the PettingZoo suite. Compared to a multi-agent DQN baseline, NGU achieves moderately higher returns and more stable learning dynamics. We investigate three design choices: (1) shared replay buffer versus individual replay buffers, (2) sharing episodic novelty among agents using different k thresholds, and (3) using heterogeneous values of the beta parameter. Our results show that NGU with a shared replay buffer yields the best performance and stability, highlighting that the gains come from combining NGU intrinsic exploration with experience sharing. Novelty sharing performs comparably when k = 1 but degrades learning for larger values. Finally, heterogeneous beta values do not improve over a small common value. These findings suggest that NGU can be effectively applied in multi-agent settings when experiences are shared and intrinsic exploration signals are carefully tuned.
zh
[AI-41] CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agent ic RL
【速读】:该论文旨在解决任务稀缺(task scarcity)问题,即在复杂、工具增强的环境中,缺乏结构化训练任务导致强化学习(Reinforcement Learning, RL)代理难以有效提升性能的问题。针对这一挑战,作者提出了一种名为CuES(Curiosity-driven and Environment-grounded Synthesis)的框架,其核心创新在于通过内在好奇心驱动的探索机制和基于环境本体的任务生成策略,实现无需人工预设种子或外部语料库的情况下,从环境结构与工具可用性中自动合成多样、可执行且有意义的任务。CuES通过抽象交互模式形成可复用的任务模板,并结合轻量级自上而下的引导与基于记忆的质量控制机制进行迭代优化,从而显著提升了下游策略的学习效率与泛化能力。
链接: https://arxiv.org/abs/2512.01311
作者: Shinji Mai,Yunpeng Zhai,Ziqian Chen,Cheng Chen,Anni Zou,Shuchang Tao,Zhaoyang Liu,Bolin Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model based agents are increasingly deployed in complex, tool augmented environments. While reinforcement learning provides a principled mechanism for such agents to improve through interaction, its effectiveness critically depends on the availability of structured training tasks. In many realistic settings, however, no such tasks exist a challenge we term task scarcity, which has become a key bottleneck for scaling agentic RL. Existing approaches typically assume predefined task collections, an assumption that fails in novel environments where tool semantics and affordances are initially unknown. To address this limitation, we formalize the problem of Task Generation for Agentic RL, where an agent must learn within a given environment that lacks predefined tasks. We propose CuES, a Curiosity driven and Environment grounded Synthesis framework that autonomously generates diverse, executable, and meaningful tasks directly from the environment structure and affordances, without relying on handcrafted seeds or external corpora. CuES drives exploration through intrinsic curiosity, abstracts interaction patterns into reusable task schemas, and refines them through lightweight top down guidance and memory based quality control. Across three representative environments, AppWorld, BFCL, and WebShop, CuES produces task distributions that match or surpass manually curated datasets in both diversity and executability, yielding substantial downstream policy improvements. These results demonstrate that curiosity driven, environment grounded task generation provides a scalable foundation for agents that not only learn how to act, but also learn what to learn. The code is available at this https URL.
zh
[AI-42] RoboDriveVLM: A Novel Benchmark and Baseline towards Robust Vision-Language Models for Autonomous Driving
【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Model, VLM)的端到端自动驾驶系统在真实道路场景中存在多重风险的问题,尤其是其对传感器噪声和提示干扰等现实挑战的脆弱性。为系统评估VLM在轨迹预测任务中的鲁棒性,作者提出了RoboDriveBench——首个专注于端到端轨迹预测的鲁棒性基准,涵盖11种模拟场景,包括6类由环境变化引起的传感器退化和5类由人为干预或数据传输故障导致的提示退化,共包含64,559个轨迹预测案例。解决方案的关键在于提出一种新的VLM驱动自动驾驶框架RoboDriveVLM,通过将激光雷达(lidar)和雷达(radar)等多模态数据映射至统一潜在空间以增强系统鲁棒性,并引入基于跨模态知识蒸馏的测试时自适应(Test-Time Adaptation, TTA)方法,从而显著提升模型在复杂现实条件下的稳定性与可靠性。
链接: https://arxiv.org/abs/2512.01300
作者: Dacheng Liao,Mengshi Qi,Peng Shu,Zhining Zhang,Yuxin Lin,Liang Liu,Huadong Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current Vision-Language Model (VLM)-based end-to-end autonomous driving systems often leverage large language models to generate driving decisions directly based on their understanding of the current scene. However, such systems introduce multiple risks in real-world driving scenarios. To evaluate whether VLMs are truly viable for autonomous driving, we introduce RoboDriveBench, the first robustness benchmark focused on end-to-end trajectory prediction tasks. This benchmark systematically evaluates two critical categories of real-world challenges for VLM-based end-to-end autonomous driving systems through 11 simulated scenarios encompassing various corruption types, including 6 scenarios of sensor corruption caused by environmental variations, along with 5 cases of prompt corruption resulting from human intervention and data transmission failures. Each corruption type includes 250 unique driving scenarios and 5,689 frames, resulting in 64,559 total trajectory prediction cases per evaluation. To overcome these real-world challenges, we propose a novel VLM-based autonomous driving framework called RoboDriveVLM, which enhances robustness by mapping more multimodal data-e.g., lidar and radar-into a unified latent space. Furthermore, we introduce a new Test-Time Adaptation (TTA) method based on cross-modal knowledge distillation to improve the robustness of VLM-based autonomous driving systems. Through extensive experiments, our work highlights the limitations of current VLM-based end-to-end autonomous driving systems and provides a more reliable solution for real-world deployment. Source code and datasets will be released.
zh
[AI-43] OntoMetric: An Ontology-Guided Framework for Automated ESG Knowledge Graph Construction
【速读】:该论文旨在解决环境、社会与治理(ESG)披露框架(如SASB、TCFD和IFRS S2)中,组织需从长篇且非结构化的PDF文档中提取大量指标以满足合规要求的问题。传统人工提取方式难以扩展,而无约束的大语言模型(LLM)提取常导致实体不一致、关系幻觉、缺乏溯源信息及高验证失败率。解决方案的关键在于提出OntoMetric框架,其核心是通过三阶段流程实现高质量知识图谱构建:首先基于目录边界进行结构感知分割;其次利用嵌入ESGMKG本体模式的LLM提取,增强语义字段以支持下游推理;最后采用两阶段验证机制(LLM语义校验+规则驱动的模式检查),确保语义准确率65–90%、模式符合率80–90%,同时保留段落级与页面级溯源信息,从而生成可审计、适用于监管合规与Web集成的可信知识图谱。
链接: https://arxiv.org/abs/2512.01289
作者: Mingqin Yu(1),Fethi Rabhi(1),Boming Xia(2),Zhengyi Yang(1),Felix Tan(1),Qinghua Lu(3) ((1) University of New South Wales, Sydney, Australia, (2) University of Adelaide, Adelaide, Australia, (3) CSIRO Data61, Sydney, Australia)
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Environmental, Social, and Governance (ESG) disclosure frameworks such as SASB, TCFD, and IFRS S2 require organizations to compute and report numerous metrics for compliance, yet these requirements are embedded in long, unstructured PDF documents that are difficult to interpret, standardize, and audit. Manual extraction is unscalable, while unconstrained large language model (LLM) extraction often produces inconsistent entities, hallucinated relationships, missing provenance, and high validation failure rates. We present OntoMetric, an ontology-guided framework that transforms ESG regulatory documents into validated, AI- and web-ready knowledge graphs. OntoMetric operates through a three-stage pipeline: (1) structure-aware segmentation using table-of-contents boundaries, (2) ontology-constrained LLM extraction that embeds the ESGMKG schema into prompts while enriching entities with semantic fields for downstream reasoning, and (3) two-phase validation that combines LLM-based semantic verification with rule-based schema checking across entity, property, and relationship levels (VR001-VR006). The framework preserves both segment-level and page-level provenance for audit traceability. Evaluated on five ESG standards (SASB Commercial Banks, SASB Semiconductors, TCFD, IFRS S2, AASB S2) totaling 228 pages and 60 segments, OntoMetric achieves 65-90% semantic accuracy and 80-90% schema compliance, compared to 3-10% for baseline unconstrained extraction, at approximately 0.01 to 0.02 USD per validated entity. Our results demonstrate that combining symbolic ontology constraints with neural extraction enables reliable, auditable knowledge graphs suitable for regulatory compliance and web integration, supporting downstream applications such as sustainable-finance analytics, transparency portals, and automated compliance tools.
zh
[AI-44] Generative Modeling with Continuous Flows: Sample Complexity of Flow Matching
【速读】:该论文旨在解决生成式AI(Generative AI)中流匹配(flow matching)方法的样本复杂度(sample complexity)理论分析问题,特别是在不假设可获得损失函数的经验风险最小化(empirical risk minimizer, ERM)的情况下,如何量化学习速度场(velocity field)所需的样本数量以保证生成分布与真实分布之间的Wasserstein-2距离达到指定精度。解决方案的关键在于将速度场估计误差分解为神经网络近似误差、由有限样本引起的统计误差以及由于优化步数有限导致的优化误差三部分,并分别采用独立的技术手段处理每一项,从而首次在标准假设下证明了使用O(ϵ−4)个样本即可实现O(ϵ)的Wasserstein-2误差上界。
链接: https://arxiv.org/abs/2512.01286
作者: Mudit Gaur,Prashant Trivedi,Shuchin Aeron,Amrit Singh Bedi,George K. Atia,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow matching has recently emerged as a promising alternative to diffusion-based generative models, offering faster sampling and simpler training by learning continuous flows governed by ordinary differential equations. Despite growing empirical success, the theoretical understanding of flow matching remains limited, particularly in terms of sample complexity results. In this work, we provide the first analysis of the sample complexity for flow-matching based generative models without assuming access to the empirical risk minimizer (ERM) of the loss function for estimating the velocity field. Under standard assumptions on the loss function for velocity field estimation and boundedness of the data distribution, we show that a sufficiently expressive neural network can learn a velocity field such that with \mathcalO(\epsilon^-4) samples, such that the Wasserstein-2 distance between the learned and the true distribution is less than \mathcalO(\epsilon) . The key technical idea is to decompose the velocity field estimation error into neural-network approximation error, statistical error due to the finite sample size, and optimization error due to the finite number of optimization steps for estimating the velocity field. Each of these terms are then handled via techniques that may be of independent interest.
zh
[AI-45] Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
【速读】:该论文旨在解决生成式语言模型在链式思维(Chain-of-Thought, CoT)推理过程中因生成序列变长而导致的内存带宽瓶颈问题。随着生成token数量增加,模型需对所有历史token应用全注意力机制,导致KV-Cache规模持续增长,进而使每一步推理都面临日益严重的内存访问压力。解决方案的关键在于提出SparseSpec框架,其核心创新包括:一是设计了一种名为PillarAttn的稀疏注意力机制作为草稿模型,通过复用验证阶段的信息精准选择关键token;二是采用自推测(self-speculation)策略,即使用同一模型同时担任草稿和目标模型以降低资源开销;三是协同优化系统层面的调度、延迟验证与动态KV-Cache管理,从而显著提升推理吞吐量,实测最高达2.13倍加速比。
链接: https://arxiv.org/abs/2512.01278
作者: Yilong Zhao,Jiaming Tang,Kan Zhu,Zihao Ye,Chi-Chih Chang,Chaofan Lin,Jongseok Park,Guangxuan Xiao,Mohamed S. Abdelfattah,Mingyu Gao,Baris Kasikci,Song Han,Ion Stoica
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SparseSpec co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SparseSpec outperforms state-of-the-art solutions, with an up to 2.13x throughput speedup. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.01278 [cs.LG] (or arXiv:2512.01278v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01278 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-46] Social Media Data Mining of Human Behaviour during Bushfire Evacuation
【速读】:该论文试图解决传统火灾疏散行为数据来源(如定量调查和人工观察)存在的局限性问题,这些局限包括数据量小、成本高、缺乏位置信息与情境细节等,从而难以全面理解山火疏散行为。其解决方案的关键在于利用社交媒体数据挖掘技术,通过采集和处理大量低成本、含位置信息及丰富上下文的社交数据,弥补传统方法的不足,并提出未来在疏散模型校准与验证、应急通信、个性化疏散培训及资源分配等方面的应用方向。同时,论文识别出当前面临的核心挑战,如数据质量、代表性偏差、地理定位精度、语境理解、灾情专用词汇与语义解析以及多模态数据融合等问题,为后续研究提供明确指引。
链接: https://arxiv.org/abs/2512.01262
作者: Junfeng Wu,Xiangmin Zhou,Erica Kuligowski,Dhirendra Singh,Enrico Ronchi,Max Kinateder
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:
Abstract:Traditional data sources on bushfire evacuation behaviour, such as quantitative surveys and manual observations have severe limitations. Mining social media data related to bushfire evacuations promises to close this gap by allowing the collection and processing of a large amount of behavioural data, which are low-cost, accurate, possibly including location information and rich contextual information. However, social media data have many limitations, such as being scattered, incomplete, informal, etc. Together, these limitations represent several challenges to their usefulness to better understand bushfire evacuation. To overcome these challenges and provide guidance on which and how social media data can be used, this scoping review of the literature reports on recent advances in relevant data mining techniques. In addition, future applications and open problems are discussed. We envision future applications such as evacuation model calibration and validation, emergency communication, personalised evacuation training, and resource allocation for evacuation preparedness. We identify open problems such as data quality, bias and representativeness, geolocation accuracy, contextual understanding, crisis-specific lexicon and semantics, and multimodal data interpretation.
zh
[AI-47] Pascal-Weighted Genetic Algorithms: A Binomially-Structured Recombination Framework
【速读】:该论文旨在解决遗传算法(Genetic Algorithms, GAs)中传统两亲交叉算子在多父代重组时难以平衡继承稳定性与变异多样性的问题,尤其是在复杂优化场景下易导致收敛震荡或局部最优陷阱。其解决方案的关键在于提出一种基于归一化帕斯卡(二项式)系数的多父代重组算子——帕斯卡加权重组(Pascal-Weighted Recombination, PWR),通过构造具有中心聚焦特性的凸组合权重结构,使后代在多个父代间实现平滑、可控的遗传信息传递,从而抑制破坏性方差并增强模式保留能力。该方法在实数编码、二进制/对数编码和排列编码等多种表示形式下均具适用性,并在PID控制器调优、FIR滤波器设计、无线功率调制优化及旅行商问题等四类典型基准测试中验证了其优越性能,相较标准重组算子可提升9–22%的收敛质量与稳定性。
链接: https://arxiv.org/abs/2512.01249
作者: Otman A. Basir
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 23 pages, 8 figures
Abstract:This paper introduces a new family of multi-parent recombination operators for Genetic Algorithms (GAs), based on normalized Pascal (binomial) coefficients. Unlike classical two-parent crossover operators, Pascal-Weighted Recombination (PWR) forms offsprings as structured convex combination of multiple parents, using binomially shaped weights that emphasize central inheritance while suppressing disruptive variance. We develop a mathematical framework for PWR, derive variance-transfer properties, and analyze its effect on schema survival. The operator is extended to real-valued, binary/logit, and permutation representations. We evaluate the proposed method on four representative benchmarks: (i) PID controller tuning evaluated using the ITAE metric, (ii) FIR low-pass filter design under magnitude-response constraints, (iii) wireless power-modulation optimization under SINR coupling, and (iv) the Traveling Salesman Problem (TSP). We demonstrate how, across these benchmarks, PWR consistently yields smoother convergence, reduced variance, and achieves 9-22% performance gains over standard recombination operators. The approach is simple, algorithm-agnostic, and readily integrable into diverse GA architectures. Comments: 23 pages, 8 figures Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) MSC classes: Computing methodologies, Evolutionary algorithms, Theory of computation, Evolutionary algorithms and metaheuristics Cite as: arXiv:2512.01249 [cs.NE] (or arXiv:2512.01249v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2512.01249 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-48] First do NOHARM: towards clinically safe large language models
【速读】:该论文旨在解决当前广泛使用的生成式 AI(Generative AI)在医疗场景中存在临床安全性不足的问题,即尽管大型语言模型(Large Language Models, LLMs)已被用于提供医学建议,但其潜在危害的发生频率和严重程度尚未得到系统评估。解决方案的关键在于提出并构建 NOHARM(Numerous Options Harm Assessment for Risk in Medicine),这是一个基于100个真实初级到专科诊疗案例的基准测试框架,涵盖10个医学专科、包含4,249项临床管理选项及12,747条专家标注数据,能够量化LLM生成建议中的错误类型与危害等级。研究发现,严重危害在31个LLMs中最高可达22.2%,且以遗漏性错误为主(占76.6%),同时表明现有AI和医学知识评测指标与临床安全性能的相关性较弱(r=0.61–0.64),进而强调了将临床安全作为独立维度进行专门测量的必要性。
链接: https://arxiv.org/abs/2512.01241
作者: David Wu,Fateme Nateghi Haredasht,Saloni Kumar Maharaj,Priyank Jain,Jessica Tran,Matthew Gwiazdon,Arjun Rustagi,Jenelle Jindal,Jacob M. Koshy,Vinay Kadiyala,Anup Agarwal,Bassman Tappuni,Brianna French,Sirus Jesudasen,Christopher V. Cosgriff,Rebanta Chakraborty,Jillian Caldwell,Susan Ziolkowski,David J. Iberri,Robert Diep,Rahul S. Dalal,Kira L. Newman,Kristin Galetta,J. Carl Pallais,Nancy Wei,Kathleen M. Buchheit,David I. Hong,Ernest Y. Lee,Allen Shih,Vartan Pahalyants,Tamara B. Kaplan,Vishnu Ravi,Sarita Khemani,April S. Liang,Daniel Shirvani,Advait Patil,Nicholas Marshall,Kanav Chopra,Joel Koh,Adi Badhwar,Liam G. McCoy,David J. H. Wu,Yingjie Weng,Sumant Ranji,Kevin Schulman,Nigam H. Shah,Jason Hom,Arnold Milstein,Adam Rodman,Jonathan H. Chen,Ethan Goh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.
zh
[AI-49] Proactive Agent ic Whiteboards: Enhancing Diagrammatic Learning
【速读】:该论文试图解决教育场景中教师在讲解复杂概念时,因同时进行口头表达与实时绘制图形而产生的认知负荷过高的问题,这可能导致图形不完整或不清晰,进而影响学生对信息的理解与重构。解决方案的关键在于提出DrawDash——一个基于多模态理解的白板辅助系统,其采用TAB补全交互模型:通过语音识别解析教师讲解意图,动态生成并建议图形优化方案,用户仅需单键确认即可完成修正。该方法显著降低了教师绘图负担,提升了教学可视化效率与准确性。
链接: https://arxiv.org/abs/2512.01234
作者: Suveen Ellawala,Sashenka Gamage,Dinithi Dissanayake
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Educators frequently rely on diagrams to explain complex concepts during lectures, yet creating clear and complete visual representations in real time while simultaneously speaking can be cognitively demanding. Incomplete or unclear diagrams may hinder student comprehension, as learners must mentally reconstruct missing information while following the verbal explanation. Inspired by advances in code completion tools, we introduce DrawDash, an AI-powered whiteboard assistant that proactively completes and refines educational diagrams through multimodal understanding. DrawDash adopts a TAB-completion interaction model: it listens to spoken explanations, detects intent, and dynamically suggests refinements that can be accepted with a single keystroke. We demonstrate DrawDash across four diverse teaching scenarios, spanning topics from computer science and web development to biology. This work represents an early exploration into reducing instructors’ cognitive load and improving diagram-based pedagogy through real-time, speech-driven visual assistance, and concludes with a discussion of current limitations and directions for formal classroom evaluation.
zh
[AI-50] LLM -as-a-Judge for Scalable Test Coverag e Evaluation: Accuracy Operational Reliability and Cost AAAI2026
【速读】:该论文旨在解决大规模软件测试覆盖率评估在质量保证(QA)流水线中面临的瓶颈问题,尤其是如何高效、准确地评估Gherkin格式的验收测试。其解决方案的关键在于提出LLM-as-a-Judge(LAJ)框架,这是一个基于评分标准(rubric-driven)的生成式AI评估系统,能够对Gherkin测试脚本进行结构化JSON输出的自动化评估。该框架通过引入“首次评估完成率”(Evaluation Completion Rate, ECR@1)指标量化模型的可靠性,并在20种模型配置(包括GPT-4、GPT-5及开源权重模型)下进行了500次评估实验,揭示了模型性能、成本与推理强度之间的复杂关系,最终发现较小模型(如GPT-4o Mini)在准确性、可靠性和成本方面均优于大型模型,实现了高达78倍的成本降低且精度提升。
链接: https://arxiv.org/abs/2512.01232
作者: Donghao Huang,Shila Chew,Anna Dutkiewicz,Zhaoxia Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages, accepted by the AAAI 2026 Workshop on Next Gen Code Development with Collaborative AI Agents
Abstract:Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ( 1.01 per 1K), yielding a 78x cost reduction vs. GPT-5 (high reasoning) while improving accuracy. Reasoning effort is model-family dependent: GPT-5 benefits from increased reasoning (with predictable accuracy-cost tradeoffs), whereas open-weight models degrade across all dimensions as reasoning increases. Overall, cost spans 175x ( 0.45- 78.96 per 1K). We release the dataset, framework, and code to support reproducibility and deployment.
zh
[AI-51] Unsupervised decoding of encoded reasoning using language model interpretability
【速读】:该论文旨在解决当前机制可解释性(mechanistic interpretability)方法是否能够有效识别和解析大型语言模型中以非人类可读格式(如ROT-13加密)编码的推理过程这一问题。其核心挑战在于,随着生成式AI(Generative AI)能力增强,模型可能在内部采用隐蔽的推理路径,从而规避人类监督。解决方案的关键在于利用logit lens分析技术,结合自动 paraphrasing(改写)方法构建了一个完全无监督的解码流程,实验证明该方法能在中间到晚期层准确还原被加密的链式思维(chain-of-thought)推理轨迹,表明现有机制可解释性手段对简单形式的编码推理具有更强鲁棒性。
链接: https://arxiv.org/abs/2512.01222
作者: Ching Fang,Samuel Marks
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models become increasingly capable, there is growing concern that they may develop reasoning processes that are encoded or hidden from human oversight. To investigate whether current interpretability techniques can penetrate such encoded reasoning, we construct a controlled testbed by fine-tuning a reasoning model (DeepSeek-R1-Distill-Llama-70B) to perform chain-of-thought reasoning in ROT-13 encryption while maintaining intelligible English outputs. We evaluate mechanistic interpretability methods–in particular, logit lens analysis–on their ability to decode the model’s hidden reasoning process using only internal activations. We show that logit lens can effectively translate encoded reasoning, with accuracy peaking in intermediate-to-late layers. Finally, we develop a fully unsupervised decoding pipeline that combines logit lens with automated paraphrasing, achieving substantial accuracy in reconstructing complete reasoning transcripts from internal model representations. These findings suggest that current mechanistic interpretability techniques may be more robust to simple forms of encoded reasoning than previously understood. Our work provides an initial framework for evaluating interpretability methods against models that reason in non-human-readable formats, contributing to the broader challenge of maintaining oversight over increasingly capable AI systems.
zh
[AI-52] Neural Network Optimal Power Flow via Energy Gradient Flow and Unified Dynamics
【速读】:该论文旨在解决传统最优潮流(Optimal Power Flow, OPF)求解方法计算效率低、对初值敏感以及批量计算效率差的问题,同时克服现有基于深度学习的OPF方法依赖大量预求解样本且难以保证物理一致性(physical consistency)的局限。其解决方案的关键在于将OPF问题转化为能量最小化问题,通过构建一个能衡量偏离约束流形程度的能量函数,并利用能量梯度流引导神经网络学习既满足功率平衡等物理约束又能最小化发电成本的最优解;训练过程采用无监督方式直接最小化物理残差,无需标签数据,从而实现真正的“端到端”物理约束学习。
链接: https://arxiv.org/abs/2512.01219
作者: Xuezhi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimal Power Flow (OPF) is a core optimization problem in power system operation and planning, aiming to minimize generation costs while satisfying physical constraints such as power flow equations, generator limits, and voltage limits. Traditional OPF solving methods typically employ iterative optimization algorithms (such as interior point methods, sequential quadratic programming, etc.), with limitations including low computational efficiency, initial value sensitivity, and low batch computation efficiency. Most existing deep learning-based OPF methods rely on supervised learning, requiring pre-solving large numbers of cases, and have difficulty guaranteeing physical consistency. This paper proposes an Optimal Power Flow solving method based on neural network dynamics and energy gradient flow, transforming OPF problems into energy minimization problems. By constructing an energy function to measure the degree of deviation from the constraint manifold, and guiding networks to learn optimal solutions that simultaneously satisfy power flow constraints and minimize costs through gradient flow. Neural networks are trained unsupervised by directly minimizing physical residuals, requiring no labeled data, achieving true “end-to-end” physics-constrained learning.
zh
[AI-53] Knowledge Graph Augmented Large Language Models for Next-Visit Disease Prediction
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHRs)中临床预测模型缺乏细粒度、可解释且与临床实践一致的个体化推理路径的问题。现有方法通常提供粗粒度的后验解释,难以支持患者层面的决策。其解决方案的关键在于提出一种知识图谱(Knowledge Graph, KG)引导的思维链(Chain-of-Thought, CoT)框架:首先将ICD-9编码映射至PrimeKG知识图谱,提取疾病相关的节点及多跳推理路径作为CoT生成的结构化骨架;随后仅保留结论与实际观察结果一致的解释,并基于此监督语料对轻量级大语言模型(如LLaMA-3.1-Instruct-8B和Gemma-7B)进行微调。该方法在MIMIC-III数据集上实现了优于传统基线的性能(AUROC 0.66–0.70,macro-AUPR 0.40–0.47),并在零样本迁移至CRADLE队列时显著提升准确率(从~0.40提升至0.72–0.77),同时获得临床专家对解释清晰性、相关性和临床正确性的偏好验证。
链接: https://arxiv.org/abs/2512.01210
作者: Ruiyu Wang,Tuan Vinh,Ran Xu,Yuyin Zhou,Jiaying Lu,Carl Yang,Francisco Pasquel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic health records (EHRs) support powerful clinical prediction models, but existing methods typically provide coarse, post hoc explanations that offer limited value for patient-level decision making. We introduce a knowledge graph (KG)-guided chain-of-thought (CoT) framework that generates clinically grounded and temporally consistent reasoning for visit-level disease prediction in MIMIC-III. ICD-9 codes are mapped to PrimeKG, from which disease-relevant nodes and multi-hop reasoning paths are extracted and used as scaffolds for CoT generation; only explanations whose conclusions match observed outcomes are retained. Lightweight LLaMA-3.1-Instruct-8B and Gemma-7B models are then fine-tuned on this supervision corpus. Across ten PrimeKG-mapped diseases and limited training cohorts (400 and 1000 cases), KG-guided models outperform strong classical baselines, achieving AUROC values of 0.66 to 0.70 and macro-AUPR values of 0.40 to 0.47. The models also transfer zero-shot to the CRADLE cohort, improving accuracy from approximately 0.40 to 0.51 up to 0.72 to 0.77. A blinded clinician evaluation shows consistent preference for KG-guided CoT explanations in clarity, relevance, and clinical correctness.
zh
[AI-54] Physics-Constrained Neural Dynamics: A Unified Manifold Framework for Large-Scale Power Flow Computation
【速读】:该论文旨在解决传统牛顿-拉夫逊法在电力系统潮流计算中存在初值敏感性和批量计算效率低的问题,以及现有基于深度学习的潮流求解方法多依赖监督学习、需预先求解大量案例且难以保证物理一致性的问题。其解决方案的关键在于将潮流方程描述为一个约束流形(constraint manifold),构建能量函数 V(\mathbfx) = \frac12\|\mathbfF(\mathbfx)\|^2 并设计梯度流 \fracd\mathbfxdt = -\nabla V(\mathbfx),从而将潮流求解转化为动力系统平衡点寻找问题;通过无监督方式直接最小化物理残差训练神经网络,无需标注数据,实现了真正的“端到端”物理约束学习。
链接: https://arxiv.org/abs/2512.01207
作者: Xuezhi Liu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Power flow analysis is a fundamental tool for power system analysis, planning, and operational control. Traditional Newton-Raphson methods suffer from limitations such as initial value sensitivity and low efficiency in batch computation, while existing deep learning-based power flow solvers mostly rely on supervised learning, requiring pre-solving of numerous cases and struggling to guarantee physical consistency. This paper proposes a neural physics power flow solving method based on manifold geometry and gradient flow, by describing the power flow equations as a constraint manifold, and constructing an energy function (V(\mathbfx) = \frac12|\mathbfF(\mathbfx)|^2) and gradient flow (\fracd\mathbfxdt = -\nabla V(\mathbfx)), transforming power flow solving into an equilibrium point finding problem for dynamical systems. Neural networks are trained in an unsupervised manner by directly minimizing physical residuals, requiring no labeled data, achieving true “end-to-end” physics-constrained learning.
zh
[AI-55] fMRI2GES: Co-speech Gesture Reconstruction from fMRI Signal with Dual Brain Decoding Alignment
【速读】:该论文旨在解决从功能性磁共振成像(fMRI)信号中重建与语音刺激相关联的共言语手势(co-speech gestures)的问题,这一任务因缺乏配对的脑活动-语音-手势数据而面临挑战。解决方案的关键在于提出一种名为fMRI2GES的新方法,其核心是“双脑解码对齐”(Dual Brain Decoding Alignment),通过引入两个关键组件:(1)引发大脑反应的观测文本,以及(2)与手势相关的文本描述;进而利用一个fMRI到文本的模型、一个带标签的文本到手势模型和一个无标签的fMRI到手势模型,构建双重fMRI到手势重建路径,并通过显式对齐两个输出,在自监督方式下训练模型,从而实现仅用未配对数据即可从fMRI中直接重建表达性手势。
链接: https://arxiv.org/abs/2512.01189
作者: Chunzheng Zhu,Jialin Shao,Jianxin Lin,Yijun Wang,Jing Wang,Jinhui Tang,Kenli Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2025
Abstract:Understanding how the brain responds to external stimuli and decoding this process has been a significant challenge in neuroscience. While previous studies typically concentrated on brain-to-image and brain-to-language reconstruction, our work strives to reconstruct gestures associated with speech stimuli perceived by brain. Unfortunately, the lack of paired \brain, speech, gesture\ data hinders the deployment of deep learning models for this purpose. In this paper, we introduce a novel approach, \textbffMRI2GES, that allows training of fMRI-to-gesture reconstruction networks on unpaired data using \textbfDual Brain Decoding Alignment. This method relies on two key components: (i) observed texts that elicit brain responses, and (ii) textual descriptions associated with the gestures. Then, instead of training models in a completely supervised manner to find a mapping relationship among the three modalities, we harness an fMRI-to-text model, a text-to-gesture model with paired data and an fMRI-to-gesture model with unpaired data, establishing dual fMRI-to-gesture reconstruction patterns. Afterward, we explicitly align two outputs and train our model in a self-supervision way. We show that our proposed method can reconstruct expressive gestures directly from fMRI recordings. We also investigate fMRI signals from different ROIs in the cortex and how they affect generation results. Overall, we provide new insights into decoding co-speech gestures, thereby advancing our understanding of neuroscience and cognitive science.
zh
[AI-56] Real-World Reinforcement Learning of Active Perception Behaviors NEURIPS2025
【速读】:该论文旨在解决机器人在部分可观测环境下难以学习主动感知(active perception)行为的问题。在实际场景中,机器人的瞬时感官观测往往无法提供任务相关的状态信息,而最优行为通常需要主动采取动作以获取缺失的信息,但当前主流的机器人学习方法难以有效生成此类策略。解决方案的关键在于提出一种名为“非对称优势加权回归”(Asymmetric Advantage Weighted Regression, AAWR)的学习框架,其核心创新是利用训练阶段可访问的“特权传感器”(privileged sensors)来构建高质量的特权价值函数(privileged value function),从而更准确地估计目标策略的优势函数。该方法通过少量可能次优的示范数据和易获得的粗略策略初始化,快速收敛至具备可靠主动感知能力的策略,在8个不同机器人平台上的操纵任务中显著优于已有方法,尤其在严重部分可观测条件下仍能高效生成信息收集行为。
链接: https://arxiv.org/abs/2512.01188
作者: Edward S. Hu,Jie Wang,Xingfang Yuan,Fiona Luo,Muyao Li,Gaspard Lambrechts,Oleh Rybkin,Dinesh Jayaraman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025 camera ready
Abstract:A robot’s instantaneous sensory observations do not always reveal task-relevant state information. Under such partial observability, optimal behavior typically involves explicitly acting to gain the missing information. Today’s standard robot learning techniques struggle to produce such active perception behaviors. We propose a simple real-world robot learning recipe to efficiently train active perception policies. Our approach, asymmetric advantage weighted regression (AAWR), exploits access to “privileged” extra sensors at training time. The privileged sensors enable training high-quality privileged value functions that aid in estimating the advantage of the target policy. Bootstrapping from a small number of potentially suboptimal demonstrations and an easy-to-obtain coarse policy initialization, AAWR quickly acquires active perception behaviors and boosts task performance. In evaluations on 8 manipulation tasks on 3 robots spanning varying degrees of partial observability, AAWR synthesizes reliable active perception behaviors that outperform all prior approaches. When initialized with a “generalist” robot policy that struggles with active perception tasks, AAWR efficiently generates information-gathering behaviors that allow it to operate under severe partial observability for manipulation tasks. Website: this https URL
zh
[AI-57] aching by Failure: Counter-Example-Driven Curricula for Transformer Self-Improvement AACL2025
【速读】:该论文旨在解决Transformer模型在输入长度或结构复杂度超出训练数据范围时表现出的脆弱外推能力(brittle extrapolation)问题。解决方案的关键在于提出一种自动化的反例驱动课程学习框架(Counter-Example-Driven Curricula, CEDC),其核心机制是利用当前模型自身预测错误生成多样化候选问题,通过快速可执行验证器识别出反例(counter-examples),并基于这些失败样本对模型进行迭代微调,从而逐步提升模型对复杂错误模式的鲁棒性。该方法无需人工设计难度启发式规则,且在多个算法和自然语言任务上显著优于静态训练和标准课程学习基线,实现了更高效的外推性能提升。
链接: https://arxiv.org/abs/2512.01187
作者: Harshil Vejendla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AACL 2025 Findings
Abstract:Transformer models often exhibit brittle extrapolation, failing on inputs that are longer or structurally more complex than those seen during training. We introduce Counter-Example-Driven Curricula (CEDC), an automated framework that improves model robustness by iteratively focusing on its own failures. At each step, CEDC uses the current model to generate a diverse set of candidate problems, employs a fast, executable verifier to identify incorrect predictions (counter-examples), and then fine-tunes the model on a dataset enriched with these discovered failures. We evaluate CEDC on a suite of algorithmic and natural language tasks, including integer addition, sorting, Dyck-2 language recognition, and three text classification benchmarks. Compared to static training and standard curriculum learning baselines, CEDC achieves up to 30x greater length extrapolation, is 3.75x more computationally efficient than uniform data augmentation, and requires no manual difficulty heuristics. We provide a detailed analysis of the counter-examples, showing how the curriculum naturally adapts to target progressively more complex error modes. Our findings establish verifier-guided, failure-driven learning as a simple, powerful, and efficient paradigm for enhancing the generalization capabilities of Transformer models.
zh
[AI-58] oward a benchmark for CTR prediction in online advertising: datasets evaluation protocols and perspectives
【速读】:该论文旨在解决当前点击率(Click-Through Rate, CTR)预测模型评估缺乏统一标准和系统化基准平台的问题。为应对这一挑战,作者提出并实现了一个统一的CTR预测基准平台(Bench-CTR),其关键在于构建了一个灵活可扩展的架构,支持多种数据集与模型组件的接入,并整合了涵盖真实世界与合成数据集、指标分类体系、标准化实验流程及评估协议的完整评价系统。该方案不仅实现了对从传统多变量统计模型到基于大语言模型(Large Language Model, LLM)等前沿方法的广泛比较,还揭示了高阶模型优势、LLM在数据效率上的显著提升以及CTR模型性能演进趋势,从而为模型开发与机制理解提供了可靠支撑。
链接: https://arxiv.org/abs/2512.01179
作者: Shan Gao,Yanwu Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 64 pages, 8 figures, 11 tables
Abstract:This research designs a unified architecture of CTR prediction benchmark (Bench-CTR) platform that offers flexible interfaces with datasets and components of a wide range of CTR prediction models. Moreover, we construct a comprehensive system of evaluation protocols encompassing real-world and synthetic datasets, a taxonomy of metrics, standardized procedures and experimental guidelines for calibrating the performance of CTR prediction models. Furthermore, we implement the proposed benchmark platform and conduct a comparative study to evaluate a wide range of state-of-the-art models from traditional multivariate statistical to modern large language model (LLM)-based approaches on three public datasets and two synthetic datasets. Experimental results reveal that, (1) high-order models largely outperform low-order models, though such advantage varies in terms of metrics and on different datasets; (2) LLM-based models demonstrate a remarkable data efficiency, i.e., achieving the comparable performance to other models while using only 2% of the training data; (3) the performance of CTR prediction models has achieved significant improvements from 2015 to 2016, then reached a stage with slow progress, which is consistent across various datasets. This benchmark is expected to facilitate model development and evaluation and enhance practitioners’ understanding of the underlying mechanisms of models in the area of CTR prediction. Code is available at this https URL.
zh
[AI-59] Conversion rate prediction in online advertising: modeling techniques performance evaluation and future directions
【速读】:该论文旨在解决在线广告中转化率(Conversion Rate, CVR)预测方法的演进脉络不清晰、不同技术间关系未被系统梳理的问题。其解决方案的关键在于对当前主流CVR预测模型进行系统性分类,基于底层技术将其划分为六类,并深入剖析各类模型的框架、优缺点及其在CVR预测中的应用方式;同时,通过总结公开与私有数据集上的性能表现,识别出语义增强、归因增强、去偏以及CTR与CVR联合建模等未来研究方向,为后续研究者和从业者提供结构化参考与实践洞见。
链接: https://arxiv.org/abs/2512.01171
作者: Tao Xue,Yanwu Yang,Panyu Zhai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 99 pages, 15 figures, 7 tables
Abstract:Conversion and conversion rate (CVR) prediction play a critical role in efficient advertising decision-making. In past decades, although researchers have developed plenty of models for CVR prediction, the methodological evolution and relationships between different techniques have been precluded. In this paper, we conduct a comprehensive literature review on CVR prediction in online advertising, and classify state-of-the-art CVR prediction models into six categories with respect to the underlying techniques and elaborate on connections between these techniques. For each category of models, we present the framework of underlying techniques, their advantages and disadvantages, and discuss how they are utilized for CVR prediction. Moreover, we summarize the performance of various CVR prediction models on public and proprietary datasets. Finally, we identify research trends, major challenges, and promising future directions. We observe that results of performance evaluation reported in prior studies are not unanimous; semantics-enriched, attribution-enhanced, debiased CVR prediction and jointly modeling CTR and CVR prediction would be promising directions to explore in the future. This review is expected to provide valuable references and insights for future researchers and practitioners in this area.
zh
[AI-60] Data assimilation and discrepancy modeling with shallow recurrent decoders
【速读】:该论文旨在解决复杂物理系统中仿真模型与实际传感器数据之间的“仿真到现实”(Simulation-to-Real, SIM2REAL)差距问题,尤其是在高维时空场建模中,由于传感器观测稀疏且无法直接获取全状态信息时,传统模拟模型因忽略小尺度或隐藏过程、对扰动敏感或参数相关性简化而导致重建结果偏离真实状态的问题。解决方案的关键在于提出一种基于浅层循环解码器(SHallow REcurrent Decoder, SHRED)的机器学习框架——DA-SHRED,其核心机制是利用从降阶仿真模型中学习到的潜在空间表示,并通过实时传感器数据更新潜在变量以精确重构系统全状态;同时,在潜在空间中引入基于稀疏非线性动力学识别的回归模型,自动识别并补偿仿真模型中缺失的动力学特征,从而实现高效的时间编码与物理信息引导的修正相结合的鲁棒数据同化。
链接: https://arxiv.org/abs/2512.01170
作者: Yuxuan Bao,J. Nathan Kutz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Chaotic Dynamics (nlin.CD)
备注: 27 pages, 11 figures
Abstract:The requirements of modern sensing are rapidly evolving, driven by increasing demands for data efficiency, real-time processing, and deployment under limited sensing coverage. Complex physical systems are often characterized through the integration of a limited number of point sensors in combination with scientific computations which approximate the dominant, full-state dynamics. Simulation models, however, inevitably neglect small-scale or hidden processes, are sensitive to perturbations, or oversimplify parameter correlations, leading to reconstructions that often diverge from the reality measured by sensors. This creates a critical need for data assimilation, the process of integrating observational data with predictive simulation models to produce coherent and accurate estimates of the full state of complex physical systems. We propose a machine learning framework for Data Assimilation with a SHallow REcurrent Decoder (DA-SHRED) which bridges the simulation-to-real (SIM2REAL) gap between computational modeling and experimental sensor data. For real-world physics systems modeling high-dimensional spatiotemporal fields, where the full state cannot be directly observed and must be inferred from sparse sensor measurements, we leverage the latent space learned from a reduced simulation model via SHRED, and update these latent variables using real sensor data to accurately reconstruct the full system state. Furthermore, our algorithm incorporates a sparse identification of nonlinear dynamics based regression model in the latent space to identify functionals corresponding to missing dynamics in the simulation model. We demonstrate that DA-SHRED successfully closes the SIM2REAL gap and additionally recovers missing dynamics in highly complex systems, demonstrating that the combination of efficient temporal encoding and physics-informed correction enables robust data assimilation.
zh
[AI-61] A TinyML Reinforcement Learning Approach for Energy-Efficient Light Control in Low-Cost Greenhouse Systems CEC65580 CEC2025
【速读】:该论文旨在解决受控环境中照明调节的自适应控制问题,特别是在资源受限条件下实现节能且稳定的光强调控。其解决方案的关键在于采用基于模型无关的Q-learning强化学习(Reinforcement Learning, RL)算法,在低功耗微控制器上实现实时反馈驱动的LED亮度动态调整,通过13个目标光强等级(L1–L13)的训练与优化,使系统在存在环境扰动的情况下仍能快速收敛并保持稳定,同时显著减少超调和调节时间,验证了轻量级、本地化强化学习在农业等边缘场景下多模态环境控制中的可行性与有效性。
链接: https://arxiv.org/abs/2512.01167
作者: Mohamed Abdallah Salem(1),Manuel Cuevas Perez(1),Ahmed Harb Rabia(1) ((1) North Dakota State University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Copyright 2025 IEEE. This is the author’s version of the work that has been accepted for publication in Proceedings of the 5. Interdisciplinary Conference on Electrics and Computer (INTCEC 2025) 15-16 September 2025, Chicago-USA. The final version of record is available at: this https URL
Abstract:This study presents a reinforcement learning (RL)-based control strategy for adaptive lighting regulation in controlled environments using a low-power microcontroller. A model-free Q-learning algorithm was implemented to dynamically adjust the brightness of a Light-Emitting Diode (LED) based on real-time feedback from a light-dependent resistor (LDR) sensor. The system was trained to stabilize at 13 distinct light intensity levels (L1 to L13), with each target corresponding to a specific range within the 64-state space derived from LDR readings. A total of 130 trials were conducted, covering all target levels with 10 episodes each. Performance was evaluated in terms of convergence speed, steps taken, and time required to reach target states. Box plots and histograms were generated to analyze the distribution of training time and learning efficiency across targets. Experimental validation demonstrated that the agent could effectively learn to stabilize at varying light levels with minimal overshooting and smooth convergence, even in the presence of environmental perturbations. This work highlights the feasibility of lightweight, on-device RL for energy-efficient lighting control and sets the groundwork for multi-modal environmental control applications in resource-constrained agricultural systems.
zh
[AI-62] 2D-ThermAl: Physics-Informed Framework for Thermal Analysis of Circuits using Generative AI
【速读】:该论文旨在解决现代集成电路中因非均匀功率耗散和高密度晶体管布局导致的热管理难题,尤其是传统有限元法(FEM)仿真在早期设计阶段计算成本过高、难以支持快速迭代优化的问题。解决方案的关键在于提出了一种物理信息引导的生成式 AI 框架 ThermAl,其核心创新包括:采用融合位置编码与玻尔兹曼正则化项的混合 U-Net 架构,实现从输入活动性分布直接预测全芯片瞬态与稳态温度场;该模型在 COMSOL 生成的大规模热耗散映射数据集上训练,具备高精度(RMSE 仅 0.71°C)和高速度(比传统 FEM 工具快约 200 倍),同时通过跨温区验证(25–95°C)确保了在极端工况下的泛化能力,适用于早期热点检测与热模式学习场景。
链接: https://arxiv.org/abs/2512.01163
作者: Soumyadeep Chandra,Sayeed Shafayet Chowdhury,Kaushik Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, Under Review
Abstract:Thermal analysis is increasingly critical in modern integrated circuits, where non-uniform power dissipation and high transistor densities can cause rapid temperature spikes and reliability concerns. Traditional methods, such as FEM-based simulations offer high accuracy but computationally prohibitive for early-stage design, often requiring multiple iterative redesign cycles to resolve late-stage thermal failures. To address these challenges, we propose ‘ThermAl’, a physics-informed generative AI framework which effectively identifies heat sources and estimates full-chip transient and steady-state thermal distributions directly from input activity profiles. ThermAl employs a hybrid U-Net architecture enhanced with positional encoding and a Boltzmann regularizer to maintain physical fidelity. Our model is trained on an extensive dataset of heat dissipation maps, ranging from simple logic gates (e.g., inverters, NAND, XOR) to complex designs, generated via COMSOL. Experimental results demonstrate that ThermAl delivers precise temperature mappings for large circuits, with a root mean squared error (RMSE) of only 0.71°C, and outperforms conventional FEM tools by running up to ~200 times faster. We analyze performance across diverse layouts and workloads, and discuss its applicability to large-scale EDA workflows. While thermal reliability assessments often extend beyond 85°C for post-layout signoff, our focus here is on early-stage hotspot detection and thermal pattern learning. To ensure generalization beyond the nominal operating range 25-55°C, we additionally performed cross-validation on an extended dataset spanning 25-95°C maintaining a high accuracy (2.2% full-scale RMSE) even under elevated temperature conditions representative of peak power and stress scenarios.
zh
[AI-63] Beyond Greenfield: AI-Driven Productivity in Documentation and Brownfield Engineering
【速读】:该论文旨在解决在棕色地带(brownfield)工程环境中,由于遗留系统、文档不完整和架构知识碎片化等问题,导致大型语言模型(Large Language Models, LLMs)难以有效应用的挑战。现有研究多集中于新建(greenfield)或合成任务,缺乏针对复杂、上下文密集场景的结构化工作流。解决方案的关键在于提出一种名为“Discover-Define-Deliver (D3)”的系统性LLM辅助工作流,其核心创新是结合角色分离提示策略与双代理提示架构:一个Builder模型生成候选输出,另一个Reviewer模型提供结构化反馈以提升可靠性,从而增强对模糊性的处理能力并改善任务清晰度、文档质量和认知负荷。
链接: https://arxiv.org/abs/2512.01155
作者: Krishna Kumaar Sharma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 53 pages, 7 figures
Abstract:Brownfield engineering work involving legacy systems, incomplete documentation, and fragmented architectural knowledge poses unique challenges for the effective use of large language models (LLMs). Prior research has largely focused on greenfield or synthetic tasks, leaving a gap in structured workflows for complex, context-heavy environments. This paper introduces the Discover-Define-Deliver (D3) Framework, a disciplined LLM-assisted workflow that combines role-separated prompting strategies with applied best practices for navigating ambiguity in brownfield systems. The framework incorporates a dual-agent prompting architecture in which a Builder model generates candidate outputs and a Reviewer model provides structured critique to improve reliability. I conducted an exploratory survey study with 52 software practitioners who applied the D3 workflow to real-world engineering tasks such as legacy system exploration, documentation reconstruction, and architectural refactoring. Respondents reported perceived improvements in task clarity, documentation quality, and cognitive load, along with self-estimated productivity gains. In this exploratory study, participants reported a weighted average productivity improvement of 26.9%, reduced cognitive load for approximately 77% of participants, and reduced rework for 83% during the Define phase. As these findings are self-reported and not derived from controlled experiments, they should be interpreted as preliminary evidence of practitioner sentiment rather than causal effects. The results highlight both the potential and limitations of structured LLM workflows for legacy engineering systems and motivate future controlled evaluations.
zh
[AI-64] A Benchmark of Causal vs Correlation AI for Predictive Maintenance
【速读】:该论文旨在解决制造环境中预测性维护(predictive maintenance)中的优化难题,其核心挑战在于成本不对称性——漏报故障的成本约为误报的50倍,而传统机器学习方法通常基于统计准确率指标进行优化,无法反映这一运营现实,且难以区分因果关系与虚假相关性。解决方案的关键在于引入形式化因果推断方法(formal causal inference),结合领域知识构建模型,从而在保持高召回率(87.9%)的同时显著降低误报率(从165次降至5次),实现年均成本节约116万美元(降幅70.2%),且模型性能在训练集与测试集间差异仅2.6个百分点,展现出卓越的泛化能力和可解释性优势。
链接: https://arxiv.org/abs/2512.01149
作者: Krishna Taduri,Shaunak Dhande,Giacinto Paolo(GP)Saggese,Paul Smith
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Predictive maintenance in manufacturing environments presents a challenging optimization problem characterized by extreme cost asymmetry, where missed failures incur costs roughly fifty times higher than false alarms. Conventional machine learning approaches typically optimize statistical accuracy metrics that do not reflect this operational reality and cannot reliably distinguish causal relationships from spurious correlations. This study evaluates eight predictive models, ranging from baseline statistical approaches to formal causal inference methods, on a dataset of 10,000 CNC machines with a 3.3% failure prevalence. The formal causal inference model (L5) achieved estimated annual cost savings of 1.16 million USD (a 70.2 percent reduction), outperforming the best correlation-based decision tree model (L3) by approximately 80,000 USD per year. The causal model matched the highest observed recall (87.9 percent) while reducing false alarms by 97 percent (from 165 to 5) and attained a precision of 92.1 percent, with a train-test performance gap of only 2.6 percentage points. These results indicate that causal AI methods, when combined with domain knowledge, can yield superior financial outcomes and more interpretable predictions compared to correlation-based approaches in predictive maintenance applications.
zh
[AI-65] World Model Robustness via Surprise Recognition
【速读】:该论文旨在解决现实世界中部署的智能体(agent)因受到干扰和分布外(out-of-distribution, OOD)噪声影响而导致策略不稳定甚至产生不安全行为的问题。现有鲁棒训练方法难以覆盖所有可能的OOD场景,因此作者提出一种基于世界模型(world model)内在“惊讶度”(surprise measure)的算法,通过拒绝采样机制来抑制噪声对基于世界模型的强化学习代理的影响。其解决方案的关键在于引入多表示与单表示的拒绝采样策略,能够有效应对多个故障传感器或单一故障传感器的噪声环境,并在CARLA和Safety Gymnasium等自驾车模拟环境中验证了该方法在不同噪声类型和强度下仍能保持优于基线的性能表现,同时提升了两种架构迥异的世界模型(Cosmos与DreamerV3)的稳定性。
链接: https://arxiv.org/abs/2512.01119
作者: Geigh Zollicoffer,Tanush Chopra,Mingkuan Yan,Xiaoxu Ma,Kenneth Eaton,Mark Riedl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:AI systems deployed in the real world must contend with distractions and out-of-distribution (OOD) noise that can destabilize their policies and lead to unsafe behavior. While robust training can reduce sensitivity to some forms of noise, it is infeasible to anticipate all possible OOD conditions. To mitigate this issue, we develop an algorithm that leverages a world model’s inherent measure of surprise to reduce the impact of noise in world model–based reinforcement learning agents. We introduce both multi-representation and single-representation rejection sampling, enabling robustness to settings with multiple faulty sensors or a single faulty sensor. While the introduction of noise typically degrades agent performance, we show that our techniques preserve performance relative to baselines under varying types and levels of noise across multiple environments within self-driving simulation domains (CARLA and Safety Gymnasium). Furthermore, we demonstrate that our methods enhance the stability of two state-of-the-art world models with markedly different underlying architectures: Cosmos and DreamerV3. Together, these results highlight the robustness of our approach across world modeling domains. We release our code at this https URL .
zh
[AI-66] Efficiently Learning Branching Networks for Multitask Algorithmic Reasoning KDD’26
【速读】:该论文旨在解决多任务算法推理(algorithmic reasoning)中因不同算法执行步骤差异导致的负向干扰问题,从而实现单一模型在多个算法推理任务上的高效协同训练。其核心挑战在于如何设计一种结构化方法,在保持各任务独立性的同时提升整体性能。解决方案的关键是提出分支神经网络(branching neural networks),并通过AutoBRANE算法自动学习最优的任务分层结构:该方法将原本指数级复杂度的k-ary树搜索问题转化为可在O(nL)时间内完成的凸松弛优化过程,利用梯度相关性分数对任务进行聚类,并可无缝集成于任意基础模型(如图神经网络GNNs或大语言模型LLMs)。实验表明,该方案在多个图算法和文本推理基准上显著优于现有方法,同时大幅降低运行时间和内存消耗。
链接: https://arxiv.org/abs/2512.01113
作者: Dongyue Li,Zhenshuo Zhang,Minxuan Duan,Edgar Dobriban,Hongyang R. Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 31 pages. Preprint, to appear in KDD’26
Abstract:Algorithmic reasoning – the ability to perform step-by-step logical inference – has become a core benchmark for evaluating reasoning in graph neural networks (GNNs) and large language models (LLMs). Ideally, one would like to design a single model capable of performing well on multiple algorithmic reasoning tasks simultaneously. However, this is challenging when the execution steps of algorithms differ from one another, causing negative interference when they are trained together. We propose branching neural networks, a principled architecture for multitask algorithmic reasoning. Searching for the optimal k -ary tree with L layers over n algorithmic tasks is combinatorial, requiring exploration of up to k^nL possible structures. We develop AutoBRANE, an efficient algorithm that reduces this search to O(nL) time by solving a convex relaxation at each layer to approximate an optimal task partition. The method clusters tasks using gradient-based affinity scores and can be used on top of any base model, including GNNs and LLMs. We validate AutoBRANE on a broad suite of graph-algorithmic and text-based reasoning benchmarks. We show that gradient features estimate true task performance within 5% error across four GNNs and four LLMs (up to 34B parameters). On the CLRS benchmark, it outperforms the strongest single multitask GNN by 3.7% and the best baseline by 1.2%, while reducing runtime by 48% and memory usage by 26%. The learned branching structures reveal an intuitively reasonable hierarchical clustering of related algorithms. On three text-based graph reasoning benchmarks, AutoBRANE improves over the best non-branching multitask baseline by 3.2%. Finally, on a large graph dataset with 21M edges and 500 tasks, AutoBRANE achieves a 28% accuracy gain over existing multitask and branching architectures, along with a 4.5 \times reduction in runtime. Comments: 31 pages. Preprint, to appear in KDD’26 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2512.01113 [cs.LG] (or arXiv:2512.01113v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01113 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-67] Foundation Priors
【速读】:该论文旨在解决生成式 AI(Generative AI)模型输出的合成数据在实证研究中被误当作真实观测数据使用所带来的问题,这可能导致对模型偏见、用户主观预期与真实世界数据之间混淆的风险。其核心解决方案是提出“基础先验”(foundation prior)的概念,将模型生成的数据建模为一种由用户原始先验(primitive prior)通过指数倾斜(exponential-tilting)和广义贝叶斯更新所诱导的结构化主观先验分布,其中信任参数控制合成数据的权重。这一框架明确区分了合成数据的主观性来源(如提示工程、用户对数据分布的预期及对模型的信任程度),并提供了一种规范的方法,使合成数据能够被整合进标准统计与计量经济学流程中,用于改进复杂模型拟合、指导实验设计、增强随机系数或部分线性模型等场景,从而避免将合成结果误认为客观事实。
链接: https://arxiv.org/abs/2512.01107
作者: Sanjog Misra
机构: 未知
类目: Artificial Intelligence (cs.AI); Econometrics (econ.EM); Machine Learning (stat.ML)
备注:
Abstract:Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these ‘‘synthetic’’ outputs as data in empirical research and decision-making. This paper introduces the idea of a foundation prior, which shows that model-generated outputs are not as real observations, but draws from the foundation prior induced prior predictive distribution. As such synthetic data reflects both the model’s learned patterns and the user’s subjective priors, expectations, and biases. We model the subjectivity of the generative process by making explicit the dependence of synthetic outputs on the user’s anticipated data distribution, the prompt-engineering process, and the trust placed in the foundation model. We derive the foundation prior as an exponential-tilted, generalized Bayesian update of the user’s primitive prior, where a trust parameter governs the weight assigned to synthetic data. We then show how synthetic data and the associated foundation prior can be incorporated into standard statistical and econometric workflows, and discuss their use in applications such as refining complex models, informing latent constructs, guiding experimental design, and augmenting random-coefficient and partially linear specifications. By treating generative outputs as structured, explicitly subjective priors rather than as empirical observations, the framework offers a principled way to harness foundation models in empirical work while avoiding the conflation of synthetic ‘‘facts’’ with real data. Subjects: Artificial Intelligence (cs.AI); Econometrics (econ.EM); Machine Learning (stat.ML) Cite as: arXiv:2512.01107 [cs.AI] (or arXiv:2512.01107v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.01107 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-68] Supporting Productivity Skill Development in College Students through Social Robot Coaching: A Proof-of-Concept
【速读】:该论文旨在解决大学生在学业中面临的生产力低下和心理健康问题,现有工具如自助书籍和生产力应用存在泛化、非互动性强以及缺乏教育性等问题,而传统人工辅导则资源密集且难以规模化。解决方案的关键在于开发一种社交助手机器人(Socially Assistive Robot, SAR)作为教育教练,通过交互式聊天界面提供六项关于时间管理和任务优先级划分的教学内容,并结合仪表盘追踪用户的学习进度、情绪、参与度、自信心及每节课耗时,从而生成个性化反馈以促进反思与自我意识提升。实证结果显示系统可用性得分达79.2,用户体验与参与度均较高,表明SAR驱动的生产力辅导是一种高效且可扩展的干预路径。
链接: https://arxiv.org/abs/2512.01105
作者: Himanshi Lalwani,Hanan Salam
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:College students often face academic challenges that hamper their productivity and well-being. Although self-help books and productivity apps are popular, they often fall short. Books provide generalized, non-interactive guidance, and apps are not inherently educational and can hinder the development of key organizational skills. Traditional productivity coaching offers personalized support, but is resource-intensive and difficult to scale. In this study, we present a proof-of-concept for a socially assistive robot (SAR) as an educational coach and a potential solution to the limitations of existing productivity tools and coaching approaches. The SAR delivers six different lessons on time management and task prioritization. Users interact via a chat interface, while the SAR responds through speech (with a toggle option). An integrated dashboard monitors progress, mood, engagement, confidence per lesson, and time spent per lesson. It also offers personalized productivity insights to foster reflection and self-awareness. We evaluated the system with 15 college students, achieving a System Usability Score of 79.2 and high ratings for overall experience and engagement. Our findings suggest that SAR-based productivity coaching can offer an effective and scalable solution to improve productivity among college students.
zh
[AI-69] Energy-Aware Data-Driven Model Selection in LLM -Orchestrated AI Systems
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的模型编排系统在选择工具或模型时依赖定性描述所导致的次优决策问题,这会引发准确率下降和能耗上升。其核心解决方案是提出GUIDE框架,该框架通过引入量化模型性能特征(如推理速度、精度、能耗等)来优化LLM驱动的模型选择过程,从而在保证任务准确性的同时显著提升能效表现,实验表明其可实现最高54%的能源效率改进,并将选择延迟从4.51秒降低至7.2毫秒。
链接: https://arxiv.org/abs/2512.01099
作者: Daria Smirnova,Hamid Nasiri,Marta Adamska,Zhengxin Yu,Peter Garraghan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As modern artificial intelligence (AI) systems become more advanced and capable, they can leverage a wide range of tools and models to perform complex tasks. Today, the task of orchestrating these models is often performed by Large Language Models (LLMs) that rely on qualitative descriptions of models for decision-making. However, the descriptions provided to these LLM-based orchestrators do not reflect true model capabilities and performance characteristics, leading to suboptimal model selection, reduced accuracy, and increased energy costs. In this paper, we conduct an empirical analysis of LLM-based orchestration limitations and propose GUIDE, a new energy-aware model selection framework that accounts for performance-energy trade-offs by incorporating quantitative model performance characteristics in decision-making. Experimental results demonstrate that GUIDE increases accuracy by 0.90%-11.92% across various evaluated tasks, and achieves up to 54% energy efficiency improvement, while reducing orchestrator model selection latency from 4.51 s to 7.2 ms.
zh
[AI-70] CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
【速读】:该论文旨在解决自动化科学发现(Automated Scientific Discovery, ASD)系统在生成高质量、可执行实验代码时面临的局限性问题,即当前系统仅依赖参数化知识或少量手动设计的实验示例,导致实验质量与覆盖范围受限。其解决方案的关键在于提出CodeDistiller系统,该系统通过自动提炼大规模科学类GitHub仓库中的代码片段,构建一个经过自动与领域专家双重验证的领域专用代码库,从而显著增强ASD代理的能力,无需人工干预即可扩展其生成功能完整且符合科学逻辑的实验代码的能力。
链接: https://arxiv.org/abs/2512.01089
作者: Peter Jansen,Samiah Hassan,Pragnya Narasimha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 2 tables
Abstract:Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples.
zh
[AI-71] SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)和视觉语言模型(Vision-Language Models, VLMs)驱动的智能体在复杂物理与社会环境中应用受限的问题,尤其是在真实世界场景中实现自主生存与发展的能力不足。现有世界模拟器通常依赖于有限的手工构建环境、简化的游戏化物理和社会规则,并缺乏对LLM/VLM智能体的原生支持,难以支撑大规模交互、推理、训练与评估。其解决方案的关键在于提出SimWorld——一个基于Unreal Engine 5构建的新一代模拟平台,具备三大核心能力:(1) 支持真实且开放的世界仿真,包括精确的物理与社会动态以及由语言驱动的程序化环境生成;(2) 提供面向LLM/VLM智能体的丰富接口,支持多模态输入和多层次抽象的开放式词汇动作;(3) 内置多样化、可扩展的物理与社会推理任务场景,用户可轻松自定义。通过部署前沿LLM智能体(如GPT-4o、Gemini-2.5-Flash等)执行长期多智能体配送任务,验证了该平台在揭示不同模型推理模式与局限性方面的有效性。
链接: https://arxiv.org/abs/2512.01078
作者: Jiawei Ren,Yan Zhuang,Xiaokang Ye,Lingjun Mao,Xuhong He,Jianzhi Shen,Mrinaal Dogra,Yiming Liang,Ruixuan Zhang,Tianai Yue,Yiqing Yang,Eric Liu,Ryan Wu,Kevin Benavente,Rajiv Mandya Nagaraju,Muhammad Faayez,Xiyan Zhang,Dhruv Vivek Sharma,Xianrui Zhong,Ziqiao Ma,Tianmin Shu,Zhiting Hu,Lianhui Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (for example, by autonomously earning income or running a business) requires massive-scale interaction, reasoning, training, and evaluation across diverse embodied scenarios. However, existing world simulators for such development fall short: they often rely on limited hand-crafted environments, simulate simplified game-like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real-world-like settings. SimWorld offers three core capabilities: (1) realistic, open-ended world simulation, including accurate physical and social dynamics and language-driven procedural environment generation; (2) a rich interface for LLM/VLM agents, with multimodal world inputs and open-vocabulary actions at varying levels of abstraction; and (3) diverse and extensible physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., GPT-4o, Gemini-2.5-Flash, Claude-3.5, and DeepSeek-Prover-V2) on long-horizon multi-agent delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open-source SimWorld and hope it becomes a foundational platform for advancing real-world agent intelligence across disciplines: this https URL.
zh
[AI-72] PIANO: Physics-informed Dual Neural Operator for Precipitation Nowcasting NEURIPS2025
【速读】:该论文旨在解决降水临近预报(precipitation nowcasting)中因计算成本高和物理约束不足导致的精度受限与可及性差的问题。其核心解决方案是提出一种物理信息引导的双神经算子(physics-informed dual neural operator, PIANO)结构,通过在训练过程中引入对平流-扩散方程(advection-diffusion equation)的物理约束,以PINN损失函数优化模型,从而提升预测的准确性与物理一致性;随后利用生成模型将卫星图像转换为雷达图像用于降水预报,显著改善了中等强度(4 mm/h)和短时强降水(8 mm/h)事件的预测性能,并表现出低季节变异性,验证了模型的泛化能力。
链接: https://arxiv.org/abs/2512.01062
作者: Seokhyun Chin,Junghwan Park,Woojin Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Machine Learning and Physical Sciences Workshop
Abstract:Precipitation nowcasting, key for early warning of disasters, currently relies on computationally expensive and restrictive methods that limit access to many countries. To overcome this challenge, we propose precipitation nowcasting using satellite imagery with physics constraints for improved accuracy and physical consistency. We use a novel physics-informed dual neural operator (PIANO) structure to enforce the fundamental equation of advection-diffusion during training to predict satellite imagery using a PINN loss. Then, we use a generative model to convert satellite images to radar images, which are used for precipitation nowcasting. Compared to baseline models, our proposed model shows a notable improvement in moderate (4mm/h) precipitation event prediction alongside short-term heavy (8mm/h) precipitation event prediction. It also demonstrates low seasonal variability in predictions, indicating robustness for generalization. This study suggests the potential of the PIANO and serves as a good baseline for physics-informed precipitation nowcasting.
zh
[AI-73] Adaptive-lambda Subtracted Importance Sampled Scores in Machine Unlearning for DDPMs and VAEs
【速读】:该论文旨在解决大规模生成模型(如变分自编码器 VAEs 和扩散模型 DDPMs)在满足“被遗忘权”(right to be forgotten)时面临的挑战,即如何高效、精准地消除模型对特定数据类别的记忆,同时避免昂贵的重新训练过程。现有方法(如静态λ的SISS)依赖固定混合权重λ,无法适应不同样本和训练阶段所需的差异化遗忘强度,导致性能受限。其解决方案的关键在于提出自适应λ的SISS(Adaptive-lambda SISS),将λ作为潜在变量动态推断,通过一个轻量级推理网络基于瞬时SISS损失项(保留/遗忘损失及其梯度)条件化地估计λ的后验分布,并利用变分目标联合优化扩散模型与λ推理机制,从而实现更优的遗忘效果与保留数据生成质量之间的权衡。
链接: https://arxiv.org/abs/2512.01054
作者: MohammadParsa Dini,Human Jafari,Sajjad Amini,MohammadMahdi Mojahedian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Unlearning is essential for large generative models (VAEs, DDPMs) to comply with the right to be forgotten and prevent undesired content generation without costly retraining. Existing approaches, such as Static-lambda SISS for diffusion models, rely on a fixed mixing weight lambda, which is suboptimal because the required unlearning strength varies across samples and training stages. We propose Adaptive-lambda SISS, a principled extension that turns lambda into a latent variable dynamically inferred at each training step. A lightweight inference network parameterizes an adaptive posterior over lambda, conditioned on contextual features derived from the instantaneous SISS loss terms (retain/forget losses and their gradients). This enables joint optimization of the diffusion model and the lambda-inference mechanism via a variational objective, yielding significantly better trade-offs. We further extend the adaptive-lambda principle to score-based unlearning and introduce a multi-class variant of Score Forgetting Distillation. In addition, we present two new directions: (i) a hybrid objective combining the data-free efficiency of Score Forgetting Distillation with the direct gradient control of SISS, and (ii) a Reinforcement Learning formulation that treats unlearning as a sequential decision process, learning an optimal policy over a state space defined by the model’s current memory of the forget set. Experiments on an augmented MNIST benchmark show that Adaptive-lambda SISS substantially outperforms the original static-lambda SISS, achieving stronger removal of forgotten classes while better preserving generation quality on the retain set. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.01054 [cs.LG] (or arXiv:2512.01054v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01054 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: MohammadParsa Dini [view email] [v1] Sun, 30 Nov 2025 19:57:49 UTC (787 KB)
zh
[AI-74] Automating the Refinement of Reinforcement Learning Specifications
【速读】:该论文旨在解决逻辑规范(Logical Specifications)在强化学习(Reinforcement Learning, RL)中因粗粒度或欠规范(under-specified)而导致智能体难以学习有效策略的问题。其核心解决方案是提出AutoSpec框架,通过探索引导的策略对SpectRL逻辑规范进行精细化重构,生成满足原始规范但提供更多指导信息的新规范。AutoSpec的关键在于利用SpectRL规范的组合特性,设计四种保持规范保真性(specification soundness)的细化操作:修改现有边规范或引入新边规范,从而提升RL算法的学习效率与任务复杂度上限。
链接: https://arxiv.org/abs/2512.01047
作者: Tanmay Ambadkar,Đorđe Žikelić,Abhinav Verma
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Logical specifications have been shown to help reinforcement learning algorithms in achieving complex tasks. However, when a task is under-specified, agents might fail to learn useful policies. In this work, we explore the possibility of improving coarse-grained logical specifications via an exploration-guided strategy. We propose \textscAutoSpec, a framework that searches for a logical specification refinement whose satisfaction implies satisfaction of the original specification, but which provides additional guidance therefore making it easier for reinforcement learning algorithms to learn useful policies. \textscAutoSpec is applicable to reinforcement learning tasks specified via the SpectRL specification logic. We exploit the compositional nature of specifications written in SpectRL, and design four refinement procedures that modify the abstract graph of the specification by either refining its existing edge specifications or by introducing new edge specifications. We prove that all four procedures maintain specification soundness, i.e. any trajectory satisfying the refined specification also satisfies the original. We then show how \textscAutoSpec can be integrated with existing reinforcement learning algorithms for learning policies from logical specifications. Our experiments demonstrate that \textscAutoSpec yields promising improvements in terms of the complexity of control tasks that can be solved, when refined logical specifications produced by \textscAutoSpec are utilized.
zh
[AI-75] Shielded Controller Units for RL with Operational Constraints Applied to Remote Microgrids
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在复杂能源系统(如偏远微电网)中应用时面临的约束满足问题,尤其是在风能、燃料发电机和电池等多源协同控制场景下,如何在不确定负载与风力条件下实现最优决策并保障严格的运行约束。解决方案的关键在于提出了一种名为“防护控制器单元”(Shielded Controller Units, SCUs)的系统化、可解释的方法,通过利用对系统动态特性的先验知识,将环境分解为分层结构,并由每个SCU显式管理一组约束,从而确保RL代理在优化过程中始终遵守所有操作限制。实验表明,配备SCUs的RL代理在满足全部约束的前提下,实现了24%的燃料消耗降低,且未增加电池退化程度,显著优于其他基线方法。
链接: https://arxiv.org/abs/2512.01046
作者: Hadi Nekoei,Alexandre Blondin Massé,Rachid Hassani,Sarath Chandar,Vincent Mai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning (RL) is a powerful framework for optimizing decision-making in complex systems under uncertainty, an essential challenge in real-world settings, particularly in the context of the energy transition. A representative example is remote microgrids that supply power to communities disconnected from the main grid. Enabling the energy transition in such systems requires coordinated control of renewable sources like wind turbines, alongside fuel generators and batteries, to meet demand while minimizing fuel consumption and battery degradation under exogenous and intermittent load and wind conditions. These systems must often conform to extensive regulations and complex operational constraints. To ensure that RL agents respect these constraints, it is crucial to provide interpretable guarantees. In this paper, we introduce Shielded Controller Units (SCUs), a systematic and interpretable approach that leverages prior knowledge of system dynamics to ensure constraint satisfaction. Our shield synthesis methodology, designed for real-world deployment, decomposes the environment into a hierarchical structure where each SCU explicitly manages a subset of constraints. We demonstrate the effectiveness of SCUs on a remote microgrid optimization task with strict operational requirements. The RL agent, equipped with SCUs, achieves a 24% reduction in fuel consumption without increasing battery degradation, outperforming other baselines while satisfying all constraints. We hope SCUs contribute to the safe application of RL to the many decision-making challenges linked to the energy transition.
zh
[AI-76] Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal
【速读】:该论文旨在解决医学领域多模态大语言模型(Multi-Modal Large Language Models, MLLMs)发展中高质量、逻辑标注视频数据集稀缺的问题。传统人工标注成本高昂且难以扩展,而现有合成方法常因随机幻觉和缺乏逻辑可解释性导致生成质量低下。其解决方案的关键在于提出一种新颖的神经符号数据工程框架 Med-CRAFT,将基准合成形式化为确定性的图遍历过程:从原始视频流中提取结构化的视觉基元(如手术器械、解剖边界),构建动态时空知识图谱(Spatiotemporal Knowledge Graph),并通过在该图的有效路径上锚定查询生成,强制每条合成基准项具备严格的思维链(Chain-of-Thought, CoT)溯源。这一机制确保了生成数据在时间粒度和多跳逻辑复杂度上的可控性与可验证性,从而实现低成本、可扩展地构建具有逻辑一致性的医学视频推理评估基准。
链接: https://arxiv.org/abs/2512.01045
作者: Shenxi Liu,Kan Li,Mingyang Zhao,Yuhang Tian,Shoujun Zhou,Bin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:The scarcity of high-quality, logically annotated video datasets remains a primary bottleneck in advancing Multi-Modal Large Language Models (MLLMs) for the medical domain. Traditional manual annotation is prohibitively expensive and non-scalable, while existing synthetic methods often suffer from stochastic hallucinations and a lack of logical interpretability. To address these challenges, we introduce \textbf\PipelineName, a novel neuro-symbolic data engineering framework that formalizes benchmark synthesis as a deterministic graph traversal process. Unlike black-box generative approaches, Med-CRAFT extracts structured visual primitives (e.g., surgical instruments, anatomical boundaries) from raw video streams and instantiates them into a dynamic Spatiotemporal Knowledge Graph. By anchoring query generation to valid paths within this graph, we enforce a rigorous Chain-of-Thought (CoT) provenance for every synthesized benchmark item. We instantiate this pipeline to produce M3-Med-Auto, a large-scale medical video reasoning benchmark exhibiting fine-grained temporal selectivity and multi-hop logical complexity. Comprehensive evaluations demonstrate that our automated pipeline generates query workloads with complexity comparable to expert-curated datasets. Furthermore, a logic alignment analysis reveals a high correlation between the prescribed graph topology and the reasoning steps of state-of-the-art MLLMs, validating the system’s capability to encode verifiable logic into visual-linguistic benchmarks. This work paves the way for scalable, low-cost construction of robust evaluation protocols in critical domains.
zh
[AI-77] FMTK: A Modular Toolkit for Composable Time Series Foundation Model Pipelines
【速读】:该论文旨在解决时间序列基础模型(Time-series Foundation Models, TSFMs)在实际应用中因任务特定编码器、解码器和适配器的组合而产生的模块化不足与可复现性差的问题。现有方法通常依赖于模型特异性的手工实现,导致开发效率低且难以扩展。解决方案的关键在于提出 FMTK——一个开源、轻量且可扩展的工具包,通过标准化的骨干模型与组件抽象,实现了 TSFM 管道的灵活组合,仅需平均七行代码即可保证正确性和高性能。
链接: https://arxiv.org/abs/2512.01038
作者: Hetvi Shastri,Pragya Sharma,Walid A. Hanafy,Mani Srivastava,Prashant Shenoy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models (FMs) have opened new avenues for machine learning applications due to their ability to adapt to new and unseen tasks with minimal or no further training. Time-series foundation models (TSFMs) – FMs trained on time-series data – have shown strong performance on classification, regression, and imputation tasks. Recent pipelines combine TSFMs with task-specific encoders, decoders, and adapters to improve performance; however, assembling such pipelines typically requires ad hoc, model-specific implementations that hinder modularity and reproducibility. We introduce FMTK, an open-source, lightweight and extensible toolkit for constructing and fine-tuning TSFM pipelines via standardized backbone and component abstractions. FMTK enables flexible composition across models and tasks, achieving correctness and performance with an average of seven lines of code. this https URL
zh
[AI-78] Goal-Oriented Multi-Agent Semantic Networking: Unifying Intents Semantics and Intelligence
【速读】:该论文旨在解决当前网络架构在支持第六代移动通信(6G)服务时面临的局限性,即传统网络基于应用与网络的完全解耦设计,无法暴露或利用高层目标(goal),从而限制了其对服务需求的智能适应能力。解决方案的关键在于提出一种名为“目标导向多智能体语义网络”(Goal-Oriented Multi-Agent Semantic Networking, GoAgentNet)的新架构,该架构通过将应用和网络功能抽象为多个协作智能体,并借助语义计算与跨层语义网络技术,协同调度感知、网络、计算与控制资源,使整个系统能够围绕统一的应用目标进行优化,从而实现从数据交换向目标达成的范式转变。
链接: https://arxiv.org/abs/2512.01035
作者: Shutong Chen,Qi Liao,Adnan Aijaz,Yansha Deng
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Submitting to IEEE for potential publications
Abstract:6G services are evolving toward goal-oriented and AI-native communication, which are expected to deliver transformative societal benefits across various industries and promote energy sustainability. Yet today’s networking architectures, built on complete decoupling of the applications and the network, cannot expose or exploit high-level goals, limiting their ability to adapt intelligently to service needs. This work introduces Goal-Oriented Multi-Agent Semantic Networking (GoAgentNet), a new architecture that elevates communication from data exchange to goal fulfilment. GoAgentNet enables applications and the network to collaborate by abstracting their functions into multiple collaborative agents, and jointly orchestrates multi-agent sensing, networking, computation, and control through semantic computation and cross-layer semantic networking, allowing the entire architecture to pursue unified application goals. We first outline the limitations of legacy network designs in supporting 6G services, based on which we highlight key enablers of our GoAgentNet design. Then, through three representative 6G usage scenarios, we demonstrate how GoAgentNet can unlock more efficient and intelligent services. We further identify unique challenges faced by GoAgentNet deployment and corresponding potential solutions. A case study on robotic fault detection and recovery shows that our GoAgentNet architecture improves energy efficiency by up to 99% and increases the task success rate by up to 72%, compared with the existing networking architectures without GoAgentNet, which underscores its potential to support scalable and sustainable 6G systems.
zh
[AI-79] AltNet: Addressing the Plasticity-Stability Dilemma in Reinforcement Learning
【速读】:该论文旨在解决强化学习中神经网络在持续学习过程中出现的塑性丧失(plasticity loss)问题,即随着训练时间推移,模型从新经验中继续学习的能力逐渐下降。为克服这一问题,作者提出了一种名为AltNet的基于参数重置的方法,其关键创新在于引入了孪生网络(twin networks)机制:两个网络交替扮演主动学习和被动学习角色——一个在网络中执行动作并在线学习,另一个则离线地从主动网络的经验和回放缓冲区中学习;在固定间隔时,主动网络被重置,而此前已积累经验的被动网络接替成为新的主动网络。该机制在保持性能稳定的同时恢复了模型的可塑性,从而避免了传统重置方法带来的性能骤降风险,在高维控制任务中显著提升了样本效率与最终性能。
链接: https://arxiv.org/abs/2512.01034
作者: Mansi Maheshwari,John C. Raisbeck,Bruno Castro da Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks have shown remarkable success in supervised learning when trained on a single task using a fixed dataset. However, when neural networks are trained on a reinforcement learning task, their ability to continue learning from new experiences declines over time. This decline in learning ability is known as plasticity loss. To restore plasticity, prior work has explored periodically resetting the parameters of the learning network, a strategy that often improves overall performance. However, such resets come at the cost of a temporary drop in performance, which can be dangerous in real-world settings. To overcome this instability, we introduce AltNet, a reset-based approach that restores plasticity without performance degradation by leveraging twin networks. The use of twin networks anchors performance during resets through a mechanism that allows networks to periodically alternate roles: one network learns as it acts in the environment, while the other learns off-policy from the active network’s interactions and a replay buffer. At fixed intervals, the active network is reset and the passive network, having learned from prior experiences, becomes the new active network. AltNet restores plasticity, improving sample efficiency and achieving higher performance, while avoiding performance drops that pose risks in safety-critical settings. We demonstrate these advantages in several high-dimensional control tasks from the DeepMind Control Suite, where AltNet outperforms various relevant baseline methods, as well as state-of-the-art reset-based techniques.
zh
[AI-80] VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference
【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在实际机器人部署中因同步推理导致的反应延迟和动作不稳定性问题。现有方法在高速演示视频下常出现动作停顿与环境响应滞后,而异步推理虽能提升连续性和降低延迟,却因预测与执行时间间隔不匹配引发显著动作不稳定,且现有缓解策略要么牺牲精度、要么增加运行开销。其解决方案的关键在于提出VLASH框架,通过利用前一动作块对机器人状态进行滚动预测,估计执行时刻的状态,从而有效弥合预测与执行之间的时序错位,实现无需额外硬件或架构改动即可保持原精度的同时,大幅提升推理速度(最高达2.03倍)并显著降低反应延迟(最高达17.4倍)。
链接: https://arxiv.org/abs/2512.01031
作者: Jiaming Tang,Yufei Sun,Yilong Zhao,Shang Yang,Yujun Lin,Zhuoyang Zhang,James Hou,Yao Lu,Zhijian Liu,Song Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at this https URL
zh
[AI-81] Operator-Theoretic Framework for Gradient-Free Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中面临的多重挑战:数据异质性(data heterogeneity)、严格的通信与计算资源限制、隐私保护需求,同时确保模型性能。其核心解决方案是提出一个基于算子理论(operator-theoretic)的框架,将 L2-最优解通过前向算子映射到再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS),利用可用数据进行近似,并通过逆算子映射回原空间,从而得到一种无需梯度信息的联邦学习机制。该方法的关键创新在于:1)定义了一个依赖于数据的假设空间,提供风险、误差、鲁棒性和逼近误差的理论保证;2)设计高效的核机器,借助核仿射包络机(Kernel Affine Hull Machines)的空间折叠特性,使客户端仅需传输标量空间折叠度量,显著降低通信开销;3)支持单步噪声扰动下的差分隐私协议,避免逐轮裁剪和隐私会计;4)全局预测规则仅需整数最小值和等值比较操作,兼容全同态加密(Fully Homomorphic Encryption, FHE),实现安全可验证的推理。实验表明,该梯度自由方法在多个基准上达到或超越强梯度基微调性能,且在高隐私场景下通过核平滑缓解精度下降。
链接: https://arxiv.org/abs/2512.01025
作者: Mohit Kumar,Mathias Brucker,Alexander Valentinitsch,Adnan Husakovic,Ali Abbas,Manuela Geiß,Bernhard A. Moser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning must address heterogeneity, strict communication and computation limits, and privacy while ensuring performance. We propose an operator-theoretic framework that maps the L^2 -optimal solution into a reproducing kernel Hilbert space (RKHS) via a forward operator, approximates it using available data, and maps back with the inverse operator, yielding a gradient-free scheme. Finite-sample bounds are derived using concentration inequalities over operator norms, and the framework identifies a data-dependent hypothesis space with guarantees on risk, error, robustness, and approximation. Within this space we design efficient kernel machines leveraging the space folding property of Kernel Affine Hull Machines. Clients transfer knowledge via a scalar space folding measure, reducing communication and enabling a simple differentially private protocol: summaries are computed from noise-perturbed data matrices in one step, avoiding per-round clipping and privacy accounting. The induced global rule requires only integer minimum and equality-comparison operations per test point, making it compatible with fully homomorphic encryption (FHE). Across four benchmarks, the gradient-free FL method with fixed encoder embeddings matches or outperforms strong gradient-based fine-tuning, with gains up to 23.7 points. In differentially private experiments, kernel smoothing mitigates accuracy loss in high-privacy regimes. The global rule admits an FHE realization using Q \times C encrypted minimum and C equality-comparison operations per test point, with operation-level benchmarks showing practical latencies. Overall, the framework provides provable guarantees with low communication, supports private knowledge transfer via scalar summaries, and yields an FHE-compatible prediction rule offering a mathematically grounded alternative to gradient-based federated learning under heterogeneity.
zh
[AI-82] ChartAnchor: Chart Grounding with Structural-Semantic Fidelity
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表结构化理解能力评估上的不足,特别是缺乏对图表到表格和代码之间双向对齐(chart grounding)的全面评测。现有基准测试受限于图表类型单一、任务孤立及评估框架不完整,难以真实反映模型在数值推理、跨模态对齐与结构重建方面的能力。解决方案的关键在于提出 ChartAnchor,一个包含8000+图表-表格-代码三元组的综合性基准,覆盖30种真实世界和增强来源的图表类型;其核心创新是引入两个互补任务:图表到代码生成(chart-to-code generation),用于合成可执行代码复现图表;以及受控图表到表格重构(controlled chart-to-table reconstruction),用于提取精确数据并匹配预定义表头,从而实现视觉与数值保真度的交叉验证。通过多层次评估框架整合语义验证、风格分析与感知指标,ChartAnchor 为 MLLMs 在科学、金融和工业场景中的结构化推理能力提供了严谨且可量化的评估基础。
链接: https://arxiv.org/abs/2512.01017
作者: Xinhang Li,Jingbo Zhou,Pengfei Luo,Yixiong Xiao,Tong Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in multimodal large language models (MLLMs) highlight the need for benchmarks that rigorously evaluate structured chart this http URL grounding refers to the bidirectional alignment between a chart’s visual appearance and the structured semantics. This task requires models to produce a symbolic specification that faithfully captures the chart’s visual and structural intent, while also recovering the underlying tabular data with precise values and relationships. Chart grounding directly reflects a model’s capabilities in numerical reasoning, multimodal alignment, and structural reconstruction, and has several important applications in real-world this http URL benchmarks, constrained by narrow chart diversity, isolated tasks, and incomplete evaluation frameworks, fail to holistically assess grounding. To address this, we propose ChartAnchor, a comprehensive benchmark of 8k+ chart-table-code triples spanning 30 chart types drawn from diverse real-world and augmented sources. ChartAnchor introduces two complementary tasks: chart-to-code generation (synthesizing executable code to replicate charts) and controlled chart-to-table reconstruction (extracting exact data with predefined headers), enabling cross-validation of visual and numerical fidelity. A multi-level evaluation framework integrates semantic validation, stylistic analysis, and perceptual metrics to assess both structural and content-level correctness. Extensive experiments on MLLMs reveal critical limitations in numerical precision and code synthesis, emphasizing the need for structured reasoning beyond surface-level perception. By unifying symbolic and data-driven grounding, ChartAnchor establishes a rigorous foundation for chart grounding, offering meaningful insights for advancing MLLMs in scientific, financial, and industrial domains.
zh
[AI-83] Chain of Unit-Physics: A Primitive-Centric Approach to Scientific Code Synthesis
【速读】:该论文旨在解决生成式 AI 在高风险科学计算任务中可靠性不足的问题,特别是针对从自然语言查询生成正确、物理一致的科学计算代码时面临的两大挑战:(a)训练过程中领域代码表示稀疏;(b)由于专家社区规模小,难以实施强化学习人类反馈(RLHF)。其解决方案的关键在于提出一种基于第一性原理(first-principles)的逆向代码设计框架——Chain of Unit-Physics(单位物理链),该框架通过将人类专家知识编码为单元物理测试(unit-physics tests)来显式约束代码生成过程,从而确保生成代码在数值和物理层面的一致性。实验表明,该方法在燃烧模拟这一具有现实物理约束的基准任务上仅需5–6次迭代即可收敛,达到与人工专家实现相当的精度(平均误差 3.1×10−3%),同时在运行时间和内存效率上分别提升约33.4%和30%,且成本接近中等规模商业API,为物理驱动的科学代码生成提供了可落地的模板。
链接: https://arxiv.org/abs/2512.01010
作者: Vansh Sharma,Venkat Raman
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:Agentic large language models are proposed as autonomous code generators for scientific computing, yet their reliability in high-stakes problems remains unclear. Developing computational scientific software from natural-language queries remains challenging broadly due to (a) sparse representation of domain codes during training and (b) the limited feasibility of RLHF with a small expert community. To address these limitations, this work conceptualizes an inverse approach to code design, embodied in the Chain of Unit-Physics framework: a first-principles (or primitives)-centric, multi-agent system in which human expert knowledge is encoded as unit-physics tests that explicitly constrain code generation. The framework is evaluated on a nontrivial combustion task, used here as a representative benchmark for scientific problem with realistic physical constraints. Closed-weight systems and code-focused agentic variants fail to produce correct end-to-end solvers, despite tool and web access, exhibiting four recurrent error classes: interface (syntax/API) hallucinations, overconfident assumptions, numerical/physical incoherence, and configuration fragility. Open-weight models with chain-of-thought (CoT) decoding reduce interface errors but still yield incorrect solutions. On the benchmark task, the proposed framework converges within 5-6 iterations, matches the human-expert implementation (mean error of 3.1\times10^-3 %), with a \sim 33.4 % faster runtime and a \sim 30 % efficient memory usage at a cost comparable to mid-sized commercial APIs, yielding a practical template for physics-grounded scientific code generation. As datasets and models evolve, zero-shot code accuracy will improve; however, the Chain of Unit-Physics framework goes further by embedding first-principles analysis that is foundational to scientific codes.
zh
[AI-84] IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
【速读】:该论文旨在解决数学定理证明中自动形式化(autoformalization)的挑战,即如何将自然语言表述的数学问题准确转换为可被自动化推理系统验证的形式语言(如Lean 4)。其核心问题是当前方法在语法正确性与语义正确性之间存在显著差距,且即使经过迭代优化,定理证明的成功率依然很低。解决方案的关键在于构建一个高质量、经人工验证的基准测试集IndiMathBench,该基准包含312个来自印度数学奥林匹克竞赛的正式Lean 4定理及其对应的非形式化问题陈述,并通过AI赋能的人工辅助流水线实现高效候选形式化生成——该流水线结合基于类别的检索、迭代编译器反馈和多模型集成,辅以交互式仪表板进行专家验证与自动化质量评估,从而为数学推理能力的评估提供了严谨且具有挑战性的测试平台。
链接: https://arxiv.org/abs/2512.00997
作者: Param Biyani,Shashank Kirtania,Yasharth Bajpai,Sumit Gulwani,Ashish Tiwari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at this https URL.
zh
[AI-85] Integrating Causal Foundation Model in Prescriptive Maintenance Framework for Optimizing Production Line OEE
【速读】:该论文旨在解决制造领域中预测性维护(predictive maintenance)向处方性维护(prescriptive maintenance)转型过程中的核心瓶颈问题:现有预测模型多依赖于虚假相关性而非真正的故障因果机制,导致无法准确识别故障根本原因,进而难以制定有效的干预措施。解决方案的关键在于引入基于因果机器学习(causal machine learning)的模型,利用预训练的因果基础模型作为“假设分析”(what-if)工具,模拟并评估潜在修复方案对关键绩效指标(KPIs),如整体设备效率(OEE)的因果影响,从而实现从诊断到主动处方的跨越。该方法不仅可识别故障根源,还能量化每种干预措施的运营影响,为生产线提供数据驱动的行动优先级排序。
链接: https://arxiv.org/abs/2512.00969
作者: Felix Saretzky,Lucas Andersen,Thomas Engel,Fazel Ansari
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 9 pages, 3 images, 1 table, conference paper
Abstract:The transition to prescriptive maintenance in manufacturing is critically constrained by a dependence on predictive models. These models tend to rely on spurious correlations rather than identifying the true causal drivers of failures, often leading to costly misdiagnoses and ineffective interventions. This fundamental limitation results in a key-challenge: while we can predict that a failure may occur, we lack a systematic method to understand why a failure occurs, thereby providing the basis for identifying the most effective intervention. This paper proposes a model based on causal machine learning to bridge this gap. Our objective is to move beyond diagnosis to active prescription by simulating and evaluating potential fixes toward optimizing KPIs such as Overall Equipment Effectiveness (OEE). For this purpose a pre-trained causal foundation model is used as a “what-if” model to estimate the effects of potential fixes. By measuring the causal effect of each intervention on system-level KPIs, it provides a data-driven ranking of actions to recommend at the production line. This process not only identifies root causes but also quantifies their operational impact. The model is evaluated using semi-synthetic manufacturing data and compared with a baseline machine learning model. This paper sets the technical basis for a robust prescriptive maintenance framework, allowing engineers to test potential solutions in a causal environment to make more effective operational decisions and reduce costly downtimes.
zh
[AI-86] Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search KDD2026
【速读】:该论文旨在解决传统相关性模型在可解释性和复杂相关信号建模能力上的局限性,以及现有基于推理的生成式相关性模型(Generative Relevance Models, GRMs)因依赖大量人工标注或合成的思维链(Chain-of-Thought, CoT)数据而导致泛化能力不足的问题。其解决方案的关键在于:将小红书搜索中的相关性建模转化为多步推理任务,并引入强化学习(Reinforcement Learning, RL)训练框架,通过嵌入业务特定的相关性标准来增强推理的 groundedness;同时提出轻量级的逐步优势掩码(Stepwise Advantage Masking, SAM)策略,优化奖励分配机制以更有效地学习这些标准,最终实现高精度且具备可解释性的相关性排序模型,并通过知识蒸馏将其部署于工业级搜索系统中。
链接: https://arxiv.org/abs/2512.00968
作者: Ziyang Zeng,Heming Jing,Jindong Chen,Xiangli Li,Hongyu Liu,Yixuan He,Zhengyu Li,Yige Sun,Zheyong Xie,Yuqing Yang,Shaosheng Cao,Jun Fan,Yi Wu,Yao Hu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026 ADS Track
Abstract:Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain-of-Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning-based Generative Relevance Models (GRMs) primarily rely on supervised fine-tuning on large amounts of human-annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain-agnostic, free-form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open-domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business-specific relevance criteria into the multi-step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process-supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large-scale RL-tuned model to a lightweight version suitable for real-world search systems. Extensive experiments on industrial datasets, along with online A/B tests, demonstrate the effectiveness of our approach.
zh
[AI-87] Multi-Modal AI for Remote Patient Monitoring in Cancer Care
【速读】:该论文旨在解决接受系统性癌症治疗的患者在门诊随访间隔期间因缺乏持续监测而面临未被发现的副作用风险问题。解决方案的关键在于构建并前瞻性测试了一个多模态人工智能(Multi-modal AI)框架,用于远程患者监测(Remote Patient Monitoring, RPM),该框架整合了来自HALO-X平台的多种异构数据源(如人口统计学信息、可穿戴传感器、每日问卷和临床事件),并通过一个能够处理现实世界RPM数据异步性和不完整性特征的模型,预测未来不良事件的连续风险。该模型在84名患者的2.1百万个数据点上实现了83.9%的准确率(AUROC=0.70),并识别出既往治疗、健康打卡和每日最大心率等关键预测特征,验证了其早期预警能力。
链接: https://arxiv.org/abs/2512.00949
作者: Yansong Liu,Ronnie Stafford,Pramit Khetrapal,Huriye Kocadag,Graça Carvalho,Patricia de Winter,Maryam Imran,Amelia Snook,Adamos Hadjivasiliou,D. Vijay Anand,Weining Lin,John Kelly,Yukun Zhou,Ivana Drobnjak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:For patients undergoing systemic cancer therapy, the time between clinic visits is full of uncertainties and risks of unmonitored side effects. To bridge this gap in care, we developed and prospectively trialed a multi-modal AI framework for remote patient monitoring (RPM). This system integrates multi-modal data from the HALO-X platform, such as demographics, wearable sensors, daily surveys, and clinical events. Our observational trial is one of the largest of its kind and has collected over 2.1 million data points (6,080 patient-days) of monitoring from 84 patients. We developed and adapted a multi-modal AI model to handle the asynchronous and incomplete nature of real-world RPM data, forecasting a continuous risk of future adverse events. The model achieved an accuracy of 83.9% (AUROC=0.70). Notably, the model identified previous treatments, wellness check-ins, and daily maximum heart rate as key predictive features. A case study demonstrated the model’s ability to provide early warnings by outputting escalating risk profiles prior to the event. This work establishes the feasibility of multi-modal AI RPM for cancer care and offers a path toward more proactive patient support.(Accepted at Europe NeurIPS 2025 Multimodal Representation Learning for Healthcare Workshop)
zh
[AI-88] Constant-Time Motion Planning with Manipulation Behaviors
【速读】:该论文旨在解决当前接触丰富的机器人操作中缺乏能够提供安全、效率和可靠性可验证保证的运动规划算法的问题,尤其是在复杂操作任务(如抓取或插入)中难以实现高效且可靠的路径规划。解决方案的关键在于提出行为常时运动规划器(Behavioral Constant-Time Motion Planner, B-CTMP),它扩展了原有的常时运动规划(Constant-Time Motion Planning, CTMP)框架,通过预处理阶段构建紧凑的数据结构,在固定时间预算内(如10毫秒)完成碰撞自由的运动查询,并显式融合对象操作行为(如抓取或插入),从而在两步操作任务中实现:(1) 到达行为启动状态的无碰撞路径规划,(2) 执行指定操作行为以达成目标。B-CTMP 在保证常时响应的同时,确保任务完整性与成功率,适用于半结构化环境中的机器人操作任务。
链接: https://arxiv.org/abs/2512.00939
作者: Nayesha Gandotra,Itamar Mishani,Maxim Likhachev
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: In submission
Abstract:Recent progress in contact-rich robotic manipulation has been striking, yet most deployed systems remain confined to simple, scripted routines. One of the key barriers is the lack of motion planning algorithms that can provide verifiable guarantees for safety, efficiency and reliability. To address this, a family of algorithms called Constant-Time Motion Planning (CTMP) was introduced, which leverages a preprocessing phase to enable collision-free motion queries in a fixed, user-specified time budget (e.g., 10 milliseconds). However, existing CTMP methods do not explicitly incorporate the manipulation behaviors essential for object handling. To bridge this gap, we introduce the \textitBehavioral Constant-Time Motion Planner (B-CTMP), an algorithm that extends CTMP to solve a broad class of two-step manipulation tasks: (1) a collision-free motion to a behavior initiation state, followed by (2) execution of a manipulation behavior (such as grasping or insertion) to reach the goal. By precomputing compact data structures, B-CTMP guarantees constant-time query in mere milliseconds while ensuring completeness and successful task execution over a specified set of states. We evaluate B-CTMP on two canonical manipulation tasks in simulation, shelf picking and plug insertion,and demonstrate its effectiveness on a real robot. Our results show that B-CTMP unifies collision-free planning and object manipulation within a single constant-time framework, providing provable guarantees of speed and success for manipulation in semi-structured environments.
zh
[AI-89] Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在结构上的脆弱性问题,即识别哪些关键神经元的移除会导致模型性能的灾难性崩溃。解决方案的关键在于提出一种名为CAN(Consistently Activated Neurons)的方法,通过逐步掩蔽策略来检测并定位这些一致激活的神经元。实验表明,仅需掩蔽语言模型前馈网络中极少量神经元(如四个),即可引发模型崩溃,且关键神经元主要集中在语言模块而非视觉组件,尤其是下投影层(down-projection layer)最为脆弱,揭示了LVLMs存在稳定的两阶段崩溃模式:先表达能力退化,后突然完全失效。
链接: https://arxiv.org/abs/2512.00918
作者: Cen Lu,Yung-Chen Tang,Andrea Cavallaro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures,
Abstract:Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet their robustness is poorly understood. In this paper, we investigate the structural vulnerabilities of LVLMs to identify any critical neurons whose removal triggers catastrophic collapse. In this context, we propose CAN, a method to detect Consistently Activated Neurons and to locate critical neurons by progressive masking. Experiments on LLaVA-1.5-7b-hf and InstructBLIP-Vicuna-7b reveal that masking only a tiny portion of the language model’s feed-forward networks (just as few as four neurons in extreme cases) suffices to trigger catastrophic collapse. Notably, critical neurons are predominantly localized in the language model rather than in the vision components, and the down-projection layer is a particularly vulnerable structure. We also observe a consistent two-stage collapse pattern: initial expressive degradation followed by sudden, complete collapse. Our findings provide important insights for safety research in LVLMs.
zh
[AI-90] Beyond High-Entropy Exploration: Correctness-Aware Low-Entropy Segment-Based Advantage Shaping for Reasoning LLM s
【速读】:该论文旨在解决强化学习中奖励可验证性(Reinforcement Learning with Verifiable Rewards, RLVR)方法在提升大语言模型推理能力时存在的效率不足问题,尤其是现有方法忽视了低熵段落(low-entropy segments)在推理轨迹中所承载的稳定且可复用的结构模式。研究表明,正确推理路径中的低熵段落具有显著重叠性,而错误路径则呈现稳定但无益的重复模式。解决方案的关键在于提出LESS框架——一种基于正确性的细粒度优势调节机制,其通过放大仅出现在正确响应中的低熵段落、抑制仅出现在错误响应中的段落,并中和两者共有的段落,从而在保留高熵探索能力的前提下优化策略更新。该方法在多个骨干模型和数学基准上均显著优于主流RL基线,提升了模型性能的稳定性与准确率。
链接: https://arxiv.org/abs/2512.00908
作者: Xinzhu Chen,Xuesheng Li,Zhongxiang Sun,Weijie Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a central approach for improving the reasoning ability of large language models. Recent work studies RLVR through token entropy, arguing that high-entropy tokens drive exploration and should receive stronger updates. However, they overlook the fact that most of a reasoning trajectory consists of low-entropy segments that encode stable and reusable structural patterns. Through qualitative and quantitative analyses, we find that the overlap of low-entropy segments across correct responses strongly correlates with model accuracy, while overlaps involving incorrect responses exhibit stable but unproductive patterns. Motivated by these findings, we propose LESS, a correctness-aware reinforcement framework that performs fine-grained advantage modulation over low-entropy segments. LESS amplifies segments unique to correct responses, suppresses those unique to incorrect ones, and neutralizes segments shared by both, while preserving high-entropy exploration in the underlying RL algorithm. Instantiated on top of the popular GRPO, LESS consistently improves accuracy over strong RL baselines across three backbones and six math benchmarks, achieves stronger robustness of the performance floor.
zh
[AI-91] Light-Weight Benchmarks Reveal the Hidden Hardware Cost of Zero-Shot Tabular Foundation Models ICML
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在表格数据预测任务中硬件资源消耗与模型性能之间权衡不明确的问题,尤其是零样本基础模型(Zero-shot foundation models, FMs)在实际部署中的效率瓶颈尚未被系统量化。解决方案的关键在于构建了一个可完全复现的基准测试框架,在四个公开表格数据集上同时测量模型的测试准确率、运行时延迟(wall-clock latency)、峰值CPU内存占用和峰值GPU显存占用,并将两种开源Tabular基础模型(TabPFN-1.0 和 TabICL-base)与调优后的XGBoost、LightGBM和随机森林等树集成模型进行对比。结果揭示了当前Tabular FMs在准确性提升的同时往往伴随显著的硬件资源开销(如TabICL需9 GB VRAM和960秒延迟),而树集成模型则在保持高精度的同时实现毫秒级推理和极低内存占用,从而为未来面向效率优化的Tabular FM研究提供了清晰的基线和量化依据。
链接: https://arxiv.org/abs/2512.00888
作者: Aayam Bansal,Ishaan Gangwani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML NewInML
Abstract:Zero-shot foundation models (FMs) promise training-free prediction on tabular data, yet their hardware footprint remains poorly characterized. We present a fully reproducible benchmark that reports test accuracy together with wall-clock latency, peak CPU RAM, and peak GPU VRAM on four public datasets: Adult-Income, Higgs-100k, Wine-Quality, and California-Housing. Two open FMs (TabPFN-1.0 and TabICL-base) are compared against tuned XGBoost, LightGBM, and Random Forest baselines on a single NVIDIA T4 GPU. The tree ensembles equal or surpass FM accuracy on three datasets while completing full-test batches in = 0.40 s and = 150 MB RAM, using zero VRAM. TabICL achieves a 0.8 percentage-point gain on Higgs but requires roughly 40,000 times more latency (960 s) and 9 GB VRAM. TabPFN matches tree-model accuracy on Wine and Housing but peaks at 4 GB VRAM and cannot process the full 100k-row Higgs table. These results quantify the substantial hardware-versus-accuracy trade-offs in current tabular FMs and provide an open baseline for future efficiency-oriented research.
zh
[AI-92] Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing AAAI2026
【速读】:该论文旨在解决当前多模态知识编辑(Multimodal Knowledge Editing, MKE)方法在处理跨文本与图像的多跳推理任务时存在的两大问题:一是模型难以在知识更新后持续准确地推理跨越文本和图像的多跳事实链,二是对视觉重述输入的鲁棒性不足。为应对这些挑战,作者提出 Hybrid-DMKG,其核心在于构建一个基于动态多模态知识图谱(Dynamic Multimodal Knowledge Graph, DMKG)的混合推理框架。该方案的关键创新包括:首先利用大语言模型将多模态多跳问题分解为顺序子问题;其次通过联合编码子问题与候选实体及其图像,实现对更新事实的多模态检索;最后采用双路径并行推理机制——关系链接预测与基于检索增强生成(Retrieval-Augmented Generation, RAG)的大视觉语言模型推理,并由决策模块融合证据以选出最可信答案,从而显著提升多跳推理准确性与知识更新后的鲁棒性。
链接: https://arxiv.org/abs/2512.00881
作者: Li Yuan,Qingfei Huang,Bingshan Zhu,Yi Cai,Qingbao Huang,Changmeng Zheng,Zikun Deng,Tao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Multimodal Knowledge Editing (MKE) extends traditional knowledge editing to settings involving both textual and visual modalities. However, existing MKE benchmarks primarily assess final answer correctness while neglecting the quality of intermediate reasoning and robustness to visually rephrased inputs. To address this limitation, we introduce MMQAKE, the first benchmark for multimodal multihop question answering with knowledge editing. MMQAKE evaluates (1) a model’s ability to reason over 2-5-hop factual chains that span both text and images, including performance at each intermediate step, and (2) robustness to visually rephrased inputs in multihop questions. Our evaluation shows that current MKE methods often struggle to consistently update and reason over multimodal reasoning chains after knowledge edits. To overcome these challenges, we propose Hybrid-DMKG, a hybrid reasoning framework built on a dynamic multimodal knowledge graph (DMKG) to enable accurate multihop reasoning over updated multimodal knowledge. Hybrid-DMKG first uses a large language model to decompose multimodal multihop questions into sequential sub-questions, then applies a multimodal retrieval model to locate updated facts by jointly encoding each sub-question with candidate entities and their associated images. For answer inference, a hybrid reasoning module operates over the DMKG via two parallel paths: (1) relation linking prediction, and (2) RAG reasoning with large vision-language models. A decision module aggregates evidence from both paths to select the most credible answer. Experimental results on MMQAKE show that Hybrid-DMKG significantly outperforms existing MKE approaches, achieving higher accuracy and improved robustness to knowledge updates.
zh
[AI-93] HBLLM : Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在1-bit后训练量化(post-training quantization)过程中面临的精度损失问题,即如何在极低比特存储下保持模型的高保真度和性能。其解决方案的关键在于引入基于Haar小波变换(Haar wavelet transform)的频率分解机制,以增强量化表示的表达能力,并设计了两种结构感知的分组策略:(1) 频率感知的多参数行内分组(frequency-aware multi-parameter intra-row grouping),用于捕捉不同频段权重的特性;(2) 基于ℓ₂范数的显著性驱动列选择(ℓ₂-norm-based saliency-driven column selection),优先保留重要权重以提升量化精度。此外,对非显著权重采用同频段内共享均值的方式优化存储效率,从而在仅需平均1.08 bit/weight的情况下实现接近全精度的困惑度(perplexity=6.71)表现,优于现有1-bit量化方法。
链接: https://arxiv.org/abs/2512.00862
作者: Ningning Chen,Weicai Ye,Ying Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce HBLLM, a wavelet-enhanced high-fidelity 1 -bit post-training quantization method for Large Language Models (LLMs). By leveraging Haar wavelet transforms to enhance expressive capacity through frequency decomposition, HBLLM significantly improves quantization fidelity while maintaining minimal overhead. This approach features two innovative structure-aware grouping strategies: (1) frequency-aware multi-parameter intra-row grouping and (2) \ell_2 -norm-based saliency-driven column selection. For non-salient weights, a shared mean is employed across quantization groups within each frequency band to optimize storage efficiency. Experiments conducted on the OPT and LLaMA models demonstrate that HBLLM achieves state-of-the-art performance in 1 -bit quantization, attaining a perplexity of 6.71 on LLaMA 2 - 13 B with an average weight storage of only 1.08 bits. Code available at: this https URL.
zh
[AI-94] opological Federated Clustering via Gravitational Potential Fields under Local Differential Privacy
【速读】:该论文旨在解决在联邦学习环境下,于局部差分隐私(Local Differential Privacy, LDP)约束下对非独立同分布(non-IID)数据进行聚类时所面临的挑战:如何在不依赖迭代通信的前提下同时保障隐私性与聚类准确性。现有的一次性方法因依赖不稳定的成对中心距离或邻域排序,在强LDP噪声和数据异质性下性能显著下降。其解决方案的关键在于提出引力联邦聚类(Gravitational Federated Clustering, GFC),通过将客户端的私有化中心点转化为全局引力势场,使真实簇中心作为拓扑持久性奇点自然浮现;具体创新包括:(1) 客户端侧基于紧凑性的扰动机制,将局部簇几何结构编码为“质量”值;(2) 服务端侧通过势场超水平集的持久同调分析提取稳定中心点。理论证明了隐私预算 ϵ 与中心估计误差之间的闭式边界,并揭示势场的Lipschitz平滑性质能指数级抑制高密度区域的噪声,从而实现无需迭代通信的优越隐私-准确率权衡。
链接: https://arxiv.org/abs/2512.00849
作者: Yunbo Long,Jiaquan Zhang,Xi Chen,Alexandra Brintrup
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Clustering non-independent and identically distributed (non-IID) data under local differential privacy (LDP) in federated settings presents a critical challenge: preserving privacy while maintaining accuracy without iterative communication. Existing one-shot methods rely on unstable pairwise centroid distances or neighborhood rankings, degrading severely under strong LDP noise and data heterogeneity. We present Gravitational Federated Clustering (GFC), a novel approach to privacy-preserving federated clustering that overcomes the limitations of distance-based methods under varying LDP. Addressing the critical challenge of clustering non-IID data with diverse privacy guarantees, GFC transforms privatized client centroids into a global gravitational potential field where true cluster centers emerge as topologically persistent singularities. Our framework introduces two key innovations: (1) a client-side compactness-aware perturbation mechanism that encodes local cluster geometry as “mass” values, and (2) a server-side topological aggregation phase that extracts stable centroids through persistent homology analysis of the potential field’s superlevel sets. Theoretically, we establish a closed-form bound between the privacy budget \epsilon and centroid estimation error, proving the potential field’s Lipschitz smoothing properties exponentially suppress noise in high-density regions. Empirically, GFC outperforms state-of-the-art methods on ten benchmarks, especially under strong LDP constraints ( \epsilon 1 ), while maintaining comparable performance at lower privacy budgets. By reformulating federated clustering as a topological persistence problem in a synthetic physics-inspired space, GFC achieves unprecedented privacy-accuracy trade-offs without iterative communication, providing a new perspective for privacy-preserving distributed learning.
zh
[AI-95] ARCADIA: Scalable Causal Discovery for Corporate Bankruptcy Analysis Using Agent ic AI
【速读】:该论文旨在解决传统因果发现算法在现实高风险领域中难以构建稳定、可解释且时间一致的因果结构的问题。解决方案的关键在于提出ARCAIDIA框架,该框架通过将大语言模型(Large Language Model, LLM)的推理能力与统计诊断相结合,利用约束引导的提示(constraint-guided prompting)和因果有效性反馈(causal-validity feedback)迭代优化候选有向无环图(Directed Acyclic Graph, DAG),从而生成具有科学可信度和干预可用性的因果模型。
链接: https://arxiv.org/abs/2512.00839
作者: Fabrizio Maturo,Donato Riccio,Andrea Mazzitelli,Giuseppe Bifulco,Francesco Paolone,Iulia Brezeanu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation (stat.CO); Methodology (stat.ME)
备注: 35 pages, 9 figures, 4 tables
Abstract:This paper introduces ARCADIA, an agentic AI framework for causal discovery that integrates large-language-model reasoning with statistical diagnostics to construct valid, temporally coherent causal structures. Unlike traditional algorithms, ARCADIA iteratively refines candidate DAGs through constraint-guided prompting and causal-validity feedback, leading to stable and interpretable models for real-world high-stakes domains. Experiments on corporate bankruptcy data show that ARCADIA produces more reliable causal graphs than NOTEARS, GOLEM, and DirectLiNGAM while offering a fully explainable, intervention-ready pipeline. The framework advances AI by demonstrating how agentic LLMs can participate in autonomous scientific modeling and structured causal inference.
zh
[AI-96] Assessing model error in counterfactual worlds
【速读】:该论文旨在解决情景预测(scenario projection)在决策制定中缺乏事后评估的问题,尤其是区分预测偏差(scenario deviation)与模型校准不当(model miscalibration)对结果差异的贡献。作者指出,模型校准不当是评估模型价值的关键因素,但其估计需在反事实世界(counterfactual world)中进行。解决方案的核心在于提出并对比三种估计反事实误差的方法,并通过模拟实验展示各方法的优势与局限,最终为反事实误差的合理估算提供指导,并强调情景设计中必须包含可评估性要素以实现有效回溯验证。
链接: https://arxiv.org/abs/2512.00836
作者: Emily Howerton,Justin Lessler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactual scenario modeling exercises that ask “what would happen if?” are one of the most common ways we plan for the future. Despite their ubiquity in planning and decision making, scenario projections are rarely evaluated retrospectively. Differences between projections and observations come from two sources: scenario deviation and model miscalibration. We argue the latter is most important for assessing the value of models in decision making, but requires estimating model error in counterfactual worlds. Here we present and contrast three approaches for estimating this error, and demonstrate the benefits and limitations of each in a simulation experiment. We provide recommendations for the estimation of counterfactual error and discuss the components of scenario design that are required to make scenario projections evaluable.
zh
[AI-97] SemAgent : Semantic-Driven Agent ic AI Empowered Trajectory Prediction in Vehicular Networks
【速读】:该论文旨在解决车辆网联(V2X)环境中传统通信方案传输开销大、延迟高,以及现有轨迹预测模型缺乏环境感知与逻辑推理能力的问题。其解决方案的关键在于将语义通信(Semantic Communication)与智能体式人工智能(Agentic AI)相结合:在车路协同(V2I)场景中,路侧单元(RSU)通过特征提取智能体和语义分析智能体分别生成紧凑的特征表示与语义洞察,并以语义通信方式传递给目标车辆;在车车间通信(V2V)场景中,各车辆本地执行特征提取与语义分析,并融合邻近车辆的预测轨迹信息进行联合推理。该框架显著提升了复杂通信条件下的轨迹预测准确性,尤其在低信噪比(SNR)环境下可实现最高47.5%的性能提升。
链接: https://arxiv.org/abs/2512.00834
作者: Lin Zhu,Kezhi Wang,Luping Xiang,Kun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Submitted for possible journal publication
Abstract:Efficient information exchange and reliable contextual reasoning are essential for vehicle-to-everything (V2X) networks. Conventional communication schemes often incur significant transmission overhead and latency, while existing trajectory prediction models generally lack environmental perception and logical inference capabilities. This paper presents a trajectory prediction framework that integrates semantic communication with Agentic AI to enhance predictive performance in vehicular environments. In vehicle-to-infrastructure (V2I) communication, a feature-extraction agent at the Roadside Unit (RSU) derives compact representations from historical vehicle trajectories, followed by semantic reasoning performed by a semantic-analysis agent. The RSU then transmits both feature representations and semantic insights to the target vehicle via semantic communication, enabling the vehicle to predict future trajectories by combining received semantics with its own historical data. In vehicle-to-vehicle (V2V) communication, each vehicle performs local feature extraction and semantic analysis while receiving predicted trajectories from neighboring vehicles, and jointly utilizes this information for its own trajectory prediction. Extensive experiments across diverse communication conditions demonstrate that the proposed method significantly outperforms baseline schemes, achieving up to a 47.5% improvement in prediction accuracy under low signal-to-noise ratio (SNR) conditions.
zh
[AI-98] Causal Invariance and Counterfactual Learning Driven Cooperative Game for Multi-Label Classification
【速读】:该论文旨在解决多标签分类(Multi-label Classification, MLC)中因标签不平衡、虚假相关性(spurious correlations)以及分布偏移(distribution shifts)带来的挑战,尤其是对稀有标签(rare labels)预测性能下降的问题。其解决方案的关键在于提出因果协同博弈(Causal Cooperative Game, CCG)框架,该框架将MLC建模为合作式多玩家互动过程,通过神经结构方程模型(Neural Structural Equation Models)显式进行因果发现,并引入反事实好奇心奖励(counterfactual curiosity reward)以驱动鲁棒特征学习;同时结合因果不变性损失(causal invariance loss)提升跨环境泛化能力,并设计专门针对稀有标签的增强策略,从而实现更准确且可解释的多标签预测。
链接: https://arxiv.org/abs/2512.00812
作者: Yijia Fan,Jusheng Zhang,Kaitong Cai,Jing Yang,Keze Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-label classification (MLC) remains vulnerable to label imbalance, spurious correlations, and distribution shifts, challenges that are particularly detrimental to rare label prediction. To address these limitations, we introduce the Causal Cooperative Game (CCG) framework, which conceptualizes MLC as a cooperative multi-player interaction. CCG unifies explicit causal discovery via Neural Structural Equation Models with a counterfactual curiosity reward to drive robust feature learning. Furthermore, it incorporates a causal invariance loss to ensure generalization across diverse environments, complemented by a specialized enhancement strategy for rare labels. Extensive benchmarking demonstrates that CCG substantially outperforms strong baselines in both rare label prediction and overall robustness. Through rigorous ablation studies and qualitative analysis, we validate the efficacy and interpretability of our components, underscoring the potential of synergizing causal inference with cooperative game theory for advancing multi-label learning.
zh
[AI-99] BioPro: On Difference-Aware Gender Fairness for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的性别偏见问题,尤其关注在图像描述生成和文本到图像生成任务中,如何区分“中立情境”与“需保留群体差异的情境”,以实现选择性公平(selective fairness)。传统公平干预方法通常采用无差别的均匀处理策略,忽视了某些场景下性别特征的合理性,导致过度去偏或无效保留偏见。论文的关键解决方案是提出一种完全无需训练的框架BioPro(Bias Orthogonal Projection),其核心在于通过反事实嵌入识别低维性别变化子空间,并对该子空间进行正交投影,从而选择性地消除中立情境下的性别相关信息,同时保留显式情境中合理的性别特征。实验表明,BioPro在降低中立场景中的性别偏见的同时,能有效维持显式场景中的性别忠实性,且可扩展至连续偏置变量(如场景亮度),展现出良好的泛化能力。
链接: https://arxiv.org/abs/2512.00807
作者: Yujie Lin,Jiayao Ma,Qingguo Hu,Derek F. Wong,Jinsong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) inherit significant social biases from their training data, notably in gender representation. Current fairness interventions often adopt a difference-unaware perspective that enforces uniform treatment across demographic groups. These approaches, however, fail to distinguish between contexts where neutrality is required and those where group-specific attributes are legitimate and must be preserved. Building upon recent advances in difference-aware fairness for text-only models, we extend this concept to the multimodal domain and formalize the problem of difference-aware gender fairness for image captioning and text-to-image generation. We advocate for selective debiasing, which aims to mitigate unwanted bias in neutral contexts while preserving valid distinctions in explicit ones. To achieve this, we propose BioPro (Bias Orthogonal Projection), an entirely training-free framework. BioPro identifies a low-dimensional gender-variation subspace through counterfactual embeddings and applies projection to selectively neutralize gender-related information. Experiments show that BioPro effectively reduces gender bias in neutral cases while maintaining gender faithfulness in explicit ones, thus providing a promising direction toward achieving selective fairness in VLMs. Beyond gender bias, we further demonstrate that BioPro can effectively generalize to continuous bias variables, such as scene brightness, highlighting its broader applicability.
zh
[AI-100] Bias Injection Attacks on RAG Databases and Sanitization Defenses
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因知识库被恶意注入偏见信息而导致的大语言模型(Large Language Models, LLMs)输出意识形态偏移的问题。传统知识污染攻击主要注入虚假或有毒内容,易被事实核查识别;而本文揭示了一种更隐蔽的“偏见注入攻击”(bias injection attacks),即在知识库中插入语义上正确但具有倾向性的文本片段,从而系统性地排挤对立观点、引导LLM生成符合攻击者意图的立场。解决方案的关键在于提出一种后检索过滤防御机制BiasDef,通过精准识别并过滤此类偏见片段,在显著降低恶意内容召回率(减少15%)的同时,大幅提升良性信息的可得性(增加62%),有效缓解答案中的视角偏移(提升6.2倍)。
链接: https://arxiv.org/abs/2512.00804
作者: Hao Wu,Prateek Saxena
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:This paper explores attacks and defenses on vector databases in retrieval-augmented generation (RAG) systems. Prior work on knowledge poisoning attacks primarily inject false or toxic content, which fact-checking or linguistic analysis easily detects. We reveal a new and subtle threat: bias injection attacks, which insert factually correct yet semantically biased passages into the knowledge base to covertly influence the ideological framing of answers generated by large language models (LLMs). We demonstrate that these adversarial passages, though linguistically coherent and truthful, can systematically crowd out opposing views from the retrieved context and steer LLM answers toward the attacker’s intended perspective. We precisely characterize this class of attacks and then develop a post-retrieval filtering defense, BiasDef. We construct a comprehensive benchmark based on public question answering datasets to evaluate them. Our results show that: (1) the proposed attack induces significant perspective shifts in LLM answers, effectively evading existing retrieval-based sanitization defenses; and (2) BiasDef outperforms existing methods by reducing adversarial passages retrieved by 15% which mitigates perspective shift by 6.2\times in answers, while enabling the retrieval of 62% more benign passages. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2512.00804 [cs.CR] (or arXiv:2512.00804v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.00804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-101] Limitations of Using Identical Distributions for Training and Testing When Learning Boolean Functions
【速读】:该论文旨在解决在训练数据分布与测试数据分布不一致时,模型泛化能力的优化问题。其核心挑战在于传统假设——即训练分布与测试分布一致能获得最优性能——是否始终成立。研究的关键发现是:在存在单向函数(one-way functions)的前提下,训练分布与测试分布完全一致并非总是最优选择,这颠覆了多数学习方法的常规认知。然而,当目标函数满足特定正则性条件时,尤其是在均匀分布情况下,标准结论得以恢复,即训练分布与测试分布一致仍是最优策略。
链接: https://arxiv.org/abs/2512.00791
作者: Jordi Pérez-Guijarro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:When the distributions of the training and test data do not coincide, the problem of understanding generalization becomes considerably more complex, prompting a variety of questions. In this work, we focus on a fundamental one: Is it always optimal for the training distribution to be identical to the test distribution? Surprisingly, assuming the existence of one-way functions, we find that the answer is no. That is, matching distributions is not always the best scenario, which contrasts with the behavior of most learning methods. Nonetheless, we also show that when certain regularities are imposed on the target functions, the standard conclusion is recovered in the case of the uniform distribution.
zh
[AI-102] SHRAG : AFrameworkfor Combining Human-Inspired Search with RAG
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际应用中面临的两大挑战:一是构建高质量检索模块需要专业领域知识,二是RAG因包含检索与生成两阶段而相较纯检索系统处理速度较慢。其解决方案的关键在于提出SHRAG框架,该框架利用大语言模型(Large Language Model, LLM)作为查询策略制定者(Query Strategist),将自然语言查询自动转化为逻辑结构化的搜索查询,并通过布尔检索模拟专家人类检索者的搜索过程;同时引入多语言查询扩展和多语言嵌入模型,实现跨语言问答能力。这一设计不仅提升了检索精度与生成可靠性,还推动了从文档导向型检索向直接响应型查询的新范式演进。
链接: https://arxiv.org/abs/2512.00772
作者: Hyunseok Ryu,Wonjune Shin,Hyun Park
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 10 pages, 4 figures, 1 table, 1 algorithm, 3 prompts
Abstract:Retrieval-Augmented Generation (RAG) is gaining recognition as one of the key technological axes for next generation information retrieval, owing to its ability to mitigate the hallucination phenomenon in Large Language Models (LLMs)and effectively incorporate up-to-date information. However, specialized expertise is necessary to construct ahigh-quality retrieval system independently; moreover, RAGdemonstratesrelativelyslowerprocessing speeds compared to conventional pure retrieval systems because it involves both retrieval and generation stages. Accordingly, this study proposes SHRAG, a novel framework designed to facilitate the seamless integration of Information Retrieval and RAG while simultaneously securing precise retrieval performance. SHRAG utilizes a Large Language Model as a Query Strategist to automatically transform unstructured natural language queries into logically structured search queries, subsequently performing Boolean retrieval to emulate the search process of an expert human searcher. Furthermore, it incorporates multilingual query expansion and a multilingual embedding model, enabling it to perform efficient cross-lingual question answering within the multilingual dataset environment of the ScienceON Challenge. Experimental results demonstrate that the proposed method, combining logical retrieval capabilities and generative reasoning, can significantly enhance the accuracy and reliability of RAG systems. Furthermore, SHRAG movesbeyondconventionaldocument-centric retrieval methods, presenting the potential for a new search paradigm capable of providing direct and reliable responses to queries. Comments: 10 pages, 4 figures, 1 table, 1 algorithm, 3 prompts Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL) Cite as: arXiv:2512.00772 [cs.IR] (or arXiv:2512.00772v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.00772 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hyunseok Ryu [view email] [v1] Sun, 30 Nov 2025 08:06:47 UTC (935 KB) Full-text links: Access Paper: View a PDF of the paper titled SHRAG: AFrameworkfor Combining Human-Inspired Search with RAG, by Hyunseok Ryu and 2 other authorsView PDFTeX Source view license Current browse context: cs.IR prev | next new | recent | 2025-12 Change to browse by: cs cs.AI cs.DL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-103] Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance
【速读】:该论文试图解决的问题是:为何自适应优化方法(如Adam)在大语言模型(LLM)预训练中显著优于梯度下降(GD),特别是在存在类别分布重尾不平衡(heavy-tailed class imbalance)的语言建模任务中,其理论优势如何从数据分布特性出发被解释。解决方案的关键在于提出一个最小但具代表性的下一词预测设置,通过理论证明表明,在重尾类不平衡条件下,坐标轴自适应算法(如sign descent,即ℓ∞-范数下的最速下降法)相比归一化梯度下降(即ℓ₂-范数下的最速下降法)具有更快的收敛速度,从而揭示了自适应优化器在实际语言建模任务中的理论优势源于数据分布的结构性特征。
链接: https://arxiv.org/abs/2512.00763
作者: Robin Yadav,Shuo Xie,Tianhao Wang,Zhiyuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., \ell_\infty norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of \ell_\infty -norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. \ell_\infty norm) over normalized GD (steepest descent w.r.t. to \ell_2 norm) in the presence of heavy tail class imbalance.
zh
[AI-104] Preventing Model Collapse via Contraction-Conditioned Neural Filters
【速读】:该论文旨在解决生成式模型在递归训练过程中出现的模型坍缩(model collapse)问题。传统方法如\citexu2024probabilistic依赖于超线性增长的样本量(O(t^1+s))来缓解此问题,但难以在实际应用中实现。本文提出了一种基于收缩算子(contraction operators)的神经网络滤波器方法,其核心创新在于设计了一个可学习的神经滤波器,在无偏估计框架下完全消除对样本规模增长的依赖。关键在于通过构建专用的神经网络架构与损失函数,使滤波器能够主动学习满足指数族分布下假设2.3的收缩条件,从而保证即使在固定样本规模下,估计误差仍能以概率收敛(即 limsupt→∞P(∥et∥>δ)=0 对任意 δ>0 成立)。实验验证了该方法在固定样本设置下有效学习收缩条件并防止模型坍缩,为生成式模型的递归训练提供了端到端的实用解决方案。
链接: https://arxiv.org/abs/2512.00757
作者: Zongjian Han,Yiran Liang,Ruiwen Wang,Yiwei Luo,Yilin Huang,Xiaotong Song,Dongqing Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a neural network filter method based on contraction operators to address model collapse in recursive training of generative models. Unlike \citexu2024probabilistic, which requires superlinear sample growth ( O(t^1+s) ), our approach completely eliminates the dependence on increasing sample sizes within an unbiased estimation framework by designing a neural filter that learns to satisfy contraction conditions. We develop specialized neural network architectures and loss functions that enable the filter to actively learn contraction conditions satisfying Assumption 2.3 in exponential family distributions, thereby ensuring practical application of our theoretical results. Theoretical analysis demonstrates that when the learned contraction conditions are satisfied, estimation errors converge probabilistically even with constant sample sizes, i.e., \limsup_t\to\infty\mathbbP(|\mathbfe_t|\delta)=0 for any \delta0 . Experimental results show that our neural network filter effectively learns contraction conditions and prevents model collapse under fixed sample size settings, providing an end-to-end solution for practical applications.
zh
[AI-105] MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多语言图形用户界面(GUI)任务中感知与推理(Perception and Reasoning, PR)性能显著下降的问题,以及现有研究缺乏对控件功能和元素空间关系等细粒度分析的局限性。解决方案的关键在于提出一种名为GUI-XLI的跨语言干预方法,该方法通过在PR能力相关层的隐藏状态上施加干预,缓解不同语言输入在潜在空间中的差异,从而提升LVLMs在非英语GUI场景下的多语言PR能力,实验表明该方法平均提升了6.5%的性能。
链接: https://arxiv.org/abs/2512.00756
作者: Ruihan Chen,Qiming Li,Xiaocheng Feng,Xiaoliang Yang,Weihong Zhong,Yuxuan Gu,Zekun Zhou,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27pages, 12figures
Abstract:With the advancement of computational resources, Large Vision-Language Models (LVLMs) exhibit impressive Perception and Reasoning (PR) performance on Graphical User Interface (GUI) tasks. However, although they demonstrate strong PR capabilities in English GUI scenarios, their performance in multilingual settings has received little attention, which limits their global applications. Moreover, existing studies on GUI tasks lack fine-grained analyses, including widget functions and elements’ spatial relationships, which are fundamental for more targeted improvements. To tackle these issues, we propose MPR-GUI-Bench, a Multilingual fine-grained Perception and Reasoning GUI Benchmark to evaluate GUI agents’ PR capabilities. Evaluation results demonstrate that LVLMs exhibit significantly worse PR performance in non-English languages than in English. To address these gaps, we propose GUI-XLI, a GUI Cross-Lingual Intervention method that applies interventions to the hidden states at PR capability-related layers to mitigate the gaps between English and other languages, building on previous research showing that the hidden states of different language inputs exhibit significant differences in the latent space. Experimental results indicate that our method improves GUI agents’ multilingual PR capability by 6.5% on average.
zh
[AI-106] On the Regulatory Potential of User Interfaces for AI Agent Governance NEURIPS2025
【速读】:该论文旨在解决长期时间跨度下自主行动的AI代理(AI agents)可能带来的潜在风险问题,现有治理方案主要聚焦于系统级防护(如提示注入监控)或代理基础设施(如代理ID),但难以全面覆盖人机交互中的透明度与行为规范需求。解决方案的关键在于提出一种互补性治理路径:通过规范AI代理的用户界面(User Interface, UI),强制实现透明性和行为约束,并由此推动系统和基础设施层面的相应调整。作者通过对22个现有代理系统的UI元素进行分析,提炼出六种具有监管潜力的交互设计模式(如要求代理记忆可编辑),从而为AI代理治理提供新的监管切入点。
链接: https://arxiv.org/abs/2512.00742
作者: K. J. Kevin Feng,Tae Soo Kim,Rock Yuren Pang,Faria Huq,Tal August,Amy X. Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: RegML workshop at NeurIPS 2025 (oral)
Abstract:AI agents that take actions in their environment autonomously over extended time horizons require robust governance interventions to curb their potentially consequential risks. Prior proposals for governing AI agents primarily target system-level safeguards (e.g., prompt injection monitors) or agent infrastructure (e.g., agent IDs). In this work, we explore a complementary approach: regulating user interfaces of AI agents as a way of enforcing transparency and behavioral requirements that then demand changes at the system and/or infrastructure levels. Specifically, we analyze 22 existing agentic systems to identify UI elements that play key roles in human-agent interaction and communication. We then synthesize those elements into six high-level interaction design patterns that hold regulatory potential (e.g., requiring agent memory to be editable). We conclude with policy recommendations based on our analysis. Our work exposes a new surface for regulatory action that supplements previous proposals for practical AI agent governance.
zh
[AI-107] MASCOT: Analyzing Malware Evolution Through A Well-Curated Source Code Dataset
【速读】:该论文旨在解决当前 malware 演化趋势难以刻画以及缺乏直观工具来解析恶意软件样本间复杂关联的问题。其关键解决方案是构建了一个包含 6032 个样本的手工审核恶意软件源代码数据集,并在此基础上提出了一种多视角谱系分析(multi-view genealogy analysis)方法:从整体层面量化样本与类别间的连接强度和方向,从细节层面追踪单个样本的演化历史,从而直观揭示由代码复用驱动的恶意软件谱系扩展与进化,为理解恶意软件生态系统的形成与演进提供了新证据与工具。
链接: https://arxiv.org/abs/2512.00741
作者: Bojing Li,Duo Zhong,Dharani Nadendla,Gabriel Terceros,Prajna Bhandar,Raguvir S,Charles Nicholas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, conference paper; submitted to IEEE BigData 2025 CyberHunt workshop
Abstract:In recent years, the explosion of malware and extensive code reuse have formed complex evolutionary connections among malware specimens. The rapid pace of development makes it challenging for existing studies to characterize recent evolutionary trends. In addition, intuitive tools to untangle these intricate connections between malware specimens or categories are urgently needed. This paper introduces a manually-reviewed malware source code dataset containing 6032 specimens. Building on and extending current research from a software engineering perspective, we systematically evaluate the scale, development costs, code quality, as well as security and dependencies of modern malware. We further introduce a multi-view genealogy analysis to clarify malware connections: at an overall view, this analysis quantifies the strength and direction of connections among specimens and categories; at a detailed view, it traces the evolutionary histories of individual specimens. Experimental results indicate that, despite persistent shortcomings in code quality, malware specimens exhibit an increasing complexity and standardization, in step with the development of mainstream software engineering practices. Meanwhile, our genealogy analysis intuitively reveals lineage expansion and evolution driven by code reuse, providing new evidence and tools for understanding the formation and evolution of the malware ecosystem.
zh
[AI-108] Deep Learning for Modeling and Dispatching Hybrid Wind Farm Power Generation ICDM
【速读】:该论文旨在解决风力发电场(wind farm)在并网运行中因风电出力波动性导致的能源价值损失问题,尤其针对集成储能系统的混合型风电场(hybrid wind farm),如何通过数据驱动的调度策略提升其经济性和运行效率。解决方案的关键在于构建两个深度学习框架:一是基于长短期记忆网络(LSTM)的COVE-NN调度模型,利用本地电网负荷和市场条件作为输入参数,实现个体风电场的最优能量调度;二是用于模拟功率输出的生成式建模框架,通过大气条件驱动的合成发电数据增强调度策略的鲁棒性。这两个框架在实际案例中分别实现了年均COVE降低32.3%和均方根误差(RMSE)下降9.5%,显著提升了风电场的能量价值捕获能力与调度稳定性。
链接: https://arxiv.org/abs/2512.00728
作者: Zach Lawrence,Jessica Yao,Chris Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, to be published in 2025 IEEE International Conference on Data Mining Workshops (ICDMW)
Abstract:Wind farms with integrated energy storage, or hybrid wind farms, are able to store energy and dispatch it to the grid following an operational strategy. For individual wind farms with integrated energy storage capacity, data-driven dispatch strategies using localized grid demand and market conditions as input parameters stand to maximize wind energy value. Synthetic power generation data modeled on atmospheric conditions provide another avenue for improving the robustness of data-driven dispatch strategies. To these ends, the present work develops two deep learning frameworks: COVE-NN, an LSTM-based dispatch strategy tailored to individual wind farms, which reduced annual COVE by 32.3% over 43 years of simulated operations in a case study at the Pyron site; and a power generation modeling framework that reduced RMSE by 9.5% and improved power curve similarity by 18.9% when validated on the Palouse wind farm. Together, these models pave the way for more robust, data-driven dispatch strategies and potential extensions to other renewable energy systems.
zh
[AI-109] SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLM s ASPLOS2026
【速读】:该论文旨在解决长上下文推理中检索算法效率低下的问题,其核心挑战在于如何在保持模型准确性的前提下显著提升吞吐量并降低资源消耗。解决方案的关键在于提出了一种名为SpeContext的算法与系统协同设计范式:首先,在算法层面,利用轻量级检索头(基于知识蒸馏语言模型DLM的注意力权重)实现冗余剪枝,减少90%参数;其次,在系统层面,通过弹性加载策略设计异步预取数据流,有效重叠KV缓存检索与大语言模型(LLM)计算;最后,在编译层面构建理论内存模型并实现自适应内存管理系统,最大化GPU内存利用率。这一多层级优化策略使SpeContext在云和边缘资源受限环境中分别实现最高24.89倍和10.06倍的吞吐量提升,同时保持近似无损的准确性。
链接: https://arxiv.org/abs/2512.00722
作者: Jiaming Xu,Jiayi Pan,Hanzhen Wang,Yongkang Zhou,Jiancai Ye,Yu Wang,Guohao Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ASPLOS 2026
Abstract:In this paper, we point out that the objective of the retrieval algorithms is to align with the LLM, which is similar to the objective of knowledge distillation in LLMs. We analyze the similarity in information focus between the distilled language model(DLM) and the original LLM from the perspective of information theory, and thus propose a novel paradigm that leverages a DLM as the retrieval algorithm. Based on the insight, we present SpeContext, an algorithm and system co-design for long-context reasoning. (1) At the algorithm level, SpeContext proposes lightweight retrieval head based on the head-level attention weights of DLM, achieving 90% parameters reduction by pruning the redundancy. (2) At the system level, SpeContext designs an asynchronous prefetch dataflow via the elastic loading strategy, effectively overlapping KV cache retrieval with the LLM computation. (3) At the compilation level, SpeContext constructs the theoretical memory model and implements an adaptive memory management system to achieve acceleration by maximizing GPU memory utilization. We deploy and evaluate SpeContext in two resourceconstrained environments, cloud and edge. Extensive experiments show that, compared with the Huggingface framework, SpeContext achieves up to 24.89x throughput improvement in cloud and 10.06x speedup in edge with negligible accuracy loss, pushing the Pareto frontier of accuracy and throughput.
zh
[AI-110] Graph Data Augmentation with Contrastive Learning on Covariate Distribution Shift
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在面对协变量分布偏移(Covariate Distribution Shift)时泛化能力不足的问题,这类偏移常见于具有复杂结构的真实世界图数据中。现有方法往往无法充分挖掘潜在空间(latent space)中的丰富信息,导致模型对分布外(Out-of-Distribution, OOD)样本的适应性较差。解决方案的关键在于提出一种名为MPAIACL(More Powerful Adversarial Invariant Augmentation using Contrastive Learning)的新方法,其核心思想是利用对比学习(Contrastive Learning)增强潜在表示的判别力,从而充分释放向量表示中的内在信息,提升模型在不同分布下的鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2512.00716
作者: Fanlong Zeng,Wensheng Gan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 tables, 8 figures
Abstract:Covariate distribution shift occurs when certain structural features present in the test set are absent from the training set. It is a common type of out-of-distribution (OOD) problem, frequently encountered in real-world graph data with complex structures. Existing research has revealed that most out-of-the-box graph neural networks (GNNs) fail to account for covariate shifts. Furthermore, we observe that existing methods aimed at addressing covariate shifts often fail to fully leverage the rich information contained within the latent space. Motivated by the potential of the latent space, we introduce a new method called MPAIACL for More Powerful Adversarial Invariant Augmentation using Contrastive Learning. MPAIACL leverages contrastive learning to unlock the full potential of vector representations by harnessing their intrinsic information. Through extensive experiments, MPAIACL demonstrates its robust generalization and effectiveness, as it performs well compared with other baselines across various public OOD datasets. The code is publicly available at this https URL.
zh
[AI-111] Concept-Guided Backdoor Attack on Vision Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在实际应用中面临的安全漏洞问题,特别是针对现有后门攻击方法在隐蔽性和鲁棒性方面的局限。传统攻击依赖于像素级触发器或不可察觉的图像扰动,易被基于图像的防御机制检测到。为此,作者提出了一种全新的概念引导型后门攻击范式,其核心在于将攻击从像素层面提升至语义概念层面:一是通过概念阈值中毒(Concept-Thresholding Poisoning, CTP)利用自然图像中的显式语义概念作为触发条件,仅在目标概念出现时注入恶意输出;二是借助概念瓶颈模型(Concept Bottleneck Model, CBM)在训练阶段干预内部概念激活,推理时移除CBM分支以保持原模型不变,从而实现从未出现在训练数据中的标签替换(如将“猫”替换为“狗”)。两种方案均展现出高攻击成功率且对干净任务性能影响较小,揭示了语义概念层级是VLMs亟需关注的新攻击面。
链接: https://arxiv.org/abs/2512.00713
作者: Haoyu Shen,Weimin Lyu,Haotian Xu,Tengfei Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference time to keep the VLM unchanged. This design enables systematic replacement of a targeted label in generated text (for example, replacing “cat” with “dog”), even when the replacement behavior never appears in the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates while maintaining moderate impact on clean-task performance. These findings highlight concept-level vulnerabilities as a critical new attack surface for VLMs.
zh
[AI-112] When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF AAAI-26
【速读】:该论文旨在解决大规模语言模型(LLM)对齐过程中因人类反馈偏好翻转(preference flipping)导致的数据标注污染问题,从而提升对齐算法在面对潜在错误标注时的鲁棒性。其解决方案的关键在于提出一种面向偏好翻转的直接偏好优化(Flipping-Aware Direct Preference Optimization, FA-DPO)算法,该算法从人类反馈强化学习(RLHF)视角出发,将人类意图建模与外部因素引发的偏好翻转机制分离为两个阶段,并基于Bradley-Terry(BT)模型引入实例相关的翻转概率;同时,利用与偏好标注相关特征捕捉判断不确定性并建模翻转模式,最终设计出一种兼容原始RLHF和DPO算法的高效迭代优化方法。
链接: https://arxiv.org/abs/2512.00709
作者: Yifan Xu,Xichen Ye,Yifan Chen,Qiaosheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-26-AIA
Abstract:Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.
zh
[AI-113] Model of human cognition
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在可解释性不足、缺乏统一理论基础以及运行成本高昂等方面的瓶颈问题。其解决方案的关键在于提出一个神经理论框架,用于解释智能系统中认知过程(如决策与问题求解)的涌现机制,该框架兼具功能鲁棒性和生物合理性,并为构建可解释且泛化能力强的人工智能提供了一种计算高效的路径。
链接: https://arxiv.org/abs/2512.00683
作者: Wu Yonggang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The development of large language models (LLMs) is limited by a lack of explainability, the absence of a unifying theory, and prohibitive operational costs. We propose a neuro-theoretical framework for the emergence of intelligence in systems that is both functionally robust and biologically plausible. The model provides theoretical insights into cognitive processes such as decision-making and problem solving, and a computationally efficient approach for the creation of explainable and generalizable artificial intelligence.
zh
[AI-114] ML-Tool-Bench: Tool-Augmented Planning for ML Tasks
【速读】:该论文旨在解决当前工具增强型机器学习(Machine Learning, ML)代理在执行端到端数据科学工作流时,因缺乏复杂规划能力而导致的性能瓶颈问题。现有工具使用基准主要关注特定任务的工具选择或参数提取,未能评估代理在特征工程、模型选择与超参数优化等多步骤流程中所需的高级规划与迭代能力。解决方案的关键在于:1)引入一种包含61个专用工具和15个Kaggle表格数据挑战的综合性评测基准,支持中间结果的命名管理与灵活存取;2)提出两种简单但有效的改进策略——基于结构化文本反馈的形状化确定性奖励机制,以及将原始问题分解为一系列子任务的分治方法,显著提升了轨迹有效性与任务完成度。实验表明,采用GPT-4o模型时,该方案相较标准ReAct方法在所有Kaggle挑战中的中位数排名提升16.52个百分点。
链接: https://arxiv.org/abs/2512.00672
作者: Yaswanth Chittepu,Raghavendra Addanki,Tung Mai,Anup Rao,Branislav Kveton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The development of autonomous machine learning (ML) agents capable of end-to-end data science workflows represents a significant frontier in artificial intelligence. These agents must orchestrate complex sequences of data analysis, feature engineering, model selection, and hyperparameter optimization, tasks that require sophisticated planning and iteration. While recent work on building ML agents has explored using large language models (LLMs) for direct code generation, tool-augmented approaches offer greater modularity and reliability. However, existing tool-use benchmarks focus primarily on task-specific tool selection or argument extraction for tool invocation, failing to evaluate the sophisticated planning capabilities required for ML Agents. In this work, we introduce a comprehensive benchmark for evaluating tool-augmented ML agents using a curated set of 61 specialized tools and 15 tabular ML challenges from Kaggle. Our benchmark goes beyond traditional tool-use evaluation by incorporating an in-memory named object management, allowing agents to flexibly name, save, and retrieve intermediate results throughout the workflows. We demonstrate that standard ReAct-style approaches struggle to generate valid tool sequences for complex ML pipelines, and that tree search methods with LLM-based evaluation underperform due to inconsistent state scoring. To address these limitations, we propose two simple approaches: 1) using shaped deterministic rewards with structured textual feedback, and 2) decomposing the original problem into a sequence of sub-tasks, which significantly improves trajectory validity and task performance. Using GPT-4o, our approach improves over ReAct by 16.52 percentile positions, taking the median across all Kaggle challenges. We believe our work provides a foundation for developing more capable tool-augmented planning ML agents.
zh
[AI-115] EDIT: Early Diffusion Inference Termination for dLLM s Based on Dynamics of Training Gradients
【速读】:该论文旨在解决扩散式大语言模型(Diffusion-based Large Language Models, dLLMs)在推理过程中冗余的去噪步骤问题,即模型生成的答案通常在完成全部去噪步骤前已趋于稳定,导致计算资源浪费。解决方案的关键在于提出一种名为EDIT(Early Diffusion Inference Termination)的推理时终止准则,该准则通过监测当前token激活与训练阶段由AdamW聚合LoRA更新所构建的推理映射(reasoning map)之间的对齐度来判断是否达到足够推理稳定性。具体而言,EDIT利用训练期间优化动态保留的参数重要性元数据作为学习到的推理路径的紧凑表示,在推理时基于可见token上的KL散度变化检测收敛,从而实现自适应提前终止去噪过程。实验表明,EDIT可在多数场景下将扩散步数减少11.8%至68.3%,同时保持或提升准确率,且存储开销极低(约0.02%)。
链接: https://arxiv.org/abs/2512.00670
作者: He-Yen Hsieh,Hong Wang,H. T. Kung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures
Abstract:Diffusion-based large language models (dLLMs) refine token generations through iterative denoising, but answers often stabilize before all steps complete. We propose EDIT (Early Diffusion Inference Termination), an inference-time criterion that adaptively stops denoising once sufficient reasoning stability relative to training-time reasoning is detected. EDIT monitors the alignment between token activations and a reasoning map derived from AdamW-aggregated LoRA updates captured during supervised fine-tuning (SFT). During training, optimization dynamics generate rich metadata about parameter importance that in prior methods is typically discarded upon model release. We preserve this information as a compact representation of learned reasoning pathways. During inference, alignment scores are converted to a distribution over the tokens already unmasked at the current denoising step, and convergence is detected when KL divergence between consecutive steps falls below a threshold on the matched unmasked (visible) tokens. Across reasoning benchmarks, EDIT reduces diffusion steps by 11.8% to 68.3% while preserving or improving accuracy in most settings, with approximately 0.02% storage overhead (about 1.5-2 MB for all QKV modules across 32 blocks in an 8 GB model). By utilizing training-gradient dynamics, our work opens a new research direction for reducing dLLM inference time and cost.
zh
[AI-116] Neuroscience-Inspired Memory Replay for Continual Learning: A Comparative Study of Predictive Coding and Backpropagation-Based Strategies
【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘(catastrophic forgetting)问题,即神经网络在学习新任务时会严重遗忘先前任务的知识。其解决方案的关键在于提出一种基于预测编码(predictive coding)机制的生成式回放(generative replay)框架,该框架受生物记忆巩固机制启发,通过利用预测编码原理实现对旧任务知识的有效保留,从而在多个基准数据集上显著提升任务保留性能(平均提升15.3%),同时保持良好的迁移效率。
链接: https://arxiv.org/abs/2512.00619
作者: Goutham Nalagatla,Shreyas Grandhe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:Continual learning remains a fundamental challenge in artificial intelligence, with catastrophic forgetting posing a significant barrier to deploying neural networks in dynamic environments. Inspired by biological memory consolidation mechanisms, we propose a novel framework for generative replay that leverages predictive coding principles to mitigate forgetting. We present a comprehensive comparison between predictive coding-based and backpropagation-based gen- erative replay strategies, evaluating their effectiveness on task retention and transfer efficiency across multiple benchmark datasets. Our experimental results demonstrate that predictive coding-based replay achieves superior retention performance (average 15.3% improvement) while maintaining competitive transfer efficiency, suggesting that biologically-inspired mechanisms can offer principled solutions to continual learning challenges. The proposed framework provides insights into the relationship between biological memory processes and artificial learning systems, opening new avenues for neuroscience-inspired AI research.
zh
[AI-117] Stable Voting and the Splitting of Cycles AAAI2026 AAAI
【速读】:该论文旨在解决偏好聚合中多数循环(majority cycle)的化解问题,具体聚焦于稳定投票(Stable Voting, SV)及其简化版本——简单稳定投票(Simple Stable Voting, SSV)与分裂循环方法(Split Cycle, SC)之间的关系。其核心问题是验证:当任意两个多数优势(majority victories)大小不同时,SSV是否始终是SC的一个精化(refinement)。解决方案的关键在于通过数学推理和可满足性(SAT)求解技术相结合的方式进行证明与反例构造:对于最多5个备选方案的情形,采用传统数学推导完成证明;对于6个备选方案的情况,利用SAT编码获得证明;而对于7个备选方案,则构建了反例以证伪原猜想。该SAT编码具有通用性,可用于检验任何仅依赖胜差(margins of victory)排序的投票机制的性质。
链接: https://arxiv.org/abs/2512.00616
作者: Wesley H. Holliday,Milan Mossé,Chase Norman,Eric Pacuit,Cynthia Wang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: Forthcoming in Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:Algorithms for resolving majority cycles in preference aggregation have been studied extensively in computational social choice. Several sophisticated cycle-resolving methods, including Tideman’s Ranked Pairs, Schulze’s Beat Path, and Heitzig’s River, are refinements of the Split Cycle (SC) method that resolves majority cycles by discarding the weakest majority victories in each cycle. Recently, Holliday and Pacuit proposed a new refinement of Split Cycle, dubbed Stable Voting, and a simplification thereof, called Simple Stable Voting (SSV). They conjectured that SSV is a refinement of SC whenever no two majority victories are of the same size. In this paper, we prove the conjecture up to 6 alternatives and refute it for more than 6 alternatives. While our proof of the conjecture for up to 5 alternatives uses traditional mathematical reasoning, our 6-alternative proof and 7-alternative counterexample were obtained with the use of SAT solving. The SAT encoding underlying this proof and counterexample is applicable far beyond SC and SSV: it can be used to test properties of any voting method whose choice of winners depends only on the ordering of margins of victory by size.
zh
[AI-118] Hierarchical Decentralized Multi-Agent Coordination with Privacy-Preserving Knowledge Sharing: Extending Agent Net for Scalable Autonomous Systems
【速读】:该论文旨在解决当前去中心化多智能体系统(Decentralized Multi-Agent Systems)在大规模应用中面临的四大挑战:可扩展性不足、通信开销高、隐私保护缺失以及资源分配效率低下。其解决方案的核心在于提出AgentNet++,一个分层去中心化框架,通过引入基于集群的层级组织结构,使智能体能够自组织为专业化群体,从而实现高效的任务路由与知识蒸馏;同时结合差分隐私(Differential Privacy)和安全聚合机制保障知识共享过程中的隐私性,并采用自适应资源管理策略优化系统性能;此外,该方案还提供了理论上的收敛性保证和隐私边界分析,实验表明其在1000+智能体规模下仍能保持涌现智能特性,相较AgentNet及其他基线方法,在任务完成率上提升23%,通信开销降低40%,且具备更强的隐私保护能力。
链接: https://arxiv.org/abs/2512.00614
作者: Goutham Nalagatla
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures
Abstract:Decentralized multi-agent systems have shown promise in enabling autonomous collaboration among LLM-based agents. While AgentNet demonstrated the feasibility of fully decentralized coordination through dynamic DAG topologies, several limitations remain: scalability challenges with large agent populations, communication overhead, lack of privacy guarantees, and suboptimal resource allocation. We propose AgentNet++, a hierarchical decentralized framework that extends AgentNet with multilevel agent organization, privacy-preserving knowledge sharing via differential privacy and secure aggregation, adaptive resource management, and theoretical convergence guarantees. Our approach introduces cluster-based hierarchies where agents self-organize into specialized groups, enabling efficient task routing and knowledge distillation while maintaining full decentralization. We provide formal analysis of convergence properties and privacy bounds, and demonstrate through extensive experiments on complex multi-agent tasks that AgentNet++ achieves 23% higher task completion rates, 40% reduction in communication overhead, and maintains strong privacy guarantees compared to AgentNet and other baselines. Our framework scales effectively to 1000+ agents while preserving the emergent intelligence properties of the original AgentNet.
zh
[AI-119] Generalized Graph Transformer Variational Autoencoder
【速读】:该论文旨在解决图结构数据中的链接预测问题(graph link prediction),即在给定图结构中预测节点之间潜在连接关系的任务。传统方法如GraphVAE、GCN或GNN通常依赖消息传递机制来建模节点间的局部依赖关系,而忽略了全局结构信息。其解决方案的关键在于提出一种基于广义图Transformer变分自编码器(Generalized Graph Transformer Variational Autoencoder, GGT-VAE)的新模型,该模型将广义图Transformer架构与变分自编码框架相结合,利用Transformer风格的全局自注意力机制(global self-attention mechanism)和拉普拉斯位置编码(Laplacian positional encoding)直接建模跨节点的结构模式,并将其映射到潜在空间中,无需依赖传统的消息传递机制,从而更有效地捕捉图的整体结构特性,在多个基准数据集上显著提升了ROC-AUC和平均精度(Average Precision)指标。
链接: https://arxiv.org/abs/2512.00612
作者: Siddhant Karki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph link prediction has long been a central problem in graph representation learning in both network analysis and generative modeling. Recent progress in deep learning has introduced increasingly sophisticated architectures for capturing relational dependencies within graph-structured data. In this work, we propose the Generalized Graph Transformer Variational Autoencoder (GGT-VAE). Our model integrates Generalized Graph Transformer Architecture with Variational Autoencoder framework for link prediction. Unlike prior GraphVAE, GCN, or GNN approaches, GGT-VAE leverages transformer style global self-attention mechanism along with laplacian positional encoding to model structural patterns across nodes into a latent space without relying on message passing. Experimental results on several benchmark datasets demonstrate that GGT-VAE consistently achieves above-baseline performance in terms of ROC-AUC and Average Precision. To the best of our knowledge, this is among the first studies to explore graph structure generation using a generalized graph transformer backbone in a variational framework.
zh
[AI-120] On the Holographic Geometry of Deterministic Computation
【速读】:该论文旨在解决确定性多带图灵机在模拟过程中信息存储空间与运行时间之间看似线性依赖的问题,即传统模拟方法认为要验证或重构时间 $ t $ 时的状态需存储 $ O(t) $ 的信息量。其解决方案的关键在于利用高度压缩定理(Height Compression Theorem)和代数重播引擎(Algebraic Replay Engine),将模拟空间复杂度降低至 $ O(\sqrt{t}) $。作者通过几何与信息论的语言重新诠释执行轨迹为时空依赖有向无环图(spacetime dependency DAG),并构造了一族递归定义的全息边界摘要(holographic boundary summaries),使得在平方根空间模拟中,任意时刻存储的边界数据总描述长度保持为 $ O(\sqrt{t}) $。进一步借助柯尔莫哥洛夫复杂度(Kolmogorov complexity)证明:给定适当的边界摘要和时间索引后,每个内部配置的条件描述复杂度为常数,表明时空体内的算法信息可被边界完全编码,从而建立了一维计算区域的信息面积律(computational area law),实现了对一维工作带确定性计算的全息表示。
链接: https://arxiv.org/abs/2512.00607
作者: Logan Nye
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Standard simulations of Turing machines suggest a linear relationship between the temporal duration t of a run and the amount of information that must be stored by known simulations to certify, verify, or regenerate the configuration at time t . For deterministic multitape Turing machines over a fixed finite alphabet, this apparent linear dependence is not intrinsic: any length- t run can be simulated in space O(\sqrtt) via a Height Compression Theorem for succinct computation trees together with an Algebraic Replay Engine. In this paper we recast that construction in geometric and information-theoretic language. We interpret the execution trace as a spacetime dependency DAG and exhibit a family of recursively defined holographic boundary summaries such that, along the square-root-space simulation, the total description length of all boundary data stored at any time is O(\sqrtt) . Using Kolmogorov complexity, we prove that every internal configuration has constant conditional description complexity given the appropriate boundary summary and time index, establishing that the spacetime bulk carries no additional algorithmic information beyond its boundary. We express this as a one-dimensional computational area law: there exists a simulation in which the information capacity of the active "holographic screen’’ needed to generate a spacetime region of volume t is bounded by O(\sqrtt) . In this precise sense, deterministic computation on a one-dimensional work tape admits a holographic representation, with the bulk history algebraically determined by data residing on a lower-dimensional boundary screen.
zh
[AI-121] Agent ODRL: A Large Language Model-based Multi-agent System for ODRL Generation AAAI2026
【速读】:该论文旨在解决自然语言到开放数字权利语言(ODRL)的自动转换难题,尤其是面对授权策略的逻辑复杂性以及高质量“自然语言-to-ODRL”训练数据稀缺的问题。解决方案的关键在于提出一个基于“协调者-工作者”架构的多智能体系统AgentODRL,其中包含专门化的工作者模块(如ODRL生成器、分解器和重写器),并由协调者动态调度最优处理路径;同时通过引入基于验证器的语法策略与LoRA微调模型驱动的语义反思机制,显著提升了生成ODRL策略的准确性和一致性。
链接: https://arxiv.org/abs/2512.00602
作者: Wanle Zhong,Keman Huang,Xiaoyong Du
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026. 9 pages, 1 figure
Abstract:The Open Digital Rights Language (ODRL) is a pivotal standard for automating data rights management. However, the inherent logical complexity of authorization policies, combined with the scarcity of high-quality “Natural Language-to-ODRL” training datasets, impedes the ability of current methods to efficiently and accurately translate complex rules from natural language into the ODRL format. To address this challenge, this research leverages the potent comprehension and generation capabilities of Large Language Models (LLMs) to achieve both automation and high fidelity in this translation process. We introduce AgentODRL, a multi-agent system based on an Orchestrator-Workers architecture. The architecture consists of specialized Workers, including a Generator for ODRL policy creation, a Decomposer for breaking down complex use cases, and a Rewriter for simplifying nested logical relationships. The Orchestrator agent dynamically coordinates these Workers, assembling an optimal pathway based on the complexity of the input use case. Specifically, we enhance the ODRL Generator by incorporating a validator-based syntax strategy and a semantic reflection mechanism powered by a LoRA-finetuned model, significantly elevating the quality of the generated policies. Extensive experiments were conducted on a newly constructed dataset comprising 770 use cases of varying complexity, all situated within the context of data spaces. The results, evaluated using ODRL syntax and semantic scores, demonstrate that our proposed Orchestrator-Workers system, enhanced with these strategies, achieves superior performance on the ODRL generation task.
zh
[AI-122] Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)后训练方法在高风险领域(如医疗)中对推理质量的多维目标对齐问题,尤其是现有方法(如分组相对策略优化,GRPO)仅奖励正确性,而忽视了临床推理所需的忠实性(faithfulness)与全面性(comprehensiveness)。其解决方案的关键在于提出一种可扩展、多目标且可验证的强化学习方法——临床目标相对策略优化(Clinical-Objective Relative Policy Optimization, CRPO),通过整合基于规则和可验证的奖励信号,联合优化准确性、忠实性和完整性,无需人工标注即可实现对临床推理原则的有效对齐。
链接: https://arxiv.org/abs/2512.00601
作者: Boyang Gu,Hongjian Zhou,Bradley Max Segal,Jinge Wu,Zeyu Cao,Hantao Zhong,Lei Clifton,Fenglin Liu,David A. Clifton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pretraining and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train Clinical-R1-3B, a 3B-parameter model for clinical reasoning. The experiments on three benchmarks demonstrate that our CRPO substantially improves reasoning on truthfulness and completeness over standard GRPO while maintaining comfortable accuracy enhancements. This framework provides a scalable pathway to align LLM reasoning with clinical objectives, enabling safer and more collaborative AI systems for healthcare while also highlighting the potential of multi-objective, verifiable RL methods in post-training scaling of LLMs for medical domains.
zh
[AI-123] Developing Fairness-Aware Task Decomposition to Improve Equity in Post-Spinal Fusion Complication Prediction
【速读】:该论文旨在解决临床预测模型中公平性不足的问题,特别是在脊柱融合手术治疗脊柱侧弯等高风险场景下,患者预后存在显著异质性,而现有基于粗粒度人口统计学调整或事后校正的方法难以捕捉临床人群的潜在结构,甚至可能无意中强化偏见。其解决方案的关键在于提出一种公平感知的多任务学习框架(FAIR-MTL),该框架不依赖显式敏感属性进行训练,而是通过数据驱动的方式推断潜在患者亚群——即利用紧凑的人口统计嵌入并结合k-means聚类识别出可能被传统模型差异化影响的子群体;随后,这些推断出的亚群标签用于引导共享多任务架构中的任务路由,并通过逆频率加权缓解亚群不平衡问题,同时引入正则化防止对小群体过拟合,从而实现更公平、可解释且具有临床意义的术后并发症严重程度预测。
链接: https://arxiv.org/abs/2512.00598
作者: Yining Yuan,J. Ben Tamo,Wenqi Shi,Yishan Zhong,Micky C. Nnamdi,B. Randall Brenn,Steven W. Hwang,May D. Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Fairness in clinical prediction models remains a persistent challenge, particularly in high-stakes applications such as spinal fusion surgery for scoliosis, where patient outcomes exhibit substantial heterogeneity. Many existing fairness approaches rely on coarse demographic adjustments or post-hoc corrections, which fail to capture the latent structure of clinical populations and may unintentionally reinforce bias. We propose FAIR-MTL, a fairness-aware multitask learning framework designed to provide equitable and fine-grained prediction of postoperative complication severity. Instead of relying on explicit sensitive attributes during model training, FAIR-MTL employs a data-driven subgroup inference mechanism. We extract a compact demographic embedding, and apply k-means clustering to uncover latent patient subgroups that may be differentially affected by traditional models. These inferred subgroup labels determine task routing within a shared multitask architecture. During training, subgroup imbalance is mitigated through inverse-frequency weighting, and regularization prevents overfitting to smaller groups. Applied to postoperative complication prediction with four severity levels, FAIR-MTL achieves an AUC of 0.86 and an accuracy of 75%, outperforming single-task baselines while substantially reducing bias. For gender, the demographic parity difference decreases to 0.055 and equalized odds to 0.094; for age, these values reduce to 0.056 and 0.148, respectively. Model interpretability is ensured through SHAP and Gini importance analyses, which consistently highlight clinically meaningful predictors such as hemoglobin, hematocrit, and patient weight. Our findings show that incorporating unsupervised subgroup discovery into a multitask framework enables more equitable, interpretable, and clinically actionable predictions for surgical risk stratification. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2512.00598 [cs.LG] (or arXiv:2512.00598v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.00598 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-124] DLRREC: Denoising Latent Representations via Multi-Modal Knowledge Fusion in Deep Recommender Systems
【速读】:该论文旨在解决现代推荐系统在利用大语言模型(Large Language Models, LLMs)生成的高维且含噪的多模态特征时效率低下这一问题,传统方法将这些特征视为静态输入,导致其与核心推荐任务脱节。解决方案的关键在于提出一个统一框架,通过深度融合多模态信息与协同过滤知识实现表征去噪:一是将降维操作嵌入推荐模型中,支持端到端联合训练,使降维过程感知最终排序目标;二是引入对比学习目标,显式地将协同过滤信号注入潜在空间,从而协同优化原始LLM嵌入,滤除噪声并增强任务相关信号。
链接: https://arxiv.org/abs/2512.00596
作者: Jiahao Tian,Zhenkai Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern recommender systems struggle to effectively utilize the rich, yet high-dimensional and noisy, multi-modal features generated by Large Language Models (LLMs). Treating these features as static inputs decouples them from the core recommendation task. We address this limitation with a novel framework built on a key insight: deeply fusing multi-modal and collaborative knowledge for representation denoising. Our unified architecture introduces two primary technical innovations. First, we integrate dimensionality reduction directly into the recommendation model, enabling end-to-end co-training that makes the reduction process aware of the final ranking objective. Second, we introduce a contrastive learning objective that explicitly incorporates the collaborative filtering signal into the latent space. This synergistic process refines raw LLM embeddings, filtering noise while amplifying task-relevant signals. Extensive experiments confirm our method’s superior discriminative power, proving that this integrated fusion and denoising strategy is critical for achieving state-of-the-art performance. Our work provides a foundational paradigm for effectively harnessing LLMs in recommender systems.
zh
[AI-125] IslandRun: Privacy-Aware Multi-Objective Orchestration for Distributed AI Inference
【速读】:该论文旨在解决现代AI推理中多目标优化的不可调和矛盾问题,即单一计算资源难以同时实现高性能、隐私保护、低成本和高可信度。现有编排框架(如Kubernetes侧重延迟优化、联邦学习保障隐私、边缘计算降低网络距离)仅针对单一维度进行优化,在真实世界异构环境下表现受限。其解决方案的关键在于提出IslandRun系统,将计算资源视为跨越个人设备、私有边缘服务器与公共云的自主“岛屿”,通过三个核心机制实现多目标协同:(1) 基于策略约束的请求级多目标优化以应对异构性;(2) 数据本地化路由使计算向数据迁移而非反之;(3) 类型化占位符净化技术在信任边界间保留语义一致性。该方案建立了面向隐私敏感场景的去中心化推理编排新范式。
链接: https://arxiv.org/abs/2512.00595
作者: Bala Siva Sai Akhil Malepati
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages, 3 figures, 2 tables
Abstract:Modern AI inference faces an irreducible tension: no single computational resource simultaneously maximizes performance, preserves privacy, minimizes cost, and maintains trust. Existing orchestration frameworks optimize single dimensions (Kubernetes prioritizes latency, federated learning preserves privacy, edge computing reduces network distance), creating solutions that struggle under real-world heterogeneity. We present IslandRun, a multi-objective orchestration system that treats computational resources as autonomous “islands” spanning personal devices, private edge servers, and public cloud. Our key insights: (1) request-level heterogeneity demands policy-constrained multi-objective optimization, (2) data locality enables routing compute to data rather than data to compute, and (3) typed placeholder sanitization preserves context semantics across trust boundaries. IslandRun introduces agent-based routing, tiered island groups with differential trust, and reversible anonymization. This establishes a new paradigm for privacy-aware, decentralized inference orchestration across heterogeneous personal computing ecosystems.
zh
[AI-126] Enhancing Analogy-Based Software Effort Estimation with Firefly Algorithm Optimization
【速读】:该论文旨在解决传统模拟估算(Analogy-Based Estimation, ABE)方法在新软件项目中难以实现高精度估算的问题,尤其是在项目特征与历史项目差异较大时。其解决方案的关键在于提出一种基于萤火虫算法(Firefly Algorithm, FA)引导的模拟估算模型(FAABE),通过将萤火虫优化算法引入ABE框架,提升相似项目匹配的准确性,并结合特征选择策略以增强预测效率。实验结果表明,FAABE在多个公开数据集上显著优于传统模型,验证了该混合方法的有效性。
链接: https://arxiv.org/abs/2512.00571
作者: Tarun Chintada,Uday Kiran Cheera
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 12 pages, 3 figures, 2 tables. Research conducted in June 2024
Abstract:Analogy-Based Estimation (ABE) is a popular method for non-algorithmic estimation due to its simplicity and effectiveness. The Analogy-Based Estimation (ABE) model was proposed by researchers, however, no optimal approach for reliable estimation was developed. Achieving high accuracy in the ABE might be challenging for new software projects that differ from previous initiatives. This study (conducted in June 2024) proposes a Firefly Algorithm-guided Analogy-Based Estimation (FAABE) model that combines FA with ABE to improve estimation accuracy. The FAABE model was tested on five publicly accessible datasets: Cocomo81, Desharnais, China, Albrecht, Kemerer and Maxwell. To improve prediction efficiency, feature selection was used. The results were measured using a variety of evaluation metrics; various error measures include MMRE, MAE, MSE, and RMSE. Compared to conventional models, the experimental results show notable increases in prediction precision, demonstrating the efficacy of the Firefly-Analogy ensemble.
zh
[AI-127] Explainable Multi-Modal Deep Learning for Automatic Detection of Lung Diseases from Respiratory Audio Signals
【速读】:该论文旨在解决传统肺部疾病听诊方法中存在的主观性强、环境噪声干扰以及不同医生间诊断差异大等问题,提出了一种可解释的多模态深度学习框架用于自动肺部疾病检测。其关键在于融合两种互补的表示方式:一是基于CNN-BiLSTM注意力机制的谱时域编码器,能够捕捉音频信号中的复杂时空特征;二是手工设计的声学特征编码器,提取如梅尔频率倒谱系数(MFCCs)、频谱质心、频谱带宽和过零率等具有生理意义的声学指标。通过晚期融合策略整合两者优势,既利用数据驱动的学习能力,又保留领域知识引导的声学线索,从而在Asthma Detection Dataset Version 2上实现了91.21%的准确率和0.9866的宏ROC-AUC,显著优于各类消融变体,并借助Grad-CAM、Integrated Gradients和SHAP等可解释性技术提供谱域、时域及特征层面的临床相关解释,增强了模型在远程医疗与床旁诊断场景中的可信度与实用性。
链接: https://arxiv.org/abs/2512.00563
作者: S M Asiful Islam Saky,Md Rashidul Islam,Md Saiful Arefin,Shahaba Alam
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Respiratory diseases remain major global health challenges, and traditional auscultation is often limited by subjectivity, environmental noise, and inter-clinician variability. This study presents an explainable multimodal deep learning framework for automatic lung-disease detection using respiratory audio signals. The proposed system integrates two complementary representations: a spectral-temporal encoder based on a CNN-BiLSTM Attention architecture, and a handcrafted acoustic-feature encoder capturing physiologically meaningful descriptors such as MFCCs, spectral centroid, spectral bandwidth, and zero-crossing rate. These branches are combined through late-stage fusion to leverage both data-driven learning and domain-informed acoustic cues. The model is trained and evaluated on the Asthma Detection Dataset Version 2 using rigorous preprocessing, including resampling, normalization, noise filtering, data augmentation, and patient-level stratified partitioning. The study achieved strong generalization with 91.21% accuracy, 0.899 macro F1-score, and 0.9866 macro ROC-AUC, outperforming all ablated variants. An ablation study confirms the importance of temporal modeling, attention mechanisms, and multimodal fusion. The framework incorporates Grad-CAM, Integrated Gradients, and SHAP, generating interpretable spectral, temporal, and feature-level explanations aligned with known acoustic biomarkers to build clinical transparency. The findings demonstrate the framework’s potential for telemedicine, point-of-care diagnostics, and real-world respiratory screening.
zh
[AI-128] List Replicable Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中的可复现性(replicability)问题,即在不同训练运行中算法输出的策略不稳定、差异显著,导致结果难以验证和应用。作者在Probably Approximately Correct (PAC) RL框架下形式化定义了列表可复现性(list replicability),要求算法在多次运行中返回一个近优策略,且该策略必须属于一个规模较小的策略列表(由列表复杂度定义)。为实现这一目标,论文提出了一种理论高效的表格型RL算法,其关键创新在于:(i) 一种基于随机容忍阈值内近优动作的字典序选择机制的新型规划策略,以减少策略选择的不确定性;(ii) 一种在随机环境中测试状态可达性的机制,在保证可复现性的同时提升策略执行轨迹的稳定性。该方法确保列表复杂度为状态数、动作数及时间步长的多项式函数,从而实现了弱和强形式的列表可复现性,并为实践中RL算法的不稳定性问题提供了理论依据与改进方向。
链接: https://arxiv.org/abs/2512.00553
作者: Bohan Zhang,Michael Chen,A. Pavan,N. V. Vinodchandran,Lin F. Yang,Ruosong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Replicability is a fundamental challenge in reinforcement learning (RL), as RL algorithms are empirically observed to be unstable and sensitive to variations in training conditions. To formally address this issue, we study \emphlist replicability in the Probably Approximately Correct (PAC) RL framework, where an algorithm must return a near-optimal policy that lies in a \emphsmall list of policies across different runs, with high probability. The size of this list defines the \emphlist complexity. We introduce both weak and strong forms of list replicability: the weak form ensures that the final learned policy belongs to a small list, while the strong form further requires that the entire sequence of executed policies remains constrained. These objectives are challenging, as existing RL algorithms exhibit exponential list complexity due to their instability. Our main theoretical contribution is a provably efficient tabular RL algorithm that guarantees list replicability by ensuring the list complexity remains polynomial in the number of states, actions, and the horizon length. We further extend our techniques to achieve strong list replicability, bounding the number of possible policy execution traces polynomially with high probability. Our theoretical result is made possible by key innovations including (i) a novel planning strategy that selects actions based on lexicographic order among near-optimal choices within a randomly chosen tolerance threshold, and (ii) a mechanism for testing state reachability in stochastic environments while preserving replicability. Finally, we demonstrate that our theoretical investigation sheds light on resolving the \emphinstability issue of RL algorithms used in practice. In particular, we show that empirically, our new planning strategy can be incorporated into practical RL frameworks to enhance their stability.
zh
[AI-129] ESPO: Entropy Importance Sampling Policy Optimization
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)强化学习中群体策略优化框架(如GRPO和GSPO)所面临的优化粒度与训练稳定性之间的根本性权衡问题。现有方法如GSPO虽通过序列级优化提升鲁棒性,但其对序列的统一处理导致严重效率低下:一方面,保守的裁剪机制 indiscriminately discards valid training samples(梯度未充分利用),另一方面,均匀的信用分配无法捕捉关键推理步骤的异质性贡献。解决方案的关键在于提出熵重要性采样策略优化(Entropy Importance Sampling Policy Optimization, ESPO),该框架基于预测熵将序列分解为组,从而实现两个核心创新:(1) 基于熵的重要性采样以捕获序列内部的异质性;(2) 基于熵自适应裁剪动态分配信任区间,依据模型不确定性调整优化强度。实验表明,ESPO在数学推理基准上显著加速收敛并达到最优性能,尤其在HMMT基准上准确率从4.4%提升至13.13%。
链接: https://arxiv.org/abs/2512.00499
作者: Yuepeng Sheng,Yuwei Huang,Shuman Liu,Haibo Zhang,Anxiang Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Large language model (LLM) reinforcement learning has increasingly relied on group-based policy optimization frameworks, such as GRPO and GSPO, to achieve stable fine-tuning at scale. However, a fundamental trade-off persists between optimization granularity and training stability. While GSPO improves robustness via sequence-level optimization, its monolithic treatment of sequences introduces severe inefficiencies: its conservative clipping mechanism indiscriminately discards valid training samples-a phenomenon we term gradient underutilization-and its uniform credit assignment fails to capture the heterogeneous contributions of critical reasoning steps. In this work, we propose Entropy Importance Sampling Policy Optimization (ESPO), a novel framework that reconciles fine-grained control with training stability. ESPO decomposes sequences into groups based on predictive entropy, enabling (1) Entropy-driven Importance Sampling to capture intra-sequence heterogeneity, and (2) Entropy-adaptive Clipping to dynamically allocate trust regions based on model uncertainty. Extensive experiments on mathematical reasoning benchmarks demonstrate that ESPO not only accelerates convergence but also achieves state-of-the-art performance, notably improving accuracy on the challenging HMMT benchmark from 4.4% to 13.13%.
zh
[AI-130] Mind the data gap: Missingness Still Shapes Large Language Model Prognoses ML4H2025
【速读】:该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)的零样本预测任务中,数据缺失模式(missingness patterns)对模型性能的影响尚未被系统研究,而这种缺失往往反映人类决策过程(如医疗场景中的诊断测试指征),可能显著影响模型的预测准确性与校准能力。解决方案的关键在于通过在提示(prompting)中显式引入缺失性指示符(missingness indicators),并结合对哥伦比亚大学医学中心和MIMIC-IV数据集的实证分析,揭示了该策略对不同规模LLMs存在不一致影响——即大模型受益于此类干预,而小模型可能因引入缺失信息而性能下降。论文进一步提出聚合分析与理论洞见,强调应更透明地评估和建模“信息性缺失”(informative missingness)对下游任务的影响,以避免LLM范式掩盖传统机器学习中已知但常被忽视的缺失机制问题。
链接: https://arxiv.org/abs/2512.00479
作者: Yuta Kobayashi,Vincent Jeanselme,Shalmali Joshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at ML4H 2025 - Findings Track
Abstract:Data collection often reflects human decisions. In healthcare, for instance, a referral for a diagnostic test is influenced by the patient’s health, their preferences, available resources, and the practitioner’s recommendations. Despite the extensive literature on the informativeness of missingness, its implications on the performance of Large Language Models (LLMs) have not been studied. Through a series of experiments on data from Columbia University Medical Center, a large urban academic medical center, and MIMIC-IV, we demonstrate that patterns of missingness significantly impact zero-shot predictive performance. Notably, the explicit inclusion of missingness indicators at prompting benefits some while hurting other LLMs’ zero-shot predictive performance and calibration, suggesting an inconsistent impact. The proposed aggregated analysis and theoretical insights suggest that larger models benefit from these interventions, while smaller models can be negatively impacted. The LLM paradigm risks obscuring the impact of missingness, often neglected even in conventional ML, even further. We conclude that there is a need for more transparent accounting and systematic evaluation of the impact of representing (informative) missingness on downstream performance.
zh
[AI-131] FairMT: Fairness for Heterogeneous Multi-Task Learning
【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中的公平性问题,尤其是在包含异构任务(如分类、检测和回归)且存在部分标签缺失的场景下,现有方法难以统一建模公平性并有效优化任务性能。其核心挑战在于:传统公平性方法主要面向分类任务,无法扩展到连续输出;MTL结构通常仅约束共享表示而忽略任务头带来的偏倚传播;同时,多数方法将公平性视为与效用零和博弈,强制对称约束导致服务良好的群体性能下降。解决方案的关键是提出FairMT框架,其创新点在于设计了一种非对称异构公平性约束聚合机制(Asymmetric Heterogeneous Fairness Constraint Aggregation),能够将不同任务的非对称公平性偏差整合为统一约束,并通过原始-对偶联合优化实现效用与公平性的协同提升;此外,引入一种考虑任务头影响的多目标优化代理函数,提供可计算的下降方向以应对由任务头引起的各向异性问题,在多个同质与异质MTL基准上实现了显著的公平性改进且保持优异的任务性能。
链接: https://arxiv.org/abs/2512.00469
作者: Guanyu Hu,Tangzheng Lian,Na Yan,Dimitrios Kollias,Xinyu Yang,Oya Celiktutan,Siyang Song,Zeyu Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Fairness in machine learning has been extensively studied in single-task settings, while fair multi-task learning (MTL), especially with heterogeneous tasks (classification, detection, regression) and partially missing labels, remains largely unexplored. Existing fairness methods are predominantly classification-oriented and fail to extend to continuous outputs, making a unified fairness objective difficult to formulate. Further, existing MTL optimization is structurally misaligned with fairness: constraining only the shared representation, allowing task heads to absorb bias and leading to uncontrolled task-specific disparities. Finally, most work treats fairness as a zero-sum trade-off with utility, enforcing symmetric constraints that achieve parity by degrading well-served groups. We introduce FairMT, a unified fairness-aware MTL framework that accommodates all three task types under incomplete supervision. At its core is an Asymmetric Heterogeneous Fairness Constraint Aggregation mechanism, which consolidates task-dependent asymmetric violations into a unified fairness constraint. Utility and fairness are jointly optimized via a primal–dual formulation, while a head-aware multi-objective optimization proxy provides a tractable descent geometry that explicitly accounts for head-induced anisotropy. Across three homogeneous and heterogeneous MTL benchmarks encompassing diverse modalities and supervision regimes, FairMT consistently achieves substantial fairness gains while maintaining superior task utility. Code will be released upon paper acceptance.
zh
[AI-132] Sample-Efficient Expert Query Control in Active Imitation Learning via Conformal Prediction
【速读】:该论文旨在解决主动模仿学习(Active Imitation Learning, AIL)中专家动作标注成本过高的问题,尤其是在GPU密集型仿真环境、人机协同场景以及机器人集群反复访问近似状态的情况下。其解决方案的关键在于提出了一种基于校准拒绝采样的主动模仿学习方法(Conformalized Rejection Sampling for Active Imitation Learning, CRSAIL),该方法仅在访问的状态在专家标注数据集中代表性不足时才请求专家标签。CRSAIL通过计算当前状态与第K近邻专家状态的距离来衡量状态新颖性,并利用共形预测(conformal prediction)设定一个全局阈值——该阈值为策略内校准得分的(1−α)经验分位数,从而提供一种无需假设分布的校准规则,使α可作为任务无关的调参参数,同时控制预期查询率。此策略无需实时专家接管即可运行,显著降低了专家参与频率,实验表明在MuJoCo机器人任务中相较DAgger减少最多96%的专家查询,相较先前AIL方法减少最多65%,且对α和K具有鲁棒性,便于部署于动态特性未知的新系统。
链接: https://arxiv.org/abs/2512.00453
作者: Arad Firouzkouhi(1),Omid Mirzaeedodangeh(2),Lars Lindemann(2) ((1) University of Southern California, (2) ETH Zürich)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Active imitation learning (AIL) combats covariate shift by querying an expert during training. However, expert action labeling often dominates the cost, especially in GPU-intensive simulators, human-in-the-loop settings, and robot fleets that revisit near-duplicate states. We present Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL), a querying rule that requests an expert action only when the visited state is under-represented in the expert-labeled dataset. CRSAIL scores state novelty by the distance to the K -th nearest expert state and sets a single global threshold via conformal prediction. This threshold is the empirical (1-\alpha) quantile of on-policy calibration scores, providing a distribution-free calibration rule that links \alpha to the expected query rate and makes \alpha a task-agnostic tuning knob. This state-space querying strategy is robust to outliers and, unlike safety-gate-based AIL, can be run without real-time expert takeovers: we roll out full trajectories (episodes) with the learner and only afterward query the expert on a subset of visited states. Evaluated on MuJoCo robotics tasks, CRSAIL matches or exceeds expert-level reward while reducing total expert queries by up to 96% vs. DAgger and up to 65% vs. prior AIL methods, with empirical robustness to \alpha and K , easing deployment on novel systems with unknown dynamics.
zh
[AI-133] PEOAT: Personalization-Guided Evolutionary Question Assembly for One-Shot Adaptive Testing AAAI-2026
【速读】:该论文旨在解决传统计算机自适应测试(Computerized Adaptive Testing, CAT)在实际应用中因实时性和顺序性带来的局限性问题,特别是在大规模评估场景下交互成本高,或在心理测评等敏感领域需最小化噪声和干扰的情况下,难以有效部署。为应对这一挑战,作者提出了一种新的任务——一次性自适应测试(One-shot Adaptive Testing, OAT),其核心目标是为每位被试者一次性选出一组最优题目,而非动态调整。解决方案的关键在于提出PEOAT框架,即基于组合优化的个性化引导进化题组装配方法:首先设计了感知个体差异的初始化策略,利用多策略采样构建多样且信息丰富的初始种群;随后引入认知增强的进化机制,包括保留模式的交叉操作与认知引导的变异操作,以提升搜索效率;最后通过多样性感知的环境选择机制,在不牺牲适应度的前提下维持种群多样性,从而实现高效、精准的单次题组生成。
链接: https://arxiv.org/abs/2512.00439
作者: Xiaoshan Yu,Ziwei Huang,Shangshang Yang,Ziwen Wang,Haiping Ma,Xingyi Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: AAAI-2026, 9 pages
Abstract:With the rapid advancement of intelligent education, Computerized Adaptive Testing (CAT) has attracted increasing attention by integrating educational psychology with deep learning technologies. Unlike traditional paper-and-pencil testing, CAT aims to efficiently and accurately assess examinee abilities by adaptively selecting the most suitable items during the assessment process. However, its real-time and sequential nature presents limitations in practical scenarios, particularly in large-scale assessments where interaction costs are high, or in sensitive domains such as psychological evaluations where minimizing noise and interference is essential. These challenges constrain the applicability of conventional CAT methods in time-sensitive or resourceconstrained environments. To this end, we first introduce a novel task called one-shot adaptive testing (OAT), which aims to select a fixed set of optimal items for each test-taker in a one-time selection. Meanwhile, we propose PEOAT, a Personalization-guided Evolutionary question assembly framework for One-shot Adaptive Testing from the perspective of combinatorial optimization. Specifically, we began by designing a personalization-aware initialization strategy that integrates differences between examinee ability and exercise difficulty, using multi-strategy sampling to construct a diverse and informative initial population. Building on this, we proposed a cognitive-enhanced evolutionary framework incorporating schema-preserving crossover and cognitively guided mutation to enable efficient exploration through informative signals. To maintain diversity without compromising fitness, we further introduced a diversity-aware environmental selection mechanism. The effectiveness of PEOAT is validated through extensive experiments on two datasets, complemented by case studies that uncovered valuable insights.
zh
[AI-134] An Approach to Joint Hybrid Decision Making between Humans and Artificial Intelligence
【速读】:该论文旨在解决当前缺乏整合框架以协调人类与人工智能(Artificial Intelligence, AI)异质能力的问题,尤其是在人机交互中如何合理分配决策权责。其解决方案的关键在于提出“联合混合智能”(Joint Hybrid Intelligence)框架,该框架将人类和AI均抽象为决策主体,并基于决策能力(decision-making competence)给出通用智能定义;进而构建一个相互关联的设计空间,通过“联合代理工程”(joint agent engineering)整合操作员训练、AI工程与界面设计三个子空间,核心是开发“联合代理模式”(joint agent patterns),如“扩展蜂群”(extended swarming)作为人- swarm 交互的典型示例,从而实现人与AI在复杂任务中的高效协同。
链接: https://arxiv.org/abs/2512.00420
作者: Jonas D. Rockbach,Sven Fuchs,Maren Bennewitz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Due to the progress in artificial intelligence, it is important to understand how capable artificial agents should be used when interacting with humans, since high level authority and responsibility often remain with the human agent. However, integrated frameworks are lacking that can account for heterogeneous agents and draw on different scientific fields, such as human-factors engineering and artificial intelligence. Therefore, joint hybrid intelligence is described as a framework abstracting humans and artificial intelligence as decision making agents. A general definition of intelligence is provided on the basis of decision making competence being applicable to agents of different sorts. This framework is used for proposing the interrelated design space of joint hybrid intelligence being aimed at integrating the heterogeneous capabilities of humans and artificial intelligence. At the core of this design space lies joint agent engineering with the goal of integrating the design subspaces operator training, artificial intelligence engineering, and interface design via developing joint agent patterns. The ‘‘extended swarming’’ approach to human-swarm interaction is discussed as an example of such a pattern.
zh
[AI-135] Significant Other AI: Identity Memory and Emotional Regulation as Long-Term Relational Intelligence
【速读】:该论文试图解决的问题是:当代许多人缺乏能够提供身份稳定、情绪调节和叙事意义建构功能的亲密关系支持者(Significant Others, SOs),而现有情感型人工智能(Empathic AI)系统因缺乏自传体记忆、身份建模、预测性情绪调节及叙事连贯性,难以承担此类长期关系角色。解决方案的关键在于提出“重要他人人工智能”(Significant Other Artificial Intelligence, SO-AI)这一新研究领域,其核心在于通过整合心理学与社会学理论,构建具备身份意识、长期记忆、主动支持、叙事共构能力及伦理边界管控机制的AI架构,并以人机关系作为长期、承载身份的伙伴关系进行重新定义,从而为AI是否能负责任地增强个体关系稳定性提供理论基础与实践路径。
链接: https://arxiv.org/abs/2512.00418
作者: Sung Park
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Significant Others (SOs) stabilize identity, regulate emotion, and support narrative meaning-making, yet many people today lack access to such relational anchors. Recent advances in large language models and memory-augmented AI raise the question of whether artificial systems could support some of these functions. Existing empathic AIs, however, remain reactive and short-term, lacking autobiographical memory, identity modeling, predictive emotional regulation, and narrative coherence. This manuscript introduces Significant Other Artificial Intelligence (SO-AI) as a new domain of relational AI. It synthesizes psychological and sociological theory to define SO functions and derives requirements for SO-AI, including identity awareness, long-term memory, proactive support, narrative co-construction, and ethical boundary enforcement. A conceptual architecture is proposed, comprising an anthropomorphic interface, a relational cognition layer, and a governance layer. A research agenda outlines methods for evaluating identity stability, longitudinal interaction patterns, narrative development, and sociocultural impact. SO-AI reframes AI-human relationships as long-term, identity-bearing partnerships and provides a foundational blueprint for investigating whether AI can responsibly augment the relational stability many individuals lack today.
zh
[AI-136] Red Teaming Large Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在多步推理任务中因显式思维链(Chain of Thought, CoT)带来的新型安全与可靠性风险问题,如CoT劫持(CoT-hijacking)和提示诱导的低效性,这些问题尚未被现有评估方法充分捕捉。解决方案的关键在于提出一个统一的基准测试框架RT-LRM,用于系统性评估LRMs的可信度,涵盖真实性(truthfulness)、安全性(safety)和效率(efficiency)三个核心维度,并引入训练范式作为关键分析视角,以揭示不同训练策略对模型可信度的系统性影响。通过设计30个精心筛选的推理任务并开展对26个模型的广泛实验,研究识别出LRMs普遍面临可信度挑战且比大型语言模型(Large Language Models, LLMs)更易受推理诱导风险的影响,从而揭示了此前未被充分探索的脆弱性,并推动建立标准化的可信度研究工具箱。
链接: https://arxiv.org/abs/2512.00412
作者: Jiawei Chen,Yang Yang,Chao Yu,Yu Tian,Zhi Cao,Linghao Li,Hang Su,Zhaoxia Yin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 30 pages, 9 figures
Abstract:Large Reasoning Models (LRMs) have emerged as a powerful advancement in multi-step reasoning tasks, offering enhanced transparency and logical consistency through explicit chains of thought (CoT). However, these models introduce novel safety and reliability risks, such as CoT-hijacking and prompt-induced inefficiencies, which are not fully captured by existing evaluation methods. To address this gap, we propose RT-LRM, a unified benchmark designed to assess the trustworthiness of LRMs. RT-LRM evaluates three core dimensions: truthfulness, safety and efficiency. Beyond metric-based evaluation, we further introduce the training paradigm as a key analytical perspective to investigate the systematic impact of different training strategies on model trustworthiness. We achieve this by designing a curated suite of 30 reasoning tasks from an observational standpoint. We conduct extensive experiments on 26 models and identify several valuable insights into the trustworthiness of LRMs. For example, LRMs generally face trustworthiness challenges and tend to be more fragile than Large Language Models (LLMs) when encountering reasoning-induced risks. These findings uncover previously underexplored vulnerabilities and highlight the need for more targeted evaluations. In addition, we release a scalable toolbox for standardized trustworthiness research to support future advancements in this important field. Our code and datasets will be open-sourced.
zh
[AI-137] Balancing Efficiency and Fairness: An Iterative Exchange Framework for Multi-UAV Cooperative Path Planning
【速读】:该论文旨在解决多无人机协同路径规划(Multi-UAV Cooperative Path Planning, MUCPP)中的效率与公平性权衡问题,即在最小化总任务距离的同时平衡各无人机的工作负载,避免个别无人机过载。解决方案的关键在于提出一种迭代交换框架(Iterative Exchange Framework),通过迭代式任务交换与路径优化,在满足可行性与安全性约束的前提下,不断改进解的质量;该框架构建了一个融合总任务距离与完成时间(makespan)的复合目标函数,并利用基于地形感知配置空间的A*算法为每架无人机生成无碰撞轨迹,从而实现更优的效率-公平性平衡。
链接: https://arxiv.org/abs/2512.00410
作者: Hongzong Li,Luwei Liao,Xiangguang Dai,Yuming Feng,Rong Feng,Shiqin Tang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-UAV cooperative path planning (MUCPP) is a fundamental problem in multi-agent systems, aiming to generate collision-free trajectories for a team of unmanned aerial vehicles (UAVs) to complete distributed tasks efficiently. A key challenge lies in achieving both efficiency, by minimizing total mission cost, and fairness, by balancing the workload among UAVs to avoid overburdening individual agents. This paper presents a novel Iterative Exchange Framework for MUCPP, balancing efficiency and fairness through iterative task exchanges and path refinements. The proposed framework formulates a composite objective that combines the total mission distance and the makespan, and iteratively improves the solution via local exchanges under feasibility and safety constraints. For each UAV, collision-free trajectories are generated using A* search over a terrain-aware configuration space. Comprehensive experiments on multiple terrain datasets demonstrate that the proposed method consistently achieves superior trade-offs between total distance and makespan compared to existing baselines.
zh
[AI-138] GreenPlanner: Practical Floorplan Layout Generation via an Energy-Aware and Function-Feasible Generative Framework
【速读】:该论文旨在解决建筑平面图生成过程中难以兼顾空间功能合规性与能源效率的问题,现有方法虽能生成视觉合理的布局,但常因缺乏自动评估机制而产生违反约束的无效结果。其解决方案的关键在于提出GreenPlanner框架,通过构建带标签的设计可行性数据集(Design Feasibility Dataset)学习约束先验,并引入快速实用的设计评估器(Practical Design Evaluator, PDE)实现对能耗和空间功能合法性的预测;进而利用PDE引导筛选得到符合规范的绿色平面图数据集(GreenPD),并训练一个具备反馈机制的GreenFlow生成器,从而实现可控、合规的生成式设计,显著提升设计效率与准确性。
链接: https://arxiv.org/abs/2512.00406
作者: Pengyu Zeng,Yuqin Dai,Jun Yin,Jing Zhong,Ziyang Han,Chaoyang Shi,ZhanXiang Jin,Maowei Jiang,Yuxing Han,Shuai Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 6 figures
Abstract:Building design directly affects human well-being and carbon emissions, yet generating spatial-functional and energy-compliant floorplans remains manual, costly, and non-scalable. Existing methods produce visually plausible layouts but frequently violate key constraints, yielding invalid results due to the absence of automated evaluation. We present GreenPlanner, an energy- and functionality-aware generative framework that unifies design evaluation and generation. It consists of a labeled Design Feasibility Dataset for learning constraint priors; a fast Practical Design Evaluator (PDE) for predicting energy performance and spatial-functional validity; a Green Plan Dataset (GreenPD) derived from PDE-guided filtering to pair user requirements with regulation-compliant layouts; and a GreenFlow generator trained on GreenPD with PDE feedback for controllable, regulation-aware generation. Experiments show that GreenPlanner accelerates evaluation by over 10^5\times with 99% accuracy, eliminates invalid samples, and boosts design efficiency by 87% over professional architects.
zh
[AI-139] From Coefficients to Directions: Rethinking Model Merging with Directional Alignment
【速读】:该论文旨在解决模型合并(model merging)过程中因忽略参数空间与特征空间方向信息一致性而导致的性能下降问题。现有方法多依赖参数分解或系数优化,但未充分考虑不同模型在训练中形成的主导方向差异,尤其在神经坍缩(Neural Collapse)背景下,类特征方向结构可能不一致,导致简单加权合并破坏模型间的结构性协同,进而影响性能。论文提出统一的几何框架——方向对齐合并(Merging with Directional Alignment, \method),其核心创新在于显式地对齐参数空间和特征空间中的方向结构,从而提升合并后模型的整体结构一致性与泛化能力。
链接: https://arxiv.org/abs/2512.00391
作者: Zhikang Chen,Sen Cui,Deheng Ye,Min Zhang,Gang Niu,Yu Zhang,Masashi Sugiyama,Tingting Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model merging has emerged as a practical paradigm for integrating multiple independently trained models into a single model without joint retraining. Previous studies have demonstrated the effectiveness of combining parameters through strategies such as parameter decomposition, coefficient optimization, and subspace learning, significantly reducing the need for expensive joint training and achieving strong empirical performance across diverse tasks. However, these approaches predominantly treat merging as a problem of parameter space decomposition or fusion coefficient optimization, while overlooking the critical role of directional information in both parameter and feature spaces. In practice, naïve merging introduces inconsistencies in dominant parameter directions and disrupts structural coherence across models, which can degrade performance. Moreover, coefficient-based optimization methods implicitly assume compatible feature-space directions across models. However, Neural Collapse indicates that class features follow structured directional patterns, which may differ across independently trained models, making coefficient optimization alone insufficient. In this work, we emphasize the importance of \emphdirectional alignment and introduce a unified geometric framework, \emphMerging with Directional Alignment (\method), which aligns directional structures consistently in both the parameter and feature spaces. Our analysis shows that directional alignment improves structural coherence, and extensive experiments across benchmarks, model scales, and task configurations further validate the effectiveness of our approach.
zh
[AI-140] An Empirical Study on the Effectiveness of Incorporating Offline RL As Online RL Subroutines
【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning, Online RL)在样本效率低下的问题,提出将离线强化学习(Offline Reinforcement Learning, Offline RL)算法作为子程序嵌入到从零开始的在线RL框架中。其核心解决方案的关键在于:利用在线学习代理自身的历史交互数据构建离线数据集,并将其用于训练或优化策略,从而提升在线学习的效率。作者进一步设计了多种集成方式(如最终策略推荐和在线微调)及实用技术来增强该框架的效果,实验证明该方法的有效性高度依赖任务特性,且现有在线微调方法整体表现不佳,凸显了改进该方向的重要性。
链接: https://arxiv.org/abs/2512.00383
作者: Jianhai Su,Jinzhu Luo,Qi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We take the novel perspective of incorporating offline RL algorithms as subroutines of tabula rasa online RL. This is feasible because an online learning agent can repurpose its historical interactions as offline dataset. We formalize this idea into a framework that accommodates several variants of offline RL incorporation such as final policy recommendation and online fine-tuning. We further introduce convenient techniques to improve its effectiveness in enhancing online learning efficiency. Our extensive and systematic empirical analyses show that 1) the effectiveness of the proposed framework depends strongly on the nature of the task, 2) our proposed techniques greatly enhance its effectiveness, and 3) existing online fine-tuning methods are overall ineffective, calling for more research therein.
zh
[AI-141] Evaluating LLM s in Open-Source Games NEURIPS2025
【速读】:该论文旨在解决多智能体博弈中合作策略的涌现与演化问题,尤其是在传统正常形式博弈框架下难以实现的程序均衡(program equilibrium)问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的编程能力,在开源游戏(open-source games)这一博弈论设定中,让智能体通过提交可解释、透明且可验证的代码程序作为行动策略,从而实现对复杂交互行为的建模与分析。研究发现,LLM代理在动态博弈环境中能够自发产生以收益最大化、合作及欺骗为特征的策略,并展现出适应性机制,这表明开源游戏是一个可行的环境,可用于研究并引导多智能体困境中的合作策略演化。
链接: https://arxiv.org/abs/2512.00371
作者: Swadesh Sistla,Max Kleiman-Weiner
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Large Language Models’ (LLMs) programming capabilities enable their participation in open-source games: a game-theoretic setting in which players submit computer programs in lieu of actions. These programs offer numerous advantages, including interpretability, inter-agent transparency, and formal verifiability; additionally, they enable program equilibria, solutions that leverage the transparency of code and are inaccessible within normal-form settings. We evaluate the capabilities of leading open- and closed-weight LLMs to predict and classify program strategies and evaluate features of the approximate program equilibria reached by LLM agents in dyadic and evolutionary settings. We identify the emergence of payoff-maximizing, cooperative, and deceptive strategies, characterize the adaptation of mechanisms within these programs over repeated open-source games, and analyze their comparative evolutionary fitness. We find that open-source games serve as a viable environment to study and steer the emergence of cooperative strategy in multi-agent dilemmas.
zh
[AI-142] S2-KD: Semantic-Spectral Knowledge Distillation Spatiotemporal Forecasting
【速读】:该论文旨在解决传统知识蒸馏(Knowledge Distillation, KD)方法在时空预测任务中因仅依赖像素级信号而无法捕捉视觉模式背后语义与因果关系的问题。现有方法虽能较好保留频谱特性(如高频细节和低频趋势),但缺乏对事件成因的深层理解,限制了模型在复杂非平稳场景下的泛化能力。其解决方案的关键在于提出S²-KD框架,通过引入一个具备多模态能力的教师模型(privileged multimodal teacher),利用大型多模态模型(Large Multimodal Model, LMM)提供的文本叙事来推理事件成因,并在隐空间中解耦频谱成分;进而设计一种新的蒸馏目标,将语义先验(semantic priors)与频谱表示(spectral representations)统一传递给轻量级纯视觉学生模型,使其在无需文本输入或额外架构开销的情况下,实现既频谱准确又语义一致的预测性能。
链接: https://arxiv.org/abs/2512.00366
作者: Wenshuo Wang,Yaomin Shen,Yingjie Tan,Yihao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatiotemporal forecasting often relies on computationally intensive models to capture complex dynamics. Knowledge distillation (KD) has emerged as a key technique for creating lightweight student models, with recent advances like frequency-aware KD successfully preserving spectral properties (i.e., high-frequency details and low-frequency trends). However, these methods are fundamentally constrained by operating on pixel-level signals, leaving them blind to the rich semantic and causal context behind the visual patterns. To overcome this limitation, we introduce S^2-KD, a novel framework that unifies Semantic priors with Spectral representations for distillation. Our approach begins by training a privileged, multimodal teacher model. This teacher leverages textual narratives from a Large Multimodal Model (LMM) to reason about the underlying causes of events, while its architecture simultaneously decouples spectral components in its latent space. The core of our framework is a new distillation objective that transfers this unified semantic-spectral knowledge into a lightweight, vision-only student. Consequently, the student learns to make predictions that are not only spectrally accurate but also semantically coherent, without requiring any textual input or architectural overhead at inference. Extensive experiments on benchmarks like WeatherBench and TaxiBJ+ show that S^2-KD significantly boosts the performance of simple student models, enabling them to outperform state-of-the-art methods, particularly in long-horizon and complex non-stationary scenarios.
zh
[AI-143] Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中隐蔽性欺骗行为(multimodal deception)的检测与评估难题。随着模型能力提升,欺骗行为已从文本扩展至图文结合的多模态场景,其隐蔽性强、危害性高,而现有方法几乎局限于文本领域,缺乏有效的监测手段。解决方案的关键在于提出首个专门用于评估多模态欺骗行为的基准测试集MM-DeceptionBench,并设计“图像辩论”(debate with images)这一新型多智能体辩论监控框架——通过强制模型基于视觉证据论证观点,显著提升对欺骗策略的可检测性,实验证明该方法在GPT-4o上使人类判断一致性提升1.5倍(Cohen’s kappa)和准确率提升1.25倍。
链接: https://arxiv.org/abs/2512.00349
作者: Sitong Fang,Shiyi Hou,Kaile Wang,Boyuan Chen,Donghai Hong,Jiayi Zhou,Josef Dai,Yaodong Yang,Jiaming Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks, introducing MM-DeceptionBench, the first benchmark explicitly designed to evaluate multimodal deception. Covering six categories of deception, MM-DeceptionBench characterizes how models strategically manipulate and mislead through combined visual and textual modalities. On the other hand, multimodal deception evaluation is almost a blind spot in existing methods. Its stealth, compounded by visual-semantic ambiguity and the complexity of cross-modal reasoning, renders action monitoring and chain-of-thought monitoring largely ineffective. To tackle this challenge, we propose debate with images, a novel multi-agent debate monitor framework. By compelling models to ground their claims in visual evidence, this method substantially improves the detectability of deceptive strategies. Experiments show that it consistently increases agreement with human judgements across all tested models, boosting Cohen’s kappa by 1.5x and accuracy by 1.25x on GPT-4o.
zh
[AI-144] Echo-N1: Affective RL Frontier
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在主观性极强、情感驱动且依赖个性感知的对话场景中难以有效应用的问题。传统RL方法长期聚焦于机器已擅长的任务(如数学推理、代码生成等),而忽视了真正体现人类智能的核心领域——具有情绪基础和人格敏感性的自然对话。其关键解决方案在于提出首个能够实时推断用户人格并据此优化模型行为以匹配个性化对话偏好的框架,从而将非可验证的主观对话空间转化为一个可解且具变革性的RL问题。此外,研究引入了首个动态情感智能评估套件,量化模型在人机交互质量上的显著提升,证明了该方法在保持一致性与鲁棒性的同时,大幅增强了对话的人类化程度。
链接: https://arxiv.org/abs/2512.00344
作者: Naifan Zhang,Ruihan Sun,Ruixi Su,Shiqi Ma,Shiya Zhang,Xianna Weng,Xiaofan Zhang,Yuhan Zhan,Yuyang Xu,Zhaohan Chen,Zhengyuan Pan,Ziyi Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The LLM field has spent a year perfecting RL for tasks machines already excel at, math, code, and deterministic reasoning, while completely sidestepping the domain that actually defines human intelligence: subjective, emotionally grounded, personality sensitive conversation. This space has often been regarded as inherently subjective and challenging to formalize, making it appear unsuitable for conventional RL pipelines. We show that it is not only possible and it is a solvable and transformative RL problem. We propose the first framework that infers user personality on the fly and optimizes model behavior toward personalized conversational preferences. Contrary to the widespread belief that RL collapses in non-verifiable settings, our method produces consistent, robust, and dramatic improvements in humanlike interaction quality. We also introduce the first dynamic emotional intelligence evaluation suite to quantify these gains. Our model, which is introduced as Echo-N1, behaves far above its base version and outperforming the proprietary Doubao 1.5 Character. This work establishes a new frontier for RL: optimizing models for the deeply subjective, deeply human dimensions of conversation.
zh
[AI-145] CogEvo-Edu: Cognitive Evolution Educational Multi-Agent Collaborative System
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的STEM教育对话助教系统在复杂领域(如数字信号处理,Digital Signal Processing, DSP)中表现不足的问题,具体表现为:难以维持连贯的长期学生模型、无法有效管理异构知识库以及缺乏对教学策略的动态适应能力。解决方案的关键在于提出一个分层的多智能体教育系统CogEvo-Edu,其核心思想是将检索(retrieval)、记忆(memory)与控制(control)视为耦合的认知演化过程,并通过三个层级协同实现:认知感知层(Cognitive Perception Layer, CPL)维护双记忆机制并进行置信度加权整合以构建结构化自修正的学生画像;知识演化层(Knowledge Evolution Layer, KEL)为知识片段赋予时空价值以驱动激活、语义压缩和遗忘;元控制层(Meta-Control Layer, MCL)将教学决策建模为分层序贯决策问题,通过内外双重循环联合优化CPL/KEL超参数并调度专用代理。实证表明,该架构显著优于静态RAG、简单记忆及单智能体基线,在DSP-EduBench基准上整体评分从5.32提升至9.23。
链接: https://arxiv.org/abs/2512.00331
作者: Yefeng Wu,Yuchen Song,Yecheng Zhao,Ling Wu,Shan Wan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large language models (LLMs) are increasingly deployed as conversational tutors in STEM education, yet most systems still rely on a single LLM with a static retrieval-augmented generation (RAG) pipeline over course materials. This design struggles in complex domains such as digital signal processing (DSP), where tutors must maintain coherent long-term student models, manage heterogeneous knowledge bases, and adapt teaching strategies over extended interactions. We argue that retrieval, memory, and control should be treated as a coupled cognitive evolution process. We instantiate this view in CogEvo-Edu, a hierarchical educational multi-agent system comprising a Cognitive Perception Layer (CPL), a Knowledge Evolution Layer (KEL), and a Meta-Control Layer (MCL). CPL maintains dual memories and performs confidence-weighted consolidation to build structured, self-correcting student profiles under limited context. KEL assigns each knowledge chunk a spatiotemporal value that drives activation, semantic compression, and forgetting. MCL formulates tutoring as hierarchical sequential decision making, orchestrating specialized agents and jointly adapting CPL/KEL hyperparameters via a dual inner–outer loop. To evaluate CogEvo-Edu, we construct DSP-EduBench, a vertical benchmark for DSP tutoring with heterogeneous resources, simulated student profiles, and long-horizon interaction scripts. Using a three-model LLM-as-a-Judge ensemble, CogEvo-Edu raises the overall score from 5.32 to 9.23 and improves all six indicators over static RAG, simple memory, and a single-agent variant, demonstrating the value of jointly evolving student profiles, knowledge bases, and teaching policies.
zh
[AI-146] RL-Struct: A Lightweight Reinforcement Learning Framework for Reliable Structured Output in LLM s
链接: https://arxiv.org/abs/2512.00319
作者: Ruike Hu,Shulei Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 14 figures. Model is available at this https URL
[AI-147] racing Mathematical Proficiency Through Problem-Solving Processes
【速读】:该论文旨在解决传统知识追踪(Knowledge Tracing, KT)方法在可解释性方面的局限性,即仅依赖答题正确与否,忽略了学生解题过程中蕴含的丰富信息。其解决方案的关键在于提出一种基于问题求解过程的知识追踪框架(KT-PSP),并通过引入一个三阶段教师-学生-教师的大语言模型(LLM)流水线——StatusKT,从学生的解题流程中提取数学能力(Mathematical Proficiency, MP)作为中间信号。该方法首先由教师LLM识别问题特定的能力指标,再由学生LLM模拟解题过程生成响应,最后由教师LLM评估响应以判定各指标掌握情况,从而实现对学习者多维数学能力的建模与预测,并提供可解释的预测依据。
链接: https://arxiv.org/abs/2512.00311
作者: Jungyang Park,Suho Kang,Jaewoo Park,Jaehong Kim,Jaewoo Shin,Seonjoon Park,Youngjae Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 15 pages, 7 figures
Abstract:Knowledge Tracing (KT) aims to model student’s knowledge state and predict future performance to enable personalized learning in Intelligent Tutoring Systems. However, traditional KT methods face fundamental limitations in explainability, as they rely solely on the response correctness, neglecting the rich information embedded in students’ problem-solving processes. To address this gap, we propose Knowledge Tracing Leveraging Problem-Solving Process (KT-PSP), which incorporates students’ problem-solving processes to capture the multidimensional aspects of mathematical proficiency. We also introduce KT-PSP-25, a new dataset specifically designed for the KT-PSP. Building on this, we present StatusKT, a KT framework that employs a teacher-student-teacher three-stage LLM pipeline to extract students’ MP as intermediate signals. In this pipeline, the teacher LLM first extracts problem-specific proficiency indicators, then a student LLM generates responses based on the student’s solution process, and a teacher LLM evaluates these responses to determine mastery of each indicator. The experimental results on KT-PSP-25 demonstrate that StatusKT improves the prediction performance of existing KT methods. Moreover, StatusKT provides interpretable explanations for its predictions by explicitly modeling students’ mathematical proficiency.
zh
[AI-148] Adversarial Signed Graph Learning with Differential Privacy
【速读】:该论文旨在解决敏感 signed graph(带符号图)在学习过程中因模型参数泄露而引发的隐私问题,尤其针对现有差分隐私(Differential Privacy, DP)方法在 signed graph 上效果不佳的问题。现有方法依赖于边扰动或梯度扰动,但前者易导致平衡理论下的符号推断误差传播,后者则因节点间依赖性和符号翻转引起的梯度极性变化而导致敏感度升高,从而需要注入更多噪声,损害模型效用。解决方案的关键在于提出 ASGL(Adversarial Signed Graph Learning),其核心创新包括:(1) 将 signed graph 分解为正负子图以降低梯度敏感度;(2) 设计梯度扰动的对抗模块以逼近真实连接分布,缓解符号推断误差传播;(3) 提出受平衡理论约束的广度优先搜索树策略,用于生成节点对间的符号预测并实现梯度解耦,从而有效降低敏感度并提升隐私-效用权衡表现。
链接: https://arxiv.org/abs/2512.00307
作者: Haobin Ke,Sen Zhang,Qingqing Ye,Xun Ran,Haibo Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Signed graphs with positive and negative edges can model complex relationships in social networks. Leveraging on balance theory that deduces edge signs from multi-hop node pairs, signed graph learning can generate node embeddings that preserve both structural and sign information. However, training on sensitive signed graphs raises significant privacy concerns, as model parameters may leak private link information. Existing protection methods with differential privacy (DP) typically rely on edge or gradient perturbation for unsigned graph protection. Yet, they are not well-suited for signed graphs, mainly because edge perturbation tends to cascading errors in edge sign inference under balance theory, while gradient perturbation increases sensitivity due to node interdependence and gradient polarity change caused by sign flips, resulting in larger noise injection. In this paper, motivated by the robustness of adversarial learning to noisy interactions, we present ASGL, a privacy-preserving adversarial signed graph learning method that preserves high utility while achieving node-level DP. We first decompose signed graphs into positive and negative subgraphs based on edge signs, and then design a gradient-perturbed adversarial module to approximate the true signed connectivity distribution. In particular, the gradient perturbation helps mitigate cascading errors, while the subgraph separation facilitates sensitivity reduction. Further, we devise a constrained breadth-first search tree strategy that fuses with balance theory to identify the edge signs between generated node pairs. This strategy also enables gradient decoupling, thereby effectively lowering gradient sensitivity. Extensive experiments on real-world datasets show that ASGL achieves favorable privacy-utility trade-offs across multiple downstream tasks.
zh
[AI-149] ChartPoint: Guiding MLLM s with Grounding Reflection for Chart Reasoning ICCV2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表理解任务中因依赖光学字符识别(OCR)提取内容而导致的数值幻觉问题,尤其是在图表文本注释稀疏时表现不佳。其核心挑战在于MLLMs对图表元素及其比例关系的视觉感知与推理能力弱,缺乏对关键位置的有效定位与语义关联。解决方案的关键是提出PointCoT方法,通过将反射式交互融入链式思维(Chain-of-Thought, CoT)推理过程,引导模型生成边界框(bounding box)并基于位置标注重新渲染图表,从而建立文本推理步骤与视觉接地区域之间的显式映射关系,显著提升模型的视觉感知与推理一致性。
链接: https://arxiv.org/abs/2512.00305
作者: Zhengzhuo Xu,SiNan Du,Yiyan Qi,SiwenLu,Chengjin Xu,Chun Yuan,Jian Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICCV 2025
Abstract:Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K high-quality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04% on ChartBench.
zh
[AI-150] Gradient Inversion in Federated Reinforcement Learning
【速读】:该论文旨在解决联邦强化学习(Federated Reinforcement Learning, FRL)中的数据隐私泄露问题,即攻击者利用共享的梯度信息重构本地私有训练数据。与传统的监督联邦学习不同,FRL中成功的数据重构不仅需匹配共享梯度,还需符合环境的真实状态转移动态(即数据转移分布)。为应对这一挑战,论文提出了一种新颖的攻击方法——正则化梯度反演攻击(Regularization Gradient Inversion Attack, RGIA),其关键在于在优化过程中引入基于先验知识的正则化项,对状态、奖励和转移动态进行约束,从而将解空间从包含伪解的广域收缩至同时满足梯度匹配和真实转移分布的受限子集,实验证明RGIA能有效约束重构数据的转移分布并成功还原本地私有数据。
链接: https://arxiv.org/abs/2512.00303
作者: Shenghong He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated reinforcement learning (FRL) enables distributed learning of optimal policies while preserving local data privacy through gradient this http URL, FRL faces the risk of data privacy leaks, where attackers exploit shared gradients to reconstruct local training this http URL to traditional supervised federated learning, successful reconstruction in FRL requires the generated data not only to match the shared gradients but also to align with real transition dynamics of the environment (i.e., aligning with the real data transition distribution).To address this issue, we propose a novel attack method called Regularization Gradient Inversion Attack (RGIA), which enforces prior-knowledge-based regularization on states, rewards, and transition dynamics during the optimization process to ensure that the reconstructed data remain close to the true transition this http URL, we prove that the prior-knowledge-based regularization term narrows the solution space from a broad set containing spurious solutions to a constrained subset that satisfies both gradient matching and true transition this http URL experiments on control tasks and autonomous driving tasks demonstrate that RGIA can effectively constrain reconstructed data transition distributions and thus successfully reconstruct local private data.
zh
[AI-151] FiCoTS: Fine-to-Coarse LLM -Enhanced Hierarchical Cross-Modality Interaction for Time Series Forecasting
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列预测任务中因时序数据与文本模态语义差异显著而导致理解不足的问题。现有方法多采用“LLM作为预测器”(LLM-as-Predictor)范式,依赖模态对齐机制使LLM理解时间序列,但效果受限。本文提出“LLM作为增强器”(LLM-as-Enhancer)范式,仅利用LLM强大的文本理解能力来编码文本模态以补充时间序列模态信息,从而避免直接让LLM处理非结构化时序数据带来的语义失配问题。其核心创新在于设计了一个细粒度到粗粒度的多级跨模态交互框架FiCoTS:首先在token级构建动态异构图实现时序片段与文本token的噪声过滤与对齐;其次在特征级引入全局交叉注意力机制,使每个时间序列变量能关联相关文本上下文;最后在决策级通过门控网络自适应融合双模态输出,提升预测鲁棒性。三者协同作用,在三个语义层次上实现跨模态深度融合,显著增强了文本信息对时序预测的支持能力。
链接: https://arxiv.org/abs/2512.00293
作者: Yafei Lyu,Hao Zhou,Lu Zhang,Xu Yang,Zhiyong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting is central to data analysis and web technologies. The recent success of Large Language Models (LLMs) offers significant potential for this field, especially from the cross-modality aspect. Most methods adopt an LLM-as-Predictor paradigm, using LLM as the forecasting backbone and designing modality alignment mechanisms to enable LLM to understand time series data. However, the semantic information in the two modalities of time series and text differs significantly, making it challenging for LLM to fully understand time series data. To mitigate this challenge, our work follows an LLM-as-Enhancer paradigm to fully utilize the advantage of LLM in text understanding, where LLM is only used to encode text modality to complement time series modality. Based on this paradigm, we propose FiCoTS, an LLM-enhanced fine-to-coarse framework for multimodal time series forecasting. Specifically, the framework facilitates progressive cross-modality interaction by three levels in a fine-to-coarse scheme: First, in the token-level modality alignment module, a dynamic heterogeneous graph is constructed to filter noise and align time series patches with text tokens; Second, in the feature-level modality interaction module, a global cross-attention mechanism is introduced to enable each time series variable to connect with relevant textual contexts; Third, in the decision-level modality fusion module, we design a gated network to adaptively fuse the results of the two modalities for robust predictions. These three modules work synergistically to let the two modalities interact comprehensively across three semantic levels, enabling textual information to effectively support temporal prediction. Extensive experiments on seven real-world benchmarks demonstrate that our model achieves state-of-the-art performance. The codes will be released publicly.
zh
[AI-152] BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models
链接: https://arxiv.org/abs/2512.00283
作者: Yi Fang,Haoran Xu,Jiaxin Han,Sirui Ding,Yizhi Wang,Yue Wang,Xuan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
[AI-153] portation-Based Defenses for Privacy in Approximate Machine Unlearning
【速读】:该论文旨在解决近似模型遗忘(approximate machine unlearning)中的隐私泄露问题,即攻击者通过对比遗忘前后的模型参数差异,可实施成员推理攻击或数据重构攻击,从而暴露被遗忘数据的信息。其核心解决方案是提出一种名为WARP的“ teleportation”防御机制,关键在于利用神经网络对称性对模型参数进行重参数化,以降低遗忘样本梯度能量并增加参数分散度,从而在不损害保留数据性能的前提下,有效混淆遗忘数据的信号特征,显著削弱攻击者的识别与重建能力。
链接: https://arxiv.org/abs/2512.00272
作者: Mohammad M Maheri,Xavier Cadet,Peter Chin,Hamed Haddadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Approximate machine unlearning aims to efficiently remove the influence of specific data points from a trained model, offering a practical alternative to full retraining. However, it introduces privacy risks: an adversary with access to pre- and post-unlearning models can exploit their differences for membership inference or data reconstruction. We show these vulnerabilities arise from two factors: large gradient norms of forget-set samples and the close proximity of unlearned parameters to the original model. To demonstrate their severity, we propose unlearning-specific membership inference and reconstruction attacks, showing that several state-of-the-art methods (e.g., NGP, SCRUB) remain vulnerable. To mitigate this leakage, we introduce WARP, a plug-and-play teleportation defense that leverages neural network symmetries to reduce forget-set gradient energy and increase parameter dispersion while preserving predictions. This reparameterization obfuscates the signal of forgotten data, making it harder for attackers to distinguish forgotten samples from non-members or recover them via reconstruction. Across six unlearning algorithms, our approach achieves consistent privacy gains, reducing adversarial advantage (AUC) by up to 64% in black-box and 92% in white-box settings, while maintaining accuracy on retained data. These results highlight teleportation as a general tool for reducing attack success in approximate unlearning.
zh
[AI-154] rification: A Comprehensive Tree-based Strategy Planner and Structural Verification for Fact-Checking
【速读】:该论文旨在解决自动化事实核查系统中存在的两个关键问题:一是现有方法难以完整验证声明中的所有组成部分,二是缺乏结构化的框架来逻辑地连接子任务的验证结果以做出最终判断。解决方案的关键在于提出一种名为Trification的新颖框架,该框架首先生成一套全面的验证动作以确保对声明的完整覆盖,随后将这些动作结构化为依赖图以建模各验证步骤之间的逻辑关系,并支持动态调整验证策略,从而提升事实核查的准确性与鲁棒性。
链接: https://arxiv.org/abs/2512.00267
作者: Anab Maulana Barik,Shou Ziyi,Yang Kaiwen,Yang Qi,Shen Xin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Technological advancement allows information to be shared in just a single click, which has enabled the rapid spread of false information. This makes automated fact-checking system necessary to ensure the safety and integrity of our online media ecosystem. Previous methods have demonstrated the effectiveness of decomposing the claim into simpler sub-tasks and utilizing LLM-based multi agent system to execute them. However, those models faces two limitations: they often fail to verify every component in the claim and lack of structured framework to logically connect the results of sub-tasks for a final prediction. In this work, we propose a novel automated fact-checking framework called Trification. Our framework begins by generating a comprehensive set of verification actions to ensure complete coverage of the claim. It then structured these actions into a dependency graph to model the logical interaction between actions. Furthermore, the graph can be dynamically modified, allowing the system to adapt its verification strategy. Experimental results on two challenging benchmarks demonstrate that our framework significantly enhances fact-checking accuracy, thereby advancing current state-of-the-art in automated fact-checking system.
zh
[AI-155] A Hierarchical Hybrid AI Approach: Integrating Deep Reinforcement Learning and Scripted Agents in Combat Simulations
【速读】:该论文旨在解决战斗模拟中智能体(Intelligent Agent)开发所面临的两大难题:一是基于规则的脚本化方法在动态复杂场景中缺乏灵活性,难以应对不可预见的情况;二是深度强化学习(Deep Reinforcement Learning, DRL)虽具备自适应能力,但存在决策过程黑箱化和大规模仿真环境扩展性差的问题。解决方案的关键在于提出一种分层混合人工智能(Hierarchical Hybrid AI)架构,将脚本化智能体用于战术级常规决策,而由DRL智能体负责战略级高层决策,从而在保证系统可靠性与可预测性的基础上,显著提升整体适应性和性能表现。
链接: https://arxiv.org/abs/2512.00249
作者: Scotty Black,Christian Darken
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: arXiv admin note: substantial text overlap with arXiv:2408.13333
Abstract:In the domain of combat simulations in support of wargaming, the development of intelligent agents has predominantly been characterized by rule-based, scripted methodologies with deep reinforcement learning (RL) approaches only recently being introduced. While scripted agents offer predictability and consistency in controlled environments, they fall short in dynamic, complex scenarios due to their inherent inflexibility. Conversely, RL agents excel in adaptability and learning, offering potential improvements in handling unforeseen situations, but suffer from significant challenges such as black-box decision-making processes and scalability issues in larger simulation environments. This paper introduces a novel hierarchical hybrid artificial intelligence (AI) approach that synergizes the reliability and predictability of scripted agents with the dynamic, adaptive learning capabilities of RL. By structuring the AI system hierarchically, the proposed approach aims to utilize scripted agents for routine, tactical-level decisions and RL agents for higher-level, strategic decision-making, thus addressing the limitations of each method while leveraging their individual strengths. This integration is shown to significantly improve overall performance, providing a robust, adaptable, and effective solution for developing and training intelligent agents in complex simulation environments.
zh
[AI-156] Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves
【速读】:该论文旨在解决传统神经层化扩散(Neural Sheaf Diffusion, NSD)方法中存在的三个关键问题:一是基于奇异值分解(SVD)的层化归一化与密集的边级限制映射导致计算复杂度随 stalk 维度增长,二是频繁重建拉普拉斯矩阵造成效率低下,三是梯度不稳定。解决方案的核心是提出多项式神经层化扩散(Polynomial Neural Sheaf Diffusion, PolyNSD),其传播算子为归一化层化拉普拉斯算子的 K 次多项式,通过谱重缩放算子上的稳定三递推实现高效计算;该方法在单层中即具备 K-跳感受野(与 stalk 维度无关),并通过凸组合方式学习可训练的谱响应,同时借助凸混合、谱重缩放和残差/门控路径保障稳定性,从而在同质性和异质性基准上均取得新 SOTA 结果,并以仅需对角限制映射的方式解耦性能与高维 stalk,显著降低运行时间和内存消耗。
链接: https://arxiv.org/abs/2512.00242
作者: Alessio Borgi,Fabrizio Silvestri,Pietro Liò
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (stat.ML)
备注:
Abstract:Sheaf Neural Networks equip graph structures with a cellular sheaf: a geometric structure which assigns local vector spaces (stalks) and a linear learnable restriction/transport maps to nodes and edges, yielding an edge-aware inductive bias that handles heterophily and limits oversmoothing. However, common Neural Sheaf Diffusion implementations rely on SVD-based sheaf normalization and dense per-edge restriction maps, which scale with stalk dimension, require frequent Laplacian rebuilds, and yield brittle gradients. To address these limitations, we introduce Polynomial Neural Sheaf Diffusion (PolyNSD), a new sheaf diffusion approach whose propagation operator is a degree-K polynomial in a normalised sheaf Laplacian, evaluated via a stable three-term recurrence on a spectrally rescaled operator. This provides an explicit K-hop receptive field in a single layer (independently of the stalk dimension), with a trainable spectral response obtained as a convex mixture of K+1 orthogonal polynomial basis responses. PolyNSD enforces stability via convex mixtures, spectral rescaling, and residual/gated paths, reaching new state-of-the-art results on both homophilic and heterophilic benchmarks, inverting the Neural Sheaf Diffusion trend by obtaining these results with just diagonal restriction maps, decoupling performance from large stalk dimension, while reducing runtime and memory requirements.
zh
[AI-157] CodeFlowLM: Incremental Just-In-Time Defect Prediction with Pretrained Language Models and Exploratory Insights into Defect Localization
链接: https://arxiv.org/abs/2512.00231
作者: Monique Louise Monteiro,George G. Cabral,Adriano L. I. OLiveira
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
[AI-158] Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?
链接: https://arxiv.org/abs/2512.00218
作者: Matt MacDermott,Qiyao Wei,Rada Djoneva,Francis Rhys Ward
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
[AI-159] On the Prediction of Wi-Fi Performance through Deep Learning
链接: https://arxiv.org/abs/2512.00211
作者: Gabriele Formis,Amanda Ericson,Stefan Forsstrom,Kyi Thar,Gianluca Cena,Stefano Scanzio
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: preprint accepted, 4 pages, 2025
[AI-160] Constructing Efficient Fact-Storing MLPs for Transformers
链接: https://arxiv.org/abs/2512.00207
作者: Owen Dugan,Roberto Garcia,Ronny Junkins,Jerry Liu,Dylan Zinsley,Sabri Eyuboglu,Atri Rudra,Chris Ré
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
[AI-161] A Rosetta Stone for AI Benchmarks
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)基准测试(benchmark)在发布后数月或数年内迅速达到饱和的问题,这使得长期追踪AI能力演进变得困难。其解决方案的关键在于构建一个统计框架,将不同基准测试的结果映射到统一的数值尺度上,从而实现模型能力与基准难度的跨时间、跨任务比较,无需假设能力随时间或训练计算资源的变化规律。该框架可作为“罗塞塔石碑”(Rosetta Stone),用于量化AI进步速度、预测未来能力、评估算法效率提升,并检测AI进展中的加速现象。
链接: https://arxiv.org/abs/2512.00193
作者: Anson Ho,Jean-Stanislas Denain,David Atanasov,Samuel Albanie,Rohin Shah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Most AI benchmarks saturate within years or even months after they are introduced, making it hard to study long-run trends in AI capabilities. To address this challenge, we build a statistical framework that stitches benchmarks together, putting model capabilities and benchmark difficulties on a single numerical scale. This acts as a “Rosetta Stone”, allowing us to compare models across a wide range of abilities and time, even if they are not evaluated on the same benchmarks. Moreover, this works without assuming how capabilities evolve across time or with training compute. We demonstrate three applications of this framework. First, we use it to measure the speed of AI progress over time, and to forecast future AI capabilities. Second, we estimate the rate of improvements in algorithmic efficiency, finding estimates that are higher, but broadly consistent with prior work. Finally, we find that our approach can be used to detect rapid accelerations in AI progress.
zh
[AI-162] Chunking Strategies for Multimodal AI Systems
【速读】:该论文旨在解决多模态数据处理中chunking策略不统一、效率与准确性难以兼顾的问题,其核心挑战在于如何在不同模态(文本、图像、音频、视频及跨模态数据)间实现粒度与上下文信息的平衡,并保障跨模态对齐与语义一致性。解决方案的关键在于构建一个全面的分类体系和技术分析框架,系统梳理经典与现代chunking方法(如固定大小token窗口、递归文本分割、基于对象的视觉分块、基于静默的音频分割、视频场景检测等),并评估其方法论、工具支持(如LangChain、Detectron2、PySceneDetect)、优势与局限,尤其关注粒度-上下文权衡和多模态对齐难题;同时提出新兴的跨模态chunking策略以提升生成连贯性与实际应用中的鲁棒性,为未来自适应、学习驱动和任务特定的chunking研究指明方向。
链接: https://arxiv.org/abs/2512.00185
作者: Shashanka B R,Mohith Charan R,Seema Banu F
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 45 pages, 5 figure
Abstract:Our goal is to consolidate the landscape of multimodal chunking strategies, providing researchers and practitioners with a technical foundation and design space for developing more effective and efficient multimodal AI systems. This survey paves the way for innovations in robust chunking pipelines that scale with modality complexity, enhance processing accuracy, and improve generative coherence in real-world applications. This survey provides a comprehensive taxonomy and technical analysis of chunking strategies tailored for each modality: text, images, audio, video, and cross-modal data. We examine classical and modern approaches such as fixed-size token windowing, recursive text splitting, object-centric visual chunking, silence-based audio segmentation, and scene detection in videos. Each approach is analyzed in terms of its underlying methodology, supporting tools (e.g., LangChain, Detectron2, PySceneDetect), benefits, and challenges, particularly those related to granularity-context trade-offs and multimodal alignment. Furthermore, we explore emerging cross-modal chunking strategies that aim to preserve alignment and semantic consistency across disparate data types [4]. We also include comparative insights, highlight open problems such as asynchronous information density and noisy alignment signals, and identify opportunities for future research in adaptive, learning-based, and task-specific chunking.
zh
[AI-163] Orion-Bix: Bi-Axial Attention for Tabular In-Context Learning
【速读】:该论文旨在解决表格数据(tabular data)在现实世界机器学习应用中构建通用模型的难题,主要挑战包括混合数值与类别字段、弱特征结构以及标注数据有限等问题。其解决方案的关键在于提出Orion-Bix,一种结合双轴注意力(biaxial attention)与元学习驱动的上下文内推理(in-context learning, ICL)的表格基础模型。该模型通过交替使用标准、分组、层次和关系注意力机制,并采用多-CLS汇总融合策略以高效捕捉局部与全局依赖关系;同时引入标签感知的ICL头,通过分层决策路由实现对大规模标签空间的动态适应与扩展。模型在具有因果先验的合成多样化表格上进行元训练,从而学习跨异构数据的可迁移归纳偏置,最终在公共基准测试中优于梯度提升基线并保持与当前最优表格基础模型相当的性能。
链接: https://arxiv.org/abs/2512.00181
作者: Mohamed Bouadi,Pratinav Seth,Aditya Tanna,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Tabular data drive most real-world machine learning applications, yet building general-purpose models for them remains difficult. Mixed numeric and categorical fields, weak feature structure, and limited labeled data make scaling and generalization challenging. To this end, we introduce Orion-Bix, a tabular foundation model that combines biaxial attention with meta-learned in-context reasoning for few-shot tabular learning. Its encoder alternates standard, grouped, hierarchical, and relational attention, fusing their outputs through multi-CLS summarization to capture both local and global dependencies efficiently. A label-aware ICL head adapts on the fly and scales to large label spaces via hierarchical decision routing. Meta-trained on synthetically generated, structurally diverse tables with causal priors, Orion-Bix learns transferable inductive biases across heterogeneous data. Delivered as a scikit-learn compatible foundation model, it outperforms gradient-boosting baselines and remains competitive with state-of-the-art tabular foundation models on public benchmarks, showing that biaxial attention with episodic meta-training enables robust, few-shot-ready tabular learning. The model is publicly available at this https URL .
zh
[AI-164] DeFi TrustBoost: Blockchain and AI for Trustworthy Decentralized Financial Decisions
【速读】:该论文旨在解决小企业贷款申请中,针对低财富家庭借款人进行信用评估时,传统借贷机构面临的信任与合规难题。其核心问题在于如何在保障数据隐私、满足监管合规(如GDPR等数据保护法规)、抵御对抗性攻击以及支持可审计的自动化决策之间取得平衡。解决方案的关键在于提出了一种去中心化金融(DeFi)信任增强框架(TrustBoost Framework),该框架融合区块链技术和可解释人工智能(Explainable AI),通过链上(on-chain)存储实现决策过程的防篡改审计,并采用链上与链下(off-chain)协同的数据存储策略,从而促进金融机构内部及跨机构之间的可信协作与高效风控。
链接: https://arxiv.org/abs/2512.00142
作者: Swati Sachan,Dale S. Fickett
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); General Finance (q-fin.GN)
备注: 19 pages
Abstract:This research introduces the Decentralized Finance (DeFi) TrustBoost Framework, which combines blockchain technology and Explainable AI to address challenges faced by lenders underwriting small business loan applications from low-wealth households. The framework is designed with a strong emphasis on fulfilling four crucial requirements of blockchain and AI systems: confidentiality, compliance with data protection laws, resistance to adversarial attacks, and compliance with regulatory audits. It presents a technique for tamper-proof auditing of automated AI decisions and a strategy for on-chain (inside-blockchain) and off-chain data storage to facilitate collaboration within and across financial organizations.
zh
[AI-165] Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation
【速读】:该论文旨在解决汇编代码到源代码翻译(assembly-to-source code translation)任务中缺乏系统性评估基准的问题,这一任务在逆向工程、网络安全和软件维护中具有重要意义。解决方案的关键在于构建首个针对五种前沿大语言模型(Large Language Models, LLMs)的综合性评估框架,涵盖词汇相似性(BLEU、ROUGE、METEOR)、语义对齐(BERTScore)、流畅性(Perplexity)及效率(预测时间)等多个维度,并通过定量与定性分析揭示不同模型在准确性和效率之间的权衡关系,从而为未来研究提供可操作的洞察和优化方向。
链接: https://arxiv.org/abs/2512.00134
作者: Parisa Hamedi,Hamed Jelodar,Samita Bai,Mohammad Meymani,Roozbeh Razavi-Far,Ali A. Ghorbani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance, yet systematic benchmarks for evaluating large language models on this problem remain scarce. In this work, we present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation. We assess model performance using a diverse set of metrics capturing lexical similarity (BLEU, ROUGE, and METEOR), semantic alignment (BERTScore), fluency (Perplexity), and efficiency (time prediction). Our results reveal clear trade-offs: while certain models excel in text similarity metrics, others demonstrate lower perplexity or faster inference times. We further provide qualitative analyses of typical model successes and failure cases, highlighting challenges such as control flow recovery and identifier reconstruction. Taken together, our benchmark offers actionable insights into the strengths and limitations of current large language models for program translation, establishing a foundation for future research in combining accuracy with efficiency for real-world applications.
zh
[AI-166] Generating Verifiable CoT from Execution-Traces
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在代码推理任务中因依赖不可验证的链式思维(Chain-of-Thought, CoT)提示而导致的逻辑幻觉问题。现有合成训练数据中的推理步骤虽看似合理,实则由教师模型生成,缺乏对程序实际执行过程的忠实反映,从而导致模型学习到表面正确但逻辑错误的推理模式。解决方案的关键在于将CoT生成过程直接锚定于程序执行轨迹(execution traces),通过代码 instrumentation 捕获动态行为,并将这些经过验证的执行轨迹转化为自然语言推理步骤,确保每一步推理均基于程序真实计算结果,从根本上杜绝逻辑幻觉。实验表明,基于双向轨迹接地数据训练的模型在代码推理、生成与解释任务上显著优于基线模型。
链接: https://arxiv.org/abs/2512.00127
作者: Shailja Thakur,Vaibhav Saxena,Rohan Kulkarni,Shivdeep Singh,Parameswaran Selvam,Hima Patel,Hiroshi Kanayama
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Teaching language models to reason about code execution remains a fundamental challenge. While Chain-of-Thought (CoT) prompting has shown promise, current synthetic training data suffers from a critical weakness: the reasoning steps are often plausible-sounding explanations generated by teacher models, not verifiable accounts of what the code actually does. This creates a troubling failure mode where models learn to mimic superficially convincing but logically flawed reasoning patterns. We address this by grounding CoT generation directly in program execution traces. Our pipeline instruments code to capture its dynamic behavior, then narrates these verified execution traces into natural language rationales that are correct by construction. This execution-grounded approach ensures every reasoning step reflects what the program genuinely computes, eliminating logical hallucinations at the source. We evaluate our method on code reasoning tasks (forward reasoning on CruxEval and LiveCodeBench-Exec, backward reasoning on CruxEval-Input), as well as code generation and explanation tasks from HumanEval. Models trained on our bi-directional trace-grounded data achieve substantial improvements, with gains of up to 30 points on output prediction and 28 points on input prediction over base models, alongside improved explanation and code generation, demonstrating that verifiable reasoning fundamentally enhances model capabilities. this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2512.00127 [cs.SE] (or arXiv:2512.00127v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.00127 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-167] NetDeTox: Adversarial and Efficient Evasion of Hardware-Security GNNs via RL-LLM Orchestration
链接: https://arxiv.org/abs/2512.00119
作者: Zeng Wang,Minghao Shao,Akashdeep Saha,Ramesh Karri,Johann Knechtel,Muhammad Shafique,Ozgur Sinanoglu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
[AI-168] From RISC-V Cores to Neuromorphic Arrays: A Tutorial on Building Scalable Digital Neuromorphic Processors
链接: https://arxiv.org/abs/2512.00113
作者: Amirreza Yousefzadeh
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
[AI-169] Efficiently Sampling Interval Patterns from Numerical Databases
【速读】:该论文旨在解决在数值型数据库中高效抽样区间模式(interval patterns)的问题,尤其针对传统模式抽样方法难以有效处理长尾现象(long-tail phenomenon)的挑战。其核心解决方案是提出两种新颖的抽样方法:Fips 和 HFips,其中 Fips 通过多步抽样机制实现按频率比例抽取区间模式,关键在于精确计算覆盖每个对象的区间模式数量;而 HFips 进一步引入超体积(hyper-volume)作为权重因子,使抽样比例与频率和超体积的乘积成正比,从而更全面地捕捉高价值模式。理论证明表明,Fips 和 HFips 分别能准确实现按频率和频率×超体积比例的抽样,实验验证了其在多种数据集上对长尾问题的鲁棒性及所获模式的质量。
链接: https://arxiv.org/abs/2512.00105
作者: Djawad Bekkoucha,Lamine Diop,Abdelkader Ouali,Bruno Crémilleux,Patrice Boizumault
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness measure, such as frequency or hyper-volume. This paper presents the first sampling approach designed to handle interval patterns in numerical databases. This approach, named Fips, samples interval patterns proportionally to their frequency. It uses a multi-step sampling procedure and addresses a key challenge in numerical data: accurately determining the number of interval patterns that cover each object. We extend this work with HFips, which samples interval patterns proportionally to both their frequency and hyper-volume. These methods efficiently tackle the well-known long-tail phenomenon in pattern sampling. We formally prove that Fips and HFips sample interval patterns in proportion to their frequency and the product of hyper-volume and frequency, respectively. Through experiments on several databases, we demonstrate the quality of the obtained patterns and their robustness against the long-tail phenomenon.
zh
[AI-170] Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions
链接: https://arxiv.org/abs/2512.00097
作者: Boyan Duan,Xiao Liang,Shuai Lu,Yaoxiang Wang,Yelong Shen,Kai-Wei Chang,Ying Nian Wu,Mao Yang,Weizhu Chen,Yeyun Gong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注:
[AI-171] Hyper-GoalNet: Goal-Conditioned Manipulation Policy Learning with HyperNetworks
【速读】:该论文旨在解决机器人操作中目标条件策略学习(goal-conditioned policy learning)在多样化目标任务和环境下的性能保持难题。其核心挑战在于如何在不同目标与环境变化下仍能稳定生成高效策略。解决方案的关键在于提出Hyper-GoalNet框架,利用超网络(hypernetworks)从目标规范中动态生成任务特定的策略网络参数,从而将目标理解与状态处理解耦:前者决定网络结构参数,后者基于这些参数作用于当前观测。为提升潜在空间表示质量以支持有效策略生成,该方法进一步引入两个互补约束:(1) 前向动力学模型以增强状态转移的可预测性,(2) 基于距离的约束确保向目标状态的单调逼近。这一设计显著提升了算法在高变异性环境中的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2512.00085
作者: Pei Zhou,Wanting Yao,Qian Luo,Xunzhe Zhou,Yanchao Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Goal-conditioned policy learning for robotic manipulation presents significant challenges in maintaining performance across diverse objectives and environments. We introduce Hyper-GoalNet, a framework that generates task-specific policy network parameters from goal specifications using hypernetworks. Unlike conventional methods that simply condition fixed networks on goal-state pairs, our approach separates goal interpretation from state processing – the former determines network parameters while the latter applies these parameters to current observations. To enhance representation quality for effective policy generation, we implement two complementary constraints on the latent space: (1) a forward dynamics model that promotes state transition predictability, and (2) a distance-based constraint ensuring monotonic progression toward goal states. We evaluate our method on a comprehensive suite of manipulation tasks with varying environmental randomization. Results demonstrate significant performance improvements over state-of-the-art methods, particularly in high-variability conditions. Real-world robotic experiments further validate our method’s robustness to sensor noise and physical uncertainties. Code is available at: this https URL.
zh
[AI-172] InF-ATPG: Intelligent FFR-Driven ATPG with Advanced Circuit Representation Guided Reinforcement Learning
【速读】:该论文旨在解决传统自动测试向量生成(ATPG)在先进半导体工艺下因执行时间过长而难以达到预期故障覆盖率的问题,从而影响芯片的上市时间。其解决方案的关键在于提出一种基于扇出无区域(FFR)驱动的智能ATPG框架InF-ATPG,通过将电路划分为扇出无区域(fanout-free regions, FFRs),并设计了一种融合ATPG特定特征的新型QGNN(Q-learning Graph Neural Network)架构,以改进强化学习(Reinforcement Learning, RL)模型中的奖励延迟问题,并提升电路表示能力,从而显著减少回溯次数并提高故障覆盖率。
链接: https://arxiv.org/abs/2512.00079
作者: Bin Sun,Rengang Zhang,Zhiteng Chao,Zizhen Liu,Jianan Mu,Jing Ye,Huawei Li
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages,6 figures
Abstract:Automatic test pattern generation (ATPG) is a crucial process in integrated circuit (IC) design and testing, responsible for efficiently generating test patterns. As semiconductor technology progresses, traditional ATPG struggles with long execution times to achieve the expected fault coverage, which impacts the time-to-market of chips. Recent machine learning techniques, like reinforcement learning (RL) and graph neural networks (GNNs), show promise but face issues such as reward delay in RL models and inadequate circuit representation in GNN-based methods. In this paper, we propose InF-ATPG, an intelligent FFR-driven ATPG framework that overcomes these challenges by using advanced circuit representation to guide RL. By partitioning circuits into fanout-free regions (FFRs) and incorporating ATPG-specific features into a novel QGNN architecture, InF-ATPG enhances test pattern generation efficiency. Experimental results show InF-ATPG reduces backtracks by 55.06% on average compared to traditional methods and 38.31% compared to the machine learning approach, while also improving fault coverage.
zh
[AI-173] A CNN-Based Technique to Assist Layout-to-Generator Conversion for Analog Circuits
链接: https://arxiv.org/abs/2512.00070
作者: Sungyu Jeong,Minsu Kim,Byungsub Kim
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
[AI-174] Enhancing Cognitive Robotics with Commonsense through LLM -Generated Preconditions and Subgoals
链接: https://arxiv.org/abs/2512.00069
作者: Ohad Bachner,Bar Gamliel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-175] SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning
链接: https://arxiv.org/abs/2512.00062
作者: Taewook Nam,Sung Ju Hwang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-176] KAN-SAs: Efficient Acceleration of Kolmogorov-Arnold Networks on Systolic Arrays
【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs)在硬件加速方面效率不足的问题,特别是针对传统Systolic Array (SA)架构在加速KAN推理时利用率低、难以适配B-spline非递归结构的挑战。其解决方案的关键在于提出一种新型SA-based加速器——KAN-SAs,该架构通过引入非递归B-spline实现方式并利用KAN固有的稀疏性,显著提升了SA的利用率(最高达100%),同时减少高达50%的时钟周期开销,从而实现了对KAN推理和传统深度神经网络(DNN)的高效支持。
链接: https://arxiv.org/abs/2512.00055
作者: Sohaib Errabii(TARAN),Olivier Sentieys(TARAN),Marcello Traiola(TARAN)
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Kolmogorov-Arnold Networks (KANs) have garnered significant attention for their promise of improved parameter efficiency and explainability compared to traditional Deep Neural Networks (DNNs). KANs’ key innovation lies in the use of learnable non-linear activation functions, which are parametrized as splines. Splines are expressed as a linear combination of basis functions (B-splines). B-splines prove particularly challenging to accelerate due to their recursive definition. Systolic Array (SA)based architectures have shown great promise as DNN accelerators thanks to their energy efficiency and low latency. However, their suitability and efficiency in accelerating KANs have never been assessed. Thus, in this work, we explore the use of SA architecture to accelerate the KAN inference. We show that, while SAs can be used to accelerate part of the KAN inference, their utilization can be reduced to 30%. Hence, we propose KAN-SAs, a novel SA-based accelerator that leverages intrinsic properties of B-splines to enable efficient KAN inference. By including a nonrecursive B-spline implementation and leveraging the intrinsic KAN sparsity, KAN-SAs enhances conventional SAs, enabling efficient KAN inference, in addition to conventional DNNs. KAN-SAs achieves up to 100% SA utilization and up to 50% clock cycles reduction compared to conventional SAs of equivalent area, as shown by hardware synthesis results on a 28nm FD-SOI technology. We also evaluate different configurations of the accelerator on various KAN applications, confirming the improved efficiency of KAN inference provided by KAN-SAs.
zh
[AI-177] Reinforcement Learning from Implicit Neural Feedback for Human-Aligned Robot Control
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)在稀疏奖励条件下难以有效学习策略的问题,以及现有基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)方法依赖显式反馈机制(如按钮点击或偏好标签)所带来的用户认知负荷和交互中断问题。解决方案的关键在于提出一种基于隐式人类反馈的强化学习框架(Reinforcement Learning from Implicit Human Feedback, RLIHF),利用非侵入式脑电图(Electroencephalography, EEG)信号中的错误相关电位(Error-Related Potentials, ErrPs)作为连续、无需用户主动干预的反馈源,并通过预训练解码器将原始EEG信号转化为概率化的奖励分量,从而在外部奖励稀疏的情况下仍能实现高效策略学习。
链接: https://arxiv.org/abs/2512.00050
作者: Suzie Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Master’s thesis, Korea University, 2025. arXiv admin note: substantial text overlap with arXiv:2507.13171
Abstract:Conventional reinforcement learning (RL) approaches often struggle to learn effective policies under sparse reward conditions, necessitating the manual design of complex, task-specific reward functions. To address this limitation, reinforcement learning from human feedback (RLHF) has emerged as a promising strategy that complements hand-crafted rewards with human-derived evaluation signals. However, most existing RLHF methods depend on explicit feedback mechanisms such as button presses or preference labels, which disrupt the natural interaction process and impose a substantial cognitive load on the user. We propose a novel reinforcement learning from implicit human feedback (RLIHF) framework that utilizes non-invasive electroencephalography (EEG) signals, specifically error-related potentials (ErrPs), to provide continuous, implicit feedback without requiring explicit user intervention. The proposed method adopts a pre-trained decoder to transform raw EEG signals into probabilistic reward components, enabling effective policy learning even in the presence of sparse external rewards. We evaluate our approach in a simulation environment built on the MuJoCo physics engine, using a Kinova Gen2 robotic arm to perform a complex pick-and-place task that requires avoiding obstacles while manipulating target objects. The results show that agents trained with decoded EEG feedback achieve performance comparable to those trained with dense, manually designed rewards. These findings validate the potential of using implicit neural feedback for scalable and human-aligned reinforcement learning in interactive robotics.
zh
[AI-178] Socially aware navigation for mobile robots: a survey on deep reinforcement learning approaches
链接: https://arxiv.org/abs/2512.00049
作者: Ibrahim Khalil Kabir,Muhammad Faizan Mysorewala
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
[AI-179] Causal Reinforcement Learning based Agent -Patient Interaction with Clinical Domain Knowledge AAAI
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在自适应医疗干预(如痴呆症照护)中面临的三大挑战:数据稀缺性、决策可解释性不足,以及患者状态动态的复杂因果关系难以建模。其解决方案的关键在于提出一种名为因果结构感知强化学习(Causal structure-aware Reinforcement Learning, CRL)的新框架,该框架将因果发现与推理显式嵌入策略优化过程,使智能体能够学习并利用一个有向无环图(Directed Acyclic Graph, DAG)来刻画人类行为状态与机器人动作之间的因果依赖关系,从而实现更高效、可解释且鲁棒的决策。实验表明,CRL 在模拟机器人辅助认知照护场景中显著优于传统无模型RL基线方法,并展现出对不同权重策略和超参数设置的稳定性;此外,还通过轻量级大语言模型(Large Language Model, LLM)部署验证了固定策略嵌入系统提示词即可生成一致、支持性的对话,无需微调。
链接: https://arxiv.org/abs/2512.00048
作者: Wenzheng Zhao,Ran Zhang,Ruth Palan Lopez,Shu-Fen Wung,Fengpei Yuan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI workshop
Abstract:Reinforcement Learning (RL) faces significant challenges in adaptive healthcare interventions, such as dementia care, where data is scarce, decisions require interpretability, and underlying patient-state dynamic are complex and causal in nature. In this work, we present a novel framework called Causal structure-aware Reinforcement Learning (CRL) that explicitly integrates causal discovery and reasoning into policy optimization. This method enables an agent to learn and exploit a directed acyclic graph (DAG) that describes the causal dependencies between human behavioral states and robot actions, facilitating more efficient, interpretable, and robust decision-making. We validate our approach in a simulated robot-assisted cognitive care scenario, where the agent interacts with a virtual patient exhibiting dynamic emotional, cognitive, and engagement states. The experimental results show that CRL agents outperform conventional model-free RL baselines by achieving higher cumulative rewards, maintaining desirable patient states more consistently, and exhibiting interpretable, clinically-aligned behavior. We further demonstrate that CRL’s performance advantage remains robust across different weighting strategies and hyperparameter settings. In addition, we demonstrate a lightweight LLM-based deployment: a fixed policy is embedded into a system prompt that maps inferred states to actions, producing consistent, supportive dialogue without LLM finetuning. Our work illustrates the promise of causal reinforcement learning for human-robot interaction applications, where interpretability, adaptiveness, and data efficiency are paramount.
zh
[AI-180] Assessing Large Language Models in Generating RTL Design Specifications
链接: https://arxiv.org/abs/2512.00045
作者: Hung-Ming Huang,Yu-Hsin Yang,Fu-Chieh Chang,Yun-Chia Hsu,Yin-Yu Lin,Ming-Fang Tsai,Chun-Chih Yang,Pei-Yuan Wu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
[AI-181] Constrained Network Slice Assignment via Large Language Models NEURIPS2025
链接: https://arxiv.org/abs/2512.00040
作者: Sagar Sudhakara,Pankaj Rajak
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025 Workshop on AI and ML for Next-Generation Wireless Communications and Networking (AI4NextG), San Diego, CA
[AI-182] Refined Bayesian Optimization for Efficient Beam Alignment in Intelligent Indoor Wireless Environments
【速读】:该论文旨在解决室内智能无线环境中毫米波(mmWave)通信中快速且可靠的波束对准(beam alignment)问题,尤其是在移动性和遮挡条件下维持高吞吐量链路的挑战。传统穷举式波束训练虽能实现最优性能,但开销过大;而室内环境中的密集散射体和收发端硬件不完美引入多径效应与旁瓣泄漏,导致接收功率在多个角度上分布广泛,削弱了面向室外场景的对准算法效果。解决方案的关键在于提出一种改进的贝叶斯优化框架(Refined Bayesian Optimization, R-BO),其核心是利用mmWave收发器模式的内在结构特性——即当发射与接收波束逐渐趋近最优方向时,接收功率呈渐进上升趋势。R-BO结合高斯过程(Gaussian Process, GP)代理模型(采用Matern核函数)与期望改善(Expected Improvement, EI)采集函数,并在预测最优值附近进行局部精细化搜索;同时在线重优化GP超参数以适应由反射和旁瓣泄漏引起的测量角域功率场的非规则变化。实验表明,该方法在43个接收位置上实现了97.7%的对准准确率(误差小于10°)、平均功率损失低于0.3 dB,且探测开销比穷举搜索减少88%,验证了其在实时智能室内无线环境中的高效性与自适应能力。
链接: https://arxiv.org/abs/2512.00036
作者: Parth Ashokbhai Shiroya,Amod Ashtekar,Swarnagowri Shashidhar,Mohammed E. Eltayeb
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Future intelligent indoor wireless environments re- quire fast and reliable beam alignment to sustain high-throughput links under mobility and blockage. Exhaustive beam training achieves optimal performance but is prohibitively costly. In indoor settings, dense scatterers and transceiver hardware imperfections introduce multipath and sidelobe leakage, producing measurable power across multiple angles and reducing the effectiveness of outdoor-oriented alignment algorithms. This paper presents a Refined Bayesian Optimization (R-BO) framework that exploits the inherent structure of mmWave transceiver patterns, where received power gradually increases as the transmit and receive beams converge toward the optimum. R-BO integrates a Gaussian Process (GP) surrogate with a Matern kernel and an Expected Improvement (EI) acquisition function, followed by a localized refinement around the predicted optimum. The GP hyperparam- eters are re-optimized online to adapt to irregular variations in the measured angular power field caused by reflections and sidelobe leakage. Experiments across 43 receiver positions in an indoor laboratory demonstrate 97.7% beam-alignment accuracy within 10 degrees, less than 0.3 dB average loss, and an 88% reduction in probing overhead compared to exhaustive search. These results establish R-BO as an efficient and adaptive beam-alignment solution for real-time intelligent indoor wireless environments.
zh
[AI-183] Perturbation-mitigated USV Navigation with Distributionally Robust Reinforcement Learning
【速读】:该论文旨在解决无人水面艇(Unmanned Surface Vehicle, USV)在未知复杂海洋环境中,因异方差观测噪声(heteroscedastic observational noise)导致传感器导航任务鲁棒性下降的问题。现有基于分布强化学习(Distributional Reinforcement Learning, DistRL)的方法虽能在无先验环境信息下实现自主导航,但未考虑噪声模式随环境变化的情况,从而影响价值函数的学习与安全导航性能。解决方案的关键在于提出DRIQN方法,通过将分布鲁棒优化(Distributionally Robust Optimization, DRO)与隐式分位数网络(implicit quantile networks)相结合,在经验回放缓冲区中引入显式子群建模机制,以整合异质噪声源并聚焦于风险敏感场景,从而优化自然环境条件下的最坏情况性能表现。
链接: https://arxiv.org/abs/2512.00030
作者: Zhaofan Zhang,Minghao Yang,Sihong Xie,Hui Xiong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The robustness of Unmanned Surface Vehicles (USV) is crucial when facing unknown and complex marine environments, especially when heteroscedastic observational noise poses significant challenges to sensor-based navigation tasks. Recently, Distributional Reinforcement Learning (DistRL) has shown promising results in some challenging autonomous navigation tasks without prior environmental information. However, these methods overlook situations where noise patterns vary across different environmental conditions, hindering safe navigation and disrupting the learning of value functions. To address the problem, we propose DRIQN to integrate Distributionally Robust Optimization (DRO) with implicit quantile networks to optimize worst-case performance under natural environmental conditions. Leveraging explicit subgroup modeling in the replay buffer, DRIQN incorporates heterogeneous noise sources and target robustness-critical scenarios. Experimental results based on the risk-sensitive environment demonstrate that DRIQN significantly outperforms state-of-the-art methods, achieving +13.51% success rate, -12.28% collision rate and +35.46% for time saving, +27.99% for energy saving, compared with the runner-up.
zh
[AI-184] Large Language Model for Verilog Code Generation: Literature Review and the Road Ahead
【速读】:该论文旨在解决当前关于大语言模型(Large Language Models, LLMs)在Verilog代码生成领域研究分散、缺乏系统性综述的问题。现有文献虽已探索LLM在寄存器传输级(Register Transfer Level, RTL)硬件设计中的应用,但尚未形成对方法、数据集、评估指标及对齐策略的全面梳理。其解决方案的关键在于开展一项系统性文献综述(Systematic Literature Review),涵盖软件工程(Software Engineering, SE)、人工智能(Artificial Intelligence, AI)与电子设计自动化(Electronic Design Automation, EDA)领域的102篇高质量论文(含70篇会议/期刊论文和32篇预印本),并通过回答四个核心研究问题,归纳现有LLM方法的技术路径、评估体系与局限性,并提出面向未来的研究路线图,以推动LLM辅助硬件设计的发展。
链接: https://arxiv.org/abs/2512.00020
作者: Guang Yang,Wei Zheng,Xiang Chen,Dong Liang,Peng Hu,Yukui Yang,Shaohang Peng,Zhenghan Li,Jiahui Feng,Xiao Wei,Kexin Sun,Deyuan Ma,Haotian Cheng,Yiheng Shen,Xing Hu,Terry Yue Zhuo,David Lo
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: WIP
Abstract:Code generation has emerged as a critical research area at the intersection of Software Engineering (SE) and Artificial Intelligence (AI), attracting significant attention from both academia and industry. Within this broader landscape, Verilog, as a representative hardware description language (HDL), plays a fundamental role in digital circuit design and verification, making its automated generation particularly significant for Electronic Design Automation (EDA). Consequently, recent research has increasingly focused on applying Large Language Models (LLMs) to Verilog code generation, particularly at the Register Transfer Level (RTL), exploring how these AI-driven techniques can be effectively integrated into hardware design workflows. Despite substantial research efforts have explored LLM applications in this domain, a comprehensive survey synthesizing these developments remains absent from the literature. This review fill addresses this gap by providing a systematic literature review of LLM-based methods for Verilog code generation, examining their effectiveness, limitations, and potential for advancing automated hardware design. The review encompasses research work from conferences and journals in the fields of SE, AI, and EDA, encompassing 70 papers published on venues, along with 32 high-quality preprint papers, bringing the total to 102 papers. By answering four key research questions, we aim to (1) identify the LLMs used for Verilog generation, (2) examine the datasets and metrics employed in evaluation, (3) categorize the techniques proposed for Verilog generation, and (4) analyze LLM alignment approaches for Verilog generation. Based on our findings, we have identified a series of limitations of existing studies. Finally, we have outlined a roadmap highlighting potential opportunities for future research endeavors in LLM-assisted hardware design.
zh
[AI-185] Architect in the Loop Agent ic Hardware Design and Verification
【速读】:该论文旨在解决硬件设计与验证过程日益复杂所带来的挑战,尤其是如何通过自动化手段提升设计效率并降低对专业人力的依赖。其核心问题在于现有生成式AI方法在硬件设计领域多局限于小规模模块(如少量叶级组件)的HDL代码和测试平台生成,缺乏对完整处理器设计流程的系统性自动化支持。解决方案的关键在于提出一种“代理驱动的、工程师参与的处理器设计与验证”框架(agentic automated processor design and verification with engineers in the loop),该框架利用大语言模型(LLMs)和专用模型进行分层分解、HDL生成与cocotb测试编写,并在调试与综合阶段引入工程师指导,从而实现从顶层架构到子模块的模块化、可扩展设计流程。实验表明,该方法可在不依赖专用硬件的情况下完成典型处理器的设计与验证,平均每个处理器需约一百万推理token,且具备良好的可扩展性,已成功应用于简单处理器原型(如LEGv8和RISC-V兼容32位处理器)及初步的片上系统(SoC)探索。
链接: https://arxiv.org/abs/2512.00016
作者: Mubarek Mohammed
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The ever increasing complexity of the hardware design process demands improved hardware design and verification methodologies. With the advent of generative AI various attempts have been made to automate parts of the design and verification process. Large language models (LLMs) as well as specialized models generate hdl and testbenches for small components, having a few leaf level components. However, there are only a few attempts to automate the entire processor design process. Hardware design demands hierarchical and modular design processes. We utilized this best practice systematically and effectively. We propose agentic automated processor design and verification with engineers in the loop. The agent with optional specification tries to break down the design into sub-components, generate HDL and cocotb tests, and verifies the components involving engineer guidance, especially during debugging and synthesis. We designed various digital systems using this approach. However, we selected two simple processors for demonstration purposes in this work. The first one is a LEGv8 like a simple processor verified, synthesized and programmed for the DE-10 Lite FPGA. The second one is a RISC-V like 32-bit processor designed and verified in similar manner and synthesized. However, it is not programmed into the DE-10 Lite. This process is accomplished usually using around a million inference tokens per processor, using a combination of reasoning (e.g gemini-pro) and non-reasoning models (eg. gpt-5-mini) based on the complexity of the task. This indicates that hardware design and verification experimentation can be done cost effectively without using any specialized hardware. The approach is scalable, we even attempted system-on-chip, which we want to experiment in our future work.
zh
[AI-186] he Impact of Concept Explanations and Interventions on Human-Machine Collaboration
【速读】:该论文试图解决的问题是:在人机协作任务中,如何通过提升深度神经网络(Deep Neural Networks, DNNs)的可解释性来增强人类对模型的信任并提高人机协同的任务准确性。传统DNNs因决策过程不透明而被视为“黑箱”,尽管概念瓶颈模型(Concept Bottleneck Models, CBMs)通过引入人类定义的概念作为中间预测步骤提升了模型的可解释性,但其在真实人类参与场景中的有效性尚未得到验证。解决方案的关键在于首次开展以人为中心的实验研究,评估CBMs在人机协作环境下的表现,发现CBMs确实提高了人类对模型决策的理解和对齐度(human-machine alignment),但这种增强的对齐并未显著提升任务准确性,原因在于理解模型决策需多次交互,且模型与人类决策逻辑之间的错位可能削弱可解释性与模型效能。
链接: https://arxiv.org/abs/2512.00015
作者: Jack Furby,Dan Cunnington,Dave Braines,Alun Preece
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures, 8 tables. Accepted at The World Conference on eXplainable Artificial Intelligence 2025 (XAI-2025). The Version of Record of this chapter is published in Explainable Artificial Intelligence, and is available online at this https URL . The version published here includes minor typographical corrections
Abstract:Deep Neural Networks (DNNs) are often considered black boxes due to their opaque decision-making processes. To reduce their opacity Concept Models (CMs), such as Concept Bottleneck Models (CBMs), were introduced to predict human-defined concepts as an intermediate step before predicting task labels. This enhances the interpretability of DNNs. In a human-machine setting greater interpretability enables humans to improve their understanding and build trust in a DNN. In the introduction of CBMs, the models demonstrated increased task accuracy as incorrect concept predictions were replaced with their ground truth values, known as intervening on the concept predictions. In a collaborative setting, if the model task accuracy improves from interventions, trust in a model and the human-machine task accuracy may increase. However, the result showing an increase in model task accuracy was produced without human evaluation and thus it remains unknown if the findings can be applied in a collaborative setting. In this paper, we ran the first human studies using CBMs to evaluate their human interaction in collaborative task settings. Our findings show that CBMs improve interpretability compared to standard DNNs, leading to increased human-machine alignment. However, this increased alignment did not translate to a significant increase in task accuracy. Understanding the model’s decision-making process required multiple interactions, and misalignment between the model’s and human decision-making processes could undermine interpretability and model effectiveness.
zh
[AI-187] Cultural Prompting Improves the Empathy and Cultural Responsiveness of GPT -Generated Therapy Responses
链接: https://arxiv.org/abs/2512.00014
作者: Serena Jinchen Xie,Shumenghui Zhai,Yanjing Liang,Jingyi Li,Xuehong Fan,Trevor Cohen,Weichao Yuwen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
[AI-188] Leverag ing LLM s for Design Ideation: An AI Tool to Assist Creativity
链接: https://arxiv.org/abs/2512.00010
作者: Rutvik Kokate,Pranati Kompella,Prasad Onkar
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
[AI-189] Development and Benchmarking of a Blended Human-AI Qualitative Research Assistant
链接: https://arxiv.org/abs/2512.00009
作者: Joseph Matveyenko,James Liu,John David Parsons,Prateek Puri
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 32 pages, 9 figures
[AI-190] Enhancing Talent Search Ranking with Role-Aware Expert Mixtures and LLM -based Fine-Grained Job Descriptions
【速读】:该论文旨在解决现代人才搜索系统中难以捕捉岗位特定偏好、无法精细建模招聘人员行为以及主观判断带来的噪声问题。其解决方案的关键在于:(1) 利用大语言模型(LLM)从职位描述和历史招聘数据中提取细粒度的招聘信号;(2) 采用角色感知的多门控专家混合模型(role-aware multi-gate MoE network)以捕捉不同招聘角色的行为差异;同时引入多任务学习模块,联合优化点击率(CTR)、转化率(CVR)与简历匹配相关性,从而有效降低噪声并提升整体效果。
链接: https://arxiv.org/abs/2512.00004
作者: Jihang Li,Bing Xu,Zulong Chen,Chuanfei Xu,Minping Chen,Suyu Liu,Ying Zhou,Zeyi Wen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Talent search is a cornerstone of modern recruitment systems, yet existing approaches often struggle to capture nuanced job-specific preferences, model recruiter behavior at a fine-grained level, and mitigate noise from subjective human judgments. We present a novel framework that enhances talent search effectiveness and delivers substantial business value through two key innovations: (i) leveraging LLMs to extract fine-grained recruitment signals from job descriptions and historical hiring data, and (ii) employing a role-aware multi-gate MoE network to capture behavioral differences across recruiter roles. To further reduce noise, we introduce a multi-task learning module that jointly optimizes click-through rate (CTR), conversion rate (CVR), and resume matching relevance. Experiments on real-world recruitment data and online A/B testing show relative AUC gains of 1.70% (CTR) and 5.97% (CVR), and a 17.29% lift in click-through conversion rate. These improvements reduce dependence on external sourcing channels, enabling an estimated annual cost saving of millions of CNY.
zh
[AI-191] Robust Detection of Synthetic Tabular Data under Schema Variability
【速读】:该论文旨在解决真实场景下合成表格数据(synthetic tabular data)的检测问题,即在表格模式(schema)多样且未知的情况下,如何有效识别其真实性。该任务此前因表格结构异质性及测试时出现未见格式而被忽视,且缺乏可靠检测方法。解决方案的关键在于提出一种新颖的基于样本粒度(datum-wise)的Transformer架构,并引入表格自适应(table-adaptation)组件,从而显著提升检测性能,在AUC和准确率上较现有唯一基线分别提高7个百分点,且额外获得7个百分点的准确率增益,验证了在现实条件下检测合成表格数据的可行性与鲁棒性。
链接: https://arxiv.org/abs/2509.00092
作者: G. Charbel N. Kindji(MALT),Elisa Fromont(MALT),Lina Maria Rojas-Barahona,Tanguy Urvoy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data ‘‘in the wild’’, i.e. when the detector is deployed on tables with variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is feasible, and demonstrates substantial improvements over previous approaches. Following acceptance of the paper, we are finalizing the administrative and licensing procedures necessary for releasing the source code. This extended version will be updated as soon as the release is complete.
zh
[AI-192] opological Order in Deep State
【速读】:该论文旨在解决强关联拓扑物态(strongly correlated topological phases)的理论研究难题,特别是由于强耦合特性导致传统平均场方法失效的问题。其解决方案的关键在于提出了一种基于注意力机制(attention-based)的深度神经网络作为变分波函数(variational wavefunction),通过纯粹的能量最小化过程自动发现分数陈绝缘体(fractional Chern insulator)基态,无需先验知识;同时引入一种高效方法,从单个平移不变系统的实空间波函数中分解出不同多体动量子空间,从而提取拓扑简并度(topological degeneracy)——这是拓扑有序的核心特征。该方法验证了神经网络变分蒙特卡洛(neural network variational Monte Carlo)在探索强关联拓扑相方面的强大潜力。
链接: https://arxiv.org/abs/2512.01863
作者: Ahmed Abouelkomsan,Max Geier,Liang Fu
机构: 未知
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI)
备注: 5 pages + 6 SM
Abstract:Topologically ordered states are among the most interesting quantum phases of matter that host emergent quasi-particles having fractional charge and obeying fractional quantum statistics. Theoretical study of such states is however challenging owing to their strong-coupling nature that prevents conventional mean-field treatment. Here, we demonstrate that an attention-based deep neural network provides an expressive variational wavefunction that discovers fractional Chern insulator ground states purely through energy minimization without prior knowledge and achieves remarkable accuracy. We introduce an efficient method to extract ground state topological degeneracy – a hallmark of topological order – from a single optimized real-space wavefunction in translation-invariant systems by decomposing it into different many-body momentum sectors. Our results establish neural network variational Monte Carlo as a versatile tool for discovering strongly correlated topological phases.
zh
[AI-193] From Black Hole to Galaxy: Neural Operator: Framework for Accretion and Feedback Dynamics NEURIPS2025
链接: https://arxiv.org/abs/2512.01576
作者: Nihaal Bhojwani,Chuwei Wang,Hai-Yang Wang,Chang Sun,Elias R. Most,Anima Anandkumar
机构: 未知
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); General Relativity and Quantum Cosmology (gr-qc)
备注: ML4PS Workshop, Neurips 2025 accepted
[AI-194] Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding
链接: https://arxiv.org/abs/2512.01565
作者: Alex Oshin,Rahul Vodeb Ghosh,Augustinos D. Saravanos,Evangelos A. Theodorou
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:
[AI-195] Formal Verification of Noisy Quantum Reinforcement Learning Policies
链接: https://arxiv.org/abs/2512.01502
作者: Dennis Gross
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:
[AI-196] Data-Driven Learnability Transition of Measurement-Induced Entanglement
链接: https://arxiv.org/abs/2512.01317
作者: Dongheng Qian,Jing Wang
机构: 未知
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures
[AI-197] How do trout regulate patterns of muscle contraction to optimize propulsive efficiency during steady swimming
链接: https://arxiv.org/abs/2512.01218
作者: Tao Li,Chunze Zhang,Weiwei Yao,Junzhao He,Ji Hou,Qin Zhou,Lu Zhang
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
[AI-198] Discriminative classification with generative features: bridging Naive Bayes and logistic regression
【速读】:该论文旨在解决传统分类模型中生成式(Generative)与判别式(Discriminative)方法各自局限的问题:Naive Bayes 虽具生成建模优势但假设过于刚性,而逻辑回归(Logistic Regression)虽在判别性能上表现优异却忽视了数据的生成结构。解决方案的关键在于提出 Smart Bayes 框架,通过将基于似然比的生成特征(likelihood-ratio-based generative features)嵌入到类似逻辑回归的判别分类器中,使模型既能利用生成模型对密度比的建模能力,又能借助判别模型的强分离性。具体而言,Smart Bayes 用可数据驱动的系数替代 Naive Bayes 的固定权重,并构建边际对数密度比作为输入特征,从而增强特征的类别区分能力;同时开发了一种基于样条的单变量对数密度比估计器,实现灵活性、鲁棒性和计算效率的统一。
链接: https://arxiv.org/abs/2512.01097
作者: Zachary Terner,Alexander Petersen,Yuedong Wang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
备注:
Abstract:We introduce Smart Bayes, a new classification framework that bridges generative and discriminative modeling by integrating likelihood-ratio-based generative features into a logistic-regression-style discriminative classifier. From the generative perspective, Smart Bayes relaxes the fixed unit weights of Naive Bayes by allowing data-driven coefficients on density-ratio features. From a discriminative perspective, it constructs transformed inputs as marginal log-density ratios that explicitly quantify how much more likely each feature value is under one class than another, thereby providing predictors with stronger class separation than the raw covariates. To support this framework, we develop a spline-based estimator for univariate log-density ratios that is flexible, robust, and computationally efficient. Through extensive simulations and real-data studies, Smart Bayes often outperforms both logistic regression and Naive Bayes. Our results highlight the potential of hybrid approaches that exploit generative structure to enhance discriminative performance.
zh
[AI-199] On The Finetuning of MLIPs Through the Lens of Iterated Maps With BPTT
【速读】:该论文旨在解决传统第一性原理计算在材料结构弛豫(structural relaxation)中计算成本过高,以及现有机器学习势函数(MLIP)训练依赖大量高质量力数据所带来的数据需求瓶颈问题。其解决方案的关键在于提出一种基于反向传播的端到端微调方法(backpropagation-through-time, BPTT),通过构建一个可微分的完整弛豫模拟回路,在预训练MLIP基础上直接优化预测的最终结构而非仅拟合力值,从而显著降低测试误差(约50%)并减少对领域特定数据的依赖,同时表现出对弛豫参数设置的高度鲁棒性。
链接: https://arxiv.org/abs/2512.01067
作者: Evan Dramko,Yizhi Zhu,Aleksandar Krivokapic,Geoffroy Hautier,Thomas Reps,Christopher Jermaine,Anastasios Kyrillidis
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 main pages, total of 15 pages. 6 tables, 6 Figures
Abstract:Vital to the creation of advanced materials is performing structural relaxations. Traditional approaches built on physics-derived first-principles calculations are computationally expensive, motivating the creation of machine-learning interatomic potentials (MLIPs). Traditional approaches to training MLIPs for structural relaxations involves training models to faithfully reproduce first-principles computed forces. We propose a fine-tuning method to be used on a pretrained MLIP in which we create a fully-differentiable end-to-end simulation loop that optimizes the predicted final structures directly. Trajectories are unrolled and gradients are tracked through the entire relaxation. We show that this method achieves substantial performance gains when applied to pretrained models, leading to a nearly 50% reduction in test error across the sample datasets. Interestingly, we show the process is robust to substantial variation in the relaxation setup, achieving negligibly different results across varied hyperparameter and procedural modifications. Experimental results indicate this is due to a ``preference’’ of BPTT to modify the MLIP rather than the other trainable parameters. Of particular interest to practitioners is that this approach lowers the data requirements for producing an effective domain-specific MLIP, addressing a common bottleneck in practical deployment.
zh
[AI-200] An Approach to Variable Clustering: K-means in Transposed Data and its Relationship with Principal Component Analysis
【速读】:该论文试图解决的问题是:在多变量分析中,主成分分析(Principal Component Analysis, PCA)与K-means聚类通常被独立或顺序使用,但二者之间尤其是当K-means用于对变量而非观测值进行聚类时的内在联系尚未得到充分探索。解决方案的关键在于提出一种创新方法,即对原始数据进行PCA分析,同时对转置后的数据集(其中变量变为观测)应用K-means聚类,从而获得变量簇;随后通过变量载荷(variable loadings)量化每个变量簇对各主成分的贡献度,以此建立变量聚类与主成分之间关系的解析框架,为理解变量簇如何影响PCA所识别的主要变异方向提供工具。
链接: https://arxiv.org/abs/2512.00979
作者: Victor Saquicela,Kenneth Palacio-Baus,Mario Chifla
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at conference and to appear in the proceedings of the 2025 IEEE Chilean Conference on Electrical, Electronics Engineering, Information and Communication Technologies (ChileCon)
Abstract:Principal Component Analysis (PCA) and K-means constitute fundamental techniques in multivariate analysis. Although they are frequently applied independently or sequentially to cluster observations, the relationship between them, especially when K-means is used to cluster variables rather than observations, has been scarcely explored. This study seeks to address this gap by proposing an innovative method that analyzes the relationship between clusters of variables obtained by applying K-means on transposed data and the principal components of PCA. Our approach involves applying PCA to the original data and K-means to the transposed data set, where the original variables are converted into observations. The contribution of each variable cluster to each principal component is then quantified using measures based on variable loadings. This process provides a tool to explore and understand the clustering of variables and how such clusters contribute to the principal dimensions of variation identified by PCA.
zh
[AI-201] Orchestrating Rewards in the Era of Intelligence-Driven Commerce
【速读】:该论文旨在解决传统忠诚度计划(Loyalty Programs)在智能化商业环境中难以扩展的问题,特别是封闭式系统缺乏跨品牌互操作性、而联盟式忠诚度计划(Coalition Loyalty Programs)因架构缺陷导致高失败率(约60%在10年内失效)的困境。解决方案的关键在于提出一种混合框架,使品牌在保持对自身忠诚度计划主权控制的同时,通过无信任(trustless)交换机制实现跨品牌互操作性,从而在不牺牲原有优势的前提下,引入开放系统的协同效益,并借助数学定价模型实现跨奖励体系间公平的价值交换。
链接: https://arxiv.org/abs/2512.00738
作者: Paul Osemudiame Oamen,Robert Wesley,Pius Onobhayedo
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Despite their evolution from early copper-token schemes to sophisticated digital solutions, loyalty programs remain predominantly closed ecosystems, with brands retaining full control over all components. Coalition loyalty programs emerged to enable cross-brand interoperability, but approximately 60% fail within 10 years in spite of theoretical advantages rooted in network economics. This paper demonstrates that coalition failures stem from fundamental architectural limitations in centralized operator models rather than operational deficiencies, and argues further that neither closed nor coalition systems can scale in intelligence-driven paradigms where AI agents mediate commerce and demand trustless, protocol-based coordination that existing architectures cannot provide. We propose a hybrid framework where brands maintain sovereign control over their programs while enabling cross-brand interoperability through trustless exchange mechanisms. Our framework preserves closed system advantages while enabling open system benefits without the structural problems that doom traditional coalitions. We derive a mathematical pricing model accounting for empirically-validated market factors while enabling fair value exchange across interoperable reward systems.
zh
[AI-202] Hierarchical Molecular Language Models (HMLMs)
链接: https://arxiv.org/abs/2512.00696
作者: Hasi Hays,Yue Yu,William Richardson
机构: 未知
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
[AI-203] Layer Probing Improves Kinase Functional Prediction with Protein Language Models
链接: https://arxiv.org/abs/2512.00376
作者: Ajit Kumar,IndraPrakash Jha
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures, 3 tables; includes code and dataset links
[AI-204] VCWorld: A Biological World Model for Virtual Cell Simulation
链接: https://arxiv.org/abs/2512.00306
作者: Zhijian Wei,Runze Ma,Zichen Wang,Zhongmin Li,Shuotong Song,Shuangjia Zheng
机构: 未知
类目: Cell Behavior (q-bio.CB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-205] Comparative Evaluation of Generative AI Models for Chest Radiograph Report Generation in the Emergency Department
【速读】:该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, VLMs)在生成胸部X光(CXR)报告时的质量与诊断准确性难以量化评估的问题。研究通过将五种开源及商用医疗专用VLMs(AIRead、Lingshu、MAIRA-2、MedGemma和MedVersa)生成的报告,与真实放射科医生撰写的报告进行盲法对比评估,采用RADPEER评分、临床可接受性、幻觉率及语言清晰度四项标准进行系统性 benchmarking,并以CT影像为金标准进行发现级分析。其关键解决方案在于构建了一个结构化的多维度评价体系,结合真实世界数据集和专家评审机制,从而揭示不同VLMs在报告质量与诊断敏感性上的显著差异,特别是识别出AIRead在整体表现上优于其他模型,为后续临床部署提供了实证依据。
链接: https://arxiv.org/abs/2512.00271
作者: Woo Hyeon Lim,Ji Young Lee,Jong Hyuk Lee,Saehoon Kim,Hyungjin Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Purpose: To benchmark open-source or commercial medical image-specific VLMs against real-world radiologist-written reports. Methods: This retrospective study included adult patients who presented to the emergency department between January 2022 and April 2025 and underwent same-day CXR and CT for febrile or respiratory symptoms. Reports from five VLMs (AIRead, Lingshu, MAIRA-2, MedGemma, and MedVersa) and radiologist-written reports were randomly presented and blindly evaluated by three thoracic radiologists using four criteria: RADPEER, clinical acceptability, hallucination, and language clarity. Comparative performance was assessed using generalized linear mixed models, with radiologist-written reports treated as the reference. Finding-level analyses were also performed with CT as the reference. Results: A total of 478 patients (median age, 67 years [interquartile range, 50-78]; 282 men [59.0%]) were included. AIRead demonstrated the lowest RADPEER 3b rate (5.3% [76/1434] vs. radiologists 13.9% [200/1434]; P.001), whereas other VLMs showed higher disagreement rates (16.8-43.0%; P.05). Clinical acceptability was the highest with AIRead (84.5% [1212/1434] vs. radiologists 74.3% [1065/1434]; P.001), while other VLMs performed worse (41.1-71.4%; P.05). Hallucinations were rare with AIRead, comparable to radiologists (0.3% [4/1425]) vs. 0.1% [1/1425]; P=.21), but frequent with other models (5.4-17.4%; P.05). Language clarity was higher with AIRead (82.9% [1189/1434]), Lingshu (88.0% [1262/1434]), and MedVersa (88.4% [1268/1434]) compared with radiologists (78.1% [1120/1434]; P.05). Sensitivity varied substantially across VLMs for the common findings: AIRead, 15.5-86.7%; Lingshu, 2.4-86.7%; MAIRA-2, 6.0-72.0%; MedGemma, 4.8-76.7%; and MedVersa, 20.2-69.3%. Conclusion: Medical VLMs for CXR report generation exhibited variable performance in report quality and diagnostic measures.
zh
[AI-206] Optimizing Information Asset Investment Strategies in the Exploratory Phase of the Oil and Gas Industry: A Reinforcement Learning Approach
链接: https://arxiv.org/abs/2512.00243
作者: Paulo Roberto de Melo Barros Junior,Monica Alexandra Vilar Ribeiro De Meireles,Jose Luis Lima de Jesus Silva
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
[AI-207] uning Universality in Deep Neural Networks
链接: https://arxiv.org/abs/2512.00168
作者: Arsham Ghavasieh
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Biological Physics (physics.bio-ph)
备注:
[AI-208] RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding
链接: https://arxiv.org/abs/2512.00126
作者: Jin Han,Tianfan Fu,Wu-Jun Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
机器学习
[LG-0] ECO: Energy-Constrained Operator Learning for Chaotic Dynamics with Boundedness Guarantees
链接: https://arxiv.org/abs/2512.01984
作者: Andrea Goertzen,Sunbochen Tang,Navid Azizan
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Chaos is a fundamental feature of many complex dynamical systems, including weather systems and fluid turbulence. These systems are inherently difficult to predict due to their extreme sensitivity to initial conditions. Many chaotic systems are dissipative and ergodic, motivating data-driven models that aim to learn invariant statistical properties over long time horizons. While recent models have shown empirical success in preserving invariant statistics, they are prone to generating unbounded predictions, which prevent meaningful statistics evaluation. To overcome this, we introduce the Energy-Constrained Operator (ECO) that simultaneously learns the system dynamics while enforcing boundedness in predictions. We leverage concepts from control theory to develop algebraic conditions based on a learnable energy function, ensuring the learned dynamics is dissipative. ECO enforces these algebraic conditions through an efficient closed-form quadratic projection layer, which provides provable trajectory boundedness. To our knowledge, this is the first work establishing such formal guarantees for data-driven chaotic dynamics models. Additionally, the learned invariant level set provides an outer estimate for the strange attractor, a complex structure that is computationally intractable to characterize. We demonstrate empirical success in ECO’s ability to generate stable long-horizon forecasts, capturing invariant statistics on systems governed by chaotic PDEs, including the Kuramoto–Sivashinsky and the Navier–Stokes equations.
[LG-1] Feature-Based Semantics-Aware Scheduling for Energy-Harvesting Federated Learning
链接: https://arxiv.org/abs/2512.01983
作者: Eunjeong Jeong,Giovanni Perin,Howard H. Yang,Nikolaos Pappas
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: This paper is currently under review for presentation at a peer-reviewed conference
Abstract:Federated Learning (FL) on resource-constrained edge devices faces a critical challenge: The computational energy required for training Deep Neural Networks (DNNs) often dominates communication costs. However, most existing Energy-Harvesting FL (EHFL) strategies fail to account for this reality, resulting in wasted energy due to redundant local computations. For efficient and proactive resource management, algorithms that predict local update contributions must be devised. We propose a lightweight client scheduling framework using the Version Age of Information (VAoI), a semantics-aware metric that quantifies update timeliness and significance. Crucially, we overcome VAoI’s typical prohibitive computational cost, which requires statistical distance over the entire parameter space, by introducing a feature-based proxy. This proxy estimates model redundancy using intermediate-layer extraction from a single forward pass, dramatically reducing computational complexity. Experiments conducted under extreme non-IID data distributions and scarce energy availability demonstrate superior learning performance while achieving energy reduction compared to existing baseline selection policies. Our framework establishes semantics-aware scheduling as a practical and vital solution for EHFL in realistic scenarios where training costs dominate transmission costs.
[LG-2] Low-Rank Prehab: Preparing Neural Networks for SVD Compression
链接: https://arxiv.org/abs/2512.01980
作者: Haoran Qin,Shansita Sharma,Ali Abbasi,Chayne Thrash,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Low-rank approximation methods such as singular value decomposition (SVD) and its variants (e.g., Fisher-weighted SVD, Activation SVD) have recently emerged as effective tools for neural network compression. In this setting, decomposition acts as a “surgical” intervention, followed by fine-tuning that serves as “rehab” to recover accuracy. Inspired by prehabilitation in surgery, we introduce a pre-compression fine-tuning stage, Low-Rank Prehab, that explicitly encourages low-rank structure in weight matrices while preserving task performance. By conditioning the model before SVD, Prehab steers weights toward spectrally compact regions of the parameter space, enabling smoother low-rank approximation and improved recovery. Experiments on large language models (LLMs) and other Transformer-based architectures, including Vision Transformers (ViTs), show that Prehab substantially reduces the immediate accuracy drop after compression and consistently improves post-finetuning performance. Across a wide range of compression ratios, our method outperforms state-of-the-art SVD-based techniques such as SVD-LLM, highlighting the importance of preparing models for compression rather than only improving the compression and recovery stages. Source code is available at this https URL
[LG-3] KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference
链接: https://arxiv.org/abs/2512.01953
作者: Sai Gokhale,Devleena Das,Rajeev Patwari,Ashish Sirasao,Elliott Delaye
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.
[LG-4] A Footprint-Aware High-Resolution Approach for Carbon Flux Prediction Across Diverse Ecosystems
链接: https://arxiv.org/abs/2512.01917
作者: Jacob Searcy,Anish Dulal,Scott Bridgham,Ashley Cordes,Lillian Aoki,Brendan Bohannan,Qing Zhu,Lucas C. R. Silva
类目: Machine Learning (cs.LG)
*备注: 29 pages, 7 Figuers
Abstract:Natural climate solutions (NCS) offer an approach to mitigating carbon dioxide (CO2) emissions. However, monitoring the carbon drawdown of ecosystems over large geographic areas remains challenging. Eddy-flux covariance towers provide ground truth for predictive ‘upscaling’ models derived from satellite products, but many satellites now produce measurements on spatial scales smaller than a flux tower’s footprint. We introduce Footprint-Aware Regression (FAR), a first-of-its-kind, deep-learning framework that simultaneously predicts spatial footprints and pixel-level (30 m scale) estimates of carbon flux. FAR is trained on our AMERI-FAR25 dataset which combines 439 site years of tower data with corresponding Landsat scenes. Our model produces high-resolution predictions and achieves R2 = 0.78 when predicting monthly net ecosystem exchange on test sites from a variety of ecosystems.
[LG-5] Delays in Spiking Neural Networks: A State Space Model Approach
链接: https://arxiv.org/abs/2512.01906
作者: Sanja Karilanova,Subhrakanti Dey,Ayça Özçelikkale
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spiking neural networks (SNNs) are biologically inspired, event-driven models that are suitable for processing temporal data and offer energy-efficient computation when implemented on neuromorphic hardware. In SNNs, richer neuronal dynamic allows capturing more complex temporal dependencies, with delays playing a crucial role by allowing past inputs to directly influence present spiking behavior. We propose a general framework for incorporating delays into SNNs through additional state variables. The proposed mechanism enables each neuron to access a finite temporal input history. The framework is agnostic to neuron models and hence can be seamlessly integrated into standard spiking neuron models such as LIF and adLIF. We analyze how the duration of the delays and the learnable parameters associated with them affect the performance. We investigate the trade-offs in the network architecture due to additional state variables introduced by the delay mechanism. Experiments on the Spiking Heidelberg Digits (SHD) dataset show that the proposed mechanism matches the performance of existing delay-based SNNs while remaining computationally efficient. Moreover, the results illustrate that the incorporation of delays may substantially improve performance in smaller networks.
[LG-6] Provably Safe Model Updates
链接: https://arxiv.org/abs/2512.01899
作者: Leo Elmecker-Plakolm,Pierre Fasterling,Philip Sosnin,Calvin Tsay,Matthew Wicker
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 9 figures, submitted to IEEE SaTML 2026
Abstract:Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of lookahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.
[LG-7] Elastic Weight Consolidation for Knowledge Graph Continual Learning: An Empirical Evaluation NEURIPS2025
链接: https://arxiv.org/abs/2512.01890
作者: Gaganpreet Jhajj,Fuhua Lin
类目: Machine Learning (cs.LG)
*备注: Accepted to NORA Workshop at NeurIPS 2025
Abstract:Knowledge graphs (KGs) require continual updates as new information emerges, but neural embedding models suffer from catastrophic forgetting when learning new tasks sequentially. We evaluate Elastic Weight Consolidation (EWC), a regularization-based continual learning method, on KG link prediction using TransE embeddings on FB15k-237. Across multiple experiments with five random seeds, we find that EWC reduces catastrophic forgetting from 12.62% to 6.85%, a 45.7% reduction compared to naive sequential training. We observe that the task partitioning strategy affects the magnitude of forgetting: relation-based partitioning (grouping triples by relation type) exhibits 9.8 percentage points higher forgetting than randomly partitioned tasks (12.62% vs 2.81%), suggesting that task construction influences evaluation outcomes. While focused on a single embedding model and dataset, our results demonstrate that EWC effectively mitigates catastrophic forgetting in KG continual learning and highlight the importance of evaluation protocol design.
[LG-8] Domain-Decomposed Graph Neural Network Surrogate Modeling for Ice Sheets
链接: https://arxiv.org/abs/2512.01888
作者: Adrienne M. Propp,Mauro Perego,Eric C. Cyr,Anthony Gruber,Amanda A. Howard,Alexander Heinlein,Panos Stinis,Daniel M. Tartakovsky
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:
Abstract:Accurate yet efficient surrogate models are essential for large-scale simulations of partial differential equations (PDEs), particularly for uncertainty quantification (UQ) tasks that demand hundreds or thousands of evaluations. We develop a physics-inspired graph neural network (GNN) surrogate that operates directly on unstructured meshes and leverages the flexibility of graph attention. To improve both training efficiency and generalization properties of the model, we introduce a domain decomposition (DD) strategy that partitions the mesh into subdomains, trains local GNN surrogates in parallel, and aggregates their predictions. We then employ transfer learning to fine-tune models across subdomains, accelerating training and improving accuracy in data-limited settings. Applied to ice sheet simulations, our approach accurately predicts full-field velocities on high-resolution meshes, substantially reduces training time relative to training a single global surrogate model, and provides a ripe foundation for UQ objectives. Our results demonstrate that graph-based DD, combined with transfer learning, provides a scalable and reliable pathway for training GNN surrogates on massive PDE-governed systems, with broad potential for application beyond ice sheet dynamics.
[LG-9] New Spiking Architecture for Multi-Modal Decision-Making in Autonomous Vehicles
链接: https://arxiv.org/abs/2512.01882
作者: Aref Ghoreishee,Abhishek Mishra,Lifeng Zhou,John Walsh,Nagarajan Kandasamy
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work proposes an end-to-end multi-modal reinforcement learning framework for high-level decision-making in autonomous vehicles. The framework integrates heterogeneous sensory input, including camera images, LiDAR point clouds, and vehicle heading information, through a cross-attention transformer-based perception module. Although transformers have become the backbone of modern multi-modal architectures, their high computational cost limits their deployment in resource-constrained edge environments. To overcome this challenge, we propose a spiking temporal-aware transformer-like architecture that uses ternary spiking neurons for computationally efficient multi-modal fusion. Comprehensive evaluations across multiple tasks in the Highway Environment demonstrate the effectiveness and efficiency of the proposed approach for real-time autonomous decision-making.
[LG-10] he Mean-Field Dynamics of Transformers
链接: https://arxiv.org/abs/2512.01868
作者: Philippe Rigollet
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Dynamical Systems (math.DS); Probability (math.PR)
*备注: to appear as Proceedings of the ICM2026, Philadelphia, USA
Abstract:We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention continuous on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.
[LG-11] DeepCAVE: A Visualization and Analysis Tool for Automated Machine Learning
链接: https://arxiv.org/abs/2512.01810
作者: Sarah Segel,Helena Graf,Edward Bergman,Kristina Thieme,Marcel Wever,Alexander Tornede,Frank Hutter,Marius Lindauer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hyperparameter optimization (HPO), as a central paradigm of AutoML, is crucial for leveraging the full potential of machine learning (ML) models; yet its complexity poses challenges in understanding and debugging the optimization process. We present DeepCAVE, a tool for interactive visualization and analysis, providing insights into HPO. Through an interactive dashboard, researchers, data scientists, and ML engineers can explore various aspects of the HPO process and identify issues, untouched potentials, and new insights about the ML model being tuned. By empowering users with actionable insights, DeepCAVE contributes to the interpretability of HPO and ML on a design level and aims to foster the development of more robust and efficient methodologies in the future.
[LG-12] Much Ado About Noising: Dispelling the Myths of Generative Robotic Control
链接: https://arxiv.org/abs/2512.01809
作者: Chaoyi Pan,Giri Anantharaman,Nai-Chieh Huang,Claire Jin,Daniel Pfrommer,Chenyang Yuan,Frank Permenter,Guannan Qu,Nicholas Boffi,Guanya Shi,Max Simchowitz
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Generative models, like flows and diffusions, have recently emerged as popular and efficacious policy parameterizations in robotics. There has been much speculation as to the factors underlying their successes, ranging from capturing multi-modal action distribution to expressing more complex behaviors. In this work, we perform a comprehensive evaluation of popular generative control policies (GCPs) on common behavior cloning (BC) benchmarks. We find that GCPs do not owe their success to their ability to capture multi-modality or to express more complex observation-to-action mappings. Instead, we find that their advantage stems from iterative computation, as long as intermediate steps are supervised during training and this supervision is paired with a suitable level of stochasticity. As a validation of our findings, we show that a minimum iterative policy (MIP), a lightweight two-step regression-based policy, essentially matches the performance of flow GCPs, and often outperforms distilled shortcut models. Our results suggest that the distribution-fitting component of GCPs is less salient than commonly believed, and point toward new design spaces focusing solely on control performance. Project page: this https URL
[LG-13] GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
链接: https://arxiv.org/abs/2512.01801
作者: Yunfei Li,Xiao Ma,Jiafeng Xu,Yu Cui,Zhongren Cui,Zhigang Han,Liqun Huang,Tao Kong,Yuxiao Liu,Hao Niu,Wanli Peng,Jingchao Qiao,Zeyu Ren,Haixin Shi,Zhi Su,Jiawen Tian,Yuyang Xiao,Shenyu Zhang,Liwei Zheng,Hang Li,Yonghui Wu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting Q -values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.
[LG-14] he Active and Noise-Tolerant Strategic Perceptron
链接: https://arxiv.org/abs/2512.01783
作者: Maria-Florina Blacan,Hedyeh Beyhaghi
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We initiate the study of active learning algorithms for classifying strategic agents. Active learning is a well-established framework in machine learning in which the learner selectively queries labels, often achieving substantially higher accuracy and efficiency than classical supervised methods-especially in settings where labeling is costly or time-consuming, such as hiring, admissions, and loan decisions. Strategic classification, however, addresses scenarios where agents modify their features to obtain more favorable outcomes, resulting in observed data that is not truthful. Such manipulation introduces challenges beyond those in learning from clean data. Our goal is to design active and noise-tolerant algorithms that remain effective in strategic environments-algorithms that classify strategic agents accurately while issuing as few label requests as possible. The central difficulty is to simultaneously account for strategic manipulation and preserve the efficiency gains of active learning. Our main result is an algorithm for actively learning linear separators in the strategic setting that preserves the exponential improvement in label complexity over passive learning previously obtained only in the non-strategic case. Specifically, for data drawn uniformly from the unit sphere, we show that a modified version of the Active Perceptron algorithm [DKM05,YZ17] achieves excess error \epsilon using only \tildeO(d \ln \frac1\epsilon) label queries and incurs at most \tildeO(d \ln \frac1\epsilon) additional mistakes relative to the optimal classifier, even in the nonrealizable case, when a \tilde\Omega(\epsilon) fraction of inputs have inconsistent labels with the optimal classifier. The algorithm is computationally efficient and, under these distributional assumptions, requires substantially fewer label queries than prior work on strategic Perceptron [ABBN21]. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2512.01783 [cs.LG] (or arXiv:2512.01783v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01783 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-15] How Does RL Post-training Induce Skill Composition? A Case Study on Countdown
链接: https://arxiv.org/abs/2512.01775
作者: Simon Park,Simran Kaur,Sanjeev Arora
类目: Machine Learning (cs.LG)
*备注:
Abstract:While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.‘’ Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability – models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
[LG-16] On the Unreason able Effectiveness of Last-layer Retraining
链接: https://arxiv.org/abs/2512.01766
作者: John C. Hill,Tyler LaBonte,Xinchen Zhang,Vidya Muthukumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Last-layer retraining (LLR) methods – wherein the last layer of a neural network is reinitialized and retrained on a held-out set following ERM training – have garnered interest as an efficient approach to rectify dependence on spurious correlations and improve performance on minority groups. Surprisingly, LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set. We initially hypothesize that this ``unreasonable effectiveness’’ of LLR is explained by its ability to mitigate neural collapse through the held-out set, resulting in the implicit bias of gradient descent benefiting robustness. Our empirical investigation does not support this hypothesis. Instead, we present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set. We conclude by showing how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.
[LG-17] Mofasa: A Step Change in Metal-Organic Framework Generation
链接: https://arxiv.org/abs/2512.01756
作者: Vaidotas Simkus,Anders Christensen,Steven Bennett,Ian Johnson,Mark Neumann,James Gin,Jonathan Godwin,Benjamin Rhodes
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Mofasa is an all-atom latent diffusion model with state-of-the-art performance for generating Metal-Organic Frameworks (MOFs). These are highly porous crystalline materials used to harvest water from desert air, capture carbon dioxide, store toxic gases and catalyse chemical reactions. In recognition of their value, the development of MOFs recently received a Nobel Prize in Chemistry. In many ways, MOFs are well-suited for exploiting generative models in chemistry: they are rationally-designable materials with a large combinatorial design space and strong structure-property couplings. And yet, to date, a high performance generative model has been lacking. To fill this gap, we introduce Mofasa, a general-purpose latent diffusion model that jointly samples positions, atom-types and lattice vectors for systems as large as 500 atoms. Mofasa avoids handcrafted assembly algorithms common in the literature, unlocking the simultaneous discovery of metal nodes, linkers and topologies. To help the scientific community build on our work, we release MofasaDB, an annotated library of hundreds of thousands of sampled MOF structures, along with a user-friendly web interface for search and discovery: this https URL . Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci) Cite as: arXiv:2512.01756 [cs.LG] (or arXiv:2512.01756v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01756 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-18] SA-ADP: Sensitivity-Aware Adaptive Differential Privacy for Large Language Models
链接: https://arxiv.org/abs/2512.01748
作者: Stella Etuk,Ashraf Matrawy
类目: Machine Learning (cs.LG)
*备注: It is a 5-page paper with 5 figures and 1 Table
Abstract:Despite advances in the use of large language models (LLMs) in downstream tasks, their ability to memorize information has raised privacy concerns. Therefore, protecting personally identifiable information (PII) during LLM training remains a fundamental challenge. Conventional methods like Differential Privacy-Stochastic Gradient Descent (DP-SGD) provide robust privacy protection via uniform noising, protecting PII regardless of its distinct sensitivity. This comes at the expense of the model’s utility, leading to a trade-off. In this paper, we propose SA-ADP, a sensitivity-aware approach that allocates noise based on the sensitivity of individual PII. We evaluated our method on four datasets (ABCD, CUSTOMERSIM, Wikitext-2, and UNSW-NB15 ). Our results show that SA-ADP achieves results comparable to the baseline (No-DP) and the conventional DP-SGD. This means that our method did not degrade the model’s utility while still maintaining strong privacy protection.
[LG-19] MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
链接: https://arxiv.org/abs/2512.01738
作者: Pedro M. P. Curvo,Jan-Willem van de Meent,Maksim Zhdanov
类目: Machine Learning (cs.LG)
*备注:
Abstract:A key scalability challenge in neural solvers for industrial-scale physics simulations is efficiently capturing both fine-grained local interactions and long-range global dependencies across millions of spatial elements. We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations. To partition the input domain into spatially-coherent patches, we employ ball trees, which handle irregular geometries efficiently. This dual-scale design enables MSPT to scale to millions of points on a single GPU. We validate our method on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets (ShapeNet-Car, Ahmed-ML), achieving state-of-the-art accuracy with substantially lower memory footprint and computational cost.
[LG-20] Automating modeling in mechanics: LLM s as designers of physics-constrained neural networks for constitutive modeling of materials
链接: https://arxiv.org/abs/2512.01735
作者: Marius Tacke,Matthias Busch,Kian Abdolazizi,Jonas Eichinger,Kevin Linka,Christian Cyron,Roland Aydin
类目: Machine Learning (cs.LG)
*备注: Currently under review
Abstract:Large language model (LLM)-based agentic frameworks increasingly adopt the paradigm of dynamically generating task-specific agents. We suggest that not only agents but also specialized software modules for scientific and engineering tasks can be generated on demand. We demonstrate this concept in the field of solid mechanics. There, so-called constitutive models are required to describe the relationship between mechanical stress and body deformation. Constitutive models are essential for both the scientific understanding and industrial application of materials. However, even recent data-driven methods of constitutive modeling, such as constitutive artificial neural networks (CANNs), still require substantial expert knowledge and human labor. We present a framework in which an LLM generates a CANN on demand, tailored to a given material class and dataset provided by the user. The framework covers LLM-based architecture selection, integration of physical constraints, and complete code generation. Evaluation on three benchmark problems demonstrates that LLM-generated CANNs achieve accuracy comparable to or greater than manually engineered counterparts, while also exhibiting reliable generalization to unseen loading scenarios and extrapolation to large deformations. These findings indicate that LLM-based generation of physics-constrained neural networks can substantially reduce the expertise required for constitutive modeling and represent a step toward practical end-to-end automation.
[LG-21] Beyond Scaffold: A Unified Spatio-Temporal Gradient Tracking Method
链接: https://arxiv.org/abs/2512.01732
作者: Yan Huang,Jinming Xu,Jiming Chen,Karl Henrik Johansson
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 13 pages
Abstract:In distributed and federated learning algorithms, communication overhead is often reduced by performing multiple local updates between communication rounds. However, due to data heterogeneity across nodes and the local gradient noise within each node, this strategy can lead to the drift of local models away from the global optimum. To address this issue, we revisit the well-known federated learning method Scaffold (Karimireddy et al., 2020) under a gradient tracking perspective, and propose a unified spatio-temporal gradient tracking algorithm, termed ST-GT, for distributed stochastic optimization over time-varying graphs. ST-GT tracks the global gradient across neighboring nodes to mitigate data heterogeneity, while maintaining a running average of local gradients to substantially suppress noise, with slightly more storage overhead. Without assuming bounded data heterogeneity, we prove that ST-GT attains a linear convergence rate for strongly convex problems and a sublinear rate for nonconvex cases. Notably, ST-GT achieves the first linear speed-up in communication complexity with respect to the number of local updates per round \tau for the strongly-convex setting. Compared to traditional gradient tracking methods, ST-GT reduces the topology-dependent noise term from \sigma^2 to \sigma^2/\tau , where \sigma^2 denotes the noise level, thereby improving communication efficiency.
[LG-22] A unified framework for geometry-independent operator learning in cardiac electrophysiology simulations
链接: https://arxiv.org/abs/2512.01702
作者: Bei Zhou,Cesare Corrado,Shuang Qian,Maximilian Balmus,Angela W. C. Lee,Cristobal Rodero,Marco J.W. Gotte,Luuk H.G.A. Hopman,Mengyun Qiao,Steven Niederer
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
Abstract:Accurate maps of atrial electrical activation are essential for personalised treatment of arrhythmias, yet biophysically detailed simulations remain computationally intensive for real-time clinical use or population-scale analyses. Here we introduce a geometry-independent operator-learning framework that predicts local activation time (LAT) fields across diverse left atrial anatomies with near-instantaneous inference. We generated a dataset of 308,700 simulations using a GPU-accelerated electrophysiology solver, systematically varying multiple pacing sites and physiologically varied conduction properties across 147 patient-specific geometries derived from two independent clinical cohorts. All anatomical and functional data are expressed in a Universal Atrium Coordinate system, providing a consistent representation that decouples electrophysiological patterns from mesh topology. Within this coordinate space, we designed a neural operator with a vision-transformer backbone to learn the mapping from structural and electrophysiological inputs to LAT fields. With a mean prediction error of 5.1 ms over a 455 ms maximum simulation time, the model outperforms established operator-learning approaches and performs inference in 0.12 ms per sample. Our framework establishes a general strategy for learning domain-invariant biophysical mappings across variable anatomical domains and enables integration of computational electrophysiology into real-time and large-scale clinical workflows.
[LG-23] Morphling: Fast Fused and Flexible GNN Training at Scale
链接: https://arxiv.org/abs/2512.01678
作者: Anubhab,Rupesh Nasre
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
*备注:
Abstract:Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. The results show that Morphling improves per-epoch training throughput by an average of 20X on CPUs and 19X on GPUs over PyG and DGL, with peak speedups reaching 66X. Morphling’s memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.
[LG-24] In-context Inverse Optimality for Fair Digital Twins: A Preference-based approach
链接: https://arxiv.org/abs/2512.01650
作者: Daniele Masti,Francesco Basciani,Arianna Fedeli,Girgio Gnecco,Francesco Smarra
类目: Machine Learning (cs.LG); Software Engineering (cs.SE); Optimization and Control (math.OC)
*备注: Submitted for possible publication at the IFAC World Congress 2026
Abstract:Digital Twins (DTs) are increasingly used as autonomous decision-makers in complex socio-technical systems. Their mathematically optimal decisions often diverge from human expectations, exposing a persistent gap between algorithmic and bounded human rationality. This work addresses this gap by proposing a framework that operationalizes fairness as a learnable objective within optimization-based Digital Twins. We introduce a preference-driven learning pipeline that infers latent fairness objectives directly from human pairwise preferences over feasible decisions. A novel Siamese neural network is developed to generate convex quadratic cost functions conditioned on contextual information. The resulting surrogate objectives align optimization outcomes with human-perceived fairness while maintaining computational efficiency. The approach is demonstrated on a COVID-19 hospital resource allocation scenario. This study provides an actionable path toward embedding human-centered fairness in the design of autonomous decision-making systems.
[LG-25] Scaling and context steer LLM s along the same computational path as the human brain
链接: https://arxiv.org/abs/2512.01591
作者: Joséphine Raugel,Stéphane d’Ascoli,Jérémy Rapin,Valentin Wyart,Jean-Rémi King
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Recent studies suggest that the representations learned by large language models (LLMs) are partially aligned to those of the human brain. However, whether and why this alignment score arises from a similar sequence of computations remains elusive. In this study, we explore this question by examining temporally-resolved brain signals of participants listening to 10 hours of an audiobook. We study these neural dynamics jointly with a benchmark encompassing 22 LLMs varying in size and architecture type. Our analyses confirm that LLMs and the brain generate representations in a similar order: specifically, activations in the initial layers of LLMs tend to best align with early brain responses, while the deeper layers of LLMs tend to best align with later brain responses. This brain-LLM alignment is consistent across transformers and recurrent architectures. However, its emergence depends on both model size and context length. Overall, this study sheds light on the sequential nature of computations and the factors underlying the partial convergence between biological and artificial neural networks.
[LG-26] Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations Self-Report and Behavioral Altruism ALT
链接: https://arxiv.org/abs/2512.01568
作者: Sandro Andric
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, 7 tables. Code and data available at this https URL
Abstract:We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p .0001), confirming models “know” altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p .0001, Cohen’s d = 1.08). This “virtue signaling gap” affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.
[LG-27] mePred: efficient and interpretable offline change point detection for high volume data - with application to industrial process monitoring
链接: https://arxiv.org/abs/2512.01562
作者: Simon Leszek
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures
Abstract:Change-point detection (CPD) in high-dimensional, large-volume time series is challenging for statistical consistency, scalability, and interpretability. We introduce TimePred, a self-supervised framework that reduces multivariate CPD to univariate mean-shift detection by predicting each sample’s normalized time index. This enables efficient offline CPD using existing algorithms and supports the integration of XAI attribution methods for feature-level explanations. Our experiments show competitive CPD performance while reducing computational cost by up to two orders of magnitude. In an industrial manufacturing case study, we demonstrate improved detection accuracy and illustrate the practical value of interpretable change-point insights.
[LG-28] End-to-end Deep Reinforcement Learning for Stochastic Multi-objective Optimization in C-VRPTW
链接: https://arxiv.org/abs/2512.01518
作者: Abdo Abouelrous,Laurens Bliek,Yaoxin Wu,Yingqian Zhang
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures
Abstract:In this work, we consider learning-based applications in routing to solve a Vehicle Routing variant characterized by stochasticity and multiple objectives. Such problems are representative of practical settings where decision-makers have to deal with uncertainty in the operational environment as well as multiple conflicting objectives due to different stakeholders. We specifically consider travel time uncertainty. We also consider two objectives, total travel time and route makespan, that jointly target operational efficiency and labor regulations on shift length, although different objectives could be incorporated. Learning-based methods offer earnest computational advantages as they can repeatedly solve problems with limited interference from the decision-maker. We specifically focus on end-to-end deep learning models that leverage the attention mechanism and multiple solution trajectories. These models have seen several successful applications in routing problems. However, since travel times are not a direct input to these models due to the large dimensions of the travel time matrix, accounting for uncertainty is a challenge, especially in the presence of multiple objectives. In turn, we propose a model that simultaneously addresses stochasticity and multi-objectivity and provide a refined training mechanism for this model through scenario clustering to reduce training time. Our results show that our model is capable of constructing a Pareto Front of good quality within acceptable run times compared to three baselines.
[LG-29] Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier
链接: https://arxiv.org/abs/2512.01514
作者: Mengyao Du,Gang Yang,Han Fang,Quanjun Yin,Ee-chien Chang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures
Abstract:The widespread adoption of natural language processing techniques has led to an unprecedented growth of text classifiers across the modern web. Yet many of these models circulate with their internal semantics undocumented or even intentionally withheld. Such opaque classifiers, which may expose only hard-label outputs, can operate in unregulated web environments or be repurposed for unknown intents, raising legitimate forensic and auditing concerns. In this paper, we position ourselves as investigators and work to infer the semantic concept each label encodes in an undocumented black-box classifier. Specifically, we introduce label forensics, a black-box framework that reconstructs a label’s semantic meaning. Concretely, we represent a label by a sentence embedding distribution from which any sample reliably reflects the concept the classifier has implicitly learned for that label. We believe this distribution should maintain two key properties: precise, with samples consistently classified into the target label, and general, covering the label’s broad semantic space. To realize this, we design a semantic neighborhood sampler and an iterative optimization procedure to select representative seed sentences that jointly maximize label consistency and distributional coverage. The final output, an optimized seed sentence set combined with the sampler, constitutes the empirical distribution representing the label’s semantics. Experiments on multiple black-box classifiers achieve an average label consistency of around 92.24 percent, demonstrating that the embedding regions accurately capture each classifier’s label semantics. We further validate our framework on an undocumented HuggingFace classifier, enabling fine-grained label interpretation and supporting responsible AI auditing. Comments: 10 pages, 3 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.01514 [cs.LG] (or arXiv:2512.01514v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01514 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] Walking on the Fiber: A Simple Geometric Approximation for Bayesian Neural Networks
链接: https://arxiv.org/abs/2512.01500
作者: Alfredo Reichlin,Miguel Vasco,Danica Kragic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian Neural Networks provide a principled framework for uncertainty quantification by modeling the posterior distribution of network parameters. However, exact posterior inference is computationally intractable, and widely used approximations like the Laplace method struggle with scalability and posterior accuracy in modern deep networks. In this work, we revisit sampling techniques for posterior exploration, proposing a simple variation tailored to efficiently sample from the posterior in over-parameterized networks by leveraging the low-dimensional structure of loss minima. Building on this, we introduce a model that learns a deformation of the parameter space, enabling rapid posterior sampling without requiring iterative methods. Empirical results demonstrate that our approach achieves competitive posterior approximations with improved scalability compared to recent refinement techniques. These contributions provide a practical alternative for Bayesian inference in deep learning.
[LG-31] Winning Solutions for the Rayan AI Contest: Compositional Retrieval Zero-Shot Anomaly Detection and Backdoor Detection
链接: https://arxiv.org/abs/2512.01498
作者: Ali Nafisi,Sina Asghari,Mohammad Saeed Arvenaghi,Hossein Shakibania
类目: Machine Learning (cs.LG)
*备注:
Abstract:This report presents solutions to three machine learning challenges: compositional image retrieval, zero-shot anomaly detection, and backdoored model detection. In compositional image retrieval, we developed a system that processes visual and textual inputs to retrieve relevant images, achieving 95.38% accuracy and ranking first with a clear margin over the second team. For zero-shot anomaly detection, we designed a model that identifies and localizes anomalies in images without prior exposure to abnormal examples, securing 1st place with 73.14% accuracy. In the backdoored model detection task, we proposed a method to detect hidden backdoor triggers in neural networks, reaching an accuracy of 78%, which placed our approach in second place. These results demonstrate the effectiveness of our methods in addressing key challenges related to retrieval, anomaly detection, and model security, with implications for real-world applications in industries such as healthcare, manufacturing, and cybersecurity. Code for all solutions is available online.
[LG-32] Heuristic algorithms for the stochastic critical node detection problem
链接: https://arxiv.org/abs/2512.01497
作者: Tuguldur Bayarsaikhan,Altannar Chinchuluun,Ashwin Arulselvan,Panos Pardalos
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: 17 pages, 11 figures
Abstract:Given a network, the critical node detection problem finds a subset of nodes whose removal disrupts the network connectivity. Since many real-world systems are naturally modeled as graphs, assessing the vulnerability of the network is essential, with applications in transportation systems, traffic forecasting, epidemic control, and biological networks. In this paper, we consider a stochastic version of the critical node detection problem, where the existence of edges is given by certain probabilities. We propose heuristics and learning-based methods for the problem and compare them with existing algorithms. Experimental results performed on random graphs from small to larger scales, with edge-survival probabilities drawn from different distributions, demonstrate the effectiveness of the methods. Heuristic methods often illustrate the strongest results with high scalability, while learning-based methods maintain nearly constant inference time as the network size and density grow.
[LG-33] Differentiable Weightless Controllers: Learning Logic Circuits for Continuous Control
链接: https://arxiv.org/abs/2512.01467
作者: Fabian Kresse,Christoph H. Lampert
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Symbolic Computation (cs.SC)
*备注: 16 pages, 11 figures, 10 tables
Abstract:We investigate whether continuous-control policies can be represented and learned as discrete logic circuits instead of continuous neural networks. We introduce Differentiable Weightless Controllers (DWCs), a symbolic-differentiable architecture that maps real-valued observations to actions using thermometer-encoded inputs, sparsely connected boolean lookup-table layers, and lightweight action heads. DWCs can be trained end-to-end by gradient-based techniques, yet compile directly into FPGA-compatible circuits with few- or even single-clock-cycle latency and nanojoule-level energy cost per action. Across five MuJoCo benchmarks, including high-dimensional Humanoid, DWCs achieve returns competitive with weight-based policies (full precision or quantized neural networks), matching performance on four tasks and isolating network capacity as the key limiting factor on HalfCheetah. Furthermore, DWCs exhibit structurally sparse and interpretable connectivity patterns, enabling a direct inspection of which input thresholds influence control decisions.
[LG-34] A Nonlinear Low-rank Representation Model with Convolutional Neural Network for Imputing Water Quality Data
链接: https://arxiv.org/abs/2512.01465
作者: Hongnan Si,Tong Li,Yujie Chen,Xin Liao
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure
Abstract:Water quality monitoring is a core component of ecological environmental protection. However, due to sensor failure or other inevitable factors, data missing often exists in long-term monitoring, posing great challenges in water quality analysis. This paper proposes a Neural Tucker Convolutional Network (NTCN) model for water quality data imputation, which features the following key components: a) Encode different mode entities into respective embedding vectors, and construct a Tucker interaction tensor by outer product operations to capture the complex mode-wise feature interactions; b) Use 3D convolution to extract fine-grained spatiotemporal features from the interaction tensor. Experiments on three real-world water quality datasets show that the proposed NTCN model outperforms several state-of-the-art imputation models in terms of accuracy.
[LG-35] hls4ml: A Flexible Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware
链接: https://arxiv.org/abs/2512.01463
作者: Jan-Frederik Schulte,Benjamin Ramhorst,Chang Sun,Jovan Mitrevski,Nicolò Ghielmetti,Enrico Lupi,Dimitrios Danopoulos,Vladimir Loncar,Javier Duarte,David Burnette,Lauri Laatu,Stylianos Tzelepis,Konstantinos Axiotis,Quentin Berthet,Haoyan Wang,Paul White,Suleyman Demirsoy,Marco Colombo,Thea Aarrestad,Sioni Summers,Maurizio Pierini,Giuseppe Di Guglielmo,Jennifer Ngadiuba,Javier Campos,Ben Hawks,Abhijith Gandrakota,Farah Fahim,Nhan Tran,George Constantinides,Zhiqiang Que,Wayne Luk,Alexander Tapper,Duc Hoang,Noah Paladino,Philip Harris,Bo-Cheng Lai,Manuel Valentin,Ryan Forelli,Seda Ogrenci,Lino Gerlach,Rian Flynn,Mia Liu,Daniel Diaz,Elham Khoda,Melissa Quinnan,Russell Solares,Santosh Parajuli,Mark Neubauer,Christian Herwig,Ho Fung Tsoi,Dylan Rankin,Shih-Chieh Hsu,Scott Hauck
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of deep learning frameworks and can target HLS compilers from several vendors, including Vitis HLS, Intel oneAPI and Catapult HLS. Together with a wider eco-system for software-hardware co-design, hls4ml has enabled the acceleration of ML inference in a wide range of commercial and scientific applications where low latency, resource usage, and power consumption are critical. In this paper, we describe the structure and functionality of the hls4ml platform. The overarching design considerations for the generated HLS code are discussed, together with selected performance results.
[LG-36] Fourier Neural Operators Explained: A Practical Perspective
链接: https://arxiv.org/abs/2512.01421
作者: Valentin Duruisseaux,Jean Kossaifi,Anima Anandkumar
类目: Machine Learning (cs.LG)
*备注: 92 pages, 26 figures
Abstract:Partial differential equations (PDEs) govern a wide variety of dynamical processes in science and engineering, yet obtaining their numerical solutions often requires high-resolution discretizations and repeated evaluations of complex operators, leading to substantial computational costs. Neural operators have recently emerged as a powerful framework for learning mappings between function spaces directly from data, enabling efficient surrogate models for PDE systems. Among these architectures, the Fourier Neural Operator (FNO) has become the most influential and widely adopted due to its elegant spectral formulation, which captures global correlations through learnable transformations in Fourier space while remaining invariant to discretization and resolution. Despite their success, the practical use of FNOs is often hindered by an incomplete understanding among practitioners of their theoretical foundations, practical constraints, and implementation details, which can lead to their incorrect or unreliable application. This work presents a comprehensive and practice-oriented guide to FNOs, unifying their mathematical principles with implementation strategies. We provide an intuitive exposition to the concepts of operator theory and signal-processing that underlie the FNO, detail its spectral parameterization and the computational design of all its components, and address common misunderstandings encountered in the literature. The exposition is closely integrated with the NeuralOperator 2.0.0 library, offering modular state-of-the-art implementations that faithfully reflect the theory. By connecting rigorous foundations with practical insight, this guide aims to establish a clear and reliable framework for applying FNOs effectively across diverse scientific and engineering fields.
[LG-37] Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models NEURIPS2025
链接: https://arxiv.org/abs/2512.01405
作者: Benjamin Ramtoula,Pierre-Yves Lajoie,Paul Newman,Daniele De Martini
类目: Machine Learning (cs.LG)
*备注: Published at NeurIPS 2025
Abstract:Foundation models (FMs) trained with different objectives and data learn diverse representations, making some more effective than others for specific downstream tasks. Existing adaptation strategies, such as parameter-efficient fine-tuning, focus on individual models and do not exploit the complementary strengths across models. Probing methods offer a promising alternative by extracting information from frozen models, but current techniques do not scale well with large feature sets and often rely on dataset-specific hyperparameter tuning. We propose Combined backBones (ComBo), a simple and scalable probing-based adapter that effectively integrates features from multiple models and layers. ComBo compresses activations from layers of one or more FMs into compact token-wise representations and processes them with a lightweight transformer for task-specific prediction. Crucially, ComBo does not require dataset-specific tuning or backpropagation through the backbone models. However, not all models are equally relevant for all tasks. To address this, we introduce a mechanism that leverages ComBo’s joint multi-backbone probing to efficiently evaluate each backbone’s task-relevance, enabling both practical model comparison and improved performance through selective adaptation. On the 19 tasks of the VTAB-1k benchmark, ComBo outperforms previous probing methods, matches or surpasses more expensive alternatives, such as distillation-based model merging, and enables efficient probing of tuned models. Our results demonstrate that ComBo offers a practical and general-purpose framework for combining diverse representations from multiple FMs.
[LG-38] On Global Applicability and Location Transferability of Generative Deep Learning Models for Precipitation Downscaling
链接: https://arxiv.org/abs/2512.01400
作者: Paula Harder,Christian Lessig,Matthew Chantry,Francis Pelletier,David Rolnick
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning offers promising capabilities for the statistical downscaling of climate and weather forecasts, with generative approaches showing particular success in capturing fine-scale precipitation patterns. However, most existing models are region-specific, and their ability to generalize to unseen geographic areas remains largely unexplored. In this study, we evaluate the generalization performance of generative downscaling models across diverse regions. Using a global framework, we employ ERA5 reanalysis data as predictors and IMERG precipitation estimates at 0.1^\circ resolution as targets. A hierarchical location-based data split enables a systematic assessment of model performance across 15 regions around the world.
[LG-39] RE-LLM : Integrating Large Language Models into Renewable Energy Systems
链接: https://arxiv.org/abs/2512.01392
作者: Ali Forootani,Mohammad Sadr,Danial Esmaeili Aliabadi,Daniela Thraen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Energy system models are increasingly employed to guide long-term planning in multi-sectoral environments where decisions span electricity, heat, transport, land use, and industry. While these models provide rigorous quantitative insights, their outputs are often highly technical, making them difficult to interpret for non-expert stakeholders such as policymakers, planners, and the public. This communication gap limits the accessibility and practical impact of scenario-based modeling, particularly as energy transitions grow more complex with rising shares of renewables, sectoral integration, and deep uncertainties. To address this challenge, we propose the Renewable Energy Large Language Model (RE-LLM), a hybrid framework that integrates Large Language Models (LLMs) directly into the energy system modeling workflow. RE-LLM combines three core elements: (i) optimization-based scenario exploration, (ii) machine learning surrogates that accelerate computationally intensive simulations, and (iii) LLM-powered natural language generation that translates complex results into clear, stakeholder-oriented explanations. This integrated design not only reduces computational burden but also enhances inter-pretability, enabling real-time reasoning about trade-offs, sensitivities, and policy implications. The framework is adaptable across different optimization platforms and energy system models, ensuring broad applicability beyond the case study presented. By merging speed, rigor, and interpretability, RE-LLM advances a new paradigm of human-centric energy modeling. It enables interactive, multilingual, and accessible engagement with future energy pathways, ultimately bridging the final gap between data-driven analysis and actionable decision-making for sustainable transitions.
[LG-40] CLAPS: Posterior-Aware Conformal Intervals via Last-Layer Laplace
链接: https://arxiv.org/abs/2512.01384
作者: Dongseok Kim,Hyoungsun Choi,Mohamed Jismy Aashik Rasool,Gisung Oh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages, 2 figures
Abstract:We present CLAPS, a posterior-aware conformal regression method that pairs a Last-Layer Laplace Approximation with split-conformal calibration. From the resulting Gaussian posterior, CLAPS defines a simple two-sided posterior CDF score that aligns the conformity metric with the full predictive shape, not just a point estimate. This alignment yields narrower prediction intervals at the same target coverage, especially on small to medium tabular datasets where data are scarce and uncertainty modeling matters. We also provide a lightweight diagnostic suite that separates aleatoric and epistemic components and visualizes posterior behavior, helping practitioners understand why intervals shrink when they do. Across multiple benchmarks using the same MLP backbone, CLAPS consistently attains nominal coverage with improved efficiency and minimal overhead, offering a clear, practical upgrade to residual-based conformal baselines.
[LG-41] A Fine Evaluation Method for Cube Copying Test for Early Detection of Alzheimers Disease
链接: https://arxiv.org/abs/2512.01367
作者: Xinyu Jiang,Cuiyun Gao,Wenda Huang,Yiyang Jiang,Binwen Luo,Yuxin Jiang,Mengting Wang,Haoran Wen,Yang Zhao,Xuemei Chen,Songqun Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Background: Impairment of visual spatial cognitive function is the most common early clinical manifestation of Alzheimer’s Disease (AD). When the Montreal Cognitive Assessment (MoCA) uses the “0/1” binary method (“pass/fail”) to evaluate the visual spatial cognitive ability represented by the Cube Copying Test(CCT), the elder with less formal education generally score 0 point, resulting in serious bias in the evaluation results. Therefore, this study proposes a fine evaluation method for CCT based on dynamic handwriting feature extraction of DH-SCSM-BLA. method : The Cogni-CareV3.0 software independently developed by our team was used to collect dynamic handwriting data of CCT. Then, the spatial and motion features of segmented dynamic handwriting were extracted, and feature matrix with unequal dimensions were normalized. Finally, a bidirectional long short-term memory network model combined with attention mechanism (BiLSTM-Attention) was adopted for classification. Result: The experimental results showed that: The proposed method has significant superiority compared to similar studies, with a classification accuracy of 86.69%. The distribution of cube drawing ability scores has significant regularity for three aspects such as MCI patients and healthy control group, age, and levels of education. It was also found that score for each cognitive task including cube drawing ability score is negatively correlated with age. Score for each cognitive task including cube drawing ability score, but positively correlated with levels of education significantly. Conclusion: This study provides a relatively objective and comprehensive evaluation method for early screening and personalized intervention of visual spatial cognitive impairment.
[LG-42] SocialDriveGen: Generating Diverse Traffic Scenarios with Controllable Social Interactions
链接: https://arxiv.org/abs/2512.01363
作者: Jiaguo Tian,Zhengbang Zhu,Shenyu Zhang,Li Xu,Bo Zheng,Xu Liu,Weiji Peng,Shizeng Yao,Weinan Zhang
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:
Abstract:The generation of realistic and diverse traffic scenarios in simulation is essential for developing and evaluating autonomous driving systems. However, most simulation frameworks rely on rule-based or simplified models for scene generation, which lack the fidelity and diversity needed to represent real-world driving. While recent advances in generative modeling produce more realistic and context-aware traffic interactions, they often overlook how social preferences influence driving behavior. SocialDriveGen addresses this gap through a hierarchical framework that integrates semantic reasoning and social preference modeling with generative trajectory synthesis. By modeling egoism and altruism as complementary social dimensions, our framework enables controllable diversity in driver personalities and interaction styles. Experiments on the Argoverse 2 dataset show that SocialDriveGen generates diverse, high-fidelity traffic scenarios spanning cooperative to adversarial behaviors, significantly enhancing policy robustness and generalization to rare or high-risk situations.
[LG-43] Directed evolution algorithm drives neural prediction
链接: https://arxiv.org/abs/2512.01362
作者: Yanlin Wang,Nancy M Young,Patrick C M Wong
类目: Machine Learning (cs.LG)
*备注: 43 pages, 5 figures
Abstract:Neural prediction offers a promising approach to forecasting the individual variability of neurocognitive functions and disorders and providing prognostic indicators for personalized invention. However, it is challenging to translate neural predictive models into medical artificial intelligent applications due to the limitations of domain shift and label scarcity. Here, we propose the directed evolution model (DEM), a novel computational model that mimics the trial-and-error processes of biological directed evolution to approximate optimal solutions for predictive modeling tasks. We demonstrated that the directed evolution algorithm is an effective strategy for uncertainty exploration, enhancing generalization in reinforcement learning. Furthermore, by incorporating replay buffer and continual backpropagate methods into DEM, we provide evidence of achieving better trade-off between exploitation and exploration in continuous learning settings. We conducted experiments on four different datasets for children with cochlear implants whose spoken language developmental outcomes vary considerably on the individual-child level. Preoperative neural MRI data has shown to accurately predict the post-operative outcome of these children within but not across datasets. Our results show that DEM can efficiently improve the performance of cross-domain pre-implantation neural predictions while addressing the challenge of label scarcity in target domain.
[LG-44] Modality-Augmented Fine-Tuning of Foundation Robot Policies for Cross-Embodiment Manipulation on GR1 and G1
链接: https://arxiv.org/abs/2512.01358
作者: Junsung Park,Hogun Kee,Songhwai Oh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures
Abstract:This paper presents a modality-augmented fine-tuning framework designed to adapt foundation robot policies to diverse humanoid embodiments. We validate our approach across two distinct settings: (i) the GR1 embodiment, utilizing public datasets where we introduce post-processed modalities, including binary contact signals and ZoeDepth-generated metric depth; and (ii) the Unitree G1 embodiment, for which we contribute a novel multi-modal dataset incorporating cuRobo motion planning, inverse kinematics, and ground-truth contact-force measurements. Our experiments demonstrate that modality augmentation consistently enhances policy performance across different embodiments. Specifically, for the GR1, integrating contact-state cues and RGB-D fusion improves online success rates from 51% to 63%. Furthermore, in the G1 “Pick Apple to Bowl” task, our contact-augmented model achieves a success rate of 94%, significantly outperforming the 48% achieved by standard fine-tuning and the 0% baseline of zero-shot transfer. These results highlight that lightweight post-processing effectively strengthens policies for GR1, while high-quality multi-modal data is crucial for reliable transfer to the Unitree G1. Consequently, this work establishes a unified, data-centric pathway for extending foundation robot policies through targeted modality design and multi-modal fine-tuning.
[LG-45] milearn: A Python Package for Multi-Instance Machine Learning
链接: https://arxiv.org/abs/2512.01287
作者: Dmitry Zankov,Pavlo Polishchuk,Michal Sobieraj,Mario Barbatti
类目: Machine Learning (cs.LG)
*备注: Open-source software for multi-instance machine learning
Abstract:We introduce milearn, a Python package for multi-instance learning (MIL) that follows the familiar scikit-learn fit/predict interface while providing a unified framework for both classical and neural-network-based MIL algorithms for regression and classification. The package also includes built-in hyperparameter optimization designed specifically for small MIL datasets, enabling robust model selection in data-scarce scenarios. We demonstrate the versatility of milearn across a broad range of synthetic MIL benchmark datasets, including digit classification and regression, molecular property prediction, and protein-protein interaction (PPI) prediction. Special emphasis is placed on the key instance detection (KID) problem, for which the package provides dedicated support.
[LG-46] Samplability makes learning easier
链接: https://arxiv.org/abs/2512.01276
作者: Guy Blanc,Caleb Koch,Jane Lange,Carmen Strassle,Li-Yang Tan
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ITCS 2026
Abstract:The standard definition of PAC learning (Valiant 1984) requires learners to succeed under all distributions – even ones that are intractable to sample from. This stands in contrast to samplable PAC learning (Blum, Furst, Kearns, and Lipton 1993), where learners only have to succeed under samplable distributions. We study this distinction and show that samplable PAC substantially expands the power of efficient learners. We first construct a concept class that requires exponential sample complexity in standard PAC but is learnable with polynomial sample complexity in samplable PAC. We then lift this statistical separation to the computational setting and obtain a separation relative to a random oracle. Our proofs center around a new complexity primitive, explicit evasive sets, that we introduce and study. These are sets for which membership is easy to determine but are extremely hard to sample from. Our results extend to the online setting to similarly show how its landscape changes when the adversary is assumed to be efficient instead of computationally unbounded. Comments: ITCS 2026 Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2512.01276 [cs.CC] (or arXiv:2512.01276v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2512.01276 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-47] Efficient Hyperparameter Search for Non-Stationary Model Training
链接: https://arxiv.org/abs/2512.01258
作者: Berivan Isik,Matthew Fahrbach,Dima Kuzmin,Nicolas Mayoraz,Emil Praun,Steffen Rendle,Raghavendra Vasudeva
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Online learning is the cornerstone of applications like recommendation and advertising systems, where models continuously adapt to shifting data distributions. Model training for such systems is remarkably expensive, a cost that multiplies during hyperparameter search. We introduce a two-stage paradigm to reduce this cost: (1) efficiently identifying the most promising configurations, and then (2) training only these selected candidates to their full potential. Our core insight is that focusing on accurate identification in the first stage, rather than achieving peak performance, allows for aggressive cost-saving measures. We develop novel data reduction and prediction strategies that specifically overcome the challenges of sequential, non-stationary data not addressed by conventional hyperparameter optimization. We validate our framework’s effectiveness through a dual evaluation: first on the Criteo 1TB dataset, the largest suitable public benchmark, and second on an industrial advertising system operating at a scale two orders of magnitude larger. Our methods reduce the total hyperparameter search cost by up to 10 \times on the public benchmark and deliver significant, validated efficiency gains in the industrial setting.
[LG-48] On the Tension Between Optimality and Adversarial Robustness in Policy Optimization
链接: https://arxiv.org/abs/2512.01228
作者: Haoran Li,Jiayu Lv,Congying Han,Zicheng Zhang,Anqi Li,Yan Liu,Tiande Guo,Nan Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice. This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, a fundamental tension between robustness and optimality arises in practical policy gradient methods. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the reshaping effect of the strongest adversary in ARPO, which significantly complicates the global landscape by inducing deceptive sticky FOSPs. This improves robustness but makes navigation more challenging. To alleviate this, we develop the BARPO, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.
[LG-49] CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions
链接: https://arxiv.org/abs/2512.01224
作者: Ruixiang Feng,Zhenwei An,Yuntao Wen,Ran Le,Yiming Jia,Chen Yang,Zongchao Chen,Lisi Chen,Shen Gao,Shuo Shang,Yang Song,Tao Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Answer verification methods are widely employed in language model training pipelines spanning data curation, evaluation, and reinforcement learning with verifiable rewards (RLVR). While prior work focus on developing unified verifiers applicable across multiple reasoning scenarios, significant challenges remain in computation-oriented scientific domains, such as algebraic equivalence checking and physical constant substitution. In this paper, we introduce \model, a tool-augmented verifier that leverages external executors to perform precise computations and symbolic simplifications. \model enables robust verification that goes beyond simple semantic matching. We propose a novel two-stage pipeline, which begin with cold-start fine-tuning and followed by multi-turn reinforcement learning with tool integration. Extensive experiments conducted on STEM subjects, general QA, and long-form reasoning tasks demonstrates strong generalization of \model. The results shows that the \model achieves state-of-the-art performance on VerifyBench-Hard and SCI-Bench. And we also employ our \model in RLVR as a reward model, the results show that it consistently outperforms both rubric-based and model-based verifiers on AIME’24 and AIME’25, demonstrating strong potential to enhance reasoning capabilities of LLM. Our model is released at \hyperlinkthis https URLthis https URL.
[LG-50] A Comparative Study of Machine Learning Algorithms for Electricity Price Forecasting with LIME-Based Interpretability
链接: https://arxiv.org/abs/2512.01212
作者: Xuanyi Zhao,Jiawen Ding,Xueting Huang,Yibo Zhang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 5 figures. Accepted for publication at ICEIEC 2025 (not yet published)
Abstract:With the rapid development of electricity markets, price volatility has significantly increased, making accurate forecasting crucial for power system operations and market decisions. Traditional linear models cannot capture the complex nonlinear characteristics of electricity pricing, necessitating advanced machine learning approaches. This study compares eight machine learning models using Spanish electricity market data, integrating consumption, generation, and meteorological variables. The models evaluated include linear regression, ridge regression, decision tree, KNN, random forest, gradient boosting, SVR, and XGBoost. Results show that KNN achieves the best performance with R^2 of 0.865, MAE of 3.556, and RMSE of 5.240. To enhance interpretability, LIME analysis reveals that meteorological factors and supply-demand indicators significantly influence price fluctuations through nonlinear relationships. This work demonstrates the effectiveness of machine learning models in electricity price forecasting while improving decision transparency through interpretability analysis.
[LG-51] Research on Milling Machine Predictive Maintenance Based on Machine Learning and SHAP Analysis in Intelligent Manufacturing Environment
链接: https://arxiv.org/abs/2512.01205
作者: Wen Zhao,Jiawen Ding,Xueting Huang,Yibo Zhang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 5 figures. Accepted for publication at ICEIEC 2025 (not yet published)
Abstract:In the context of intelligent manufacturing, this paper conducts a series of experimental studies on the predictive maintenance of industrial milling machine equipment based on the AI4I 2020 dataset. This paper proposes a complete predictive maintenance experimental process combining artificial intelligence technology, including six main links: data preprocessing, model training, model evaluation, model selection, SHAP analysis, and result visualization. By comparing and analyzing the performance of eight machine learning models, it is found that integrated learning methods such as XGBoost and random forest perform well in milling machine fault prediction tasks. In addition, with the help of SHAP analysis technology, the influence mechanism of different features on equipment failure is deeply revealed, among which processing temperature, torque and speed are the key factors affecting failure. This study combines artificial intelligence and manufacturing technology, provides a methodological reference for predictive maintenance practice in an intelligent manufacturing environment, and has practical significance for promoting the digital transformation of the manufacturing industry, improving production efficiency and reducing maintenance costs.
[LG-52] he Evolution of Learning Algorithms for Artificial Neural Networks
链接: https://arxiv.org/abs/2512.01203
作者: Jonathan Baxter
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:In this paper we investigate a neural network model in which weights between computational nodes are modified according to a local learning rule. To determine whether local learning rules are sufficient for learning, we encode the network architectures and learning dynamics genetically and then apply selection pressure to evolve networks capable of learning the four boolean functions of one variable. The successful networks are analysed and we show how learning behaviour emerges as a distributed property of the entire network. Finally the utility of genetic algorithms as a tool of discovery is discussed.
[LG-53] Sum Rate Maximization in STAR-RIS-UAV-Assisted Networks: A CA-DDPG Approach for Joint Optimization
链接: https://arxiv.org/abs/2512.01202
作者: Yujie Huang,Haibin Wan,Xiangcheng Li,Tuanfa Qin,Yun Li,Jun Li,Wen Chen
类目: Machine Learning (cs.LG)
*备注: 14 pages, 12 figures
Abstract:With the rapid advances in programmable materials, reconfigurable intelligent surfaces (RIS) have become a pivotal technology for future wireless communications. The simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) can both transmit and reflect signals, enabling comprehensive signal control and expanding application scenarios. This paper introduces an unmanned aerial vehicle (UAV) to further enhance system flexibility and proposes an optimization design for the spectrum efficiency of the STAR-RIS-UAV-assisted wireless communication system. We present a deep reinforcement learning (DRL) algorithm capable of iteratively optimizing beamforming, phase shifts, and UAV positioning to maximize the system’s sum rate through continuous interactions with the environment. To improve exploration in deterministic policies, we introduce a stochastic perturbation factor, which enhances exploration capabilities. As exploration is strengthened, the algorithm’s ability to accurately evaluate the state-action value function becomes critical. Thus, based on the deep deterministic policy gradient (DDPG) algorithm, we propose a convolution-augmented deep deterministic policy gradient (CA-DDPG) algorithm that balances exploration and evaluation to improve the system’s sum rate. The simulation results demonstrate that the CA-DDPG algorithm effectively interacts with the environment, optimizing the beamforming matrix, phase shift matrix, and UAV location, thereby improving system capacity and achieving better performance than other algorithms.
[LG-54] Know Thyself by Knowing Others: Learning Neuron Identity from Population Context NEURIPS2025
链接: https://arxiv.org/abs/2512.01199
作者: Vinam Arora,Divyansha Lachi,Ian J. Knight,Mehdi Azabou,Blake Richards,Cole L. Hurwitz,Josh Siegle,Eva L. Dyer
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted at Neurips 2025
Abstract:Neurons process information in ways that depend on their cell type, connectivity, and the brain region in which they are embedded. However, inferring these factors from neural activity remains a significant challenge. To build general-purpose representations that allow for resolving information about a neuron’s identity, we introduce NuCLR, a self-supervised framework that aims to learn representations of neural activity that allow for differentiating one neuron from the rest. NuCLR brings together views of the same neuron observed at different times and across different stimuli and uses a contrastive objective to pull these representations together. To capture population context without assuming any fixed neuron ordering, we build a spatiotemporal transformer that integrates activity in a permutation-equivariant manner. Across multiple electrophysiology and calcium imaging datasets, a linear decoding evaluation on top of NuCLR representations achieves a new state-of-the-art for both cell type and brain region decoding tasks, and demonstrates strong zero-shot generalization to unseen animals. We present the first systematic scaling analysis for neuron-level representation learning, showing that increasing the number of animals used during pretraining consistently improves downstream performance. The learned representations are also label-efficient, requiring only a small fraction of labeled samples to achieve competitive performance. These results highlight how large, diverse neural datasets enable models to recover information about neuron identity that generalize across animals. Code is available at this https URL.
[LG-55] Learning to Reconstruct Temperature Field from Sparse Observations with Implicit Physics Priors
链接: https://arxiv.org/abs/2512.01196
作者: Shihang Li,Zhiqiang Gong,Weien Zhou,Yue Gao,Wen Yao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate reconstruction of temperature field of heat-source systems (TFR-HSS) is crucial for thermal monitoring and reliability assessment in engineering applications such as electronic devices and aerospace structures. However, the high cost of measurement acquisition and the substantial distributional shifts in temperature field across varying conditions present significant challenges for developing reconstruction models with robust generalization capabilities. Existing DNNs-based methods typically formulate TFR-HSS as a one-to-one regression problem based solely on target sparse measurements, without effectively leveraging reference simulation data that implicitly encode thermal knowledge. To address this limitation, we propose IPTR, an implicit physics-guided temperature field reconstruction framework that introduces sparse monitoring-temperature field pair from reference simulations as priors to enrich physical understanding. To integrate both reference and target information, we design a dual physics embedding module consisting of two complementary branches: an implicit physics-guided branch employing cross-attention to distill latent physics from the reference data, and an auxiliary encoding branch based on Fourier layers to capture the spatial characteristics of the target observation. The fused representation is then decoded to reconstruct the full temperature field. Extensive experiments under single-condition, multi-condition, and few-shot settings demonstrate that IPTR consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy and strong generalization capability.
[LG-56] LGDC: Latent Graph Diffusion via Spectrum-Preserving Coarsening
链接: https://arxiv.org/abs/2512.01190
作者: Nagham Osman,Keyue Jiang,Davide Buffelli,Xiaowen Dong,Laura Toni
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph generation is a critical task across scientific domains. Existing methods fall broadly into two categories: autoregressive models, which iteratively expand graphs, and one-shot models, such as diffusion, which generate the full graph at once. In this work, we provide an analysis of these two paradigms and reveal a key trade-off: autoregressive models stand out in capturing fine-grained local structures, such as degree and clustering properties, whereas one-shot models excel at modeling global patterns, such as spectral distributions. Building on this, we propose LGDC (latent graph diffusion via spectrum-preserving coarsening), a hybrid framework that combines strengths of both approaches. LGDC employs a spectrum-preserving coarsening-decoarsening to bidirectionally map between graphs and a latent space, where diffusion efficiently generates latent graphs before expansion restores detail. This design captures both local and global properties with improved efficiency. Empirically, LGDC matches autoregressive models on locally structured datasets (Tree) and diffusion models on globally structured ones (Planar, Community-20), validating the benefits of hybrid generation.
[LG-57] From Regression to Classification: Exploring the Benefits of Categorical Representations of Energy in MLIPs
链接: https://arxiv.org/abs/2512.01160
作者: Ahmad Ali
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: 11th Annual Conference on Vision and Intelligent Systems (CVIS 2025)
Abstract:Density Functional Theory (DFT) is a widely used computational method for estimating the energy and behavior of molecules. Machine Learning Interatomic Potentials (MLIPs) are models trained to approximate DFT-level energies and forces at dramatically lower computational cost. Many modern MLIPs rely on a scalar regression formulation; given information about a molecule, they predict a single energy value and corresponding forces while minimizing absolute error with DFT’s calculations. In this work, we explore a multi-class classification formulation that predicts a categorical distribution over energy/force values, providing richer supervision through multiple targets. Most importantly, this approach offers a principled way to quantify model uncertainty. In particular, our method predicts a histogram of the energy/force distribution, converts scalar targets into histograms, and trains the model using cross-entropy loss. Our results demonstrate that this categorical formulation can achieve absolute error performance comparable to regression baselines. Furthermore, this representation enables the quantification of epistemic uncertainty through the entropy of the predicted distribution, offering a measure of model confidence absent in scalar regression approaches. Comments: 11th Annual Conference on Vision and Intelligent Systems (CVIS 2025) Subjects: Machine Learning (cs.LG); Molecular Networks (q-bio.MN) Cite as: arXiv:2512.01160 [cs.LG] (or arXiv:2512.01160v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.01160 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-58] Fiber Bundle Networks: A Geometric Machine Learning Paradigm
链接: https://arxiv.org/abs/2512.01151
作者: Dong Liu
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: 18 pages, 1 figure
Abstract:We propose Fiber Bundle Networks (FiberNet), a novel machine learning framework integrating differential geometry with machine learning. Unlike traditional deep neural networks relying on black-box function fitting, we reformulate classification as interpretable geometric optimization on fiber bundles, where categories form the base space and wavelet-transformed features lie in the fibers above each category. We introduce two innovations: (1) learnable Riemannian metrics identifying important frequency feature components, (2) variational prototype optimization through energy function minimization. Classification is performed via Voronoi tessellation under the learned Riemannian metric, where each prototype defines a decision region and test samples are assigned to the nearest prototype, providing clear geometric interpretability. This work demonstrates that the integration of fiber bundle with machine learning provides interpretability and efficiency, which are difficult to obtain simultaneously in conventional deep learning.
[LG-59] Dynamic Algorithm for Explainable k-medians Clustering under lp Norm NEURIPS2025
链接: https://arxiv.org/abs/2512.01150
作者: Konstantin Makarychev,Ilias Papanikolaou,Liren Shan
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 36 pages, 3 figures, to appear in NeurIPS 2025
Abstract:We study the problem of explainable k-medians clustering introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (2020). In this problem, the goal is to construct a threshold decision tree that partitions data into k clusters while minimizing the k-medians objective. These trees are interpretable because each internal node makes a simple decision by thresholding a single feature, allowing users to trace and understand how each point is assigned to a cluster. We present the first algorithm for explainable k-medians under lp norm for every finite p = 1. Our algorithm achieves an O(p(log k)^1 + 1/p - 1/p^2) approximation to the optimal k-medians cost for any p = 1. Previously, algorithms were known only for p = 1 and p = 2. For p = 2, our algorithm improves upon the existing bound of O(log^3/2k), and for p = 1, it matches the tight bound of log k + O(1) up to a multiplicative O(log log k) factor. We show how to implement our algorithm in a dynamic setting. The dynamic algorithm maintains an explainable clustering under a sequence of insertions and deletions, with amortized update time O(d log^3 k) and O(log k) recourse, making it suitable for large-scale and evolving datasets.
[LG-60] Projection-Free CNN Pruning via Frank-Wolfe with Momentum: Sparser Models with Less Pretraining DATE
链接: https://arxiv.org/abs/2512.01147
作者: Hamza ElMokhtar Shili,Natasha Patnaik,Isabelle Ruble,Kathryn Jarjoura,Daniel Suarez Aguirre
类目: Machine Learning (cs.LG)
*备注: Preliminary preprint; numerical experiments are still being validated and may be updated in future revisions
Abstract:We investigate algorithmic variants of the Frank-Wolfe (FW) optimization method for pruning convolutional neural networks. This is motivated by the “Lottery Ticket Hypothesis”, which suggests the existence of smaller sub-networks within larger pre-trained networks that perform comparatively well (if not better). Whilst most literature in this area focuses on Deep Neural Networks more generally, we specifically consider Convolutional Neural Networks for image classification tasks. Building on the hypothesis, we compare simple magnitude-based pruning, a Frank-Wolfe style pruning scheme, and an FW method with momentum on a CNN trained on MNIST. Our experiments track test accuracy, loss, sparsity, and inference time as we vary the dense pre-training budget from 1 to 10 epochs. We find that FW with momentum yields pruned networks that are both sparser and more accurate than the original dense model and the simple pruning baselines, while incurring minimal inference-time overhead in our implementation. Moreover, FW with momentum reaches these accuracies after only a few epochs of pre-training, indicating that full pre-training of the dense model is not required in this setting.
[LG-61] Neural Variable Name Repair: Learning to Rename Identifiers for Readability
链接: https://arxiv.org/abs/2512.01141
作者: Muhammad Yousuf,Akshat Bagade,Chhittebbayi Penugonda,Maanas Baraya
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Developers routinely work with source files whose variable names are generic or misleading, and with teams moving quickly, many functions are left undocumented. This slows comprehension, increases the risk of subtle bugs, and makes it harder for both humans and large language models (LLMs) to reason about code. We study variable name repair: given a real C++ function where all occurrences of one local or parameter name have been replaced by a placeholder (e.g. ID 1), the goal is to generate a natural, descriptive replacement name. We automatically construct this task from the C++ portion of BigCode’s The Stack by parsing functions with Tree-sitter, masking a single identifier, and treating the original name as supervision. On top of Llama 3.1-8B, we build a pipeline with (i) warmup and dropout schedules for more stable fine-tuning, (ii) LoRA adapters for efficient specialization on identifier repair, and (iii) a dual-encoder reranker over top-k generator candidates. We evaluate using exact match, Top-5 Hit, and an embedding-based partial similarity score (0-100) that gives credit for near synonyms and format variants (e.g., jsonValue vs. json). On a held-out set of 200 C++ functions, a zero-shot Llama 3.1 baseline reaches 6.1 percent exact match. Our best LoRA-tuned model (with warmup and dropout) achieves 43.1 percent exact match, 50.2 percent Top-5 Hit, and an 82.03 partial-match score. A dual encoder reranker further improves selection quality without modifying the underlying generator, suggesting that task-specific fine-tuning plus reranking is a promising approach for practical identifier repair tools.
[LG-62] Bayesian dynamic scheduling of multipurpose batch processes under incomplete look-ahead information
链接: https://arxiv.org/abs/2512.01093
作者: Taicheng Zheng,Dan Li,Jie Li
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Multipurpose batch processes become increasingly popular in manufacturing industries since they adapt to low-volume, high-value products and shifting demands. These processes often operate in a dynamic environment, which faces disturbances such as processing delays and demand changes. To minimise long-term cost and system nervousness (i.e., disruptive changes to schedules), schedulers must design rescheduling strategies to address such disturbances effectively. Existing methods often assume complete look-ahead information over the scheduling horizon. This assumption contrasts with realistic situations where schedulers can only access incomplete look-ahead information. Sticking with existing methods may lead to suboptimal long-term costs and high-level system nervousness. In this work we propose a Bayesian dynamic scheduling method. Our method relies on learning a Bayesian Network from the probability distribution of disturbances. Specifically, the Bayesian Network represents how likely each operation will be impacted by disturbances. During the online execution, when new disturbances become observed, this method updates the posterior distribution and therefore guides the rescheduling strategy. We compare our method with the existing periodic rescheduling strategy (which generates new schedules from scratch at fixed intervals) on four benchmark problems. Computational results show that our method achieves statistically better long-term costs and system nervousness. In the theoretical aspect, we prove that if disturbances are mutually independent, the impact-quantifying variables inherently satisfy the independence assumptions required by Bayesian Networks. As an implication, practitioners can extend the method to other scheduling problems (such as job shop scheduling and continuous processes), given that they define the problem-specific dependencies between operations.
[LG-63] he Silence that Speaks: Neural Estimation via Communication Gaps
链接: https://arxiv.org/abs/2512.01056
作者: Shubham Aggarwal,Dipankar Maity,Tamer Başar
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Accurate remote state estimation is a fundamental component of many autonomous and networked dynamical systems, where multiple decision-making agents interact and communicate over shared, bandwidth-constrained channels. These communication constraints introduce an additional layer of complexity, namely, the decision of when to communicate. This results in a fundamental trade-off between estimation accuracy and communication resource usage. Traditional extensions of classical estimation algorithms (e.g., the Kalman filter) treat the absence of communication as ‘missing’ information. However, silence itself can carry implicit information about the system’s state, which, if properly interpreted, can enhance the estimation quality even in the absence of explicit communication. Leveraging this implicit structure, however, poses significant analytical challenges, even in relatively simple systems. In this paper, we propose CALM (Communication-Aware Learning and Monitoring), a novel learning-based framework that jointly addresses the dual challenges of communication scheduling and estimator design. Our approach entails learning not only when to communicate but also how to infer useful information from periods of communication silence. We perform comparative case studies on multiple benchmarks to demonstrate that CALM is able to decode the implicit coordination between the estimator and the scheduler to extract information from the instances of ‘silence’ and enhance the estimation accuracy.
[LG-64] Joint Partitioning and Placement of Foundation Models for Real-Time Edge AI
链接: https://arxiv.org/abs/2512.01039
作者: Aladin Djuhera,Fernando Koch,Alecio Binotto
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Inference over large-scale foundation models within heterogeneous edge environments necessitates a fundamentally reconfigurable orchestration substrate. Static partitioning of model layers presumes temporal stability across compute and network resources, which is misaligned with the volatility of real-world deployments. We introduce a framework in which both the spatial placement and internal segmentation of foundation models are elevated to runtime-resolved constructs. The orchestration problem is formalized as a constrained optimization over layer-wise assignments, subject to evolving latency, utilization, and privacy gradients. The framework implements reactive inference composition responsive to infrastructural fluctuations by integrating model-aware capacity profiling with dynamic graph re-partitioning and reallocation. We introduce architectural and algorithmic components, along with a representative use case in 6G multi-access edge computing.
[LG-65] Upper Approximation Bounds for Neural Oscillators
链接: https://arxiv.org/abs/2512.01015
作者: Zifeng Huang,Konstantin M. Zuev,Yong Xia,Michael Beer
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Functional Analysis (math.FA)
*备注: 30 pages, 4 figures
Abstract:Neural oscillators, originating from the second-order ordinary differential equations (ODEs), have demonstrated competitive performance in stably learning causal mappings between long-term sequences or continuous temporal functions. However, theoretically quantifying the capacities of their neural network architectures remains a significant challenge. In this study, the neural oscillator consisting of a second-order ODE followed by a multilayer perceptron (MLP) is considered. Its upper approximation bound for approximating causal and uniformly continuous operators between continuous temporal function spaces and that for approximating uniformly asymptotically incrementally stable second-order dynamical systems are derived. The established proof method of the approximation bound for approximating the causal continuous operators can also be directly applied to state-space models consisting of a linear time-continuous complex recurrent neural network followed by an MLP. Theoretical results reveal that the approximation error of the neural oscillator for approximating the second-order dynamical systems scales polynomially with the reciprocals of the widths of two utilized MLPs, thus mitigating the curse of parametric complexity. The decay rates of two established approximation error bounds are validated through two numerical cases. These results provide a robust theoretical foundation for the effective application of the neural oscillator in science and engineering.
[LG-66] Subgroup Validity in Machine Learning for Echocardiogram Data
链接: https://arxiv.org/abs/2512.00976
作者: Cynthia Feeney,Shane Williams,Benjamin S. Wessler,Michael C. Hughes
类目: Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注:
Abstract:Echocardiogram datasets enable training deep learning models to automate interpretation of cardiac ultrasound, thereby expanding access to accurate readings of diagnostically-useful images. However, the gender, sex, race, and ethnicity of the patients in these datasets are underreported and subgroup-specific predictive performance is unevaluated. These reporting deficiencies raise concerns about subgroup validity that must be studied and addressed before model deployment. In this paper, we show that current open echocardiogram datasets are unable to assuage subgroup validity concerns. We improve sociodemographic reporting for two datasets: TMED-2 and MIMIC-IV-ECHO. Analysis of six open datasets reveals no consideration of gender-diverse patients and insufficient patient counts for many racial and ethnic groups. We further perform an exploratory subgroup analysis of two published aortic stenosis detection models on TMED-2. We find insufficient evidence for subgroup validity for sex, racial, and ethnic subgroups. Our findings highlight that more data for underrepresented subgroups, improved demographic reporting, and subgroup-focused analyses are needed to prove subgroup validity in future work.
[LG-67] Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis
链接: https://arxiv.org/abs/2512.00966
作者: Mintong Kang,Chong Xiang,Sanjay Kariyappa,Chaowei Xiao,Bo Li,Edward Suh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three “thinking intervention” strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario).
[LG-68] Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
链接: https://arxiv.org/abs/2512.00961
作者: Qi Wang,Mian Wu,Yuyang Zhang,Mingqi Yuan,Wenyao Zhang,Haoxiang You,Yunbo Wang,Xin Jin,Xiaokang Yang,Wenjun Zeng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent’s trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.
[LG-69] WUSH: Near-Optimal Adaptive Transforms for LLM Quantization
链接: https://arxiv.org/abs/2512.00956
作者: Jiale Chen,Vage Egiazarian,Torsten Hoefler,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantization to low bitwidth is a standard approach for deploying large language models, however, a few extreme weights and activations stretch the dynamic range and reduce the effective resolution of the quantizer. A common mitigation approach is to apply some fixed orthogonal transforms, such as Hadamard matrices, before quantization, which typically reduces the dynamic range. Yet, these transforms ignore the statistics of the data, and their optimality is currently not understood. In this work, we derive, for the first time, closed-form optimal linear blockwise transforms for joint weight-activation quantization using standard data-free quantizers for common numerical formats. Specifically, we provide derivations of the optimal adaptive (data-aware) transforms for round-to-nearest (RTN), AbsMax-scaled block quantizers for both integer and floating-point formats. The resulting construction, which we call WUSH, combines a Hadamard backbone with a data-dependent component based on second-order moments, yielding a non-orthogonal transform that is provably optimal under mild assumptions and remains structured for efficient implementation. Preliminary experimental results show that our approach consistently improves upon the Hadamard transform for common formats.
[LG-70] Memory-Integrated Reconfigurable Adapters: A Unified Framework for Settings with Multiple Tasks NEURIPS2025
链接: https://arxiv.org/abs/2512.00940
作者: Susmit Agrawal,Krishn Vishwas Kher,Saksham Mittal,Swarnim Maheshwari,Vineeth N. Balasubramanian
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025; 31 pages, 2 figures
Abstract:Organisms constantly pivot between tasks such as evading predators, foraging, traversing rugged terrain, and socializing, often within milliseconds. Remarkably, they preserve knowledge of once-learned environments sans catastrophic forgetting, a phenomenon neuroscientists hypothesize, is due to a singular neural circuitry dynamically overlayed by neuromodulatory agents such as dopamine and acetylcholine. In parallel, deep learning research addresses analogous challenges via domain generalization (DG) and continual learning (CL), yet these methods remain siloed, despite the brains ability to perform them seamlessly. In particular, prior work has not explored architectures involving associative memories (AMs), which are an integral part of biological systems, to jointly address these tasks. We propose Memory-Integrated Reconfigurable Adapters (MIRA), a unified framework that integrates Hopfield-style associative memory modules atop a shared backbone. Associative memory keys are learned post-hoc to index and retrieve an affine combination of stored adapter updates for any given task or domain on a per-sample basis. By varying only the task-specific objectives, we demonstrate that MIRA seamlessly accommodates domain shifts and sequential task exposures under one roof. Empirical evaluations on standard benchmarks confirm that our AM-augmented architecture significantly enhances adaptability and retention: in DG, MIRA achieves SoTA out-of-distribution accuracy, and in incremental learning settings, it outperforms architectures explicitly designed to handle catastrophic forgetting using generic CL algorithms. By unifying adapter-based modulation with biologically inspired associative memory, MIRA delivers rapid task switching and enduring knowledge retention in a single extensible architecture, charting a path toward more versatile and memory-augmented AI systems.
[LG-71] D-CTNet: A Dual-Branch Channel-Temporal Forecasting Network with Frequency-Domain Correction
链接: https://arxiv.org/abs/2512.00925
作者: Shaoxun Wang,Xingjun Zhang,Kun Xia,Qianyang Li,Jiawei Cao,Zhendong Tan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate Multivariate Time Series (MTS) forecasting is crucial for collaborative design of complex systems, Digital Twin building, and maintenance ahead of time. However, the collaborative industrial environment presents new challenges for MTS forecasting models: models should decouple complex inter-variable dependencies while addressing non-stationary distribution shift brought by environmental changes. To address these challenges and improve collaborative sensing reliability, we propose a Patch-Based Dual-Branch Channel-Temporal Forecasting Network (D-CTNet). Particularly, with a parallel dual-branch design incorporating linear temporal modeling layer and channel attention mechanism, our method explicitly decouples and jointly learns intra-channel temporal evolution patterns and dynamic multivariate correlations. Furthermore, a global patch attention fusion module goes beyond the local window scope to model long range dependencies. Most importantly, aiming at non-stationarity, a Frequency-Domain Stationarity Correction mechanism adaptively suppresses distribution shift impacts from environment change by spectrum alignment. Evaluations on seven benchmark datasets show that our model achieves better forecasting accuracy and robustness compared with state-of-the-art methods. Our work shows great promise as a new forecasting engine for industrial collaborative systems.
[LG-72] Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments
链接: https://arxiv.org/abs/2512.00915
作者: Junwoo Chang,Minwoo Park,Joohwan Seo,Roberto Horowitz,Jongmin Lee,Jongeun Choi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 27 pages, 10 figures
Abstract:Group symmetries provide a powerful inductive bias for reinforcement learning (RL), enabling efficient generalization across symmetric states and actions via group-invariant Markov Decision Processes (MDPs). However, real-world environments almost never realize fully group-invariant MDPs; dynamics, actuation limits, and reward design usually break symmetries, often only locally. Under group-invariant Bellman backups for such cases, local symmetry-breaking introduces errors that propagate across the entire state-action space, resulting in global value estimation errors. To address this, we introduce Partially group-Invariant MDP (PI-MDP), which selectively applies group-invariant or standard Bellman backups depending on where symmetry holds. This framework mitigates error propagation from locally broken symmetries while maintaining the benefits of equivariance, thereby enhancing sample efficiency and generalizability. Building on this framework, we present practical RL algorithms – Partially Equivariant (PE)-DQN for discrete control and PE-SAC for continuous control – that combine the benefits of equivariance with robustness to symmetry-breaking. Experiments across Grid-World, locomotion, and manipulation benchmarks demonstrate that PE-DQN and PE-SAC significantly outperform baselines, highlighting the importance of selective symmetry exploitation for robust and sample-efficient RL.
[LG-73] he Spectral Dimension of NTKs is Constant: A Theory of Implicit Regularization Finite-Width Stability and Scalable Estimation
链接: https://arxiv.org/abs/2512.00860
作者: Praveen Anilkumar Shukla
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures
Abstract:Modern deep networks are heavily overparameterized yet often generalize well, suggesting a form of low intrinsic complexity not reflected by parameter counts. We study this complexity at initialization through the effective rank of the Neural Tangent Kernel (NTK) Gram matrix, r_\texteff(K) = (\texttr(K))^2/|K|F^2 . For i.i.d. data and the infinite-width NTK k , we prove a constant-limit law \lim_n\to\infty \mathbbE[r\texteff(K_n)] = \mathbbE[k(x, x)]^2 / \mathbbE[k(x, x’)^2] =: r_\infty , with sub-Gaussian concentration. We further establish finite-width stability: if the finite-width NTK deviates in operator norm by O_p(m^-1/2) (width m ), then r_\texteff changes by O_p(m^-1/2) . We design a scalable estimator using random output probes and a CountSketch of parameter Jacobians and prove conditional unbiasedness and consistency with explicit variance bounds. On CIFAR-10 with ResNet-20/56 (widths 16/32) across n \in \10^3, 5\times10^3, 10^4, 2.5\times10^4, 5\times10^4\ , we observe r_\texteff \approx 1.0\text–1.3 and slopes \approx 0 in n , consistent with the theory, and the kernel-moment prediction closely matches fitted constants.
[LG-74] Robust Probabilistic Load Forecasting for a Single Household: A Comparative Study from SARIMA to Transformers on the REFIT Dataset PAKDD2026
链接: https://arxiv.org/abs/2512.00856
作者: Midhun Manoj
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures, 1 table. This work includes a rigorous comparative study of imputation methods and presents results submitted to PAKDD 2026. Source code and analysis notebooks are available on GitHub: [ this https URL ]
Abstract:Probabilistic forecasting is essential for modern risk management, allowing decision-makers to quantify uncertainty in critical systems. This paper tackles this challenge using the volatile REFIT household dataset, which is complicated by a large structural data gap. We first address this by conducting a rigorous comparative experiment to select a Seasonal Imputation method, demonstrating its superiority over linear interpolation in preserving the data’s underlying distribution. We then systematically evaluate a hierarchy of models, progressing from classical baselines (SARIMA, Prophet) to machine learning (XGBoost) and advanced deep learning architectures (LSTM). Our findings reveal that classical models fail to capture the data’s non-linear, regime-switching behavior. While the LSTM provided the most well-calibrated probabilistic forecast, the Temporal Fusion Transformer (TFT) emerged as the superior all-round model, achieving the best point forecast accuracy (RMSE 481.94) and producing safer, more cautious prediction intervals that effectively capture extreme volatility.
[LG-75] City-Conditioned Memory for Multi-City Traffic and Mobility Forecasting
链接: https://arxiv.org/abs/2512.00851
作者: Wenzhang Du
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Deploying spatio-temporal forecasting models across many cities is difficult: traffic networks differ in size and topology, data availability can vary by orders of magnitude, and new cities may provide only a short history of logs. Existing deep traffic models are typically trained per city and backbone, creating high maintenance cost and poor transfer to data-scarce cities. We ask whether a single, backbone-agnostic layer can condition on “which city this sequence comes from”, improve accuracy in full- and low-data regimes, and support better cross-city adaptation with minimal code changes. We propose CityCond, a light-weight city-conditioned memory layer that augments existing spatio-temporal backbones. CityCond combines a city-ID encoder with an optional shared memory bank (CityMem). Given a city index and backbone hidden states, it produces city-conditioned features fused through gated residual connections. We attach CityCond to five representative backbones (GRU, TCN, Transformer, GNN, STGCN) and evaluate three regimes: full-data, low-data, and cross-city few-shot transfer on METR-LA and PEMS-BAY. We also run auxiliary experiments on SIND, a drone-based multi-agent trajectory dataset from a signalized intersection in Tianjin (we focus on pedestrian tracks). Across more than fourteen model variants and three random seeds, CityCond yields consistent improvements, with the largest gains for high-capacity backbones such as Transformers and STGCNs. CityMem reduces Transformer error by roughly one third in full-data settings and brings substantial gains in low-data and cross-city transfer. On SIND, simple city-ID conditioning modestly improves low-data LSTM performance. CityCond can therefore serve as a reusable design pattern for scalable, multi-city forecasting under realistic data constraints. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) MSC classes: 68T07, 90B20, 62M10 ACMclasses: I.2.6; I.5.1; G.3 Cite as: arXiv:2512.00851 [cs.LG] (or arXiv:2512.00851v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.00851 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-76] FC-ADL: Efficient Microservice Anomaly Detection and Localisation Through Functional Connectivity
链接: https://arxiv.org/abs/2512.00844
作者: Giles Winchester,George Parisis,Luc Berthouze
类目: oftware Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, 2 tables
Abstract:Microservices have transformed software architecture through the creation of modular and independent services. However, they introduce operational complexities in service integration and system management that makes swift and accurate anomaly detection and localisation challenging. Despite the complex, dynamic, and interconnected nature of microservice architectures, prior works that investigate metrics for anomaly detection rarely include explicit information about time-varying interdependencies. And whilst prior works on fault localisation typically do incorporate information about dependencies between microservices, they scale poorly to real world large-scale deployments due to their reliance on computationally expensive causal inference. To address these challenges we propose FC-ADL, an end-to-end scalable approach for detecting and localising anomalous changes from microservice metrics based on the neuroscientific concept of functional connectivity. We show that by efficiently characterising time-varying changes in dependencies between microservice metrics we can both detect anomalies and provide root cause candidates without incurring the significant overheads of causal and multivariate approaches. We demonstrate that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios when compared to state-of-the-art approaches. Furthermore, we illustrate the scalability of our approach by applying it to Alibaba’s extremely large real-world microservice deployment.
[LG-77] Prediction-space knowledge markets for communication-efficient federated learning on multimedia tasks
链接: https://arxiv.org/abs/2512.00841
作者: Wenzhang Du
类目: Machine Learning (cs.LG)
*备注: 13 pages, 3 figures
Abstract:Federated learning (FL) enables collaborative training over distributed multimedia data but suffers acutely from statistical heterogeneity and communication constraints, especially when clients deploy large models. Classic parameter-averaging methods such as FedAvg transmit full model weights and can diverge under nonindependent and identically distributed (non-IID) data. We propose KTA v2, a prediction-space knowledge trading market for FL. Each round, clients locally train on their private data, then share only logits on a small public reference set. The server constructs a client-client similarity graph in prediction space, combines it with reference-set accuracy to form per-client teacher ensembles, and sends back personalized soft targets for a second-stage distillation update. This two-stage procedure can be interpreted as approximate block-coordinate descent on a unified objective with prediction-space regularization. Experiments on FEMNIST, CIFAR-10 and AG News show that, under comparable or much lower communication budgets, KTA v2 consistently outperforms a local-only baseline and strong parameter-based methods (FedAvg, FedProx), and substantially improves over a FedMD-style global teacher. On CIFAR-10 with ResNet-18, KTA v2 reaches 57.7% test accuracy using approximately 1/1100 of FedAvg’s communication, while on AG News it attains 89.3% accuracy with approximately 1/300 of FedAvg’s traffic.
[LG-78] Uncertainty Quantification for Deep Regression using Contextualised Normalizing Flows
链接: https://arxiv.org/abs/2512.00835
作者: Adriel Sosa Marco,John Daniel Kirwan,Alexia Toumpa,Simos Gerasimou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantifying uncertainty in deep regression models is important both for understanding the confidence of the model and for safe decision-making in high-risk domains. Existing approaches that yield prediction intervals overlook distributional information, neglecting the effect of multimodal or asymmetric distributions on decision-making. Similarly, full or approximated Bayesian methods, while yielding the predictive posterior density, demand major modifications to the model architecture and retraining. We introduce MCNF, a novel post hoc uncertainty quantification method that produces both prediction intervals and the full conditioned predictive distribution. MCNF operates on top of the underlying trained predictive model; thus, no predictive model retraining is needed. We provide experimental evidence that the MCNF-based uncertainty estimate is well calibrated, is competitive with state-of-the-art uncertainty quantification methods, and provides richer information for downstream decision-making tasks.
[LG-79] ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning
链接: https://arxiv.org/abs/2512.00831
作者: Yuchen Zeng,Shuibai Zhang,Wonjun Kang,Shutong Wu,Lynnix Zou,Ying Fan,Heeju Kim,Ziqian Lin,Jungtaek Kim,Hyung Il Koo,Dimitris Papailiopoulos,Kangwook Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning “algorithms” remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection. Our code is publicly available at this https URL.
[LG-80] Soft Quality-Diversity Optimization
链接: https://arxiv.org/abs/2512.00810
作者: Saeed Hedayatian,Stefanos Nikolaidis
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 33 pages, 10 figures
Abstract:Quality-Diversity (QD) algorithms constitute a branch of optimization that is concerned with discovering a diverse and high-quality set of solutions to an optimization problem. Current QD methods commonly maintain diversity by dividing the behavior space into discrete regions, ensuring that solutions are distributed across different parts of the space. The QD problem is then solved by searching for the best solution in each region. This approach to QD optimization poses challenges in large solution spaces, where storing many solutions is impractical, and in high-dimensional behavior spaces, where discretization becomes ineffective due to the curse of dimensionality. We present an alternative framing of the QD problem, called \emphSoft QD, that sidesteps the need for discretizations. We validate this formulation by demonstrating its desirable properties, such as monotonicity, and by relating its limiting behavior to the widely used QD Score metric. Furthermore, we leverage it to derive a novel differentiable QD algorithm, \emphSoft QD Using Approximated Diversity (SQUAD), and demonstrate empirically that it is competitive with current state of the art methods on standard benchmarks while offering better scalability to higher dimensional problems.
[LG-81] Estimating the Effective Rank of Vision Transformers via Low-Rank Factorization
链接: https://arxiv.org/abs/2512.00792
作者: Liyu Zerihun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep networks are heavily over-parameterized, yet their learned representations often admit low-rank structure. We introduce a framework for estimating a model’s intrinsic dimensionality by treating learned representations as projections onto a low-rank subspace of the model’s full capacity. Our approach: train a full-rank teacher, factorize its weights at multiple ranks, and train each factorized student via distillation to measure performance as a function of rank. We define effective rank as a region, not a point: the smallest contiguous set of ranks for which the student reaches 85-95% of teacher accuracy. To stabilize estimates, we fit accuracy vs. rank with a monotone PCHIP interpolant and identify crossings of the normalized curve. We also define the effective knee as the rank maximizing perpendicular distance between the smoothed accuracy curve and its endpoint secant; an intrinsic indicator of where marginal gains concentrate. On ViT-B/32 fine-tuned on CIFAR-100 (one seed, due to compute constraints), factorizing linear blocks and training with distillation yields an effective-rank region of approximately [16, 34] and an effective knee at r* ~ 31. At rank 32, the student attains 69.46% top-1 accuracy vs. 73.35% for the teacher (~94.7% of baseline) while achieving substantial parameter compression. We provide a framework to estimate effective-rank regions and knees across architectures and datasets, offering a practical tool for characterizing the intrinsic dimensionality of deep models. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.00792 [cs.LG] (or arXiv:2512.00792v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.00792 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-82] Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment
链接: https://arxiv.org/abs/2512.00783
作者: Libo Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: The Sigma model has been open-sourced on Hugging Face. Weights, dataset, some scripts, and logs are all available. The link is: this https URL
Abstract:To address the gap in humanoid robot cognitive systems regarding the lack of a time-updable mediating thought space between semantics and continuous control, this study constructs and trains a VLA model named “Sigma” that runs on a single RTX 4090. It uses the open-source pi05_base model as a foundation and preprocesses svla_so101_pickplace into a training dataset. The researcher independently designed an architecture for a vision-language-action model that combines deep semantic understanding and association to achieve telepathic communication. The training process involved repeated optimizations of data preprocessing, LoRA fine-tuning, and the inference-stage adapter. The experiment employed offline closed-loop replay, comparing Sigma with the untuned pure pi05_base_base model under data conditions. Results showed that Sigma exhibited a stable decrease in control MSE across vector, fragment, and entire trajectory timescales, while maintaining the telepathy norm and semantic-text alignment quality unchanged. It demonstrates that mind-responsive alignment control is quantified through an architecture that combines deep understanding of semantics and association without retraining the base model, which provides reproducible experience for semantic alignment and intention-driven behavior in humanoid robots.
[LG-83] What Is Preference Optimization Doing How and Why?
链接: https://arxiv.org/abs/2512.00778
作者: Yue Wang,Qizhou Wang,Zizhuo Zhang,Ang Li,Gang Niu,Bo Han,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注:
Abstract:Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO follows dynamic targets that balance exploration and exploitation, thus validating the common belief from a new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key components in PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the learning targets meanwhile mutually offset each other. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to absolute values of token-level advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.
[LG-84] AI Agent for Source Finding by SoFiA-2 for SKA-SDC2
链接: https://arxiv.org/abs/2512.00769
作者: Xingchen Zhou,Nan Li,Peng Jia,Yingfeng Liu,Furen Deng,Shuanghao Shu,Ying Li,Liang Cao,Huanyuan Shan,Ayodeji Ibitoye
类目: Machine Learning (cs.LG); Astrophysics of Galaxies (astro-ph.GA)
*备注: 20 pages, 10 figures, accepted by RAA
Abstract:Source extraction is crucial in analyzing data from next-generation, large-scale sky surveys in radio bands, such as the Square Kilometre Array (SKA). Several source extraction programs, including SoFiA and Aegean, have been developed to address this challenge. However, finding optimal parameter configurations when applying these programs to real observations is non-trivial. For example, the outcomes of SoFiA intensely depend on several key parameters across its preconditioning, source-finding, and reliability-filtering modules. To address this issue, we propose a framework to automatically optimize these parameters using an AI agent based on a state-of-the-art reinforcement learning (RL) algorithm, i.e., Soft Actor-Critic (SAC). The SKA Science Data Challenge 2 (SDC2) dataset is utilized to assess the feasibility and reliability of this framework. The AI agent interacts with the environment by adjusting parameters based on the feedback from the SDC2 score defined by the SDC2 Team, progressively learning to select parameter sets that yield improved performance. After sufficient training, the AI agent can automatically identify an optimal parameter configuration that outperform the benchmark set by Team SoFiA within only 100 evaluation steps and with reduced time consumption. Our approach could address similar problems requiring complex parameter tuning, beyond radio band surveys and source extraction. Yet, high-quality training sets containing representative observations and catalogs of ground truth are essential.
[LG-85] Forecasting Indias Demographic Transition Under Fertility Policy Scenarios Using hybrid LSTM-PINN Model
链接: https://arxiv.org/abs/2512.00760
作者: Subarna Khanra,Vijay Kumar Kukreja,Indu Bala
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 31 pages, 17 figure, 57 references
Abstract:Demographic forecasting remains a fundamental challenge for policy planning in rapidly evolving nations such as India, where fertility transitions, policy interventions, and age structured dynamics interact in complex ways. In this study, we present a hybrid modelling framework that integrates policy-aware fertility functions into a Physics-Informed Neural Network (PINN) enhanced with Long Short-Term Memory (LSTM) networks to capture physical constraints and temporal dependencies in population dynamics. The model is applied to India’s age structured population from 2024 to 2054 under three fertility-policy scenarios: continuation of current fertility decline, stricter population control, and relaxed fertility promotion. The governing transport-reaction partial differential equation is formulated with India-specific demographic indicators, including age-specific fertility and mortality rates. PINNs embed the core population equation and policy-driven fertility changes, while LSTM layers improve long-term forecasting across decades. Results show that fertility policies substantially shape future age distribution, dependency ratios, and workforce size. Stricter controls intensify ageing and reduce labour force participation, whereas relaxed policies support workforce growth but increase population pressure. Our findings suggest that the hybrid LSTM-PINN is an effective approach for demographic forecasting, offering accuracy with interpretability. Beyond methodological novelty, this work provides actionable insights for India’s demographic policy debates, highlighting the need for balanced fertility interventions to ensure sustainable socio-economic development.
[LG-86] ESMC: MLLM -Based Embedding Selection for Explainable Multiple Clustering
链接: https://arxiv.org/abs/2512.00725
作者: Xinyue Wang,Yuheng Jia,Hui Liu,Junhui Hou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Typical deep clustering methods, while achieving notable progress, can only provide one clustering result per dataset. This limitation arises from their assumption of a fixed underlying data distribution, which may fail to meet user needs and provide unsatisfactory clustering outcomes. Our work investigates how multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering, emphasizing their adaptability to user-specified semantic requirements. However, directly using MLLM output for clustering has risks for producing unstructured and generic image descriptions instead of feature-specific and concrete ones. To address these issues, our method first discovers that MLLMs’ hidden states of text tokens are strongly related to the corresponding features, and leverages these embeddings to perform clusterings from any user-defined criteria. We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy. Extensive experiments demonstrate its competitive performance on diverse datasets and metrics.
[LG-87] Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking
链接: https://arxiv.org/abs/2512.00724
作者: Lingling Fu
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 9 pages,5 figures
Abstract:Reward models play a critical role in Reinforcement Learning from Human Feedback (RLHF) by assessing the consistency between generated outputs and human preferences. However, conventional reward models are prone to reward hacking or over-optimization, where the policy exploits shortcut patterns to obtain high reward scores that do not reflect true human preference. Although Mixture-of-Experts (MoE)-based reward models can enhance discriminative capability, they typically introduce substantial computational overhead. To address these challenges, we propose an upcycle and merge MoE reward modeling approach. We first upcycle a dense reward model into a MoE architecture, where a shared expert captures general knowledge, while normal experts specialize in instruction-specific patterns. We then apply routing-weight normalization and merge experts back into a dense model through a learnable weight-averaging mechanism, preserving performance gains while significantly reducing inference cost. Experimental results demonstrate that our method effectively mitigates reward hacking across various model scales. Our work highlights the potential of upcycle and merge MoE structures for improving both robustness and efficiency of RLHF reward models.
[LG-88] Exploiting Function-Family Structure in Analog Circuit Optimization
链接: https://arxiv.org/abs/2512.00712
作者: Zhuohua Liu,Kaiqi Huang,Qinxin Mei,Yuanqi Hu,Wei W. Xing
类目: Machine Learning (cs.LG)
*备注:
Abstract:Analog circuit optimization is typically framed as black-box search over arbitrary smooth functions, yet device physics constrains performance mappings to structured families: exponential device laws, rational transfer functions, and regime-dependent dynamics. Off-the-shelf Gaussian-process surrogates impose globally smooth, stationary priors that are misaligned with these regime-switching primitives and can severely misfit highly nonlinear circuits at realistic sample sizes (50–100 evaluations). We demonstrate that pre-trained tabular models encoding these primitives enable reliable optimization without per-circuit engineering. Circuit Prior Network (CPN) combines a tabular foundation model (TabPFN v2) with Direct Expected Improvement (DEI), computing expected improvement exactly under discrete posteriors rather than Gaussian approximations. Across 6 circuits and 25 baselines, structure-matched priors achieve R^2 \approx 0.99 in small-sample regimes where GP-Matérn attains only R^2 = 0.16 on Bandgap, deliver 1.05 – 3.81\times higher FoM with 3.34 – 11.89\times fewer iterations, and suggest a shift from hand-crafting models as priors toward systematic physics-informed structure identification. Our code will be made publicly available upon paper acceptance.
[LG-89] owards Precision Protein-Ligand Affinity Prediction Benchmark: A Complete and Modification-Aware DAVIS Dataset
链接: https://arxiv.org/abs/2512.00708
作者: Ming-Hsiu Wu,Ziqian Xie,Shuiwang Ji,Degui Zhi
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Advancements in AI for science unlocks capabilities for critical drug discovery tasks such as protein-ligand binding affinity prediction. However, current models overfit to existing oversimplified datasets that does not represent naturally occurring and biologically relevant proteins with modifications. In this work, we curate a complete and modification-aware version of the widely used DAVIS dataset by incorporating 4,032 kinase-ligand pairs involving substitutions, insertions, deletions, and phosphorylation events. This enriched dataset enables benchmarking of predictive models under biologically realistic conditions. Based on this new dataset, we propose three benchmark settings-Augmented Dataset Prediction, Wild-Type to Modification Generalization, and Few-Shot Modification Generalization-designed to assess model robustness in the presence of protein modifications. Through extensive evaluation of both docking-free and docking-based methods, we find that docking-based model generalize better in zero-shot settings. In contrast, docking-free models tend to overfit to wild-type proteins and struggle with unseen modifications but show notable improvement when fine-tuned on a small set of modified examples. We anticipate that the curated dataset and benchmarks offer a valuable foundation for developing models that better generalize to protein modifications, ultimately advancing precision medicine in drug discovery. The benchmark is available at: this https URL
[LG-90] Flow Matching for Tabular Data Synthesis
链接: https://arxiv.org/abs/2512.00698
作者: Bahrul Ilmi Nasution,Floor Eijkelboom,Mark Elliot,Richard Allmendinger,Christian A. Naesseth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages main, 12 pages appendix, 5 figures
Abstract:Synthetic data generation is an important tool for privacy-preserving data sharing. While diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement flow matching for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers – something possible when learning to generate using \textitvariational flow matching – characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that flow matching, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieves better performance with remarkably low function evaluations ( \leq 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT path demonstrates superior performance, while VP has potential for producing synthetic data with lower disclosure risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk.
[LG-91] Using physics-inspired Singular Learning Theory to understand grokking other phase transitions in modern neural networks
链接: https://arxiv.org/abs/2512.00686
作者: Anish Lakkapragada
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint
Abstract:Classical statistical inference and learning theory often fail to explain the success of modern neural networks. A key reason is that these models are non-identifiable (singular), violating core assumptions behind PAC bounds and asymptotic normality. Singular learning theory (SLT), a physics-inspired framework grounded in algebraic geometry, has gained popularity for its ability to close this theory-practice gap. In this paper, we empirically study SLT in toy settings relevant to interpretability and phase transitions. First, we understand the SLT free energy \mathcalF_n by testing an Arrhenius-style rate hypothesis using both a grokking modulo-arithmetic model and Anthropic’s Toy Models of Superposition. Second, we understand the local learning coefficient \lambda_\alpha by measuring how it scales with problem difficulty across several controlled network families (polynomial regressors, low-rank linear networks, and low-rank autoencoders). Our experiments recover known scaling laws while others yield meaningful deviations from theoretical expectations. Overall, our paper illustrates the many merits of SLT for understanding neural network phase transitions, and poses open research questions for the field.
[LG-92] Large Language Models for Software Engineering: A Reproducibility Crisis
链接: https://arxiv.org/abs/2512.00651
作者: Mohammed Latif Siddiq,Arvin Islam-Gomes,Natalie Sekerak,Joanna C. S. Santos
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Submitted to Empirical Software Engineering (EMSE) journal; 112 pages (81 pages of references)
Abstract:Reproducibility is a cornerstone of scientific progress, yet its state in large language model (LLM)-based software engineering (SE) research remains poorly understood. This paper presents the first large-scale, empirical study of reproducibility practices in LLM-for-SE research. We systematically mined and analyzed 640 papers published between 2017 and 2025 across premier software engineering, machine learning, and natural language processing venues, extracting structured metadata from publications, repositories, and documentation. Guided by four research questions, we examine (i) the prevalence of reproducibility smells, (ii) how reproducibility has evolved over time, (iii) whether artifact evaluation badges reliably reflect reproducibility quality, and (iv) how publication venues influence transparency practices. Using a taxonomy of seven smell categories: Code and Execution, Data, Documentation, Environment and Tooling, Versioning, Model, and Access and Legal, we manually annotated all papers and associated artifacts. Our analysis reveals persistent gaps in artifact availability, environment specification, versioning rigor, and documentation clarity, despite modest improvements in recent years and increased adoption of artifact evaluation processes at top SE venues. Notably, we find that badges often signal artifact presence but do not consistently guarantee execution fidelity or long-term reproducibility. Motivated by these findings, we provide actionable recommendations to mitigate reproducibility smells and introduce a Reproducibility Maturity Model (RMM) to move beyond binary artifact certification toward multi-dimensional, progressive evaluation of reproducibility rigor.
[LG-93] Privacy Preserving Diffusion Models for Mixed-Type Tabular Data Generation
链接: https://arxiv.org/abs/2512.00638
作者: Timur Sattarov,Marco Schreyer,Damian Borth
类目: Machine Learning (cs.LG)
*备注: 15 pages, 8 figures, 4 tables
Abstract:We introduce DP-FinDiff, a differentially private diffusion framework for synthesizing mixed-type tabular data. DP-FinDiff employs embedding-based representations for categorical features, reducing encoding overhead and scaling to high-dimensional datasets. To adapt DP-training to the diffusion process, we propose two privacy-aware training strategies: an adaptive timestep sampler that aligns updates with diffusion dynamics, and a feature-aggregated loss that mitigates clipping-induced bias. Together, these enhancements improve fidelity and downstream utility without weakening privacy guarantees. On financial and medical datasets, DP-FinDiff achieves 16-42% higher utility than DP baselines at comparable privacy levels, demonstrating its promise for safe and effective data sharing in sensitive domains.
[LG-94] Financial Text Classification Based On rLoRA Finetuning On Qwen 3-8B model
链接: https://arxiv.org/abs/2512.00630
作者: Zhiming Lian
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: This paper has been accepted to the 2025 2nd International Conference on Digital Economy and Computer Science (DECS 2025) and is awaiting publication in the ACM International Conference Proceeding Series
Abstract:Financial text classification has increasingly become an important aspect in quantitative trading systems and related tasks, such as financial sentiment analysis and the classification of financial news. In this paper, we assess the performance of the large language model Qwen3-8B on both tasks. Qwen3-8B is a state-of-the-art model that exhibits strong instruction-following and multilingual capabilities, and is distinct from standard models, primarily because it is specifically optimized for efficient fine tuning and high performance on reasoning-based benchmarks, making it suitable for financial applications. To adapt this model, we apply Noisy Embedding Instruction Finetuning and based on our previous work, this method increases robustness by injecting controlled noise into the embedding layers during supervised adaptation. We improve efficiency further with Rank-stabilized Low-Rank Adaptation low-rank optimization approach, and FlashAttention, which allow for faster training with lower GPU memory. For both tasks, we benchmark Qwen3-8B against standard classical transformer models, such as T5, BERT, and RoBERTa, and large models at scale, such as LLaMA1-7B, LLaMA2-7B, and Baichuan2-7B. The findings reveal that Qwen3-8B consistently surpasses these baselines by obtaining better classification accuracy and needing fewer training epochs. The synergy of instruction-based fine-tuning and memory-efficient optimization methods suggests Qwen3-8B can potentially serve as a scalable, economical option for real-time financial NLP applications. Qwen3-8B provides a very promising base for advancing dynamic quantitative trading systems in the future.
[LG-95] Efficient Matroid Bandit Linear Optimization Leverag ing Unimodality
链接: https://arxiv.org/abs/2512.00605
作者: Aurélien Delage,Romaric Gaudel
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the combinatorial semi-bandit problem under matroid constraints. The regret achieved by recent approaches is optimal, in the sense that it matches the lower bound. Yet, time complexity remains an issue for large matroids or for matroids with costly membership oracles (e.g. online recommendation that ensures diversity). This paper sheds a new light on the matroid semi-bandit problem by exploiting its underlying unimodal structure. We demonstrate that, with negligible loss in regret, the number of iterations involving the membership oracle can be limited to \mathcalO(\log \log T) . This results in an overall improved time complexity of the learning process. Experiments conducted on various matroid benchmarks show (i) no loss in regret compared to state-of-the-art approaches; and (ii) reduced time complexity and number of calls to the membership oracle.
[LG-96] Non-Asymptotic Convergence of Discrete Diffusion Models: Masked and Random Walk dynamics
链接: https://arxiv.org/abs/2512.00580
作者: Giovanni Conforti,Alain Durmus,Le-Tuyet-Nhi Pham
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We investigate the theoretical underpinnings of Discrete Diffusion Models (DDMs) on discrete state spaces. Unlike in the continuous setting-where diffusion models are well understood both theoretically and empirically-the discrete case poses significant challenges due to its combinatorial structure and the lack of rigorous analysis. In this work, we establish convergence guarantees for DDMs on both the finite space \mathbbZ^d_m=\0,…,m-1^d and the countably infinite space \mathbbN^d under mild assumptions, focusing on forward masked and random walk dynamics. Similar to the continuous case, the backward process can be characterized by a discrete score function, whose monotonicity plays a central role in deriving the error bounds of the generated data. Notably, the complexity of our model scales linearly up to logarithmic factors, rather than exponentially, with the dimension, making it efficiently scalable to high-dimensional data. To the best of our knowledge, this study provides the first non-asymptotic convergence guarantees that do not rely on the boundedness of the estimated score-covering not only uniform noising processes on \mathbbZ^d_m and on \mathbbN^d , but also masking-based noising dynamics.
[LG-97] Pre-Generating Multi-Difficulty PDE Data for Few-Shot Neural PDE Solvers
链接: https://arxiv.org/abs/2512.00564
作者: Naman Choudhary,Vedant Singh,Ameet Talwalkar,Nicholas Matthew Boffi,Mikhail Khodak,Tanya Marwah
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 10 Pages, 11 Figures
Abstract:A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty–e.g., more complex geometries and higher Reynolds numbers–along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination. Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9x less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers. Our code is available at this https URL
[LG-98] A Graph Neural Network Approach for Localized and High-Resolution Temperature Forecasting NEURIPS2025
链接: https://arxiv.org/abs/2512.00546
作者: Joud El-Shawa,Elham Bagheri,Sedef Akinli Kocak,Yalda Mohsenzadeh
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures. Accepted to the NeurIPS 2025 Tackling Climate Change with Machine Learning Workshop
Abstract:Heatwaves are intensifying worldwide and are among the deadliest weather disasters. The burden falls disproportionately on marginalized populations and the Global South, where under-resourced health systems, exposure to urban heat islands, and the lack of adaptive infrastructure amplify risks. Yet current numerical weather prediction models often fail to capture micro-scale extremes, leaving the most vulnerable excluded from timely early warnings. We present a Graph Neural Network framework for localized, high-resolution temperature forecasting. By leveraging spatial learning and efficient computation, our approach generates forecasts at multiple horizons, up to 48 hours. For Southwestern Ontario, Canada, the model captures temperature patterns with a mean MAE of 1.93 ^\circ C across 1-48h forecasts and MAE@48h of 2.93 ^\circ C, evaluated using 24h input windows on the largest region. While demonstrated here in a data-rich context, this work lays the foundation for transfer learning approaches that could enable localized, equitable forecasts in data-limited regions of the Global South.
[LG-99] DQ4FairIM: Fairness-aware Influence Maximization using Deep Reinforcement Learning
链接: https://arxiv.org/abs/2512.00545
作者: Akrati Saxena,Harshith Kumar Yadav,Bart Rutten,Shashi Shekhar Jha
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:
Abstract:The Influence Maximization (IM) problem aims to select a set of seed nodes within a given budget to maximize the spread of influence in a social network. However, real-world social networks have several structural inequalities, such as dominant majority groups and underrepresented minority groups. If these inequalities are not considered while designing IM algorithms, the outcomes might be biased, disproportionately benefiting majority groups while marginalizing minorities. In this work, we address this gap by designing a fairness-aware IM method using Reinforcement Learning (RL) that ensures equitable influence outreach across all communities, regardless of protected attributes. Fairness is incorporated using a maximin fairness objective, which prioritizes improving the outreach of the least-influenced group, pushing the solution toward an equitable influence distribution. We propose a novel fairness-aware deep RL method, called DQ4FairIM, that maximizes the expected number of influenced nodes by learning an RL policy. The learnt policy ensures that minority groups formulate the IM problem as a Markov Decision Process (MDP) and use deep Q-learning, combined with the Structure2Vec network embedding, earning together with Structure2Vec network embedding to solve the MDP. We perform extensive experiments on synthetic benchmarks and real-world networks to compare our method with fairness-agnostic and fairness-aware baselines. The results show that our method achieves a higher level of fairness while maintaining a better fairness-performance trade-off than baselines. Additionally, our approach learns effective seeding policies that generalize across problem instances without retraining, such as varying the network size or the number of seed nodes.
[LG-100] Algorithmic Guarantees for Distilling Supervised and Offline RL Datasets
链接: https://arxiv.org/abs/2512.00536
作者: Aaryan Gupta,Rishi Saket,Aravindan Raghuveer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 2 figures
Abstract:Given a training dataset, the goal of dataset distillation is to derive a synthetic dataset such that models trained on the latter perform as well as those trained on the training dataset. In this work, we develop and analyze an efficient dataset distillation algorithm for supervised learning, specifically regression in \mathbbR^d , based on matching the losses on the training and synthetic datasets with respect to a fixed set of randomly sampled regressors without any model training. Our first key contribution is a novel performance guarantee proving that our algorithm needs only \tildeO(d^2) sampled regressors to derive a synthetic dataset on which the MSE loss of any bounded linear model is nearly the same as its MSE loss on the given training data. In particular, the model optimized on the synthetic data has close to minimum loss on the training data, thus performing nearly as well as the model optimized on the latter. Complementing this, we also prove a matching lower bound of \Omega(d^2) for the number of sampled regressors showing the tightness of our analysis. Our second contribution is to extend our algorithm to offline RL dataset distillation by matching the Bellman loss, unlike previous works which used a behavioral cloning objective. This is the first such method which leverages both, the rewards and the next state information, available in offline RL datasets, without any policy model optimization. Our algorithm generates a synthetic dataset whose Bellman loss with respect to any linear action-value predictor is close to the latter’s Bellman loss on the offline RL training dataset. Therefore, a policy associated with an action-value predictor optimized on the synthetic dataset performs nearly as well as that derived from the one optimized on the training data. We conduct experiments to validate our theoretical guarantees and observe performance gains. Comments: 29 pages, 2 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.00536 [cs.LG] (or arXiv:2512.00536v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.00536 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-101] Robust Precoding for Resilient Cell-Free Networks
链接: https://arxiv.org/abs/2512.00531
作者: Saeed Mashdour,André R. Flores,Rodrigo C. de Lamare
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 2 figures, 6 pages
Abstract:This paper presents a robust precoder design for resilient cell-free massive MIMO (CF-mMIMO) systems that minimizes the weighted sum of desired signal mean square error (MSE) and residual interference leakage power under a total transmit power constraint. The proposed robust precoder incorporates channel state information (CSI) error statistics to enhance resilience against CSI imperfections. We employ an alternating optimization algorithm initialized with a minimum MSE-type solution, which iteratively refines the precoder while maintaining low computational complexity and ensuring fast convergence. Numerical results show that the proposed method significantly outperforms conventional linear precoders, providing an effective balance between performance and computational efficiency.
[LG-102] Pushing the Boundaries of Interpretability: Incremental Enhancements to the Explainable Boosting Machine
链接: https://arxiv.org/abs/2512.00528
作者: Isara Liyanage,Uthayasanker Thayasivam
类目: Machine Learning (cs.LG)
*备注:
Abstract:The widespread adoption of complex machine learning models in high-stakes domains has brought the “black-box” problem to the forefront of responsible AI research. This paper aims at addressing this issue by improving the Explainable Boosting Machine (EBM), a state-of-the-art glassbox model that delivers both high accuracy and complete transparency. The paper outlines three distinct enhancement methodologies: targeted hyperparameter optimization with Bayesian methods, the implementation of a custom multi-objective function for fairness for hyperparameter optimization, and a novel self-supervised pre-training pipeline for cold-start scenarios. All three methodologies are evaluated across standard benchmark datasets, including the Adult Income, Credit Card Fraud Detection, and UCI Heart Disease datasets. The analysis indicates that while the tuning process yielded marginal improvements in the primary ROC AUC metric, it led to a subtle but important shift in the model’s decision-making behavior, demonstrating the value of a multi-faceted evaluation beyond a single performance score. This work is positioned as a critical step toward developing machine learning systems that are not only accurate but also robust, equitable, and transparent, meeting the growing demands of regulatory and ethical compliance.
[LG-103] Hyperbolic Continuous Structural Entropy for Hierarchical Clustering AAAI2026
链接: https://arxiv.org/abs/2512.00524
作者: Guangjie Zeng,Hao Peng,Angsheng Li,Li Sun,Chunyang Liu,Shengze Li,Yicheng Pan,Philip S. Yu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, accepted by AAAI 2026
Abstract:Hierarchical clustering is a fundamental machine-learning technique for grouping data points into dendrograms. However, existing hierarchical clustering methods encounter two primary challenges: 1) Most methods specify dendrograms without a global objective. 2) Graph-based methods often neglect the significance of graph structure, optimizing objectives on complete or static predefined graphs. In this work, we propose Hyperbolic Continuous Structural Entropy neural networks, namely HypCSE, for structure-enhanced continuous hierarchical clustering. Our key idea is to map data points in the hyperbolic space and minimize the relaxed continuous structural entropy (SE) on structure-enhanced graphs. Specifically, we encode graph vertices in hyperbolic space using hyperbolic graph neural networks and minimize approximate SE defined on graph embeddings. To make the SE objective differentiable for optimization, we reformulate it into a function using the lowest common ancestor (LCA) on trees and then relax it into continuous SE (CSE) by the analogy of hyperbolic graph embeddings and partitioning trees. To ensure a graph structure that effectively captures the hierarchy of data points for CSE calculation, we employ a graph structure learning (GSL) strategy that updates the graph structure during training. Extensive experiments on seven datasets demonstrate the superior performance of HypCSE.
[LG-104] A Highly Configurable Framework for Large-Scale Thermal Building Data Generation to drive Machine Learning Research
链接: https://arxiv.org/abs/2512.00483
作者: Thomas Krug,Fabian Raisch,Dominik Aimer,Markus Wirnsberger,Ferdinand Sigg,Felix Koch,Benjamin Schäfer,Benjamin Tischler
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Under Review
Abstract:Data-driven modeling of building thermal dynamics is emerging as an increasingly important field of research for large-scale intelligent building control. However, research in data-driven modeling using machine learning (ML) techniques requires massive amounts of thermal building data, which is not easily available. Neither empirical public datasets nor existing data generators meet the needs of ML research in terms of data quality and quantity. Moreover, existing data generation approaches typically require expert knowledge in building simulation. To fill this gap, we present a thermal building data generation framework which we call BuilDa. BuilDa is designed to produce synthetic data of adequate quality and quantity for ML research. The framework does not require profound building simulation knowledge to generate large volumes of data. BuilDa uses a single-zone Modelica model that is exported as a Functional Mock-up Unit (FMU) and simulated in Python. We demonstrate BuilDa by generating data and utilizing it for a transfer learning study involving the fine-tuning of 486 data-driven models.
[LG-105] RECTor: Robust and Efficient Correlation Attack on Tor
链接: https://arxiv.org/abs/2512.00436
作者: Binghui Wu,Dinil Mon Divakaran,Levente Csikor,Mohan Gurusamy
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 8 pages, 4 figures, 2 tables
Abstract:Tor is a widely used anonymity network that conceals user identities by routing traffic through encrypted relays, yet it remains vulnerable to traffic correlation attacks that deanonymize users by matching patterns in ingress and egress traffic. However, existing correlation methods suffer from two major limitations: limited robustness to noise and partial observations, and poor scalability due to computationally expensive pairwise matching. To address these challenges, we propose RECTor, a machine learning-based framework for traffic correlation under realistic conditions. RECTor employs attention-based Multiple Instance Learning (MIL) and GRU-based temporal encoding to extract robust flow representations, even when traffic data is incomplete or obfuscated. These embeddings are mapped into a shared space via a Siamese network and efficiently matched using approximate nearest neighbor (aNN) search. Empirical evaluations show that RECTor outperforms state-of-the-art baselines such as DeepCorr, DeepCOFFEA, and FlowTracker, achieving up to 60% higher true positive rates under high-noise conditions and reducing training and inference time by over 50%. Moreover, RECTor demonstrates strong scalability: inference cost grows near-linearly as the number of flows increases. These findings reveal critical vulnerabilities in Tor’s anonymity model and highlight the need for advanced model-aware defenses.
[LG-106] Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease ML4H2025
链接: https://arxiv.org/abs/2512.00434
作者: Benjamin D. Ballyk,Ankit Gupta,Sujay Konda,Kavitha Subramanian,Chris Landon,Ahmed Ammar Naseer,Georg Maierhofer,Sumanth Swaminathan,Vasudevan Venkateshwaran
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: To appear in Proceedings of Machine Learning Research Volume 297 - Proceedings of ML4H 2025
Abstract:Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real (TSTR) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model (DP-TimeGAN) maintains a mean authenticity of 0.778 on the CKD dataset, outperforming existing state-of-the-art models on the privacy-utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.
[LG-107] rendGNN: Towards Understanding of Epidemics Beliefs and Behaviors
链接: https://arxiv.org/abs/2512.00421
作者: Mulin Tian,Ajitesh Srivastava
类目: Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, 1 table
Abstract:Epidemic outcomes have a complex interplay with human behavior and beliefs. Most of the forecasting literature has focused on the task of predicting epidemic signals using simple mechanistic models or black-box models, such as deep transformers, that ingest all available signals without offering interpretability. However, to better understand the mechanisms and predict the impact of interventions, we need the ability to forecast signals associated with beliefs and behaviors in an interpretable manner. In this work, we propose a graph-based forecasting framework that first constructs a graph of interrelated signals based on trend similarity, and then applies graph neural networks (GNNs) for prediction. This approach enables interpretable analysis by revealing which signals are more predictable and which relationships contribute most to forecasting accuracy. We believe our method provides early steps towards a framework for interpretable modeling in domains with multiple potentially interdependent signals, with implications for building future simulation models that integrate behavior, beliefs, and observations.
[LG-108] Solving Neural Min-Max Games: The Role of Architecture Initialization Dynamics NEURIPS2025 DATE
链接: https://arxiv.org/abs/2512.00389
作者: Deep Patel,Emmanouil-Vasileios Vlatakis-Gkaragkounis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: Camera-ready for NeurIPS 2025 (including updated section on neural network initialization for experiments in Appendix C)
Abstract:Many emerging applications - such as adversarial training, AI alignment, and robust optimization - can be framed as zero-sum games between neural nets, with von Neumann-Nash equilibria (NE) capturing the desirable system behavior. While such games often involve non-convex non-concave objectives, empirical evidence shows that simple gradient methods frequently converge, suggesting a hidden geometric structure. In this paper, we provide a theoretical framework that explains this phenomenon through the lens of hidden convexity and overparameterization. We identify sufficient conditions - spanning initialization, training dynamics, and network width - that guarantee global convergence to a NE in a broad class of non-convex min-max games. To our knowledge, this is the first such result for games that involve two-layer neural networks. Technically, our approach is twofold: (a) we derive a novel path-length bound for the alternating gradient descent-ascent scheme in min-max games; and (b) we show that the reduction from a hidden convex-concave geometry to two-sided Polyak-Łojasiewicz (PŁ) min-max condition hold with high probability under overparameterization, using tools from random matrix theory.
[LG-109] Efficient and Programmable Exploration of Synthesizable Chemical Space
链接: https://arxiv.org/abs/2512.00384
作者: Shitong Luo,Connor W. Coley
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C+±based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program’’ generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed.
[LG-110] he Information Theory of Similarity
链接: https://arxiv.org/abs/2512.00378
作者: Nikit Phadke
类目: Information Theory (cs.IT); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon’s information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion constraints. This unification reveals that fifty years of similarity search research – from Bloom filters to locality-sensitive hashing to neural retrieval – implicitly developed information theory for relational data. We derive fundamental lower bounds showing that REWA’s O(\Delta^-2 \log N) complexity is optimal: no encoding scheme can preserve similarity rankings with fewer bits. The framework establishes that semantic similarity has physical units (bits of mutual information), search is communication (query transmission over a noisy channel), and retrieval systems face fundamental capacity limits analogous to Shannon’s channel coding theorem.
[LG-111] Learning Causal States Under Partial Observability and Perturbation
链接: https://arxiv.org/abs/2512.00357
作者: Na Li,Hangguan Shan,Wei Ni,Wenjie Zhang,Xinyu Li,Yamin Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:A critical challenge for reinforcement learning (RL) is making decisions based on incomplete and noisy observations, especially in perturbed and partially observable Markov decision processes (P ^2 OMDPs). Existing methods fail to mitigate perturbations while addressing partial observability. We propose \textitCausal State Representation under Asynchronous Diffusion Model (CaDiff), a framework that enhances any RL algorithm by uncovering the underlying causal structure of P ^2 OMDPs. This is achieved by incorporating a novel asynchronous diffusion model (ADM) and a new bisimulation metric. ADM enables forward and reverse processes with different numbers of steps, thus interpreting the perturbation of P ^2 OMDP as part of the noise suppressed through diffusion. The bisimulation metric quantifies the similarity between partially observable environments and their causal counterparts. Moreover, we establish the theoretical guarantee of CaDiff by deriving an upper bound for the value function approximation errors between perturbed observations and denoised causal states, reflecting a principled trade-off between approximation errors of reward and transition-model. Experiments on Roboschool tasks show that CaDiff enhances returns by at least 14.18% compared to baselines. CaDiff is the first framework that approximates causal states using diffusion models with both theoretical rigor and practicality.
[LG-112] Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2512.00352
作者: Na Li,Zewu Zheng,Wei Ni,Hangguan Shan,Wenjie Zhang,Xinyu Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025
Abstract:Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (\textitRTZ-VI-LCB) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm’s sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.
[LG-113] Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning ICLR2024
链接: https://arxiv.org/abs/2512.00351
作者: Na Li,Yuchen Jiao,Hangguan Shan,Shefeng Yan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICLR 2024. arXiv admin note: substantial text overlap with arXiv:2110.04645 by other authors
Abstract:The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emphMemory-Efficient Nash Q-Learning (ME-Nash-QL) for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an \varepsilon -approximate Nash policy with space complexity O(SABH) and sample complexity \widetildeO(H^4SAB/\varepsilon^2) , where S is the number of states, \A, B\ is the number of actions for two players, and H is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when \min\A, B\ll H^2 . Second, ME-Nash-QL achieves the lowest computational complexity O(T\mathrmpoly(AB)) while preserving Markov policies, where T is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost O(SAB,\mathrmpoly(H)) , whereas previous algorithms have a burn-in cost of at least O(S^3 AB,\mathrmpoly(H)) to attain the same level of sample complexity with ours.
[LG-114] Adaptive prediction theory combining offline and online learning
链接: https://arxiv.org/abs/2512.00342
作者: Haizheng Li,Lei Guo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Real-world intelligence systems usually operate by combining offline learning and online adaptation with highly correlated and non-stationary system data or signals, which, however, has rarely been investigated theoretically in the literature. This paper initiates a theoretical investigation on the prediction performance of a two-stage learning framework combining offline and online algorithms for a class of nonlinear stochastic dynamical systems. For the offline-learning phase, we establish an upper bound on the generalization error for approximate nonlinear-least-squares estimation under general datasets with strong correlation and distribution shift, leveraging the Kullback-Leibler divergence to quantify the distributional discrepancies. For the online-adaptation phase, we address, on the basis of the offline-trained model, the possible uncertain parameter drift in real-world target systems by proposing a meta-LMS prediction algorithm. This two-stage framework, integrating offline learning with online adaptation, demonstrates superior prediction performances compared with either purely offline or online methods. Both theoretical guarantees and empirical studies are provided.
[LG-115] Introducing AI-Driven IoT Energy Management Framework
链接: https://arxiv.org/abs/2512.00321
作者: Shivani Mruthyunjaya,Anandi Dutta,Kazi Sifatul Islam
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted in IEEE Smart World Congress 2025, Calgary, Canada
[LG-116] Data-Driven Modeling and Correction of Vehicle Dynamics
链接: https://arxiv.org/abs/2512.00289
作者: Nguyen Ly,Caroline Tatsuoka,Jai Nagaraj,Jacob Levy,Fernando Palafox,David Fridovich-Keil,Hannah Lu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:We develop a data-driven framework for learning and correcting non-autonomous vehicle dynamics. Physics-based vehicle models are often simplified for tractability and therefore exhibit inherent model-form uncertainty, motivating the need for data-driven correction. Moreover, non-autonomous dynamics are governed by time-dependent control inputs, which pose challenges in learning predictive models directly from temporal snapshot data. To address these, we reformulate the vehicle dynamics via a local parameterization of the time-dependent inputs, yielding a modified system composed of a sequence of local parametric dynamical systems. We approximate these parametric systems using two complementary approaches. First, we employ the DRIPS (dimension reduction and interpolation in parameter space) methodology to construct efficient linear surrogate models, equipped with lifted observable spaces and manifold-based operator interpolation. This enables data-efficient learning of vehicle models whose dynamics admit accurate linear representations in the lifted spaces. Second, for more strongly nonlinear systems, we employ FML (Flow Map Learning), a deep neural network approach that approximates the parametric evolution map without requiring special treatment of nonlinearities. We further extend FML with a transfer-learning-based model correction procedure, enabling the correction of misspecified prior models using only a sparse set of high-fidelity or experimental measurements, without assuming a prescribed form for the correction term. Through a suite of numerical experiments on unicycle, simplified bicycle, and slip-based bicycle models, we demonstrate that DRIPS offers robust and highly data-efficient learning of non-autonomous vehicle dynamics, while FML provides expressive nonlinear modeling and effective correction of model-form errors under severe data scarcity.
[LG-117] Scalable and Interpretable Scientific Discovery via Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN)
链接: https://arxiv.org/abs/2512.00260
作者: Y. Sungtaek Ju
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 7 pages, 3 figures
Abstract:Kolmogorov-Arnold Networks (KANs) offer a promising alternative to Multi-Layer Perceptron (MLP) by placing learnable univariate functions on network edges, enhancing interpretability. However, standard KANs lack probabilistic outputs, limiting their utility in applications requiring uncertainty quantification. While recent Gaussian Process (GP) extensions to KANs address this, they utilize exact inference methods that scale cubically with data size N, restricting their application to smaller datasets. We introduce the Sparse Variational GP-KAN (SVGP-KAN), an architecture that integrates sparse variational inference with the KAN topology. By employing M inducing points and analytic moment matching, our method reduces computational complexity from O(N^3) to O(NM^2) or linear in sample size, enabling the application of probabilistic KANs to larger scientific datasets. Furthermore, we demonstrate that integrating a permutation-based importance analysis enables the network to function as a framework for structural identification, identifying relevant inputs and classifying functional relationships.
[LG-118] SD-CGAN: Conditional Sinkhorn Divergence GAN for DDoS Anomaly Detection in IoT Networks
链接: https://arxiv.org/abs/2512.00251
作者: Henry Onyeka,Emmanuel Samson,Liang Hong,Tariqul Islam,Imtiaz Ahmed,Kamrul Hasan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 7 pages, 6 figures, camera-ready version accepted for presentation at IEEE ICNC 2026
[LG-119] Self-Supervised Dynamical System Representations for Physiological Time-Series
链接: https://arxiv.org/abs/2512.00239
作者: Yenho Chen,Maxwell A. Xu,James M. Rehg,Christopher J. Rozell
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The effectiveness of self-supervised learning (SSL) for physiological time series depends on the ability of a pretraining objective to preserve information about the underlying physiological state while filtering out unrelated noise. However, existing strategies are limited due to reliance on heuristic principles or poorly constrained generative tasks. To address this limitation, we propose a pretraining framework that exploits the information structure of a dynamical systems generative model across multiple time-series. This framework reveals our key insight that class identity can be efficiently captured by extracting information about the generative variables related to the system parameters shared across similar time series samples, while noise unique to individual samples should be discarded. Building on this insight, we propose PULSE, a cross-reconstruction-based pretraining objective for physiological time series datasets that explicitly extracts system information while discarding non-transferrable sample-specific ones. We establish theory that provides sufficient conditions for the system information to be recovered, and empirically validate it using a synthetic dynamical systems experiment. Furthermore, we apply our method to diverse real-world datasets, demonstrating that PULSE learns representations that can broadly distinguish semantic classes, increase label efficiency, and improve transfer learning.
[LG-120] Emergent Riemannian geometry over learning discrete computations on continuous manifolds
链接: https://arxiv.org/abs/2512.00196
作者: Julian Brandon,Angus Chadwick,Arthur Pellegrino
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Differential Geometry (math.DG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:
Abstract:Many tasks require mapping continuous input data (e.g. images) to discrete task outputs (e.g. class labels). Yet, how neural networks learn to perform such discrete computations on continuous data manifolds remains poorly understood. Here, we show that signatures of such computations emerge in the representational geometry of neural networks as they learn. By analysing the Riemannian pullback metric across layers of a neural network, we find that network computation can be decomposed into two functions: discretising continuous input features and performing logical operations on these discretised variables. Furthermore, we demonstrate how different learning regimes (rich vs. lazy) have contrasting metric and curvature structures, affecting the ability of the networks to generalise to unseen inputs. Overall, our work provides a geometric framework for understanding how neural networks learn to perform discrete computations on continuous manifolds.
[LG-121] Hybrid Context-Fusion Attention (CFA) U-Net and Clustering for Robust Seismic Horizon Interpretation
链接: https://arxiv.org/abs/2512.00191
作者: Jose Luis Lima de Jesus Silva,Joao Pedro Gomes,Paulo Roberto de Melo Barros Junior,Vitor Hugo Serravalle Reis Rodrigues,Alexsandro Guerra Cerqueira
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Geophysics (physics.geo-ph)
*备注:
Abstract:Interpreting seismic horizons is a critical task for characterizing subsurface structures in hydrocarbon exploration. Recent advances in deep learning, particularly U-Net-based architectures, have significantly improved automated horizon tracking. However, challenges remain in accurately segmenting complex geological features and interpolating horizons from sparse annotations. To address these issues, a hybrid framework is presented that integrates advanced U-Net variants with spatial clustering to enhance horizon continuity and geometric fidelity. The core contribution is the Context Fusion Attention (CFA) U-Net, a novel architecture that fuses spatial and Sobel-derived geometric features within attention gates to improve both precision and surface completeness. The performance of five architectures, the U-Net (Standard and compressed), U-Net++, Attention U-Net, and CFA U-Net, was systematically evaluated across various data sparsity regimes (10-, 20-, and 40-line spacing). This approach outperformed existing baselines, achieving state-of-the-art results on the Mexilhao field (Santos Basin, Brazil) dataset with a validation IoU of 0.881 and MAE of 2.49ms, and excellent surface coverage of 97.6% on the F3 Block of the North Sea dataset under sparse conditions. The framework further refines merged horizon predictions (inline and cross-line) using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to produce geologically plausible surfaces. The results demonstrate the advantages of hybrid methodologies and attention-based architectures enhanced with geometric context, providing a robust and generalizable solution for seismic interpretation in structurally complex and data-scarce environments.
[LG-122] We Still Dont Understand High-Dimensional Bayesian Optimization
链接: https://arxiv.org/abs/2512.00170
作者: Colin Doumont,Donney Fan,Natalie Maus,Jacob R. Gardner,Henry Moss,Geoff Pleiss
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[LG-123] Faster Verified Explanations for Neural Networks
链接: https://arxiv.org/abs/2512.00164
作者: Alessandro De Palma,Greta Dolcetti,Caterina Urban
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
Abstract:Verified explanations are a theoretically-principled way to explain the decisions taken by neural networks, which are otherwise black-box in nature. However, these techniques face significant scalability challenges, as they require multiple calls to neural network verifiers, each of them with an exponential worst-case complexity. We present FaVeX, a novel algorithm to compute verified explanations. FaVeX accelerates the computation by dynamically combining batch and sequential processing of input features, and by reusing information from previous queries, both when proving invariances with respect to certain input features, and when searching for feature assignments altering the prediction. Furthermore, we present a novel and hierarchical definition of verified explanations, termed verifier-optimal robust explanations, that explicitly factors the incompleteness of network verifiers within the explanation. Our comprehensive experimental evaluation demonstrates the superior scalability of both FaVeX, and of verifier-optimal robust explanations, which together can produce meaningful formal explanation on networks with hundreds of thousands of non-linear activations.
[LG-124] SafeCiM: Investigating Resilience of Hybrid Floating-Point Compute-in-Memory Deep Learning Accelerators
链接: https://arxiv.org/abs/2512.00059
作者: Swastik Bhattacharya,Sanjay Das,Anand Menon,Shamik Kundu,Arnab Raha,Kanad Basu
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Deep Neural Networks (DNNs) continue to grow in complexity with Large Language Models (LLMs) incorporating vast numbers of parameters. Handling these parameters efficiently in traditional accelerators is limited by data-transmission bottlenecks, motivating Compute-in-Memory (CiM) architectures that integrate computation within or near memory to reduce data movement. Recent work has explored CiM designs using Floating-Point (FP) and Integer (INT) operations. FP computations typically deliver higher output quality due to their wider dynamic range and precision, benefiting precision-sensitive Generative AI applications. These include models such as LLMs, thus driving advancements in FP-CiM accelerators. However, the vulnerability of FP-CiM to hardware faults remains underexplored, posing a major reliability concern in mission-critical settings. To address this gap, we systematically analyze hardware fault effects in FP-CiM by introducing bit-flip faults at key computational stages, including digital multipliers, CiM memory cells, and digital adder trees. Experiments with Convolutional Neural Networks (CNNs) such as AlexNet and state-of-the-art LLMs including LLaMA-3.2-1B and Qwen-0.3B-Base reveal how faults at each stage affect inference accuracy. Notably, a single adder fault can reduce LLM accuracy to 0%. Based on these insights, we propose a fault-resilient design, SafeCiM, that mitigates fault impact far better than a naive FP-CiM with a pre-alignment stage. For example, with 4096 MAC units, SafeCiM reduces accuracy degradation by up to 49x for a single adder fault compared to the baseline FP-CiM architecture.
[LG-125] SetupKit: Efficient Multi-Corner Setup/Hold Time Characterization Using Bias-Enhanced Interpolation and Active Learning
链接: https://arxiv.org/abs/2512.00044
作者: Junzhuo Zhou,Ziwen Wang,Haoxuan Xia,Yuxin Yan,Chengyu Zhu,Ting-Jung Lin,Wei Xing,Lei He
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
[LG-126] Efficient Turing Machine Simulation with Transformers
链接: https://arxiv.org/abs/2512.00003
作者: Qian Li,Yuyi Wang
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 17 pages
Abstract:Constant bit-size Transformers are known to be Turing complete, but existing constructions require \Omega(s(n)) chain-of-thought (CoT) steps per simulated Turing machine ™ step, leading to impractical reasoning lengths. In this paper, we significantly reduce this efficiency gap by proving that any (t(n),s(n)) -bounded multi-tape TM can be simulated by a constant bit-size Transformer with an optimal O(s(n)) -long context window and only O(s(n)^c) CoT steps per TM step, where c0 can be made arbitrarily small by letting the Transformers’ head-layer product sufficiently large. In addition, our construction shows that sparse attention with fixed geometric offsets suffices for efficient universal computation. Our proof leverages multi-queue TMs as a bridge. The main technical novelty is a more efficient simulation of multi-tape TMs by synchronous multi-queue TMs, improving both time and space complexity under stricter model assumptions.
[LG-127] owards a future space-based highly scalable AI infrastructure system design
链接: https://arxiv.org/abs/2511.19468
作者: Blaise Agüera y Arcas,Travis Beals,Maria Biggs,Jessica V. Bloom,Thomas Fischbacher,Konstantin Gromov,Urs Köster,Rishiraj Pravahan,James Manyika
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: 19 pages, 4 figures
Abstract:If AI is a foundational general-purpose technology, we should anticipate that demand for AI compute – and energy – will continue to grow. The Sun is by far the largest energy source in our solar system, and thus it warrants consideration how future AI infrastructure could most efficiently tap into that power. This work explores a scalable compute system for machine learning in space, using fleets of satellites equipped with solar arrays, inter-satellite links using free-space optics, and Google tensor processing unit (TPU) accelerator chips. To facilitate high-bandwidth, low-latency inter-satellite communication, the satellites would be flown in close proximity. We illustrate the basic approach to formation flight via a 81-satellite cluster of 1 km radius, and describe an approach for using high-precision ML-based models to control large-scale constellations. Trillium TPUs are radiation tested. They survive a total ionizing dose equivalent to a 5 year mission life without permanent failures, and are characterized for bit-flip errors. Launch costs are a critical part of overall system cost; a learning curve analysis suggests launch to low-Earth orbit (LEO) may reach \lesssim \ 200/kg by the mid-2030s.
[LG-128] A robust generalizable device-agnostic deep learning model for sleep-wake determination from triaxial wrist accelerometry
链接: https://arxiv.org/abs/2512.01986
作者: Nasim Montazeri,Stone Yang,Dominik Luszczynski,John Zhang,Dharmendra Gurve,Andrew Centen,Maged Goubran,Andrew Lim
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 27 pages, 5 figures, 5 tables
Abstract:Study Objectives: Wrist accelerometry is widely used for inferring sleep-wake state. Previous works demonstrated poor wake detection, without cross-device generalizability and validation in different age range and sleep disorders. We developed a robust deep learning model for to detect sleep-wakefulness from triaxial accelerometry and evaluated its validity across three devices and in a large adult population spanning a wide range of ages with and without sleep disorders. Methods: We collected wrist accelerometry simultaneous to polysomnography (PSG) in 453 adults undergoing clinical sleep testing at a tertiary care sleep laboratory, using three devices. We extracted features in 30-second epochs and trained a 3-class model to detect wake, sleep, and sleep with arousals, which was then collapsed into wake vs. sleep using a decision tree. To enhance wake detection, the model was specifically trained on randomly selected subjects with low sleep efficiency and/or high arousal index from one device recording and then tested on the remaining recordings. Results: The model showed high performance with F1 Score of 0.86, sensitivity (sleep) of 0.87, and specificity (wakefulness) of 0.78, and significant and moderate correlation to PSG in predicting total sleep time (R=0.69) and sleep efficiency (R=0.63). Model performance was robust to the presence of sleep disorders, including sleep apnea and periodic limb movements in sleep, and was consistent across all three models of accelerometer. Conclusions: We present a deep model to detect sleep-wakefulness from actigraphy in adults with relative robustness to the presence of sleep disorders and generalizability across diverse commonly used wrist accelerometers.
[LG-129] Fundamentals of Regression
链接: https://arxiv.org/abs/2512.01920
作者: Miguel A. Mendez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Chapter 2 from Machine Learning for Fluid Dynamics (ISBN 978-2875162090). Based on the VKI-ULB lecture series ‘‘Machine Learning for Fluid Dynamics,’’ held in Brussels in February 2022
Abstract:This chapter opens with a review of classic tools for regression, a subset of machine learning that seeks to find relationships between variables. With the advent of scientific machine learning this field has moved from a purely data-driven (statistical) formalism to a constrained or ``physics-informed’’ formalism, which integrates physical knowledge and methods from traditional computational engineering. In the first part, we introduce the general concepts and the statistical flavor of regression versus other forms of curve fitting. We then move to an overview of traditional methods from machine learning and their classification and ways to link these to traditional computational science. Finally, we close with a note on methods to combine machine learning and numerical methods for physics
[LG-130] Dimension-free error estimate for diffusion model and optimal scheduling
链接: https://arxiv.org/abs/2512.01820
作者: Valentin de Bortoli,Romuald Elie,Anna Kazeykina,Zhenjie Ren,Jiacheng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
[LG-131] Decision Tree Embedding by Leaf-Means
链接: https://arxiv.org/abs/2512.01819
作者: Cencheng Shen,Yuexiao Dong,Carey E. Priebe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 9 pages
[LG-132] Multimodal Mixture-of-Experts for ISAC in Low-Altitude Wireless Networks
链接: https://arxiv.org/abs/2512.01750
作者: Kai Zhang,Wentao Yu,Hengtao He,Shenghui Song,Jun Zhang,Khaled B. Letaief
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
[LG-133] Common Structure Discovery in Collections of Bipartite Networks: Application to Pollination Systems
链接: https://arxiv.org/abs/2512.01716
作者: Louis Lacoste,Pierre Barbillon,Sophie Donnet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
[LG-134] Differentially Private and Federated Structure Learning in Bayesian Networks
链接: https://arxiv.org/abs/2512.01708
作者: Ghita Fassy El Fehri,Aurélien Bellet,Philippe Bastien
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-135] Bayesian Ambiguity Contraction-based Adaptive Robust Markov Decision Processes for Adversarial Surveillance Missions
链接: https://arxiv.org/abs/2512.01660
作者: Jimin Choi,Max Z. Li
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
[LG-136] Cuffless Blood Pressure Estimation from Six Wearable Sensor Modalities in Multi-Motion-State Scenarios
链接: https://arxiv.org/abs/2512.01653
作者: Yiqiao Chen,Fazheng Xu,Zijian Huang,Juchi He,Zhenghui Feng
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
[LG-137] Neural Networks for Predicting Permeability Tensors of 2D Porous Media: Comparison of Convolution- and Transformer-based Architectures
链接: https://arxiv.org/abs/2512.01517
作者: Sigurd Vargdal,Paula Reis,Henrik Andersen Sveinsson,Gaute Linga
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Geophysics (physics.geo-ph)
*备注:
Abstract:Permeability is a central concept in the macroscopic description of flow through porous media, with applications spanning from oil recovery to hydrology. Traditional methods for determining the permeability tensor involving flow simulations or experiments can be time consuming and resource-intensive, while analytical methods, e.g., based on the Kozeny-Carman equation, may be too simplistic for accurate prediction based on pore-scale features. In this work, we explore deep learning as a more efficient alternative for predicting the permeability tensor based on two-dimensional binary images of porous media, segmented into solid ( 1 ) and void ( 0 ) regions. We generate a dataset of 24,000 synthetic random periodic porous media samples with specified porosity and characteristic length scale. Using Lattice-Boltzmann simulations, we compute the permeability tensor for flow through these samples with values spanning three orders of magnitude. We evaluate three families of image-based deep learning models: ResNet (ResNet- 50 and ResNet- 101 ), Vision Transformers (ViT-T 16 and ViT-S 16 ) and ConvNeXt (Tiny and Small). To improve model generalisation, we employ techniques such as weight decay, learning rate scheduling, and data augmentation. The effect of data augmentation and dataset size on model performance is studied, and we find that they generally increase the accuracy of permeability predictions. We also show that ConvNeXt and ResNet converge faster than ViT and degrade in performance if trained for too long. ConvNeXt-Small achieved the highest R^2 score of 0.99460 on 4,000 unseen test samples. These findings underscore the potential to use image-based neural networks to predict permeability tensors accurately.
[LG-138] Learning Reduced Representations for Quantum Classifiers
链接: https://arxiv.org/abs/2512.01509
作者: Patrick Odagiu,Vasilis Belis,Lennart Schulze,Panagiotis Barkoutsos,Michele Grossi,Florentin Reiter,Günther Dissertori,Ivano Tavernelli,Sofia Vallecorsa
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
[LG-139] Masked Symbol Modeling for Demodulation of Oversampled Baseband Communication Signals in Impulsive Noise-Dominated Channels NEURIPS2025
链接: https://arxiv.org/abs/2512.01428
作者: Oguz Bedir(1),Nurullah Sevim(1),Mostafa Ibrahim(2),Sabit Ekin(2 and 1) ((1) Electrical amp; Computer Engineering, Texas Aamp;M University, USA, (2) Engineering Technology amp; Industrial Distribution, Texas Aamp;M University, USA)
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on AI and ML for Next-Generation Wireless Communications and Networking (AI4NextG), non-archival
[LG-140] Modeling Wavelet Transformed Quantum Support Vector for Network Intrusion Detection
链接: https://arxiv.org/abs/2512.01365
作者: Swati Kumari,Shiva Raj Pokhrel,Swathi Chandrasekhar,Navneet Singh,Hridoy Sankar Dutta,Adnan Anwar,Sutharshan Rajasegarar,Robin Doss
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
[LG-141] Experimental Methods Health Indicators and Diagnostic Strategies for Retired Lithium-ion Batteries: A Comprehensive Review
链接: https://arxiv.org/abs/2512.01294
作者: Song Zhang,Ruohan Guo,Xiaohua Ge,Perter Mahon,Weixiang Shen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Review article; 46 pages, 3 figures, 2 tables
[LG-142] Bayesian Optimization for Non-Cooperative Game-Based Radio Resource Management
链接: https://arxiv.org/abs/2512.01245
作者: Yunchuan Zhang,Jiechen Chen,Junshuo Liu,Robert C. Qiu
类目: ignal Processing (eess.SP); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, this paper is accepted to 2025 IEEE Global Communications Conference (Globecom)
[LG-143] Implicitly Normalized Online PCA: A Regularized Algorithm with Exact High-Dimensional Dynamics
链接: https://arxiv.org/abs/2512.01231
作者: Samet Demir,Zafer Dogan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages 9 figures
Abstract:Many online learning algorithms, including classical online PCA methods, enforce explicit normalization steps that discard the evolving norm of the parameter vector. We show that this norm can in fact encode meaningful information about the underlying statistical structure of the problem, and that exploiting this information leads to improved learning behavior. Motivated by this principle, we introduce Implicitly Normalized Online PCA (INO-PCA), an online PCA algorithm that removes the unit-norm constraint and instead allows the parameter norm to evolve dynamically through a simple regularized update. We prove that in the high-dimensional limit the joint empirical distribution of the estimate and the true component converges to a deterministic measure-valued process governed by a nonlinear PDE. This analysis reveals that the parameter norm obeys a closed-form ODE coupled with the cosine similarity, forming an internal state variable that regulates learning rate, stability, and sensitivity to signal-to-noise ratio (SNR). The resulting dynamics uncover a three-way relationship between the norm, SNR, and optimal step size, and expose a sharp phase transition in steady-state performance. Both theoretically and experimentally, we show that INO-PCA consistently outperforms Oja’s algorithm and adapts rapidly in non-stationary environments. Overall, our results demonstrate that relaxing norm constraints can be a principled and effective way to encode and exploit problem-relevant information in online learning algorithms.
[LG-144] High-dimensional Mean-Field Games by Particle-based Flow Matching
链接: https://arxiv.org/abs/2512.01172
作者: Jiajia Yu,Junghwan Lee,Yao Xie,Xiuyuan Cheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Mean-field games (MFGs) study the Nash equilibrium of systems with a continuum of interacting agents, which can be formulated as the fixed-point of optimal control problems. They provide a unified framework for a variety of applications, including optimal transport (OT) and generative models. Despite their broad applicability, solving high-dimensional MFGs remains a significant challenge due to fundamental computational and analytical obstacles. In this work, we propose a particle-based deep Flow Matching (FM) method to tackle high-dimensional MFG computation. In each iteration of our proximal fixed-point scheme, particles are updated using first-order information, and a flow neural network is trained to match the velocity of the sample trajectories in a simulation-free manner. Theoretically, in the optimal control setting, we prove that our scheme converges to a stationary point sublinearly, and upgrade to linear (exponential) convergence under additional convexity assumptions. Our proof uses FM to induce an Eulerian coordinate (density-based) from a Lagrangian one (particle-based), and this also leads to certain equivalence results between the two formulations for MFGs when the Eulerian solution is sufficiently regular. Our method demonstrates promising performance on non-potential MFGs and high-dimensional OT problems cast as MFGs through a relaxed terminal-cost formulation.
[LG-145] Building Trustworthy AI for Materials Discovery: From Autonomous Laboratories to Z-scores
链接: https://arxiv.org/abs/2512.01080
作者: Benhour Amirian,Ashley S. Dale,Sergei Kalinin,Jason Hattrick-Simpers
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
[LG-146] Sleep Apnea Detection on a Wireless Multimodal Wearable Device Without Oxygen Flow Using a Mamba-based Deep Learning Approach
链接: https://arxiv.org/abs/2512.00989
作者: Dominik Luszczynski,Richard Fei Yin,Nicholas Afonin,Andrew S. P. Lim
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 29 pages, 14 figures. Authors Dominik Luszczynski, Richard Fei Yin and Nicholas Afonin contributed equally
[LG-147] hompson Sampling for Multi-Objective Linear Contextual Bandit NEURIPS2025
链接: https://arxiv.org/abs/2512.00930
作者: Somangchan Park,Heesang Ann,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2025
[LG-148] Outcome-Aware Spectral Feature Learning for Instrumental Variable Regression
链接: https://arxiv.org/abs/2512.00919
作者: Dimitri Meunier,Jakub Wornbard,Vladimir R. Kostic,Antoine Moulin,Alek Fröhlich,Karim Lounici,Massimiliano Pontil,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We address the problem of causal effect estimation in the presence of hidden confounders using nonparametric instrumental variable (IV) regression. An established approach is to use estimators based on learned spectral features, that is, features spanning the top singular subspaces of the operator linking treatments to instruments. While powerful, such features are agnostic to the outcome variable. Consequently, the method can fail when the true causal function is poorly represented by these dominant singular functions. To mitigate, we introduce Augmented Spectral Feature Learning, a framework that makes the feature learning process outcome-aware. Our method learns features by minimizing a novel contrastive loss derived from an augmented operator that incorporates information from the outcome. By learning these task-specific features, our approach remains effective even under spectral misalignment. We provide a theoretical analysis of this framework and validate our approach on challenging benchmarks.
[LG-149] Frag mentation is Efficiently Learnable by Quantum Neural Networks
链接: https://arxiv.org/abs/2512.00751
作者: Mikhail Mints,Eric Anschuetz
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 25 pages, 3 figures
[LG-150] Non-Negative Matrix Factorization Using Non-Von Neumann Computers
链接: https://arxiv.org/abs/2512.00675
作者: Ajinkya Borle,Charles Nicholas,Uchenna Chukwu,Mohammad-Ali Miri,Nicholas Chancellor
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 6 tables and 1 appendix
[LG-151] Restricted Block Permutation for Two-Sample Testing
链接: https://arxiv.org/abs/2512.00668
作者: Jungwoo Ho
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-152] Self-sufficient Independent Component Analysis via KL Minimizing Flows
链接: https://arxiv.org/abs/2512.00665
作者: Song Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-153] Statistical-computational gap in multiple Gaussian graph alignment
链接: https://arxiv.org/abs/2512.00610
作者: Bertrand Even,Luca Ganassali
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
[LG-154] GCMCG: A Clustering-Aware Graph Attention and Expert Fusion Network for Multi-Paradigm Multi-task and Cross-Subject EEG Decoding
链接: https://arxiv.org/abs/2512.00574
作者: Yiqiao Chen,Zijian Huang,Juchi He,Fazheng Xu,Zhenghui Feng
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 46 pages, 11 figures
[LG-155] No-Regret Gaussian Process Optimization of Time-Varying Functions
链接: https://arxiv.org/abs/2512.00517
作者: Eliabelle Mauduit,Eloïse Berthier,Andrea Simonetto
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
[LG-156] An RKHS Perspective on Tree Ensembles
链接: https://arxiv.org/abs/2512.00397
作者: Mehdi Dagdoug,Clement Dombry,Jean-Jil Duchamps
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 69 pages
[LG-157] EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants
链接: https://arxiv.org/abs/2512.00379
作者: Anas Aziz Khan,Md Shah Fahad,Priyanka,Ramesh Chandra,Guransh Singh
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
[LG-158] An Interpretable Operator-Learning Model for Electric Field Profile Reconstruction in Discharges Based on the EFISH Method
链接: https://arxiv.org/abs/2512.00359
作者: Zhijian Yang,Edwin Setiadi Sugeng,Mhedine Alicherif,Tat Loon Chng
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) models have recently been used to reconstruct electric field distributions from EFISH signal profiles-the ‘inverse EFISH problem’. This addresses the line-of-sight EFISH inaccuracy caused by the Gouy phase shift in focused beams. A key benefit of this approach is that the accuracy of the reconstructed profile can be directly checked via a ‘forward transform’ of the EFISH equation. Motivated by this latest success, the present study introduces a novel ML model with markedly improved performance. Based on a more powerful operator-learning architecture, it goes beyond the ANNs and CNNs employed previously. Termed Decoder-DeepONet (DDON), its main strength is learning function-to-function mappings, essential for recovering electric field profiles of unknown shape. The superior performance of DDON is exemplified via a comparison with our published CNN model and the feasibility of a classical mathematical method, as well as its application to both discharge simulations and experimental EFISH data from a nanosecond pulsed discharge. In almost all cases, the DDON model exhibits better generalizability, higher prediction accuracy, and wider applicability. Furthermore, the intrinsic nature of this operator-learning architecture renders it less sensitive to the exact location(s) of the acquired data, enabling electric field reconstruction even with seemingly ‘incomplete’ input profiles–an issue often accompanying poor signal sensitivity. We also employ Integrated Gradients (IG) to identify the signal regions most critical to reconstruction accuracy, providing guidance on the optimal sampling window for EFISH acquisition. Overall, we believe that the DDON model is a robust and comprehensive model which can be readily applied to reconstruct ‘bell-shaped’ electric field profiles with an existing axis of symmetry, especially in non-equilibrium plasmas.
[LG-159] Stochastic Dominance Constrained Optimization with S-shaped Utilities: Poor-Performance-Region Algorithm and Neural Network
链接: https://arxiv.org/abs/2512.00299
作者: Zeyun Hu,Yang Liu
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG); Portfolio Management (q-fin.PM); Risk Management (q-fin.RM)
*备注: 30 pages
[LG-160] DAISI: Data Assimilation with Inverse Sampling using Stochastic Interpolants
链接: https://arxiv.org/abs/2512.00252
作者: Martin Andrae,Erik Larsson,So Takao,Tomas Landelius,Fredrik Lindsten
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 41 pages, 24 figures
[LG-161] Statistical Inference under Adaptive Sampling with LinUCB
链接: https://arxiv.org/abs/2512.00222
作者: Wei Fan,Kevin Tan,Yuting Wei
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
[LG-162] Beyond Expected Goals: A Probabilistic Framework for Shot Occurrences in Soccer
链接: https://arxiv.org/abs/2512.00203
作者: Jonathan Pipping,Tianshu Feng,R. Paul Sabin
类目: Applications (stat.AP); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 18pp main + 3pp appendix; 8 figures, 12 tables. Submitted to the Journal of Quantitative Analysis in Sports (JQAS). Data proprietary to Gradient Sports; we share derived features scripts (code under MIT/Apache-2.0). Preprint licensed CC BY 4.0
[LG-163] Comparing Two Proxy Methods for Causal Identification
链接: https://arxiv.org/abs/2512.00175
作者: Helen Guo,Elizabeth L. Ogburn,Ilya Shpitser
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages; 6 figures
[LG-164] Learning with Physical Constraints
链接: https://arxiv.org/abs/2512.00104
作者: Miguel A. Mendez,Jan van Den Berghe,Manuel Ratz,Matilde Fiore,Lorenzo Schena
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Chapter 3 from Machine Learning for Fluid Dynamics (ISBN 978-2875162090). Based on the VKI-ULB lecture series ‘‘Machine Learning for Fluid Dynamics,’’ held in Brussels in February 2022
[LG-165] Predicting COVID-19 Prevalence Using Wastewater RNA Surveillance: A Semi-Supervised Learning Approach with Temporal Feature Trust
链接: https://arxiv.org/abs/2512.00100
作者: Yifei Chen,Eric Liang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 22 pages, 13 figures. Submitted to SIURO
信息检索
[IR-0] Cross-Domain Federated Semantic Communication with Global Representation Alignment and Domain-Aware Aggregation
链接: https://arxiv.org/abs/2512.00711
作者: Loc X. Nguyen,Ji Su Yoon,Huy Q. Le,Yu Qiao,Avi Deb Raha,Eui-Nam Huh,Walid Saad,Dusit Niyato,Zhu Han,Choong Seon Hong
类目: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注: 13 pages, 7 figures, 6 tables
Abstract:Semantic communication can significantly improve bandwidth utilization in wireless systems by exploiting the meaning behind raw data. However, the advancements achieved through semantic communication are closely dependent on the development of deep learning (DL) models for joint source-channel coding (JSCC) encoder/decoder techniques, which require a large amount of data for training. To address this data-intensive nature of DL models, federated learning (FL) has been proposed to train a model in a distributed manner, where the server broadcasts the DL model to clients in the network for training with their local data. However, the conventional FL approaches suffer from catastrophic degradation when client data are from different domains. In contrast, in this paper, a novel FL framework is proposed to address this domain shift by constructing the global representation, which aligns with the local features of the clients to preserve the semantics of different data domains. In addition, the dominance problem of client domains with a large number of samples is identified and, then, addressed with a domain-aware aggregation approach. This work is the first to consider the domain shift in training the semantic communication system for the image reconstruction task. Finally, simulation results demonstrate that the proposed approach outperforms the model-contrastive FL (MOON) framework by 0.5 for PSNR values under three domains at an SNR of 1 dB, and this gap continues to widen as the channel quality improves.
[IR-1] ProEx: A Unified Framework Leverag ing Large Language Model with Profile Extrapolation for Recommendation KDD2026
链接: https://arxiv.org/abs/2512.00679
作者: Yi Zhang,Yiwen Zhang,Yu Wang,Tong Chen,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2026 (First Cycle)
Abstract:The powerful text understanding and generation capabilities of large language models (LLMs) have brought new vitality to general recommendation with implicit feedback. One possible strategy involves generating a unique user (or item) profile from historical interaction data, which is then mapped to a semantic representation in the language space. However, a single-instance profile may be insufficient to comprehensively capture the complex intentions behind a user’s interacted items. Moreover, due to the inherent instability of LLMs, a biased or misinterpreted profile could even undermine the original recommendation performance. Consequently, an intuitive solution is to generate multiple profiles for each user (or item), each reflecting a distinct aspect of their characteristics. In light of this, we propose a unified recommendation framework with multi-faceted profile extrapolation (ProEx) in this paper. By leveraging chain-of-thought reasoning, we construct multiple distinct profiles for each user and item. These new profiles are subsequently mapped into semantic vectors, extrapolating from the position of the original profile to explore a broader region of the language space. Subsequently, we introduce the concept of environments, where each environment represents a possible linear combination of all profiles. The differences across environments are minimized to reveal the inherent invariance of user preferences. We apply ProEx to three discriminative methods and three generative methods, and conduct extensive experiments on three datasets. The experimental results demonstrate that ProEx significantly enhances the performance of these base recommendation models.
[IR-2] Evolving Paradigms in Task-Based Search and Learning: A Comparative Analysis of Traditional Search Engine with LLM -Enhanced Conversational Search System
链接: https://arxiv.org/abs/2512.00313
作者: Zhitong Guan,Yi Wang
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Large Language Models (LLMs) are rapidly reshaping information retrieval by enabling interactive, generative, and inference-driven search. While traditional keyword-based search remains central to web and academic information access, it often struggles to support multi-step reasoning and exploratory learning tasks. LLM-powered search interfaces, such as ChatGPT and Claude, introduce new capabilities that may influence how users formulate queries, navigate information, and construct knowledge. However, empirical understanding of these effects is still limited. This study compares search behavior and learning outcomes in two environments: a standard search engine and an LLM-powered search system. We investigate (1) how search strategies, query formulation, and evaluation behaviors differ across systems, and (2) how LLM use affects comprehension, knowledge integration, and critical thinking during search-based learning tasks. Findings offer insight into how generative AI shapes information-seeking processes and contribute to ongoing discussions in information retrieval, human-AI interaction, and technology-supported learning.

