本篇博文主要内容为 2025-12-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-09)

今日共更新927篇论文,其中:

  • 自然语言处理110篇(Computation and Language (cs.CL))
  • 人工智能257篇(Artificial Intelligence (cs.AI))
  • 计算机视觉259篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习264篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Do Generalisation Results Generalise?

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在分布外(Out-of-Distribution, OOD)泛化能力评估中存在的片面性问题。现有研究通常仅基于单一OOD测试集进行评估,难以反映模型在实际部署时面临的多样化数据偏移场景。为更全面地衡量模型的OOD泛化性能,作者提出通过在微调过程中持续评估模型在多个OOD测试集上的表现,并采用偏相关分析(partial correlation)控制域内(in-domain)性能的影响,从而考察不同OOD测试集间泛化性能的关联性。其解决方案的关键在于:利用多任务、跨分布的性能对比,在排除域内性能干扰的前提下,量化OOD泛化能力的可迁移性与一致性,进而揭示当前LLM泛化行为缺乏统一规律的现象。

链接: https://arxiv.org/abs/2512.07832
作者: Matteo Boglioni,Andrea Sgobbi,Gabriel Tavernini,Francesco Rita,Marius Mosbach,Tiago Pimentel
机构: ETH Zürich (苏黎世联邦理工学院); Mila - Quebec Artificial Intelligence Institute (蒙特利尔人工智能研究所); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A large language model’s (LLM’s) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs’ generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model’s performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
zh

[NLP-1] Group Representational Position Encoding

【速读】: 该论文旨在解决大语言模型中位置编码(Positional Encoding)设计的统一性与灵活性问题,尤其在长上下文建模场景下如何有效捕捉相对位置关系并支持高效计算。解决方案的关键在于提出GRAPE(Group RepresentAtional Position Encoding)框架,该框架基于群作用(group actions)构建统一的位置编码机制:一方面通过SO(d)中的乘法旋转(Multiplicative GRAPE)实现相对、组合且保范的映射,可精确恢复RoPE(Rotary Position Embedding),并通过学习交换子空间和紧凑非交换混合扩展几何结构以捕获跨子空间特征耦合;另一方面通过GL(d)中的加法对数偏置(Additive GRAPE)实现秩1或低秩拟幂作用,精确恢复ALiBi(Attention with Linear Biases)和遗忘Transformer(FoX),同时保持相对性法则与流式缓存能力。此框架将现有主流位置编码方法纳入一个理论一致的设计空间,为长程依赖建模提供了更灵活且高效的几何基础。

链接: https://arxiv.org/abs/2512.07805
作者: Yifan Zhang,Zixiang Chen,Yifeng Liu,Zhen Qin,Huizhuo Yuan,Kangping Xu,Yang Yuan,Quanquan Gu,Andrew Chi-Chih Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in \mathrmSO(d) and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group \mathrmGL . In Multiplicative GRAPE, a position n \in \mathbbZ (or t \in \mathbbR ) acts as \mathbfG(n)=\exp(n,\omega,\mathbfL) with a rank-2 skew generator \mathbfL \in \mathbbR^d \times d , yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the d/2 planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at O(d) and O(r d) cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: this https URL.
zh

[NLP-2] Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support

【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Model, LLM)的智能体在高风险、复杂决策场景中,难以有效提升人类-AI团队的整体表现,表现为团队绩效低于最优个体、专家在验证与过度依赖之间反复摇摆,且预期的互补性未实现。其核心原因在于现有AI辅助系统未能嵌入人类专家协作认知过程的本质——即目标、心智模型和约束条件需在人机间持续协同构建、测试与修正。解决方案的关键在于提出“协作式因果归因”(Collaborative Causal Sensemaking, CCS)这一研究框架:将AI设计为认知工作的伙伴,动态建模特定专家的推理方式,协助明确与迭代目标,共同构建并压力测试因果假设,并通过联合决策结果持续学习,从而实现人机双方的认知同步进化。

链接: https://arxiv.org/abs/2512.07801
作者: Raunak Jain,Mudita Khurana
机构: Intuit(美国intuit公司); Airbnb(爱彼迎)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.
zh

[NLP-3] Reason BENCH: Benchmarking the (In)Stability of LLM Reasoning

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在推理任务中评估方法的局限性问题,即现有评价实践主要依赖单次运行的准确率,忽视了由随机解码(stochastic decoding)带来的内在不确定性,导致无法可靠评估方法的稳定性、可复现性和成本一致性。其解决方案的关键在于提出 ReasonBENCH——首个专门用于量化 LLM 推理不稳定性(instability)的基准测试框架,包含三个核心组件:(i) 模块化评估库以标准化推理框架、模型与任务;(ii) 多次运行协议,提供质量与成本的统计可靠指标;(iii) 公开排行榜以激励对方差敏感的报告方式。实验表明,绝大多数推理策略和模型存在显著不稳定性,甚至高平均性能的方法也可能具有宽达四倍的置信区间及更不稳定的成本表现,从而揭示可复现性是可靠 LLM 推理的关键维度。

链接: https://arxiv.org/abs/2512.07795
作者: Nearchos Potamitis,Lars Klein,Akhil Arora
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 3 tables, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method’s reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at this https URL .
zh

[NLP-4] On the Interplay of Pre-Training Mid-Training and RL on Reasoning Language Models

【速读】: 该论文试图解决的问题是:在语言模型(Language Model, LM)的训练过程中,后训练(post-training)阶段(尤其是基于强化学习(Reinforcement Learning, RL)的优化)是否真正扩展了模型的推理能力,还是仅仅利用了预训练阶段已有的知识。现有方法因训练流程缺乏控制而难以厘清预训练、中段训练(mid-training)和RL后训练各自对推理能力提升的因果贡献。

解决方案的关键在于构建了一个完全受控的实验框架,通过以下手段实现:使用具有明确原子操作的合成推理任务、可解析的逐步推理轨迹,以及对训练分布的系统性操控,从而隔离并量化各训练阶段的作用。该框架使研究者能够精确评估模型在两种维度上的表现:对更复杂组合的外推泛化能力(extrapolative generalization)和跨表面语境的上下文泛化能力(contextual generalization)。这一设计使得论文得以澄清RL在不同训练阶段下的实际效果,并揭示预训练留有足够提升空间、RL数据聚焦于模型能力边界任务时才能产生真实的能力提升等关键机制。

链接: https://arxiv.org/abs/2512.07783
作者: Charlie Zhang,Graham Neubig,Xiang Yue
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model’s reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL’s effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model’s edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
zh

[NLP-5] Mary the Cheeseburger-Eating Vegetarian: Do LLM s Recognize Incoherence in Narratives?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在识别叙事连贯性方面的可靠性问题,特别是其内部表征与行为输出之间是否存在不一致。研究发现,LLMs的内部表示能够可靠地识别非连贯叙事,但在生成对叙事质量的评分响应时,却无法在多种提示变体下有效区分连贯与非连贯故事,表明模型对讲故事的理解存在局限。关键在于,即使引入推理链(reasoning chains)也未能消除这一缺陷,说明思维链(thought strings)可能不足以弥合模型内部状态与外部行为之间的差距;此外,模型更敏感于违反场景原型(如沙漠下雨)的不连贯,而非角色违背性格特征(如素食者点奶酪汉堡),暗示其依赖的是基于原型的世界知识,而非基于语义的内容一致性,从而揭示了LLMs尚未完全掌握叙事连贯性的本质。

链接: https://arxiv.org/abs/2512.07777
作者: Karin de Langis,Püren Öncel,Ryan Peters,Andrew Elfenbein,Laura Kristen Allen,Andreas Schramm,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学); Hamline University (哈姆林大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs’ internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM’s understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.
zh

[NLP-6] Automated Generation of Custom MedDRA Queries Using SafeTerm Medical Map

【速读】: 该论文旨在解决药品上市前安全性评估中,如何高效、准确地将相关不良事件术语归类为标准化的MedDRA(Medical Dictionary for Regulatory Activities)查询项(如FDA OCMQs)的问题,以提升信号检测的效率与一致性。解决方案的关键在于提出了一种基于人工智能的系统SafeTerm,其核心机制是将输入查询词与MedDRA首选术语(Preferred Terms, PTs)嵌入到多维向量空间中,通过余弦相似度(cosine similarity)和极值聚类(extreme-value clustering)方法计算语义相关性,并据此对PT进行排序,从而实现自动化的高质量PT检索。

链接: https://arxiv.org/abs/2512.07694
作者: Francois Vandenhende,Anna Georgiou,Michalis Georgiou,Theodoros Psaras,Ellie Karekla,Elena Hadjicosta
机构: ClinBAY Limited( ClinBAY 有限公司)
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:In pre-market drug safety review, grouping related adverse event terms into standardised MedDRA queries or the FDA Office of New Drugs Custom Medical Queries (OCMQs) is critical for signal detection. We present a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against the FDA OCMQ v3.0 (104 queries), restricted to valid MedDRA PTs. Precision, recall and F1 were computed across similarity-thresholds. High recall (95%) is achieved at moderate thresholds. Higher thresholds improve precision (up to 86%). The optimal threshold (~0.70 - 0.75) yielded recall ~50% and precision ~33%. Narrow-term PT subsets performed similarly but required slightly higher similarity thresholds. The SafeTerm AI-driven system provides a viable supplementary method for automated MedDRA query generation. A similarity threshold of ~0.60 is recommended initially, with increased thresholds for refined term selection.
zh

[NLP-7] HalluShift: Bridging Language and Vision through Internal Representation Shifts for Hierarchical Hallucinations in MLLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的幻觉问题,即模型生成与视觉内容不符的描述,这可能带来严重后果。现有评估方法依赖外部大语言模型(Large Language Models, LLMs)作为评判者,但这些评估器本身也存在幻觉风险且难以适配特定领域。论文的关键创新在于提出假设:幻觉表现为MLLM内部层动态中的可测量异常,而不仅源于分布偏移;通过引入对层间动态的分析,作者提出了HalluShift++方法,将原本仅适用于文本大语言模型的幻觉检测能力扩展至多模态场景,从而实现更可靠、内生的幻觉识别机制。

链接: https://arxiv.org/abs/2512.07687
作者: Sujoy Nath,Arkaprabha Basu,Sharanya Dasgupta,Swagatam Das
机构: Netaji Subhash Engineering College (NSEC) (尼赫鲁理工学院); TCG Crest (TCG 峰会); Electronics and Communication Sciences Unit (ECSU) (电子与通信科学系), Indian Statistical Institute (印度统计研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding tasks. While these models often produce linguistically coherent output, they often suffer from hallucinations, generating descriptions that are factually inconsistent with the visual content, potentially leading to adverse consequences. Therefore, the assessment of hallucinations in MLLM has become increasingly crucial in the model development process. Contemporary methodologies predominantly depend on external LLM evaluators, which are themselves susceptible to hallucinations and may present challenges in terms of domain adaptation. In this study, we propose the hypothesis that hallucination manifests as measurable irregularities within the internal layer dynamics of MLLMs, not merely due to distributional shifts but also in the context of layer-wise analysis of specific assumptions. By incorporating such modifications, \textsc\textscHalluShift++ broadens the efficacy of hallucination detection from text-based large language models (LLMs) to encompass multimodal scenarios. Our codebase is available at this https URL.
zh

[NLP-8] When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks

【速读】: 该论文旨在解决在线不文明行为(online incivility)在数字社区中广泛存在且难以有效检测的问题,尤其针对现有基于文本的自动化检测方法在准确性和效率上的局限性。其解决方案的关键在于提出一种图神经网络(Graph Neural Network, GNN)框架,将用户评论建模为节点,并通过评论间的文本相似性构建边结构,从而联合学习语言内容与评论之间的关系结构;同时引入动态调整的注意力机制,在信息聚合过程中自适应地平衡节点特征与拓扑特征,显著提升了对毒性、攻击性和人身攻击三类不文明行为的识别性能,且推理成本远低于当前主流大语言模型(Large Language Models, LLMs)。

链接: https://arxiv.org/abs/2512.07684
作者: Zihan Chen,Lanyu Yu
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 10 pages

点击查看摘要

Abstract:Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automated detection, the performance of existing approaches often remains limited in both accuracy and efficiency. To address this challenge, we propose a Graph Neural Network (GNN) framework for detecting three types of uncivil behavior (i.e., toxicity, aggression, and personal attacks) within the English Wikipedia community. Our model represents each user comment as a node, with textual similarity between comments defining the edges, allowing the network to jointly learn from both linguistic content and relational structures among comments. We also introduce a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation. Empirical evaluations demonstrate that our proposed architecture outperforms 12 state-of-the-art Large Language Models (LLMs) across multiple metrics while requiring significantly lower inference cost. These findings highlight the crucial role of structural context in detecting online incivility and address the limitations of text-only LLM paradigms in behavioral prediction. All datasets and comparative outputs will be publicly available in our repository to support further research and reproducibility.
zh

[NLP-9] Bridging Code Graphs and Large Language Models for Better Code Understanding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码智能任务中因依赖线性化标记序列而导致对程序结构语义理解不足的问题。现有方法如图增强提示(graph-augmented prompting)或结构感知预训练,分别受限于提示长度约束或需针对特定任务修改架构,难以适配大规模指令跟随型LLM。其解决方案的关键在于提出CGBridge——一种可插拔的外部可训练Bridge模块,通过自监督预训练一个代码图编码器以学习结构化的代码语义,并利用跨模态注意力机制对齐代码、图与文本的语义空间,最终生成结构信息增强的提示注入冻结的LLM中进行下游任务微调。该方法在代码摘要和翻译任务上显著优于基线模型,且推理速度超过LoRA微调模型4倍,兼具有效性与高效性。

链接: https://arxiv.org/abs/2512.07666
作者: Zeqi Chen,Zhaoyang Chu,Yi Gui,Feng Guo,Yao Wan,Chuan Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.
zh

[NLP-10] PCMind-2.1-Kaiyuan-2B Technical Report

【速读】: 该论文旨在解决开源社区与工业界在大型语言模型(Large Language Models, LLMs)领域存在的知识鸿沟问题,其根源在于工业界依赖封闭源的高质量数据和训练方案,而开源社区受限于资源约束难以复现同等性能。解决方案的关键在于提出一个名为Kaiyuan-2B的20亿参数全开源模型,并通过三项核心技术实现高效且有效的预训练:一是基于分位数的数据基准评估方法(Quantile Data Benchmarking),用于系统比较异构开源数据集并指导数据混合策略;二是多阶段范式中的战略性选择性重复机制(Strategic Selective Repetition),以充分利用稀疏但高质量的数据;三是多领域课程训练策略(Multi-Domain Curriculum Training),按样本质量排序进行训练优化。此外,配合高优化的数据预处理流水线和FP16精度稳定性改进,该方案在有限资源下实现了与当前最先进开源模型相当的性能表现。

链接: https://arxiv.org/abs/2512.07612
作者: Kairong Luo,Zhenbo Sun,Xinyu Shi,Shengqi Chen,Bowen Yu,Yunyi Chen,Chenyi Dang,Hengtao Tao,Hui Wang,Fangming Liu,Kaifeng Lyu,Wenguang Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at this https URL.
zh

[NLP-11] Metric-Fair Prompting: Treating Similar Samples Similarly

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险临床多选题问答任务中缺乏个体公平性(individual fairness)的问题,即相似问题应被一致地处理。其核心解决方案是提出一种名为“Metric-Fair Prompting”的公平感知提示框架,关键在于通过自然语言处理(NLP)嵌入计算问题相似度,并以成对相似问题为单位进行联合推理;同时,在提示中引入 Lipschitz-style 约束,确保相似输入获得相近的得分(confidence score),从而保证输出的一致性与公平性,提升模型在 MedQA(US)基准上的准确率。

链接: https://arxiv.org/abs/2512.07608
作者: Jing Wang,Jie Shen,Xing Niu,Tong Zhang,Jeremy Weiss
机构: National Library of Medicine (国家医学图书馆); Stevens Institute of Technology (斯蒂文斯理工学院); AWS AI (亚马逊云科技人工智能); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce \emphMetric-Fair Prompting, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each (question, option) pair is treated as a binary instance with label +1 (correct) or -1 (incorrect). To promote individual fairness~–~treating similar instances similarly~–~we compute question similarity using NLP embeddings and solve items in \emphjoint pairs of similar questions rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each ((\textquestion, \textoption)) to a score f(x) that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the MedQA (US) benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.
zh

[NLP-12] Complementary Learning Approach for Text Classification using Large Language Models

【速读】: 该论文旨在解决如何在定量研究中高效、低成本地整合人类与大型语言模型(Large Language Models, LLMs)的优势,以应对LLMs固有的局限性,同时提升人机协作的可靠性。其解决方案的关键在于提出一种结构化方法,通过思维链(chain of thought)和少样本学习(few-shot learning)提示技术,将定性研究中人类合作者的最佳实践扩展至人机团队,并利用人类的归纳推理(abductive reasoning)和自然语言能力,对机器与人类的评分差异进行深度质询,从而实现对人机协作过程的有效监督与优化。

链接: https://arxiv.org/abs/2512.07583
作者: Navid Asgari,Benjamin M. Cole
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 67 pages

点击查看摘要

Abstract:In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).
zh

[NLP-13] A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

【速读】: 该论文旨在解决多模态融合中音频序列长度远超文本序列所带来的挑战,尤其是在将语音信息高效整合进预训练大语言模型(Large Language Model, LLM)以提升特定分类任务性能时的困难。其关键解决方案在于利用已训练好的语音分词器(speech tokenizer)生成的高维音频token序列,通过基于Lasso的特征选择方法构建多模态词袋(Bag-of-Words)表示,筛选出对目标任务最具判别性的音频token,并在此基础上采用自监督语言建模目标微调语言模型,最后进行下游任务的精细调整。该策略显著优于单一模态模型、更大的SpeechLM或通过学习嵌入方式融合音频的方法,在论点谬误检测与分类任务上达到当前最优效果,且分析表明即使随机选择音频token也能提升单模态模型性能,验证了该方法的有效性与鲁棒性。

链接: https://arxiv.org/abs/2512.07571
作者: Nicolas Calbucura,Valentin Barriere
机构: Universidad de Chile, DCC (智利大学,计算机科学系)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](this https URL).
zh

[NLP-14] oward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像内容理解中频繁产生幻觉(hallucination)的问题,即模型生成看似合理但与图像实际内容不符的错误陈述。解决方案的关键在于提出一种无需训练的自校正框架,通过不确定性引导的视觉重关注机制实现响应迭代优化:该方法结合多维不确定性量化(包括词元熵、注意力分散度、语义一致性及声明置信度),并利用注意力引导的裁剪策略聚焦于未充分探索的图像区域,从而在不更新模型参数的前提下,使VLM基于视觉证据修正错误输出。实验表明,该方法在POPE和MMHAL BENCH基准上将幻觉率降低9.8个百分点,并提升对抗性测试集上的物体存在准确性4.7点。

链接: https://arxiv.org/abs/2512.07564
作者: Kassoum Sanogo,Renzo Ardiccioni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 3 figures, 2 tables. Training-free self-correction framework for vision-language models. Code and implementation details will be released at: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
zh

[NLP-15] Performance of the SafeTerm AI-Based MedDRA Query System Against Standardised MedDRA Queries

【速读】: 该论文旨在解决药物上市前安全性评估中,如何高效、准确地将相关不良事件术语归类到标准医学术语集(如MedDRA的SMQs或OCMQs)以支持信号检测的问题。解决方案的关键在于提出并验证SafeTerm自动化医疗查询(AMQ)系统,该系统利用嵌入技术将医疗查询词与MedDRA首选术语(Preferred Terms, PTs)映射至多维向量空间,并通过余弦相似度和极值聚类方法生成按相关性排序的PT列表。其核心创新在于结合多准则统计方法实现自动阈值选择与精准检索,在保持较高召回率(最高达94%)的同时提升精度(最高达89%),从而为自动化MedDRA查询生成提供可靠补充工具。

链接: https://arxiv.org/abs/2512.07552
作者: Francois Vandenhende,Anna Georgiou,Michalis Georgiou,Theodoros Psaras,Ellie Karekla,Elena Hadjicosta
机构: ClinBAY Limited(临床贝叶)(Cyprus)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:In pre-market drug safety review, grouping related adverse event terms into SMQs or OCMQs is critical for signal detection. We assess the performance of SafeTerm Automated Medical Query (AMQ) on MedDRA SMQs. The AMQ is a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score (0-1) using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity, and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against tier-1 SMQs (110 queries, v28.1). Precision, recall and F1 were computed at multiple similarity-thresholds, defined either manually or using an automated method. High recall (94%)) is achieved at moderate similarity thresholds, indicative of good retrieval sensitivity. Higher thresholds filter out more terms, resulting in improved precision (up to 89%). The optimal threshold (0.70)) yielded an overall recall of (48%) and precision of (45%) across all 110 queries. Restricting to narrow-term PTs achieved slightly better performance at an increased (+0.05) similarity threshold, confirming increased relatedness of narrow versus broad terms. The automatic threshold (0.66) selection prioritizes recall (0.58) to precision (0.29). SafeTerm AMQ achieves comparable, satisfactory performance on SMQs and sanitized OCMQs. It is therefore a viable supplementary method for automated MedDRA query generation, balancing recall and precision. We recommend using suitable MedDRA PT terminology in query formulation and applying the automated threshold method to optimise recall. Increasing similarity scores allows refined, narrow terms selection.
zh

[NLP-16] MoCoRP: Modeling Consistent Relations between Persona and Response for Persona-based Dialogue

【速读】: 该论文旨在解决当前基于人格(persona-based)对话系统中,模型难以有效捕捉并利用人格信息以生成连贯、情境相关且具吸引力的对话响应的问题。现有数据集缺乏显式的“人格句子-回应”关系标注,导致模型无法准确理解人格内容与对话响应之间的逻辑关联。解决方案的关键在于提出MoCoRP(Modeling Consistent Relations between Persona and Response)框架,其核心创新是引入自然语言推理(NLI)专家来显式提取人格句子与响应之间的语义关系(如蕴含、矛盾或中立),从而增强语言模型对人格信息的感知能力,并通过预训练模型(如BART)及大语言模型(LLM)的对齐微调实现高效整合。实验表明,该方法在ConvAI2和MPChat等公开数据集上显著提升了人格一致性与对话质量。

链接: https://arxiv.org/abs/2512.07544
作者: Kyungro Lee,Dongha Choi,Hyunju Lee
机构: GIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:As dialogue systems become increasingly important across various domains, a key challenge in persona-based dialogue is generating engaging and context-specific interactions while ensuring the model acts with a coherent personality. However, existing persona-based dialogue datasets lack explicit relations between persona sentences and responses, which makes it difficult for models to effectively capture persona information. To address these issues, we propose MoCoRP (Modeling Consistent Relations between Persona and Response), a framework that incorporates explicit relations into language models. MoCoRP leverages an NLI expert to explicitly extract the NLI relations between persona sentences and responses, enabling the model to effectively incorporate appropriate persona information from the context into its responses. We applied this framework to pre-trained models like BART and further extended it to modern large language models (LLMs) through alignment tuning. Experimental results on the public datasets ConvAI2 and MPChat demonstrate that MoCoRP outperforms existing baselines, achieving superior persona consistency and engaging, context-aware dialogue generation. Furthermore, our model not only excels in quantitative metrics but also shows significant improvements in qualitative aspects. These results highlight the effectiveness of explicitly modeling persona-response relations in persona-based dialogue. The source codes of MoCoRP are available at this https URL.
zh

[NLP-17] Most over-representation of phonological features in basic vocabulary disappears when controlling for spatial and phylogenetic effects

【速读】: 该论文旨在解决当前关于基本词汇音位特征统计富集现象(statistical over-representation of phonological features)是否反映普遍声音象征性模式(sound symbolic patterns)的研究中存在的可重复性问题与潜在偏差,尤其是未充分控制语言之间的谱系关系(phylogenetic dependencies)和区域依赖关系(areal dependencies)。其解决方案的关键在于:基于Lexibank提供的2864种语言数据构建新样本,并在原模型基础上引入对空间和谱系依赖性的统计控制,从而更严谨地评估声音象征性模式的稳健性。结果表明,多数先前发现的模式不再显著,仅有少数模式展现出高度稳定性,凸显了在语言学普遍性主张中系统测试多层级稳健性的必要性。

链接: https://arxiv.org/abs/2512.07543
作者: Frederic Blum
机构: Max-Planck Institute for Evolutionary Anthropology (马克斯·普朗克进化人类学研究所); University of Passau (帕绍大学)
类目: Computation and Language (cs.CL)
备注: Accepted with minor revisions at Linguistic Typology, expected to be fully published in 2026

点击查看摘要

Abstract:The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed245 this http URL new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.
zh

[NLP-18] Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation

【速读】: 该论文旨在解决生成式错误跨度检测(Error Span Detection, ESD)中因采用最大后验概率(Maximum a Posteriori, MAP)解码而导致的性能瓶颈问题。研究表明,MAP解码假设模型估计的概率与人类标注的相似性完全相关,但实际中存在某些与人类标注差异较大的候选译文却具有更高模型似然的情况,从而影响检测精度。解决方案的关键在于引入最小贝叶斯风险(Minimum Bayes Risk, MBR)解码策略,通过句子级和跨度级相似性度量作为效用函数,从候选假设中选择更接近人类标注的输出,从而提升系统、句子和跨度三个层级的检测性能。此外,为降低MBR解码带来的计算开销,作者进一步提出MBR蒸馏方法,使标准贪婪解码模型在推理阶段即可逼近MBR解码效果,有效缓解了延迟瓶颈问题。

链接: https://arxiv.org/abs/2512.07540
作者: Boxuan Lyu,Haiyue Song,Hidetaka Kamigaito,Chenchen Ding,Hideki Tanaka,Masao Utiyama,Kotaro Funakoshi,Manabu Okumura
机构: Institute of Science Tokyo(东京科学研究所); National Institute of Information and Communications Technology(信息与通信技术国立研究所); Nara Institute of Science and Technology(奈良科学技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (MAP), assuming that model-estimated probabilities are perfectly correlated with similarity to human annotation. However, we observed that annotations dissimilar to the human annotation could achieve a higher model likelihood than the human annotation. We address this issue by applying Minimum Bayes Risk (MBR) decoding to generative ESD models. Specifically, we employ sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to the human annotation. Extensive experimental results show that our MBR decoding outperforms the MAP baseline at the system, sentence, and span-levels. Furthermore, to mitigate the computational cost of MBR decoding, we demonstrate that applying MBR distillation enables a standard greedy model to match MBR decoding performance, effectively eliminating the inference-time latency bottleneck.
zh

[NLP-19] SwissGov-RSD: A Human-annotated Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

【速读】: 该论文旨在解决跨语言文档级语义差异识别(cross-lingual document-level semantic difference recognition)这一任务在自然语境下缺乏高质量标注数据和系统评估的问题。当前主流的文本生成评估与多语言内容对齐研究多依赖于单语、句子级别或合成基准,难以反映真实场景中跨语言文档间的细微语义差异。解决方案的关键在于构建首个自然主义、文档级别的多语言语义差异识别数据集SwissGov-RSD,包含英语-德语、英语-法语和英语-意大利语共224组多语言平行文档,并由人工标注词级别差异信息。该数据集为评估大语言模型(LLMs)和编码器模型在跨语言语义理解上的表现提供了新的基准,揭示了现有方法在真实多语言场景下的显著性能差距。

链接: https://arxiv.org/abs/2512.07538
作者: Michelle Wastl,Jannis Vamvas,Rico Sennrich
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注: 30 pages

点击查看摘要

Abstract:Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.
zh

[NLP-20] Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLM s

【速读】: 该论文旨在解决标准旋转位置编码(Rotary Position Embeddings, RoPE)在大型语言模型(Large Language Models, LLMs)中因仅使用复数点积的实部计算注意力分数而导致的位置信息丢失问题,从而削弱了对长上下文依赖关系的建模能力。解决方案的关键在于重新引入被忽略的虚部成分,构建基于完整复数表示的双通道注意力分数机制,从而保留更多相位信息以增强长程依赖建模能力。理论分析与实验验证均表明,该方法在多个长上下文语言建模基准测试中显著优于标准RoPE,且随着上下文长度增加,性能提升更加明显。

链接: https://arxiv.org/abs/2512.07525
作者: Xiaoran Liu,Yuerong Song,Zhigeng Liu,Zengfeng Huang,Qipeng Guo,Zhaoxiang Liu,Shiguo Lian,Ziwei He,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Shanghai AI Lab (上海人工智能实验室); China Unicom (中国联通)
类目: Computation and Language (cs.CL)
备注: 20 pages, 6 figures, under review

点击查看摘要

Abstract:Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at this https URL.
zh

[NLP-21] LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

【速读】: 该论文旨在解决预训练仅解码器语言模型(decoder-only language models)对高质量数据依赖性强但可用数据日益受限的问题。传统方法虽广泛使用元数据(metadata)进行数据集构建与筛选,但未将其作为直接训练信号加以利用。解决方案的关键在于提出LIME(Linguistic Metadata Embeddings),通过将语法、语义和上下文属性等元数据嵌入到token embeddings中,从而增强模型的表示能力。LIME在不显著增加参数量(仅0.01%)和计算开销的前提下,使模型适应训练数据分布的速度提升达56%,同时改善分词效果并提升语言建模及生成任务性能;进一步提出的LIME+1变体则利用前向元数据引导下一个token生成,在推理和算术任务中分别提升38%和35%的准确率。

链接: https://arxiv.org/abs/2512.07522
作者: Sebastian Sztwiertnia,Felix Friedrich,Kristian Kersting,Patrick Schramowski,Björn Deiseroth
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
zh

[NLP-22] SPAD: Seven-Source Token Probability Attribution with Syntactic Aggregation for Detecting Hallucinations in RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中幻觉检测难题,现有方法通常将幻觉归因于内部知识(存储在前馈神经网络FFN中)与检索上下文之间的二元冲突,但这一视角忽略了生成过程中其他关键组件的影响,如用户查询、已生成的token、当前token本身及最终LayerNorm调整等。解决方案的关键在于提出SPAD(Source Attribution for Detection),其核心是通过数学分解方式将每个token的概率来源精确归因于七个独立成分:查询(Query)、RAG上下文、历史token(Past)、当前token(Current Token)、FFN内部知识、最终LayerNorm调整以及初始嵌入(Initial Embedding)。在此基础上,进一步按词性(POS tags)聚合各源贡献度,从而识别异常模式(例如名词过度依赖Final LayerNorm),实现对幻觉的有效检测。实验表明,SPAD在多个基准上达到当前最优性能。

链接: https://arxiv.org/abs/2512.07515
作者: Pengqian Lu,Jie Lu,Anjin Liu,Guangquan Zhang
机构: Australian Artificial Intelligence Institute (澳大利亚人工智能研究所); University of Technology Sydney (悉尼科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspective is incomplete, failing to account for the impact of other components in the generative process, such as the user query, previously generated tokens, the current token itself, and the final LayerNorm adjustment. To address this, we introduce SPAD. First, we mathematically attribute each token’s probability into seven distinct sources: Query, RAG, Past, Current Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the current token. Then, we aggregate these scores by POS tags to quantify how different components drive specific linguistic categories. By identifying anomalies, such as Nouns relying on Final LayerNorm, SPAD effectively detects hallucinations. Extensive experiments demonstrate that SPAD achieves state-of-the-art performance
zh

[NLP-23] Enhancing Agent ic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

【速读】: 该论文旨在解决工具集成推理(Tool-Integrated Reasoning, TIR)框架下大语言模型(Large Language Models, LLMs)在复杂、长程推理任务中训练效率低下的问题,具体包括两个核心挑战:一是稀疏且非指导性的奖励信号(如二值验证反馈)难以有效引导中间步骤优化,导致收敛缓慢;二是群体相对策略优化(Group Relative Policy Optimization, GRPO)中因同一回放组内奖励相同而产生零优势估计,引发梯度退化,降低样本效率并 destabilize 训练过程。解决方案的关键在于提出两种互补技术:其一为渐进式奖励设计(Progressive Reward Shaping, PRS),通过课程学习机制引入密集的阶段式反馈,分步引导模型先掌握格式正确的工具调用,再提升事实准确性和答案质量;其二为基于价值采样的策略优化(Value-based Sampling Policy Optimization, VSPO),改进GRPO以替换低价值样本,并采用价值平滑裁剪机制稳定梯度更新。实验证明,PRS显著优于传统二值奖励,VSPO在稳定性、收敛速度和最终性能上均优于PPO、GRPO、CISPO及仅监督微调(SFT)基线,二者协同使LLM-TIR代理具备更强跨领域泛化能力。

链接: https://arxiv.org/abs/2512.07478
作者: Zhuoran Zhuang,Ye Chen,Jianghao Su,Chao Luo,Luhui Liu,Xia Zeng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
zh

[NLP-24] Living the Novel: A System for Generating Self-Training Timeline-Aware Conversational Agents from Novels

【速读】: 该论文旨在解决生成式 AI (Generative AI) 驱动角色在文学作品中面临的两个核心问题:一是角色身份漂移(persona drift),即大语言模型(LLM)难以维持角色一致性;二是叙事逻辑越界,导致情节不连贯(如剧透泄露)和鲁棒性失败(如打破故事框架)。为此,作者提出了一种两阶段训练流程:第一阶段为深度角色对齐(Deep Persona Alignment, DPA),采用无数据强化微调(data-free reinforcement fine-tuning)提升角色忠实度;第二阶段为一致性与鲁棒性增强(Coherence and Robustness Enhancing, CRE),通过引入故事时间感知的知识图谱和检索增强的二次训练,从架构层面强制执行叙事约束。关键创新在于将角色先验知识注入模型并结合时空语境约束,从而实现高保真、强鲁棒的多角色沉浸式对话体验。

链接: https://arxiv.org/abs/2512.07474
作者: Yifei Huang,Tianyu Yan,Sitong Gong,Xiwei Gao,Caixin Kang,Ruicong Liu,Huchuan Lu,Bo Zheng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the Living Novel, an end-to-end system that transforms any literary work into an immersive, multi-character conversational experience. This system is designed to solve two fundamental challenges for LLM-driven characters. Firstly, generic LLMs suffer from persona drift, often failing to stay in character. Secondly, agents often exhibit abilities that extend beyond the constraints of the story’s world and logic, leading to both narrative incoherence (spoiler leakage) and robustness failures (frame-breaking). To address these challenges, we introduce a novel two-stage training pipeline. Our Deep Persona Alignment (DPA) stage uses data-free reinforcement finetuning to instill deep character fidelity. Our Coherence and Robustness Enhancing (CRE) stage then employs a story-time-aware knowledge graph and a second retrieval-grounded training pass to architecturally enforce these narrative constraints. We validate our system through a multi-phase evaluation using Jules Verne’s Twenty Thousand Leagues Under the Sea. A lab study with a detailed ablation of system components is followed by a 5-day in-the-wild diary study. Our DPA pipeline helps our specialized model outperform GPT-4o on persona-specific metrics, and our CRE stage achieves near-perfect performance in coherence and robustness measures. Our study surfaces practical design guidelines for AI-driven narrative systems: we find that character-first self-training is foundational for believability, while explicit story-time constraints are crucial for sustaining coherent, interruption-resilient mobile-web experiences.
zh

[NLP-25] Native Parallel Reason er: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中普遍依赖自回归解码导致的串行执行效率低下问题,从而限制了其在复杂任务中的并行推理能力。解决方案的关键在于提出一种无需教师监督的框架——原生并行推理器(Native Parallel Reasoner, NPR),通过三项核心创新实现模型从串行模拟到原生并行认知的转变:一是自蒸馏渐进式训练范式,使模型在无外部监督下从初始格式发现逐步过渡到严格的拓扑约束;二是新型并行感知策略优化算法(Parallel-Aware Policy Optimization, PAPO),直接在执行图中优化分支策略,支持通过试错学习自适应任务分解;三是鲁棒的NPR引擎重构了SGLang的内存管理和流控机制,保障大规模并行强化学习训练的稳定性。实验表明,NPR在八个推理基准上相较基线提升最高达24.5%,推理速度加快至4.6倍,且实现100%真正的并行执行,为自主演化的高效可扩展智能体推理树立了新标准。

链接: https://arxiv.org/abs/2512.07461
作者: Tong Wu,Yang Liu,Jun Bai,Zixia Jia,Shuyi Zhang,Ziyong Lin,Yanting Wang,Song-Chun Zhu,Zilong Zheng
机构: NLCo Lab, Beijing Institute for General Artificial Intelligence (BIGAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start’’ format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
zh

[NLP-26] Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLM s via Curriculum Learning

【速读】: 该论文旨在解决低资源语言在训练大型语言模型(LLM)时面临的高昂计算成本问题,从而推动生成式 AI 的民主化进程。其核心解决方案在于提出了一种资源高效的课程学习(curriculum learning)流程,通过引入双语叙事数据(Tiny Stories)作为“预热”阶段来对齐嵌入空间,随后结合参数高效微调(PEFT)技术进行持续预训练和指令微调,成功将原本仅支持英语的 Microsoft Phi-3 Mini 模型适配为性能优异的 3.8B 参数波斯语模型(Persian-Phi)。该方法无需大规模多语言基线或超大模型规模即可实现高质量多语言能力,验证了在有限硬件条件下扩展先进 LLM 到代表性不足语言的可行性与有效性。

链接: https://arxiv.org/abs/2512.07454
作者: Amir Mohammad Akhlaghi,Amirhossein Shabani,Mostafa Abdolmaleki,Saeed Reza Kheradpisheh
机构: Shahid Beheshti University (谢里夫理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini – originally a monolingual English model – can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique “warm-up” stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at this https URL.
zh

[NLP-27] raining Language Models to Use Prolog as a Tool

【速读】: 该论文旨在解决生成式 AI 在工具使用过程中因推理不可靠而导致的安全性问题,特别是语言模型常产生看似合理但错误的推理结果,且难以验证。其解决方案的关键在于将模型推理过程与形式化验证系统(如 Prolog)结合,通过强化学习方法(Group Relative Policy Optimization, GRPO)对 Qwen2.5-3B-Instruct 模型进行微调,使其能够利用外部 Prolog 工具执行可验证的计算,并在提示结构、奖励函数设计(包括执行、语法、语义和结构维度)以及推理协议(单次推理、Best-of-N 和两种代理模式)上进行联合优化。实验表明,该策略显著提升了数学推理准确性(GSM8K)并增强了零样本泛化能力(MMLU),从而提高了安全关键场景下的可靠性与可审计性。

链接: https://arxiv.org/abs/2512.07407
作者: Niklas Mellgren,Peter Schneider-Kamp,Lukas Galke Poech
机构: University of Southern Denmark (南丹麦大学)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under this https URL
zh

[NLP-28] LUNE: Efficient LLM Unlearning via LoRA Fine-Tuning with Negative Examples

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在需要移除特定知识时面临的挑战,如隐私保护、偏见缓解和知识修正等问题。传统方法通常依赖于计算成本高昂的微调或直接权重编辑,难以在实际场景中部署。其解决方案的关键在于提出一种轻量级框架——基于负例的LoRA去学习(LoRA-based Unlearning with Negative Examples, LUNE),通过仅更新低秩适配器(Low-Rank Adaptation, LoRA)模块并冻结主干模型,实现局部化知识修改,避免全局扰动;同时利用LoRA对中间表示进行干预,以极低的计算与内存开销(相比完整微调或权重编辑降低一个数量级)有效抑制或替换目标知识,从而在保持性能的同时显著提升可操作性。

链接: https://arxiv.org/abs/2512.07375
作者: Yezi Liu,Hanning Chen,Wenjun Huang,Yang Ni,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校); Purdue University Northwest (普渡大学西北分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) possess vast knowledge acquired from extensive training corpora, but they often cannot remove specific pieces of information when needed, which makes it hard to handle privacy, bias mitigation, and knowledge correction. Traditional model unlearning approaches require computationally expensive fine-tuning or direct weight editing, making them impractical for real-world deployment. In this work, we introduce LoRA-based Unlearning with Negative Examples (LUNE), a lightweight framework that performs negative-only unlearning by updating only low-rank adapters while freezing the backbone, thereby localizing edits and avoiding disruptive global changes. Leveraging Low-Rank Adaptation (LoRA), LUNE targets intermediate representations to suppress (or replace) requested knowledge with an order-of-magnitude lower compute and memory than full fine-tuning or direct weight editing. Extensive experiments on multiple factual unlearning tasks show that LUNE: (I) achieves effectiveness comparable to full fine-tuning and memory-editing methods, and (II) reduces computational cost by about an order of magnitude.
zh

[NLP-29] Recover-to-Forget: Gradient Reconstruction from LoRA for Efficient LLM Unlearning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中高效去遗忘(unlearning)的问题,即在不重新训练整个模型或访问原始训练数据的前提下,实现对特定知识的删除或修正。现有方法通常依赖全模型微调或原始数据,难以扩展且实用性受限。其解决方案的关键在于提出 Recover-to-Forget (R2F) 框架,通过从低秩 LoRA(Low-Rank Adaptation)适配器更新中重建完整模型梯度方向,利用多轮改写提示(paraphrased prompts)计算 LoRA 参数梯度,并训练一个梯度解码器(gradient decoder)来近似对应全模型梯度;该解码器在代理模型上训练后可迁移至目标模型,从而实现轻量、可扩展的去遗忘机制,同时保持模型整体性能不受显著影响。

链接: https://arxiv.org/abs/2512.07374
作者: Yezi Liu,Hanning Chen,Wenjun Huang,Yang Ni,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校); Purdue University Northwest (普渡大学西北分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unlearning in large foundation models (e.g., LLMs) is essential for enabling dynamic knowledge updates, enforcing data deletion rights, and correcting model behavior. However, existing unlearning methods often require full-model fine-tuning or access to the original training data, which limits their scalability and practicality. In this work, we introduce Recover-to-Forget (R2F), a novel framework for efficient unlearning in LLMs based on reconstructing full-model gradient directions from low-rank LoRA adapter updates. Rather than performing backpropagation through the full model, we compute gradients with respect to LoRA parameters using multiple paraphrased prompts and train a gradient decoder to approximate the corresponding full-model gradients. To ensure applicability to larger or black-box models, the decoder is trained on a proxy model and transferred to target models. We provide a theoretical analysis of cross-model generalization and demonstrate that our method achieves effective unlearning while preserving general model performance. Experimental results demonstrate that R2F offers a scalable and lightweight alternative for unlearning in pretrained LLMs without requiring full retraining or access to internal parameters.
zh

[NLP-30] Multilingual corpora for the study of new concepts in the social sciences and humanities:

【速读】: 该论文旨在解决如何构建一个可复现且可扩展的多语言语料库,以支持人文社会科学(HSS)中新兴概念(如“非技术性创新”)的研究。其核心挑战在于从非结构化文本中自动提取高质量、相关性强的数据,并将其转化为适用于自然语言处理(NLP)任务的标注数据集。解决方案的关键在于采用混合方法:一方面通过自动化流程从公司网站和年度报告中提取并清洗法语与英语文本,另一方面利用文档标准(如年份、格式、去重)进行筛选,并结合语言检测、内容过滤、片段提取及结构化元数据增强等步骤,最终为专家词典中的每个术语生成包含前后五句上下文的标注样本,从而构建出适合监督分类任务的数据集。

链接: https://arxiv.org/abs/2512.07367
作者: Revekka Kyriakoglou(LIASD),Anna Pappa(LIASD)
机构: 未知
类目: Computation and Language (cs.CL)
备注: in French language

点击查看摘要

Abstract:This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences (HSS), illustrated here through the case of ``non-technological innovation’'. The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication). The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata. From this initial corpus, a derived dataset in English is created for machine learning purposes. For each occurrence of a term from the expert lexicon, a contextual block of five sentences is extracted (two preceding and two following the sentence containing the term). Each occurrence is annotated with the thematic category associated with the term, enabling the construction of data suitable for supervised classification tasks. This approach results in a reproducible and extensible resource, suitable both for analyzing lexical variability around emerging concepts and for generating datasets dedicated to natural language processing applications.
zh

[NLP-31] Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models AACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成自解释(self-explanations)时普遍存在的忠实性不足问题,即这些解释往往不能准确反映模型的真实决策机制。其解决方案的关键在于:利用特征归因方法(feature attribution method)构建一种受约束的一词自解释(one-word constrained explanations),作为伪忠实标签(pseudo-faithful self-explanations),并通过持续学习(continual learning)策略对指令微调后的模型进行训练。实验表明,该方法能显著提升多种分类任务和不同解释风格下的自解释忠实性,并展现出跨风格与跨任务的泛化能力,说明训练可促进模型在更广泛场景中生成可信的自解释。

链接: https://arxiv.org/abs/2512.07288
作者: Tomoki Doi,Masaru Isonuma,Hitomi Yanaka
机构: The University of Tokyo (东京大学); Riken (理化学研究所); Tohoku University (东北大学); NII LLMC (日本国立信息学研究所大语言模型研究中心)
类目: Computation and Language (cs.CL)
备注: To appear in the Proceedings of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop (AACL-SRW 2025)

点击查看摘要

Abstract:Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models’ actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful self-explanations for continual learning on instruction-tuned models. Our experiments demonstrate that training can improve self-explanation faithfulness across all classification tasks and explanation styles, and that these improvements also show signs of generalization to the multi-word settings and to unseen tasks. Furthermore, we find consistent cross-style generalization among three styles, suggesting that training may contribute to a broader improvement in faithful self-explanation ability.
zh

[NLP-32] Efficient ASR for Low-Resource Languages: Leverag ing Cross-Lingual Unlabeled Data AACL

【速读】: 该论文旨在解决低资源语言自动语音识别(Automatic Speech Recognition, ASR)中因标注数据稀缺和计算资源不足而导致性能受限的问题。其解决方案的关键在于通过跨语言连续预训练(cross-lingual continuous pretraining)策略,高效利用大规模未标注语音数据,并结合形态学感知的分词方法(morphologically-aware tokenization),在仅使用300M参数模型的情况下实现与参数量达1.5B的Whisper Large v3相当甚至更优的性能,从而证明在低资源场景下,数据的相关性和预训练策略的合理性比单纯扩大模型规模更为关键。

链接: https://arxiv.org/abs/2512.07277
作者: Srihari Bandarupalli,Bhavana Akkiraju,Charan Devarakonda,Vamsiraghusimha Narsinga,Anil Kumar Vuppala
机构: International Institute of Information Technology Hyderabad (国际信息科技学院海得拉巴分校)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted in AACL IJCNLP 2025

点击查看摘要

Abstract:Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.
zh

[NLP-33] uguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation AACL

【速读】: 该论文旨在解决印地语语系中形态学复杂的泰卢固语(Telugu)在语音翻译(Speech Translation, ST)领域研究严重不足的问题。其解决方案的关键在于构建了一个高质量的泰卢固语-英语语音翻译基准,基于46小时人工验证的CSTD(Cross-Sentence Translation Dataset)语料库,并系统比较了级联式(cascaded)与端到端(end-to-end)架构的性能表现。研究发现,尽管IndicWhisper + IndicMT因使用大量泰卢固语特定训练数据而达到最优性能,但经过微调的SeamlessM4T模型在仅使用少量泰卢固语数据(可能低于100小时)时仍展现出极强的竞争力,表明在低资源场景下,通过精细超参数调优和足够平行语料,端到端系统可实现与级联方法相当的翻译质量。

链接: https://arxiv.org/abs/2512.07265
作者: Bhavana Akkiraju,Srihari Bandarupalli,Swathi Sambangi,Vasavi Ravuri,R Vijaya Saraswathi,Anil Kumar Vuppala
机构: International Institute of Information Technology, Hyderabad, India (国际信息科技学院,海得拉巴,印度); VNRVJIET, Hyderabad, India (VNRVJIET,海得拉巴,印度)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to AACL IJCNLP 2025

点击查看摘要

Abstract:Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu–English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, finetuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. Our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu–English translation. The work delivers three key contributions: a reproducible Telugu–English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.
zh

[NLP-34] Ensembling LLM -Induced Decision Trees for Explainable and Robust Error Detection

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的表格数据错误检测(Error Detection, ED)方法中存在的两大问题:一是LLM-as-a-labeler范式缺乏可解释性,因其决策过程为黑箱;二是对提示(prompt)高度敏感,导致因模型固有的随机性而输出不一致,鲁棒性差。解决方案的关键在于提出一种LLM-as-an-inducer框架,通过LLM诱导生成决策树(TreeED),其中决策树包含规则节点(执行格式或范围校验)、图神经网络(Graph Neural Network, GNN)节点(捕捉功能依赖等复杂模式)和叶节点(输出最终判断),从而实现步骤化的、可解释的错误检测逻辑;进一步地,ForestED通过不确定性采样获取多个行子集,分别构建决策树,并利用期望最大化(Expectation-Maximization)算法联合估计各树可靠性并优化共识预测,显著提升了方法的准确性和鲁棒性。

链接: https://arxiv.org/abs/2512.07246
作者: Mengqi Wang(1),Jianwei Wang(1),Qing Liu(2),Xiwei Xu(2),Zhenchang Xing(2),Liming Zhu(2),Wenjie Zhang(1) ((1) UNSW Sydney, (2) Data61, CSIRO)
机构: UNSW Sydney; Data61, CSIRO
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Error detection (ED), which aims to identify incorrect or inconsistent cell values in tabular data, is important for ensuring data quality. Recent state-of-the-art ED methods leverage the pre-trained knowledge and semantic capability embedded in large language models (LLMs) to directly label whether a cell is erroneous. However, this LLM-as-a-labeler pipeline (1) relies on the black box, implicit decision process, thus failing to provide explainability for the detection results, and (2) is highly sensitive to prompts, yielding inconsistent outputs due to inherent model stochasticity, therefore lacking robustness. To address these limitations, we propose an LLM-as-an-inducer framework that adopts LLM to induce the decision tree for ED (termed TreeED) and further ensembles multiple such trees for consensus detection (termed ForestED), thereby improving explainability and robustness. Specifically, based on prompts derived from data context, decision tree specifications and output requirements, TreeED queries the LLM to induce the decision tree skeleton, whose root-to-leaf decision paths specify the stepwise procedure for evaluating a given sample. Each tree contains three types of nodes: (1) rule nodes that perform simple validation checks (e.g., format or range), (2) Graph Neural Network (GNN) nodes that capture complex patterns (e.g., functional dependencies), and (3) leaf nodes that output the final decision types (error or clean). Furthermore, ForestED employs uncertainty-based sampling to obtain multiple row subsets, constructing a decision tree for each subset using TreeED. It then leverages an Expectation-Maximization-based algorithm that jointly estimates tree reliability and optimizes the consensus ED prediction. Extensive xperiments demonstrate that our methods are accurate, explainable and robust, achieving an average F1-score improvement of 16.1% over the best baseline.
zh

[NLP-35] Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Model, VLM)在鲁棒性与性能之间的权衡问题,特别是针对跨模态对抗攻击下的脆弱性。研究表明,功能词(function words)是导致VLM易受此类攻击的关键因素。为此,作者提出函数词去注意机制(Function-word De-Attention, FDA),其核心思想是模仿差分放大器的设计,在注意力头中分别计算原始交叉注意力与功能词交叉注意力,并通过差分减法消除后者的影响,从而提升模型的对齐能力和鲁棒性。实验表明,FDA在多个下游任务、数据集和模型上均能显著降低攻击成功率(ASR),同时保持极小的性能损失,且具备良好的可扩展性、泛化性和零样本能力。

链接: https://arxiv.org/abs/2512.07222
作者: Qiwei Tian,Chenhao Lin,Zhengyu Zhao,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at this https URL.
zh

[NLP-36] NeSTR: A Neuro-Symbolic Abductive Framework for Temporal Reasoning in Large Language Models AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂时间约束下进行时序推理(temporal reasoning)能力不足的问题,尤其针对模型在处理时间相关信息时可能出现的误解、误用或推理不一致甚至幻觉现象。其解决方案的关键在于提出一种名为神经符号时序推理(Neuro-Symbolic Temporal Reasoning, NeSTR)的新框架,该框架通过显式符号编码保留精确的时间关系,结合验证机制确保逻辑一致性,并利用归谬反思(abductive reflection)修正错误推理路径,从而在无需任何微调的情况下显著提升LLMs的时序敏感性和推理准确性。

链接: https://arxiv.org/abs/2512.07218
作者: Feng Liang,Weixin Zeng,Runhao Zhao,Xiang Zhao
机构: National Key Laboratory of Big Data and Decision, National University of Defense Technology, China(中国国防科技大学大数据与决策国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.
zh

[NLP-37] Generating Storytelling Images with Rich Chains-of-Reasoning

【速读】: 该论文旨在解决Storytelling Images(叙事性图像)生成困难且稀缺的问题,这类图像具有丰富的语义信息和逻辑连贯的视觉线索(Chains-of-Reasoning, CoRs),能够支持多层信息传达与主动解读。为应对这一挑战,作者提出了一种两阶段生成框架——StorytellingPainter,其核心在于融合大型语言模型(Large Language Models, LLMs)的创造性推理能力与文本到图像(Text-to-Image, T2I)模型的视觉合成能力,从而实现高质量叙事性图像的自动化生成。关键创新点在于通过结构化提示工程与轻量级微调策略构建出性能优异的Mini-Storytellers模型,显著缩小了开源与专有LLMs在故事生成任务上的性能差距,并配套设计了包含语义复杂度、多样性及图文对齐度的综合评估体系,验证了方法的有效性和可行性。

链接: https://arxiv.org/abs/2512.07198
作者: Xiujie Song,Qi Jia,Shota Watanabe,Xiaoyi Pang,Ruijie Chen,Mengyue Wu,Kenny Q. Zhu
机构: X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); East China University of Technology (华东理工大学); University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at this https URL.
zh

[NLP-38] MASim: Multilingual Agent -Based Simulation for Social Science

【速读】: 该论文旨在解决现有多智能体角色扮演模拟中普遍存在的单语种局限性问题,即无法有效建模真实社会中跨语言互动这一关键特征。其解决方案的核心在于提出MASim框架——首个支持具有多样化社会语言学特征的生成式智能体之间多轮交互的多语言基于代理的仿真系统。该框架通过两项关键分析实现目标:一是全球公共意见建模,用于模拟开放域假设在不同语言和文化中的态度演化;二是媒体影响与信息扩散分析,借助自主新闻代理动态生成内容并塑造用户行为。此方案显著提升了计算社会科学研究的可扩展性和可控性,并强调了多语言仿真的必要性。

链接: https://arxiv.org/abs/2512.07195
作者: Xuan Zhang,Wenxuan Zhang,Anxu Wang,See-Kiong Ng,Yang Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Multi-agent role-playing has recently shown promise for studying social behavior with language agents, but existing simulations are mostly monolingual and fail to model cross-lingual interaction, an essential property of real societies. We introduce MASim, the first multilingual agent-based simulation framework that supports multi-turn interaction among generative agents with diverse sociolinguistic profiles. MASim offers two key analyses: (i) global public opinion modeling, by simulating how attitudes toward open-domain hypotheses evolve across languages and cultures, and (ii) media influence and information diffusion, via autonomous news agents that dynamically generate content and shape user behavior. To instantiate simulations, we construct the MAPS benchmark, which combines survey questions and demographic personas drawn from global population distributions. Experiments on calibration, sensitivity, consistency, and cultural case studies show that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.
zh

[NLP-39] PICKT: Practical Interlinked Concept Knowledge Tracing for Personalized Learning using Knowledge Map Concept Relations

【速读】: 该论文旨在解决智能辅导系统(Intelligent Tutoring Systems, ITS)中知识追踪(Knowledge Tracing, KT)模型在实际应用中存在的三大问题:输入数据格式受限、新学生入学或新题目加入时的冷启动(cold start)问题,以及在真实服务环境中的稳定性不足。解决方案的关键在于提出一种实用的互连概念知识追踪模型(Practical Interlinked Concept Knowledge Tracing, PICKT),其核心创新是构建一个融合题目与概念文本信息的知识图谱(knowledge map),从而有效建模概念间的结构化关系,使模型能够在冷启动场景下依然保持高精度的知识状态预测能力,并通过贴近真实业务环境的实验设计验证了该方法在性能和稳定性上的显著提升。

链接: https://arxiv.org/abs/2512.07179
作者: Wonbeen Lee,Channyoung Lee,Junho Sohn,Hansam Cho
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 15 pages, 5 figures, 17 tables. Preparing submission for EDM 2026 conference

点击查看摘要

Abstract:With the recent surge in personalized learning, Intelligent Tutoring Systems (ITS) that can accurately track students’ individual knowledge states and provide tailored learning paths based on this information are in demand as an essential task. This paper focuses on the core technology of Knowledge Tracing (KT) models that analyze students’ sequences of interactions to predict their knowledge acquisition levels. However, existing KT models suffer from limitations such as restricted input data formats, cold start problems arising with new student enrollment or new question addition, and insufficient stability in real-world service environments. To overcome these limitations, a Practical Interlinked Concept Knowledge Tracing (PICKT) model that can effectively process multiple types of input data is proposed. Specifically, a knowledge map structures the relationships among concepts considering the question and concept text information, thereby enabling effective knowledge tracing even in cold start situations. Experiments reflecting real operational environments demonstrated the model’s excellent performance and practicality. The main contributions of this research are as follows. First, a model architecture that effectively utilizes diverse data formats is presented. Second, significant performance improvements are achieved over existing models for two core cold start challenges: new student enrollment and new question addition. Third, the model’s stability and practicality are validated through delicate experimental design, enhancing its applicability in real-world product environments. This provides a crucial theoretical and technical foundation for the practical implementation of next-generation ITS.
zh

[NLP-40] hink-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

【速读】: 该论文旨在解决大型视觉语言模型(LVLMs)在安全对齐方面存在的关键缺陷:当前基于“思考-回答”单次推理范式的模型容易受到上下文或视觉越狱攻击(jailbreak attacks),且可能在生成过程中遗漏自身输出中的有害内容,导致安全风险。解决方案的核心在于提出一种三阶段训练框架 Think-Reflect-Revise (TRR),其关键创新是引入政策引导的自我反思机制(policy-guided self-reflection),通过利用首次推理中暴露的恶意信号进行有效自修正,从而实现真正的安全性提升。具体而言,作者构建了一个包含5,000个样本的反射式安全推理(Reflective Safety Reasoning, ReSafe)数据集,并结合监督微调与强化学习策略,使模型具备识别并修正潜在不安全输出的能力,实验表明该方法显著提升了LVLMs的安全响应率(从42.8%提升至87.7%),同时保持通用任务性能稳定。

链接: https://arxiv.org/abs/2512.07141
作者: Fenghua Weng,Chaochao Lu,Xia Hu,Wenqi Shao,Wenjie Wang
机构: Shanghaitech University (上海科技大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at this https URL.
zh

[NLP-41] GUMBridge: a Corpus for Varieties of Bridging Anaphora

【速读】: 该论文旨在解决英文语境中桥接指代(bridging anaphora)资源稀缺且覆盖不全的问题,现有资源普遍存在规模小、现象覆盖有限及语域多样性不足等缺陷。其解决方案的关键在于构建GUMBridge——一个包含16种不同语域的英文桥接指代语料库,不仅实现了对桥接现象的广泛覆盖,还提供了细粒度的子类型标注,从而为桥接指代的识别与分类任务提供高质量的数据支持。此外,论文还评估了标注质量,并基于开源和闭源大语言模型(LLM)在三个相关任务上的基线性能,验证了桥接解析与子类型分类仍是当前自然语言处理(NLP)领域的挑战性任务。

链接: https://arxiv.org/abs/2512.07134
作者: Lauren Levine,Amir Zeldes
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in “There is ‘a house’. ‘The door’ is red,” where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.
zh

[NLP-42] DART: Leverag ing Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

【速读】: 该论文旨在解决多智能体框架中如何有效识别并调用合适的视觉工具(如目标检测、光学字符识别OCR、空间推理等)以缓解智能体间分歧、提升决策质量的问题。现有方法在工具选择与时机判断上存在挑战,导致无法充分利用专家知识增强模型表现。其解决方案的关键在于提出DART框架,通过多个辩论型视觉智能体之间的不一致性来动态识别有用工具,并利用这些工具引入新信息以达成共识;同时设计聚合代理(aggregator agent)基于工具对齐的同意分数和工具调用信息综合评估输出,从而选出最优答案。实验表明,该方法显著优于单智能体工具调用及传统多智能体辩论策略,在多个基准测试中取得领先性能。

链接: https://arxiv.org/abs/2512.07132
作者: Nithin Sivakumaran,Justin Chih-Yao Chen,David Wan,Yue Zhang,Jaehong Yoon,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Nanyang Technological University (南洋理工大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.
zh

[NLP-43] A Neural Affinity Framework for Abstract Reasoning : Diagnosing the Compositional Gap in Transformer Architectures via Procedural Task Taxonomy

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在抽象推理任务中存在显著性能瓶颈的问题,特别是针对 ARC-AGI-2 数据集上 Transformer 架构表现出的“神经适配度天花板效应”(Neural Affinity Ceiling Effect)。其核心问题是:当前模型在处理某些任务时并非因训练不足或数据稀缺导致性能低下,而是由于架构与任务之间的内在适配性不足。解决方案的关键在于构建了一个经规则代码分析验证的 9 类任务分类体系(taxonomy),该体系以高准确率(97.5%)对全部 400 个任务进行了系统划分,并通过 CNN 对原始网格像素的视觉一致性训练(S3 上达 95.24% 准确率)证明了其结构合理性。进一步地,该分类框架揭示出约 35.3% 的任务对 Transformer 具有低神经适配度,且即使在大规模数据(如 Li et al. 的 ViTARC 研究中每个任务 1M 样本)下,低适配度任务仍无法突破性能上限(仅 51.9%,p < 0.001),而高适配度任务可达 99.8%。因此,论文提出必须发展具有任务适配模块的混合架构,而非单纯依赖单一模型结构或更大规模训练数据。

链接: https://arxiv.org/abs/2512.07109
作者: Miguel Ingram,Arthur Joseph Merritt III
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 62 pages, 10 figures

点击查看摘要

Abstract:Responding to Hodel et al.'s (2024) call for a formal definition of task relatedness in re-arc, we present the first 9-category taxonomy of all 400 tasks, validated at 97.5% accuracy via rule-based code analysis. We prove the taxonomy’s visual coherence by training a CNN on raw grid pixels (95.24% accuracy on S3, 36.25% overall, 3.3x chance), then apply the taxonomy diagnostically to the original ARC-AGI-2 test set. Our curriculum analysis reveals 35.3% of tasks exhibit low neural affinity for Transformers–a distributional bias mirroring ARC-AGI-2. To probe this misalignment, we fine-tuned a 1.7M-parameter Transformer across 302 tasks, revealing a profound Compositional Gap: 210 of 302 tasks (69.5%) achieve 80% cell accuracy (local patterns) but 10% grid accuracy (global synthesis). This provides direct evidence for a Neural Affinity Ceiling Effect, where performance is bounded by architectural suitability, not curriculum. Applying our framework to Li et al.'s independent ViTARC study (400 specialists, 1M examples each) confirms its predictive power: Very Low affinity tasks achieve 51.9% versus 77.7% for High affinity (p0.001), with a task at 0% despite massive data. The taxonomy enables precise diagnosis: low-affinity tasks (A2) hit hard ceilings, while high-affinity tasks (C1) reach 99.8%. These findings indicate that progress requires hybrid architectures with affinity-aligned modules. We release our validated taxonomy,
zh

[NLP-44] Leverag ing KV Similarity for Online Structured Pruning in LLM s

链接: https://arxiv.org/abs/2512.07090
作者: Jungmin Lee,Gwangeun Byeon,Yulhwa Kim,Seokin Hong
机构: Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-45] Do Large Language Models Truly Understand Cross-cultural Differences?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在跨文化理解能力评估中存在的三大局限:缺乏情境化场景、跨文化概念映射不足以及深层文化推理能力有限。为应对这些问题,作者提出SAGE——一个基于情境的基准测试框架,其核心创新在于通过跨文化核心概念对齐和生成式任务设计,系统性地评估LLMs的跨文化理解和推理能力。解决方案的关键在于构建一个以文化理论为基础的九维分类体系,从中筛选出210个核心概念,并据此设计4530个测试项,覆盖15个真实世界情境,形成结构清晰、可扩展且具备多语言迁移能力的评测数据集,从而揭示模型在不同维度与情境下的系统性短板。

链接: https://arxiv.org/abs/2512.07075
作者: Shiwei Guo,Sihang Jiang,Qianxi He,Yanghua Xiao,Jiaqing Liang,Bi Yude,Minggui He,Shimin Tao,Li Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have demonstrated strong performance on multilingual tasks. Given its wide range of applications, cross-cultural understanding capability is a crucial competency. However, existing benchmarks for evaluating whether LLMs genuinely possess this capability suffer from three key limitations: a lack of contextual scenarios, insufficient cross-cultural concept mapping, and limited deep cultural reasoning capabilities. To address these gaps, we propose SAGE, a scenario-based benchmark built via cross-cultural core concept alignment and generative task design, to evaluate LLMs’ cross-cultural understanding and reasoning. Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions. Using this framework, we curated 210 core concepts and constructed 4530 test items across 15 specific real-world scenarios, organized under four broader categories of cross-cultural situations, following established item design principles. The SAGE dataset supports continuous expansion, and experiments confirm its transferability to other languages. It reveals model weaknesses across both dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. While progress has been made, LLMs are still some distance away from reaching a truly nuanced cross-cultural understanding. In compliance with the anonymity policy, we include data and code in the supplement materials. In future versions, we will make them publicly available online.
zh

[NLP-46] SETUP: Sentence-level English-To-Uniform Meaning Representation Parser

【速读】: 该论文旨在解决英文文本到统一意义表示(Uniform Meaning Representation, UMR)的自动解析问题,以支持UMR在语言文档记录、低资源语言技术提升及可解释性增强等下游应用中的大规模部署。当前文本到UMR解析的研究仍处于早期阶段,缺乏有效的自动化工具。解决方案的关键在于提出两种新的解析方法:其一为微调现有的抽象意义表示(Abstract Meaning Representation, AMR)解析器;其二则利用通用依存句法(Universal Dependencies, UD)转换器作为基线。其中最优模型SETUP在AnCast和SMATCH++指标上分别达到84和91,显著提升了自动UMR生成的准确性,标志着向实用化UMR解析迈出了关键一步。

链接: https://arxiv.org/abs/2512.07068
作者: Emma Markle,Javier Gutierrez Bach,Shira Wein
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world’s languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.
zh

[NLP-47] Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

链接: https://arxiv.org/abs/2512.07059
作者: Richard Young
机构: University of Nevada Las Vegas (内华达大学拉斯维加斯分校)
类目: Computation and Language (cs.CL)
备注: 30 pages, 11 figures, 5 tables. Code and data: this https URL

点击查看摘要

[NLP-48] FVA-RAG : Falsification-Verification Alignment for Mitigating Sycophantic Hallucinations

链接: https://arxiv.org/abs/2512.07015
作者: Mayank Ravishankara
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

[NLP-49] Block Sparse Flash Attention

链接: https://arxiv.org/abs/2512.07011
作者: Daniel Ohayon,Itay Lamprecht,Itay Hubara,Israel Cohen,Daniel Soudry,Noam Elata
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Performance (cs.PF)
备注: 10 pages, 5 figures. Code: this https URL

点击查看摘要

[NLP-50] Prompting-in-a-Series: Psychology-Informed Contents and Embeddings for Personality Recognition With Decoder-Only Models

链接: https://arxiv.org/abs/2512.06991
作者: Jing Jie Tan,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum,Anissa Mokraoui,Shih-Yu Lo
机构: Universiti Tunku Abdul Rahman (马来西亚英迪大学); Université Sorbonne Paris Nord (巴黎第十三大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

[NLP-51] Flash Multi-Head Feed-Forward Network

【速读】: 该论文旨在解决传统Transformer架构中前馈神经网络(Feed-Forward Network, FFN)在扩展性与表达能力上的局限性,特别是当将多头机制引入FFN时所面临的两个核心问题:一是内存消耗随头数增加而显著上升,二是随着模型规模扩大,中间层维度增长与固定头维度之间的比例失衡,导致可扩展性和表达能力下降。解决方案的关键在于提出Flash Multi-Head FFN(FlashMHF),其创新点包括:(1)设计了一种I/O感知的融合核函数,在SRAM中在线计算输出,类似FlashAttention,从而大幅降低内存峰值占用;(2)采用动态加权并行子网络结构,维持中间维度与头维度之间的平衡比例,确保模型在不同规模下仍具备良好的表达能力和效率。实验证明,FlashMHF在128M至1.3B参数模型上均优于SwiGLU FFN,在提升困惑度和下游任务准确率的同时,内存峰值减少3–5倍,推理速度提升最高达1.08倍。

链接: https://arxiv.org/abs/2512.06989
作者: Minshen Zhang,Xiang Hu,Jianguo Li,Wei Wu,Kewei Tu
机构: ShanghaiTech University (上海科技大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:We explore Multi-Head FFN (MH-FFN) as a replacement of FFN in the Transformer architecture, motivated by the structural similarity between single-head attention and FFN. While multi-head mechanisms enhance expressivity in attention, naively applying them to FFNs faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension as models scale, which degrades scalability and expressive power. To address these challenges, we propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions. Validated on models from 128M to 1.3B parameters, FlashMHF consistently improves perplexity and downstream task accuracy over SwiGLU FFNs, while reducing peak memory usage by 3-5x and accelerating inference by up to 1.08x. Our work establishes the multi-head design as a superior architectural principle for FFNs, presenting FlashMHF as a powerful, efficient, and scalable alternative to FFNs in Transformers.
zh

[NLP-52] Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation

链接: https://arxiv.org/abs/2512.06938
作者: Ivanhoé Botcazou,Tassadit Amghar,Sylvain Lamprier,Frédéric Saubion
机构: LERIA, University of Angers (昂热大学LERIA实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-53] MATEX: A Multi-Agent Framework for Explaining Ethereum Transactions

链接: https://arxiv.org/abs/2512.06933
作者: Zifan Peng
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[NLP-54] XAM: Interactive Explainability for Authorship Attribution Models

【速读】: 该论文旨在解决作者归属(Authorship Attribution, AA)模型预测结果缺乏可解释性的问题,即用户难以理解模型为何做出特定判断。解决方案的关键在于提出IXAM——一个交互式可解释性框架,允许用户在基于嵌入的AA模型的嵌入空间中进行探索,并以不同粒度的写作风格特征集合形式构建对模型预测的解释,从而提升模型决策过程的透明度与可信度。

链接: https://arxiv.org/abs/2512.06924
作者: Milad Alshomary,Anisha Bhatnagar,Peter Zeng,Smaranda Muresan,Owen Rambow,Kathleen McKeown
机构: Columbia University (哥伦比亚大学); Stony Brook University (石溪大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present IXAM, an Interactive eXplainability framework for Authorship Attribution Models. Given an authorship attribution (AA) task and an embedding-based AA model, our tool enables users to interactively explore the model’s embedding space and construct an explanation of the model’s prediction as a set of writing style features at different levels of granularity. Through a user evaluation, we demonstrate the value of our framework compared to predefined stylistic explanations.
zh

[NLP-55] Large Language Models and Forensic Linguistics: Navigating Opportunities and Threats in the Age of Generative AI

【速读】: 该论文旨在解决生成式 AI(Generative AI)对法语言学(forensic linguistics)带来的双重挑战:一方面,大语言模型(LLMs)作为分析工具提升了语料库规模分析和基于嵌入的作者归属能力;另一方面,其风格模仿、作者身份混淆及合成文本泛滥等特性动摇了传统语体特征(idiolect)的稳定性,进而威胁法证语言学在法律程序中的可采信性。解决方案的关键在于方法论重构——提出融合人类与AI的混合工作流、超越二元分类的可解释检测范式,以及覆盖多元人群的误差与偏倚验证机制,以确保该领域在科学严谨性和法律合规性上的持续可信度。

链接: https://arxiv.org/abs/2512.06922
作者: George Mikros
机构: Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) present a dual challenge for forensic linguistics. They serve as powerful analytical tools enabling scalable corpus analysis and embedding-based authorship attribution, while simultaneously destabilising foundational assumptions about idiolect through style mimicry, authorship obfuscation, and the proliferation of synthetic texts. Recent stylometric research indicates that LLMs can approximate surface stylistic features yet exhibit detectable differences from human writers, a tension with significant forensic implications. However, current AI-text detection techniques, whether classifier-based, stylometric, or watermarking approaches, face substantial limitations: high false positive rates for non-native English writers and vulnerability to adversarial strategies such as homoglyph substitution. These uncertainties raise concerns under legal admissibility standards, particularly the Daubert and Kumho Tire frameworks. The article concludes that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible. Proposed adaptations include hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline’s core insight, i.e., that language reveals information about its producer, remains valid but must accommodate increasingly complex chains of human and machine authorship.
zh

[NLP-56] NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

链接: https://arxiv.org/abs/2512.06921
作者: Ziyang Song,Zelin Zang,Xiaofan Ye,Boqiang Xu,Long Bai,Jinlin Wu,Hongliang Ren,Hongbin Liu,Jiebo Luo,Zhen Lei
机构: Hong Kong Institute of Science and Innovation (香港科学创新研究院); The University of Hong Kong-Shenzhen Hospital (香港大学深圳医院); Department of Electronic Engineering, The Chinese University of Hong Kong (电子工程系,香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by IEEE ICIA 2025

点击查看摘要

[NLP-57] Automated PRO-CTCAE Symptom Selection based on Prior Adverse Event Profiles

链接: https://arxiv.org/abs/2512.06919
作者: Francois Vandenhende,Anna Georgiou,Michalis Georgiou,Theodoros Psaras,Ellie Karekla
机构: ClinBAY Limited( ClinBAY 有限公司)
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

[NLP-58] An Analysis of Large Language Models for Simulating User Responses in Surveys AACL2025

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在模拟用户观点时存在偏差,尤其在使用基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)训练后,倾向于偏向主流观点,难以准确反映来自不同人口统计和文化背景用户的多样性意见。为应对这一问题,论文提出了一种名为CLAIMSIM的声明多样化方法,其关键在于通过提取LLM参数知识中的多视角观点作为上下文输入,以增强生成响应的多样性。实验表明,虽然CLAIMSIM能提升输出观点的多样性,但LLMs仍难以准确模拟真实用户,主要受限于其对不同人口特征保持固定立场以及在面对冲突观点时无法有效推理细微差异的能力。

链接: https://arxiv.org/abs/2512.06874
作者: Ziyun Yu,Yiru Zhou,Chen Zhao,Hongyi Wen
机构: Center for Data Science, NYU Shanghai (上海纽约大学数据科学中心); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注: Accepted to IJCNLP-AACL 2025 (Main Conference)

点击查看摘要

Abstract:Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.
zh

[NLP-59] Rhea: Role-aware Heuristic Episodic Attention for Conversational LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中出现的累积上下文衰减(cumulative contextual decay)问题,即由于注意力污染(attention pollution)、稀释(dilution)和漂移(drift)导致的上下文完整性逐步退化。解决方案的关键在于提出一种名为 Rhea(Role-aware Heuristic Episodic Attention)的新框架,其核心创新是将对话历史解耦为两个功能独立的记忆模块:(1)指令记忆(Instructional Memory, IM),通过结构化优先级机制持久存储高保真全局约束;(2)情景记忆(Episodic Memory, EM),通过非对称噪声控制和启发式上下文检索动态管理用户与模型的交互。推理时,Rhea 采用优先注意力机制构建高信噪比上下文,始终优先保留全局指令并选择性整合相关的情景信息,从而显著提升长程对话中的指令一致性和整体性能。

链接: https://arxiv.org/abs/2512.06869
作者: Wanyang Hong,Zhaoning Zhang,Yi Chen,Libo Zhang,Baihui Liu,Linbo Qiao,Zhiliang Tian,Dongsheng Li
机构: National Key Laboratory of Parallel and Distributed Computing (国家并行与分布式计算重点实验室); College of Computer Science and Technology (计算机科学与技术学院); National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.
zh

[NLP-60] Less Is More but Where? Dynamic Token Compression via LLM -Guided Keyframe Prior NEURIPS2025

链接: https://arxiv.org/abs/2512.06866
作者: Yulin Li,Haokun Gui,Ziyang Fan,Junjie Wang,Bin Kang,Bin Chen,Zhuotao Tian
机构: Harbin Institute of Technology (Shenzhen); Shenzhen Loop Area Institute; University of Chinese Academy of Sciences; The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2025

点击查看摘要

[NLP-61] AquaFusionNet: Lightweight VisionSensor Fusion Framework for Real-Time Pathogen Detection and Water Quality Anomaly Prediction on Edge Devices

【速读】: 该论文旨在解决小规模饮用水系统中微生物污染事件检测不连续、实时决策不可靠的问题,现有监测工具难以捕捉污染波动的动态特性。其核心挑战在于如何整合显微成像(提供微生物层面信息)与物理化学传感器数据(反映水质短期变化)这两类异构信息源,以实现高精度且低功耗的现场实时决策。解决方案的关键是提出AquaFusionNet——一个轻量级跨模态框架,通过专为低功耗硬件设计的门控交叉注意力机制(gated cross-attention mechanism),学习微生物形态特征与同步传感器动态之间的统计依赖关系,从而在边缘设备上统一建模并提升检测鲁棒性。实证表明,该方法在印尼东爪哇7个设施部署6个月期间,实现了94.8% mAP@0.5的污染事件检测准确率和96.3%异常预测准确率,同时功耗仅为4.8 W,显著优于单一模态检测器在污垢、浊度突变和光照不均等复杂场景下的性能瓶颈。

链接: https://arxiv.org/abs/2512.06848
作者: Sepyan Purnama Kristanto,Lutfi Hakim,Hermansyah
机构: Politeknik Negeri Banyuwangi (印尼邦尤瓦尼理工学院); Balai Besar Teknik Kesehatan Lingkungan dan P2B Surabaya (泗水环境健康技术与研究中心)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9Pages, 3 figure, Politeknik Negeri Banyuwangi

点击查看摘要

Abstract:Evidence from many low and middle income regions shows that microbial contamination in small scale drinking water systems often fluctuates rapidly, yet existing monitoring tools capture only fragments of this behaviour. Microscopic imaging provides organism level visibility, whereas physicochemical sensors reveal shortterm changes in water chemistry; in practice, operators must interpret these streams separately, making realtime decision-making unreliable. This study introduces AquaFusionNet, a lightweight cross-modal framework that unifies both information sources inside a single edge deployable model. Unlike prior work that treats microscopic detection and water quality prediction as independent tasks, AquaFusionNet learns the statistical dependencies between microbial appearance and concurrent sensor dynamics through a gated crossattention mechanism designed specifically for lowpower hardware. The framework is trained on AquaMicro12K, a new dataset comprising 12,846 annotated 1000 micrographs curated for drinking water contexts, an area where publicly accessible microscopic datasets are scarce. Deployed for six months across seven facilities in East Java, Indonesia, the system processed 1.84 million frames and consistently detected contamination events with 94.8% mAP@0.5 and 96.3% anomaly prediction accuracy, while operating at 4.8 W on a Jetson Nano. Comparative experiments against representative lightweight detectors show that AquaFusionNet provides higher accuracy at comparable or lower power, and field results indicate that cross-modal coupling reduces common failure modes of unimodal detectors, particularly under fouling, turbidity spikes, and inconsistent illumination. All models, data, and hardware designs are released openly to facilitate replication and adaptation in decentralized water safety infrastructures.
zh

[NLP-62] CAuSE: Decoding Multimodal Classifiers using Faithful Natural Language Explanation ACL

链接: https://arxiv.org/abs/2512.06814
作者: Dibyanayan Bandyopadhyay,Soham Bhattacharjee,Mohammed Hasanuzzaman,Asif Ekbal
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Queen’s University Belfast (贝尔法斯特女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Transactions of the Association for Computational Linguistics (TACL). Pre-MIT Press publication version

点击查看摘要

[NLP-63] Large Language Model-Based Generation of Discharge Summaries

链接: https://arxiv.org/abs/2512.06812
作者: Tiago Rodrigues,Carla Teixeira Lopes
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 6 figures

点击查看摘要

[NLP-64] MMDuet2: Enhancing Proactive Interaction of Video MLLM s with Multi-Turn Reinforcement Learning

链接: https://arxiv.org/abs/2512.06810
作者: Yueqian Wang,Songxiang Liu,Disong Wang,Nuo Xu,Guanglu Wan,Huishuai Zhang,Dongyan Zhao
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-65] LLM 4SFC: Sequential Function Chart Generation via Large Language Models

链接: https://arxiv.org/abs/2512.06787
作者: Ofek Glick,Vladimir Tchuiev,Marah Ghoummaid,Michal Moshkovitz,Dotan Di-Castro
机构: Bosch Research (博世研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-66] From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLM s

链接: https://arxiv.org/abs/2512.06776
作者: Yuchuan Tian,Yuchen Liang,Jiacheng Sun,Shuo Zhang,Guangwen Yang,Yingte Shu,Sibo Fang,Tianyu Guo,Kai Han,Chao Xu,Hanting Chen,Xinghao Chen,Yunhe Wang
机构: Peking University (北京大学); Huawei Technologies (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures

点击查看摘要

[NLP-67] Becoming Experienced Judges: Selective Test-Time Learning for Evaluators

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动评估方法在实际部署中存在两个关键局限:一是评估过程通常独立处理每个样本,未能积累经验以实现持续改进;二是统一使用固定提示(prompt)进行所有样本评估,忽略了不同样本可能需要差异化评价标准。解决方案的核心在于提出“边评估边学习”(Learning While Evaluating, LWE)框架,其关键创新是维护一个可动态演化的元提示(meta-prompt),该元提示既能生成针对每个样本的特定评估指令,又能通过自生成反馈不断自我优化。进一步地,论文还提出“选择性LWE”(Selective LWE),仅在自不一致案例上更新元提示,从而在保持序列学习优势的同时显著降低计算成本,实验证明该方法能从困难样本中高效学习并优于强基线模型。

链接: https://arxiv.org/abs/2512.06751
作者: Seungyeon Jwa,Daechul Ahn,Reokyoung Kim,Dongyeop Kang,Jonghyun Choi
机构: Seoul National University (首尔国立大学); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.
zh

[NLP-68] One Word Is Not Enough: Simple Prompts Improve Word Embeddings

链接: https://arxiv.org/abs/2512.06744
作者: Rajeev Ranjan
机构: GodelLabs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-69] Arc Gradient Descent: A Mathematically Derived Reformulation of Gradient Descent with Phase-Aware User-Controlled Step Dynamics

【速读】: 该论文旨在解决深度学习优化中普遍存在的收敛稳定性差、对超参数敏感以及在高维非凸问题上表现不佳的问题。其核心解决方案是提出一种名为ArcGD的新型优化算法,其关键创新在于引入了基于梯度方向弧长(Arc Length)的自适应学习率机制,能够有效缓解传统优化器如Adam在极端维度(如50,000D)和复杂损失曲面(如Rosenbrock函数)下的学习率偏差与局部最优陷阱问题。实验表明,ArcGD在多个基准测试中优于Adam、AdamW、Lion和SGD等主流优化器,尤其在长期训练中展现出更强的泛化能力与抗过拟合特性,且其变体可被解释为Lion优化器的特例,揭示了不同优化方法之间的内在联系。

链接: https://arxiv.org/abs/2512.06737
作者: Nikhil Verma,Joonas Linnosmaa,Espinosa-Leal Leonardo,Napat Vajragupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 80 pages, 6 tables, 2 figures, 5 appendices, proof-of-concept

点击查看摘要

Abstract:The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD’s effective learning rate and (ii) both using Adam’s default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved super ior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that a variant of ArcGD can be interpreted as a special case of the Lion optimiser, highlighting connections between the inherent mechanisms of such optimisation methods.
zh

[NLP-70] A Patient-Doctor-NLP-System to contest inequality for less privileged

【速读】: 该论文旨在解决在资源受限的现实医疗场景中部署大型语言模型(Large Language Models, LLMs)所面临的挑战,特别是针对视觉障碍用户和低资源语言(如印地语)使用者在农村环境中获取医疗帮助时存在的可访问性与可用性问题。解决方案的关键在于提出PDFTEMRA(Performant Distilled Frequency Transformer Ensemble Model with Random Activations),一种紧凑的Transformer架构,其核心创新包括:模型蒸馏(model distillation)以压缩模型规模、频域调制(frequency-domain modulation)用于高效特征提取、集成学习(ensemble learning)提升鲁棒性,以及随机激活模式(randomized activation patterns)增强模型泛化能力,从而在显著降低计算成本的同时保持语言理解性能,适用于面向包容性与低资源环境的医疗自然语言处理应用。

链接: https://arxiv.org/abs/2512.06734
作者: Subrit Dikshit,Ritu Tiwari,Priyank Jain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Transfer Learning (TL) has accelerated the rapid development and availability of large language models (LLMs) for mainstream natural language processing (NLP) use cases. However, training and deploying such gigantic LLMs in resource-constrained, real-world healthcare situations remains challenging. This study addresses the limited support available to visually impaired users and speakers of low-resource languages such as Hindi who require medical assistance in rural environments. We propose PDFTEMRA (Performant Distilled Frequency Transformer Ensemble Model with Random Activations), a compact transformer-based architecture that integrates model distillation, frequency-domain modulation, ensemble learning, and randomized activation patterns to reduce computational cost while preserving language understanding performance. The model is trained and evaluated on medical question-answering and consultation datasets tailored to Hindi and accessibility scenarios, and its performance is compared against standard NLP state-of-the-art model baselines. Results demonstrate that PDFTEMRA achieves comparable performance with substantially lower computational requirements, indicating its suitability for accessible, inclusive, low-resource medical NLP applications.
zh

[NLP-71] “The Dentist is an involved parent the bartender is not”: Revealing Implicit Biases in QA with Implicit BBQ

链接: https://arxiv.org/abs/2512.06732
作者: Aarushi Wagh,Saniya Srivastava
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-72] he Role of Entropy in Visual Grounding: Analysis and Optimization

【速读】: 该论文旨在解决在感知导向任务(如视觉定位,Visual Grounding)中,熵(Entropy)的作用机制及其有效调控策略尚不明确的问题。现有研究多集中于推理类任务中的熵控制,而对视觉 grounding 这类依赖感知理解的任务缺乏系统分析。为此,作者提出 ECVGPO(Entropy Control Visual Grounding Policy Optimization),其核心创新在于引入可解释的熵控制机制,通过动态调节策略分布的熵值,实现探索(Exploration)与利用(Exploitation)之间的更好平衡,从而在多个基准测试和模型上显著提升视觉定位性能。

链接: https://arxiv.org/abs/2512.06726
作者: Shuo Li,Jiajun Sun,Zhihao Zhang,Xiaoran Fan,Senjie Jin,Hui Li,Yuming Yang,Junjie Ye,Lixing Shen,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Hikvision Research Institute (海康威视研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.
zh

[NLP-73] ProAgent : Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems

链接: https://arxiv.org/abs/2512.06721
作者: Bufang Yang,Lilin Xu,Liekang Zeng,Yunqi Guo,Siyang Jiang,Wenrui Lu,Kaiwei Liu,Hancheng Xiang,Xiaofan Jiang,Guoliang Xing,Zhenyu Yan
机构: The Chinese University of Hong Kong (香港中文大学); Columbia University (哥伦比亚大学); Purdue University (普渡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[NLP-74] Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents

链接: https://arxiv.org/abs/2512.06716
作者: Zhibo Liang,Tianze Hu,Zaiye Chen,Mingjie Tang
机构: Sichuan University (四川大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

[NLP-75] Look Twice before You Leap: A Rational Agent Framework for Localized Adversarial Anonymization

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的文本匿名化框架所面临的“隐私悖论”问题,即用户需将敏感数据提交至不可信的第三方远程API以实现高效隐私保护,同时指出直接迁移至本地小规模模型(Local Small-scale Models, LSMs)会导致效用灾难性下降。其核心发现表明,这种失败不仅源于LSMs能力不足,更根本上是由于现有最先进(State-of-the-Art, SoTA)方法采用贪婪对抗策略所引发的内在非理性行为。解决方案的关键在于提出一种完全本地化且无需训练的框架——理性局部对抗匿名化(Rational Localized Adversarial Anonymization, RLAA),其核心创新为引入一个仲裁者(Arbitrator)机制,作为理性守门人验证攻击者的推理反馈,过滤掉对隐私提升无显著贡献的信息,从而强制执行理性早停准则,系统性防止效用崩溃,并在多个数据集上实现最优隐私-效用权衡,甚至在帕累托前沿上超越现有SoTA方法。

链接: https://arxiv.org/abs/2512.06713
作者: Donghang Duan,Xu Zheng
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Current LLM-based text anonymization frameworks usually rely on remote API services from powerful LLMs, which creates an inherent “privacy paradox”: users must somehow disclose data to untrusted third parties for superior privacy preservation. Moreover, directly migrating these frameworks to local small-scale models (LSMs) offers a suboptimal solution with catastrophic collapse in utility based on our core findings. Our work argues that this failure stems not merely from the capability deficits of LSMs, but from the inherent irrationality of the greedy adversarial strategies employed by current state-of-the-art (SoTA) methods. We model the anonymization process as a trade-off between Marginal Privacy Gain (MPG) and Marginal Utility Cost (MUC), and demonstrate that greedy strategies inevitably drift into an irrational state. To address this, we propose Rational Localized Adversarial Anonymization (RLAA), a fully localized and training-free framework featuring an Attacker-Arbitrator-Anonymizer (A-A-A) architecture. RLAA introduces an arbitrator that acts as a rationality gatekeeper, validating the attacker’s inference to filter out feedback providing negligible benefits on privacy preservation. This mechanism enforces a rational early-stopping criterion, and systematically prevents utility collapse. Extensive experiments on different datasets demonstrate that RLAA achieves the best privacy-utility trade-off, and in some cases even outperforms SoTA on the Pareto principle. Our code and datasets will be released upon acceptance.
zh

[NLP-76] Parameter-Efficient Fine-Tuning with Differential Privacy for Robust Instruction Adaptation in Large Language Models

链接: https://arxiv.org/abs/2512.06711
作者: Yulin Huang,Yaxuan Luan,Jinxu Guo,Xiangchen Song,Yuchen Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-77] opiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction

链接: https://arxiv.org/abs/2512.06694
作者: Aoi Fujita,Taichi Yamamoto,Yuri Nakayama,Ryota Kobayashi
机构: Graduate School of Frontier Sciences, The University of Tokyo (东京大学前沿科学研究科); Mathematics and Informatics Center, The University of Tokyo (东京大学数理信息中心)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures, code available at this https URL

点击查看摘要

[NLP-78] hink-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在个性化长文本生成中面临的效率与适应性问题:现有方法多基于群体偏好优化,难以捕捉个体用户的隐式偏好;而“先思考再生成”(think-then-generate)类方法虽能提升推理能力,但在长文本生成场景下因静态单次推理无法动态调整,导致学习困难且适应性差。其解决方案的关键在于提出 FlyThinker 框架,该框架采用“边生成边思考”(think-while-generating)机制,通过一个独立的推理模型并行生成 token 级别的潜在推理表示,并将其融合至生成模型中以动态引导响应生成,从而实现推理与生成的并发执行,同时保持训练时的并行性——即推理模块仅依赖历史输出而非自身先前输出,使得所有推理 token 可在单次前向传播中完成,兼顾了个性化生成质量、训练效率与推理速度。

链接: https://arxiv.org/abs/2512.06690
作者: Chengbing Wang,Yang Zhang,Wenjie Wang,Xiaoyan Zhao,Fuli Feng,Xiangnan He,Tat-Seng Chua
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches-such as prompt customization or fine-tuning-struggle to reason over implicit preferences, limiting real-world effectiveness. Recent “think-then-generate” methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose FlyThinker, an efficient “think-while-generating” framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions-allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.
zh

[NLP-79] PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agent ic Memory

链接: https://arxiv.org/abs/2512.06688
作者: Bowen Jiang,Yuan Yuan,Maohao Shen,Zhuoqun Hao,Zhangchen Xu,Zichen Chen,Ziyi Liu,Anvesh Rao Vijjini,Jiashu He,Hanchao Yu,Radha Poovendran,Gregory Wornell,Lyle Ungar,Dan Roth,Sihao Chen,Camillo Jose Taylor
机构: University of Pennsylvania (宾夕法尼亚大学); Massachusetts Institute of Technology (麻省理工学院); University of Washington (华盛顿大学); University of California Santa Barbara (加州大学圣塔芭芭拉分校); Meta (Meta); University of Southern California (南加州大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Microsoft Corporation (微软公司)
类目: Computation and Language (cs.CL)
备注: Data is available at this https URL

点击查看摘要

[NLP-80] Mechanistic Interpretability of GPT -2: Lexical and Contextual Layers in Sentiment Analysis

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中情感信息处理机制的可解释性问题,特别是验证 GPT-2 是否遵循预设的两阶段情感计算架构:早期词汇检测与中期语境整合。其解决方案的关键在于采用系统性的激活修补(activation patching)方法,在 GPT-2 的全部 12 层中进行因果干预实验,从而明确各层在情感信号处理中的功能分工。研究发现,早期层(0–3)确实作为词汇级情感探测器,而中期层并未表现出预测的模块化整合特性;相反,复杂语境现象(如否定、反讽和领域转换)主要由晚期层(8–11)通过统一非模块化机制完成整合,这一结果推翻了传统层级架构假设,揭示了语言模型中语境整合的新型因果路径。

链接: https://arxiv.org/abs/2512.06681
作者: Amartya Hatua
机构: Fidelity Investments(富达投资)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2’s sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.
zh

[NLP-81] CMV-Fuse: Cross Modal-View Fusion of AMR Syntax and Knowledge Representations for Aspect Based Sentiment Analysis

【速读】: 该论文旨在解决当前Aspect-Based Sentiment Analysis (ABSA)系统通常仅依赖孤立的语言视角(如句法或语义),而忽视了人类在理解自然语言时所利用的多视角协同作用的问题。其解决方案的关键在于提出CMV-Fuse框架,该框架通过系统融合四种互补的语言学视角——抽象意义表示(Abstract Meaning Representations)、成分句法分析、依存句法和语义注意力,并引入外部知识增强,借助分层门控注意力机制在局部句法、中间语义和全局知识三个层次上实现跨模态视图的融合,从而同时捕捉细粒度结构模式与广泛上下文语境;此外,创新性地设计了结构感知的多视图对比学习机制,在保证不同表示间一致性的同时维持计算效率,显著提升了情感分析的鲁棒性与准确性。

链接: https://arxiv.org/abs/2512.06679
作者: Smitha Muthya Sudheendra,Mani Deep Cherukuri,Jaideep Srivastava
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language understanding inherently depends on integrating multiple complementary perspectives spanning from surface syntax to deep semantics and world knowledge. However, current Aspect-Based Sentiment Analysis (ABSA) systems typically exploit isolated linguistic views, thereby overlooking the intricate interplay between structural representations that humans naturally leverage. We propose CMV-Fuse, a Cross-Modal View fusion framework that emulates human language processing by systematically combining multiple linguistic perspectives. Our approach systematically orchestrates four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention, enhanced with external knowledge integration. Through hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels, CMV-Fuse captures both fine-grained structural patterns and broad contextual understanding. A novel structure aware multi-view contrastive learning mechanism ensures consistency across complementary representations while maintaining computational efficiency. Extensive experiments demonstrate substantial improvements over strong baselines on standard benchmarks, with analysis revealing how each linguistic view contributes to more robust sentiment analysis.
zh

[NLP-82] he Online Discourse of Virtual Reality and Anxiety

【速读】: 该论文试图解决的问题是:如何通过分析在线讨论中用户对虚拟现实(Virtual Reality, VR)用于治疗焦虑障碍的看法,来提升VR技术在临床实践中的应用效能。其解决方案的关键在于采用语料库语言学(corpus linguistics)方法,利用Sketch Engine软件识别与VR和焦虑相关的高频词汇及其搭配关系,从而揭示公众 discourse 中关于VR技术设计、体验和发展方向的认知模式,为未来通过提升VR系统的开发与可及性来支持心理咨询服务提供依据。

链接: https://arxiv.org/abs/2512.06656
作者: Kwabena Yamoah,Cass Dykeman
机构: 未知
类目: Computation and Language (cs.CL); Computation (stat.CO)
备注: Three tables and two figures. Unfortunately, I did not formally register the dataset prior to conducting the analysis

点击查看摘要

Abstract:VR in the treatment of clinical concerns such as generalized anxiety disorder or social anxiety. VR has created additional pathways to support patient well-being and care. Understanding online discussion of what users think about this technology may further support its efficacy. The purpose of this study was to employ a corpus linguistic methodology to identify the words and word networks that shed light on the online discussion of virtual reality and anxiety. Using corpus linguistics, frequently used words in discussion along with collocation were identified by utilizing Sketch Engine software. The results of the study, based upon the English Trends corpus, identified VR, Oculus, and headset as the most frequently discussed within the VR and anxiety subcorpus. These results point to the development of the virtual system, along with the physical apparatus that makes viewing and engaging with the virtual environment possible. Additional results point to collocation of prepositional phrases such as of virtual reality, in virtual reality, and for virtual reality relating to the design, experience, and development, respectively. These findings offer new perspective on how VR and anxiety together are discussed in general discourse and offer pathways for future opportunities to support counseling needs through development and accessibility. Keywords: anxiety disorders, corpus linguistics, Sketch Engine, and virtual reality VR
zh

[NLP-83] An Index-based Approach for Efficient and Effective Web Content Extraction

链接: https://arxiv.org/abs/2512.06641
作者: Yihan Chen,Benfeng Xu,Xiaorui Wang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Metastone Technology (元象科技)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-84] A Fast and Effective Solution to the Problem of Look-ahead Bias in LLM s

链接: https://arxiv.org/abs/2512.06607
作者: Humzah Merchant,Bradford Levy
机构: University of Chicago (芝加哥大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-85] Adapting AlignScore Mertic for Factual Consistency Evaluation of Text in Russian: A Student Abstract

链接: https://arxiv.org/abs/2512.06586
作者: Mikhail Zimin,Milyausha Shamsutdinova,Georgii Andriushchenko
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-86] ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

链接: https://arxiv.org/abs/2512.06515
作者: Somnath Banerjee,Sayan Layek,Sayantan Adak,Mykola Pechenizkiy,Animesh Mukherjee,Rima Hazra
机构: Cisco Research; Indian Institute of Technology Kharagpur; Eindhoven University of Technology, Netherlands (TU/e)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-87] Classifying German Language Proficiency Levels Using Large Language Models

链接: https://arxiv.org/abs/2512.06483
作者: Elias-Leander Ahlers,Witold Brunsmann,Malte Schilling
机构: University of Münster (明斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at 3rd International Conference on Foundation and Large Language Models (FLLM2025), Vienna (Austria)

点击查看摘要

[NLP-88] Knowing Whats Missing: Assessing Information Sufficiency in Question Answering

链接: https://arxiv.org/abs/2512.06476
作者: Akriti Jain,Aparna Garimella
机构: Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-89] Modeling Contextual Passage Utility for Multihop Question Answering AACL2025

链接: https://arxiv.org/abs/2512.06464
作者: Akriti Jain,Aparna Garimella
机构: Adobe Research(Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: Accepted at IJCNLP-AACL 2025

点击查看摘要

[NLP-90] Rethinking Training Dynamics in Scale-wise Autoregressive Generation

链接: https://arxiv.org/abs/2512.06421
作者: Gengze Zhou,Chongjian Ge,Hao Tan,Feng Liu,Yicong Hong
机构: University of Adelaide (阿德莱德大学); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-91] Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal Paraphrasing and Compression

链接: https://arxiv.org/abs/2512.06393
作者: Qiming Bao,Xiaoxuan Fu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

[NLP-92] LLM -Upgraded Graph Reinforcement Learning for Carbon-Aware Job Scheduling in Smart Manufacturing

链接: https://arxiv.org/abs/2512.06351
作者: Zhiying Yang,Fang Liu,Wei Zhang,Xin Lou,Malcolm Yoke Hean Low,Boon Ping Gan
机构: Singapore Institute of Technology(新加坡理工学院); Singapore University of Social Sciences(新加坡社会科学院); D-SIMLAB Technologies( D-SIMLAB科技公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-93] Why They Disagree: Decoding Differences in Opinions about AI Risk on the Lex Fridman Podcast

链接: https://arxiv.org/abs/2512.06350
作者: Nghi Truong,Phanish Puranam,Özgecan Koçak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-94] When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

链接: https://arxiv.org/abs/2512.06343
作者: Tong Xie,Andrew Bai,Yuanhao Ban,Yunqi Hong,Haoyu Li,Cho-jui Hsieh
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-95] Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

链接: https://arxiv.org/abs/2512.06266
作者: Chen Yang,Guangyue Peng,Jiaying Zhu,Ran Le,Ruixiang Feng,Tao Zhang,Wei Ruan,Xiaoqi Liu,Xiaoxue Cheng,Xiyun Xu,Yang Song,Yanzipeng Gao,Yiming Jia,Yun Xing,Yuntao Wen,Zekai Wang,Zhenwei An,Zhicong Sun,Zongchao Chen
机构: Nanbeige LLM Lab (纳米贝格大模型实验室); Boss Zhipin ( boss直聘)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-96] Convergence of Outputs When Two Large Language Models Interact in a Multi-Agent ic Setup

链接: https://arxiv.org/abs/2512.06256
作者: Aniruddha Maiti,Satya Nimmagadda,Kartha Veerya Jammuladinne,Niladri Sengupta,Ananya Jana
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted to LLM 2025

点击查看摘要

[NLP-97] LOCUS: A System and Method for Low-Cost Customization for Universal Specialization

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型在特定任务上进行定制化训练时存在的高成本与高资源消耗问题,尤其是在小样本场景下如何高效构建高性能模型。解决方案的关键在于提出LOCUS(LOw-cost Customization for Universal Specialization)流水线:首先通过针对性检索从大规模语料库中发现相关数据,继而利用上下文内数据生成(in-context data generation)合成额外训练样本,最后采用参数高效微调(parameter-efficient tuning)策略(如全量微调或低秩适配LoRA)完成模型优化。该方法在命名实体识别(Named Entity Recognition, NER)和文本分类(Text Classification, TC)任务上显著优于强基线模型(包括GPT-4o),同时实现模型尺寸和内存占用的大幅压缩——在保留99%全量微调准确率的前提下,仅需不到5%的内存开销,并且在多个基准测试中以不足1%的参数量超越GPT-4o。

链接: https://arxiv.org/abs/2512.06239
作者: Dhanasekar Sundararaman,Keying Li,Wayne Xiong,Aashna Garg
机构: Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present LOCUS (LOw-cost Customization for Universal Specialization), a pipeline that consumes few-shot data to streamline the construction and training of NLP models through targeted retrieval, synthetic data generation, and parameter-efficient tuning. With only a small number of labeled examples, LOCUS discovers pertinent data in a broad repository, synthesizes additional training samples via in-context data generation, and fine-tunes models using either full or low-rank (LoRA) parameter adaptation. Our approach targets named entity recognition (NER) and text classification (TC) benchmarks, consistently outperforming strong baselines (including GPT-4o) while substantially lowering costs and model sizes. Our resultant memory-optimized models retain 99% of fully fine-tuned accuracy while using barely 5% of the memory footprint, also beating GPT-4o on several benchmarks with less than 1% of its parameters.
zh

[NLP-98] Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM -as-a-Judge

【速读】: 该论文旨在解决句子简化(sentence simplification)任务中政策驱动控制(policy-driven control)的挑战,即如何根据不同的简化策略(如仅替换复杂词汇或整体重写句子)灵活调整简化行为。解决方案的关键在于利用大语言模型作为评判者(Large Language Model-as-a-Judge, LLM-as-a-Judge)自动构建与简化政策对齐的训练数据,从而无需昂贵的人工标注或平行语料库,实现对多种简化策略的适配。该方法在不同规模的模型上均表现出鲁棒性,且小型开源模型(如Phi-3-mini-3.8B)在词法导向简化任务上优于GPT-4o,同时在整体重写任务上保持相当性能。

链接: https://arxiv.org/abs/2512.06228
作者: Xuanxin Wu,Yuki Arase,Masaaki Nagata
机构: The University of Osaka (大阪大学); Institute of Science Tokyo (东京工业大学); NTT Inc. (日本电信电话公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentence simplification aims to modify a sentence to make it easier to read and understand while preserving the meaning. Different applications require distinct simplification policies, such as replacing only complex words at the lexical level or rewriting the entire sentence while trading off details for simplicity. However, achieving such policy-driven control remains an open challenge. In this work, we introduce a simple yet powerful approach that leverages Large Language Model-as-a-Judge (LLM-as-a-Judge) to automatically construct policy-aligned training data, completely removing the need for costly human annotation or parallel corpora. Our method enables building simplification systems that adapt to diverse simplification policies. Remarkably, even small-scale open-source LLMs such as Phi-3-mini-3.8B surpass GPT-4o on lexical-oriented simplification, while achieving comparable performance on overall rewriting, as verified by both automatic metrics and human evaluations. The consistent improvements across model families and sizes demonstrate the robustness of our approach.
zh

[NLP-99] Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLM s for Mental Health and Online Safety

【速读】: 该论文旨在解决真实世界指标(如心理健康相关的生活事件和在线安全中的风险行为)在自然语言处理(NLP)训练数据中标注成本高、难度大且难以动态更新的问题。其核心解决方案是提出一种新颖的置信度感知细粒度辩论(Confidence-Aware Fine-Grained Debate, CFD)框架,其中多个大型语言模型(LLM)代理模拟人类标注者,通过交换细粒度证据进行辩论以达成共识,从而实现高效、高质量的数据增强。该方法显著优于多种基线,在两个专家标注的新数据集(心理健康Reddit Reddit wellbeing数据集与在线安全Facebook sharenting风险数据集)上均表现出鲁棒的性能提升,尤其通过辩论转录文本引入的增强特征在在线安全任务中相较非增强基线提升了10.1%。

链接: https://arxiv.org/abs/2512.06227
作者: Junyu Mao,Anthony Hills,Talia Tseriotou,Maria Liakata,Aya Shamir,Dan Sayda,Dana Atzil-Slonim,Natalie Djohari,Arpan Mandal,Silke Roth,Pamela Ugwudike,Mahesan Niranjan,Stuart E. Middleton
机构: University of Southampton, UK (南安普顿大学, 英国); Queen Mary University of London, UK (伦敦玛丽女王大学, 英国); Bar Ilan University, Israel (巴伊兰大学, 以色列)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Real-world indicators are important for improving natural language processing (NLP) tasks such as life events for mental health analysis and risky behaviour for online safety, yet labelling such information in NLP training datasets is often costly and/or difficult given the dynamic nature of such events. This paper compares several LLM-based data enrichment methods and introduces a novel Confidence-Aware Fine-Grained Debate (CFD) framework in which multiple LLM agents simulate human annotators and exchange fine-grained evidence to reach consensus. We describe two new expert-annotated datasets, a mental health Reddit wellbeing dataset and an online safety Facebook sharenting risk dataset. Our CFD framework achieves the most robust data enrichment performance compared to a range of baselines and we show that this type of data enrichment consistently improves downstream tasks. Enriched features incorporated via debate transcripts yield the largest gains, outperforming the non-enriched baseline by 10.1% for the online safety task.
zh

[NLP-100] On measuring grounding and generalizing grounding problems

链接: https://arxiv.org/abs/2512.06205
作者: Daniel Quigley,Eric Maynard
机构: Center for Possible Minds (可能心智中心); Indiana University Bloomington (印第安纳大学布卢明顿分校); Eruditis (埃鲁迪蒂斯)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 36 pages, 85 sources

点击查看摘要

[NLP-101] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment AAAI2026

链接: https://arxiv.org/abs/2512.06196
作者: Charlie Masters,Marta Grześkiewicz,Stefano V. Albrecht
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to the AAAI 2026 LLAMAS Workshop (Large Language Model Agents for Multi-Agent Systems)

点击查看摘要

[NLP-102] Do You Feel Comfortable? Detecting Hidden Conversational Escalation in AI Chatbots

链接: https://arxiv.org/abs/2512.06193
作者: Jihyung Park,Saleh Afroogh,Junfeng Jiao
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-103] Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR

链接: https://arxiv.org/abs/2512.06169
作者: Chris Crawford
机构: 未知
类目: Computation and Language (cs.CL)
备注: 67 pages, 5 figures, 6 tables

点击查看摘要

[NLP-104] Empathy by Design: Aligning Large Language Models for Healthcare Dialogue

链接: https://arxiv.org/abs/2512.06097
作者: Emre Umucu,Guillermina Solis,Leon Garza,Emilia Rivas,Beatrice Lee,Anantaa Kotal,Aritran Piplai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-105] he Road of Adaptive AI for Precision in Cybersecurity

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在网络安全领域中因知识库、工具链和威胁态势持续动态变化而难以保持有效性和可靠性的核心挑战。其解决方案的关键在于构建可持续适应的生产级 GenAI 流水线,强调通过检索增强(retrieval-based adaptation)与模型层面自适应(model-level adaptation)机制的协同互补,实现端到端系统的实时更新与精准响应,从而提升 GenAI 在网络防御场景下的鲁棒性、精确性和可审计性。

链接: https://arxiv.org/abs/2512.06048
作者: Sahil Garg
机构: Averlon(阿弗隆)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Cybersecurity’s evolving complexity presents unique challenges and opportunities for AI research and practice. This paper shares key lessons and insights from designing, building, and operating production-grade GenAI pipelines in cybersecurity, with a focus on the continual adaptation required to keep pace with ever-shifting knowledge bases, tooling, and threats. Our goal is to provide an actionable perspective for AI practitioners and industry stakeholders navigating the frontier of GenAI for cybersecurity, with particular attention to how different adaptation mechanisms complement each other in end-to-end systems. We present practical guidance derived from real-world deployments, propose best practices for leveraging retrieval- and model-level adaptation, and highlight open research directions for making GenAI more robust, precise, and auditable in cyber defense.
zh

[NLP-106] AI-Generated Compromises for Coalition Formation: Modeling Simulation and a Textual Case Study

链接: https://arxiv.org/abs/2512.05983
作者: Eyal Briman(Ben Gurion University of the Negev),Ehud Shapiro(Weizmann Institute of Science),Nimrod Talmon(Ben Gurion University of the Negev)
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: In Proceedings TARK 2025, arXiv:2511.20540 . arXiv admin note: substantial text overlap with arXiv:2506.06837

点击查看摘要

[NLP-107] A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation WWW

链接: https://arxiv.org/abs/2512.02195
作者: David Ph. Shakouri,Crit Cremers,Niels O. Schiller
机构: Leiden University Centre for Linguistics (LUCL), Leiden University, the Netherlands; Leiden Institute for Brain and Cognition (LIBC), Leiden University, the Netherlands; City University of Hong Kong (CityU), Hong Kong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 23 pages, 7 figures, 11 tables. Related work: arXiv:2503.18702 . This is the peer-reviewed publisher’s version, downloadable from: this https URL

点击查看摘要

[NLP-108] Small Language Models Reshape Higher Education: Courses Textbooks and Teaching

链接: https://arxiv.org/abs/2512.06001
作者: Jian Zhang,Jia Shao
机构: 未知
类目: Physics Education (physics.ed-ph); Computation and Language (cs.CL)
备注: in Chinese language

点击查看摘要

[NLP-109] KidSpeak: A General Multi-purpose LLM for Kids Speech Recognition and Screening

链接: https://arxiv.org/abs/2512.05994
作者: Rohan Sharma,Dancheng Liu,Jingchen Sun,Shijie Zhou,Jiayu Qin,Jinjun Xiong,Changyou Chen
机构: State University of New York at Buffalo (纽约州立大学布法罗分校)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

计算机视觉

[CV-0] Voxify3D: Pixel Art Meets Volumetric Rendering

【速读】:该论文旨在解决从三维网格(3D mesh)自动生成高质量体素艺术(voxel art)的难题,其核心挑战在于如何在几何抽象化、语义保留和离散色彩一致性之间取得平衡。现有方法要么过度简化几何结构,要么无法实现像素级精确且受调色板约束的体素艺术效果。解决方案的关键在于提出一个可微分的两阶段框架Voxify3D,其创新性地融合了三个核心组件:(1) 正交像素艺术监督(orthographic pixel art supervision),消除透视畸变以实现精确的体素-像素对齐;(2) 基于补丁的CLIP对齐(patch-based CLIP alignment),在离散化过程中保持语义一致性;(3) 受调色板约束的Gumbel-Softmax量化机制(palette-constrained Gumbel-Softmax quantization),支持在离散颜色空间中的可微优化并具备可控的调色板策略。这一集成方案有效解决了极端离散化下的语义保真、通过体积渲染实现像素艺术美学以及端到端离散优化等根本问题。

链接: https://arxiv.org/abs/2512.07834
作者: Yi-Chuan Huang,Jiewen Chan,Hao-Jen Chien,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: this https URL
zh

[CV-1] Relational Visual Similarity

【速读】:该论文旨在解决当前视觉相似性度量模型(如LPIPS、CLIP、DINO)仅关注感知属性相似性而无法捕捉人类能够识别的关系相似性(relational similarity)的问题。关系相似性是指图像中视觉元素之间的内在结构或功能对应关系,即使其表面属性不同(如地球的地壳、地幔、核心与桃子的皮、肉、核具有类比关系)。解决方案的关键在于:首先将关系相似性形式化为可测量问题,定义两个图像在内部元素关系一致时即为关系相似;其次构建了一个包含11.4万张图像及其匿名化描述的图像-文本数据集,其中文本刻画的是场景背后的逻辑关系而非表层内容;最后利用该数据集微调视觉-语言模型,使其能够在表示空间中将具有相同关系逻辑的图像拉近,从而实现基于深层结构而非可见外观的图像关联。

链接: https://arxiv.org/abs/2512.07833
作者: Thao Nguyen,Sicheng Mo,Krishna Kumar Singh,Yilin Wang,Jing Shi,Nicholas Kolkin,Eli Shechtman,Yong Jae Lee,Yuheng Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of California, Los Angeles (加州大学洛杉矶分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page, data, and code: this https URL

点击查看摘要

Abstract:Humans do not just see attribute similarity – we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach’s skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized – describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it – revealing a critical gap in visual computing.
zh

[CV-2] UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

【速读】:该论文旨在解决当前视频生成模型因单模态条件限制而导致的对世界理解不全面的问题,其根源在于跨模态交互不足和模态多样性有限,难以实现对世界知识的完整表征。解决方案的关键在于提出一个统一的框架UnityVideo,通过联合学习多种模态(包括分割掩码、人体骨骼、DensePose、光流和深度图)及训练范式,引入两个核心组件:一是动态加噪机制以统一异构训练范式,二是带有上下文学习能力的模态切换器,借助模块化参数实现统一处理。该方法显著提升了模型在未见数据上的零样本泛化能力,并增强了视频质量、时序一致性及物理世界约束的对齐性。

链接: https://arxiv.org/abs/2512.07831
作者: Jiehui Huang,Yuechen Zhang,Xu He,Yuan Gao,Zhi Cen,Bin Xia,Yan Zhou,Xin Tao,Pengfei Wan,Jiaya Jia
机构: HKUST (香港科技大学); CUHK (香港中文大学); Tsinghua University (清华大学); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website this https URL

点击查看摘要

Abstract:Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: this https URL
zh

[CV-3] One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

【速读】:该论文旨在解决预训练视觉表征与生成友好潜空间之间存在的根本性不匹配问题,即理解导向的高维特征与生成模型所需的低维潜变量之间的矛盾。现有方法因需复杂的目标函数和架构而效果受限。其解决方案的关键在于提出FAE(Feature Auto-Encoder)框架,通过耦合两个独立的深度解码器实现:第一个解码器用于重建原始特征空间以保留语义信息,第二个解码器则以重建后的特征为输入进行图像生成,从而在保持足够信息量的同时将高维特征压缩至适合生成的低维潜空间;该方法仅需单个注意力层即可完成适配,且具备通用性,可兼容多种自监督编码器(如DINO、SigLIP)及两类生成模型(扩散模型与归一化流)。

链接: https://arxiv.org/abs/2512.07829
作者: Yuan Gao,Chen Chen,Tianrong Chen,Jiatao Gu
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
zh

[CV-4] OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

【速读】:该论文旨在解决当前指令驱动视频编辑(instruction-based video editing)领域中高质量、大规模数据集稀缺的问题。现有研究虽在图像编辑数据集方面取得进展,但视频编辑仍缺乏具有丰富多样性与高标注质量的公开数据集,限制了模型训练与评估的可靠性。解决方案的关键在于构建 OpenVE-3M——一个开源、大规模且高质量的视频编辑数据集,涵盖空间对齐编辑(如全局风格调整、背景替换、局部增删等)与非空间对齐编辑(如多镜头剪辑、创意编辑)两类任务,并通过精心设计的数据生成管道和严格的质量过滤机制确保内容一致性与多样性;同时配套提出 OpenVE-Bench 基准测试集,包含 431 对视频编辑样本及三项贴近人类判断的评价指标,从而为该领域提供统一评测标准。基于此数据集训练的 OpenVE-Edit 模型(5B 参数量)在 OpenVE-Bench 上达到新的 SOTA 性能,显著优于此前所有开源模型(包括 14B 规模基线)。

链接: https://arxiv.org/abs/2512.07826
作者: Haoyang He,Jie Wang,Jiangning Zhang,Zhucun Xue,Xingyuan Bu,Qiangpeng Yang,Shilei Wen,Lei Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages

点击查看摘要

Abstract:The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at this https URL.
zh

[CV-5] WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

【速读】:该论文旨在解决当前视频生成模型在三维(3D)空间中缺乏一致性的问题,即生成的视频在不同视角或动态场景下容易出现几何失真、运动不连贯等现象。其解决方案的关键在于提出WorldReel,一种原生时空一致的4D视频生成框架,通过联合生成RGB帧与4D场景表示(包括点云图(pointmaps)、相机轨迹和稠密光流映射),显式建模贯穿时间的单一稳定场景结构,从而实现几何与外观在时空上的协同一致性,即使面对大范围非刚性形变和显著相机运动也能保持高质量输出。

链接: https://arxiv.org/abs/2512.07821
作者: Shaoheng Fang,Hanwen Jiang,Yunpeng Bai,Niloy J. Mitra,Qixing Huang
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Adobe Research (Adobe 研究院); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
zh

[CV-6] Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes SIGGRAPH

【速读】:该论文旨在解决大规模场景中基于自然语言的语义理解与交互问题,特别是针对现有特征蒸馏方法在处理海量互联网数据时存在的语义特征错位(semantic feature misalignment)和内存及运行效率低下等问题。其解决方案的关键在于:首先,在3D高斯表示中引入极低维语义瓶颈特征(extremely low-dimensional semantic bottleneck features),并通过多分辨率、基于特征的哈希编码器进行处理,显著提升了运行时效率和GPU内存利用率;其次,提出一种衰减下采样模块(Attenuated Downsampler)及多种正则化策略,有效缓解了真实2D特征的语义错位问题,从而在真实世界场景数据集HolyScenes上实现了性能与效率的双重提升。

链接: https://arxiv.org/abs/2512.07807
作者: Shai Krakovsky,Gal Fiebelman,Sagie Benaim,Hadar Averbuch-Elor
机构: Tel Aviv University (特拉维夫大学); The Hebrew University of Jerusalem (希伯来大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to SIGGRAPH Asia 2025. Project webpage: this https URL

点击查看摘要

Abstract:Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
zh

[CV-7] Multi-view Pyramid Transformer: Look Coarser to See Broader

【速读】:该论文旨在解决大规模3D场景从数十至数百张图像中高效且高质量重建的问题,传统方法在处理复杂场景时往往面临计算效率低、难以扩展以及细节丢失等挑战。其解决方案的关键在于提出多视角金字塔Transformer(Multi-view Pyramid Transformer, MVP),通过双层结构实现计算效率与表征丰富性的平衡:一是构建局部到全局的跨视角层次结构(inter-view hierarchy),逐步扩大模型视野,从局部视图到分组再到完整场景;二是设计细粒度到粗粒度的单视角内部层次结构(intra-view hierarchy),从高分辨率空间特征逐步聚合为紧凑的信息密集token。这种双重层次架构使MVP能够在单次前向传播中完成大场景重建,并结合3D高斯泼溅(3D Gaussian Splatting)作为底层表示,在多种数据集上实现了可泛化的最优重建质量与良好的扩展性。

链接: https://arxiv.org/abs/2512.07806
作者: Gyeongjin Kang,Seungkwon Yang,Seungtae Nam,Younggeun Lee,Jungwoo Kim,Eunbyung Park
机构: Sungkyunkwan University (成均馆大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: see this https URL

点击查看摘要

Abstract:We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model’s perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
zh

[CV-8] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

【速读】:该论文旨在解决多镜头视频生成(Multi-Shot Video Generation, MSV)中长期跨镜头上下文建模不足的问题,现有方法受限于有限的时间窗口或单一关键帧条件,导致复杂叙事场景下生成质量下降。其解决方案的关键在于提出OneStory框架,通过将MSV重构为“下一镜头生成”任务,结合预训练图像到视频(Image-to-Video, I2V)模型实现强视觉条件控制;并引入两个核心模块:帧选择模块(Frame Selection module)基于前序镜头中的语义信息帧构建全局记忆,自适应条件器(Adaptive Conditioner)则通过重要性引导的patch化操作生成紧凑的上下文表示用于直接条件输入,从而实现一致且可扩展的叙事视频生成。

链接: https://arxiv.org/abs/2512.07802
作者: Zhaochong An,Menglin Jia,Haonan Qiu,Zijian Zhou,Xiaoke Huang,Zhiheng Liu,Weiming Ren,Kumara Kahatapitiya,Ding Liu,Sen He,Chenyang Zhang,Tao Xiang,Fanny Yang,Serge Belongie,Tian Xie
机构: University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Storytelling in real-world videos often unfolds through multiple shots – discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
zh

[CV-9] Distribution Matching Variational AutoEncoder

【速读】:该论文旨在解决当前视觉生成模型中潜在空间分布设计不合理的问题,即现有方法(如变分自编码器 VAEs)通常采用隐式约束而非显式控制潜在空间分布,导致难以确定最优的潜在分布形式。其解决方案的关键在于提出分布匹配变分自编码器(Distribution-Matching VAE, DMVAE),通过引入分布匹配约束,显式地将编码器的潜在分布对齐到任意参考分布(如自监督学习特征、扩散噪声等),从而突破传统高斯先验的限制。实验表明,基于自监督学习(SSL)推导的潜在分布能实现重建保真度与建模效率的最佳平衡,在 ImageNet 上仅用 64 个训练周期即可达到 gFID=3.2,验证了潜在分布结构选择的重要性。

链接: https://arxiv.org/abs/2512.07778
作者: Sen Ye,Jianning Pei,Mengde Xu,Shuyang Gu,Chunyu Wang,Liwei Wang,Han Hu
机构: Peking University (北京大学); UCAS (中国科学院大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbfDistribution-Matching VAE (\textbfDMVAE), which explicitly aligns the encoder’s latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at this https URL.
zh

[CV-10] GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring WACV2026

【速读】:该论文旨在解决野外濒危西部低地大猩猩(western lowland gorillas)个体识别效率低下问题,其核心挑战在于缺乏大规模、真实环境下的视频数据集以训练鲁棒的深度学习模型。解决方案的关键在于构建三个新型数据集(Gorilla-SPAC-Wild、Gorilla-Berlin-Zoo 和 Gorilla-SPAC-MoT),并提出 GorillaWatch 端到端流水线,集成检测、跟踪与重识别功能;同时引入多帧自监督预训练策略,利用轨迹片段的一致性在无标注条件下学习领域特定特征,并通过可微分的 AttnLRP 方法验证模型依赖于可区分的生物特征而非背景相关性,从而显著提升模型泛化能力和科学可信度。

链接: https://arxiv.org/abs/2512.07776
作者: Maximilian Schall,Felix Leonard Knöfel,Noah Elias König,Jan Jonas Kubeler,Maximilian von Klinski,Joan Wilhelm Linnemann,Xiaoshi Liu,Iven Jelle Schlegelmilch,Ole Woyciniuk,Alexandra Schild,Dante Wasmuht,Magdalena Bermejo Espinet,German Illera Basas,Gerard de Melo
机构: Hasso Plattner Institute (哈索普拉特纳研究所); Conservation X Labs; Sabine Plattner African Charities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, “in-the-wild” video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species
zh

[CV-11] Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification AAAI2026

【速读】:该论文旨在解决无监督可见光-红外行人重识别(Unsupervised Visible-Infrared Person Re-Identification, USVI-ReID)中因模态差异导致的跨模态关联不可靠问题,尤其针对现有方法依赖最优传输进行模内聚类时易传播局部聚类误差、且忽略全局实例级关系的缺陷。其解决方案的关键在于两个方面:一是提出模态感知的Jaccard距离(modality-aware Jaccard distance),通过相机感知的距离校正机制缓解由模态差异引起的空间距离偏差,从而提升全局聚类引导下的跨模态匹配可靠性;二是设计“分而对比”(split-and-contrast)策略,学习模态特定的全局原型,并在全局关联指导下显式对齐这些原型,实现具有模态不变性且具备身份判别力的表示学习。该方法虽结构简洁,但在多个基准VI-ReID数据集上达到SOTA性能,验证了其有效性。

链接: https://arxiv.org/abs/2512.07760
作者: Menglin Wang,Xiaojin Gong,Jiachen Li,Genlin Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast’ strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.
zh

[CV-12] UltrasODM: A Dual Stream Optical Flow Mamba Network for 3D Freehand Ultrasound Reconstruction

【速读】:该论文旨在解决临床超声采集过程中因操作者依赖性强、探头快速运动和亮度波动导致的重建误差问题,从而降低图像重建的可靠性并限制其临床应用价值。解决方案的关键在于提出一种双流框架UltrasODM,其核心创新包括:(1) 通过对比排序模块按运动相似性对帧进行分组以提升时序一致性;(2) 利用光流与Dual-Mamba时序模块融合实现鲁棒的6自由度(6-DoF)位姿估计;(3) 引入人机协同(Human-in-the-Loop, HITL)层,结合贝叶斯不确定性、临床校准阈值及显著性图,生成逐帧的置信度评估与可操作提示,当不确定性超过阈值时触发非侵入式警报,引导操作者调整扫描策略。该方法在真实自由手超声数据集上显著降低了漂移(减少15.2%)、距离误差(减少12.1%)和Hausdorff距离(减少10.1%),提升了重建精度与临床可信度。

链接: https://arxiv.org/abs/2512.07756
作者: Mayank Anand,Ujair Alam,Surya Prakash,Priya Shukla,Gora Chand Nandi,Domenec Puig
机构: Indian Institute of Information Technology Allahabad (印度信息技术学院阿拉哈巴德分校); Universitat Rovira i Virgili (罗维拉·维尔吉里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Clinical ultrasound acquisition is highly operator-dependent, where rapid probe motion and brightness fluctuations often lead to reconstruction errors that reduce trust and clinical utility. We present UltrasODM, a dual-stream framework that assists sonographers during acquisition through calibrated per-frame uncertainty, saliency-based diagnostics, and actionable prompts. UltrasODM integrates (i) a contrastive ranking module that groups frames by motion similarity, (ii) an optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (iii) a Human-in-the-Loop (HITL) layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps highlighting regions of low confidence. When uncertainty exceeds the threshold, the system issues unobtrusive alerts suggesting corrective actions such as re-scanning highlighted regions or slowing the sweep. Evaluated on a clinical freehand ultrasound dataset, UltrasODM reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs. By emphasizing transparency and clinician feedback, UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows. Our code is publicly available at this https URL.
zh

[CV-13] Unison: A Fully Automatic Task-Universal and Low-Cost Framework for Unified Understanding and Generation

【速读】:该论文旨在解决当前多模态学习中统一理解和生成能力不足的问题,尤其是现有方法在任务覆盖范围有限、生成质量不佳以及缺乏对输入元信息(如任务类型、图像分辨率、视频时长等)自动解析能力方面的缺陷。其解决方案的关键在于提出一种基于两阶段架构的模型 Unison,该模型在保留预训练理解与生成模型能力的基础上,通过极低的训练成本(仅需 500k 样本和 50 GPU 小时)实现了多种多模态任务的自动化处理,包括文本、图像和视频的理解,以及文本到视觉内容生成、编辑、可控生成和基于知识产权(IP)参考的生成等。Unison 还具备自动解析用户意图、识别任务类型并精准提取所需元信息的能力,从而实现无需人工干预的全流程自动化多模态任务执行。

链接: https://arxiv.org/abs/2512.07747
作者: Shihao Zhao,Yitong Chen,Zeyinzi Jiang,Bojia Zi,Shaozhe Hao,Yu Liu,Chaojie Mao,Kwan-Yee K. Wong
机构: The University of Hong Kong (香港大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.
zh

[CV-14] DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

【速读】:该论文旨在解决生成式扩散模型在端到端自动驾驶中普遍存在的模式崩溃(mode collapse)问题,即模型倾向于生成保守且同质的驾驶行为,导致多样性与高质量输出之间的权衡困境。针对DiffusionDrive依赖模仿学习缺乏足够约束的问题,其解决方案的核心在于引入强化学习机制以同时约束低质量模式并探索更优轨迹。关键创新包括:1)采用尺度自适应乘性噪声(scale-adaptive multiplicative noise),提升轨迹规划中的全局探索能力;2)设计跨锚点截断的GRPO(inter-anchor truncated GRPO)和锚点内GRPO(intra-anchor GRPO),分别优化单个意图内的优势估计与不同意图间的全局比较,避免因意图差异(如转向 vs. 直行)导致的优势误判,从而有效缓解模式崩溃。此方法在NAVSIM v1和v2数据集上分别取得91.2 PDMS和85.5 EPDMS的闭环评估性能,显著优于现有方法。

链接: https://arxiv.org/abs/2512.07745
作者: Jialv Zou,Shaoyu Chen,Bencheng Liao,Zhiyu Zheng,Yuehao Song,Lefei Zhang,Qian Zhang,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science & Technology (华中科技大学); Horizon Robotics (地平线机器人); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at this https URL
zh

[CV-15] HLTCOE Evaluation Team at TREC 2025: VQA Track

【速读】:该论文旨在解决视频问答(Video Question Answering, VQA)中答案生成任务(Answer Generation, AG)的语义精度和排序一致性问题。现有方法在生成多个候选答案时,往往难以保证答案列表在语义层面的一致性与排序稳定性,尤其在涉及时间推理和语义消歧的问题上表现不佳。解决方案的关键在于提出了一种基于列表学习(listwise learning)的框架,通过引入一种新颖的带排名权重的掩码指针交叉熵损失(Masked Pointer Cross-Entropy Loss with Rank Weights),实现了候选答案的重排序优化。该损失函数融合了基于指针机制的候选选择、依赖排序位置的加权策略以及词汇受限下的掩码交叉熵,从而在生成式建模与判别式排序之间建立桥梁,显著提升了答案列表的连贯性、细粒度性和排序稳定性。

链接: https://arxiv.org/abs/2512.07738
作者: Dengjia Zhang,Charles Weng,Katherine Guerrerio,Yi Lu,Kenton Murray,Alexander Martin,Reno Kriz,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学); Human Language Technology Center of Excellence (人机语言技术卓越中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:The HLTCOE Evaluation team participated in TREC VQA’s Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
zh

[CV-16] SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在复杂空间推理任务中表现受限的问题,尤其是其缺乏主动的心理模拟(mental simulation)能力,现有方法主要依赖对空间数据的被动观察,难以实现对环境的主动探索与内部表征。解决方案的关键在于提出 SpatialDreamer 框架,该框架通过闭环机制实现主动探索、基于世界模型的视觉想象(visual imagination)以及基于证据的推理(evidence-grounded reasoning)。其中,为应对长序列空间推理任务中细粒度奖励监督不足的问题,进一步引入几何策略优化(Geometric Policy Optimization, GeoPO),采用树状结构采样与步级奖励估计,并结合几何一致性约束,从而显著提升模型的空间推理能力。

链接: https://arxiv.org/abs/2512.07733
作者: Meng Cao,Xingyu Li,Xue Liu,Ian Reid,Xiaodan Liang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.
zh

[CV-17] SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination WACV2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的物体幻觉(object hallucination)问题,其成因主要源于语言先验(language priors)和视觉信息丢失。解决方案的关键在于提出SAVE(Sparse Autoencoder-Driven Visual Information Enhancement)框架,通过引导模型沿稀疏自动编码器(Sparse Autoencoder, SAE)潜在特征中的“视觉理解特征”(visual understanding features)进行推理,从而增强模型对图像内容的 grounded 视觉理解。这些视觉理解特征由一个二分类物体存在性问答探针识别,能够有效抑制不确定物体标记的生成并提升对图像标记的关注度,从而显著减少幻觉现象。

链接: https://arxiv.org/abs/2512.07730
作者: Sangha Park,Seungryong Yoo,Jisoo Mok,Sungroh Yoon
机构: Seoul National University (首尔国立大学); Daegu Gyeongbuk Institute of Science and Technology (大邱庆北科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV 2026

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model’s visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10%p improvement in CHAIR_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at this https URL.
zh

[CV-18] Improving action classification with brain-inspired deep networks

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在动作识别任务中对身体信息与背景信息利用不均衡的问题,即DNN可能因训练数据中身体与背景的统计相关性而过度依赖某一模态,从而无法充分利用双源信息。其解决方案的关键在于借鉴人类大脑中针对身体和场景感知具有功能分离的神经通路(domain-specific brain regions),设计一种具有独立处理身体和背景信息流的新型脑启发式深度网络架构。实验表明,该架构不仅提升了动作识别性能,且在不同刺激版本下的准确率模式更接近人类表现,验证了结构化多模态处理机制的有效性。

链接: https://arxiv.org/abs/2512.07729
作者: Aidas Aglinskas,Stefano Anzellotti
机构: Boston College (波士顿学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
zh

[CV-19] ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

【速读】:该论文旨在解决从单张输入图像生成高保真上半身三维虚拟人像(3D avatar)时存在的两大难题:一是基于大型重建模型的方法虽能稳定生成结构,但常出现纹理模糊和动作僵硬等伪影;二是生成式视频模型虽可合成逼真且动态的结果,却易产生结构不稳定和身份漂移等问题。解决方案的关键在于融合两种范式的优点——利用3D重建模型提供鲁棒的几何结构与外观先验,引导实时自回归视频扩散模型进行渲染,从而在保持几何稳定性的同时,实现高频细节的逼真合成与流畅动态表现,有效抑制纹理模糊、动作僵硬及结构不一致问题。

链接: https://arxiv.org/abs/2512.07720
作者: Fan Yang,Heyuan Li,Peihao Li,Weihao Yuan,Lingteng Qiu,Chaoyue Song,Cheng Chen,Yisheng He,Shifeng Zhang,Xiaoguang Han,Steven Hoi,Guosheng Lin
机构: Nanyang Technological University (南洋理工大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: this https URL
zh

[CV-20] UnCageNet: Tracking and Pose Estimation of Caged Animal

【速读】:该论文旨在解决动物行为追踪与姿态估计系统(如STEP和ViTPose)在处理存在笼子结构和系统性遮挡的图像与视频时性能显著下降的问题。其解决方案的关键在于提出一个三阶段预处理流水线:首先利用带有可调方向滤波器的Gabor增强ResNet-UNet架构进行笼子分割,以精准识别干扰性结构;其次采用CRFill方法对遮挡区域进行内容感知重建,实现笼子区域的修复;最后在去除笼子遮挡后的无笼帧上执行姿态估计与追踪。该方案通过引入方向敏感特征(72个方向核)提升笼子分割精度,从而有效恢复关键点检测准确性和轨迹一致性,使模型性能接近无遮挡环境下的表现。

链接: https://arxiv.org/abs/2512.07712
作者: Sayak Dutta,Harish Katti,Shashikant Verma,Shanmuganathan Raman
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘地纳格尔分校); National Institutes of Health (美国国立卫生研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures, 2 tables. Accepted to the Indian Conference on Computer Vision, Graphics, and Image Processing (ICVGIP 2025), Mandi, India

点击查看摘要

Abstract:Animal tracking and pose estimation systems, such as STEP (Simultaneous Tracking and Pose Estimation) and ViTPose, experience substantial performance drops when processing images and videos with cage structures and systematic occlusions. We present a three-stage preprocessing pipeline that addresses this limitation through: (1) cage segmentation using a Gabor-enhanced ResNet-UNet architecture with tunable orientation filters, (2) cage inpainting using CRFill for content-aware reconstruction of occluded regions, and (3) evaluation of pose estimation and tracking on the uncaged frames. Our Gabor-enhanced segmentation model leverages orientation-aware features with 72 directional kernels to accurately identify and segment cage structures that severely impair the performance of existing methods. Experimental validation demonstrates that removing cage occlusions through our pipeline enables pose estimation and tracking performance comparable to that in environments without occlusions. We also observe significant improvements in keypoint detection accuracy and trajectory consistency.
zh

[CV-21] PVeRA: Probabilistic Vector-Based Random Matrix Adaptation

【速读】:该论文旨在解决大规模基础模型在微调过程中对海量数据和计算资源的高度依赖问题,这在实际应用中常因资源稀缺而受限。解决方案的关键在于提出一种概率化的VeRA适配器(PVeRA),其通过在VeRA的基础上以概率化方式修改共享的低秩矩阵,从而在保持模型冻结主干的前提下,仅需少量可训练参数即可实现高效适应新任务。该方法不仅能够处理输入中的固有不确定性,还支持训练与推理阶段不同的采样配置,显著提升了参数效率和性能表现。

链接: https://arxiv.org/abs/2512.07703
作者: Leo Fillioux,Enzo Ferrante,Paul-Henry Cournède,Maria Vakalopoulou,Stergios Christodoulidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available this https URL.
zh

[CV-22] Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment WACV2026

【速读】:该论文旨在解决文本到图像生成(text-to-image generation)中难以实现精确文本-图像对齐的问题,尤其是在处理具有复杂组合结构或想象元素的提示时。其解决方案的关键在于提出一种名为“负向提示图像修正”(Negative Prompting for Image Correction, NPC)的自动化流程,通过识别并应用能够抑制意外内容的负向提示来提升对齐效果。NPC利用交叉注意力(cross-attention)分析揭示了目标负向提示(直接关联对齐错误)和非目标负向提示(与提示无关但出现在生成图像中的token)均有助于改善对齐,并基于验证器-描述器-提议器框架生成候选负向提示,再以显著文本空间得分进行排序,从而无需额外图像合成即可高效筛选有效负向提示。

链接: https://arxiv.org/abs/2512.07702
作者: Sangha Park,Eunji Kim,Yeongtak Oh,Jooyoung Choi,Sungroh Yoon
机构: Seoul National University (首尔国立大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV 2026

点击查看摘要

Abstract:Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt’s alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at this https URL.
zh

[CV-23] sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

【速读】:该论文旨在解决从单目视频中联合预测关节类物体的部件分割(part segmentation)与关节参数(joint parameters)的问题,这是机器人技术和数字孪生构建中的基础挑战。传统方法多依赖多视角系统、物体扫描或静态摄像头等受限场景,难以适应动态环境下的实时应用。本文提出首个数据驱动的方法,仅使用合成数据训练,即可实现对自由移动相机拍摄的视频进行端到端推理,从而在真实世界对象上展现出良好的泛化能力,其关键在于直接处理随意录制的视频流,无需额外标注或复杂预处理,为关节类物体的理解提供了可扩展且实用的解决方案。

链接: https://arxiv.org/abs/2512.07698
作者: Arslan Artykov,Corentin Sautier,Vincent Lepetit
机构: LIGM, École des Ponts et Chaussees, IP Paris, CNRS, France
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: this https URL
zh

[CV-24] DIST-CLIP: Arbitrary Metadata and Image Guided MRI Harmonization via Disentangled Anatomy-Contrast Representations

【速读】:该论文旨在解决医学影像分析中因数据异质性(data heterogeneity)导致的深度学习模型临床泛化能力不足的问题,尤其聚焦于磁共振成像(MRI)中存在的设备硬件差异、采集协议多样性和序列参数变化所引发的域偏移(domain shift),这些因素会掩盖潜在的生物信号。现有数据调和方法存在局限:基于图像的方法依赖目标图像,而基于文本的方法则使用简单标签,无法捕捉复杂的采集细节或适应真实临床环境的多样性。解决方案的关键在于提出DIST-CLIP(Disentangled Style Transfer with CLIP Guidance),其核心创新是通过预训练的CLIP编码器显式解耦解剖结构与图像对比度,并利用一种新颖的自适应风格迁移模块将对比度嵌入融入解剖内容中,从而实现灵活地以目标图像或DICOM元数据为引导的统一调和框架,在保持解剖结构不变的同时提升风格迁移保真度,显著优于当前最先进方法。

链接: https://arxiv.org/abs/2512.07674
作者: Mehmet Yigit Avci,Pedro Borges,Virginia Fernandez,Paul Wright,Mehmet Yigitsoy,Sebastien Ourselin,Jorge Cardoso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.
zh

[CV-25] EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset

【速读】:该论文旨在解决在真实世界导航场景中预测人类视觉注意力的问题,特别是针对行人以第一人称视角(egocentric)行走时的眼动行为。其核心挑战在于如何建模户外复杂环境中动态变化的视觉注意机制。解决方案的关键在于构建了EgoCampus数据集,该数据集涵盖25条独立户外路径、6公里校园范围内的多个人类行人的第一人称眼动标注视频,并利用Meta的Project Aria眼镜采集包含眼动追踪、RGB图像、惯性传感器和GPS的多模态数据;在此基础上提出EgoCampusNet模型,能够有效预测行人在户外环境中移动时的注视点分布,从而为研究现实世界中的视觉注意机制及开发更精准的导航用眼动预测模型提供新资源与方法支持。

链接: https://arxiv.org/abs/2512.07668
作者: Ronan John,Aditya Kesari,Vincenzo DiMatteo,Kristin Dana
机构: Rutgers University, New Brunswick(罗格斯大学,新布朗斯维克分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta’s Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at this https URL .
zh

[CV-26] Optimization-Guided Diffusion for Interactive Scene Generation

【速读】:该论文旨在解决自动驾驶场景生成中安全性关键事件稀缺且难以控制的问题,即现有数据驱动的场景生成方法往往缺乏可控性或生成违反物理和社交约束的轨迹,从而限制了其在安全评估中的实用性。解决方案的关键在于提出一种无需训练、基于优化引导的框架 OMEGA,通过在扩散采样过程中引入结构一致性与交互感知约束,利用约束优化重新锚定每一步反向扩散过程,以生成物理合理且行为一致的交通轨迹;进一步地,将自车与攻击者之间的交互建模为分布空间中的博弈论优化问题,近似纳什均衡以生成真实且具有挑战性的对抗性场景,显著提升了场景的真实性、一致性与可控性。

链接: https://arxiv.org/abs/2512.07661
作者: Shiaho Li,Naisheng Ye,Tianyu Li,Kashyap Chitta,Tuo An,Peng Su,Boyang Wang,Haiou Liu,Chen Lv,Hongyang Li
机构: Beijing Institute of Technology (北京理工大学); OpenDriveLab at The University of Hong Kong (香港大学 OpenDriveLab); Nanyang Technological University (南洋理工大学); NVIDIA Research (英伟达研究); Yinwang Intelligent Tech. Co. Ltd. (英伟达智能科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate 5\times more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
zh

[CV-27] An AI-Powered Autonomous Underwater System for Sea Exploration and Scientific Research

【速读】:该论文旨在解决传统海洋探索中因极端环境条件、有限能见度和高昂成本导致的大量未探索海域问题。其解决方案的关键在于构建一个集成人工智能技术的自主水下机器人(AUV)系统,通过YOLOv12 Nano实现高实时性的水下目标检测,利用ResNet50提取特征并结合PCA进行降维以保留98%方差,再借助K-Means++聚类对目标按视觉特性分组,最后引入GPT-4o Mini大语言模型(LLM)生成结构化报告与总结,从而显著提升探测效率、降低人工潜水风险,并加快水下数据的分析速度与深度。

链接: https://arxiv.org/abs/2512.07652
作者: Hamad Almazrouei,Mariam Al Nasseri,Maha Alzaabi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional sea exploration faces significant challenges due to extreme conditions, limited visibility, and high costs, resulting in vast unexplored ocean regions. This paper presents an innovative AI-powered Autonomous Underwater Vehicle (AUV) system designed to overcome these limitations by automating underwater object detection, analysis, and reporting. The system integrates YOLOv12 Nano for real-time object detection, a Convolutional Neural Network (CNN) (ResNet50) for feature extraction, Principal Component Analysis (PCA) for dimensionality reduction, and K-Means++ clustering for grouping marine objects based on visual characteristics. Furthermore, a Large Language Model (LLM) (GPT-4o Mini) is employed to generate structured reports and summaries of underwater findings, enhancing data interpretation. The system was trained and evaluated on a combined dataset of over 55,000 images from the DeepFish and OzFish datasets, capturing diverse Australian marine environments. Experimental results demonstrate the system’s capability to detect marine objects with a mAP@0.5 of 0.512, a precision of 0.535, and a recall of 0.438. The integration of PCA effectively reduced feature dimensionality while preserving 98% variance, facilitating K-Means clustering which successfully grouped detected objects based on visual similarities. The LLM integration proved effective in generating insightful summaries of detections and clusters, supported by location data. This integrated approach significantly reduces the risks associated with human diving, increases mission efficiency, and enhances the speed and depth of underwater data analysis, paving the way for more effective scientific research and discovery in challenging marine environments.
zh

[CV-28] Liver Fibrosis Quantification and Analysis: The LiQA Dataset and Baseline Method

【速读】:该论文旨在解决肝纤维化(Liver Fibrosis)分期的准确性问题,尤其是在复杂真实世界临床场景下,如域偏移(domain shifts)、模态缺失(missing modalities)和空间错配(spatial misalignment)等挑战。解决方案的关键在于构建了一个大规模多中心、多期相MRI数据集LiQA(Liver Fibrosis Quantification and Analysis),用于基准测试肝分割(LiSeg)与肝纤维化分期(LiFS)算法,并采用两阶段方法:首先通过融合外部数据的半监督学习框架提升分割鲁棒性;其次利用基于类激活图(Class Activation Map, CAM)正则化的多视角共识策略优化分期性能,从而显著增强模型在实际临床应用中的稳定性与泛化能力。

链接: https://arxiv.org/abs/2512.07651
作者: Yuanye Liu,Hanxiao Zhang,Nannan Shi,Yuxin Shi,Arif Mahmood,Murtaza Taj,Xiahai Zhuang
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Shanghai Public Health Clinical Center (上海市公共卫生临床中心); Information Technology University (信息技术大学); Lahore University of Management Sciences (拉合尔管理科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Liver fibrosis represents a significant global health burden, necessitating accurate staging for effective clinical management. This report introduces the LiQA (Liver Fibrosis Quantification and Analysis) dataset, established as part of the CARE 2024 challenge. Comprising 440 patients with multi-phase, multi-center MRI scans, the dataset is curated to benchmark algorithms for Liver Segmentation (LiSeg) and Liver Fibrosis Staging (LiFS) under complex real-world conditions, including domain shifts, missing modalities, and spatial misalignment. We further describe the challenge’s top-performing methodology, which integrates a semi-supervised learning framework with external data for robust segmentation, and utilizes a multi-view consensus approach with Class Activation Map (CAM)-based regularization for staging. Evaluation of this baseline demonstrates that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings.
zh

[CV-29] MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation

【速读】:该论文旨在解决现有部件感知的3D生成方法在组件数量增加时因全局注意力机制的二次计算复杂度而导致可扩展性差的问题。其解决方案的关键在于提出MoCA模型,包含两个核心设计:(1) 基于重要性的部件路由(importance-based component routing),通过选择top-k相关部件进行稀疏全局注意力,降低计算开销;(2) 无关部件压缩(unimportant components compression),在保留未选部件上下文先验信息的同时进一步减少全局注意力的计算复杂度。这一设计使得模型能够高效、细粒度地实现可扩展的3D资产组合生成。

链接: https://arxiv.org/abs/2512.07628
作者: Zhiqi Li,Wenhuan Li,Tengfei Wang,Zhenwei Wang,Junta Wu,Haoyuan Wang,Yunhan Yang,Zehuan Huang,Yang Li,Peidong Liu,Chunchao Guo
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: this https URL
zh

[CV-30] Decomposition Sampling for Efficient Region Annotations in Active Learning

【速读】:该论文旨在解决密集预测任务(如医学图像分割)中因标注成本高、时间长而带来的主动学习效率低下问题。现有方法在选择代表性标注区域时存在计算与内存开销大、区域选择无关性以及过度依赖不确定性采样等局限。其解决方案的关键在于提出分解采样(decomposition sampling, DECOMP)策略:通过伪标签将图像分解为类别特异性组件,并从每个类别中采样区域,从而提升标注多样性;同时利用类别级预测置信度引导采样过程,优先为难分类别增加标注,显著改善少数类别的性能表现。

链接: https://arxiv.org/abs/2512.07606
作者: Jingna Qiu,Frauke Wilm,Mathias Öttl,Jonas Utz,Maja Schlereth,Moritz Schillinger,Marc Aubreville,Katharina Breininger
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希-亚历山大大学); Julius-Maximilians-Universität Würzburg(维尔茨堡尤利乌斯-马克西米利安大学); MIRA Vision Microscopy GmbH; Hochschule Flensburg(弗莱堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Active learning improves annotation efficiency by selecting the most informative samples for annotation and model training. While most prior work has focused on selecting informative images for classification tasks, we investigate the more challenging setting of dense prediction, where annotations are more costly and time-intensive, especially in medical imaging. Region-level annotation has been shown to be more efficient than image-level annotation for these tasks. However, existing methods for representative annotation region selection suffer from high computational and memory costs, irrelevant region choices, and heavy reliance on uncertainty sampling. We propose decomposition sampling (DECOMP), a new active learning sampling strategy that addresses these limitations. It enhances annotation diversity by decomposing images into class-specific components using pseudo-labels and sampling regions from each class. Class-wise predictive confidence further guides the sampling process, ensuring that difficult classes receive additional annotations. Across ROI classification, 2-D segmentation, and 3-D segmentation, DECOMP consistently surpasses baseline methods by better sampling minority-class regions and boosting performance on these challenging classes. Code is in this https URL.
zh

[CV-31] Online Segment Any 3D Thing as Instance Tracking NEURIPS2025

【速读】:该论文旨在解决嵌入式智能体在动态环境中进行在线、实时且细粒度的3D分割时,现有基于对象查询(object queries)的方法普遍忽视时间维度理解的问题。当前方法虽能通过查询聚合视觉基础模型(Vision Foundation Models, VFMs)输出的3D点云语义信息并实现空间传播,但缺乏对时序一致性的建模能力,导致对象身份与特征在时间上难以保持连贯性。解决方案的关键在于将在线3D分割重新定义为实例跟踪任务(AutoSeg3D),其核心机制是利用稀疏对象查询实现跨帧的时间信息传播:一方面通过长期实例关联增强特征与对象身份的一致性,另一方面通过短期实例更新丰富即时观测信息;同时引入空间一致性学习以缓解VFMs固有的碎片化问题,从而提升长短期时序学习的效果。该方法在ScanNet200等多数据集上显著优于ESAM,实现了新的最先进性能。

链接: https://arxiv.org/abs/2512.07599
作者: Hanshi Wang,Zijian Cai,Jin Gao,Yiwei Zhang,Weiming Hu,Ke Wang,Zhipeng Zhang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University; Anyverse Intelligence; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information; School of Information Science and Technology, ShanghaiTech University; KargoBot
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025, Code is at this https URL

点击查看摘要

Abstract:Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
zh

[CV-32] More than Segmentation: Benchmarking SAM 3 for Segmentation 3D Perception and Reconstruction in Robotic Surgery

【速读】:该论文旨在解决生成式 AI(Generative AI)在机器人辅助手术场景中实现高效、准确且灵活的图像与视频分割问题,尤其是在零样本(zero-shot)条件下对点、边界框和语言提示的适应能力不足,以及三维(3D)重建精度受限的问题。解决方案的关键在于评估并优化 Segment Anything Model 3(SAM 3)在医疗视觉任务中的表现,其核心创新包括:引入语言提示支持以增强交互灵活性,提升基于2D图像的3D结构重建能力,并通过多基准测试(如 MICCAI EndoVis 2017/2018 和 SCARED 等)验证其在动态手术视频中进行零样本分割和单目深度估计的有效性,从而为未来外科场景下通用视觉模型的部署提供实证依据和技术路径。

链接: https://arxiv.org/abs/2512.07596
作者: Wenzhen Dong,Jieming Yu,Yiming Huang,Hongqiu Wang,Lei Zhu,Albert C. S. Chung,Hongliang Ren,Long Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Technical Report

点击查看摘要

Abstract:The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3’s 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
zh

[CV-33] Robust Variational Model Based Tailored UNet: Leverag ing Edge Detector and Mean Curvature for Improved Image Segmentation

【速读】:该论文旨在解决噪声图像中边界模糊或断裂导致的分割难题,尤其针对传统方法在复杂边界场景下性能受限的问题。其解决方案的关键在于提出一种改进的变分模型驱动的定制UNet(VM_TUNet),通过将物理先验、边缘检测器和平均曲率项引入修正的Cahn-Hilliard方程,融合变分偏微分方程(PDE)的可解释性和边界平滑优势与深度神经网络的强大表征能力。该框架包含两个协同模块:F模块在频域进行高效预处理以避免局部极小值陷阱,T模块则通过稳定性估计保障局部计算的准确性与鲁棒性,从而在保持计算效率的同时显著提升分割精度和视觉质量。

链接: https://arxiv.org/abs/2512.07590
作者: Kaili Qi,Zhongyi Huang,Wenli Yang
机构: Tsinghua University (清华大学); China University of Mining and Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To address the challenge of segmenting noisy images with blurred or fragmented boundaries, this paper presents a robust version of Variational Model Based Tailored UNet (VM_TUNet), a hybrid framework that integrates variational methods with deep learning. The proposed approach incorporates physical priors, an edge detector and a mean curvature term, into a modified Cahn-Hilliard equation, aiming to combine the interpretability and boundary-smoothing advantages of variational partial differential equations (PDEs) with the strong representational ability of deep neural networks. The architecture consists of two collaborative modules: an F module, which conducts efficient frequency domain preprocessing to alleviate poor local minima, and a T module, which ensures accurate and stable local computations, backed by a stability estimate. Extensive experiments on three benchmark datasets indicate that the proposed method achieves a balanced trade-off between performance and computational efficiency, which yields competitive quantitative results and improved visual quality compared to pure convolutional neural network (CNN) based models, while achieving performance close to that of transformer-based method with reasonable computational expense.
zh

[CV-34] LongCat-Image Technical Report

【速读】:该论文旨在解决当前主流图像生成模型在多语言文本渲染、真实感(photorealism)、部署效率及开发者可访问性方面的核心挑战。其关键解决方案在于:通过预训练、中段训练和监督微调(SFT)阶段的严格数据筛选策略,并在强化学习(RL)阶段协同使用精心构建的奖励模型,从而实现卓越的文本渲染能力与美学质量;同时,采用仅6B参数的紧凑扩散模型设计,显著优于常见20B级Mixture-of-Experts(MoE)架构,在保持高性能的同时大幅降低显存占用与推理延迟;此外,该模型在中文字符渲染方面达到行业新标准,支持复杂罕见汉字且准确率领先;最后,研究团队构建了迄今为止最全面的开源生态系统,包含多种版本模型与完整训练工具链,极大提升社区可用性与研究可复现性。

链接: https://arxiv.org/abs/2512.07584
作者: Meituan LongCat Team:Hanghang Ma,Haoxian Tan,Jiale Huang,Junqiang Wu,Jun-Yan He,Lishuai Gao,Songlin Xiao,Xiaoming Wei,Xiaoqi Ma,Xunliang Cai,Yayong Guan,Jie Hu
机构: Meituan LongCat Team (美团龙猫团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.
zh

[CV-35] All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLM s

【速读】:该论文旨在解决视觉大语言模型(Vision Large Language Models, VLLMs)在推理过程中因依赖大量视觉标记(visual tokens)而导致的高计算开销问题。现有无训练剪枝方法在模型深层(如第20层之后)性能退化,甚至不如随机剪枝,这限制了其效率提升潜力。论文的关键发现是:随着网络深度增加,视觉标记的信息逐渐衰减直至消失,形成“信息边界”(information horizon),在此边界之后的标记已冗余;且该边界位置受任务复杂度和模型容量影响。基于此,研究提出在深层采用简单随机剪枝即可实现性能与效率的平衡,并验证其能显著增强现有剪枝方法(如DivPrune),最终在保持96.9%原始模型性能的同时剪掉50%的视觉标记,达到最优效率-精度权衡。

链接: https://arxiv.org/abs/2512.07580
作者: Yahong Wang,Juncheng Wu,Zhangkai Ni,Longzhen Yang,Yihang Liu,Chengmei Yang,Ying Wen,Xianfeng Tang,Hui Liu,Yuyin Zhou,Lianghua He
机构: Tongji University (同济大学); University of California, Santa Cruz (加州大学圣克鲁兹分校); Amazon (亚马逊); East China Normal University (华东师范大学); Shanghai Eye Disease Prevention and Treatment Center (上海市眼病防治中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by “vanishing token information”, where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token’s information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as “information horizon”, beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at this https URL.
zh

[CV-36] Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation

【速读】:该论文旨在解决多模态学习中普遍存在的模态主导(modality dominance)、冗余信息耦合以及虚假跨模态相关性问题,这些问题导致模型泛化能力不足且可解释性差。具体而言,高方差模态容易掩盖语义重要但信号较弱的模态,而简单的融合策略会无差别地混杂模态共享与模态特有因素,使得预测驱动机制难以解析,并在部分模态缺失或噪声干扰时缺乏鲁棒性。解决方案的关键在于提出一种双流残差语义解耦网络(Dual-Stream Residual Semantic Decorrelation Network, DSRSD-Net),其核心创新包括:(1) 通过残差投影分离模态内(private)和模态间(shared)潜在因子;(2) 设计残差语义对齐头,结合对比学习与回归目标将不同模态的共享因子映射至统一空间;(3) 引入解耦与正交性损失,约束共享空间协方差结构并强制共享与私有流之间正交,从而抑制跨模态冗余、防止特征坍缩。该方法显著提升了教育场景下的下一步行为预测和最终结果预测性能。

链接: https://arxiv.org/abs/2512.07568
作者: Xuecheng Li,Weikuan Jia,Alisher Kurbonaliev,Qurbonaliev Alisher,Khudzhamkulov Rustam,Ismoilov Shuhratjon,Eshmatov Javhariddin,Yuanjie Zheng
机构: Shandong Normal University (山东师范大学); Tajik State University of Law, Business and Politics (塔吉克斯坦法律、商业与政治州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
zh

[CV-37] ReLaX: Reasoning with Latent Exploration for Large Reasoning Models

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)导致的熵崩溃(entropy collapse)问题,该问题表现为策略过早收敛和性能饱和,限制了大型推理模型(Large Reasoning Models, LRMs)的推理能力提升。解决方案的关键在于引入基于Koopman算子理论的隐状态动力学线性化表示,从而实现对模型内部动态结构的可分析与干预;在此基础上提出动态谱分散度(Dynamic Spectral Dispersion, DSD)作为量化隐空间异质性的新指标,用以直接衡量策略探索程度,并进一步设计出显式利用隐空间探索机制的推理范式——ReLaX(Reasoning with Latent eXploration),通过调节探索与利用的平衡显著缓解了提前收敛问题,在多模态及纯文本推理基准上均实现了SOTA性能。

链接: https://arxiv.org/abs/2512.07558
作者: Shimin Zhang,Xianwei Chen,Yufan Shen,Ziyuan Ye,Jibin Wu
机构: Hong Kong Polytechnic University (香港理工大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model’s latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
zh

[CV-38] From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

【速读】:该论文旨在解决从稀疏卫星图像中进行城市尺度三维重建时面临的极端视角外推问题,即如何从视差极小的轨道视角图像中合成地面级的新视角图像,这在传统方法如NeRF和3DGS中难以实现,因其无法有效处理严重压缩的立面纹理与近90°的视角缺口。解决方案的关键在于两个设计选择:一是将城市几何建模为2.5D高度图,采用Z单调符号距离场(signed distance field, SDF)来匹配自上而下的建筑布局,从而在稀疏、非正交卫星视角下稳定几何优化,并生成具有清晰屋顶和垂直拉伸立面的封闭网格;二是通过可微渲染技术将卫星图像纹理映射到网格表面,并引入一个生成式纹理恢复网络以增强退化输入中的高频细节,从而提升视觉保真度。

链接: https://arxiv.org/abs/2512.07527
作者: Fei Yu,Yu Liu,Luyang Tang,Mingchao Sun,Zengye Ge,Rui Bu,Yuchao Jin,Haisen Zhao,He Sun,Yangyan Li,Mu Xu,Wenzheng Chen,Baoquan Chen
机构: Peking University (北京大学); AMAP; Ant Group (蚂蚁集团); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly 90^\circ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method’s scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a 4,\mathrmkm^2 real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2512.07527 [cs.CV] (or arXiv:2512.07527v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.07527 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-39] MeshRipple: Structured Autoregressive Generation of Artist-Meshes

【速读】:该论文旨在解决当前自回归式网格生成方法中存在的长程几何依赖关系断裂问题,这类方法因受限于内存而采用滑动窗口推理,导致生成的网格出现孔洞和碎片化组件。其解决方案的关键在于提出MeshRipple框架,该框架通过三个核心创新实现:(1)基于前沿感知的广度优先搜索(BFS)标记化策略,使生成顺序与表面拓扑一致;(2)扩展式预测策略,确保表面连贯且连续生长;(3)稀疏注意力全局记忆机制,提供近乎无界的感受野以解析长程拓扑关系。这一集成设计使MeshRipple能够生成高保真度且拓扑完整的网格,显著优于现有先进基线方法。

链接: https://arxiv.org/abs/2512.07514
作者: Junkai Lin,Hang Long,Huipeng Guo,Jielei Zhang,JiaYi Yang,Tianle Guo,Yang Yang,Jianwen Li,Wenxiao Zhang,Matthias Nießner,Wei Yang
机构: Huazhong University of Science and Technology (华中科技大学); Independent Researcher; Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a this http URL rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological this http URL integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
zh

[CV-40] Exploring possible vector systems for faster training of neural networks with preconfigured latent spaces

【速读】:该论文旨在解决神经网络(Neural Network, NN)在高类别数量场景下训练效率低、嵌入空间(Latent Space, LS)结构难以控制的问题。其核心挑战在于如何设计高效的嵌入空间配置(Latent Space Configuration, LSC),以提升模型收敛速度并减少存储开销。解决方案的关键在于利用预定义的向量系统(如An根系向量)作为LS的目标结构,从而无需依赖分类层即可训练分类器NN,并通过最小化LS维度数来加速ImageNet-1K及50k–600k类任务的训练过程,同时降低用于存储NN嵌入的向量数据库规模。

链接: https://arxiv.org/abs/2512.07509
作者: Nikita Gabdullin
机构: Joint Stock "Research and production company “Kryptonite”
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 1 table, 4 equations

点击查看摘要

Abstract:The overall neural network (NN) performance is closely related to the properties of its embedding distribution in latent space (LS). It has recently been shown that predefined vector systems, specifically An root system vectors, can be used as targets for latent space configurations (LSC) to ensure the desired LS structure. One of the main LSC advantage is the possibility of training classifier NNs without classification layers, which facilitates training NNs on datasets with extremely large numbers of classes. This paper provides a more general overview of possible vector systems for NN training along with their properties and methods for vector system construction. These systems are used to configure LS of encoders and visual transformers to significantly speed up ImageNet-1K and 50k-600k classes LSC training. It is also shown that using the minimum number of LS dimensions for a specific number of classes results in faster convergence. The latter has potential advantages for reducing the size of vector databases used to store NN embeddings.
zh

[CV-41] ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points WACV2026

【速读】:该论文旨在解决生成式图像模型(如Stable Diffusion)在生成图像时存在的几何不一致性问题,特别是视点(vanishing point)不一致现象,即平行线在二维空间中的投影未能正确汇聚,导致场景结构失真,尤其在建筑类图像中影响显著。解决方案的关键在于提出ControlVP框架,通过引入基于建筑轮廓的结构引导机制扩展预训练扩散模型,并设计显式的几何约束条件,强制图像边缘与透视线索对齐,从而在保持视觉保真度的同时显著提升全局几何一致性,适用于需要精确空间结构的应用场景,如图像到三维重建。

链接: https://arxiv.org/abs/2512.07504
作者: Ryota Okumura,Kaede Shiohara,Toshihiko Yamasaki
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026, 8 pages, supplementary included. Dataset and code: this https URL

点击查看摘要

Abstract:Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at this https URL .
zh

[CV-42] SJD: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation

【速读】:该论文旨在解决自回归文本到图像生成模型在推理阶段速度缓慢的问题,其根源在于模型需进行数百至数千次串行前向传播以完成逐标记预测。解决方案的关键在于提出一种无需训练的 probabilistic parallel decoding 算法——Speculative Jacobi Decoding++ (SJD++),该方法通过结合 Jacobi decoding 的迭代多标记预测机制与 speculative sampling 的概率 drafting-and-verification 机制,在每次前向传播中实现多标记并行预测,显著减少生成步骤;更重要的是,SJD++ 在每次验证后复用高置信度的 draft tokens,避免重新采样全部标记,从而在不降低视觉质量的前提下实现 2× 至 7× 的步数压缩和 2× 至 3× 的推理延迟降低。

链接: https://arxiv.org/abs/2512.07503
作者: Yao Teng,Zhihuan Jiang,Han Shi,Xian Liu,Xuefei Ning,Guohao Dai,Yu Wang,Zhenguo Li,Xihui Liu
机构: The University of Hong Kong (香港大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Peking University (北京大学); xAI; Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves 2\times to 3\times inference latency reduction and 2\times to 7\times step compression, while preserving visual quality with no observable degradation.
zh

[CV-43] MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformer, DiT)架构在多目标视频运动迁移任务中面临的两大挑战:运动特征的内在纠缠性以及缺乏对象级别的控制能力。其解决方案的关键在于提出一种名为 Mask-aware Attention Motion Flow (AMF) 的新机制,该机制利用 SAM2 分割掩码显式解耦并控制 DiT 流水线中多个对象的运动特征,从而实现精准、语义对齐且时序一致的多对象运动迁移。

链接: https://arxiv.org/abs/2512.07500
作者: Penghui Liu,Jiangshan Wang,Yutong Shen,Shanhui Mo,Chenyang Qi,Yue Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT’s high quality and scalability. The code is in the supp.
zh

[CV-44] owards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior

【速读】:该论文旨在解决深度伪造(DeepFake)视频检测在现实场景中面临的关键挑战,即现有检测方法通常假设输入面部序列具有时间一致性且无噪声,而实际应用中常存在压缩伪影、遮挡和对抗攻击等干扰因素,导致人脸检测失效或误检,从而严重影响检测性能。解决方案的核心在于提出一种拉普拉斯正则化的图卷积网络(Laplacian-Regularized Graph Convolutional Network, LR-GCN),其关键创新是构建了一个无序时序图嵌入(Order-Free Temporal Graph Embedding, OF-TGE),通过语义相似性自适应地将帧级CNN特征组织为稀疏图结构,从而不依赖严格的时间连续性即可捕捉帧间内在特征一致性;同时引入双层稀疏机制与显式的图拉普拉斯谱先验(Graph Laplacian Spectral Prior),实现频域上的带通滤波效果——高通部分突出结构异常和伪造痕迹,低通GCN聚合保留关键篡改线索,有效抑制背景信息和随机噪声,显著提升对缺失人脸、遮挡及对抗扰动等复杂干扰的鲁棒性。

链接: https://arxiv.org/abs/2512.07498
作者: Chih-Chung Hsu,Shao-Ning Chen,Chia-Ming Lee,Yi-Fang Wang,Yi-Shiuan Chou
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Cheng Kung University (国立成功大学); National Science and Technology Council (台湾科技部); Ministry of Education (台湾教育部); National Center for High-performance Computing (国家高速网络与计算中心); National Applied Research Laboratories (国家应用研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages (including appendix)

点击查看摘要

Abstract:Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
zh

[CV-45] Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance

【速读】:该论文旨在解决传统视频编码器(Video Codec)和神经视频编码器(Neural Video Codec, NVC)在低码率下难以提升感知质量的问题,尤其针对现有方法因生成能力有限导致的伪影问题,以及使用预训练扩散模型时带来的高采样复杂度。其解决方案的关键在于提出S2VC(Single-Step diffusion based Video Codec),通过结合条件编码框架与高效的单步扩散生成器,在保持低码率的同时实现高质量重建。其中,核心创新包括:1)引入Contextual Semantic Guidance,从缓冲特征中提取帧自适应语义信息以替代文本提示,实现更精细的条件控制;2)设计Temporal Consistency Guidance嵌入扩散U-Net中,增强帧间时序一致性,确保生成稳定性。实验表明,S2VC在感知质量上达到当前最优水平,相较先前感知优化方法平均节省52.73%码率。

链接: https://arxiv.org/abs/2512.07480
作者: Naifu Xue,Zhaoyang Jia,Jiahao Li,Bin Li,Zihan Zheng,Yuan Zhang,Yan Lu
机构: Communication University of China (中国传媒大学); University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.
zh

[CV-46] Unified Video Editing with Temporal Reason er

【速读】:该论文旨在解决视频编辑中普遍存在的精度与统一性之间的矛盾问题:现有专家模型虽能实现高精度编辑,但依赖于任务特定的先验信息(如掩码),难以统一建模;而统一的时间内上下文学习模型虽无需掩码,却因缺乏显式的空间线索导致指令到区域的映射能力弱、定位不精确。其解决方案的关键在于提出 VideoCoF(Chain-of-Frames)方法,通过引入类 Chain-of-Thought 的推理机制,强制视频扩散模型遵循“观察—推理—编辑”的流程,在生成目标视频帧前先预测推理标记(即编辑区域潜在表示),从而在不依赖用户提供的掩码条件下实现精准的指令-区域对齐与细粒度编辑。此外,该方法还设计了一种 RoPE 对齐策略,利用推理标记保障运动一致性并支持超出训练时长的长度外推。

链接: https://arxiv.org/abs/2512.07469
作者: Xiangpeng Yang,Ji Xie,Yiyuan Yang,Yan Huang,Min Xu,Qiang Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at this https URL.
zh

[CV-47] Human Geometry Distribution for 3D Animation Generation

【速读】:该论文旨在解决在有限数据条件下生成逼真人体几何动画的难题,尤其是如何建模自然的衣物动态并保留精细的几何细节。其解决方案的关键在于提出两种创新设计:一是构建一种基于分布的紧凑潜在表示(compact distribution-based latent representation),实现了SMPL人体模型与虚拟角色几何之间的更均匀映射,从而提升几何生成效率与质量;二是设计一个生成式动画模型,通过身份条件约束在短时过渡中充分利用有限运动数据的多样性,同时保证长期一致性。整体方法形成两阶段框架:第一阶段学习潜在空间,第二阶段在此空间内生成动画,实验表明该方法在几何保真度和动画自然性上均显著优于现有方法。

链接: https://arxiv.org/abs/2512.07459
作者: Xiangjun Tang,Biao Zhang,Peter Wonka
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic human geometry animations remains a challenging task, as it requires modeling natural clothing dynamics with fine-grained geometric details under limited data. To address these challenges, we propose two novel designs. First, we propose a compact distribution-based latent representation that enables efficient and high-quality geometry generation. We improve upon previous work by establishing a more uniform mapping between SMPL and avatar geometries. Second, we introduce a generative animation model that fully exploits the diversity of limited motion data. We focus on short-term transitions while maintaining long-term consistency through an identity-conditioned design. These two designs formulate our method as a two-stage framework: the first stage learns a latent space, while the second learns to generate animations within this latent space. We conducted experiments on both our latent space and animation model. We demonstrate that our latent space produces high-fidelity human geometry surpassing previous methods ( 90% lower Chamfer Dist.). The animation model synthesizes diverse animations with detailed and natural dynamics ( 2.2 \times higher user study score), achieving the best results across all evaluation metrics.
zh

[CV-48] KAN-Dreamer: Benchmarking Kolmogorov-Arnold Networks as Function Approximators in World Models

【速读】:该论文旨在解决当前基于模型的强化学习(Model-Based Reinforcement Learning, MBRL)算法中世界模型(World Model)组件在参数效率与计算效率之间难以平衡的问题,特别是针对DreamerV3框架中广泛使用的多层感知机(MLP)和卷积层在复杂任务中的高参数冗余与推理延迟问题。解决方案的关键在于引入Kolmogorov-Arnold Networks(KANs)及其高效变体FastKAN,以替代DreamerV3中特定模块(如奖励预测器和继续预测器),利用KAN的可解释性和参数稀疏性提升模型效率,并通过JAX平台实现完全向量化、简化网格管理的优化版本,从而在保持样本效率和训练速度的同时,验证了KAN作为MLP替代方案在世界模型构建中的可行性与有效性。

链接: https://arxiv.org/abs/2512.07437
作者: Chenwei Shi,Xueyu Luan
机构: Shanghai Research Institute for Intelligent Autonomous Systems (上海智能自主系统研究院); Tongji University (同济大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: 23 pages, 8 figures, 3 tables

点击查看摘要

Abstract:DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs’ computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
zh

[CV-49] When normalization hallucinates: unseen risks in AI-powered whole slide image processing

【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)归一化过程中因深度学习模型过度拟合平均分布而导致的幻觉(hallucination)问题,即生成看似合理但实际不存在于原始组织中的伪影,这些伪影难以通过视觉检查发现,且现有评估方法常忽视其存在。解决方案的关键在于提出一种新颖的图像对比度量方法,能够自动检测归一化输出中的幻觉内容;基于此度量,作者系统性地重新评估了多个经典归一化方法在真实临床数据上的表现,揭示了传统指标无法捕捉到的显著不一致性和失败模式,从而强调了开发更鲁棒、可解释的归一化技术及严格验证协议的必要性。

链接: https://arxiv.org/abs/2512.07426
作者: Karel Moens,Matthew B. Blaschko,Tinne Tuytelaars,Bart Diricx,Jonas De Vylder,Mustafa Yousif
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, accepted for oral presentation at SPIE Medical Imaging, 2026

点击查看摘要

Abstract:Whole slide image (WSI) normalization remains a vital preprocessing step in computational pathology. Increasingly driven by deep learning, these models learn to approximate data distributions from training examples. This often results in outputs that gravitate toward the average, potentially masking diagnostically important features. More critically, they can introduce hallucinated content, artifacts that appear realistic but are not present in the original tissue, posing a serious threat to downstream analysis. These hallucinations are nearly impossible to detect visually, and current evaluation practices often overlook them. In this work, we demonstrate that the risk of hallucinations is real and underappreciated. While many methods perform adequately on public datasets, we observe a concerning frequency of hallucinations when these same models are retrained and evaluated on real-world clinical data. To address this, we propose a novel image comparison measure designed to automatically detect hallucinations in normalized outputs. Using this measure, we systematically evaluate several well-cited normalization methods retrained on real-world data, revealing significant inconsistencies and failures that are not captured by conventional metrics. Our findings underscore the need for more robust, interpretable normalization techniques and stricter validation protocols in clinical deployment.
zh

[CV-50] Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models

【速读】:该论文旨在解决混合精度量化(Mixed-Precision Quantization, MPQ)中代理(proxy)设计依赖人工经验与昂贵训练的问题,传统方法要么通过代价高昂的可微分优化搜索代理,要么依赖专家手工设计的代理(如HAWQ),导致效率低且灵活性差。解决方案的关键在于提出一种无需人工干预和训练的LLM驱动型自动代理发现框架——TAP(Training-free Automatic Proxy),其核心创新是利用大语言模型(Large Language Models, LLMs)自动生成适用于MPQ任务的高质量代理,并通过基于直接策略优化(Direct Policy Optimization, DPO)的强化学习机制优化提示词(prompt),从而在LLM与MPQ任务之间构建正向反馈循环,使LLM能持续进化出更优代理,显著提升MPQ性能并推动该领域向自动化、智能化方向发展。

链接: https://arxiv.org/abs/2512.07419
作者: Haidong Kang,Jun Du,Lihong Lin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs’ reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
zh

[CV-51] Data-driven Exploration of Mobility Interaction Patterns

链接: https://arxiv.org/abs/2512.07415
作者: Gabriele Galatolo,Mirco Nanni
机构: Università di Pisa (比萨大学); Consiglio Nazionale delle Ricerche (意大利国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-52] InterAg ent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

【速读】:该论文旨在解决现有方法在多智能体人形控制中缺乏物理上合理交互建模的问题,即当前研究多局限于单智能体场景,难以实现多智能体间自然、协同且符合物理规律的社会行为模拟。其解决方案的关键在于提出首个端到端的文本驱动物理多智能体人形控制框架InterAgent,核心创新包括:1)设计一种带有多流块(multi-stream blocks)的自回归扩散Transformer架构,通过解耦本体感知(proprioception)、外感受(exteroception)与动作空间,减少跨模态干扰并促进协同;2)引入新型交互图外感受表示,显式捕捉关节级的空间依赖关系以增强网络学习能力;3)提出基于稀疏边的注意力机制,动态剪枝冗余连接并强化关键的跨智能体空间关系,从而提升交互建模的鲁棒性。

链接: https://arxiv.org/abs/2512.07410
作者: Bin Li,Ruichi Zhang,Han Liang,Jingyan Zhang,Juze Zhang,Xin Chen,Lan Xu,Jingyi Yu,Jingya Wang
机构: ShanghaiTech University (上海科技大学); University of Pennsylvania (宾夕法尼亚大学); ByteDance (字节跳动); Stanford University (斯坦福大学); InstAdapt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
zh

[CV-53] Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

【速读】:该论文旨在解决视频中物体在手部交互时间线(Hand Interaction Timeline, HIT)上的三维重建问题,特别是针对稳定抓握场景下的物体姿态估计与重建优化。传统方法受限于缺乏3D真实标注数据,难以有效评估和提升重建精度。论文提出了一种名为“沿手部交互时间线重建物体”(Reconstructing Objects along Hand Interaction Timelines, ROHIT)的新任务,并构建了约束优化与传播(Constrained Optimisation and Propagation, COP)框架,其核心在于建模物体在HIT各阶段(静止→接触→稳定抓握→释放)的几何约束关系,并通过姿态传播机制将已知关键帧的可靠姿态信息传递至其他帧,从而显著提升重建质量。实验表明,在无3D真值条件下,COP框架利用2D投影误差作为评估指标,实现了6.2–11.3%的稳定抓握重建性能提升及最高达24.5%的HIT整体重建改进。

链接: https://arxiv.org/abs/2512.07394
作者: Zhifan Zhu,Siddhant Bansal,Shashank Tripathi,Dima Damen
机构: University of Bristol, UK (布里斯托大学, 英国); Max Planck Institute for Intelligent Systems, Tübingen, Germany (马克斯·普朗克智能系统研究所, 图宾根, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: webpage: this https URL

点击查看摘要

Abstract:We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object’s perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object’s pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.
zh

[CV-54] GlimmerNet: A Lightweight Grouped Dilated Depthwise Convolutions for UAV-Based Emergency Monitoring

【速读】:该论文旨在解决在资源受限的无人机(UAV)平台上实现高效且具有强全局感知能力的视觉任务问题,尤其针对边缘和移动端视觉应用中,现有基于自注意力机制的视觉Transformer(Vision Transformer)模型因计算开销大而难以部署的问题。解决方案的关键在于提出GlimmerNet架构,其核心创新是将感受野多样性(receptive field diversity)与特征重组(feature recombination)解耦:通过引入分组扩张深度卷积(Grouped Dilated Depthwise Convolutions, GDBlocks),在不增加参数量的前提下实现多尺度特征提取;同时设计新型聚合模块(Aggregator),利用分组逐点卷积高效融合跨组特征表示,显著降低参数冗余。最终,GlimmerNet以仅31K参数和29%更少的浮点运算次数(FLOPs),在AIDERv2数据集上达到0.966加权F1分数,确立了实时应急监测任务中精度与效率的新边界。

链接: https://arxiv.org/abs/2512.07391
作者: Đorđe Nedeljković
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have proven highly effective for edge and mobile vision tasks due to their computational efficiency. While many recent works seek to enhance CNNs with global contextual understanding via self-attention-based Vision Transformers, these approaches often introduce significant computational overhead. In this work, we demonstrate that it is possible to retain strong global perception without relying on computationally expensive components. We present GlimmerNet, an ultra-lightweight convolutional network built on the principle of separating receptive field diversity from feature recombination. GlimmerNet introduces Grouped Dilated Depthwise Convolutions(GDBlocks), which partition channels into groups with distinct dilation rates, enabling multi-scale feature extraction at no additional parameter cost. To fuse these features efficiently, we design a novel Aggregator module that recombines cross-group representations using grouped pointwise convolution, significantly lowering parameter overhead. With just 31K parameters and 29% fewer FLOPs than the most recent baseline, GlimmerNet achieves a new state-of-the-art weighted F1-score of 0.966 on the UAV-focused AIDERv2 dataset. These results establish a new accuracy-efficiency trade-off frontier for real-time emergency monitoring on resource-constrained UAV platforms. Our implementation is publicly available at this https URL.
zh

[CV-55] owards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood WACV2026

【速读】:该论文旨在解决测试时适应(Test-time adaptation, TTA)方法在实际部署中导致的预测不确定性校准不足问题,这在自动驾驶、金融和医疗等高风险领域尤为关键。现有校准方法通常假设模型或分布固定,难以应对现实世界中动态变化的测试条件。其解决方案的核心是提出Style Invariance as a Correctness Likelihood (SICL) 框架,通过测量风格变换后的预测一致性来估计每个样本的正确性似然(correctness likelihood),从而实现鲁棒的不确定性估计;该方法仅需模型前向传播,无需反向传播,可作为即插即用模块兼容任意TTA方法,在多个基准和真实场景下平均将校准误差降低13个百分点。

链接: https://arxiv.org/abs/2512.07390
作者: Gilhyun Nam,Taewon Kim,Joonhyun Jeong,Eunho Yang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:Test-time adaptation (TTA) enables efficient adaptation of deployed models, yet it often leads to poorly calibrated predictive uncertainty - a critical issue in high-stakes domains such as autonomous driving, finance, and healthcare. Existing calibration methods typically assume fixed models or static distributions, resulting in degraded performance under real-world, dynamic test conditions. To address these challenges, we introduce Style Invariance as a Correctness Likelihood (SICL), a framework that leverages style-invariance for robust uncertainty estimation. SICL estimates instance-wise correctness likelihood by measuring prediction consistency across style-altered variants, requiring only the model’s forward pass. This makes it a plug-and-play, backpropagation-free calibration module compatible with any TTA method. Comprehensive evaluations across four baselines, five TTA methods, and two realistic scenarios with three model architecture demonstrate that SICL reduces calibration error by an average of 13 percentage points compared to conventional calibration approaches.
zh

[CV-56] How Far are Modern Trackers from UAV-Anti-UAV? A Million-Scale Benchmark and New Baseline

【速读】:该论文旨在解决当前反无人机(Anti-UAV)技术研究中缺乏对移动平台(如另一架无人机)视角下目标无人机跟踪的挑战问题,即现有方法多基于固定地面摄像头采集的RGB、红外或RGB-IR视频,而忽略了在双动态场景下——即追踪平台与目标均高速运动时——的视觉跟踪难题。其解决方案的关键在于提出了一种新的多模态视觉跟踪任务“UAV-Anti-UAV”,并构建了一个百万级规模的数据集(含1,810段标注视频),同时设计了基于Mamba架构的基线方法MambaSTS,该方法通过融合Mamba与Transformer模型分别提取全局语义和空间特征,并利用状态空间模型在长序列中的优势,结合时间token传播机制建立视频级长期上下文,从而有效应对双动态干扰下的复杂跟踪任务。

链接: https://arxiv.org/abs/2512.07385
作者: Chunhui Zhang,Li Liu,Zhipeng Zhang,Yong Wang,Hao Wen,Xi Zhou,Shiming Ge,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); CloudWalk Technology Co., Ltd (云从科技有限公司); Sun Yat-sen University (中山大学); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model’s strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at \colormagentathis https URL.
zh

[CV-57] LogicCBMs: Logic-Enhanced Concept-Based Learning WACV2026

链接: https://arxiv.org/abs/2512.07383
作者: Deepika SN Vemuri,Gautham Bellamkonda,Aditya Pola,Vineeth N Balasubramanian
机构: Indian Institute of Technology, Hyderabad (印度理工学院海得拉巴分校); Microsoft Research (微软研究院); KLA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 19 figures, WACV 2026

点击查看摘要

[CV-58] ssellation GS: Neural Mesh Gaussians for Robust Monocular Reconstruction of Dynamic Objects

链接: https://arxiv.org/abs/2512.07381
作者: Shuohan Tao,Boyao Zhou,Hanzhang Tu,Yuwang Wang,Yebin Liu
机构: University of Cambridge (剑桥大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-59] Enhancing Small Object Detection with YOLO: A Novel Framework for Improved Accuracy and Efficiency

链接: https://arxiv.org/abs/2512.07379
作者: Mahila Moghadami,Mohammad Ali Keyvanrad,Melika Sabaghian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 16 figures

点击查看摘要

[CV-60] Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation WACV2026

链接: https://arxiv.org/abs/2512.07360
作者: Qiming Huang,Hao Ai,Jianbo Jiao
机构: The MIx Group, School of Computer Science, University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to WACV2026

点击查看摘要

[CV-61] A Geometric Unification of Concept Learning with Concept Cones

【速读】:该论文旨在解决监督式概念解释(如概念瓶颈模型,Concept Bottleneck Models, CBMs)与无监督概念发现(如稀疏自编码器,Sparse Autoencoders, SAEs)之间长期割裂的问题,即二者虽目标一致——识别神经网络中的可解释概念——但方法论迥异且缺乏统一的评估框架。其解决方案的关键在于提出一个共享的几何结构视角:CBMs 和 SAEs 实质上均学习激活空间中的一组线性方向,这些方向的非负组合构成“概念锥”(concept cone)。通过这一理论框架,作者构建了从 CBM 提供的人类定义参考几何到 SAE 概念学习质量的量化桥梁,具体表现为以 SAE 学习的概念锥是否包含或逼近 CBM 的概念锥作为评价指标。此 containment 框架揭示了 SAE 的超参数(如稀疏度、扩展比)与概念语义合理性之间的关系,并识别出最优的稀疏度和扩展因子区间,从而实现了对无监督概念发现的可度量、可比较和可优化的统一建模。

链接: https://arxiv.org/abs/2512.07355
作者: Alexandre Rocchi–Henry,Thomas Fel,Gianni Franchi
机构: U2IS Lab ENSTA Paris (ENSTA巴黎U2IS实验室); Kempner Institute, Harvard University (哈佛大学Kempner研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages

点击查看摘要

Abstract:Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases – such as SAE type, sparsity, or expansion ratio – to emergence of plausible\footnoteWe adopt the terminology of \citetjacovi2020towards, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction – selected or annotated by humans – though not necessarily faithful to the true latent factors that organise the data manifold. concepts. Using these metrics, we uncover a ``sweet spot’’ in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.
zh

[CV-62] DeepAgent : A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection

【速读】:该论文旨在解决合成媒体(尤其是深度伪造视频,deepfakes)日益增长所带来的数字内容真实性验证难题。现有方法多将音频与视觉信息融合于单一模型中,易受模态不匹配、噪声及篡改干扰,导致检测性能受限。其解决方案的关键在于提出一个基于多智能体协作的框架 DeepAgent,通过两个互补智能体分别处理不同模态:Agent-1 使用轻量级 AlexNet-based CNN 检测视觉层面的伪造痕迹;Agent-2 结合 Whisper 的语音特征与 EasyOCR 的图像文本序列,识别音视频不一致性。最终由随机森林(Random Forest)元分类器融合二者决策,利用各自学习到的不同决策边界提升整体鲁棒性与准确率,从而有效应对多样化的深度伪造攻击。

链接: https://arxiv.org/abs/2512.07351
作者: Sayeem Been Zaman,Wasimul Karim,Arefin Ittesafun Abian,Reem E. Mohamed,Md Rafiqul Islam,Asif Karim,Sami Azam
机构: Applied Artificial INtelligence and Intelligent Systems (AAIINS) Laboratory (应用人工智能与智能系统实验室); Department of Computer Science and Engineering, United International University (联合国际大学计算机科学与工程系); Department of Computer Science and Engineering, University of Scholars (学者大学计算机科学与工程系); Faculty of Science and Information Technology, Charles Darwin University (查尔斯达尔文大学理工学院); Faculty of Science and Technology, Charles Darwin University (查尔斯达尔文大学科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.
zh

[CV-63] MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

【速读】:该论文旨在解决多图合成(Multi-Image Composition, MICo)任务中图像一致性与可控性不足的问题,尤其在缺乏高质量训练数据的情况下难以实现多参考输入下的连贯图像生成。其关键解决方案在于构建了一个大规模、高质量的MICo-150K数据集,包含15万张平衡且具身份一致性的复合图像,并进一步设计了DeRe子集以支持真实与合成图像的分解与重组;同时提出MICo-Bench基准和Weighted-Ref-VIEScore评估指标,系统性地推动MICo任务的研究进展。通过在该数据集上微调模型(如Qwen-MICo),显著提升了模型在多图合成中的性能,验证了该方案的有效性和通用性。

链接: https://arxiv.org/abs/2512.07348
作者: Xinyu Wei,Kangrui Cen,Hongyang Wei,Zhen Guo,Bairui Li,Zeqing Wang,Jinrui Zhang,Lei Zhang
机构: Hong Kong Polytechnic University (香港理工大学); Tsinghua University (清华大学); Sun Yat-Sen University (中山大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (DeRe) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging DeRe cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter’s limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.
zh

[CV-64] Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting AAAI2026

【速读】:该论文旨在解决基于文本到图像(Text-to-Image, T2I)扩散模型进行三维任务(如生成或编辑)时存在的多视角不一致性问题,其根源在于T2I模型中的先验视角偏差(prior view bias),即在跨注意力(cross-attention, CA)计算中,主题词倾向于激活固定视角的特征,导致不同视角下物体外观冲突。解决方案的关键在于提出一种名为TD-Attn的新框架,包含两个核心组件:(1) 3D-Aware Attention Guidance Module (3D-AAG) 构建视图一致的3D注意力高斯分布,以增强跨视角的空间一致性;(2) Hierarchical Attention Modulation Module (HAM) 利用语义引导树(Semantic Guidance Tree, SGT)指导语义响应分析器(Semantic Response Profiler, SRP)定位并调制对视角条件敏感的UNet层,从而提升注意力图的一致性与可控性,实现精确的三维编辑。

链接: https://arxiv.org/abs/2512.07345
作者: Shilong Jin,Haoran Duan,Litao Hua,Wentao Huang,Yuan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, 5 tables, 2 algorithms, Accepted by AAAI 2026

点击查看摘要

Abstract:Versatile 3D tasks (e.g., generation or editing) that distill from Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module (3D-AAG) constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a Semantic Guidance Tree (SGT) to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a universal plugin, significantly enhancing multi-view consistency across 3D tasks.
zh

[CV-65] Generalized Referring Expression Segmentation on Aerial Photos

【速读】:该论文旨在解决航空影像中指代表达分割(Referring Expression Segmentation, RES)任务的挑战,尤其针对航拍图像因空间分辨率差异大、色彩不一致、目标尺寸微小、高密度物体及部分遮挡等问题导致的传统方法性能下降的问题。解决方案的关键在于构建了一个大规模、多样化的航空影像指代表达分割数据集 Aerial-D,其包含 37,288 张图像、152 万条指代表达和 25.9 万标注目标,覆盖从车辆到地表覆盖类型等 21 类语义类别,并通过规则驱动的表达生成与大语言模型(Large Language Model, LLM)增强相结合的自动化流程提升表达的语言多样性与视觉细节聚焦能力;同时引入模拟历史成像条件的过滤机制以增强模型对黑白、棕褐色调和颗粒噪声等退化场景的鲁棒性。基于此数据集训练的 RSRefSeg 架构实现了现代与历史航空图像的统一实例与语义分割能力,在多个基准上取得竞争力结果并保持在退化条件下较高的准确率。

链接: https://arxiv.org/abs/2512.07338
作者: Luís Marnoto,Alexandre Bernardino,Bruno Martins
机构: INESC-ID (Institute for Systems and Robotics); Instituto Superior Técnico, University of Lisbon (里斯本大学理工学院); Instituto de Sistemas e Robótica (机器人系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE J-STARS

点击查看摘要

Abstract:Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at this https URL .
zh

[CV-66] he Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers

【速读】:该论文试图解决的问题是:Vision Transformers (ViTs) 在理论上具备保持高维表示的能力,但实际训练中却常表现出“U-shaped”熵分布——即中间层信息压缩、最终层再扩展,这种现象是否源于架构缺陷或数据驱动的适应性机制。解决方案的关键在于通过分析 DINO 训练的 ViTs 在不同语义复杂度数据集(UC Merced、Tiny ImageNet 和 CIFAR-100)上的逐层有效编码维度(Effective Encoding Dimension, EED),揭示该“归纳瓶颈”(Inductive Bottleneck)并非架构固有特性,而是由数据语义抽象需求决定的数据依赖性适应行为;具体而言,纹理主导的数据集维持高秩表示,而以物体为中心的数据集则促使网络在中层抑制高频信息,从而学习到一种用于分离语义特征的瓶颈结构。

链接: https://arxiv.org/abs/2512.07331
作者: Kanishk Awadhiya
机构: Indian Institute of Technology, Delhi (印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a “U-shaped” entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this “Inductive Bottleneck” is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively “learning” a bottleneck to isolate semantic features.
zh

[CV-67] ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation

【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成中角色身份一致性难以维持的问题,尤其针对现有个性化方法仅关注面部特征而忽略发型、服装和体型等上下文线索导致的视觉不连贯问题。其解决方案的关键在于提出一种名为ContextAnyone的上下文感知扩散框架,通过联合重建参考图像与生成新视频帧来增强模型对参考信息的全面理解;核心创新包括:1)引入Emphasize-Attention模块,选择性强化参考感知特征以防止跨帧身份漂移;2)设计双指导损失(dual-guidance loss),融合扩散目标与参考重建目标以提升外观保真度;3)采用Gap-RoPE位置嵌入分离参考与视频token,稳定时序建模过程。实验表明,该方法在角色一致性与视频质量上均优于现有基于参考图像的视频生成方法。

链接: https://arxiv.org/abs/2512.07328
作者: Ziyang Mai,Yu-Wing Tai
机构: Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbfContextAnyone, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \hrefthis https URLthis https URL.
zh

[CV-68] Reevaluating Automated Wildlife Species Detection: A Reproducibility Study on a Custom Image Dataset

【速读】:该论文旨在解决预训练卷积神经网络(Convolutional Neural Networks, CNNs)在野生动物物种识别任务中的可复现性与泛化能力问题,特别是针对非ImageNet类别标签的场景下模型性能下降的问题。其解决方案的关键在于通过从头 reimplement 原始实验流程,使用公开资源和一个新的包含90个物种共900张图像的数据集进行验证,发现尽管数据分布不同,模型仍能实现62%的整体分类准确率(接近原研究的71%),但宏平均F1分数仅为0.28,表明类间性能差异显著,凸显了直接迁移预训练模型在跨域任务中存在局限性;因此,论文强调需进行物种特异性适配或迁移学习以提升预测一致性与质量。

链接: https://arxiv.org/abs/2512.07305
作者: Tobias Abraham Haider
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study revisits the findings of Carl et al., who evaluated the pre-trained Google Inception-ResNet-v2 model for automated detection of European wild mammal species in camera trap images. To assess the reproducibility and generalizability of their approach, we reimplemented the experiment from scratch using openly available resources and a different dataset consisting of 900 images spanning 90 species. After minimal preprocessing, we obtained an overall classification accuracy of 62%, closely aligning with the 71% reported in the original work despite differences in datasets. As in the original study, per-class performance varied substantially, as indicated by a macro F1 score of 0.28,highlighting limitations in generalization when labels do not align directly with ImageNet classes. Our results confirm that pretrained convolutional neural networks can provide a practical baseline for wildlife species identification but also reinforce the need for species-specific adaptation or transfer learning to achieve consistent, high-quality predictions.
zh

[CV-69] owards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

【速读】:该论文旨在解决基于视觉语言模型(Vision-Language Models, VLMs)的无人机(UAV)图像感知方法在实际应用中面临的局限性问题,尤其是当任务提示(task prompt)较为简单而图像内容复杂时,VLM难以实现视觉与文本token之间的有效语义对齐,从而导致目标混淆、尺度变化和复杂背景干扰等问题。解决方案的关键在于提出AerialVP,这是首个用于无人机图像感知任务提示增强的智能体框架,其核心机制是通过主动提取多维辅助信息来优化原始任务提示:具体包括三个阶段——分析任务提示以识别任务类型与增强需求、从工具库中选择合适工具、并基于分析结果和选定工具生成增强后的任务提示,从而显著提升VLM在复杂场景下的感知准确性和鲁棒性。

链接: https://arxiv.org/abs/2512.07302
作者: Mingning Guo,Mengwei Wu,Shaoxian Li,Haifeng Li,Chao Tao
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs’ understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model’s ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at this https URL.
zh

[CV-70] Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery WACV2026

【速读】:该论文旨在解决当前三维地理空间分析(3D geospatial analysis)依赖昂贵且专用传感器(如LiDAR和多光谱传感器)导致全球可及性受限的问题,并克服现有基于传感器或规则的方法在整合多维3D线索、应对多样化查询以及提供可解释推理方面的局限性。其解决方案的关键在于提出Geo3DVQA,一个基于RGB遥感图像的综合性视觉-语言模型(VLM)评估基准,专注于高度感知的3D地理空间推理任务;该基准涵盖11万条精心标注的问答对,覆盖16类任务并分三个复杂度层级(单特征推理、多特征推理与应用级空间分析),强调真实场景中高程、天空视野因子和地表覆盖模式的融合。实验表明,尽管主流VLMs性能有限(如GPT-4o和Gemini-2.5-Flash准确率仅28.6%和33.0%),但领域特定微调(如Qwen2.5-VL-7B)可显著提升至49.6%,验证了领域适配的有效性,为可扩展、易获取且整体性的3D地理空间分析开辟了新挑战前沿。

链接: https://arxiv.org/abs/2512.07276
作者: Mai Tsujimoto,Junjue Wang,Weihao Xuan,Naoto Yokoya
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026. Camera-ready-based version with minor edits for readability (no change in the contents)

点击查看摘要

Abstract:Three-dimensional geospatial analysis is critical to applications in urban planning, climate adaptation, and environmental assessment. Current methodologies depend on costly, specialized sensors (e.g., LiDAR and multispectral), which restrict global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We hereby present Geo3DVQA, a comprehensive benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning using RGB-only remote sensing imagery. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns. The benchmark encompasses 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6% and 33.0% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% (+24.8 points). These results reveal both the limitations of current VLMs and the effectiveness of domain adaptation. Geo3DVQA introduces new challenge frontiers for scalable, accessible, and holistic 3D geospatial analysis. The dataset and code will be released upon publication at this https URL.
zh

[CV-71] Effective Attention-Guided Multi-Scale Medical Network for Skin Lesion Segmentation

【速读】:该论文旨在解决皮肤病变分割中因病灶形状不规则和对比度低导致的精准度不足问题。其解决方案的关键在于提出一种基于多尺度残差结构的编码器-解码器网络架构,通过引入多分辨率多通道融合(Multi-Resolution Multi-Channel Fusion, MRCF)模块以捕获跨尺度特征,增强信息清晰度与准确性;同时设计交叉混合注意力模块(Cross-Mix Attention Module, CMAM),动态调整多上下文权重,提升特征提取的灵活性与深度;此外,为缓解传统U-Net中跳跃连接造成的特征信息丢失,进一步提出外部注意力桥接(External Attention Bridge, EAB)机制,有效利用解码器中的信息并补偿上采样过程中的损失。

链接: https://arxiv.org/abs/2512.07275
作者: Siyu Wang,Hua Wang,Huiyu Li,Fan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The paper has been accepted by BIBM 2025

点击查看摘要

Abstract:In the field of healthcare, precise skin lesion segmentation is crucial for the early detection and accurate diagnosis of skin diseases. Despite significant advances in deep learning for image processing, existing methods have yet to effectively address the challenges of irregular lesion shapes and low contrast. To address these issues, this paper proposes an innovative encoder-decoder network architecture based on multi-scale residual structures, capable of extracting rich feature information from different receptive fields to effectively identify lesion areas. By introducing a Multi-Resolution Multi-Channel Fusion (MRCF) module, our method captures cross-scale features, enhancing the clarity and accuracy of the extracted information. Furthermore, we propose a Cross-Mix Attention Module (CMAM), which redefines the attention scope and dynamically calculates weights across multiple contexts, thus improving the flexibility and depth of feature capture and enabling deeper exploration of subtle features. To overcome the information loss caused by skip connections in traditional U-Net, an External Attention Bridge (EAB) is introduced, facilitating the effective utilization of information in the decoder and compensating for the loss during upsampling. Extensive experimental evaluations on several skin lesion segmentation datasets demonstrate that the proposed model significantly outperforms existing transformer and convolutional neural network-based models, showcasing exceptional segmentation accuracy and robustness.
zh

[CV-72] RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation

链接: https://arxiv.org/abs/2512.07273
作者: Zhi Rao,Yucheng Zhou,Benjia Zhou,Yiqing Huang,Sergio Escalera,Jun Wan
机构: Macau University of Science and Technology (澳门科技大学); SKL-IOTSC, CIS, University of Macau (澳门大学集成电路与系统重点实验室); Beijing Institute of Technology, Zhuhai (北京理工大学珠海校区); University of Barcelona (巴塞罗那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-73] A graph generation pipeline for critical infrastructures based on heuristics images and depth data

链接: https://arxiv.org/abs/2512.07269
作者: Mike Diessner,Yannick Tarant
机构: German Aerospace Center (DLR)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-74] DGGAN: Degradation Guided Generative Adversarial Network for Real-time Endoscopic Video Enhancement

【速读】:该论文旨在解决内窥镜视频在实际手术中因光照不均、组织散射、遮挡和运动模糊等退化因素导致图像质量下降的问题,这些问题会严重影响术中解剖结构辨识与操作安全性。解决方案的关键在于提出一种退化感知(degradation-aware)的视频增强框架,通过对比学习提取图像中的退化表征,并设计融合机制将这些表征调制到图像特征中,从而指导单帧增强模型进行优化;同时引入跨帧的循环一致性约束,提升模型在不同退化条件下的鲁棒性与泛化能力,最终实现高效实时的高质量内窥镜视频增强。

链接: https://arxiv.org/abs/2512.07253
作者: Handing Xu,Zhenguo Nie,Tairan Peng,Huimin Pan,Xin-Jun Liu
机构: Tsinghua University (清华大学); Beijing Key Laboratory of Transformative High-end Manufacturing Equipment and Technology (北京市高端制造装备转型重点实验室); State Key Laboratory of Tribology in Advanced Equipment (先进装备摩擦学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, and 7 tables

点击查看摘要

Abstract:Endoscopic surgery relies on intraoperative video, making image quality a decisive factor for surgical safety and efficacy. Yet, endoscopic videos are often degraded by uneven illumination, tissue scattering, occlusions, and motion blur, which obscure critical anatomical details and complicate surgical manipulation. Although deep learning-based methods have shown promise in image enhancement, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose a degradation-aware framework for endoscopic video enhancement, which enables real-time, high-quality enhancement by propagating degradation representations across frames. In our framework, degradation representations are first extracted from images using contrastive learning. We then introduce a fusion mechanism that modulates image features with these representations to guide a single-frame enhancement model, which is trained with a cycle-consistency constraint between degraded and restored images to improve robustness and generalization. Experiments demonstrate that our framework achieves a superior balance between performance and efficiency compared with several state-of-the-art methods. These results highlight the effectiveness of degradation-aware modeling for real-time endoscopic video enhancement. Nevertheless, our method suggests that implicitly learning and propagating degradation representation offer a practical pathway for clinical application.
zh

[CV-75] See More Change Less: Anatomy-Aware Diffusion for Contrast Enhancement

【速读】:该论文旨在解决医学图像增强中现有模型过度编辑导致的器官形态失真、虚假病灶生成及小肿瘤漏检问题,其根源在于当前模型缺乏对解剖结构和对比剂动态变化的理解。解决方案的关键在于提出SMILE——一种解剖感知的扩散模型,其核心创新包括:(1) 基于真实器官边界和对比剂分布模式的结构感知监督机制;(2) 无需配准即可直接处理未对齐多期CT图像的注册自由学习策略;(3) 实现全对比相位下快速且一致增强的统一推理框架。该方法在六组外部数据集上显著优于现有方法,在图像质量(SSIM提升14.2%,PSNR提升20.6%,FID改善50%)和临床实用性方面均取得突破,并提升了非对比CT中的癌症检测能力(F1分数最高提升10%)。

链接: https://arxiv.org/abs/2512.07251
作者: Junqi Liu,Zejun Wu,Pedro R. A. S. Bassi,Xinze Zhou,Wenxuan Li,Ibrahim E. Hamamci,Sezgin Er,Tianyu Lin,Yi Luo,Szymon Płotka,Bjoern Menze,Daguang Xu,Kai Ding,Kang Wang,Yang Yang,Yucheng Tang,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰霍普金斯大学); University of Copenhagen (哥本哈根大学); University of Virginia (弗吉尼亚大学); University of Bologna (博洛尼亚大学); Italian Institute of Technology (意大利技术研究院); University of Zurich (苏黎世大学); ETH AI Center (苏黎世联邦理工学院人工智能中心); Istanbul Medipol University (伊斯坦布尔梅迪波尔大学); Jagiellonian University (雅盖隆大学); NVIDIA (英伟达); Johns Hopkins Medicine (约翰霍普金斯医学院); University of California, San Francisco (加州大学旧金山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image enhancement improves visual quality and helps reveal details that are hard to see in the original image. In medical imaging, it can support clinical decision-making, but current models often over-edit. This can distort organs, create false findings, and miss small tumors because these models do not understand anatomy or contrast dynamics. We propose SMILE, an anatomy-aware diffusion model that learns how organs are shaped and how they take up contrast. It enhances only clinically relevant regions while leaving all other areas unchanged. SMILE introduces three key ideas: (1) structure-aware supervision that follows true organ boundaries and contrast patterns; (2) registration-free learning that works directly with unaligned multi-phase CT scans; (3) unified inference that provides fast and consistent enhancement across all contrast phases. Across six external datasets, SMILE outperforms existing methods in image quality (14.2% higher SSIM, 20.6% higher PSNR, 50% better FID) and in clinical usefulness by producing anatomically accurate and diagnostically meaningful images. SMILE also improves cancer detection from non-contrast CT, raising the F1 score by up to 10 percent.
zh

[CV-76] AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)资产在基于扩散模型的指令驱动编辑中面临未经授权修改和恶意篡改的安全风险问题。现有针对2D图像的不可见对抗扰动方法难以直接应用于3DGS,主要受限于视角泛化能力不足以及隐蔽性与防护能力之间的权衡难题。解决方案的关键在于提出首个面向3DGS的编辑防护机制AdLift,其核心创新是将严格约束的2D对抗扰动“提升”(lift)至3D高斯表示的防护体素(safeguard Gaussians),并通过定制化的“提升式PGD”(Lifted PGD)进行渐进优化:该方法在反向传播中首先对编辑模型输出图像的梯度进行截断,并施加投影梯度以确保扰动在图像层面受控;随后通过图像到高斯拟合操作将扰动回传至防护高斯参数,交替执行梯度截断与拟合步骤,从而实现跨任意视角的一致防护效果并推广至新视角。

链接: https://arxiv.org/abs/2512.07247
作者: Ziming Hong,Tianyu Huang,Runnan Chen,Shanshan Ye,Mingming Gong,Bo Han,Tongliang Liu
机构: Sydney AI Centre, The University of Sydney (悉尼人工智能中心,悉尼大学); University of Technology Sydney (悉尼科技大学); University of Melbourne (墨尔本大学); Hong Kong Baptist University (香港浸会大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 40 pages, 34 figures, 18 tables

点击查看摘要

Abstract:Recent studies have extended diffusion-based instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.
zh

[CV-77] Zero-Shot Textual Explanations via Translating Decision-Critical Features

链接: https://arxiv.org/abs/2512.07245
作者: Toshinori Yamauchi,Hiroshi Kera,Kazuhiko Kawamoto
机构: Chiba University (千叶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11+6 pages, 8 figures, 4 tables

点击查看摘要

[CV-78] Squeezed-Eff-Net: Edge-Computed Boost of Tomography Based Brain Tumor Classification leverag ing Hybrid Neural Network Architecture

【速读】:该论文旨在解决脑肿瘤自动诊断中因人工勾画(tumor delineation)耗时且易受观察者间差异影响而导致的准确性与效率问题。其核心解决方案是提出一种融合轻量级SqueezeNet v1与高性能EfficientNet-B0的混合深度学习模型,并引入手工设计的影像组学特征(如方向梯度直方图Histogram of Oriented Gradients、局部二值模式Local Binary Patterns、Gabor滤波器和小波变换),以增强纹理敏感性并捕捉多层次特征表示。该方法在仅使用少于210万参数和1.2 GFLOPs计算量的前提下,实现了98.93%的测试准确率(TTA增强后达99.08%),展现出良好的泛化能力与临床实用性。

链接: https://arxiv.org/abs/2512.07241
作者: Md. Srabon Chowdhury,Syeda Fahmida Tanzim,Sheekar Banerjee,Ishtiak Al Mamoon,AKM Muzahidul Islam
机构: International University of Business Agriculture and Technology (IUBAT); United International University (UIU)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain tumors are one of the most common and dangerous neurological diseases which require a timely and correct diagnosis to provide the right treatment procedures. Even with the promotion of magnetic resonance imaging (MRI), the process of tumor delineation is difficult and time-consuming, which is prone to inter-observer error. In order to overcome these limitations, this work proposes a hybrid deep learning model based on SqueezeNet v1 which is a lightweight model, and EfficientNet-B0, which is a high-performing model, and is enhanced with handcrafted radiomic descriptors, including Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gabor filters and Wavelet transforms. The framework was trained and tested only on publicly available Nickparvar Brain Tumor MRI dataset, which consisted of 7,023 contrast-enhanced T1-weighted axial MRI slices which were categorized into four groups: glioma, meningioma, pituitary tumor, and no tumor. The testing accuracy of the model was 98.93% that reached a level of 99.08% with Test Time Augmentation (TTA) showing great generalization and power. The proposed hybrid network offers a compromise between computation efficiency and diagnostic accuracy compared to current deep learning structures and only has to be trained using fewer than 2.1 million parameters and less than 1.2 GFLOPs. The handcrafted feature addition allowed greater sensitivity in texture and the EfficientNet-B0 backbone represented intricate hierarchical features. The resulting model has almost clinical reliability in automated MRI-based classification of tumors highlighting its possibility of use in clinical decision-support systems.
zh

[CV-79] Unified Camera Positional Encoding for Controlled Video Generation

【速读】:该论文旨在解决现有相机编码方法在真实世界复杂场景中泛化能力不足的问题,尤其是传统基于简化针孔模型的假设无法有效处理多样化的相机内参和镜头畸变,从而限制了视觉观测在三维空间中的准确定位与控制。其关键解决方案是提出一种几何一致的相对射线编码(Relative Ray Encoding),能够统一建模完整的相机信息,包括6自由度位姿、内参及镜头畸变;进一步识别俯仰(pitch)和翻滚(roll)为绝对朝向编码的有效组件,实现初始相机朝向的完整可控性,并构建了UCPE(Unified Camera Positional Encoding)框架,通过轻量级空间注意力适配器嵌入预训练视频扩散Transformer,仅引入少于1%的可训练参数即显著提升相机可控性和视觉保真度。

链接: https://arxiv.org/abs/2512.07237
作者: Cheng Zhang,Boying Li,Meng Wei,Yan-Pei Cao,Camilo Cruz Gambardella,Dinh Phung,Jianfei Cai
机构: Monash University (莫纳什大学); Building 4.0 CRC; VAST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at this https URL.
zh

[CV-80] Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models

链接: https://arxiv.org/abs/2512.07234
作者: Biao Chen,Lin Zuo,Mengmeng Jing,Kunbin He,Yuchen Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-81] STRinGS: Selective Text Refinement in Gaussian Splatting WACV2026

链接: https://arxiv.org/abs/2512.07230
作者: Abhinav Raundhal,Gaurav Behera,P J Narayanan,Ravi Kiran Sarvadevabhatla,Makarand Tapaswi
机构: CVIT, IIIT Hyderabad, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026. Project Page, see this https URL

点击查看摘要

[CV-82] ReLKD: Inter-Class Relation Learning with Knowledge Distillation for Generalized Category Discovery ECAI2025

【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)中如何有效利用已知类与未知类之间的隐式类间关系,以提升对未标记数据中新型类别的分类性能的问题。传统方法通常将每个类别独立处理,忽视了类别间的内在关联性,而直接获取这些类间关系在现实场景中具有挑战性。解决方案的关键在于提出一种端到端框架 ReLKD,其包含三个核心模块:细粒度模块用于学习判别性特征表示,粗粒度模块用于捕捉类别层级关系,以及蒸馏模块将粗粒度模块中的知识迁移至细粒度模块,从而优化特征学习过程,显著增强对新型类别的识别能力。

链接: https://arxiv.org/abs/2512.07229
作者: Fang Zhou,Zhiqiang Chen,Martin Pavlovski,Yizhong Zhang
机构: East China Normal University (华东师范大学); Samsung Electronics America (三星电子美国公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Main Track of the 28th European Conference on Artificial Intelligence (ECAI 2025). To appear in the proceedings published by IOS Press (DOI: https://doi.org/10.3233/FAIA413 )

点击查看摘要

Abstract:Generalized Category Discovery (GCD) faces the challenge of categorizing unlabeled data containing both known and novel classes, given only labels for known classes. Previous studies often treat each class independently, neglecting the inherent inter-class relations. Obtaining such inter-class relations directly presents a significant challenge in real-world scenarios. To address this issue, we propose ReLKD, an end-to-end framework that effectively exploits implicit inter-class relations and leverages this knowledge to enhance the classification of novel classes. ReLKD comprises three key modules: a target-grained module for learning discriminative representations, a coarse-grained module for capturing hierarchical class relations, and a distillation module for transferring knowledge from the coarse-grained module to refine the target-grained module’s representation learning. Extensive experiments on four datasets demonstrate the effectiveness of ReLKD, particularly in scenarios with limited labeled data. The code for ReLKD is available at this https URL.
zh

[CV-83] owards Robust Protective Perturbation against DeepFake Face Swapping

链接: https://arxiv.org/abs/2512.07228
作者: Hengyang Yao,Lin Li,Ke Sun,Jianing Qiu,Huiping Chen
机构: University of Birmingham (伯明翰大学); University of Oxford (牛津大学); Xiamen University (厦门大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-84] VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation

【速读】:该论文旨在解决机器人抓取场景中6D物体位姿估计(6D object pose estimation)的问题,核心挑战在于如何有效融合语义理解与几何精度以提升抓取性能。解决方案的关键在于系统性比较基于CLIP(Contrastive Language–Image Pretraining)和DINOv2的视觉基础模型(Vision Foundation Models, VFM)在该任务中的表现:CLIP通过语言引导增强语义一致性,而DINOv2则提供更密集且精确的几何特征,二者展现出互补优势,为机器人操作与抓取应用中视觉模型的选择提供了实证依据。

链接: https://arxiv.org/abs/2512.07215
作者: Md Selim Sarowar,Sungho Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
zh

[CV-85] Object Pose Distribution Estimation for Determining Revolution and Reflection Uncertainty in Point Clouds

【速读】:该论文旨在解决机器人感知中物体位姿估计(object pose estimation)的不确定性建模问题,传统方法通常仅输出单一位姿估计,无法反映由视觉模糊性(visual ambiguity)引起的姿态不确定性,从而可能导致不可靠的行为决策。现有基于姿态分布的方法高度依赖彩色图像(RGB)信息,但在工业场景中常因光照或材质限制而难以获取。本文提出一种新颖的神经网络方法,仅使用无颜色的三维点云数据即可估计位姿不确定性,是首个不依赖RGB输入、利用深度学习实现姿态分布估计的方法。其关键创新在于构建了一个可处理反射对称与旋转对称性的框架,当前版本聚焦于SE(3)空间中的部分位姿分布估计,但具有扩展至完整位姿分布估计的潜力。

链接: https://arxiv.org/abs/2512.07211
作者: Frederik Hagelskjær,Dimitrios Arapis,Steffen Madsen,Thorbjørn Mosekjær Iversen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 5 tables, ICCR 2025

点击查看摘要

Abstract:Object pose estimation is crucial to robotic perception and typically provides a single-pose estimate. However, a single estimate cannot capture pose uncertainty deriving from visual ambiguity, which can lead to unreliable behavior. Existing pose distribution methods rely heavily on color information, often unavailable in industrial settings. We propose a novel neural network-based method for estimating object pose uncertainty using only 3D colorless data. To the best of our knowledge, this is the first approach that leverages deep learning for pose distribution estimation without relying on RGB input. We validate our method in a real-world bin picking scenario with objects of varying geometric ambiguity. Our current implementation focuses on symmetries in reflection and revolution, but the framework is extendable to full SE(3) pose distribution estimation. Source code available at this http URL Comments: 8 pages, 8 figures, 5 tables, ICCR 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.07211 [cs.CV] (or arXiv:2512.07211v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.07211 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-86] AutoLugano: A Deep Learning Framework for Fully Automated Lymphoma Segmentation and Lugano Staging on FDG-PET/CT

【速读】:该论文旨在解决淋巴瘤(Lymphoma)患者在初始分期过程中依赖人工判读FDG-PET/CT影像所存在的效率低、主观性强及一致性差的问题。其解决方案的关键在于构建一个端到端的自动化深度学习系统AutoLugano,该系统通过三个核心模块实现从图像输入到Lugano分期输出的全流程自动化:(1) 基于多通道输入的3D nnU-Net模型完成病灶分割;(2) 利用TotalSegmentator工具包结合解剖学规则将病灶映射至21个预定义淋巴结区域;(3) 根据受累区域的空间分布自动推导出Lugano分期和治疗分组(局限期 vs. 进展期)。该方法显著提升了分期准确性与临床实用性,尤其在治疗分层任务中表现优异,为辅助临床决策提供了可靠工具。

链接: https://arxiv.org/abs/2512.07206
作者: Boyang Pan,Zeyu Zhang,Hongyu Meng,Bin Cui,Yingying Zhang,Wenli Hou,Junhao Li,Langdi Zhong,Xiaoxiao Chen,Xiaoyu Xu,Changjin Zuo,Chao Cheng,Nan-Jie Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Purpose: To develop a fully automated deep learning system, AutoLugano, for end-to-end lymphoma classification by performing lesion segmentation, anatomical localization, and automated Lugano staging from baseline FDG-PET/CT scans. Methods: The AutoLugano system processes baseline FDG-PET/CT scans through three sequential modules:(1) Anatomy-Informed Lesion Segmentation, a 3D nnU-Net model, trained on multi-channel inputs, performs automated lesion detection (2) Atlas-based Anatomical Localization, which leverages the TotalSegmentator toolkit to map segmented lesions to 21 predefined lymph node regions using deterministic anatomical rules; and (3) Automated Lugano Staging, where the spatial distribution of involved regions is translated into Lugano stages and therapeutic groups (Limited vs. Advanced Stage).The system was trained on the public autoPET dataset (n=1,007) and externally validated on an independent cohort of 67 patients. Performance was assessed using accuracy, sensitivity, specificity, F1-scorefor regional involvement detection and staging agreement. Results: On the external validation set, the proposed model demonstrated robust performance, achieving an overall accuracy of 88.31%, sensitivity of 74.47%, Specificity of 94.21% and an F1-score of 80.80% for regional involvement detection,outperforming baseline models. Most notably, for the critical clinical task of therapeutic stratification (Limited vs. Advanced Stage), the system achieved a high accuracy of 85.07%, with a specificity of 90.48% and a sensitivity of 82.61%.Conclusion: AutoLugano represents the first fully automated, end-to-end pipeline that translates a single baseline FDG-PET/CT scan into a complete Lugano stage. This study demonstrates its strong potential to assist in initial staging, treatment stratification, and supporting clinical decision-making.
zh

[CV-87] MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

链接: https://arxiv.org/abs/2512.07203
作者: Xuhui Zheng,Kang An,Ziliang Wang,Yuhang Wang,Faqiang Qian,Yichao Wu
机构: SenseTime(商汤科技); Nanjing University (南京大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 1 figures

点击查看摘要

[CV-88] Understanding Diffusion Models via Code Execution

链接: https://arxiv.org/abs/2512.07201
作者: Cheng Yu
机构: Chongqing University of Technology (重庆理工大学); DiAi Corporation (迪艾公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-89] SUCCESS-GS: Survey of Compactness and Compression for Efficient Static and Dynamic Gaussian Splatting

【速读】:该论文针对3D Gaussian Splatting (3DGS) 在实际应用中面临的内存和计算资源消耗过大的问题展开研究,尤其在4D动态场景下更为严峻。其核心挑战在于如何在保持高保真重建质量的前提下减少冗余表示,从而实现高效渲染与存储。解决方案的关键在于系统性地将现有高效方法分为两大类:参数压缩(Parameter Compression)和结构重排压缩(Restructuring Compression),通过这两类策略分别从优化单个高斯分布的参数表达和重构整体高斯集合的组织方式来降低复杂度,同时兼顾静态与动态场景下的可扩展性、紧凑性和实时性需求。

链接: https://arxiv.org/abs/2512.07197
作者: Seokhyun Youn,Soohyun Lee,Geonho Kim,Weeyoung Kwon,Sung-Ho Bae,Jihyong Oh
机构: Chung-Ang University (中央大学); Kyung Hee University (庆熙大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first three authors contributed equally to this work. The last two authors are co-corresponding authors. Please visit our project page at this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful explicit representation enabling real-time, high-fidelity 3D reconstruction and novel view synthesis. However, its practical use is hindered by the massive memory and computational demands required to store and render millions of Gaussians. These challenges become even more severe in 4D dynamic scenes. To address these issues, the field of Efficient Gaussian Splatting has rapidly evolved, proposing methods that reduce redundancy while preserving reconstruction quality. This survey provides the first unified overview of efficient 3D and 4D Gaussian Splatting techniques. For both 3D and 4D settings, we systematically categorize existing methods into two major directions, Parameter Compression and Restructuring Compression, and comprehensively summarize the core ideas and methodological trends within each category. We further cover widely used datasets, evaluation metrics, and representative benchmark comparisons. Finally, we discuss current limitations and outline promising research directions toward scalable, compact, and real-time Gaussian Splatting for both static and dynamic 3D scene representation.
zh

[CV-90] HVQ-CGIC: Enabling Hyperprior Entropy Modeling for VQ-Based Controllable Generative Image Compression

【速读】:该论文旨在解决基于向量量化(Vector Quantization, VQ)的生成式图像压缩方法中,熵模型采用静态全局概率分布导致比特率控制不灵活、压缩效率受限的问题。其关键解决方案是提出了一种基于VQ超先验(Hyperprior)的可控生成式图像压缩框架(HVQ-CGIC),通过严格推导VQ索引熵模型引入超先验的数学基础,并设计新型损失函数,在VQ-based生成式图像压缩中首次实现了率失真(Rate-Distortion, RD)平衡与控制。该方案结合轻量级超先验估计网络,在Kodak数据集上以平均减少61.3%比特数达到与Control-GIC、CDC和HiFiC相当的感知质量(LPIPS),显著提升了压缩性能。

链接: https://arxiv.org/abs/2512.07192
作者: Niu Yi,Xu Tianyi,Ma Mingming,Wang Xinkun
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
zh

[CV-91] RefLSM: Linearized Structural-Prior Reflectance Model for Medical Image Segmentation and Bias-Field Correction

【速读】:该论文旨在解决医学图像分割中因强度不均匀性、噪声、边界模糊及结构不规则等因素导致的分割精度下降问题,尤其是传统水平集方法在严重非均匀成像条件下依赖近似偏置场估计而表现不佳的局限。其解决方案的关键在于提出一种新颖的变分反射率基水平集模型(RefLSM),通过引入受Retinex理论启发的反射率分解机制,将观测图像显式分离为反射率和偏置场成分,从而直接对具有光照不变性的反射率进行分割,有效保留细微结构特征;同时,创新性地融合线性结构先验以引导平滑后的反射率梯度趋向数据驱动参考方向,提升在低对比度或噪声环境下的几何引导能力,并采用松弛二值水平集结合凸松弛与符号投影策略实现稳定演化,避免重初始化引发的扩散问题,最终通过ADMM优化算法高效求解变分问题,显著提升了分割精度、鲁棒性和计算效率。

链接: https://arxiv.org/abs/2512.07191
作者: Wenqi Zhao,Jiacheng Sang,Fenghua Cheng,Yonglu Shu,Dong Li,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation remains challenging due to intensity inhomogeneity, noise, blurred boundaries, and irregular structures. Traditional level set methods, while effective in certain cases, often depend on approximate bias field estimations and therefore struggle under severe non-uniform imaging conditions. To address these limitations, we propose a novel variational Reflectance-based Level Set Model (RefLSM), which explicitly integrates Retinex-inspired reflectance decomposition into the segmentation framework. By decomposing the observed image into reflectance and bias field components, RefLSM directly segments the reflectance, which is invariant to illumination and preserves fine structural details. Building on this foundation, we introduce two key innovations for enhanced precision and robustness. First, a linear structural prior steers the smoothed reflectance gradients toward a data-driven reference, providing reliable geometric guidance in noisy or low-contrast scenes. Second, a relaxed binary level-set is embedded in RefLSM and enforced via convex relaxation and sign projection, yielding stable evolution and avoiding reinitialization-induced diffusion. The resulting variational problem is solved efficiently using an ADMM-based optimization scheme. Extensive experiments on multiple medical imaging datasets demonstrate that RefLSM achieves superior segmentation accuracy, robustness, and computational efficiency compared to state-of-the-art level set methods.
zh

[CV-92] Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification

【速读】:该论文旨在解决当前深度神经网络在医学图像分类中对解剖结构敏感性不足的问题,即模型往往过度依赖像素强度特征,而忽略了由拓扑不变量(topological invariants)编码的更本质的解剖学结构信息,且现有方法仅能捕捉单一参数下的简单拓扑特征。解决方案的关键在于提出一种拓扑引导的分类框架,通过多尺度、多滤波(multi-filtration)的持久同调(persistent homology)特征提取机制,构建一个融合了从全局解剖到局部细微异常的多层次拓扑表示;进一步设计基于交叉注意力(cross-attention)的神经网络直接处理整合后的持久同调图(cubical persistence diagrams, PDs),并将所得拓扑嵌入(topological embeddings)与卷积神经网络(CNNs)或Transformer的特征图进行融合,从而在端到端架构中增强模型识别复杂解剖结构的能力。

链接: https://arxiv.org/abs/2512.07190
作者: Pengfei Gu,Huimin Li,Haoteng Tang,Dongkuan(DK)Xu,Erik Enriquez,DongChul Kim,Bin Fu,Danny Z. Chen
机构: The University of Texas Rio Grande Valley (得克萨斯大学里奥格兰德河谷分校); North Carolina State University (北卡罗来纳州立大学); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern deep neural networks have shown remarkable performance in medical image classification. However, such networks either emphasize pixel-intensity features instead of fundamental anatomical structures (e.g., those encoded by topological invariants), or they capture only simple topological features via single-parameter persistence. In this paper, we propose a new topology-guided classification framework that extracts multi-scale and multi-filtration persistent topological features and integrates them into vision classification backbones. For an input image, we first compute cubical persistence diagrams (PDs) across multiple image resolutions/scales. We then develop a ``vineyard’’ algorithm that consolidates these PDs into a single, stable diagram capturing signatures at varying granularities, from global anatomy to subtle local irregularities that may indicate early-stage disease. To further exploit richer topological representations produced by multiple filtrations, we design a cross-attention-based neural network that directly processes the consolidated final PDs. The resulting topological embeddings are fused with feature maps from CNNs or Transformers. By integrating multi-scale and multi-filtration topologies into an end-to-end architecture, our approach enhances the model’s capacity to recognize complex anatomical structures. Evaluations on three public datasets show consistent, considerable improvements over strong baselines and state-of-the-art methods, demonstrating the value of our comprehensive topological perspective for robust and interpretable medical image classification.
zh

[CV-93] START: Spatial and Textual Learning for Chart Understanding WACV2026

链接: https://arxiv.org/abs/2512.07186
作者: Zhuoming Liu,Xiaofeng Gao,Feiyang Niu,Qiaozi Gao,Liu Liu,Robinson Piramuthu
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Amazon AGI (亚马逊AGI); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV2026 Camera Ready

点击查看摘要

[CV-94] IDE: Two-Stage Inverse Degradation Estimation with Guided Prior Disentanglement for Underwater Image Restoration

【速读】:该论文旨在解决水下图像复原中复杂且空间变化的退化特性难以有效处理的问题,现有方法通常采用全局统一的修复策略,无法应对多种共存退化因素(如颜色失真、雾化、细节丢失和噪声)在不同区域和水体条件下的差异性影响。其解决方案的关键在于提出TIDE框架——一个两阶段的逆退化估计方法,通过显式建模退化特征并利用专门的先验分解实现针对性修复:首先将水下退化分解为四种关键因素,并设计对应的专用修复专家生成差异化修复假设,再根据局部退化模式自适应融合这些假设;随后通过渐进式精修阶段校正残余伪影,从而在保持自然视觉效果的同时显著提升颜色还原与对比度增强能力。

链接: https://arxiv.org/abs/2512.07171
作者: Shravan Venkatraman,Rakesh Raj Madavan,Pavan Kumar S,Muthu Subash Kavitha
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); University of Amsterdam (阿姆斯特丹大学); University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校); Nagasaki University (长崎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 11 figures, 5 tables

点击查看摘要

Abstract:Underwater image restoration is essential for marine applications ranging from ecological monitoring to archaeological surveys, but effectively addressing the complex and spatially varying nature of underwater degradations remains a challenge. Existing methods typically apply uniform restoration strategies across the entire image, struggling to handle multiple co-occurring degradations that vary spatially and with water conditions. We introduce TIDE, a \underlinet wo stage \underlinei nverse \underlined egradation \underlinee stimation framework that explicitly models degradation characteristics and applies targeted restoration through specialized prior decomposition. Our approach disentangles the restoration process into multiple specialized hypotheses that are adaptively fused based on local degradation patterns, followed by a progressive refinement stage that corrects residual artifacts. Specifically, TIDE decomposes underwater degradations into four key factors, namely color distortion, haze, detail loss, and noise, and designs restoration experts specialized for each. By generating specialized restoration hypotheses, TIDE balances competing degradation factors and produces natural results even in highly degraded regions. Extensive experiments across both standard benchmarks and challenging turbid water conditions show that TIDE achieves competitive performance on reference based fidelity metrics while outperforming state of the art methods on non reference perceptual quality metrics, with strong improvements in color correction and contrast enhancement. Our code is available at: this https URL.
zh

[CV-95] owards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

【速读】:该论文旨在解决现有图像融合方法在鲁棒性、适应性和可控性方面的局限性,尤其在低光照退化、色彩偏移或曝光不平衡等复杂场景下难以灵活融入用户意图的问题。其核心挑战在于缺乏真实融合图像标签以及现有数据集规模较小,导致难以训练能够同时理解高层语义并实现细粒度多模态对齐的端到端模型。解决方案的关键在于提出DiTFuse框架——一种基于指令驱动的扩散-Transformer(Diffusion-Transformer, DiT)架构,通过将两张输入图像与自然语言指令共同编码至共享潜在空间,实现对融合过程的分层和细粒度控制;训练阶段采用多退化掩码图像建模策略,使网络无需依赖真实标签即可联合学习跨模态对齐、模态不变恢复及任务感知特征选择,从而统一红外-可见光、多焦点、多曝光融合等多种任务,并支持零样本泛化至其他多图像融合场景(如指令条件下的分割)。

链接: https://arxiv.org/abs/2512.07170
作者: Jiayang Li,Chengjie Jiang,Junjun Jiang,Pengwei Liang,Jiayi Ma,Liqiang Nie
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学); Wuhan University (武汉大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.
zh

[CV-96] When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的隐私泄露问题,特别是现有方法在保护用户隐私时忽视了对隐私真实性与可恢复性的评估。其核心挑战在于如何在多种MLLM应用场景中有效还原由代理数据(surrogate-driven)保护后的隐私信息。解决方案的关键在于提出SPPE(Surrogate Privacy Protected Editable)数据集,该数据集包含多样化的隐私类别和用户指令,模拟真实MLLM应用环境,并提供受保护的代理数据及其经MLLM编辑后的版本,从而实现对隐私恢复质量的直接评估。进一步地,作者将隐私恢复建模为一个基于互补多模态信号的引导生成任务,提出一种统一方法,在保持MLLM生成编辑内容保真度的同时,可靠重建原始私有内容,且在SPPE和InstructPix2Pix上的实验表明该方法具有良好的泛化能力,实现了隐私保护与模型可用性之间的良好平衡。

链接: https://arxiv.org/abs/2512.07166
作者: Siyuan Xu,Yibing Liu,Peilin Chen,Yung-Hui Li,Shiqi Wang,Sam Kwong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,7figures

点击查看摘要

Abstract:Privacy leakage in Multimodal Large Language Models (MLLMs) has long been an intractable problem. Existing studies, though effectively obscure private information in MLLMs, often overlook the evaluation of the authenticity and recovery quality of user privacy. To this end, this work uniquely focuses on the critical challenge of how to restore surrogate-driven protected data in diverse MLLM scenarios. We first bridge this research gap by contributing the SPPE (Surrogate Privacy Protected Editable) dataset, which includes a wide range of privacy categories and user instructions to simulate real MLLM applications. This dataset offers protected surrogates alongside their various MLLM-edited versions, thus enabling the direct assessment of privacy recovery quality. By formulating privacy recovery as a guided generation task conditioned on complementary multimodal signals, we further introduce a unified approach that reliably reconstructs private content while preserving the fidelity of MLLM-generated edits. The experiments on both SPPE and InstructPix2Pix further show that our approach generalizes well across diverse visual content and editing tasks, achieving a strong balance between privacy protection and MLLM usability.
zh

[CV-97] MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation

【速读】:该论文旨在解决稀疏视图三维高斯点绘(Sparse-view 3D Gaussian splatting)中,现有无位姿前馈方法依赖全参数微调大型视觉Transformer(Vision Transformer, ViT)导致训练计算成本高昂的问题。解决方案的关键在于提出MuSASplat框架,其核心创新包括:(1)轻量级多尺度适配器(Multi-Scale Adapter),仅需极少训练参数即可高效微调ViT主干网络,显著降低GPU资源消耗;(2)特征融合聚合器(Feature Fusion Aggregator),替代传统内存池机制,实现跨输入视图的几何一致性特征整合,在减少内存占用和训练复杂度的同时保持高质量的新视角合成效果。

链接: https://arxiv.org/abs/2512.07165
作者: Muyu Xu,Fangneng Zhan,Xiaoqin Zhang,Ling Shao,Shijian Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse-view 3D Gaussian splatting seeks to render high-quality novel views of 3D scenes from a limited set of input images. While recent pose-free feed-forward methods leveraging pre-trained 3D priors have achieved impressive results, most of them rely on full fine-tuning of large Vision Transformer (ViT) backbones and incur substantial GPU costs. In this work, we introduce MuSASplat, a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models with little compromise of rendering quality. Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters. This design avoids the prohibitive GPU overhead associated with previous full-model adaptation techniques while maintaining high fidelity in novel view synthesis, even with very sparse input views. In addition, we introduce a Feature Fusion Aggregator that integrates features across input views effectively and efficiently. Unlike widely adopted memory banks, the Feature Fusion Aggregator ensures consistent geometric integration across input views and meanwhile mitigates the memory usage, training complexity, and computational costs significantly. Extensive experiments across diverse datasets show that MuSASplat achieves state-of-the-art rendering quality but has significantly reduced parameters and training resource requirements as compared with existing methods.
zh

[CV-98] CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics

【速读】:该论文旨在解决扩散模型在图像形态变换(image morphing)中面临的挑战,即如何实现平滑且语义一致的过渡效果。现有方法常因缺乏自适应的结构与语义对齐机制,导致过渡过程出现突兀变化或过度饱和的现象。其核心解决方案是提出CHIMERA框架,关键在于两个创新:一是自适应缓存注入(Adaptive Cache Injection, ACI),通过在DDIM反演阶段缓存输入图像的深层、中层和浅层特征,并在去噪过程中自适应地重新注入,从而实现时空自适应的结构与语义对齐;二是语义锚点提示(Semantic Anchor Prompting, SAP),利用视觉-语言模型生成共享锚点提示,作为语义桥梁引导去噪过程以获得连贯结果。该方案显著提升了图像形态变换的自然性与一致性,达到当前最优性能。

链接: https://arxiv.org/abs/2512.07155
作者: Dahyeon Kye,Jeahun Sung,MinKyu Jeon,Jihyong Oh
机构: Chung-Ang University (中央大学); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL

点击查看摘要

Abstract:Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
zh

[CV-99] FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers

【速读】:该论文旨在解决预训练流模型(latent flow models)在求解逆问题时存在的两个关键挑战:一是现有方法难以收敛到后验分布的众数(posterior mode),二是潜在空间中存在流形偏差(manifold deviation)。为缓解这些问题,作者提出了一种全新的无训练框架 FlowLPS,其核心创新在于引入了 Langevin Proximal Sampling (LPS) 策略——该策略融合了 Langevin 动力学以实现流形一致性的探索与近端优化(proximal optimization)以精确逼近众数,从而在 FFHQ 和 DIV2K 多个逆任务上实现了重建保真度与感知质量之间的优越平衡。

链接: https://arxiv.org/abs/2512.07150
作者: Jonghyun Park,Jong Chul Ye
机构: KAIST(韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep generative models have become powerful priors for solving inverse problems, and various training-free methods have been developed. However, when applied to latent flow models, existing methods often fail to converge to the posterior mode or suffer from manifold deviation within latent spaces. To mitigate this, here we introduce a novel training-free framework, FlowLPS, that solves inverse problems with pretrained flow models via a Langevin Proximal Sampling (LPS) strategy. Our method integrates Langevin dynamics for manifold-consistent exploration with proximal optimization for precise mode seeking, achieving a superior balance between reconstruction fidelity and perceptual quality across multiple inverse tasks on FFHQ and DIV2K, outperforming state of the art inverse solvers.
zh

[CV-100] Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search

【速读】:该论文旨在解决当前生成式 AI(Generative AI)中稀疏子网络(即“赢家彩票”)搜索方法的效率与性能瓶颈问题。现有方法如 Lottery Ticket Rewinding(LTR)计算成本高,而基于一阶显著性指标的 Pruning-at-Initialization(PaI)方法则因忽略权重间依赖关系,在稀疏场景下表现不佳且难以通过基本合理性检验。解决方案的关键在于提出 Concrete Ticket Search(CTS),其将子网络发现建模为一个整体组合优化问题,利用 Concrete 松弛对离散搜索空间进行连续近似,并引入一种新型梯度平衡机制(GRADBALANCE)以精确控制稀疏度,从而在不依赖敏感超参数调优的情况下高效识别接近初始化时即具高性能的子网络。此外,受彩票训练动态启发,论文进一步设计了一类基于知识蒸馏思想的剪枝目标,其中最小化稀疏与密集网络输出间的反向 KL 散度(CTS-KL)尤为有效,显著提升了稀疏条件下的模型精度和鲁棒性。

链接: https://arxiv.org/abs/2512.07142
作者: Tanay Arora,Christof Teuscher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: This work plans to be submitted to the IEEE for possible publication

点击查看摘要

Abstract:The Lottery Ticket Hypothesis asserts the existence of highly sparse, trainable subnetworks (‘winning tickets’) within dense, randomly initialized neural networks. However, state-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding (LTR), are computationally prohibitive, while more efficient saliency-based Pruning-at-Initialization (PaI) techniques suffer from a significant accuracy-sparsity trade-off and fail basic sanity checks. In this work, we argue that PaI’s reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap, especially in the sparse regime. To address this, we introduce Concrete Ticket Search (CTS), an algorithm that frames subnetwork discovery as a holistic combinatorial optimization problem. By leveraging a Concrete relaxation of the discrete search space and a novel gradient balancing scheme (GRADBALANCE) to control sparsity, CTS efficiently identifies high-performing subnetworks near initialization without requiring sensitive hyperparameter tuning. Motivated by recent works on lottery ticket training dynamics, we further propose a knowledge distillation-inspired family of pruning objectives, finding that minimizing the reverse Kullback-Leibler divergence between sparse and dense network outputs (CTS-KL) is particularly effective. Experiments on varying image classification tasks show that CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR, while requiring only a small fraction of the computation. For example, on ResNet-20 on CIFAR10, it reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR attains the same sparsity with 68.3% accuracy in 95.2 minutes. CTS’s subnetworks outperform saliency-based methods across all sparsities, but its advantage over LTR is most pronounced in the highly sparse regime.
zh

[CV-101] A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

【速读】:该论文旨在解决当前多模态人体动作识别(Multimodal Human Action Recognition, HAR)向更深层次的人体动作理解(Human Action Understanding, HAU)和动作推理(Human Action Reasoning, HARn)演进过程中所面临的两大核心问题:一是现有大型视觉语言模型(Large Vision Language Models, LVLMs)难以有效处理非RGB模态(如深度、惯性测量单元IMU、毫米波mmWave)数据,因缺乏大规模图文标注资源;二是现有HAR数据集仅提供粗粒度标签(data label),无法支撑细粒度动作动态建模以实现HAU与HARn任务。解决方案的关键在于构建一个大规模多模态数据集CUHK-X,包含58,445个样本、40类动作及多传感器数据,并提出一种基于提示(prompt-based)的场景生成方法,利用LLMs生成逻辑连贯的动作序列描述,辅以人工验证提升文本一致性,从而为HAU与HARn任务提供高质量、结构化的数据基础与评估基准。

链接: https://arxiv.org/abs/2512.07136
作者: Siyang Jiang,Mu Yuan,Xiang Ji,Bufang Yang,Zeyu Liu,Lilin Xu,Yang Li,Yuting He,Liran Dong,Wenrui Lu,Zhenyu Yan,Xiaofan Jiang,Wei Gao,Hongkai Chen,Guoliang Xing
机构: The Chinese University of Hong Kong (香港中文大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Columbia University (哥伦比亚大学); University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: this https URL and this https URL.
zh

[CV-102] rajMoE: Scene-Adaptive Trajectory Planning with Mixture of Experts and Reinforcement Learning

【速读】:该论文旨在解决当前自动驾驶系统中端到端框架在轨迹规划性能上的局限性,具体包括两个核心问题:一是不同驾驶场景下最优轨迹先验(trajectory prior)存在显著差异,而现有方法未能动态适配;二是轨迹评估机制缺乏策略驱动的优化,受限于单阶段监督学习的瓶颈。解决方案的关键在于:首先引入Mixture of Experts (MoE)架构,根据不同场景自适应地选择或组合不同的轨迹先验,实现场景感知的轨迹生成;其次采用强化学习(Reinforcement Learning)对轨迹评分机制进行策略驱动的精细化调整,从而突破传统监督学习的性能边界。此外,通过融合多种感知骨干网络(perception backbone)提升特征表达能力,最终在navsim ICCV基准上取得51.08分,位列第三。

链接: https://arxiv.org/abs/2512.07135
作者: Zebin Xing,Pengxuan Yang,Linbo Wang,Yichen Zhang,Yiming Hu,Yupeng Zheng,Junli Wang,Yinfeng Gao,Guang Li,Kun Ma,Long Chen,Zhongpu Xia,Qichao Zhang,Hangjun Ye,Dongbin Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Xiaomi(小米)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current autonomous driving systems often favor end-to-end frameworks, which take sensor inputs like images and learn to map them into trajectory space via neural networks. Previous work has demonstrated that models can achieve better planning performance when provided with a prior distribution of possible trajectories. However, these approaches often overlook two critical aspects: 1) The appropriate trajectory prior can vary significantly across different driving scenarios. 2) Their trajectory evaluation mechanism lacks policy-driven refinement, remaining constrained by the limitations of one-stage supervised training. To address these issues, we explore improvements in two key areas. For problem 1, we employ MoE to apply different trajectory priors tailored to different scenarios. For problem 2, we utilize Reinforcement Learning to fine-tune the trajectory scoring mechanism. Additionally, we integrate models with different perception backbones to enhance perceptual features. Our integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place.
zh

[CV-103] Mimir: Hierarchical Goal-Driven Diffusion with Uncertainty Propagation for End-to-End Autonomous Driving

【速读】:该论文针对端到端自动驾驶中因高阶引导信号不准确及引导模块计算开销大而导致的轨迹生成鲁棒性差与推理速度慢的问题,提出了一种名为Mimir的分层双系统框架。其解决方案的关键在于:(1)采用拉普拉斯分布(Laplace distribution)对目标点进行不确定性建模,以提升轨迹规划的鲁棒性;(2)引入多速率引导机制(multi-rate guidance mechanism),提前预测扩展的目标点,从而显著加快高阶模块的推理速度(提升1.6倍),同时保持性能不下降。

链接: https://arxiv.org/abs/2512.07130
作者: Zebin Xing,Yupeng Zheng,Qichao Zhang,Zhixing Ding,Pengxuan Yang,Songen Gu,Zhongpu Xia,Dongbin Zhao
机构: Institue of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); China University of Geosciences (中国地质大学); Fudan University (复旦大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving has emerged as a pivotal direction in the field of autonomous systems. Recent works have demonstrated impressive performance by incorporating high-level guidance signals to steer low-level trajectory planners. However, their potential is often constrained by inaccurate high-level guidance and the computational overhead of complex guidance modules. To address these limitations, we propose Mimir, a novel hierarchical dual-system framework capable of generating robust trajectories relying on goal points with uncertainty estimation: (1) Unlike previous approaches that deterministically model, we estimate goal point uncertainty with a Laplace distribution to enhance robustness; (2) To overcome the slow inference speed of the guidance system, we introduce a multi-rate guidance mechanism that predicts extended goal points in advance. Validated on challenging Navhard and Navtest benchmarks, Mimir surpasses previous state-of-the-art methods with a 20% improvement in the driving score EPDMS, while achieving 1.6 times improvement in high-level module inference speed without compromising accuracy. The code and models will be released soon to promote reproducibility and further development. The code is available at this https URL
zh

[CV-104] MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

链接: https://arxiv.org/abs/2512.07128
作者: Chau Truong,Hieu Ta Quang,Dung D. Le
机构: FPT Software AI Center (FPT软件人工智能中心); VinUniversity (Vin大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-105] raining-free Clothing Region of Interest Self-correction for Virtual Try-On

链接: https://arxiv.org/abs/2512.07126
作者: Shengjie Lu,Zhibin Wan,Jiejie Liu,Quan Zhang,Mingjie Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures

点击查看摘要

[CV-106] MSN: Multi-directional Similarity Network for Hand-crafted and Deep-synthesized Copy-Move Forgery Detection

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的复制-移动图像伪造(copy-move image forgery)检测难题,尤其是针对复杂变换和精细操作下现有深度检测模型在特征表示(representation)与篡改区域定位(localization)两方面的局限性。其解决方案的关键在于提出一种双流网络结构——多方向相似性网络(Multi-directional Similarity Network, MSN):在特征表示层面,采用多方向卷积神经网络(multi-directional CNN)对图像进行分层编码,并通过尺度与旋转的多样化增强策略提升样本块间的相似性度量能力;在定位层面,设计基于二维相似性矩阵的解码器,相比传统一维相似性向量方法更充分地利用图像全局空间信息,从而显著提升对篡改区域的精准识别能力。

链接: https://arxiv.org/abs/2512.07110
作者: Liangwei Jiang,Jinluo Xie,Yecheng Huang,Hua Zhang,Hongyu Yang,Di Huang
机构: Beihang University (北京航空航天大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Copy-move image forgery aims to duplicate certain objects or to hide specific contents with copy-move operations, which can be achieved by a sequence of manual manipulations as well as up-to-date deep generative network-based swapping. Its detection is becoming increasingly challenging for the complex transformations and fine-tuned operations on the tampered regions. In this paper, we propose a novel two-stream model, namely Multi-directional Similarity Network (MSN), to accurate and efficient copy-move forgery detection. It addresses the two major limitations of existing deep detection models in \textbfrepresentation and \textbflocalization, respectively. In representation, an image is hierarchically encoded by a multi-directional CNN network, and due to the diverse augmentation in scales and rotations, the feature achieved better measures the similarity between sampled patches in two streams. In localization, we design a 2-D similarity matrix based decoder, and compared with the current 1-D similarity vector based one, it makes full use of spatial information in the entire image, leading to the improvement in detecting tampered regions. Beyond the method, a new forgery database generated by various deep neural networks is presented, as a new benchmark for detecting the growing deep-synthesized copy-move. Extensive experiments are conducted on two classic image forensics benchmarks, \emphi.e. CASIA CMFD and CoMoFoD, and the newly presented one. The state-of-the-art results are reported, which demonstrate the effectiveness of the proposed approach.
zh

[CV-107] COREA: Coarse-to-Fine 3D Representation Alignment Between Relightable 3D Gaussians and SDF via Bidirectional 3D-to-3D Supervision

【速读】:该论文旨在解决当前基于3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在几何重建和物理渲染(Physically-Based Rendering, PBR)中存在的两大问题:一是几何结构由2D渲染图像间接学习导致表面粗糙、细节不足;二是BRDF(双向反射分布函数)与光照的分解不稳定,难以实现真实感重光照。其解决方案的关键在于提出COREA框架,首次将可重光照的3D高斯与符号距离场(Signed Distance Field, SDF)统一建模,并采用粗到细的双向3D-to-3D对齐策略——利用深度提供粗略对齐,深度梯度和法向量进一步优化精细结构,从而直接在3D空间中学习几何信号,支持稳定的BRDF与光照分离;同时引入密度控制机制以平衡几何保真度与内存效率,显著提升新视角合成、网格重建及PBR性能。

链接: https://arxiv.org/abs/2512.07107
作者: Jaeyoon Lee,Hojoon Jung,Sungtae Hwang,Jihyong Oh,Jongwon Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present COREA, the first unified framework that jointly learns relightable 3D Gaussians and a Signed Distance Field (SDF) for accurate geometry reconstruction and faithful relighting. While recent 3D Gaussian Splatting (3DGS) methods have extended toward mesh reconstruction and physically-based rendering (PBR), their geometry is still learned from 2D renderings, leading to coarse surfaces and unreliable BRDF-lighting decomposition. To address these limitations, COREA introduces a coarse-to-fine bidirectional 3D-to-3D alignment strategy that allows geometric signals to be learned directly in 3D space. Within this strategy, depth provides coarse alignment between the two representations, while depth gradients and normals refine fine-scale structure, and the resulting geometry supports stable BRDF-lighting decomposition. A density-control mechanism further stabilizes Gaussian growth, balancing geometric fidelity with memory efficiency. Experiments on standard benchmarks demonstrate that COREA achieves superior performance in novel-view synthesis, mesh reconstruction, and PBR within a unified framework.
zh

[CV-108] DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection

链接: https://arxiv.org/abs/2512.07078
作者: Bo Gao,Jingcheng Tong,Xingsheng Chen,Han Yu,Zichen Li
机构: Beijing Institute of Graphic Communication (北京印刷学院); The University of Hong Kong (香港大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

[CV-109] Context-measure: Contextualizing Metric for Camouflage

【速读】:该论文旨在解决现有评估指标在评价伪装场景(camouflaged scenarios)时忽视空间上下文依赖性的问题。当前主流指标多用于通用或显著性目标的评估,其设计基于空间上下文无关的假设,难以准确反映伪装物体的真实感知特性。解决方案的关键在于提出一种新的情境化评估范式——Context-measure,其核心是构建一个概率像素级相关性框架(probabilistic pixel-aware correlation framework),通过引入空间依赖性和像素级别的伪装量化机制,使评估结果更贴近人类视觉感知。实验表明,该方法在三个具有挑战性的伪装目标分割数据集上均优于传统无上下文依赖的指标,可为农业、工业和医疗等涉及伪装模式识别的计算机视觉应用提供可靠的基准评估体系。

链接: https://arxiv.org/abs/2512.07076
作者: Chen-Yang Wang,Gepeng Ji,Song Shao,Ming-Ming Cheng,Deng-Ping Fan
机构: Nankai University (南开大学); Chongqing Changan Wangjiang Industrial Group Co., Ltd. (重庆长安望江工业集团有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Camouflage is primarily context-dependent yet current metrics for camouflaged scenarios overlook this critical factor. Instead, these metrics are originally designed for evaluating general or salient objects, with an inherent assumption of uncorrelated spatial context. In this paper, we propose a new contextualized evaluation paradigm, Context-measure, built upon a probabilistic pixel-aware correlation framework. By incorporating spatial dependencies and pixel-wise camouflage quantification, our measure better aligns with human perception. Extensive experiments across three challenging camouflaged object segmentation datasets show that Context-measure delivers more reliability than existing context-independent metrics. Our measure can provide a foundational evaluation benchmark for various computer vision applications involving camouflaged patterns, such as agricultural, industrial, and medical scenarios. Code is available at this https URL.
zh

[CV-110] Persistent Homology-Guided Frequency Filtering for Image Compression

【速读】:该论文旨在解决噪声图像数据集中特征提取的可靠性问题,尤其是在图像压缩过程中如何保留有意义的拓扑结构信息。其解决方案的关键在于结合离散傅里叶变换(Discrete Fourier Transform, DFT)与持久同调分析(Persistent Homology),通过识别与图像特定拓扑特征相对应的频率成分,实现对图像的有效压缩与重构,同时确保关键信息不被丢失。这种方法在六种不同指标下表现出与JPEG相当的压缩性能,并展现出在二分类任务中提升卷积神经网络(Convolutional Neural Network, CNN)性能的潜力,从而增强了噪声条件下图像压缩的可靠性。

链接: https://arxiv.org/abs/2512.07065
作者: Anil Chintapalli,Peter Tenholder,Henry Chen,Arjun Rao
机构: North Carolina School of Science and Mathematics(北卡罗来纳州科学与数学学校); University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校); University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures, code available at this http URL

点击查看摘要

Abstract:Feature extraction in noisy image datasets presents many challenges in model reliability. In this paper, we use the discrete Fourier transform in conjunction with persistent homology analysis to extract specific frequencies that correspond with certain topological features of an image. This method allows the image to be compressed and reformed while ensuring that meaningful data can be differentiated. Our experimental results show a level of compression comparable to that of using JPEG using six different metrics. The end goal of persistent homology-guided frequency filtration is its potential to improve performance in binary classification tasks (when augmenting a Convolutional Neural Network) compared to traditional feature extraction and compression methods. These findings highlight a useful end result: enhancing the reliability of image compression under noisy conditions.
zh

[CV-111] mathrmDmathrm3-Predictor: Noise-Free Deterministic Diffusion for Dense Prediction

【速读】:该论文旨在解决扩散模型(diffusion models)在密集预测任务中因采样过程中的随机噪声导致几何结构信息丢失的问题。扩散模型依赖于逐时间步的噪声注入与去噪机制,这种随机性与密集预测所需的确定性图像到几何映射存在本质冲突,从而破坏了细粒度的空间线索和几何结构一致性。解决方案的关键在于提出一种无噪声的确定性框架——D3\mathrm{D}^\mathrm{3}-Predictor,其通过重构预训练扩散模型以消除随机噪声,并将扩散网络视为一系列时间步依赖的视觉专家集合,自监督地聚合这些异质先验形成统一、干净且完整的几何先验;同时引入任务特定监督,实现对密集预测任务的高效适配。该方法在多个任务上表现出竞争力或领先性能,且训练数据需求减少超过50%,推理仅需单步完成。

链接: https://arxiv.org/abs/2512.07062
作者: Changliang Xia,Chengyou Jia,Minnan Luo,Zhuohang Dang,Xin Shen,Bowen Ping
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although diffusion models with strong visual priors have emerged as powerful dense prediction backboens, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce \mathrmD^\mathrm3 -Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, \mathrmD^\mathrm3 -Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that \mathrmD^\mathrm3 -Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at this https URL.
zh

[CV-112] RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际部署中面临的内存占用大和训练成本高的问题,尤其针对现有压缩方法仅支持固定压缩率、缺乏对不同带宽与设备约束自适应调整能力的局限性。其解决方案的关键在于提出一种灵活的压缩方案,能够在预定义压缩率边界之间任意插值,实现无重训练的动态码率控制,同时保持高质量的渲染效果,从而显著提升3DGS在沉浸式多媒体应用中的实用性和可扩展性。

链接: https://arxiv.org/abs/2512.07052
作者: Hoang-Nhat Tran,Francesco Di Sario,Gabriele Spadaro,Giuseppe Valenzise,Enzo Tartaglione
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and costly training procedures, motivating efforts toward compression. Existing approaches, however, operate at fixed rates, limiting adaptability to varying bandwidth and device constraints. In this work, we propose a flexible compression scheme for 3DGS that supports interpolation at any rate between predefined bounds. Our method is computationally lightweight, requires no retraining for any rate, and preserves rendering quality across a broad range of operating points. Experiments demonstrate that the approach achieves efficient, high-quality compression while offering dynamic rate control, making it suitable for practical deployment in immersive applications. The code will be provided open-source upon acceptance of the work.
zh

[CV-113] DAUNet: A Lightweight UNet Variant with Deformable Convolutions and Parameter-Free Attention for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中模型对几何变化敏感性高、上下文信息融合能力弱以及在资源受限环境下难以部署的问题。其核心解决方案是提出DAUNet,一种轻量级UNet变体,关键创新在于:1)在瓶颈层引入可变形卷积V2(Deformable Convolution V2),通过动态可变形核增强空间适应性以应对器官形变;2)在解码器与跳跃连接路径中嵌入无参数注意力机制SimAM(Saliency-aware Attention Module),实现显著性感知的特征精炼,从而提升对低对比度区域和缺失上下文的鲁棒性,同时保持模型参数效率。

链接: https://arxiv.org/abs/2512.07051
作者: Adnan Munir,Shujaat Khan
机构: King Fahd University of Petroleum & Minerals (国王法赫德石油与矿产大学); SDAIA-KFUPM Joint Research Center for Artificial Intelligence (SDAIA-KFUPM 人工智能联合研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Medical image segmentation plays a pivotal role in automated diagnostic and treatment planning systems. In this work, we present DAUNet, a novel lightweight UNet variant that integrates Deformable V2 Convolutions and Parameter-Free Attention (SimAM) to improve spatial adaptability and context-aware feature fusion without increasing model complexity. DAUNet’s bottleneck employs dynamic deformable kernels to handle geometric variations, while the decoder and skip pathways are enhanced using SimAM attention modules for saliency-aware refinement. Extensive evaluations on two challenging datasets, FH-PS-AoP (fetal head and pubic symphysis ultrasound) and FUMPE (CT-based pulmonary embolism detection), demonstrate that DAUNet outperforms state-of-the-art models in Dice score, HD95, and ASD, while maintaining superior parameter efficiency. Ablation studies highlight the individual contributions of deformable convolutions and SimAM attention. DAUNet’s robustness to missing context and low-contrast regions establishes its suitability for deployment in real-time and resource-constrained clinical environments.
zh

[CV-114] ransformation of Biological Networks into Images via Semantic Cartography for Visual Interpretation and Scalable Deep Analysis

【速读】:该论文旨在解决当前生物网络分析方法在处理大规模复杂生物网络时面临的挑战,包括可扩展性差、长程依赖关系难以捕捉、多模态数据整合困难、表达能力受限以及可解释性不足等问题。其核心解决方案是提出Graph2Image框架,通过将生物网络中的代表性节点空间排列到二维网格上,将其转化为一组二维图像,从而实现节点的图像化表示;这一转换使得可以利用具有全局感受野和多尺度金字塔结构的卷积神经网络(CNN)进行高效处理,显著提升了模型对大规模网络的计算效率与长程上下文建模能力,并增强了跨模态(如影像与组学数据)整合及结果可视化的能力,实现了高精度、可解释且适用于超大规模网络(节点数达10亿级)的分析。

链接: https://arxiv.org/abs/2512.07040
作者: Sakib Mostafa,Lei Xing,Md. Tauhidul Islam
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Complex biological networks are fundamental to biomedical science, capturing interactions among molecules, cells, genes, and tissues. Deciphering these networks is critical for understanding health and disease, yet their scale and complexity represent a daunting challenge for current computational methods. Traditional biological network analysis methods, including deep learning approaches, while powerful, face inherent challenges such as limited scalability, oversmoothing long-range dependencies, difficulty in multimodal integration, expressivity bounds, and poor interpretability. We present Graph2Image, a framework that transforms large biological networks into sets of two-dimensional images by spatially arranging representative network nodes on a 2D grid. This transformation decouples the nodes as images, enabling the use of convolutional neural networks (CNNs) with global receptive fields and multi-scale pyramids, thus overcoming limitations of existing biological network analysis methods in scalability, memory efficiency, and long-range context capture. Graph2Image also facilitates seamless integration with other imaging and omics modalities and enhances interpretability through direct visualization of node-associated images. When applied to several large-scale biological network datasets, Graph2Image improved classification accuracy by up to 67.2% over existing methods and provided interpretable visualizations that revealed biologically coherent patterns. It also allows analysis of very large biological networks (nodes 1 billion) on a personal computer. Graph2Image thus provides a scalable, interpretable, and multimodal-ready approach for biological network analysis, offering new opportunities for disease diagnosis and the study of complex biological systems.
zh

[CV-115] Evaluating and Preserving High-level Fidelity in Super-Resolution

【速读】:该论文旨在解决当前图像超分辨率(Super-Resolution, SR)模型在追求高视觉质量时可能出现的高阶语义失真问题,即生成式AI(Generative AI)能力过强导致图像内容被篡改,尽管主观观感良好但实际语义信息不一致。解决方案的关键在于提出并构建首个包含不同SR模型高阶保真度(high-level fidelity)评分的标注数据集,并基于此分析现有低阶图像质量指标与保真度的相关性,进一步验证基础模型(foundation models)在高阶任务中的优势;同时通过引入保真度反馈对SR模型进行微调,实现语义保真度和感知质量的同步提升,从而为SR模型的评估与优化提供新的可靠标准。

链接: https://arxiv.org/abs/2512.07037
作者: Josep M. Rocafort,Shaolin Su,Javier Vazquez-Corral,Alexandra Gomez-Villa
机构: Computer Vision Center; Universitat Autonoma de Barcelona
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent image Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering visually pleasant outputs. However, the overpowering generative ability can sometimes hallucinate and thus change the image content despite gaining high visual quality. This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics. In this paper, we establish the importance of measuring high-level fidelity for SR models as a complementary criterion to reveal the reliability of generative SR models. We construct the first annotated dataset with fidelity scores from different SR models, and evaluate how state-of-the-art (SOTA) SR models actually perform in preserving high-level fidelity. Based on the dataset, we then analyze how existing image quality metrics correlate with fidelity measurement, and further show that this high-level task can be better addressed by foundation models. Finally, by fine-tuning SR models based on our fidelity feedback, we show that both semantic fidelity and perceptual quality can be improved, demonstrating the potential value of our proposed criteria, both in model evaluation and optimization. We will release the dataset, code, and models upon acceptance.
zh

[CV-116] Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues WACV2026

【速读】:该论文旨在解决透明物体(如玻璃和镜面)在图像分割任务中难以与不透明材质区分开的问题,其核心挑战在于透明物体缺乏明显的颜色或纹理特征,且易受反射和边界模糊的影响。解决方案的关键在于提出一种名为TransCues的金字塔Transformer编码器-解码器架构,通过引入边界特征增强(Boundary Feature Enhancement)和反射特征增强(Reflection Feature Enhancement)模块,协同利用人类视觉感知中依赖的边界信息与反射特性,从而显著提升对透明物体的分割精度。实验表明,该方法在多个基准数据集上均大幅超越现有最先进方法,验证了双特征增强机制的有效性。

链接: https://arxiv.org/abs/2512.07034
作者: Tuan-Anh Vu,Hai Nguyen-Truong,Ziqiang Zheng,Binh-Son Hua,Qing Guo,Ivor Tsang,Sai-Kit Yeung
机构: 1: University of Macau (澳门大学); 2: National University of Singapore (新加坡国立大学); 3: University of Technology Sydney (悉尼科技大学); 4: University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:Glass is a prevalent material among solid objects in everyday life, yet segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While it is known that human perception relies on boundary and reflective-object features to distinguish glass objects, the existing literature has not yet sufficiently captured both properties when handling transparent objects. Hence, we propose incorporating both of these powerful visual cues via the Boundary Feature Enhancement and Reflection Feature Enhancement modules in a mutually beneficial way. Our proposed framework, TransCues, is a pyramidal transformer encoder-decoder architecture to segment transparent objects. We empirically show that these two modules can be used together effectively, improving overall performance across various benchmark datasets, including glass object semantic segmentation, mirror object semantic segmentation, and generic segmentation datasets. Our method outperforms the state-of-the-art by a large margin, achieving +4.2% mIoU on Trans10K-v2, +5.6% mIoU on MSD, +10.1% mIoU on RGBD-Mirror, +13.1% mIoU on TROSD, and +8.3% mIoU on Stanford2D3D, showing the effectiveness of our method against glass objects.
zh

[CV-117] Utilizing Multi-Agent Reinforcement Learning with Encoder-Decoder Architecture Agents to Identify Optimal Resection Location in Glioblastoma Multiforme Patients

链接: https://arxiv.org/abs/2512.06990
作者: Krishna Arun,Moinak Bhattachrya,Paras Goel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-118] Selective Masking based Self-Supervised Learning for Image Semantic Segmentation

【速读】:该论文旨在解决自监督学习在语义分割任务中性能受限的问题,尤其是传统随机掩码(random masking)方法在图像重建预训练阶段难以有效利用模型知识、导致下游分割精度提升有限的瓶颈。其解决方案的关键在于提出一种选择性掩码图像重建(Selective Masking Image Reconstruction)方法:通过将图像重建预训练分解为迭代步骤,动态识别并掩码当前模型重建损失最高的图像块(patch),从而逐步聚焦于最难重建的区域,实现对模型知识的迭代利用。实验表明,该方法在通用数据集(Pascal VOC、Cityscapes)和杂草分割数据集(Nassar 2020、Sugarbeets 2016)上均显著优于随机掩码和监督ImageNet预训练,在低性能类别上提升尤为明显,且在小预算场景下表现最优,为资源受限环境下的端到端语义分割提供了高效可行的方案。

链接: https://arxiv.org/abs/2512.06981
作者: Yuemin Wang,Ian Stavness
机构: University of Sasktachewan (萨斯喀彻温大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper proposes a novel self-supervised learning method for semantic segmentation using selective masking image reconstruction as the pretraining task. Our proposed method replaces the random masking augmentation used in most masked image modelling pretraining methods. The proposed selective masking method selectively masks image patches with the highest reconstruction loss by breaking the image reconstruction pretraining into iterative steps to leverage the trained model’s knowledge. We show on two general datasets (Pascal VOC and Cityscapes) and two weed segmentation datasets (Nassar 2020 and Sugarbeets 2016) that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy by 2.9% for general datasets and 2.5% for weed segmentation datasets. Furthermore, we found that our selective masking method significantly improves accuracy for the lowest-performing classes. Lastly, we show that using the same pretraining and downstream dataset yields the best result for low-budget self-supervised pretraining. Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end-to-end semantic segmentation workflows, especially for scenarios that require limited model capacity to meet inference speed and computational resource requirements.
zh

[CV-119] VideoVLA: Video Generators Can Be Generalizable Robot Manipulators NIPS2025

【速读】:该论文旨在解决机器人操作中泛化能力不足的问题,尤其是在开放世界环境中执行新任务、处理新物体和适应新场景时的局限性。当前基于视觉-语言-动作(Vision-Language-Action, VLA)模型虽能利用大规模预训练理解模型进行感知与指令跟随,但其在复杂动态环境中的迁移能力仍受限。解决方案的关键在于提出VideoVLA,一种将大型视频生成模型转化为机器人VLA操作器的简单方法:通过多模态扩散Transformer联合建模视频、语言和动作模态,实现对动作序列及其未来视觉结果的双重预测(dual-prediction strategy)。实验表明,高质量的视觉想象(visual imagination)与可靠的动作预测及任务成功率高度相关,凸显了视觉想象在操作任务中的重要性,从而显著提升了系统的泛化性能,包括模仿其他机器人形态技能和应对未知物体的能力。

链接: https://arxiv.org/abs/2512.06963
作者: Yichao Shen,Fangyun Wei,Zhiying Du,Yaobo Liang,Yan Lu,Jiaolong Yang,Nanning Zheng,Baining Guo
机构: Xi’an Jiaotong University (西安交通大学); Microsoft Research Asia (微软亚洲研究院); Fudan University (复旦大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments’ skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.
zh

[CV-120] ask adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge NEURIPS

【速读】:该论文旨在解决复杂家庭场景中长时程、多模态机器人任务的端到端决策与控制问题,具体挑战包括双臂操作(bimanual manipulation)、导航以及情境感知决策(context-aware decision making)。针对这一问题,作者提出了一种基于Pi0.5架构的视觉-动作策略(vision-action policy),其关键创新在于引入相关噪声(correlated noise)用于流匹配(flow matching),从而提升训练效率并实现动作序列平滑性约束下的相关性感知图像补全(correlation-aware inpainting);此外,通过可学习的混合层注意力机制(learnable mixed-layer attention)和System 2阶段跟踪(System 2 stage tracking)有效缓解任务不确定性带来的歧义,结合多样本流匹配训练降低方差,并在推理阶段采用动作压缩与挑战特异性修正规则,最终在2025 BEHAVIOR Challenge中取得26% q-score的领先表现。

链接: https://arxiv.org/abs/2512.06951
作者: Ilia Larchenko,Gleb Zarin,Akash Karnatak
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2025 NeurIPS Behavior Challenge 1st place solution

点击查看摘要

Abstract:We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards. Comments: 2025 NeurIPS Behavior Challenge 1st place solution Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2512.06951 [cs.RO] (or arXiv:2512.06951v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2512.06951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-121] Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology

【速读】:该论文旨在解决组织病理图像分割中空间上下文建模与组织间关系刻画不足的问题,尤其是在组织边界密集或形态相似区域难以准确区分的情况下。传统基于卷积神经网络(Convolutional Neural Network, CNN)的方法主要依赖视觉纹理信息,将组织视为独立区域,缺乏对生物结构上下文的编码能力。其解决方案的关键在于提出一种新型的组织关系建模框架——神经组织关系建模(Neural Tissue Relation Modeling, NTRM),该框架通过在预测区域上构建组织级图神经网络(Graph Neural Network, GNN),利用消息传递机制传播上下文信息,并结合空间投影进行分割优化,从而显式地建模不同组织类型之间的空间与功能依赖关系,提升边界区域的结构一致性与分割精度。

链接: https://arxiv.org/abs/2512.06949
作者: Shravan Venkatraman,Muthu Subash Kavitha,Joe Dhanith P R,V Manikandarajan,Jia Wu
机构: Nagasaki University (长崎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Histopathology image segmentation is essential for delineating tissue structures in skin cancer diagnostics, but modeling spatial context and inter-tissue relationships remains a challenge, especially in regions with overlapping or morphologically similar tissues. Current convolutional neural network (CNN)-based approaches operate primarily on visual texture, often treating tissues as independent regions and failing to encode biological context. To this end, we introduce Neural Tissue Relation Modeling (NTRM), a novel segmentation framework that augments CNNs with a tissue-level graph neural network to model spatial and functional relationships across tissue types. NTRM constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection. Unlike prior methods, NTRM explicitly encodes inter-tissue dependencies, enabling structurally coherent predictions in boundary-dense zones. On the benchmark Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving a robust Dice similarity coefficient that is 4.9% to 31.25% higher than the best-performing models among the evaluated approaches. Our experiments indicate that relational modeling offers a principled path toward more context-aware and interpretable histological segmentation, compared to local receptive-field architectures that lack tissue-level structural awareness. Our code is available at this https URL.
zh

[CV-122] Scaling Zero-Shot Reference-to-Video Generation

链接: https://arxiv.org/abs/2512.06905
作者: Zijian Zhou,Shikun Liu,Haozhe Liu,Haonan Qiu,Zhaochong An,Weiming Ren,Zhiheng Liu,Xiaoke Huang,Kam Woh Ng,Tian Xie,Xiao Han,Yuren Cong,Hang Li,Chuyan Zhu,Aditya Patel,Tao Xiang,Sen He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

[CV-123] Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation

链接: https://arxiv.org/abs/2512.06888
作者: Liyang Song,Hardik Bishnoi,Sai Kumar Reddy Manne,Sarah Ostadabbas,Briana J. Taylor,Michael Wan
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-124] Balanced Learning for Domain Adaptive Semantic Segmentation ICML2025

【速读】:该论文旨在解决无监督域自适应(Unsupervised Domain Adaptation, UDA)中语义分割任务因类别不平衡和跨域分布偏移导致的类别学习不均衡问题,即现有自训练方法在目标域上对某些类别过度预测、而对其他类别欠预测。解决方案的关键在于提出一种名为平衡学习的域自适应方法(Balanced Learning for Domain Adaptation, BLDA),其核心机制包括:通过分析预测logits分布识别过预测与欠预测类别;引入后处理策略利用共享锚点分布对齐不同类别的logits分布以缓解偏差;在线估计logits分布并将其修正项融入损失函数以生成更公平的伪标签;同时利用累积密度函数作为域共享结构知识,增强源域与目标域间的关联性。该方法无需预先了解分布偏移信息,且可无缝集成至多种主流UDA框架中,显著提升对低频类别的分割性能。

链接: https://arxiv.org/abs/2512.06886
作者: Wangkai Li,Rui Sun,Bohao Liao,Zhaoyang Li,Tianzhu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Despite the effectiveness of self-training techniques in UDA, they struggle to learn each class in a balanced manner due to inherent class imbalance and distribution shift in both data and label space between domains. To address this issue, we propose Balanced Learning for Domain Adaptation (BLDA), a novel approach to directly assess and alleviate class bias without requiring prior knowledge about the distribution shift. First, we identify over-predicted and under-predicted classes by analyzing the distribution of predicted logits. Subsequently, we introduce a post-hoc approach to align the logits distributions across different classes using shared anchor distributions. To further consider the network’s need to generate unbiased pseudo-labels during self-training, we estimate logits distributions online and incorporate logits correction terms into the loss function. Moreover, we leverage the resulting cumulative density as domain-shared structural knowledge to connect the source and target domains. Extensive experiments on two standard UDA semantic segmentation benchmarks demonstrate that BLDA consistently improves performance, especially for under-predicted classes, when integrated into various existing methods. Code is available at this https URL.
zh

[CV-125] JoPano: Unified Panorama Generation via Joint Modeling

链接: https://arxiv.org/abs/2512.06885
作者: Wancheng Feng,Chen An,Zhenliang He,Meina Kan,Shiguang Shan,Lukun Wang
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences (CAS), China; Shandong University of Science and Technology, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

[CV-126] Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion BMVC2025

链接: https://arxiv.org/abs/2512.06882
作者: Yu Zhu,Naoya Chiba,Koichi Hashimoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2025 (Sheffield, UK, Nov 24-27, 2025). Supplementary video and poster available upon request

点击查看摘要

[CV-127] SceneMixer: Exploring Convolutional Mixing Networks for Remote Sensing Scene Classification

链接: https://arxiv.org/abs/2512.06877
作者: Mohammed Q. Alkhatib,Ali Jamali,Swalpa Kumar Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented in ICSPIS

点击查看摘要

[CV-128] owards Robust Pseudo-Label Learning in Semantic Segmentation: An Encoding Perspective NEURIPS2025

【速读】:该论文旨在解决伪标签学习(pseudo-label learning)在语义分割任务中因错误伪标签被放大而导致模型性能下降的问题,尤其是在无监督域自适应(UDA)和半监督学习(SSL)等标注数据稀缺场景下。其解决方案的关键在于提出一种基于误差校正输出码(Error-Correcting Output Codes, ECOC)的新型编码机制——ECOCSeg,通过将类别细粒度地编码为二进制位向量,使模型能够解耦类别属性并容忍部分比特错误,从而提升伪标签质量与训练稳定性;同时引入比特级标签去噪机制,在标签生成阶段优化伪标签可靠性,实现对未标注图像更充分且鲁棒的监督信号。

链接: https://arxiv.org/abs/2512.06870
作者: Wangkai Li,Rui Sun,Zhaoyang Li,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学); National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory (深空探测重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Pseudo-label learning is widely used in semantic segmentation, particularly in label-scarce scenarios such as unsupervised domain adaptation (UDA) and semisupervised learning (SSL). Despite its success, this paradigm can generate erroneous pseudo-labels, which are further amplified during training due to utilization of one-hot encoding. To address this issue, we propose ECOCSeg, a novel perspective for segmentation models that utilizes error-correcting output codes (ECOC) to create a fine-grained encoding for each class. ECOCSeg offers several advantages. First, an ECOC-based classifier is introduced, enabling model to disentangle classes into attributes and handle partial inaccurate bits, improving stability and generalization in pseudo-label learning. Second, a bit-level label denoising mechanism is developed to generate higher-quality pseudo-labels, providing adequate and robust supervision for unlabeled images. ECOCSeg can be easily integrated with existing methods and consistently demonstrates significant improvements on multiple UDA and SSL benchmarks across different segmentation architectures. Code is available at this https URL.
zh

[CV-129] Dynamic Visual SLAM using a General 3D Prior

【速读】:该论文旨在解决在动态自然环境中进行可靠、增量式的相机位姿估计与三维重建问题,此类场景下的动态物体易导致传统视觉SLAM系统位姿估计精度下降。解决方案的关键在于融合几何特征的基于补丁的在线束调整(bundle adjustment)与前沿的前馈式重建模型:一方面利用前馈模型精确过滤动态区域,另一方面借助其深度预测结果增强补丁匹配的鲁棒性;通过将深度预测与束调整中估计的补丁对齐,有效缓解了前馈模型批处理方式带来的尺度模糊问题。

链接: https://arxiv.org/abs/2512.06868
作者: Xingguang Zhong,Liren Jin,Marija Popović,Jens Behley,Cyrill Stachniss
机构: Center for Robotics, University of Bonn, Germany(德国波恩大学机器人中心); MAVLab, TU Delft, the Netherlands(荷兰代尔夫特理工大学MAVLab); Lamarr Institute for Machine Learning and Artificial Intelligence, Germany(德国拉玛尔机器学习与人工智能研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly handle the inherent scale ambiguities of the batch-wise application of the feed-forward reconstruction model.
zh

[CV-130] Spatial Retrieval Augmented Autonomous Driving

【速读】:该论文旨在解决现有自动驾驶系统在环境感知中因传感器视距有限、遮挡或极端天气条件(如黑暗、降雨)下性能下降的问题,这些问题限制了模型对道路结构的长期理解与记忆能力。为提升模型的“回忆”能力,作者提出一种空间检索(spatial retrieval)范式,其关键在于引入离线获取的地理图像作为额外输入模态——这些图像可从Google Maps等离线缓存中轻松获取,无需增加硬件传感器,且能与车辆轨迹对齐,从而实现即插即用的模块化扩展。实验表明,该方法在物体检测、在线建图、占用预测、端到端规划和生成式世界建模等五类核心任务上均能有效提升性能。

链接: https://arxiv.org/abs/2512.06865
作者: Xiaosong Jia,Chenhe Zhang,Yule Jiang,Songbur Wong,Zhiyuan Zhang,Chen Chen,Shaofeng Zhang,Xuanhe Zhou,Xue Yang,Junchi Yan,Yu-Gang Jiang
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Chinese Academy of Sciences (中国科学院); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Demo Page: this https URL with open sourced code, dataset, and checkpoints

点击查看摘要

Abstract:Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this ``recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD tasks. For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm. Comments: Demo Page: this https URL with open sourced code, dataset, and checkpoints Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.06865 [cs.CV] (or arXiv:2512.06865v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.06865 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-131] Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training WACV2026

链接: https://arxiv.org/abs/2512.06864
作者: Kaixuan Lu,Mehmet Onurcan Kaya,Dim P. Papadopoulos
机构: Technical University of Denmark (丹麦技术大学); Pioneer Centre for AI (先锋人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026. arXiv admin note: substantial text overlap with arXiv:2508.19808

点击查看摘要

[CV-132] Omni-Referring Image Segmentation

【速读】:该论文旨在解决现有图像分割任务在模态适应性和场景泛化能力上的局限性,特别是针对单一模态(如纯文本或纯视觉)引导的图像分割方法难以同时利用文本的细粒度属性指代能力和视觉提示的罕见目标定位优势的问题。为此,作者提出了Omni-Referring Image Segmentation (OmniRIS) 这一新型任务,支持多模态输入作为“全模态提示”(omni-prompts),包括文本指令和带有掩码、边界框或草图的参考图像,从而实现更灵活且通用的图像分割能力。解决方案的关键在于:1)设计了一个包含186,939个全模态提示的大规模数据集OmniRef;2)提出了一种通用基线模型OmniSegNet,其核心创新在于对全模态提示的有效编码机制,以统一处理不同模态组合下的复杂分割需求,显著提升了模型对多类型指令的理解与执行能力。

链接: https://arxiv.org/abs/2512.06862
作者: Qiancheng Zheng,Yunhang Shen,Gen Luo,Baiyang Song,Xing Sun,Xiaoshuai Sun,Yiyi Zhou,Rongrong Ji
机构: Xiamen University (厦门大学); Tencent (腾讯); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel task termed Omni-Referring Image Segmentation (OmniRIS) towards highly generalized image segmentation. Compared with existing unimodally conditioned segmentation tasks, such as RIS and visual RIS, OmniRIS supports the input of text instructions and reference images with masks, boxes or scribbles as omni-prompts. This property makes it can well exploit the intrinsic merits of both text and visual modalities, i.e., granular attribute referring and uncommon object grounding, respectively. Besides, OmniRIS can also handle various segmentation settings, such as one v.s. many and many v.s. many, further facilitating its practical use. To promote the research of OmniRIS, we also rigorously design and construct a large dataset termed OmniRef, which consists of 186,939 omni-prompts for 30,956 images, and establish a comprehensive evaluation system. Moreover, a strong and general baseline termed OmniSegNet is also proposed to tackle the key challenges of OmniRIS, such as omni-prompt encoding. The extensive experiments not only validate the capability of OmniSegNet in following omni-modal instructions, but also show the superiority of OmniRIS for highly generalized image segmentation.
zh

[CV-133] Hide-and-Seek Attribution: Weakly Supervised Segmentation of Vertebral Metastases in CT

链接: https://arxiv.org/abs/2512.06849
作者: Matan Atad,Alexander W. Marka,Lisa Steinhelfer,Anna Curto-Vilalta,Yannik Leonhardt,Sarah C. Foreman,Anna-Sophia Walburga Dietrich,Robert Graf,Alexandra S. Gersing,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke,Hendrik Möller
机构: Technical University of Munich (TUM); University Hospital Munich (LMU); University of Zurich; Imperial College London; University Hospital Frankfurt
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: In submission

点击查看摘要

[CV-134] Pseudo Anomalies Are All You Need: Diffusion-Based Generation for Weakly-Supervised Video Anomaly Detection

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)在实际部署中因真实异常视频稀缺且采集成本高而导致的瓶颈问题。解决方案的关键在于提出PA-VAD,一种基于生成驱动的方法:通过少量真实正常图像驱动合成伪异常视频,并与真实正常视频配对进行训练,从而无需使用任何真实异常视频即可实现高性能检测。其核心创新包括:利用CLIP选择类别相关初始图像并借助视觉-语言模型优化文本提示以提升合成视频的保真度和场景一致性,再通过视频扩散模型生成异常;同时引入一个域对齐正则化模块,抑制合成异常中过度的时空幅度,结合域对齐与内存使用感知更新机制,有效提升模型泛化能力。实验表明,该方法在ShanghaiTech和UCF-Crime数据集上分别达到98.2%和82.5%的准确率,显著优于现有依赖真实异常样本的方法。

链接: https://arxiv.org/abs/2512.06845
作者: Satoshi Hashimoto,Hitoshi Nishimura,Yanan Wang,Mori Kurokawa
机构: KDDI Research, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deploying video anomaly detection in practice is hampered by the scarcity and collection cost of real abnormal footage. We address this by training without any real abnormal videos while evaluating under the standard weakly supervised split, and we introduce PA-VAD, a generation-driven approach that learns a detector from synthesized pseudo-abnormal videos paired with real normal videos, using only a small set of real normal images to drive synthesis. For synthesis, we select class-relevant initial images with CLIP and refine textual prompts with a vision-language model to improve fidelity and scene consistency before invoking a video diffusion model. For training, we mitigate excessive spatiotemporal magnitude in synthesized anomalies by an domain-aligned regularized module that combines domain alignment and memory usage-aware updates. Extensive experiments show that our approach reaches 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing the strongest real-abnormal method on ShanghaiTech by +0.6% and outperforming the UVAD state-of-the-art on UCF-Crime by +1.9%. The results demonstrate that high-accuracy anomaly detection can be obtained without collecting real anomalies, providing a practical path toward scalable deployment.
zh

[CV-135] CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles WACV2026

链接: https://arxiv.org/abs/2512.06840
作者: Satoshi Hashimoto,Tatsuya Konishi,Tomoya Kaichi,Kazunori Matsumoto,Mori Kurokawa
机构: KDDI Research, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026

点击查看摘要

[CV-136] SparseCoop: Cooperative Perception with Kinematic-Grounded Queries AAAI2026

链接: https://arxiv.org/abs/2512.06838
作者: Jiahao Wang,Zhongwei Jiang,Wenchao Sun,Jiaru Zhong,Haibao Yu,Yuner Zhang,Chenyang Lu,Chuang Zhang,Lei He,Shaobing Xu,Jianqiang Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

[CV-137] MeshSplatting: Differentiable Rendering with Opaque Meshes

【速读】:该论文旨在解决基于点的神经渲染方法(如3D Gaussian Splatting)与基于网格的图形管线(如AR/VR和游戏引擎)不兼容的问题,从而阻碍了实时交互式3D场景的应用。其解决方案的关键在于提出MeshSplatting,一种通过可微分渲染联合优化几何与外观的网格重建方法;该方法利用受限Delaunay三角剖分(restricted Delaunay triangulation)强制表面连通性,并提升表面一致性,从而生成端到端平滑、视觉质量高且可在实时3D引擎中高效渲染的网格模型。

链接: https://arxiv.org/abs/2512.06818
作者: Jan Held,Sanghyun Son,Renaud Vandeghen,Daniel Rebain,Matheus Gadelha,Yi Zhou,Anthony Cioppa,Ming C. Lin,Marc Van Droogenbroeck,Andrea Tagliasacchi
机构: University of Liège(列日大学); Simon Fraser University(西蒙弗雷泽大学); University of Maryland(马里兰大学); University of British Columbia(不列颠哥伦比亚大学); University of Toronto(多伦多大学); Adobe Research(Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Primitive-based splatting methods like 3D Gaussian Splatting have revolutionized novel view synthesis with real-time rendering. However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering. By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines. On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction. The project page is available at this https URL.
zh

[CV-138] RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models AAAI2026

链接: https://arxiv.org/abs/2512.06811
作者: Xiang Lin,Weixin Li,Shu Guo,Lihong Wang,Di Huang
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by AAAI 2026(Oral)

点击查看摘要

[CV-139] VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

链接: https://arxiv.org/abs/2512.06802
作者: Yutong Wang,Haiyu Zhang,Tianfan Xue,Yu Qiao,Yaohui Wang,Chang Xu,Xinyuan Chen
机构: USYD; Shanghai AI Laboratory; BUAA; CUHK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-140] Generalized Geometry Encoding Volume for Real-time Stereo Matching AAAI2026

链接: https://arxiv.org/abs/2512.06793
作者: Jiaxin Liu,Gangwei Xu,Xianqi Wang,Chengliang Zhang,Xin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

[CV-141] Physics Informed Human Posture Estimation Based on 3D Landmarks from Monocular RGB-Videos

链接: https://arxiv.org/abs/2512.06783
作者: Tobias Leuthold,Michele Xiloyannis,Yves Zimmermann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures

点击查看摘要

[CV-142] RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2512.06774
作者: Longjie Zhao,Ziming Hong,Zhenyang Ren,Runnan Chen,Mingming Gong,Tongliang Liu
机构: The University of Sydney (悉尼大学); The University of Melbourne (墨尔本大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-143] Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

链接: https://arxiv.org/abs/2512.06769
作者: Hang Yin,Xiaomin He,PeiWen Yuan,Yiwei Li,Jiayi Shi,Wenxiao Fan,Shaoxiong Feng,Kan Li
机构: Beijing Institute of Technology (北京理工大学); Peking University (北京大学); Xiaohongshu Inc (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-144] JOCA: Task-Driven Joint Optimisation of Camera Hardware and Adaptive Camera Control Algorithms

链接: https://arxiv.org/abs/2512.06763
作者: Chengyang Yan,Mitch Bryson,Donald G. Dansereau
机构: Australian Centre for Robotics, School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney (悉尼大学航空航天、机械和机电工程学院机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-145] VisChainBench: A Benchmark for Multi-Turn Multi-Image Visual Reasoning Beyond Language Priors

链接: https://arxiv.org/abs/2512.06759
作者: Wenbo Lyu,Yingjun Du,Jinglin Zhao,Xianton Zhen,Ling Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages,13figures

点击查看摘要

[CV-146] XM-ALIGN: Unified Cross-Modal Embedding Alignment for Face-Voice Association

【速读】:该论文旨在解决跨模态验证(cross-modal verification)任务中,尤其是在“听到”和“未听到”语言场景下,人脸与语音模态之间对齐不充分导致的性能瓶颈问题。解决方案的关键在于提出了一种统一的跨模态嵌入对齐框架(XM-ALIGN),通过显式与隐式对齐机制相结合的方式,利用人脸编码器和语音编码器提取特征嵌入,并基于共享分类器联合优化这些嵌入,以均方误差(MSE)作为嵌入对齐损失函数,从而实现模态间紧密对齐;同时引入数据增强策略提升模型泛化能力,实验表明该方法在MAV-Celeb数据集上显著优于现有方案。

链接: https://arxiv.org/abs/2512.06757
作者: Zhihua Fang,Shumei Tao,Junxu Wang,Liang He
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: FAME 2026 Technical Report

点击查看摘要

Abstract:This paper introduces our solution, XM-ALIGN (Unified Cross-Modal Embedding Alignment Framework), proposed for the FAME challenge at ICASSP 2026. Our framework combines explicit and implicit alignment mechanisms, significantly improving cross-modal verification performance in both “heard” and “unheard” languages. By extracting feature embeddings from both face and voice encoders and jointly optimizing them using a shared classifier, we employ mean squared error (MSE) as the embedding alignment loss to ensure tight alignment between modalities. Additionally, data augmentation strategies are applied during model training to enhance generalization. Experimental results show that our approach demonstrates superior performance on the MAV-Celeb dataset. The code will be released at this https URL.
zh

[CV-147] UARE: A Unified Vision-Language Model for Image Quality Assessment Restoration and Enhancement

链接: https://arxiv.org/abs/2512.06750
作者: Weiqi Li,Xuanyu Zhang,Bin Chen,Jingfen Xie,Yan Wang,Kexin Zhang,Junlin Li,Li Zhang,Jian Zhang,Shijie Zhao
机构: Peking University (北京大学); ByteDance Inc (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-148] ask-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection

链接: https://arxiv.org/abs/2512.06746
作者: Ruoxin Chen,Jiahui Gao,Kaiqing Lin,Keyue Zhang,Yandan Zhao,Isabel Guan,Taiping Yao,Shouhong Ding
机构: Tencent Youtu Lab(腾讯优图实验室); East China University of Science and Technology (华东理工大学); Shenzhen University (深圳大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-149] FedSCAl: Leverag ing Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation WACV

链接: https://arxiv.org/abs/2512.06738
作者: M Yashwanth,Sampath Koti,Arunabh Singh,Shyam Marjit,Anirban Chakraborty
机构: Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Winter Conference on Applications of Computer Vision (WACV) 2026, Round 1

点击查看摘要

[CV-150] Graph Convolutional Long Short-Term Memory Attention Network for Post-Stroke Compensatory Movement Detection Based on Skeleton Data

链接: https://arxiv.org/abs/2512.06736
作者: Jiaxing Fan,Jiaojiao Liu,Wenkong Wang,Yang Zhang,Xin Ma,Jichen Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-151] Enhancing Interpretability of AR-SSVEP-Based Motor Intention Recognition via CNN-BiLSTM and SHAP Analysis on EEG Data

【速读】:该论文旨在解决运动功能障碍患者在康复训练中主观参与度低以及传统稳态视觉诱发电位(steady-state visually evoked potential, SSVEP)脑机接口(brain-computer interface, BCI)系统依赖外部视觉刺激设备、难以在真实场景中应用的问题。解决方案的关键在于提出了一种基于增强现实的稳态视觉诱发电位(AR-SSVEP)系统,结合改进的CNN-BiLSTM模型(MACNN-BiLSTM),通过引入多头注意力机制(multi-head attention mechanism)以突出与运动意图相关的脑电特征,并利用SHAP方法提升模型决策过程的可解释性,从而实现更高效、实时的运动意图识别,支持患者康复训练的主动性和实用性。

链接: https://arxiv.org/abs/2512.06730
作者: Lin Yang,Xiang Li,Xin Ma,Xinxin Zhao
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Patients with motor dysfunction show low subjective engagement in rehabilitation training. Traditional SSVEP-based brain-computer interface (BCI) systems rely heavily on external visual stimulus equipment, limiting their practicality in real-world settings. This study proposes an augmented reality steady-state visually evoked potential (AR-SSVEP) system to address the lack of patient initiative and the high workload on therapists. Firstly, we design four HoloLens 2-based EEG classes and collect EEG data from seven healthy subjects for analysis. Secondly, we build upon the conventional CNN-BiLSTM architecture by integrating a multi-head attention mechanism (MACNN-BiLSTM). We extract ten temporal-spectral EEG features and feed them into a CNN to learn high-level representations. Then, we use BiLSTM to model sequential dependencies and apply a multi-head attention mechanism to highlight motor-intention-related patterns. Finally, the SHAP (SHapley Additive exPlanations) method is applied to visualize EEG feature contributions to the neural network’s decision-making process, enhancing the model’s interpretability. These findings enhance real-time motor intention recognition and support recovery in patients with motor impairments.
zh

[CV-152] Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

【速读】:该论文旨在解决语音增强(Speech Enhancement, SE)与语音分离(Speech Separation, SS)传统上被视作独立任务的问题,尤其是在真实场景中同时存在背景噪声和多说话人重叠语音的情况下,亟需一种统一且高效的解决方案。其关键创新在于提出了一种轻量级、无监督的音视频联合框架 UniVoiceLite,通过利用唇部运动(lip motion)和面部身份(facial identity)线索引导语音提取,并引入 Wasserstein 距离正则化稳定潜在空间表示,从而在无需成对噪声-干净语音数据的情况下实现 SE 与 SS 的统一建模,兼顾性能、效率与泛化能力。

链接: https://arxiv.org/abs/2512.06689
作者: Jisoo Park,Seonghak Lee,Guisik Kim,Taewoo Kim,Junseok Kwon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Accepted to ASRU 2025

点击查看摘要

Abstract:Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at this https URL.
zh

[CV-153] EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy

【速读】:该论文旨在解决体积电子显微镜(volume electron microscopy, vEM)中因采集限制导致的各向异性体数据问题,特别是轴向分辨率受限所引发的重建质量下降难题。现有基于深度学习的方法通常依赖于横向先验来恢复各向同性,但在形态学各向异性的结构上失效。其解决方案的关键在于提出EMGauss框架,将二维切片到三维重建问题重新建模为基于高斯点云(Gaussian splatting)的动态场景渲染问题,其中轴向切片序列被视作二维高斯点云随时间演化的过程;同时引入教师-学生自举机制,在数据稀疏区域利用高置信度预测作为伪监督信号,从而提升重建精度并实现连续切片合成,无需大规模预训练即可显著改善插值质量。

链接: https://arxiv.org/abs/2512.06684
作者: Yumeng He,Zanwei Zhou,Yekun Zheng,Chen Liang,Yunbo Wang,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors, yet their assumptions break down for morphologically anisotropic structures. We present EMGauss, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a Teacher-Student bootstrapping mechanism that uses high-confidence predictions on unobserved slices as pseudo-supervisory signals. Compared with diffusion- and GAN-based reconstruction methods, EMGauss substantially improves interpolation quality, enables continuous slice synthesis, and eliminates the need for large-scale pretraining. Beyond vEM, it potentially provides a generalizable slice-to-3D solution across diverse imaging domains.
zh

[CV-154] RunawayEvil: Jailbreaking the Image-to-Video Generative Models

链接: https://arxiv.org/abs/2512.06674
作者: Songping Wang,Rufan Qian,Yueming Lyu,Qinglong Liu,Linzhuang Zou,Jie Qin,Songhua Liu,Caifeng Shan
机构: PRLab, Nanjing University (南京大学); Meituan (美团); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-155] 1 1 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

链接: https://arxiv.org/abs/2512.06673
作者: Shida Gao,Feng Xue,Xiangfeng Wang,Anlong Ming,Teng Long,Yihua Shao,Haozhe Wang,Zhaowen Lin,Wei Wang,Nicu Sebe
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Trento (特伦托大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Hong Kong University of Science and Technology (香港科技大学); ZTE Corporation (中兴通讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-156] Rethinking Robustness: A New Approach to Evaluating Feature Attribution Methods

【速读】:该论文旨在解决当前特征归因方法(feature attribution methods)在评估其鲁棒性时存在的缺陷,即现有方法往往忽略模型输出差异的影响,导致对归因方法本身的鲁棒性评价不够客观。其解决方案的关键在于提出了一种新的输入相似性定义、一种新的鲁棒性度量指标,并基于生成对抗网络(Generative Adversarial Networks, GANs)设计了一种生成此类相似输入的新型方法,从而更准确地揭示归因方法自身的弱点,而非神经网络结构或输出变化带来的干扰。

链接: https://arxiv.org/abs/2512.06665
作者: Panagiota Kiourti,Anu Singh,Preeti Duraipandian,Weichao Zhou,Wenchao Li
机构: Boston University (波士顿大学); Intuit Inc. (Intuit公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper studies the robustness of feature attribution methods for deep neural networks. It challenges the current notion of attributional robustness that largely ignores the difference in the model’s outputs and introduces a new way of evaluating the robustness of attribution methods. Specifically, we propose a new definition of similar inputs, a new robustness metric, and a novel method based on generative adversarial networks to generate these inputs. In addition, we present a comprehensive evaluation with existing metrics and state-of-the-art attribution methods. Our findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.
zh

[CV-157] CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在感知密集型任务(如目标检测、语义分割和深度估计)中性能显著落后于专用模型的问题。例如,Qwen2.5-VL-7B-Instruct 在 COCO2017 验证集上的平均精度(mAP)仅为 19%,尤其在复杂场景和小物体召回方面表现不佳。解决方案的关键在于提出 Chain-of-Thought for Detection (CoT4Det),将感知任务重构为三个可解释的推理步骤:分类(classification)、计数(counting)和定位(grounding),从而更好地匹配 LVLM 的推理能力。实验表明,该方法在不损害通用视觉语言能力的前提下,显著提升了感知性能,使 mAP 提升至 33.0%,并在多个感知基准上优于现有基线。

链接: https://arxiv.org/abs/2512.06663
作者: Yu Qi,Yumeng Zhang,Chenting Gong,Xiao Tan,Weiming Zhang,Wei Zhang,Jingdong Wang
机构: Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks – such as object detection, semantic segmentation, and depth estimation – remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding – each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.
zh

[CV-158] Personalized Image Descriptions from Attention Sequences

链接: https://arxiv.org/abs/2512.06662
作者: Ruoyu Xue,Hieu Le,Jingyi Xu,Sounak Mondal,Abe Leite,Gregory Zelinsky,Minh Hoai,Dimitris Samaras
机构: Stony Brook University (纽约州立大学石溪分校); UNC-Charlotte (北卡罗来纳大学夏洛特分校); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

[CV-159] xtMamba: Scene Text Detector with Mamba

链接: https://arxiv.org/abs/2512.06657
作者: Qiyan Zhao,Yue Yan,Da-Han Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-160] Estimating Black Carbon Concentration from Urban Traffic Using Vision-Based Machine Learning

链接: https://arxiv.org/abs/2512.06649
作者: Camellia Zakaria,Aryan Sadeghi,Weaam Jaafar,Junshi Xu,Alex Mariakakis,Marianne Hatzopoulou
机构: University of Toronto (多伦多大学); University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 12 pages, 16 figures, 4 tables, 4 pages Appendix, in submission and under review for ACM MobiSys 2026 as of December 6th, 2025

点击查看摘要

[CV-161] Financial Fraud Identification and Interpretability Study for Listed Companies Based on Convolutional Neural Network

链接: https://arxiv.org/abs/2512.06648
作者: Xiao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: in Chinese language

点击查看摘要

[CV-162] Masked Autoencoder Pretraining on Strong-Lensing Images for Joint Dark-Matter Model Classification and Super-Resolution

链接: https://arxiv.org/abs/2512.06642
作者: Achmad Ardani Prasha,Clavino Ourizqi Rachmadi,Muhamad Fauzan Ibnu Syahlan,Naufal Rahfi Anugerah,Nanda Garin Raditya,Putri Amelia,Sabrina Laila Mutiara,Hilman Syachr Ramadhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 7 figures, 3 table

点击查看摘要

[CV-163] MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

链接: https://arxiv.org/abs/2512.06628
作者: Ruicheng Zhang,Mingyang Zhang,Jun Zhou,Zhangrui Guo,Xiaofan Liu,Zunnan Xu,Zhizhou Zhong,Puxin Yan,Haocheng Luo,Xiu Li
机构: Tsinghua University (清华大学); China University of Geosciences (中国地质大学); Sun Yat-sen University (中山大学); X Square Robot; Hong Kong University of Science and Technology (香港科技大学); Central South University (中南大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-164] Hierarchical Deep Learning for Diatom Image Classification: A Multi-Level Taxonomic Approach

链接: https://arxiv.org/abs/2512.06613
作者: Yueying Ke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 2 tables, IEEE conference format. Submitted as course project

点击查看摘要

[CV-165] Learning Relative Gene Expression Trends from Pathology Images in Spatial Transcriptomics NEURIPS2025

【速读】:该论文旨在解决从病理图像中估计基因表达值时面临的挑战,即由于测序技术的复杂性和细胞间固有的变异,观测到的基因表达数据包含随机噪声和批次效应,导致准确估计绝对表达水平困难。解决方案的关键在于提出一种新的学习目标——不再直接预测绝对基因表达值,而是学习基因之间的相对表达模式;基于“相对表达模式在不同实验中具有一致性”的假设,作者设计了一种名为STRank的新损失函数,该函数对噪声和批次效应具有鲁棒性,从而提升了基因表达估计的准确性。

链接: https://arxiv.org/abs/2512.06612
作者: Kazuya Nishimura,Haruka Hirose,Ryoma Bise,Kaito Shiku,Yasuhiro Kojima
机构: National Cancer Center Japan (日本癌症中心); Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Neurips 2025

点击查看摘要

Abstract:Gene expression estimation from pathology images has the potential to reduce the RNA sequencing cost. Point-wise loss functions have been widely used to minimize the discrepancy between predicted and absolute gene expression values. However, due to the complexity of the sequencing techniques and intrinsic variability across cells, the observed gene expression contains stochastic noise and batch effects, and estimating the absolute expression values accurately remains a significant challenge. To mitigate this, we propose a novel objective of learning relative expression patterns rather than absolute levels. We assume that the relative expression levels of genes exhibit consistent patterns across independent experiments, even when absolute expression values are affected by batch effects and stochastic noise in tissue samples. Based on the assumption, we model the relation and propose a novel loss function called STRank that is robust to noise and batch effects. Experiments using synthetic datasets and real datasets demonstrate the effectiveness of the proposed method. The code is available at this https URL.
zh

[CV-166] Vector Quantization using Gaussian Variational Autoencoder

链接: https://arxiv.org/abs/2512.06609
作者: Tongda Xu,Wendi Zheng,Jiajun He,Jose Miguel Hernandez-Lobato,Yan Wang,Ya-Qin Zhang,Jie Tang
机构: Tsinghua University (清华大学); Zhipu AI (智谱AI); University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-167] From Remote Sensing to Multiple Time Horizons Forecasts: Transformers Model for CyanoHAB Intensity in Lake Champlain

链接: https://arxiv.org/abs/2512.06598
作者: Muhammad Adil,Patrick J. Clemins,Andrew W. Schroth,Panagiotis D. Oikonomou,Donna M. Rizzo,Peter D. F. Isles,Xiaohan Zhang,Kareem I. Hannoun,Scott Turnbull,Noah B. Beckage,Asim Zia,Safwan Wshah
机构: Vermont Artificial Intelligence Laboratory (VaiL), Department of Computer Science, University of Vermont (佛蒙特大学); Department of Community Development and Applied Economics, University of Vermont (佛蒙特大学); Department of Geology, University of Vermont (佛蒙特大学); Department of Civil and Environmental Engineering, University of Vermont (佛蒙特大学); Vermont EPSCoR (Established Program to Stimulate Competitive Research), University of Vermont (佛蒙特大学); Vermont Department of Environmental Conservation (佛蒙特州环境保护局); Water Quality Solutions, Inc. (水质解决方案公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 figures

点击查看摘要

[CV-168] OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在安全对齐方面存在的脆弱性问题,尤其是针对越狱攻击(jailbreak attacks)导致有害行为发生的风险。现有基准如JailBreakV-28K、MM-SafetyBench和HADES虽提供了部分洞见,但存在攻击场景有限、缺乏标准化防御评估以及无统一可复现工具箱等局限。其解决方案的关键在于提出OmniSafeBench-MM——一个全面的多模态越狱攻击与防御评估工具箱,集成13种代表性攻击方法、15种防御策略及覆盖9大风险领域和50个细粒度类别的多样化数据集,并采用三维评价协议(危害程度、意图一致性、响应详尽度)实现精细化的安全-效用分析。该平台通过统一数据、方法与评估标准,为未来MLLM安全性研究提供可复现、标准化的基础支撑。

链接: https://arxiv.org/abs/2512.06589
作者: Xiaojun Jia,Jie Liao,Qi Guo,Teng Ma,Simeng Qin,Ranjie Duan,Tianlin Li,Yihao Huang,Zhitao Zeng,Dongxian Wu,Yiming Li,Wenqi Ren,Xiaochun Cao,Yang Liu
机构: Nanyang Technological University (南洋理工大学); BraneMatrix AI (脑矩阵人工智能); Chongqing University (重庆大学); Xi’an Jiaotong University (西安交通大学); Northeastern University (东北大学); Sun Yat-sen University (中山大学); Alibaba (阿里巴巴); National University of Singapore (新加坡国立大学); ByteDance (字节跳动)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at this https URL.
zh

[CV-169] MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

链接: https://arxiv.org/abs/2512.06581
作者: Yuhao Su,Anwesa Choudhuri,Zhongpai Gao,Benjamin Planche,Van Nguyen Nguyen,Meng Zheng,Yuhan Shen,Arun Innanje,Terrence Chen,Ehsan Elhamifar,Ziyan Wu
机构: Northeastern University (东北大学); United Imaging Intelligence (联影智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-170] Proof of Concept for Mammography Classification with Enhanced Compactness and Separability Modules

链接: https://arxiv.org/abs/2512.06575
作者: Fariza Dahes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 16 figures, 2 tables; proof of concept on mammography classification with compactness/separability modules and interactive dashboard; preprint submitted to arXiv cs.LG

点击查看摘要

[CV-171] GNC-Pose: Geometry-Aware GNC-PnP for Accurate 6D Pose Estimation

链接: https://arxiv.org/abs/2512.06565
作者: Xiujin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1 figures, 2 tables, 14pages

点击查看摘要

[CV-172] SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities

【速读】:该论文旨在解决生成式 AI(Generative AI)模型中个体身份被不当使用或难以移除的问题,尤其是在高保真图像合成场景下,如何在不重新训练整个模型的前提下实现对特定个体的可扩展性移除(scalable generative unlearning)。其解决方案的关键在于提出 SUGAR 框架,该框架通过为每个待移除身份学习一个个性化的替代潜空间(surrogate latent),将原身份的重建引导至视觉上一致的替代输出,从而避免依赖静态模板人脸或生成不合理图像;同时引入持续效用保持目标(continual utility preservation objective),有效防止随着更多身份被遗忘而导致模型性能退化。

链接: https://arxiv.org/abs/2512.06562
作者: Dung Thuy Nguyen,Quang Nguyen,Preston K. Robinette,Eli Jiang,Taylor T. Johnson,Kevin Leach
机构: Vanderbilt University (范德比尔特大学); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in 3D-aware generative models have enabled high-fidelity image synthesis of human identities. However, this progress raises urgent questions around user consent and the ability to remove specific individuals from a model’s output space. We address this by introducing SUGAR, a framework for scalable generative unlearning that enables the removal of many identities (simultaneously or sequentially) without retraining the entire model. Rather than projecting unwanted identities to unrealistic outputs or relying on static template faces, SUGAR learns a personalized surrogate latent for each identity, diverting reconstructions to visually coherent alternatives while preserving the model’s quality and diversity. We further introduce a continual utility preservation objective that guards against degradation as more identities are forgotten. SUGAR achieves state-of-the-art performance in removing up to 200 identities, while delivering up to a 700% improvement in retention utility compared to existing baselines. Our code is publicly available at this https URL.
zh

[CV-173] Bridging spatial awareness and global context in medical image segmentation

链接: https://arxiv.org/abs/2512.06560
作者: Dalia Alzu’bi,A. Ben Hamza
机构: Concordia Institute for Information Systems Engineering (信息系统工程研究所); Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-174] Novel Deep Learning Architectures for Classification and Segmentation of Brain Tumors from MRI Images

链接: https://arxiv.org/abs/2512.06531
作者: Sayan Das(1),Arghadip Biswas(2) ((1) IIIT Delhi, (2) Jadavpur University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-175] On The Role of K-Space Acquisition in MRI Reconstruction Domain-Generalization

【速读】:该论文旨在解决加速磁共振成像(MRI)中采样模式学习模型在跨域场景下泛化能力不足的问题,即现有基于特定数据集或模态优化的k空间采样模式难以适应不同成像设备或条件下的域偏移。其解决方案的关键在于将k空间轨迹设计视为提升域泛化能力的主动自由度,通过引入训练阶段的采集不确定性——即随机扰动k空间轨迹以模拟不同扫描仪和成像条件下的变异性,从而显著增强模型在跨域设置下的重建性能与鲁棒性。

链接: https://arxiv.org/abs/2512.06530
作者: Mohammed Wattad,Tamir Shor,Alex Bronstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has established learned k-space acquisition patterns as a promising direction for improving reconstruction quality in accelerated Magnetic Resonance Imaging (MRI). Despite encouraging results, most existing research focuses on acquisition patterns optimized for a single dataset or modality, with limited consideration of their transferability across imaging domains. In this work, we demonstrate that the benefits of learned k-space sampling can extend beyond the training domain, enabling superior reconstruction performance under domain shifts. Our study presents two main contributions. First, through systematic evaluation across datasets and acquisition paradigms, we show that models trained with learned sampling patterns exhibitimproved generalization under cross-domain settings. Second, we propose a novel method that enhances domain robustness by introducing acquisition uncertainty during training-stochastically perturbing k-space trajectories to simulate variability across scanners and imaging conditions. Our results highlight the importance of treating kspace trajectory design not merely as an acceleration mechanism, but as an active degree of freedom for improving domain generalization in MRI reconstruction.
zh

[CV-176] ShadowWolf – Automatic Labelling Evaluation and Model Training Optimised for Camera Trap Wildlife Images

链接: https://arxiv.org/abs/2512.06521
作者: Jens Dede(1),Anna Förster(1) ((1) Department of Sustainable Communication Networks, University of Bremen, Bibliothekstr. 1, 28359, Bremen, Bremen, Germany)
机构: Universität Bremen (不来梅大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages + appendix

点击查看摘要

[CV-177] Method of UAV Inspection of Photovoltaic Modules Using Thermal and RGB Data Fusion

【速读】:该论文旨在解决传统光伏(Photovoltaic, PV)设施检测方法中存在的三大关键问题:热图像色彩偏差(thermal palette bias)、数据冗余以及高通信带宽需求。为实现全流程自动化监测,从数据采集到生成可操作的地理定位维护警报,研究提出了一种多模态智能集成框架。其解决方案的核心在于:首先通过强制表征一致性学习得到与色彩无关的热嵌入(palette-invariant thermal embedding),并与对比归一化的RGB流通过门控机制融合;其次引入基于Rodrigues更新的闭环自适应重采样控制器以确认模糊异常;最后利用基于Haversine距离的DBSCAN聚类实现地理空间去重模块,有效降低重复告警。该方案在PVF-10公开基准上达到mAP@0.5=0.903,较单模态基线提升12–15%,且现场验证召回率达96%,显著提升了运维效率与准确性。

链接: https://arxiv.org/abs/2512.06504
作者: Andrii Lysyi,Anatoliy Sachenko,Pavlo Radiuk,Mykola Lysyi,Oleksandr Melnychenko,Diana Zahorodnia
机构: Khmelnytskyi National University (赫梅利尼茨基国立大学); Research Institute for Intelligent Computer Systems, West Ukrainian National University (西乌克兰国立大学智能计算机系统研究所); Department of Informatics and Teleinformatics, Kazimierz Pulaski University of Radom (拉多姆卡齐米日普瓦斯基大学信息与电信系); National Academy of the State Border Service of Ukraine named after Bogdan Khmelnitsky (乌克兰博格丹·赫梅利尼茨基国家边防服务学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The subject of this research is the development of an intelligent, integrated framework for the automated inspection of photovoltaic (PV) infrastructure that addresses the critical shortcomings of conventional methods, including thermal palette bias, data redundancy, and high communication bandwidth requirements. The goal of this study is to design, develop, and validate a comprehensive, multi-modal system that fully automates the monitoring workflow, from data acquisition to the generation of actionable, geo-located maintenance alerts, thereby enhancing plant safety and operational efficiency. The methods employed involve a synergistic architecture that begins with a palette-invariant thermal embedding, learned by enforcing representational consistency, which is fused with a contrast-normalized RGB stream via a gated mechanism. This is supplemented by a closed-loop, adaptive re-acquisition controller that uses Rodrigues-based updates for targeted confirmation of ambiguous anomalies and a geospatial deduplication module that clusters redundant alerts using DBSCAN over the haversine distance. In conclusion, this study establishes a powerful new paradigm for proactive PV inspection, with the proposed system achieving a mean Average Precision (mAP@0.5) of 0.903 on the public PVF-10 benchmark, a significant 12-15% improvement over single-modality baselines. Field validation confirmed the system’s readiness, achieving 96% recall, while the de-duplication process reduced duplicate-induced false positives by 15-20%, and relevance-only telemetry cut airborne data transmission by 60-70%.
zh

[CV-178] Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction

链接: https://arxiv.org/abs/2512.06485
作者: Kush Revankar,Shreyas Deshpande,Araham Sayeed,Ansh Tandale,Sarika Bobde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-179] owards Stable Cross-Domain Depression Recognition under Missing Modalities

【速读】:该论文旨在解决多模态自动抑郁识别(Multimodal Automatic Depression Detection, ADD)中缺乏统一且可泛化的框架问题,尤其是在真实场景下模态缺失常见时模型稳定性不足的挑战。现有基于音频和视频的方法难以适应多样化的抑郁识别场景,且对不完整模态输入敏感。解决方案的关键在于提出一个基于多模态大语言模型(Multimodal Large Language Model, MLLM)的统一稳定跨域抑郁识别框架(Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model, SCD-MLLM),其核心创新包括:(i) 多源数据输入适配器(Multi-Source Data Input Adapter, MDIA),通过掩码机制与任务特定提示将异构抑郁相关输入转化为统一的token序列,缓解不同数据源间的不一致性;(ii) 模态感知自适应融合模块(Modality-Aware Adaptive Fusion Module, MAFM),利用共享投影机制实现音频与视觉特征的动态融合,显著提升在模态缺失条件下的鲁棒性。实验表明,SCD-MLLM在五个公开异构抑郁数据集上均优于当前最优模型及主流商业大语言模型(Gemini和GPT),展现出卓越的跨域泛化能力与对真实世界模态缺失情况的强适应性。

链接: https://arxiv.org/abs/2512.06447
作者: Jiuyi Chen,Mingkui Tan,Haifeng Lu,Qiuna Xu,Zhihua Wang,Runhao Zeng,Xiping Hu
机构: South China University of Technology (华南理工大学); PengCheng Laboratory (鹏城实验室); Shenzhen MSU-BIT University (深圳莫斯科大学); Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing (粤港澳大湾区情感智能与普适计算联合实验室); Guangdong University of Technology (广东工业大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depression poses serious public health risks, including suicide, underscoring the urgency of timely and scalable screening. Multimodal automatic depression detection (ADD) offers a promising solution; however, widely studied audio- and video-based ADD methods lack a unified, generalizable framework for diverse depression recognition scenarios and show limited stability to missing modalities, which are common in real-world data. In this work, we propose a unified framework for Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model (SCD-MLLM). The framework supports the integration and processing of heterogeneous depression-related data collected from varied sources while maintaining stability in the presence of incomplete modality inputs. Specifically, SCD-MLLM introduces two key components: (i) Multi-Source Data Input Adapter (MDIA), which employs masking mechanism and task-specific prompts to transform heterogeneous depression-related inputs into uniform token sequences, addressing inconsistency across diverse data sources; (ii) Modality-Aware Adaptive Fusion Module (MAFM), which adaptively integrates audio and visual features via a shared projection mechanism, enhancing resilience under missing modality conditions. e conduct comprehensive experiments under multi-dataset joint training settings on five publicly available and heterogeneous depression datasets from diverse scenarios: CMDC, AVEC2014, DAIC-WOZ, DVlog, and EATD. Across both complete and partial modality settings, SCD-MLLM outperforms state-of-the-art (SOTA) models as well as leading commercial LLMs (Gemini and GPT), demonstrating superior cross-domain generalization, enhanced ability to capture multimodal cues of depression, and strong stability to missing modality cases in real-world applications.
zh

[CV-180] AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars

链接: https://arxiv.org/abs/2512.06438
作者: Ramazan Fazylov,Sergey Zagoruyko,Aleksandr Parkin,Stamatis Lefkimmiatis,Ivan Laptev
机构: MBZUAI; Polynome AI; MTS AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-181] Automated Deep Learning Estimation of Anthropometric Measurements for Preparticipation Cardiovascular Screening

链接: https://arxiv.org/abs/2512.06434
作者: Lucas R. Mareque,Ricardo L. Armentano,Leandro J. Cymberknop
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, 3 tables

点击查看摘要

[CV-182] When Gender is Hard to See: Multi-Attribute Support for Long-Range Recognition

链接: https://arxiv.org/abs/2512.06426
作者: Nzakiese Mbongo,Kailash A. Hambarde,Hugo Proença
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures

点击查看摘要

[CV-183] Drag Mesh: Interactive 3D Generation Made Easy

链接: https://arxiv.org/abs/2512.06424
作者: Tianshan Zhang,Zeyu Zhang,Hao Tang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-184] A Perception CNN for Facial Expression Recognition

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在面部表情识别(Facial Expression Recognition, FER)中可能忽略面部分割(facial segmentation)影响的问题,从而导致对细微表情变化敏感性不足。其解决方案的关键在于提出一种感知卷积神经网络(Perception CNN, PCNN),通过三个核心机制实现:首先,采用五个并行网络同时学习基于眼睛、脸颊和嘴巴的局部面部特征,以敏感捕捉FER中的细微变化;其次,引入多域交互机制,融合局部感官器官特征与全局面部结构特征,提升对人脸图像的表达能力;最后,设计两阶段损失函数,分别约束感知信息准确性和重建人脸图像的质量,从而保障PCNN在FER任务中的性能表现。

链接: https://arxiv.org/abs/2512.06422
作者: Chunwei Tian,Jingyuan Xie,Lingjun Li,Wangmeng Zuo,Yanning Zhang,David Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Northwestern Polytechnical University (西北工业大学); Zhengzhou University of Light Industry (郑州轻工业大学); Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in IEEE Transactions on Image Processing (2025)

点击查看摘要

Abstract:Convolutional neural networks (CNNs) can automatically learn data patterns to express face images for facial expression recognition (FER). However, they may ignore effect of facial segmentation of FER. In this paper, we propose a perception CNN for FER as well as PCNN. Firstly, PCNN can use five parallel networks to simultaneously learn local facial features based on eyes, cheeks and mouth to realize the sensitive capture of the subtle changes in FER. Secondly, we utilize a multi-domain interaction mechanism to register and fuse between local sense organ features and global facial structural features to better express face images for FER. Finally, we design a two-phase loss function to restrict accuracy of obtained sense information and reconstructed face images to guarantee performance of obtained PCNN in FER. Experimental results show that our PCNN achieves superior results on several lab and real-world FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB and Occlusion and Pose Variant Dataset. Its code is available at this https URL.
zh

[CV-185] Perceptual Region-Driven Infrared-Visible Co-Fusion for Extreme Scene Enhancement

链接: https://arxiv.org/abs/2512.06400
作者: Jing Tao,Yonghong Zong,Banglei Guana,Pengju Sun,Taihang Lei,Yang Shanga,Qifeng Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted and officially published by IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT

点击查看摘要

[CV-186] OCFER-Net: Recognizing Facial Expression in Online Learning System

【速读】:该论文旨在解决在线学习环境中情感交互缺失的问题,通过面部表情识别(Facial Expression Recognition, FER)技术辅助教师获取学生的情绪状态,从而提升教学互动质量。其解决方案的关键在于引入正则化项强制卷积核的正交性,以增强特征提取的多样性与表达能力,提出名为OCFER-Net的新网络结构,在FER-2013数据集上相较基线方法性能提升1.087,验证了正交约束对模型表现的有效促进作用。

链接: https://arxiv.org/abs/2512.06379
作者: Yi Huo,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, online learning is very popular, especially under the global epidemic of COVID-19. Besides knowledge distribution, emotion interaction is also very important. It can be obtained by employing Facial Expression Recognition (FER). Since the FER accuracy is substantial in assisting teachers to acquire the emotional situation, the project explores a series of FER methods and finds that few works engage in exploiting the orthogonality of convolutional matrix. Therefore, it enforces orthogonality on kernels by a regularizer, which extracts features with more diversity and expressiveness, and delivers OCFER-Net. Experiments are carried out on FER-2013, which is a challenging dataset. Results show superior performance over baselines by 1.087. The code of the research project is publicly available on this https URL.
zh

[CV-187] VAD-Net: Multidimensional Facial Expression Recognition in Intelligent Education System

链接: https://arxiv.org/abs/2512.06377
作者: Yi Huo,Yun Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-188] Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework

链接: https://arxiv.org/abs/2512.06376
作者: Xinhao Xiang,Abhijeet Rastogi,Jiawei Zhang
机构: IFM Lab, University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-189] VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agent ic Reinforcement Learning

链接: https://arxiv.org/abs/2512.06373
作者: Yuji Wang,Wenlong Liu,Jingxuan Niu,Haoji Zhang,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); International Digital Economy Academy (IDEA) (国际数字经济发展研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is [this url]( this https URL )

点击查看摘要

[CV-190] Human3R: Incorporating Human Priors for Better 3D Dynamic Reconstruction from Monocular Videos

【速读】:该论文旨在解决单目动态视频重建中因几何不一致性和分辨率退化导致的人体结构失真问题,特别是现有方法缺乏对三维人体结构的理解,造成肢体比例扭曲、人-物融合不自然以及由于内存限制下采样引发的人体边界向背景几何漂移。解决方案的关键在于引入混合几何先验(hybrid geometric priors),将SMPL人体模型与单目深度估计相结合,通过分层处理流程实现整体场景几何与人体细节的协同优化:首先在全分辨率下处理场景几何,再通过策略性裁剪和交叉注意力融合增强人体局部细节;同时设计特征融合模块(Feature Fusion Module)整合SMPL先验,确保重建结果在保持几何合理性的同时精准保留人体边界。

链接: https://arxiv.org/abs/2512.06368
作者: Weitao Xiong,Zhiyuan Yuan,Jiahao Lu,Chengfeng Zhao,Peng Li,Yuan Liu
机构: HKUST (香港科技大学); XMU (厦门大学); SYSU (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular dynamic video reconstruction faces significant challenges in dynamic human scenes due to geometric inconsistencies and resolution degradation issues. Existing methods lack 3D human structural understanding, producing geometrically inconsistent results with distorted limb proportions and unnatural human-object fusion, while memory-constrained downsampling causes human boundary drift toward background geometry. To address these limitations, we propose to incorporate hybrid geometric priors that combine SMPL human body models with monocular depth estimation. Our approach leverages structured human priors to maintain surface consistency while capturing fine-grained geometric details in human regions. We introduce Human3R, featuring a hierarchical pipeline with refinement components that processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. The method integrates SMPL priors through a Feature Fusion Module to ensure geometrically plausible reconstruction while preserving fine-grained human boundaries. Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction.
zh

[CV-191] Spoofing-aware Prompt Learning for Unified Physical-Digital Facial Attack Detection

链接: https://arxiv.org/abs/2512.06363
作者: Jiabao Guo,Yadian Wang,Hui Ma,Yuhao Fu,Ju Jia,Hui Liu,Shengeng Tang,Lechao Cheng,Yunfeng Diao,Ajian Liu
机构: HFUT(合肥工业大学); MUST(澳门科技大学); SEU(东南大学); CCNU(华中师范大学); CASIA(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-192] Rectifying Latent Space for Generative Single-Image Reflection Removal

链接: https://arxiv.org/abs/2512.06358
作者: Mingjia Li,Jin Hu,Hainuo Wang,Qiming Hu,Jiarui Wang,Xiaojie Guo
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-193] reeQ: Pushing the Quantization Boundary of Diffusion Transformer via Tree-Structured Mixed-Precision Search

链接: https://arxiv.org/abs/2512.06353
作者: Kaicheng Yang,Kaisen Yang,Baiting Wu,Xun Zhang,Qianrui Yang,Haotong Qin,He Zhang,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); ETH Zürich (苏黎世联邦理工学院); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and Supplementary Material could be found at this https URL

点击查看摘要

[CV-194] CLUENet: Cluster Attention Makes Neural Networks Have Eyes

链接: https://arxiv.org/abs/2512.06345
作者: Xiangshuai Song,Jun-Jie Huang,Tianrui Liu,Ke Liang,Chang Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 2026 Association for the Advancement of Artificial Intelligence

点击查看摘要

[CV-195] Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate

链接: https://arxiv.org/abs/2512.06344
作者: Kaile Wang,Lijun He,Haisheng Fu,Haixia Bi,Fan Li
机构: Shaanxi Key Laboratory of Deep Space Exploration Intelligent Information Technology, School of Information and Communications Engineering, Xi’an Jiaotong University (西安交通大学信息与通信工程学院); Department of Electrical and Computer Engineering, Faculty of Applied Science, The University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-196] CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks

链接: https://arxiv.org/abs/2512.06332
作者: Jeffrey Gu,Minkyu Jeon,Ambri Ma,Serena Yeung-Levy,Ellen D. Zhong
机构: Princeton University (普林斯顿大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-197] S2WMamba: A Spectral-Spatial Wavelet Mamba for Pansharpening

链接: https://arxiv.org/abs/2512.06330
作者: Haoyu Zhang,Junhan Luo,Yugang Cao,Siran Peng,Jie Huang,Liangjian-Deng
机构: University of Electronic Science and Technology of China (电子科技大学); Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-198] ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models AAAI2026

链接: https://arxiv.org/abs/2512.06328
作者: Jiahao Li,Yusheng Luo,Yunzhong Lou,Xiangdong Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an Oral presentation at AAAI 2026

点击查看摘要

[CV-199] Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

链接: https://arxiv.org/abs/2512.06306
作者: Haoxian Zhou,Chuanzhi Xu,Langyi Chen,Haodong Chen,Yuk Ying Chung,Qiang Qu,Xaoming Chen,Weidong Cai
机构: The University of Sydney (悉尼大学); Beijing Technology and Business University (北京工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-200] StrokeNet: Unveiling How to Learn Fine-Grained Interactions in Online Handwritten Stroke Classification

链接: https://arxiv.org/abs/2512.06290
作者: Yiheng Huang,Shuang She,Zewei Wei,Jianmin Lin,Ming Yang,Wenyin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures

点击查看摘要

[CV-201] A Sleep Monitoring System Based on Audio Video and Depth Information

【速读】:该论文旨在解决睡眠障碍的定量评估问题,传统方法存在侵入性高或主观性强的局限。其解决方案的关键在于构建一种基于事件检测的非侵入式监测系统,通过融合红外深度传感器、RGB相机和四麦克风阵列的数据,将睡眠干扰行为分类为运动事件、光照事件和噪声事件三类,并分别建立深度信号与彩色图像的背景模型以量化各类事件的强度,最终利用事件检测算法实现对睡眠中断的自动识别与分析,从而提升睡眠质量评估的客观性与可靠性。

链接: https://arxiv.org/abs/2512.06282
作者: Lyn Chao-ling Chen,Kuan-Wen Chen,Yi-Ping Hung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted in the Computer Vision, Graphics and Image Processing (CVGIP 2013)

点击查看摘要

Abstract:For quantitative evaluation of sleep disturbances, a noninvasive monitoring system is developed by introducing an event-based method. We observe sleeping in home context and classify the sleep disturbances into three types of events: motion events, light-on/off events and noise events. A device with an infrared depth sensor, a RGB camera, and a four-microphone array is used in sleep monitoring in an environment with barely light sources. One background model is established in depth signals for measuring magnitude of movements. Because depth signals cannot observe lighting changes, another background model is established in color images for measuring magnitude of lighting effects. An event detection algorithm is used to detect occurrences of events from the processed data of the three types of sensors. The system was tested in sleep condition and the experiment result validates the system reliability.
zh

[CV-202] Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

链接: https://arxiv.org/abs/2512.06281
作者: Hengzhuang Li,Xinsong Zhang,Qiming Peng,Bin Luo,Han Hu,Dengyang Jiang,Han-Jia Ye,Teng Zhang,Hai Jin
机构: HUST(华中科技大学); Tencent Hunyuan Research(腾讯混元研究); HKUST(香港科技大学); NJU(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-203] RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension

链接: https://arxiv.org/abs/2512.06276
作者: Tianyi Gao,Hao Li,Han Fang,Xin Wei,Xiaodong Dong,Hongbo Sun,Ye Yuan,Zhongjiang He,Jinglin Xu,Jingmin Xin,Hao Sun
机构: Xi’an Jiaotong University (西安交通大学); China Telecom (中国电信); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-204] FacePhys: State of the Heart Learning

【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在实际部署中面临的三大挑战:前端设备的计算资源限制、压缩信道导致的信号质量下降,以及模型在跨数据集场景下的泛化能力不足。解决方案的关键在于提出一种内存高效的rPPG算法FacePhys,其核心创新是基于时空状态空间对偶性(temporal-spatial state space duality),通过引入可迁移的心脏状态表示,能够在保持极低计算开销的同时捕捉视频帧间微弱的周期性生理变化,从而支持长时间序列训练和低延迟推理。该方法实现了模型可扩展性、跨数据集泛化能力和实时运行之间的平衡,并在误差上较现有方法降低49%,同时具备仅3.6 MB内存占用和每帧9.46 ms延迟的高效性能。

链接: https://arxiv.org/abs/2512.06275
作者: Kegang Wang,Jiankai Tang,Yuntao Wang,Xin Liu,Yuxuan Fan,Jiatong Ji,Yuanchun Shi,Daniel McDuff
机构: Tsinghua University (清华大学); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vital sign measurement using cameras presents opportunities for comfortable, ubiquitous health monitoring. Remote photoplethysmography (rPPG), a foundational technology, enables cardiac measurement through minute changes in light reflected from the skin. However, practical deployment is limited by the computational constraints of performing analysis on front-end devices and the accuracy degradation of transmitting data through compressive channels that reduce signal quality. We propose a memory efficient rPPG algorithm - \emphFacePhys - built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time operation. Leveraging a transferable heart state, FacePhys captures subtle periodic variations across video frames while maintaining a minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. FacePhys establishes a new state-of-the-art, with a substantial 49% reduction in error. Our solution enables real-time inference with a memory footprint of 3.6 MB and per-frame latency of 9.46 ms – surpassing existing methods by 83% to 99%. These results translate into reliable real-time performance in practical deployments, and a live demo is available at this https URL.
zh

[CV-205] riaGS: Differentiable Triangulation-Guided Geometric Consistency for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2512.06269
作者: Quan Tran,Tuan Dang
机构: VinUniversity (Vin大学); University of Arkansas (阿肯色大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

[CV-206] Knowing the Answer Isnt Enough: Fixing Reasoning Path Failures in LVLMs

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的路径选择偏差问题,即模型虽具备正确答案的知识,却常通过逻辑不一致或不稳定的推理路径得出结果,导致输出不稳定和不可靠。其核心问题是推理搜索空间中对错误路径的偏好,而非知识缺失。解决方案的关键在于提出两阶段后训练框架PSO(Path-Select Optimization):第一阶段采用基于模板与答案奖励的组相对策略优化(Group Relative Policy Optimization, GRPO),引导模型形成结构化的分步推理;第二阶段引入在线偏好优化机制,模型从GRPO生成的数据中采样推理路径并进行自评估,将劣质路径存入负向回放记忆(Negative Replay Memory, NRM)作为难样本定期重访,从而抑制错误路径重复出现,实现持续推理能力的优化。实验表明,PSO显著提升了推理准确率(平均提升7.4%)和推理链的一致性。

链接: https://arxiv.org/abs/2512.06258
作者: Chaoyang Wang,Yangfan He,Yiyang Zhou,Yixuan Wang,Jiaqi Liu,Peng Xia,Zhengzhong Tu,Mohit Bansal,Huaxiu Yao
机构: UNC-Chapel Hill (北卡罗来纳大学教堂山分校); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs): even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths. The core issue is not a lack of knowledge, but a path selection bias within the vast reasoning search space. Although LVLMs are often capable of sampling correct solution trajectories, they disproportionately favor unstable or logically inconsistent ones, leading to erratic and unreliable outcomes. The substantial disparity between Pass@K (with large K) and Pass@1 across numerous models provides compelling evidence that such failures primarily stem from misreasoning rather than ignorance. To systematically investigate and address this issue, we propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs. In the first stage, we employ Group Relative Policy Optimization (GRPO) with template and answer-based rewards to cultivate structured, step-by-step reasoning. In the second stage, we conduct online preference optimization, where the model samples reasoning paths from GRPO-generated data, self-evaluates them, and aligns itself toward the preferred trajectories. Incorrect or suboptimal paths are concurrently stored in a Negative Replay Memory (NRM) as hard negatives, which are periodically revisited to prevent the model from repeating prior mistakes and to facilitate continual reasoning refinement. Extensive experiments show that PSO effectively prunes invalid reasoning paths, substantially enhances reasoning accuracy (with 7.4% improvements on average), and yields more stable and consistent chains of thought. Our code will be available at this https URL.
zh

[CV-207] Language-driven Fine-grained Retrieval

链接: https://arxiv.org/abs/2512.06255
作者: Shijie Wang,Xin Yu,Yadan Luo,Zijian Wang,Pengfei Zhang,Zi Huang
机构: The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-208] NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks

链接: https://arxiv.org/abs/2512.06251
作者: Fangzhou Lin,Yuping Wang,Yuliang Guo,Zixun Huang,Xinyu Huang,Haichong Zhang,Kazunori Yamada,Zhengzhong Tu,Liu Ren,Ziming Zhang
机构: Texas A&M University (德克萨斯农工大学); Worcester Polytechnic Institute (伍斯特理工学院); Tohoku University (东北大学); University of Michigan (密歇根大学); Bosch Research North America & Bosch Center for AI (博世北美研究中心与博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures

点击查看摘要

[CV-209] Opinion: Learning Intuitive Physics May Require More than Visual Data

链接: https://arxiv.org/abs/2512.06232
作者: Ellen Su,Solim Legris,Todd M. Gureckis,Mengye Ren
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-210] GPU-GLMB: Assessing the Scalability of GPU-Accelerated Multi-Hypothesis Tracking

链接: https://arxiv.org/abs/2512.06230
作者: Pranav Balakrishnan,Sidisha Barik,Sean M. O’Rourke,Benjamin M. Marlin
机构: Manning College of Information and Computer Sciences, University of Massachusetts Amherst, USA; U.S. Army Combat Capabilities Development Command, Army Research Laboratory, Adelphi, MD, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-211] Revisiting SVD and Wavelet Difference Reduction for Lossy Image Compression: A Reproducibility Study

链接: https://arxiv.org/abs/2512.06221
作者: Alena Makarova
机构: Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 13 figures. Reproducibility study

点击查看摘要

[CV-212] he MICCAI Federated Tumor Segmentation (FeTS) Challenge 2024: Efficient and Robust Aggregation Methods for Federated Learning

链接: https://arxiv.org/abs/2512.06206
作者: Akis Linardos,Sarthak Pati,Ujjwal Baid,Brandon Edwards,Patrick Foley,Kevin Ta,Verena Chung,Micah Sheller,Muhammad Irfan Khan,Mojtaba Jafaritadi,Elina Kontio,Suleiman Khan,Leon Mächler,Ivan Ezhov,Suprosanna Shit,Johannes C. Paetzold,Gustav Grimberg,Manuel A. Nickel,David Naccache,Vasilis Siomos,Jonathan Passerat-Palmbach,Giacomo Tarroni,Daewoon Kim,Leonard L. Klausmann,Prashant Shah,Bjoern Menze,Dimitrios Makris,Spyridon Bakas
机构: University of Oxford (牛津大学); University College London (伦敦大学学院); Stanford University (斯坦福大学); Massachusetts Institute of Technology (麻省理工学院); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); University of Edinburgh (爱丁堡大学); University of Toronto (多伦多大学); ETH Zurich (苏黎世联邦理工学院); Technical University of Munich (慕尼黑工业大学); Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学); University of Stockholm (斯德哥尔摩大学); University of Cyprus (塞浦路斯大学); University of California, Berkeley (加州大学伯克利分校); Seoul National University (首尔国立大学); University of Hamburg (汉堡大学); Charité – Universitätsmedizin Berlin (柏林夏里特医学院); University of Crete (克里特大学); University of Athens (雅典大学); University of Thessaly (塞萨洛尼基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

[CV-213] Multi-Modal Zero-Shot Prediction of Color Trajectories in Food Drying

链接: https://arxiv.org/abs/2512.06190
作者: Shichen Li,Ahmadreza Eslaminia,Chenhui Shao
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-214] SPOOF: Simple Pixel Operations for Out-of-Distribution Fooling

链接: https://arxiv.org/abs/2512.06185
作者: Ankit Gupta,Christoph Adami,Emily Dolson(Michigan State University)
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages with 8 figures, plus 13 pages and 16 figures of supplementary material

点击查看摘要

[CV-215] Physics-Grounded Attached Shadow Detection Using Approximate 3D Geometry and Light Direction

链接: https://arxiv.org/abs/2512.06179
作者: Shilin Hu,Jingyi Xu,Sagnik Das,Dimitris Samaras,Hieu Le
机构: Stony Brook University (石溪大学); UNC Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-216] Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction

链接: https://arxiv.org/abs/2512.06174
作者: Shilin Hu,Jingyi Xu,Akshat Dave,Dimitris Samaras,Hieu Le
机构: Stony Brook University (石溪大学); UNC Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-217] Automated Annotation of Shearographic Measurements Enabling Weakly Supervised Defect Detection

链接: https://arxiv.org/abs/2512.06171
作者: Jessica Plassmann,Nicolas Schuler,Michael Schuth,Georg von Freymann
机构: RPTU University Kaiserslautern-Landau (莱茨-兰道大学); Trier University of Applied Sciences (特里尔应用科学大学); University of Luxembourg (卢森堡大学); Fraunhofer Institute for Industrial Mathematics ITWM (弗劳恩霍夫工业数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

[CV-218] racking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

链接: https://arxiv.org/abs/2512.06158
作者: Su Sun,Cheng Zhao,Himangi Mittal,Gaurav Mittal,Rohith Kukkala,Yingjie Victor Chen,Mei Chen
机构: Purdue University (普渡大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures

点击查看摘要

[CV-219] GuideNav: User-Informed Development of a Vision-Only Robotic Navigation Assistant For Blind Travelers

链接: https://arxiv.org/abs/2512.06147
作者: Hochul Hwang,Soowan Yang,Jahir Sadik Monon,Nicholas A Giudice,Sunghoon Ivan Lee,Joydeep Biswas,Donghyun Kim
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); DGIST (韩国科学技术院); University of Maine (缅因大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[CV-220] Explainable Melanoma Diagnosis with Contrastive Learning and LLM -based Report Generation AAAI-26

链接: https://arxiv.org/abs/2512.06105
作者: Junwen Zheng,Xinran Xu,Li Rong Wang,Chang Cai,Lucinda Siyun Tan,Dingyuan Wang,Hong Liang Tey,Xiuyi Fan
机构: 1. National University of Singapore (新加坡国立大学); 2. Nanyang Technological University (南洋理工大学); 3. Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI-26-AIA

点击查看摘要

[CV-221] SpectraIrisPAD: Leverag ing Vision Foundation Models for Spectrally Conditioned Multispectral Iris Presentation Attack Detection

链接: https://arxiv.org/abs/2512.06103
作者: Raghavendra Ramachandra,Sushma Venkatesh
机构: Norwegian University of Science and Technology (NTNU); MOBAI AS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE T-BIOM

点击查看摘要

[CV-222] BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving

链接: https://arxiv.org/abs/2512.06096
作者: Karthik Mohan,Sonam Singh,Amit Arvind Kale
机构: UC San Diego (加州大学圣地亚哥分校); Robert Bosch Corporate Research India (罗伯特·博世公司研究印度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-223] Shoot-Bounce-3D: Single-Shot Occlusion-Aware 3D from Lidar by Decomposing Two-Bounce Light SIGGRAPH

【速读】:该论文旨在解决从单次测量中实现复杂场景(如存在遮挡区域和镜面材质)的三维(3D)场景重建问题。传统单光子激光雷达(single-photon lidar)虽能通过直接反射光估计深度,但在多路径光传输(multi-bounce light)、阴影及镜面反射等复杂因素下难以准确恢复几何结构与材料属性。其关键解决方案在于提出一种数据驱动的方法来逆推复杂的光传输过程:首先构建了包含约10万条室内场景激光瞬态信号的大规模仿真数据集,用于学习光传输的先验知识;进而利用该先验模型将实测的双次反弹光(two-bounce light)分解为每个激光点源的独立贡献,从而在单次测量条件下成功恢复被遮挡区域和镜面反射区域的密集深度信息与材质特性。

链接: https://arxiv.org/abs/2512.06080
作者: Tzofi Klinghoffer,Siddharth Somasundaram,Xiaoyu Xiang,Yuchen Fan,Christian Richardt,Akshat Dave,Ramesh Raskar,Rakesh Ranjan
机构: Massachusetts Institute of Technology (麻省理工学院); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025. Project page: this https URL

点击查看摘要

Abstract:3D scene reconstruction from a single measurement is challenging, especially in the presence of occluded regions and specular materials, such as mirrors. We address these challenges by leveraging single-photon lidars. These lidars estimate depth from light that is emitted into the scene and reflected directly back to the sensor. However, they can also measure light that bounces multiple times in the scene before reaching the sensor. This multi-bounce light contains additional information that can be used to recover dense depth, occluded geometry, and material properties. Prior work with single-photon lidar, however, has only demonstrated these use cases when a laser sequentially illuminates one scene point at a time. We instead focus on the more practical - and challenging - scenario of illuminating multiple scene points simultaneously. The complexity of light transport due to the combined effects of multiplexed illumination, two-bounce light, shadows, and specular reflections is challenging to invert analytically. Instead, we propose a data-driven method to invert light transport in single-photon lidar. To enable this approach, we create the first large-scale simulated dataset of ~100k lidar transients for indoor scenes. We use this dataset to learn a prior on complex light transport, enabling measured two-bounce light to be decomposed into the constituent contributions from each laser spot. Finally, we experimentally demonstrate how this decomposed light can be used to infer 3D geometry in scenes with occlusions and mirrors from a single measurement. Our code and dataset are released at this https URL.
zh

[CV-224] EgoEdit: Dataset Real-Time Streaming Model and Benchmark for Egocentric Video Editing

链接: https://arxiv.org/abs/2512.06065
作者: Runjia Li,Moayed Haji-Ali,Ashkan Mirzaei,Chaoyang Wang,Arpit Sahni,Ivan Skorokhodov,Aliaksandr Siarohin,Tomas Jakab,Junlin Han,Sergey Tulyakov,Philip Torr,Willi Menapace
机构: Snap Research (Snap Research); Rice University (莱斯大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

[CV-225] Representation Learning for Point Cloud Understanding

链接: https://arxiv.org/abs/2512.06058
作者: Siming Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 181 pages

点击查看摘要

[CV-226] he SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

【速读】:该论文旨在解决最新两代Segment Anything Models(SAM2与SAM3)之间的根本性断层问题,即SAM2所依赖的基于提示(prompt-based)的分割能力为何无法迁移至SAM3所采用的多模态概念驱动(concept-driven)范式。其解决方案的关键在于系统性地从五个维度进行剖析:(1) 提示驱动与概念驱动分割的本质差异,凸显SAM2的空间提示语义与SAM3的多模态融合及文本条件掩码生成机制;(2) 架构分歧,对比SAM2纯视觉-时间架构与SAM3集成视觉-语言编码器、几何与示例编码器、融合模块、DETR风格解码器、对象查询以及基于混合专家(Mixture-of-Experts)的模糊处理机制;(3) 数据集与标注差异,指出SA-V视频掩码与SAM3多模态概念标注语料库的不同;(4) 训练与超参数区别,揭示SAM2优化知识不适用于SAM3;(5) 评估指标与失败模式转变,说明从几何IoU指标向语义开放词汇评估的演进。由此,论文确立SAM3为新一代分割基础模型,并指明概念驱动分割时代的发展方向。

链接: https://arxiv.org/abs/2512.06032
作者: Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee
机构: Cornell University (康奈尔大学); University of the Peloponnese (伯罗奔尼撒大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.
zh

[CV-227] Neural reconstruction of 3D ocean wave hydrodynamics from camera sensing

链接: https://arxiv.org/abs/2512.06024
作者: Jiabin Liu,Zihao Zhou,Jialei Yan,Anxin Guo,Alvise Benetazzo,Hui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

[CV-228] PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

链接: https://arxiv.org/abs/2512.06020
作者: Wenyi Mo,Tianyu Zhang,Yalong Bai,Ligong Han,Ying Ba,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); iN2X; MIT-IBM Watson AI Lab; Red Hat AI Innovation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: \href{ this https URL }{\texttt{ this https URL }}

点击查看摘要

[CV-229] Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets

【速读】:该论文旨在解决当前医学影像基础模型(Medical Foundation Models)在不同数据集上表现行为缺乏系统性比较的问题,以推动其在临床场景中的可靠应用。解决方案的关键在于构建一个标准化的评估框架:使用统一的预处理流程和固定的下游分类器(LightGBM),从两个大规模胸部X光(CXR)嵌入模型(CXR-Foundation 和 MedImageInsight)中提取特征表示,并在公开的 MIMIC-CR 与 NIH ChestX-ray14 数据集上进行可复现的定量对比,从而揭示各模型在性能与跨数据集稳定性方面的差异。

链接: https://arxiv.org/abs/2512.06014
作者: Jiho Shin,Dominic Marshall,Matthieu Komorowski
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.
zh

[CV-230] VAT: Vision Action Transformer by Unlocking Full Representation of ViT

链接: https://arxiv.org/abs/2512.06013
作者: Wenhao Li,Chengwei Ma,Weixin Mao
机构: Hong Kong University of Science and Technology (Guangzhou); LimX Dynamics
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

[CV-231] High-Throughput Unsupervised Profiling of the Morphology of 316L Powder Particles for Use in Additive Manufacturing

链接: https://arxiv.org/abs/2512.06012
作者: Emmanuel Akeweje,Conall Kirk,Chi-Wai Chan,Denis Dowling,Mimi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-232] Fast and Flexible Robustness Certificates for Semantic Segmentation

【速读】:该论文旨在解决深度神经网络在语义分割任务中对小扰动敏感的问题,即输入图像在人眼感知不变的情况下,模型预测结果可能剧烈变化的脆弱性问题。现有方法多集中于分类任务的鲁棒性增强或认证,而针对语义分割的高效认证方法稀缺。解决方案的关键在于提出一类具有内置Lipschitz约束的可认证鲁棒语义分割网络,通过引入Lipschitz约束实现训练效率与鲁棒性之间的平衡,并构建一个通用的鲁棒性认证框架,支持多种性能指标下的最坏情况分析。该方法首次实现了实时兼容的可认证鲁棒语义分割,在NVIDIA A100 GPU上推理时认证速度比随机平滑方法快约600倍,且证书质量相当,显著提升了实际部署中的可靠性与效率。

链接: https://arxiv.org/abs/2512.06010
作者: Thomas Massena(IRIT-MISFIT, DTIPG - SNCF, UT3),Corentin Friedrich,Franck Mamalet,Mathieu Serrurier(IRIT-MISFIT)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep Neural Networks are vulnerable to small perturbations that can drastically alter their predictions for perceptually unchanged inputs. The literature on adversarially robust Deep Learning attempts to either enhance the robustness of neural networks (e.g, via adversarial training) or to certify their decisions up to a given robustness level (e.g, by using randomized smoothing, formal methods or Lipschitz bounds). These studies mostly focus on classification tasks and few efficient certification procedures currently exist for semantic segmentation. In this work, we introduce a new class of certifiably robust Semantic Segmentation networks with built-in Lipschitz constraints that are efficiently trainable and achieve competitive pixel accuracy on challenging datasets such as Cityscapes. Additionally, we provide a novel framework that generalizes robustness certificates for semantic segmentation tasks, where we showcase the flexibility and computational efficiency of using Lipschitz networks. Our approach unlocks real-time compatible certifiably robust semantic segmentation for the first time. Moreover, it allows the computation of worst-case performance under \ell_2 attacks of radius \epsilon across a wide range of performance measures. Crucially, we benchmark the runtime of our certification process and find our approach to be around 600 times faster than randomized smoothing methods at inference with comparable certificates on an NVIDIA A100 GPU. Finally, we evaluate the tightness of our worstcase certificates against state-of-the-art adversarial attacks to further validate the performance of our method.
zh

[CV-233] Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

链接: https://arxiv.org/abs/2512.06006
作者: Xuefei(Julie)Wang,Kai A. Horstmann,Ethan Lin,Jonathan Chen,Alexander R. Farhang,Sophia Stiles,Atharva Sehgal,Jonathan Light,David Van Valen,Yisong Yue,Jennifer J. Sun
机构: Caltech (加州理工学院); Cornell (康奈尔大学); UT Austin (德克萨斯大学奥斯汀分校); Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-234] PrunedCaps: A Case For Primary Capsules Discrimination

链接: https://arxiv.org/abs/2512.06003
作者: Ramin Sharifi,Pouya Shiri,Amirali Baniasadi
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-235] FishDetector-R1: Unified MLLM -Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection Segmentation and Counting

链接: https://arxiv.org/abs/2512.05996
作者: Yi Liu,Jingyu Song,Vedanth Kallakuri,Katherine A. Skinner
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 18 pages, under review

点击查看摘要

[CV-236] Domain-Specific Foundation Model Improves AI-Based Analysis of Neuropathology

【速读】:该论文旨在解决当前通用基础模型(foundation models)在神经病理学(neuropathology)领域性能受限的问题,其根本原因在于这些模型主要基于外科病理学数据训练,而神经病理学具有独特的细胞类型(如神经元、胶质细胞)、组织结构和疾病特异性病理特征(如神经纤维缠结、淀粉样斑块、路易小体等),导致通用模型难以有效捕捉与神经退行性疾病(如阿尔茨海默病、帕金森病及小脑共济失调)相关的形态学模式。解决方案的关键在于开发了专门针对脑组织全切片图像(whole-slide images)并覆盖多种神经退行性病变的神经病理专用基础模型 NeuroFM,该模型通过在特定领域数据上进行预训练,显著提升了在混合痴呆分类、海马区域分割和神经退行性共济失调识别等下游任务中的表现,验证了领域专业化基础模型在数字病理学中对特定病理特征建模的优势。

链接: https://arxiv.org/abs/2512.05993
作者: Ruchika Verma,Shrishtee Kandoi,Robina Afzal,Shengjia Chen,Jannes Jegminat,Michael W. Karlovich,Melissa Umphlett,Timothy E. Richardson,Kevin Clare,Quazi Hossain,Jorge Samanamud,Phyllis L. Faust,Elan D. Louis,Ann C. McKee,Thor D. Stein,Jonathan D. Cherry,Jesse Mez,Anya C. McGoldrick,Dalilah D. Quintana Mora,Melissa J. Nirenberg,Ruth H. Walker,Yolfrankcis Mendez,Susan Morgello,Dennis W. Dickson,Melissa E. Murray,Carlos Cordon-Cardo,Nadejda M. Tsankova,Jamie M. Walker,Diana K. Dangoor,Stephanie McQuillan,Emma L. Thorn,Claudia De Sanctis,Shuying Li,Thomas J. Fuchs,Kurt Farrell,John F. Crary,Gabriele Campanella
机构: Mount Sinai Health System (纽约西奈山健康系统); Icahn School of Medicine at Mount Sinai (伊坎医学院); Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models have transformed computational pathology by providing generalizable representations from large-scale histology datasets. However, existing models are predominantly trained on surgical pathology data, which is enriched for non-nervous tissue and overrepresents neoplastic, inflammatory, metabolic, and other non-neurological diseases. Neuropathology represents a markedly different domain of histopathology, characterized by unique cell types (neurons, glia, etc.), distinct cytoarchitecture, and disease-specific pathological features including neurofibrillary tangles, amyloid plaques, Lewy bodies, and pattern-specific neurodegeneration. This domain mismatch may limit the ability of general-purpose foundation models to capture the morphological patterns critical for interpreting neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, and cerebellar ataxias. To address this gap, we developed NeuroFM, a foundation model trained specifically on whole-slide images of brain tissue spanning diverse neurodegenerative pathologies. NeuroFM demonstrates superior performance compared to general-purpose models across multiple neuropathology-specific downstream tasks, including mixed dementia disease classification, hippocampal region segmentation, and neurodegenerative ataxia identification encompassing cerebellar essential tremor and spinocerebellar ataxia subtypes. This work establishes that domain-specialized foundation models trained on brain tissue can better capture neuropathology-specific features than models trained on general surgical pathology datasets. By tailoring foundation models to the unique morphological landscape of neurodegenerative diseases, NeuroFM enables more accurate and reliable AI-based analysis for brain disease diagnosis and research, setting a precedent for domain-specific model development in specialized areas of digital pathology.
zh

[CV-237] EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head

链接: https://arxiv.org/abs/2512.05991
作者: Chang Liu,Tianjiao Jing,Chengcheng Ma,Xuanqi Zhou,Zhengxuan Lian,Qin Jin,Hongliang Yuan,Shi-Sheng Huang
机构: Beijing Normal University(北京师范大学); Renmin University of China(中国人民大学); Tencent AI Lab(腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-238] VG3T: Visual Geometry Grounded Gaussian Transformer

链接: https://arxiv.org/abs/2512.05988
作者: Junho Kim,Seongwon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-239] Adaptive Dataset Quantization: A New Direction for Dataset Pruning

链接: https://arxiv.org/abs/2512.05987
作者: Chenyue Yu,Jianyu Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCPR 2025

点击查看摘要

[CV-240] Video Models Start to Solve Chess Maze Sudoku Mental Rotation and Raven Matrices

链接: https://arxiv.org/abs/2512.05969
作者: Hokin Deng
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: See \href\href{ [this https URL](https://grow-ai-like-a-child.com/video-reason/) }{results} and \href\href{ [this https URL](https://github.com/hokindeng/VMEvalKit) }{code}

点击查看摘要

[CV-241] R2MF-Net: A Recurrent Residual Multi-Path Fusion Network for Robust Multi-directional Spine X-ray Segmentation

链接: https://arxiv.org/abs/2512.07576
作者: Xuecheng Li,Weikuan Jia,Komildzhon Sharipov,Sharipov Hotam Beknazarovich,Farzona S. Ataeva,Qurbonaliev Alisher,Yuanjie Zheng
机构: Shandong Normal University (山东师范大学); Tajik State University of Law, Business and Politics (塔吉克斯坦法律、商业与政治大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-242] Precise Liver Tumor Segmentation in CT Using a Hybrid Deep Learning-Radiomics Framework

链接: https://arxiv.org/abs/2512.07574
作者: Xuecheng Li,Weikuan Jia,Komildzhon Sharipov,Alimov Ruslan,Lutfuloev Mazbutdzhon,Ismoilov Shuhratjon,Yuanjie Zheng
机构: Shandong Normal University (山东师范大学); Tajik State University of Law, Business and Politics (塔吉克斯坦法律、商业与政治州立大学)
类目: Image and Video Processing (eess.IV); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-243] Affine Subspace Models and Clustering for Patch-Based Image Denoising

链接: https://arxiv.org/abs/2512.07259
作者: Tharindu Wickremasinghe,Marco F. Duarte
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Asilomar Conference on Signals, Systems, and Computers 2025

点击查看摘要

[CV-244] Clinical Interpretability of Deep Learning Segmentation Through Shapley-Derived Agreement and Uncertainty Metrics

【速读】:该论文旨在解决医学图像分割模型在临床实践中缺乏可解释性的问题,从而影响其接受度与整合应用。尽管深度学习模型在医学图像分割中表现优异,但其决策过程仍被视为“黑箱”,限制了医生的信任与使用。解决方案的关键在于引入对比级Shapley值(contrast-level Shapley values),通过系统性扰动模型输入来量化不同MRI对比度(如T1、T2、FLAIR、T1c)对模型性能的公平贡献,进而生成可解释的特征重要性排序。该方法不仅提供了模型输出与临床经验一致性的度量指标(即模型排名与“临床医生”排序的一致性),还通过交叉验证中Shapley排名方差量化不确定性,揭示了模型可靠性与性能之间的关联——高Dice系数(>0.6)病例表现出更强的临床一致性,而Shapley排名方差增大则与性能下降显著相关(如U-Net模型中r = -0.581)。这一框架为临床提供了一种直观、可衡量的模型可信度评估手段,推动先进分割模型向临床落地转化。

链接: https://arxiv.org/abs/2512.07224
作者: Tianyi Ren,Daniel Low,Pittra Jaengprajak,Juampablo Heras Rivera,Jacob Ruzevick,Mehmet Kurt
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: r=-0.581 ). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.
zh

[CV-245] Semantic Temporal Single-photon LiDAR

链接: https://arxiv.org/abs/2512.06008
作者: Fang Li,Tonglin Mu,Shuling Li,Junran Guo,Keyuan Li,Jianing Li,Ziyang Luo,Xiaodong Fan,Ye Chen,Yunfeng Liu,Hong Cai,Lip Ket Chin,Jinbei Zhang,Shihai Sun
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注: 14 pages, 5 figures. And any comment is welcome

点击查看摘要

[CV-246] Stronger is not better: Better Augmentations in Contrastive Learning for Medical Image Segmentation NEURIPS

【速读】:该论文旨在解决自监督对比学习(self-supervised contrastive learning)在医学图像语义分割任务中性能提升不稳定的問題,特别是现有强数据增强(strong data augmentation)策略并不总是能带来性能改进的问题。其解决方案的关键在于实验性地探索并引入新的数据增强方法,这些新方法能够有效提升模型在医学图像语义分割任务中的表现,从而优化自监督对比学习的下游任务性能。

链接: https://arxiv.org/abs/2512.05992
作者: Azeez Idris,Abdurahman Ali Mohammed,Samuel Fanijo
机构: Iowa State University (爱荷华州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS Black in AI workshop - 2022

点击查看摘要

Abstract:Self-supervised contrastive learning is among the recent representation learning methods that have shown performance gains in several downstream tasks including semantic segmentation. This paper evaluates strong data augmentation, one of the most important components for self-supervised contrastive learning’s improved performance. Strong data augmentation involves applying the composition of multiple augmentation techniques on images. Surprisingly, we find that the existing data augmentations do not always improve performance for semantic segmentation for medical images. We experiment with other augmentations that provide improved performance.
zh

人工智能

[AI-0] Provable Long-Range Benefits of Next-Token Prediction

【速读】:该论文试图解决的问题是:为何经过next-word prediction(下一个词预测)训练的现代语言模型能够生成连贯文档并捕捉长距离结构?其解决方案的关键在于证明了在递归神经网络(Recurrent Neural Network, RNN)上优化下一个词预测目标,可使学习到的语言模型近似地逼近训练数据分布。具体而言,论文指出,对于从训练分布中采样的未见文档,任何仅能观察接下来 k 个token且描述长度受限的算法都无法区分这些真实token序列与由模型根据相同前缀生成的k个token序列。这一结果提供了多项式复杂度界(关于k,独立于文档长度),解释了实践中观察到的长距离一致性现象。

链接: https://arxiv.org/abs/2512.07818
作者: Xinyuan Cao,Santosh S. Vempala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 66 pages, 5 figures

点击查看摘要

Abstract:Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next k tokens, for any k , can distinguish between k consecutive tokens of such documents and k tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in k , independent of the document length) on the model size needed to achieve such k -token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
zh

[AI-1] Understanding Privacy Risks in Code Models Through Training Dynamics: A Causal Approach

【速读】:该论文旨在解决大型代码语言模型(Large Language Models for Code, LLM4Code)在提升开发者生产力的同时,因训练数据来自包含大量个人身份信息(Personally Identifiable Information, PII)的开源仓库而引发的隐私泄露问题。现有研究多将PII视为单一类别,忽略了不同类型PII在学习和泄露风险上的异质性。其解决方案的关键在于:构建涵盖多种PII类型的基准数据集,对不同规模的代表性模型进行微调,分析真实PII数据的训练动态,并通过结构化因果模型(Structural Causal Model)量化学习能力对泄露的因果效应。结果表明,不同类型的PII泄漏风险存在显著差异,且与学习难易程度正相关——易学类型(如IP地址)更易泄露,而难学类型(如密钥、密码)则较少泄露,模糊类型表现混合。该研究首次提供了因果证据,证明PII泄漏风险具有类型依赖性,为开发基于类型感知和学习能力感知的防御机制提供了理论依据。

链接: https://arxiv.org/abs/2512.07814
作者: Hua Yang(1),Alejandro Velasco(2),Sen Fang(1),Bowen Xu(1),Denys Poshyvanyk(2) ((1) North Carolina State University, (2) William amp; Mary)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Large language models for code (LLM4Code) have greatly improved developer productivity but also raise privacy concerns due to their reliance on open-source repositories containing abundant personally identifiable information (PII). Prior work shows that commercial models can reproduce sensitive PII, yet existing studies largely treat PII as a single category and overlook the heterogeneous risks among different types. We investigate whether distinct PII types vary in their likelihood of being learned and leaked by LLM4Code, and whether this relationship is causal. Our methodology includes building a dataset with diverse PII types, fine-tuning representative models of different scales, computing training dynamics on real PII data, and formulating a structural causal model to estimate the causal effect of learnability on leakage. Results show that leakage risks differ substantially across PII types and correlate with their training dynamics: easy-to-learn instances such as IP addresses exhibit higher leakage, while harder types such as keys and passwords leak less frequently. Ambiguous types show mixed behaviors. This work provides the first causal evidence that leakage risks are type-dependent and offers guidance for developing type-aware and learnability-aware defenses for LLM4Code.
zh

[AI-2] Auditing Games for Sandbagging

【速读】:该论文旨在解决未来人工智能系统可能在评估中隐藏其真实能力(即“沙袋行为”,sandbagging)的问题,这种行为可能导致开发者和审计人员误判模型的实际性能。解决方案的关键在于通过构建一个审计博弈(auditing game)来测试多种检测方法的有效性:红队对模型进行微调以模拟沙袋行为,蓝队则采用黑盒、模型内部或基于训练的方法识别沙袋模型。研究发现,当前方法难以可靠区分沙袋模型与良性模型;其中,基于训练的诱导策略(training-based elicitation)最为有效,仅需一个正确示例即可激发沙袋模型的全部能力,但存在误报风险(也可能提升良性模型性能)。因此,短期建议通过分布内训练进行诱导以消除潜在沙袋行为,长期则需进一步优化训练诱导的鲁棒性并开发更可靠的沙袋检测机制。

链接: https://arxiv.org/abs/2512.07810
作者: Jordan Taylor,Sid Black,Dillon Bowen,Thomas Read,Satvik Golechha,Alex Zelenka-Martin,Oliver Makins,Connor Kissane,Kola Ayonrinde,Jacob Merizian,Samuel Marks,Chris Cundy,Joseph Bloom
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 77 pages (28 non-appendix pages), 38 figures

点击查看摘要

Abstract:Future AI systems could conceal their capabilities (‘sandbagging’) during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at this https URL and select transcripts and results at this https URL . A demo illustrating the game can be played at this https URL .
zh

[AI-3] Large Causal Models from Large Language Models

【速读】:该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)中蕴含的丰富知识,构建跨领域的、结构化的因果模型(Large Causal Models, LCMs),以克服传统因果推断方法依赖于数值实验数据且局限于特定领域的问题。其核心挑战在于将从LLM中提取的碎片化、模糊甚至冲突的因果陈述整合为一致的因果三元组,并嵌入统一的因果图谱中。解决方案的关键在于提出了一种名为DEMOCRITUS的系统架构,该系统包含六个模块组成的实现流程,通过创新的范畴机器学习方法对因果语句进行形式化处理与融合,从而实现跨域因果关系的组织与可视化,推动了从文本驱动的因果推理向大规模、多领域因果建模的新范式转变。

链接: https://arxiv.org/abs/2512.07796
作者: Sridhar Mahadevan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:We introduce a new paradigm for building large causal models (LCMs) that exploits the enormous potential latent in today’s large language models (LLMs). We describe our ongoing experiments with an implemented system called DEMOCRITUS (Decentralized Extraction of Manifold Ontologies of Causal Relations Integrating Topos Universal Slices) aimed at building, organizing, and visualizing LCMs that span disparate domains extracted from carefully targeted textual queries to LLMs. DEMOCRITUS is methodologically distinct from traditional narrow domain and hypothesis centered causal inference that builds causal models from experiments that produce numerical data. A high-quality LLM is used to propose topics, generate causal questions, and extract plausible causal statements from a diverse range of domains. The technical challenge is then to take these isolated, fragmented, potentially ambiguous and possibly conflicting causal claims, and weave them into a coherent whole, converting them into relational causal triples and embedding them into a LCM. Addressing this technical challenge required inventing new categorical machine learning methods, which we can only briefly summarize in this paper, as it is focused more on the systems side of building DEMOCRITUS. We describe the implementation pipeline for DEMOCRITUS comprising of six modules, examine its computational cost profile to determine where the current bottlenecks in scaling the system to larger models. We describe the results of using DEMOCRITUS over a wide range of domains, spanning archaeology, biology, climate change, economics, medicine and technology. We discuss the limitations of the current DEMOCRITUS system, and outline directions for extending its capabilities.
zh

[AI-4] RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在黑盒环境下易受多轮越狱攻击(multi-turn jailbreak attacks)的问题,即攻击者通过一系列提示-响应交互诱导模型输出有害内容,而现有单轮优化方法难以学习长期有效的攻击策略。其解决方案的关键在于将多轮越狱攻击建模为多轮强化学习任务,直接以最终输出的有害性作为奖励信号,并引入两种启发式过程奖励:一是控制中间输出的有害性以规避黑盒模型的拒绝机制,二是保持中间输出语义相关性以防止内容漂移,从而提升攻击成功率并增强策略的稳定性。

链接: https://arxiv.org/abs/2512.07761
作者: Xiqiao Xiong,Ouxiang Li,Zhuo Liu,Moxin Li,Wentao Shi,Fuli Feng,Xiangnan He
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 15 figures

点击查看摘要

Abstract:Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model’s rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at this https URL. Warning: This paper contains examples of harmful content.
zh

[AI-5] he Native Spiking Microarchitecture: From Iontronic Primitives to Bit-Exact FP8 Arithmetic

【速读】:该论文旨在解决如何将具有随机性、模拟特性的金属有机框架(Metal-Organic Frameworks, MOFs)材料中天然存在的集成与放电(integrate-and-fire, IF)动力学,用于实现确定性的、位级精确的AI计算任务(如FP8精度的浮点运算),从而弥合“从随机离子到确定浮点”的技术鸿沟。其解决方案的关键在于提出一种原生脉冲微架构(Native Spiking Microarchitecture),通过将噪声神经元视为逻辑基本单元,引入空间组合流水线(Spatial Combinational Pipeline)和粘滞额外修正机制(Sticky-Extra Correction),实现了对所有16,129个FP8数值对的100%位级精确匹配,并将线性层延迟降低至O(log N),获得17倍加速,同时在极端膜泄漏(β ≈ 0.01)条件下仍保持鲁棒性,有效免疫硬件随机性影响。

链接: https://arxiv.org/abs/2512.07724
作者: Zhengzheng Tang
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The 2025 Nobel Prize in Chemistry for Metal-Organic Frameworks (MOFs) and recent breakthroughs by Huanting Wang’s team at Monash University establish angstrom-scale channels as promising post-silicon substrates with native integrate-and-fire (IF) dynamics. However, utilizing these stochastic, analog materials for deterministic, bit-exact AI workloads (e.g., FP8) remains a paradox. Existing neuromorphic methods often settle for approximation, failing Transformer precision standards. To traverse the gap “from stochastic ions to deterministic floats,” we propose a Native Spiking Microarchitecture. Treating noisy neurons as logic primitives, we introduce a Spatial Combinational Pipeline and a Sticky-Extra Correction mechanism. Validation across all 16,129 FP8 pairs confirms 100% bit-exact alignment with PyTorch. Crucially, our architecture reduces Linear layer latency to O(log N), yielding a 17x speedup. Physical simulations further demonstrate robustness against extreme membrane leakage (beta approx 0.01), effectively immunizing the system against the stochastic nature of the hardware.
zh

[AI-6] Enabling Delayed-Full Charging Through Transformer-Based Real-Time-to-Departure Modeling for EV Battery Longevity AAAI’26

【速读】:该论文旨在解决电动汽车(Electric Vehicles, EVs)在长时间高荷电状态(State of Charge, SOC)下锂离子电池(Lithium-ion Batteries, LIBs)加速衰减的问题,其核心在于通过精准预测用户出发时间来延迟充电至出行前,从而延长电池寿命。解决方案的关键在于提出一种基于Transformer的实时事件预测(Real-time-to-Event, TTE)模型,该模型将每日行程建模为基于网格的时间序列token序列,并利用流式上下文信息捕捉个体出行模式的不规则性,而非依赖历史时序依赖关系,从而显著提升预测准确性,在真实世界93名用户的智能手机数据上验证了其优越性能。

链接: https://arxiv.org/abs/2512.07723
作者: Yonggeon Lee,Jibin Hwang,Alfred Malengo Kondoro,Juhyun Song,Youngtae Noh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, AAAI’26 (accepted)

点击查看摘要

Abstract:Electric vehicles (EVs) are key to sustainable mobility, yet their lithium-ion batteries (LIBs) degrade more rapidly under prolonged high states of charge (SOC). This can be mitigated by delaying full charging \ours until just before departure, which requires accurate prediction of user departure times. In this work, we propose Transformer-based real-time-to-event (TTE) model for accurate EV departure prediction. Our approach represents each day as a TTE sequence by discretizing time into grid-based tokens. Unlike previous methods primarily dependent on temporal dependency from historical patterns, our method leverages streaming contextual information to predict departures. Evaluation on a real-world study involving 93 users and passive smartphone data demonstrates that our method effectively captures irregular departure patterns within individual routines, outperforming baseline models. These results highlight the potential for practical deployment of the \ours algorithm and its contribution to sustainable transportation systems.
zh

[AI-7] Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE

【速读】:该论文旨在解决在百亿规模稀疏专家模型(MoE)上进行强化学习(RL)时所面临的稳定性与效率问题,具体包括:零方差提示浪费采样资源、长序列上的重要性采样不稳定、标准奖励模型导致的优势反转现象,以及采样处理中的系统性性能瓶颈。解决方案的关键在于提出一套统一的创新方法:首先通过多阶段零方差消除策略过滤非信息性提示并稳定基于分组的策略优化;其次引入ESPO(Entropy-Adaptive Policy Optimization)方法,在token级与序列级重要性采样之间动态平衡以维持学习稳定性;再者采用Router Replay机制对齐训练与推理时的路由决策,并调整奖励模型防止优势反转;最后构建高吞吐RL系统,利用FP8精度采样、重叠奖励计算和长度感知调度消除性能瓶颈。上述改进共同构成一个高效且稳定的强化学习训练流水线,显著提升了百亿级MoE模型的训练效果与效率。

链接: https://arxiv.org/abs/2512.07710
作者: Anxiang Zeng,Haibo Zhang,Hailing Zhang,Kaixiang Mo,Liang Yao,Ling Hu,Long Zhang,Shuman Liu,Shuyi Xie,Yanshi Li,Yizhang Chen,Yuepeng Sheng,Yuwei Huang,Zhaochen Xu,Zhiqiang Zhou,Ziqin Liew
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present CompassMax-V3-Thinking, a hundred-billion-scale MoE reasoning model trained with a new RL framework built on one principle: each prompt must matter. Scaling RL to this size exposes critical inefficiencies-zero-variance prompts that waste rollouts, unstable importance sampling over long horizons, advantage inversion from standard reward models, and systemic bottlenecks in rollout processing. To overcome these challenges, we introduce several unified innovations: (1) Multi-Stage Zero-Variance Elimination, which filters out non-informative prompts and stabilizes group-based policy optimization (e.g. GRPO) by removing wasted rollouts; (2) ESPO, an entropy-adaptive optimization method that balances token-level and sequence-level importance sampling to maintain stable learning dynamics; (3) a Router Replay strategy that aligns training-time MoE router decisions with inference-time behavior to mitigate train-infer discrepancies, coupled with a reward model adjustment to prevent advantage inversion; (4) a high-throughput RL system with FP8-precision rollouts, overlapped reward computation, and length-aware scheduling to eliminate performance bottlenecks. Together, these contributions form a cohesive pipeline that makes RL on hundred-billion-scale MoE models stable and efficient. The resulting model delivers strong performance across both internal and public evaluations.
zh

[AI-8] In-Context and Few-Shots Learning for Forecasting Time Series Data based on Large Language Models

【速读】:该论文旨在解决如何利用预训练基础模型(尤其是大语言模型和时间序列专用基础模型)在时间序列数据建模与预测任务中超越传统数据驱动方法(如ARIMA、LSTM、TCN等)的问题。其解决方案的关键在于系统性地评估不同模型在零样本(Zero-Shot)、少样本(Few-Shot)及上下文学习(In-Context Learning)场景下的性能表现,特别是引入Google最新推出的时序基础模型TimesFM(Time Series Foundation Model),发现其在RMSE指标上取得最优结果(0.3023),同时具备可接受的推理效率(266秒),表明预训练时序基础模型是实现实时高精度预测的可行路径,且仅需最小程度的模型适配即可部署。

链接: https://arxiv.org/abs/2512.07705
作者: Saroj Gopali,Bipin Chhetri,Deepika Giri,Sima Siami-Namini,Akbar Siami Namin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing data-driven approaches in modeling and predicting time series data include ARIMA (Autoregressive Integrated Moving Average), Transformer-based models, LSTM (Long Short-Term Memory) and TCN (Temporal Convolutional Network). These approaches, and in particular deep learning-based models such as LSTM and TCN, have shown great results in predicting time series data. With the advancement of leveraging pre-trained foundation models such as Large Language Models (LLMs) and more notably Google’s recent foundation model for time series data, \it TimesFM (Time Series Foundation Model), it is of interest to investigate whether these foundation models have the capability of outperforming existing modeling approaches in analyzing and predicting time series data. This paper investigates the performance of using LLM models for time series data prediction. We investigate the in-context learning methodology in the training of LLM models that are specific to the underlying application domain. More specifically, the paper explores training LLMs through in-context, zero-shot and few-shot learning and forecasting time series data with OpenAI \tt o4-mini and Gemini 2.5 Flash Lite, as well as the recent Google’s Transformer-based TimesFM, a time series-specific foundation model, along with two deep learning models, namely TCN and LSTM networks. The findings indicate that TimesFM has the best overall performance with the lowest RMSE value (0.3023) and the competitive inference time (266 seconds). Furthermore, OpenAI’s o4-mini also exhibits a good performance based on Zero Shot learning. These findings highlight pre-trained time series foundation models as a promising direction for real-time forecasting, enabling accurate and scalable deployment with minimal model adaptation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.07705 [cs.LG] (or arXiv:2512.07705v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.07705 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-9] A Mathematical Theory of Top-k Sparse Attention via Total Variation Distance

【速读】:该论文旨在解决大模型中注意力机制计算效率与精度之间的权衡问题,特别是针对Top- k attention truncation(即仅保留前k个最高分的键值对)的误差量化难题。其核心挑战在于如何在保证输出结果稳定性的前提下,对截断带来的近似误差进行严格、可证的控制。解决方案的关键在于构建一个统一的数学框架,首次将分布级误差(总变差距离 TV(P,P^)\mathrm{TV}(P,\hat{P}))与输出级误差(注意力输出差异)精确关联:证明了 TV(P,P^)\mathrm{TV}(P,\hat{P}) 等于被舍弃的softmax尾部质量,并由此推导出仅依赖有序logits的非渐近确定性边界;进一步通过head-tail分解,揭示输出误差因子化为 τμtailμhead2\tau \|\mu_\mathrm{tail}-\mu_\mathrm{head}\|_2 的形式,从而获得新的head-tail直径上界和基于VarP(V)\mathrm{Var}_P(V)的细化误差分析。此框架使得Top- k 截断具备理论保障,且实验验证其可在平均减少2–4倍评分键数量的同时满足指定的总变差预算。

链接: https://arxiv.org/abs/2512.07647
作者: Georgios Tzachristas,Lei Deng,Ioannis Tzachristas,Gong Zhang,Renhai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop a unified mathematical framework for certified Top- k attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution P and its Top- k truncation \hat P , we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies \mathrmTV(P,\hat P)=1-e^-\mathrmKL(\hat P\Vert P) , yielding sharp Top- k -specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds – from a single boundary gap through multi-gap and blockwise variants – that control \mathrmTV(P,\hat P) using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as |\mathrmAttn(q,K,V)-\mathrmAttn_k(q,K,V)|2=\tau|\mu\mathrmtail-\mu_\mathrmhead|2 with \tau=\mathrmTV(P,\hat P) , yielding a new head-tail diameter bound |\mathrmAttn(q,K,V)-\mathrmAttn_k(q,K,V)|2\le\tau,\mathrmdiam_H,T and refinements linking the error to \mathrmVar_P(V) . Under an i.i.d. Gaussian score model s_i\sim\mathcal N(\mu,\sigma^2) we derive closed-form tail masses and an asymptotic rule for the minimal k\varepsilon ensuring \mathrmTV(P,\hat P)\le\varepsilon , namely k\varepsilon/n\approx\Phi_c(\sigma+\Phi^-1(\varepsilon)) . Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of k_\varepsilon/n and show that certified Top- k can reduce scored keys by 2-4 \times on average while meeting the prescribed total-variation budget.
zh

[AI-10] he Agent Capability Problem: Predicting Solvability Through Information-Theoretic Bounds

【速读】:该论文试图解决的问题是:如何判断自主代理(autonomous agent)在资源受限条件下是否应投入资源执行某项任务,即“何时承诺资源”这一决策问题。其解决方案的关键在于提出代理能力问题(Agent Capability Problem, ACP)框架,将问题求解建模为信息获取过程——代理需获取总信息量 ItotalI_{\text{total}} 才能识别解,每步动作可获得 IstepI_{\text{step}} bits 信息,代价为 CstepC_{\text{step}},由此定义有效成本 Ceff=(Itotal/Istep)CstepC_{\text{eff}} = (I_{\text{total}} / I_{\text{step}}) \cdot C_{\text{step}},该指标可在搜索前预测资源需求,并理论证明其下界预期成本,同时提供紧致的概率上界。实验表明,ACP能精准跟踪代理性能,在多种基于大语言模型(LLM)和代理工作流中均优于贪婪与随机策略,统一了主动学习、贝叶斯优化与强化学习中的核心思想,形成以信息论为基础的通用分析范式。

链接: https://arxiv.org/abs/2512.07631
作者: Shahar Lutati
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When should an autonomous agent commit resources to a task? We introduce the Agent Capability Problem (ACP), a framework for predicting whether an agent can solve a problem under resource constraints. Rather than relying on empirical heuristics, ACP frames problem-solving as information acquisition: an agent requires \Itotal bits to identify a solution and gains \Istep bits per action at cost \Cstep , yielding an effective cost \Ceff = (\Itotal/\Istep), \Cstep that predicts resource requirements before search. We prove that \Ceff lower-bounds expected cost and provide tight probabilistic upper bounds. Experimental validation shows that ACP predictions closely track actual agent performance, consistently bounding search effort while improving efficiency over greedy and random strategies. The framework generalizes across LLM-based and agentic workflows, linking principles from active learning, Bayesian optimization, and reinforcement learning through a unified information-theoretic lens.
zh

[AI-11] Incorporating Structure and Chord Constraints in Symbolic Transformer-based Melodic Harmonization

【速读】:该论文旨在解决在旋律和声化(melodic harmonization)过程中如何有效引入预定义和弦约束的问题,即在给定旋律与特定位置所需和弦作为输入时,使自回归Transformer模型生成的和声既符合音乐逻辑又能准确匹配指定和弦的起始位置(onset)及所在小节(bar)。其解决方案的关键在于提出一种名为B的算法,该算法融合了束搜索(beam search)与A搜索策略,并引入回溯机制(backtracking),以强制预训练Transformer模型在生成过程中满足和弦约束条件。尽管该方法在最坏情况下具有指数级复杂度,但它是首次系统性地揭示该问题挑战并提供可扩展框架的尝试,为后续引入启发式策略优化奠定了基础。

链接: https://arxiv.org/abs/2512.07627
作者: Maximos Kaliakatsos-Papakostas,Konstantinos Soiledis,Theodoros Tsamis,Dimos Makris,Vassilis Katsouros,Emilios Cambouropoulos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025), Brussels, Belgium, September 10th-12th

点击查看摘要

Abstract:Transformer architectures offer significant advantages regarding the generation of symbolic music; their capabilities for incorporating user preferences toward what they generate is being studied under many aspects. This paper studies the inclusion of predefined chord constraints in melodic harmonization, i.e., where a desired chord at a specific location is provided along with the melody as inputs and the autoregressive transformer model needs to incorporate the chord in the harmonization that it generates. The peculiarities of involving such constraints is discussed and an algorithm is proposed for tackling this task. This algorithm is called B* and it combines aspects of beam search and A* along with backtracking to force pretrained transformers to satisfy the chord constraints, at the correct onset position within the correct bar. The algorithm is brute-force and has exponential complexity in the worst case; however, this paper is a first attempt to highlight the difficulties of the problem and proposes an algorithm that offers many possibilities for improvements since it accommodates the involvement of heuristics.
zh

[AI-12] me Series Foundation Models for Process Model Forecasting

【速读】:该论文旨在解决过程模型预测(Process Model Forecasting, PMF)中因直接后继(Directly-Follows, DF)关系时间序列稀疏性和异质性导致的预测精度不足问题。传统机器学习与深度学习模型在PMF任务上仅能提供有限的性能提升,主要受限于数据稀缺和复杂性。解决方案的关键在于引入时间序列基础模型(Time Series Foundation Models, TSFMs),即大规模预训练的时间序列通用模型,通过零样本(zero-shot)方式直接应用于DF时间序列预测,无需额外训练。实验表明,TSFMs在多个真实事件日志上均显著优于从头训练的传统及专用模型,证明其能够有效迁移非流程领域中的时序结构知识,展现出卓越的泛化能力和数据效率,为PMF提供了更高效、稳健的新范式。

链接: https://arxiv.org/abs/2512.07624
作者: Yongbo Yu,Jari Peeperkorn,Johannes De Smedt,Jochen De Weerdt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Process Model Forecasting (PMF) aims to predict how the control-flow structure of a process evolves over time by modeling the temporal dynamics of directly-follows (DF) relations, complementing predictive process monitoring that focuses on single-case prefixes. Prior benchmarks show that machine learning and deep learning models provide only modest gains over statistical baselines, mainly due to the sparsity and heterogeneity of the DF time series. We investigate Time Series Foundation Models (TSFMs), large pre-trained models for generic time series, as an alternative for PMF. Using DF time series derived from real-life event logs, we compare zero-shot use of TSFMs, without additional training, with fine-tuned variants adapted on PMF-specific data. TSFMs generally achieve lower forecasting errors (MAE and RMSE) than traditional and specialized models trained from scratch on the same logs, indicating effective transfer of temporal structure from non-process domains. While fine-tuning can further improve accuracy, the gains are often small and may disappear on smaller or more complex datasets, so zero-shot use remains a strong default. Our study highlights the generalization capability and data efficiency of TSFMs for process-related time series and, to the best of our knowledge, provides the first systematic evaluation of temporal foundation models for PMF.
zh

[AI-13] Comparative Analysis and Parametric Tuning of PPO GRPO and DAPO for LLM Reasoning Enhancement

【速读】:该论文旨在解决如何通过强化学习(Reinforcement Learning, RL)算法提升大语言模型(Large Language Models, LLMs)在复杂推理任务中的表现问题。其核心解决方案是采用受控的迁移学习评估范式:首先在特定的计数游戏(Countdown Game)上对LLMs进行微调,随后在一系列通用推理基准测试中评估性能。关键发现包括:三种RL算法(PPO、GRPO和DAPO)均能显著优于基线模型,其中GRPO和DAPO的群体规模(group size)增大可提升训练稳定性与准确率,而KL惩罚系数的影响呈非单调性;此外,DAPO中动态采样(Dynamic Sampling, DS)组件并未带来性能增益,关闭DS反而获得最佳整体结果。

链接: https://arxiv.org/abs/2512.07611
作者: Yongsheng Lian
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.07611 [cs.AI] (or arXiv:2512.07611v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.07611 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] Weighted Contrastive Learning for Anomaly-Aware Time-Series Forecasting

【速读】:该论文旨在解决多变量时间序列在异常条件下预测可靠性不足的问题,尤其是在ATM现金物流等场景中,突发需求变化会导致模型性能显著下降。现有深度预测模型在正常数据上表现优异,但在分布偏移(distribution shift)发生时往往失效。解决方案的关键在于提出加权对比适应(Weighted Contrastive Adaptation, WECA),其核心是通过加权对比目标对齐正常样本与异常增强样本的表示空间,在保留异常相关信息的同时保持对良性变化的一致性,从而实现异常条件下的鲁棒预测,且在正常数据上无明显性能损失。

链接: https://arxiv.org/abs/2512.07569
作者: Joel Ekstrand,Tor Mattsson,Zahra Taghiyarrenani,Slawomir Nowaczyk,Jens Lundström,Mikael Lindén
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable forecasting of multivariate time series under anomalous conditions is crucial in applications such as ATM cash logistics, where sudden demand shifts can disrupt operations. Modern deep forecasters achieve high accuracy on normal data but often fail when distribution shifts occur. We propose Weighted Contrastive Adaptation (WECA), a Weighted contrastive objective that aligns normal and anomaly-augmented representations, preserving anomaly-relevant information while maintaining consistency under benign variations. Evaluations on a nationwide ATM transaction dataset with domain-informed anomaly injection show that WECA improves SMAPE on anomaly-affected data by 6.1 percentage points compared to a normally trained baseline, with negligible degradation on normal data. These results demonstrate that WECA enhances forecasting reliability under anomalies without sacrificing performance during regular operations.
zh

[AI-15] VulnLLM -R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection

【速读】:该论文旨在解决现有漏洞检测方法在通用性与准确性上的不足,特别是传统静态分析工具和大语言模型(Large Language Models, LLMs)在复杂程序状态理解与漏洞推理能力方面的局限。其核心问题在于:当前主流方法多依赖模式匹配而非深层次的逻辑推理,导致泛化能力弱且易陷入学习捷径。解决方案的关键在于提出一种专门针对漏洞检测设计的推理型大语言模型——VulnLLM-R,通过创新性的训练流程实现高效推理能力,包括专用数据选择、推理数据生成与过滤修正,以及测试阶段优化策略,最终在Python、C/C++和Java等多个语言的SOTA数据集上展现出优于现有静态分析工具及开源/商业大模型的性能,并在真实项目中验证了其作为AI代理发现零日漏洞的能力。

链接: https://arxiv.org/abs/2512.07533
作者: Yuzhou Nie,Hongwei Li,Chengquan Guo,Ruizhe Jiang,Zhun Wang,Bo Li,Dawn Song,Wenbo Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose VulnLLM-R, the~\emphfirst specialized reasoning LLM for vulnerability detection. Our key insight is that LLMs can reason about program states and analyze the potential vulnerabilities, rather than simple pattern matching. This can improve the model’s generalizability and prevent learning shortcuts. However, SOTA reasoning LLMs are typically ultra-large, closed-source, or have limited performance in vulnerability detection. To address this, we propose a novel training recipe with specialized data selection, reasoning data generation, reasoning data filtering and correction, and testing-phase optimization. Using our proposed methodology, we train a reasoning model with seven billion parameters. Through extensive experiments on SOTA datasets across Python, C/C++, and Java, we show that VulnLLM-R has superior effectiveness and efficiency than SOTA static analysis tools and both open-source and commercial large reasoning models. We further conduct a detailed ablation study to validate the key designs in our training recipe. Finally, we construct an agent scaffold around our model and show that it outperforms CodeQL and AFL++ in real-world projects. Our agent further discovers a set of zero-day vulnerabilities in actively maintained repositories. This work represents a pioneering effort to enable real-world, project-level vulnerability detection using AI agents powered by specialized reasoning models. The code is available at~\hrefthis https URLgithub.
zh

[AI-16] Model-Based Reinforcement Learning Under Confounding

【速读】:该论文旨在解决在上下文马尔可夫决策过程(Contextual Markov Decision Processes, C-MDPs)中,由于上下文信息不可观测而导致的混淆(confounding)问题。在离线数据集中,传统基于模型的强化学习方法因无法区分行为策略下的转移和奖励机制与干预性评估所需的状态策略量而存在根本性不一致性。解决方案的关键在于采用一种近似(proximal)的离策略评估方法,通过仅利用可观测的状态-动作-奖励轨迹,在代理变量满足弱可逆性条件下识别出混淆的奖励期望;进一步结合行为平均的转移模型构建一个代理马尔可夫决策过程(surrogate MDP),其贝尔曼算子对状态策略具有一致性,并能无缝集成最大因果熵(Maximum Causal Entropy, MaxCausalEnt)模型学习框架,从而实现对未观测上下文环境中状态策略的合理建模与规划。

链接: https://arxiv.org/abs/2512.07528
作者: Nishanth Venkatesh,Andreas A. Malikopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures - decompressed draft

点击查看摘要

Abstract:We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
zh

[AI-17] AutoICE: Automatically Synthesizing Verifiable C Code via LLM -driven Evolution

【速读】:该论文旨在解决自然语言到可验证C代码自动形式化(autoformalization)过程中存在的严重语法和语义错误问题,以及难以有效形式化隐含知识的挑战。其核心解决方案是提出AutoICE,一种基于大语言模型(LLM)驱动的进化搜索方法,关键创新在于引入多样化的个体初始化(diverse individual initialization)与协同交叉(collaborative crossover)机制,以实现多样化的迭代更新,从而缓解单智能体迭代中固有的误差传播问题;同时采用自省突变(self-reflective mutation)策略,促进对隐含知识的发现,显著提升了代码验证成功率(达90.36%),优于当前最优方法。

链接: https://arxiv.org/abs/2512.07501
作者: Weilin Luo,Xueyi Liang,Haotian Deng,Yanan Liu,Hai Wan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatically synthesizing verifiable code from natural language requirements ensures software correctness and reliability while significantly lowering the barrier to adopting the techniques of formal methods. With the rise of large language models (LLMs), long-standing efforts at autoformalization have gained new momentum. However, existing approaches suffer from severe syntactic and semantic errors due to the scarcity of domain-specific pre-training corpora and often fail to formalize implicit knowledge effectively. In this paper, we propose AutoICE, an LLM-driven evolutionary search for synthesizing verifiable C code. It introduces the diverse individual initialization and the collaborative crossover to enable diverse iterative updates, thereby mitigating error propagation inherent in single-agent iterations. Besides, it employs the self-reflective mutation to facilitate the discovery of implicit knowledge. Evaluation results demonstrate the effectiveness of AutoICE: it successfully verifies 90.36 % of code, outperforming the state-of-the-art (SOTA) approach. Besides, on a developer-friendly dataset variant, AutoICE achieves a 88.33 % verification success rate, significantly surpassing the 65 % success rate of the SOTA approach.
zh

[AI-18] How Do LLM s Fail In Agent ic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLM s in Agent ic Simulations

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主代理(autonomous agents)在具备工具使用能力时所表现出的可靠性不足问题,尤其是其在多步工具调用任务中频繁出现的失败模式。解决方案的关键在于通过细粒度的行为分析,识别出导致失败的四类典型模式:未基于事实执行的过早操作、过度帮助而替代缺失实体、受干扰信息污染上下文,以及高负载下的脆弱执行流程,并强调评估方法应聚焦于交互式 grounding、恢复行为和环境感知适应能力,指出企业级可靠部署不仅依赖更强模型,更需通过有意识的训练与设计策略强化验证机制、约束发现及对真实数据源的遵循。

链接: https://arxiv.org/abs/2512.07497
作者: JV Roig
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 48 pages, 3 tables, 2 listings

点击查看摘要

Abstract:We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1’s superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.
zh

[AI-19] Artificial Intelligence and Nuclear Weapons Proliferation: The Technological Arms Race for (In)visibility

【速读】:该论文旨在解决新兴技术加速演进背景下核扩散风险加剧的问题,特别是如何应对由促进扩散技术(Proliferation-Enabling Technologies, PETs)与增强探测技术(Detection-Enhancing Technologies, DETs)之间动态博弈所引发的核“可见性”不确定性上升问题。其解决方案的关键在于构建一个以相对优势指数(Relative Advantage Index, RAI)为核心的量化模型,用以刻画PETs与DETs之间的技术平衡变化,并通过情景模拟分析不同PET增长速率和DET投资策略对累积核突破风险的影响,从而揭示未来核不扩散治理中仅依赖检测手段已不足,亟需加强PET治理的前瞻性战略路径。

链接: https://arxiv.org/abs/2512.07487
作者: David M. Allison,Stephen Herzog
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Best Paper Award (2025) from Risk Analysis as one of the articles published in the journal that year with the most significant impacts to the theory or practice of risk analysis. Main text: 17 pages, 5 tables, 5 figures. Online appendix: 4 pages, 3 figures, 1 table. Online simulation tool for the formal model available here: this https URL

点击查看摘要

Abstract:A robust nonproliferation regime has contained the spread of nuclear weapons to just nine states. Yet, emerging and disruptive technologies are reshaping the landscape of nuclear risks, presenting a critical juncture for decision makers. This article lays out the contours of an overlooked but intensifying technological arms race for nuclear (in)visibility, driven by the interplay between proliferation-enabling technologies (PETs) and detection-enhancing technologies (DETs). We argue that the strategic pattern of proliferation will be increasingly shaped by the innovation pace in these domains. Artificial intelligence (AI) introduces unprecedented complexity to this equation, as its rapid scaling and knowledge substitution capabilities accelerate PET development and challenge traditional monitoring and verification methods. To analyze this dynamic, we develop a formal model centered on a Relative Advantage Index (RAI), quantifying the shifting balance between PETs and DETs. Our model explores how asymmetric technological advancement, particularly logistic AI-driven PET growth versus stepwise DET improvements, expands the band of uncertainty surrounding proliferation detectability. Through replicable scenario-based simulations, we evaluate the impact of varying PET growth rates and DET investment strategies on cumulative nuclear breakout risk. We identify a strategic fork ahead, where detection may no longer suffice without broader PET governance. Governments and international organizations should accordingly invest in policies and tools agile enough to keep pace with tomorrow’s technology.
zh

[AI-20] From Real-World Traffic Data to Relevant Critical Scenarios

【速读】:该论文旨在解决自动驾驶系统在复杂交通场景中安全性验证效率低下的问题,尤其是针对“未知不安全”场景难以全面识别的挑战。其核心问题是:由于驾驶场景存在大量自由度(degrees of freedom),且新功能的技术复杂性不断提升,传统验证方法难以覆盖所有潜在危险场景,从而影响自动驾驶系统的可靠部署。解决方案的关键在于构建一个从真实数据采集到关键性评估、再到合成场景生成的完整处理链(processing chain)。具体而言,首先通过真实高速公路交通轨迹数据进行关键性度量(criticality measures)分析,以识别与安全相关的车道变换场景;其次,将这些度量结果与具体驾驶情境关联,实现对安全相关场景的精准定位;最后,基于记录场景生成合成场景(synthetic scenarios),从而扩展“未知不安全”场景的覆盖范围,提升数据驱动的场景挖掘和验证效率。

链接: https://arxiv.org/abs/2512.07482
作者: Florian Lüttner,Nicole Neis,Daniel Stadler,Robin Moss,Mirjam Fehling-Kaschek,Matthias Pfriem,Alexander Stolz,Jens Ziehn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:The reliable operation of autonomous vehicles, automated driving functions, and advanced driver assistance systems across a wide range of relevant scenarios is critical for their development and deployment. Identifying a near-complete set of relevant driving scenarios for such functionalities is challenging due to numerous degrees of freedom involved, each affecting the outcomes of the driving scenario differently. Moreover, with increasing technical complexity of new functionalities, the number of potentially relevant, particularly “unknown unsafe” scenarios is increasing. To enhance validation efficiency, it is essential to identify relevant scenarios in advance, starting with simpler domains like highways before moving to more complex environments such as urban traffic. To address this, this paper focuses on analyzing lane change scenarios in highway traffic, which involve multiple degrees of freedom and present numerous safetyrelevant scenarios. We describe the process of data acquisition and processing of real-world data from public highway traffic, followed by the application of criticality measures on trajectory data to evaluate scenarios, as conducted within the AVEAS project (this http URL). By linking the calculated measures to specific lane change driving scenarios and the conditions under which the data was collected, we facilitate the identification of safetyrelevant driving scenarios for various applications. Further, to tackle the extensive range of “unknown unsafe” scenarios, we propose a way to generate relevant scenarios by creating synthetic scenarios based on recorded ones. Consequently, we demonstrate and evaluate a processing chain that enables the identification of safety-relevant scenarios, the development of data-driven methods for extracting these scenarios, and the generation of synthetic critical scenarios via sampling on highways.
zh

[AI-21] Understanding LLM Agent Agent Behaviours via Game Theory: Strategy Recognition Biases and Multi-Agent Dynamics

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在交互式多智能体系统和人类社会中作为自主决策者时,其战略行为的可解释性与可评估性问题,尤其是如何识别和量化其背后的意图,而不仅限于输出结果。解决方案的关键在于扩展FAIRGAME框架,引入两种互补的实验环境:一是基于收益缩放的囚徒困境(payoff-scaled Prisoners Dilemma),用于分离激励强度对行为的影响;二是包含动态收益和多智能体历史记录的公共品博弈(Public Goods Game),以捕捉复杂协作情境下的策略演化。通过训练监督分类模型识别经典重复博弈策略并应用于LLM轨迹分析,研究发现LLMs表现出系统性的、模型和语言依赖的行为意图,且语言框架的影响有时可媲美架构差异,从而为审计LLMs作为战略代理提供了统一方法论基础,并揭示了合作偏倚对AI治理与多智能体系统安全设计的重要意义。

链接: https://arxiv.org/abs/2512.07462
作者: Trung-Kiet Huynh,Duy-Minh Dao-Sy,Thanh-Bang Cao,Phong-Hao Le,Hong-Dan Nguyen,Phu-Quy Nguyen-Lam,Minh-Luan Nguyen-Vo,Hong-Phat Pham,Phu-Hoa Pham,Thien-Kim Than,Chi-Nguyen Tran,Huy Tran,Gia-Thoai Tran-Le,Alessio Buscemi,Le Hong Trang, TheAnh Han
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly operate as autonomous decision-makers in interactive and multi-agent systems and human societies, understanding their strategic behaviour has profound implications for safety, coordination, and the design of AI-driven social and economic infrastructures. Assessing such behaviour requires methods that capture not only what LLMs output, but the underlying intentions that guide their decisions. In this work, we extend the FAIRGAME framework to systematically evaluate LLM behaviour in repeated social dilemmas through two complementary advances: a payoff-scaled Prisoners Dilemma isolating sensitivity to incentive magnitude, and an integrated multi-agent Public Goods Game with dynamic payoffs and multi-agent histories. These environments reveal consistent behavioural signatures across models and languages, including incentive-sensitive cooperation, cross-linguistic divergence and end-game alignment toward defection. To interpret these patterns, we train traditional supervised classification models on canonical repeated-game strategies and apply them to FAIRGAME trajectories, showing that LLMs exhibit systematic, model- and language-dependent behavioural intentions, with linguistic framing at times exerting effects as strong as architectural differences. Together, these findings provide a unified methodological foundation for auditing LLMs as strategic agents and reveal systematic cooperation biases with direct implications for AI governance, collective decision-making, and the design of safe multi-agent systems.
zh

[AI-22] Forget and Explain: Transparent Verification of GNN Unlearning WSDM2026

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在面对数据删除请求时难以实现可验证“遗忘”(unlearning)的问题,尤其在GDPR等隐私法规下缺乏透明性和可审计性。其解决方案的关键在于提出一种基于可解释性的验证器(explainability-driven verifier),通过对比模型在删除前后的快照,利用归因变化(attribution shifts)和局部结构变动(如图编辑距离)作为可读性强的证据,量化遗忘效果。该验证器结合五种可解释性指标(残差归因、热力图偏移、可解释性得分偏差、图编辑距离及诊断图规则偏移),并辅以成员推断ROC-AUC作为全局隐私信号,从而为GNN的遗忘行为提供透明、可验证的评估框架。

链接: https://arxiv.org/abs/2512.07450
作者: Imran Ahsan(1),Hyunwook Yu(2),Jinsung Kim(2),Mucheol Kim(2) ((1) Department of Smart Cities, Chung-Ang University, (2) Department of Computer Science and Engineering, Chung-Ang University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear in WSDM 2026 (ACM International Conference on Web Search and Data Mining). Code is available at this https URL

点击查看摘要

Abstract:Graph neural networks (GNNs) are increasingly used to model complex patterns in graph-structured data. However, enabling them to “forget” designated information remains challenging, especially under privacy regulations such as the GDPR. Existing unlearning methods largely optimize for efficiency and scalability, yet they offer little transparency, and the black-box nature of GNNs makes it difficult to verify whether forgetting has truly occurred. We propose an explainability-driven verifier for GNN unlearning that snapshots the model before and after deletion, using attribution shifts and localized structural changes (for example, graph edit distance) as transparent evidence. The verifier uses five explainability metrics: residual attribution, heatmap shift, explainability score deviation, graph edit distance, and a diagnostic graph rule shift. We evaluate two backbones (GCN, GAT) and four unlearning strategies (Retrain, GraphEditor, GNNDelete, IDEA) across five benchmarks (Cora, Citeseer, Pubmed, Coauthor-CS, Coauthor-Physics). Results show that Retrain and GNNDelete achieve near-complete forgetting, GraphEditor provides partial erasure, and IDEA leaves residual signals. These explanation deltas provide the primary, human-readable evidence of forgetting; we also report membership-inference ROC-AUC as a complementary, graph-wide privacy signal.
zh

[AI-23] LocalSearchBench: Benchmarking Agent ic Search in Real-World Local Life Services

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在本地生活服务领域中多跳推理(multi-hop reasoning)能力不足的问题,尤其是面对模糊查询和跨商户、跨产品的复杂信息检索任务时表现不佳。解决方案的关键在于构建首个面向本地生活服务的综合性基准测试集 LocalSearchBench,包含超过15万条高质量数据及300个多跳问答任务,并开发统一交互环境 LocalPlayground 以支持多工具协同的智能体(agent)评估与训练。实验表明,即使最先进的大推理模型(Large Reasoning Models, LRMs)在该基准上仍存在准确率低(最高仅34.34%)、完整性差(平均77.33%)和忠实度不足(平均61.99%)等问题,凸显了专用基准和领域特化训练的必要性。

链接: https://arxiv.org/abs/2512.07436
作者: Hang He,Chuhuai Yue,Chengqi Dong,Mingxue Tian,Zhenfeng Liu,Jiajun Chai,Xiaohan Wang,Yufei Zhang,Qun Liao,Guojun Yin,Wei Lin,Chengcheng Wan,Haiying Sun,Ting Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench includes over 150,000 high-quality entries from various cities and business types. We construct 300 multi-hop QA tasks based on real user queries, challenging agents to understand questions and retrieve information in multiple steps. We also developed LocalPlayground, a unified environment integrating multiple tools for agent interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.1) achieves only 34.34% correctness, and most models have issues with completeness (average 77.33%) and faithfulness (average 61.99%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at this http URL.
zh

[AI-24] MIDG: Mixture of Invariant Experts with knowledge injection for Domain Generalization in Multimodal Sentiment Analysis

【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中领域泛化(Domain Generalization)的两个关键问题:一是现有方法在提取域不变特征时忽视了模态间的协同作用(inter-modal synergies),导致无法准确捕捉多模态数据中的丰富语义信息;二是知识注入技术常因跨模态知识碎片化而未能充分利用超出单模态范围的特定表示。解决方案的关键在于提出一种名为MIDG的新框架,其核心创新包括:(1)引入混合不变专家模型(Mixture of Invariant Experts, MIE)以学习模态间的协同关系并提取域不变特征;(2)设计跨模态适配器(Cross-Modal Adapter)通过增强跨模态知识注入来提升多模态表征的语义丰富性。实验表明,该框架在三个数据集上均取得了优越性能。

链接: https://arxiv.org/abs/2512.07430
作者: Yangle Li,Danli Luo,Haifeng Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing methods in domain generalization for Multimodal Sentiment Analysis (MSA) often overlook inter-modal synergies during invariant features extraction, which prevents the accurate capture of the rich semantic information within multimodal data. Additionally, while knowledge injection techniques have been explored in MSA, they often suffer from fragmented cross-modal knowledge, overlooking specific representations that exist beyond the confines of unimodal. To address these limitations, we propose a novel MSA framework designed for domain generalization. Firstly, the framework incorporates a Mixture of Invariant Experts model to extract domain-invariant features, thereby enhancing the model’s capacity to learn synergistic relationships between modalities. Secondly, we design a Cross-Modal Adapter to augment the semantic richness of multimodal representations through cross-modal knowledge injection. Extensive domain experiments conducted on three datasets demonstrate that the proposed MIDG achieves superior performance.
zh

[AI-25] Do LLM s Trust the Code They Write?

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成过程中输出错误代码的问题,尤其是由于模型输出概率与代码正确性之间相关性较弱,仅反映生成过程的最终结果,难以有效评估生成代码的质量。其解决方案的关键在于:通过对比同一编程任务下正确与错误代码对应的隐藏状态(hidden states),识别出LLMs内部编码的“代码正确性”表示(correctness representation)。实验表明,利用这一提取的内部正确性信号,能够显著优于传统的对数似然排序(log-likelihood ranking)和模型自述置信度(verbalized model confidence),从而在无需执行测试的情况下筛选出更高质量的代码样本,提升LLM代码生成系统的可靠性。

链接: https://arxiv.org/abs/2512.07404
作者: Francisco Ribeiro,Claudio Spiess,Prem Devanbu,Sarah Nadi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the effectiveness of large language models (LLMs) for code generation, they often output incorrect code. One reason is that model output probabilities are often not well-correlated with correctness, and reflect only the final output of the generation process. Inspired by findings that LLMs internally encode concepts like truthfulness, this paper explores if LLMs similarly represent code correctness. Specifically, we identify a correctness representation inside LLMs by contrasting the hidden states between pairs of correct and incorrect code for the same programming tasks. By experimenting on four LLMs, we show that exploiting this extracted correctness representation outperforms standard log-likelihood ranking, as well as verbalized model confidence. Furthermore, we explore how this internal correctness signal can be used to select higher-quality code samples, without requiring test execution. Ultimately, this work demonstrates how leveraging internal representations can enhance code generation systems and make LLMs more reliable, thus improving confidence in automatically generated code.
zh

[AI-26] Asymptotic analysis of shallow and deep forgetting in replay with Neural Collapse

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中的一个核心矛盾:神经网络在输出预测失效的情况下仍能保留过去任务的线性可分特征表示(即深层特征空间遗忘与浅层分类器遗忘之间的差距)。其解决方案的关键在于揭示经验回放(Experience Replay)中存在的重要不对称性——小容量缓存虽能稳定深层特征几何结构、防止深度遗忘,却难以缓解浅层分类器遗忘,这主要归因于小缓冲区引发的“强坍缩”(strong collapse),导致协方差矩阵秩亏和类别均值膨胀,从而掩盖真实类间边界。作者通过将神经坍缩(Neural Collapse)框架扩展至序列场景,证明非零回放比例可渐近保证线性可分性,并提出应通过显式校正此类统计偏差来实现最小化回放下的鲁棒性能。

链接: https://arxiv.org/abs/2512.07400
作者: Giulia Lanzillotta,Damiano Meier,Thomas Hofmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail. We formalize this distinction as the gap between deep feature-space and shallow classifier-level forgetting. We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry and prevent deep forgetting, mitigating shallow forgetting typically requires substantially larger buffer capacities. To explain this, we extend the Neural Collapse framework to the sequential setting. We characterize deep forgetting as a geometric drift toward out-of-distribution subspaces and prove that any non-zero replay fraction asymptotically guarantees the retention of linear separability. Conversely, we identify that the “strong collapse” induced by small buffers leads to rank-deficient covariances and inflated class means, effectively blinding the classifier to true population boundaries. By unifying CL with out-of-distribution detection, our work challenges the prevailing reliance on large buffers, suggesting that explicitly correcting these statistical artifacts could unlock robust performance with minimal replay.
zh

[AI-27] ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning

【速读】:该论文旨在解决基于行为克隆(Behavior Cloning)的视觉-运动策略在机器人操作中因继承人类示范的缓慢、谨慎节奏而导致的实际部署效率低下的问题。现有加速方法多依赖统计或启发式线索,忽视任务语义信息,在多样化操作场景中易失效。其解决方案的关键在于提出ESPADA框架,该框架通过视觉语言模型(VLM)与大语言模型(LLM)组成的流水线,结合3D夹爪-物体关系对示范数据进行语义感知的分段,仅在非关键段落实施激进下采样,同时保留关键动作阶段的精度;进一步利用仅基于动态特征的动态时间规整(Dynamic Time Warping, DTW)将单个标注片段标签扩展至整个数据集,无需额外数据、架构修改或重新训练即可实现约2倍的速度提升,且保持任务成功率,显著缩小了人类示范与高效机器人控制之间的差距。

链接: https://arxiv.org/abs/2512.07371
作者: Byungju Kim,Jinu Pahk,Chungwoo Lee,Jaejoon Kim,Jangha Lee,Theo Taeyeong Kim,Kyuhwan Shim,Jun Ki Lee,Byoung-Tak Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: project page: this https URL

点击查看摘要

Abstract:Behavior-cloning based visuomotor policies enable precise manipulation but often inherit the slow, cautious tempo of human demonstrations, limiting practical deployment. However, prior studies on acceleration methods mainly rely on statistical or heuristic cues that ignore task semantics and can fail across diverse manipulation settings. We present ESPADA, a semantic and spatially aware framework that segments demonstrations using a VLM-LLM pipeline with 3D gripper-object relations, enabling aggressive downsampling only in non-critical segments while preserving precision-critical phases, without requiring extra data or architectural modifications, or any form of retraining. To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping (DTW) on dynamics-only features. Across both simulation and real-world experiments with ACT and DP baselines, ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.
zh

[AI-28] Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在在线视频理解应用中因部署约束被忽视而导致的系统开销过大的问题。现有方法虽聚焦于提升VLM的推理能力,却未充分考虑实际部署场景下的效率瓶颈。解决方案的关键在于提出Venus——一个基于边缘-云解耦架构的本地化记忆与检索系统,其核心创新包括:在数据摄入阶段通过场景分割与聚类持续处理边缘视频流,并利用多模态嵌入模型构建分层记忆结构以实现高效存储与检索;在查询阶段采用基于阈值的渐进式采样算法进行关键帧选择,从而在保证多样性的同时自适应平衡系统成本与推理准确性。该设计显著降低了响应延迟(相比最先进方法提速15–131倍),同时维持甚至超越原有推理精度,实现了实时视频理解的高效部署。

链接: https://arxiv.org/abs/2512.07344
作者: Shengyuan Ye,Bei Ouyang,Tianyi Qian,Liekang Zeng,Mu Yuan,Xiaowen Chu,Weijie Hong,Xu Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE International Conference on Computer Communications 2026

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated impressive multimodal comprehension capabilities and are being deployed in an increasing number of online video understanding applications. While recent efforts extensively explore advancing VLMs’ reasoning power in these cases, deployment constraints are overlooked, leading to overwhelming system overhead in real-world deployments. To address that, we propose Venus, an on-device memory-and-retrieval system for efficient online video understanding. Venus proposes an edge-cloud disaggregated architecture that sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages. In the ingestion stage, Venus continuously processes streaming edge videos via scene segmentation and clustering, where the selected keyframes are embedded with a multimodal embedding model to build a hierarchical memory for efficient storage and retrieval. In the querying stage, Venus indexes incoming queries from memory, and employs a threshold-based progressive sampling algorithm for keyframe selection that enhances diversity and adaptively balances system cost and reasoning accuracy. Our extensive evaluation shows that Venus achieves a 15x-131x speedup in total response latency compared to state-of-the-art methods, enabling real-time responses within seconds while maintaining comparable or even superior reasoning accuracy.
zh

[AI-29] Local-Curvature-Aware Knowledge Graph Embedding: An Extended Ricci Flow Approach

【速读】:该论文旨在解决现有知识图谱嵌入(Knowledge Graph Embedding, KGE)方法在使用单一同质流形(如欧氏、球面或双曲空间)建模时,无法适应真实世界知识图谱中局部区域曲率显著差异的问题。由于预设的几何结构与图谱局部曲率不匹配,会导致实体间距离失真,从而削弱嵌入的表达能力。解决方案的关键在于提出RicciKGE,通过将KGE损失梯度与局部曲率耦合到扩展的里奇流(Ricci flow)框架中,使实体嵌入与底层流形几何动态协同演化,实现相互适应。理论证明当耦合系数有界且适当时,边级曲率指数衰减至欧氏平坦状态,且KGE距离严格收敛至全局最优,表明几何平坦化与嵌入优化相互促进。

链接: https://arxiv.org/abs/2512.07332
作者: Zhengquan Luo,Guy Tadmor,Or Amar,David Zeevi,Zhiqiang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph embedding (KGE) relies on the geometry of the embedding space to encode semantic and structural relations. Existing methods place all entities on one homogeneous manifold, Euclidean, spherical, hyperbolic, or their product/multi-curvature variants, to model linear, symmetric, or hierarchical patterns. Yet a predefined, homogeneous manifold cannot accommodate the sharply varying curvature that real-world graphs exhibit across local regions. Since this geometry is imposed a priori, any mismatch with the knowledge graph’s local curvatures will distort distances between entities and hurt the expressiveness of the resulting KGE. To rectify this, we propose RicciKGE to have the KGE loss gradient coupled with local curvatures in an extended Ricci flow such that entity embeddings co-evolve dynamically with the underlying manifold geometry towards mutual adaptation. Theoretically, when the coupling coefficient is bounded and properly selected, we rigorously prove that i) all the edge-wise curvatures decay exponentially, meaning that the manifold is driven toward the Euclidean flatness; and ii) the KGE distances strictly converge to a global optimum, which indicates that geometric flattening and embedding optimization are promoting each other. Experimental improvements on link prediction and node classification benchmarks demonstrate RicciKGE’s effectiveness in adapting to heterogeneous knowledge graph structures.
zh

[AI-30] M-STAR: Multi-Scale Spatiotemporal Autoregression for Human Mobility Modeling

【速读】:该论文旨在解决现有生成式 AI (Generative AI) 方法在长期人类移动轨迹生成中的效率低下和缺乏显式时空多尺度建模的问题。当前基于自回归或扩散模型的方法虽能生成单日轨迹,但在处理周级等长时间跨度轨迹时存在性能瓶颈。其解决方案的关键在于提出 Multi-Scale Spatio-Temporal AutoRegression (M-STAR) 框架,通过粗粒度到细粒度的时空预测过程实现高效长程生成;该框架的核心组件包括一个用于编码层次化移动模式的多尺度时空标记器(Multi-scale Spatiotemporal Tokenizer)与基于 Transformer 的解码器,后者执行跨尺度的自回归预测,从而在保持轨迹保真度的同时显著提升生成速度。

链接: https://arxiv.org/abs/2512.07314
作者: Yuxiao Luo,Songming Zhang,Sijie Ruan,Siran Chen,Kang Liu,Yang Xu,Yu Zheng,Ling Yin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modeling human mobility is vital for extensive applications such as transportation planning and epidemic modeling. With the rise of the Artificial Intelligence Generated Content (AIGC) paradigm, recent works explore synthetic trajectory generation using autoregressive and diffusion models. While these methods show promise for generating single-day trajectories, they remain limited by inefficiencies in long-term generation (e.g., weekly trajectories) and a lack of explicit spatiotemporal multi-scale modeling. This study proposes Multi-Scale Spatio-Temporal AutoRegression (M-STAR), a new framework that generates long-term trajectories through a coarse-to-fine spatiotemporal prediction process. M-STAR combines a Multi-scale Spatiotemporal Tokenizer that encodes hierarchical mobility patterns with a Transformer-based decoder for next-scale autoregressive prediction. Experiments on two real-world datasets show that M-STAR outperforms existing methods in fidelity and significantly improves generation speed. The data and codes are available at this https URL.
zh

[AI-31] DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)快速普及背景下,AI加速器设计日益复杂化所带来的软件开发负担问题,特别是传统多级scratchpad内存(Scratchpad Memory, SPM)及其异步管理机制导致的编程复杂性。其解决方案的关键在于采用一种共享系统级缓存(shared system-level cache)架构,并结合应用感知的缓存管理策略,通过利用软件栈中可用的数据流信息指导缓存替换(包括死块预测),并辅以旁路决策与缓存抖动缓解机制,从而在保持编程简洁性的同时显著提升性能。实验表明,该方案相较传统缓存架构最高可实现1.80倍加速比,且在有无核间数据共享场景下均表现优异,最终在15nm工艺下实现了0.064mm²面积、2GHz工作频率的RTL实现,验证了共享缓存设计在下一代AI加速器系统中的潜力。

链接: https://arxiv.org/abs/2512.07312
作者: Zhongchun Zhou,Chengtao Lai,Yuhang Gu,Wei Zhang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) is pushing AI accelerators toward increasingly powerful and specialized designs. Instead of further complicating software development with deeply hierarchical scratchpad memories (SPMs) and their asynchronous management, we investigate the opposite point of the design spectrum: a multi-core AI accelerator equipped with a shared system-level cache and application-aware management policies, which keeps the programming effort modest. Our approach exploits dataflow information available in the software stack to guide cache replacement (including dead-block prediction), in concert with bypass decisions and mechanisms that alleviate cache thrashing. We assess the proposal using a cycle-accurate simulator and observe substantial performance gains (up to 1.80x speedup) compared with conventional cache architectures. In addition, we build and validate an analytical model that takes into account the actual overlapping behaviors to extend the measurement results of our policies to real-world larger-scale workloads. Experiment results show that when functioning together, our bypassing and thrashing mitigation strategies can handle scenarios both with and without inter-core data sharing and achieve remarkable speedups. Finally, we implement the design in RTL and the area of our design is \mathbf0.064mm^2 with 15nm process, which can run at 2 GHz clock frequency. Our findings explore the potential of the shared cache design to assist the development of future AI accelerator systems. Comments: \copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2512.07312 [cs.AR] (or arXiv:2512.07312v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2512.07312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Radiance-Field Reinforced Pretraining: Scaling Localization Models with Unlabeled Wireless Signals

【速读】:该论文旨在解决基于射频(RF)的室内定位模型在跨场景泛化能力不足的问题,其根源在于现有模型严重依赖于场景特定的标注数据。解决方案的关键在于提出一种新颖的自监督预训练框架——辐射场增强预训练(Radiance-Field Reinforced Pretraining, RFRP),该框架采用非对称自动编码器结构,将大型定位模型(LM)与神经射频辐射场(RF-NeRF)耦合:LM将接收到的RF频谱编码为与位置相关的潜在表示,而RF-NeRF则将其解码以重建原始频谱,通过输入与输出之间的对齐实现利用大规模无标签RF数据进行有效的表征学习,从而显著提升模型在未见场景中的定位精度。

链接: https://arxiv.org/abs/2512.07309
作者: Guosheng Wang,Shen Wang,Lei Yang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radio frequency (RF)-based indoor localization offers significant promise for applications such as indoor navigation, augmented reality, and pervasive computing. While deep learning has greatly enhanced localization accuracy and robustness, existing localization models still face major challenges in cross-scene generalization due to their reliance on scene-specific labeled data. To address this, we introduce Radiance-Field Reinforced Pretraining (RFRP). This novel self-supervised pretraining framework couples a large localization model (LM) with a neural radio-frequency radiance field (RF-NeRF) in an asymmetrical autoencoder architecture. In this design, the LM encodes received RF spectra into latent, position-relevant representations, while the RF-NeRF decodes them to reconstruct the original spectra. This alignment between input and output enables effective representation learning using large-scale, unlabeled RF data, which can be collected continuously with minimal effort. To this end, we collected RF samples at 7,327,321 positions across 100 diverse scenes using four common wireless technologies–RFID, BLE, WiFi, and IIoT. Data from 75 scenes were used for training, and the remaining 25 for evaluation. Experimental results show that the RFRP-pretrained LM reduces localization error by over 40% compared to non-pretrained models and by 21% compared to those pretrained using supervised learning.
zh

[AI-33] SIT-Graph: State Integrated Tool Graph for Multi-Turn Agents

链接: https://arxiv.org/abs/2512.07287
作者: Sijia Li,Yuchen Huang,Zifan Liu,Zijian Li,Jingjing fu,Lei Song,Jiang Bian,Jun Zhang,Rui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-34] SINRL: Socially Integrated Navigation with Reinforcement Learning using Spiking Neural Networks

【速读】:该论文旨在解决将自主移动机器人集成到人类环境中的关键挑战,即如何实现类人的决策能力以及低能耗的事件驱动计算,尤其针对深度强化学习(Deep Reinforcement Learning, DRL)导航方法中因训练不稳定而极少采用神经形态(neuromorphic)方法的问题。解决方案的关键在于提出一种混合的社会化集成DRL策略-评论家架构:在策略网络(actor)中使用脉冲神经网络(Spiking Neural Networks, SNNs),在评论家网络(critic)中使用人工神经网络(Artificial Neural Networks, ANNs),并引入神经形态特征提取器以捕捉人群动态和人机交互的时序特性,从而显著提升社交导航性能,并使估计能耗降低约1.69个数量级。

链接: https://arxiv.org/abs/2512.07266
作者: Florian Tretter,Daniel Flögel,Alexandru Vasilache,Max Grobbel,Jürgen Becker,Sören Hohmann
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Integrating autonomous mobile robots into human environments requires human-like decision-making and energy-efficient, event-based computation. Despite progress, neuromorphic methods are rarely applied to Deep Reinforcement Learning (DRL) navigation approaches due to unstable training. We address this gap with a hybrid socially integrated DRL actor-critic approach that combines Spiking Neural Networks (SNNs) in the actor with Artificial Neural Networks (ANNs) in the critic and a neuromorphic feature extractor to capture temporal crowd dynamics and human-robot interactions. Our approach enhances social navigation performance and reduces estimated energy consumption by approximately 1.69 orders of magnitude.
zh

[AI-35] IFFair: Influence Function-driven Sample Reweighting for Fair Classification

【速读】:该论文旨在解决机器学习模型在实际应用中因训练数据中存在的潜在偏见而导致对某些弱势群体产生歧视性决策的问题,这种偏见会损害社会福祉并限制相关技术的推广。解决方案的关键在于提出一种基于影响函数(influence function)的预处理方法IFFair,其通过动态调整训练样本权重来缓解偏差,而无需修改网络结构、数据特征或决策边界;具体而言,IFFair利用不同群体间样本对模型的影响差异作为引导信号,在训练过程中优化样本权重,从而在多个公平性指标(如人口均等性、机会均等性、错误率均等性等)上实现无冲突的公平性提升,并优于以往预处理方法在效用与公平性之间的权衡表现。

链接: https://arxiv.org/abs/2512.07249
作者: Jingran Yang,Min Zhang,Lingfeng Zhang,Zhaohui Wang,Yonggang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Because machine learning has significantly improved efficiency and convenience in the society, it’s increasingly used to assist or replace human decision-making. However, the data-based pattern makes related algorithms learn and even exacerbate potential bias in samples, resulting in discriminatory decisions against certain unprivileged groups, depriving them of the rights to equal treatment, thus damaging the social well-being and hindering the development of related applications. Therefore, we propose a pre-processing method IFFair based on the influence function. Compared with other fairness optimization approaches, IFFair only uses the influence disparity of training samples on different groups as a guidance to dynamically adjust the sample weights during training without modifying the network structure, data features and decision boundaries. To evaluate the validity of IFFair, we conduct experiments on multiple real-world datasets and metrics. The experimental results show that our approach mitigates bias of multiple accepted metrics in the classification setting, including demographic parity, equalized odds, equality of opportunity and error rate parity without conflicts. It also demonstrates that IFFair achieves better trade-off between multiple utility and fairness metrics compared with previous pre-processing methods.
zh

[AI-36] Cross-platform Product Matching Based on Entity Alignment of Knowledge Graph with RAEA model

【速读】:该论文旨在解决跨平台产品匹配问题,即识别在不同电商平台(如eBay和Amazon)上销售的相同或相似产品。通过构建知识图谱(Knowledge Graph, KG),该问题被转化为实体对齐(Entity Alignment, EA)任务,其核心挑战在于如何有效利用属性三元组(attribute triples)与关系三元组(relation triples)之间的交互信息,而现有方法对此利用不足。解决方案的关键在于提出一种两阶段流水线:粗筛阶段进行初步过滤,细筛阶段采用新型实体对齐框架RAEA(Relation-aware and Attribute-aware Graph Attention Networks for Entity Alignment)。RAEA通过属性感知实体编码器(Attribute-aware Entity Encoder)和关系感知图注意力网络(Relation-aware Graph Attention Networks)联合建模属性与关系的对齐信号,从而增强实体表示能力,在DBP15K(跨语言)和DWY100K(单语言)数据集上均取得显著优于12个基线模型的效果,尤其在Hits@1指标上提升达6.59%。

链接: https://arxiv.org/abs/2512.07232
作者: Wenlong Liu,Jiahua Pan,Xingyu Zhang,Xinxin Gong,Yang Ye,Xujin Zhao,Xin Wang,Kent Wu,Hua Xiang,Houmin Yan,Qingpeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, published on World Wide Web

点击查看摘要

Abstract:Product matching aims to identify identical or similar products sold on different platforms. By building knowledge graphs (KGs), the product matching problem can be converted to the Entity Alignment (EA) task, which aims to discover the equivalent entities from diverse KGs. The existing EA methods inadequately utilize both attribute triples and relation triples simultaneously, especially the interactions between them. This paper introduces a two-stage pipeline consisting of rough filter and fine filter to match products from eBay and Amazon. For fine filtering, a new framework for Entity Alignment, Relation-aware and Attribute-aware Graph Attention Networks for Entity Alignment (RAEA), is employed. RAEA focuses on the interactions between attribute triples and relation triples, where the entity representation aggregates the alignment signals from attributes and relations with Attribute-aware Entity Encoder and Relation-aware Graph Attention Networks. The experimental results indicate that the RAEA model achieves significant improvements over 12 baselines on EA task in the cross-lingual dataset DBP15K (6.59% on average Hits@1) and delivers competitive results in the monolingual dataset DWY100K. The source code for experiments on DBP15K and DWY100K is available at github (this https URL).
zh

[AI-37] Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation

【速读】:该论文旨在解决当前基于扩散模型的模仿学习在机器人控制中因观察信息仅作为高维条件输入而未融入扩散过程本身的 stochastic dynamics 所导致的感知与控制耦合弱化问题,进而影响控制精度和可靠性。其解决方案的关键在于提出 BridgePolicy,通过扩散桥(diffusion bridge)形式将观测嵌入到随机微分方程中,构建受观测信息引导的轨迹,使采样可以从富含信息的先验分布而非随机高斯噪声开始,从而显著提升生成策略的精确性与鲁棒性;同时设计多模态融合模块与语义对齐器,以统一视觉与状态输入并实现观测与动作表征的对齐,使该方法适用于异构机器人数据。

链接: https://arxiv.org/abs/2512.07212
作者: Zhaoyang Liu,Mokai Pan,Zhongyi Wang,Kaizhen Zhu,Haotao Lu,Jingya Wang,Ye Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Imitation learning with diffusion models has advanced robotic control by capturing multi-modal action distributions. However, existing approaches typically treat observations as high-level conditioning inputs to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, sampling must begin from random Gaussian noise, weakening the coupling between perception and control and often yielding suboptimal performance. We introduce BridgePolicy, a generative visuomotor policy that explicitly embeds observations within the stochastic differential equation via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich, informative prior rather than random noise, substantially improving precision and reliability in control. A key challenge is that classical diffusion bridges connect distributions with matched dimensionality, whereas robotic observations are heterogeneous and multi-modal and do not naturally align with the action space. To address this, we design a multi-modal fusion module and a semantic aligner that unify visual and state inputs and align observation and action representations, making the bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.
zh

[AI-38] Geometric Prior-Guided Federated Prompt Calibration

【速读】:该论文旨在解决联邦提示学习(Federated Prompt Learning, FPL)在面对数据异质性(data heterogeneity)时性能下降的问题,其核心挑战在于本地训练的提示(prompt)会因局部数据分布偏差而产生偏移。现有方法主要依赖聚合或正则化策略,未能从根源上缓解这种局部训练偏差。本文提出几何引导文本提示校准(Geometry-Guided Text Prompt Calibration, GGTPC),其关键创新在于引入一个全局几何先验(global geometric prior),该先验通过服务器端隐私保护方式重建,反映全局数据分布的协方差结构;客户端则利用新颖的几何先验校准层(Geometry-Prior Calibration Layer, GPCL)在训练过程中将本地特征分布对齐至该全局先验,从而直接纠正局部偏差。此机制有效提升了联邦学习在极端数据偏斜场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2512.07208
作者: Fei Luo,Ziwei Zhao,Mingxuan Wang,Duoyang Li,Zhe Qian,Jiayi Tuo,Chenyue Zhou,Yanbiao Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Prompt Learning (FPL) offers a parameter-efficient solution for collaboratively training large models, but its performance is severely hindered by data heterogeneity, which causes locally trained prompts to become biased. Existing methods, focusing on aggregation or regularization, fail to address this root cause of local training bias. To this end, we propose Geometry-Guided Text Prompt Calibration (GGTPC), a novel framework that directly corrects this bias by providing clients with a global geometric prior. This prior, representing the shape of the global data distribution derived from the covariance matrix, is reconstructed on the server in a privacy-preserving manner. Clients then use a novel Geometry-Prior Calibration Layer (GPCL) to align their local feature distributions with this global prior during training. Extensive experiments show GGTPC’s effectiveness. On the label-skewed CIFAR-100 dataset ( \beta =0.1), it outperforms the state-of-the-art by 2.15%. Under extreme skew ( \beta =0.01), it improves upon the baseline by 9.17%. Furthermore, as a plug-and-play module on the domain-skewed Office-Home dataset, it boosts FedAvg’s performance by 4.60%. These results demonstrate that GGTPC effectively mitigates data heterogeneity by correcting the fundamental local training bias, serving as a versatile module to enhance various FL algorithms.
zh

[AI-39] ContextualSHAP : Enhancing SHAP Explanations Through Contextual Language Generation

【速读】:该论文试图解决的问题是:现有的可解释人工智能(Explainable Artificial Intelligence, XAI)方法,如SHAP(SHapley Additive exPlanations),虽然能够有效可视化特征重要性,但在面向非技术背景的终端用户时,往往缺乏有意义的上下文解释,导致其理解难度较高。解决方案的关键在于:提出一个基于Python的扩展包,将SHAP与大型语言模型(Large Language Model, LLM)——具体为OpenAI的GPT——进行集成,通过用户定义的参数(如特征别名、描述和背景信息)生成针对特定模型场景和用户视角的语义化文本解释,从而提升解释的可理解性和情境相关性。

链接: https://arxiv.org/abs/2512.07178
作者: Latifa Dwiyanti,Sergio Ryan Wibisono,Hidetaka Nambo
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: This paper was accepted and presented at the 7th World Symposium on Software Engineering (WSSE) 2025 on 25 October 2025 in Okayama, Japan, and is currently awaiting publication

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) has become an increasingly important area of research, particularly as machine learning models are deployed in high-stakes domains. Among various XAI approaches, SHAP (SHapley Additive exPlanations) has gained prominence due to its ability to provide both global and local explanations across different machine learning models. While SHAP effectively visualizes feature importance, it often lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. To address this gap, we propose a Python package that extends SHAP by integrating it with a large language model (LLM), specifically OpenAI’s GPT, to generate contextualized textual explanations. This integration is guided by user-defined parameters (such as feature aliases, descriptions, and additional background) to tailor the explanation to both the model context and the user perspective. We hypothesize that this enhancement can improve the perceived understandability of SHAP explanations. To evaluate the effectiveness of the proposed package, we applied it in a healthcare-related case study and conducted user evaluations involving real end-users. The results, based on Likert-scale surveys and follow-up interviews, indicate that the generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs. While the findings are preliminary, they suggest that combining visualization with contextualized text may support more user-friendly and trustworthy model explanations.
zh

[AI-40] JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention NEURIPS2025

链接: https://arxiv.org/abs/2512.07168
作者: Georgios Ioannides,Christos Constantinou,Aman Chadha,Aaron Elkins,Linsey Pang,Ravid Shwartz-Ziv,Yann LeCun
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: UniReps: Unifying Representations in Neural Models (NeurIPS 2025 Workshop)

点击查看摘要

[AI-41] RisConFix: LLM -based Automated Repair of Risk-Prone Drone Configurations

【速读】:该论文旨在解决无人机飞行控制软件中因配置参数组合不当而导致飞行不稳定的问题,这类问题即使在使用推荐参数时仍可能发生,从而降低无人机的鲁棒性。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的实时修复方法——RisConFix,其通过持续监测飞行状态,在检测到异常行为时自动触发修复机制;该机制利用LLM分析配置参数与飞行状态之间的关系,并生成修正参数以恢复飞行稳定性,且采用迭代式修复流程确保更新后的配置有效性,直至飞行状态恢复正常。

链接: https://arxiv.org/abs/2512.07122
作者: Liping Han,Tingting Nie,Le Yu,Mingzhe Hu,Tao Yue
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flight control software is typically designed with numerous configurable parameters governing multiple functionalities, enabling flexible adaptation to mission diversity and environmental uncertainty. Although developers and manufacturers usually provide recommendations for these parameters to ensure safe and stable operations, certain combinations of parameters with recommended values may still lead to unstable flight behaviors, thereby degrading the drone’s robustness. To this end, we propose a Large Language Model (LLM) based approach for real-time repair of risk-prone configurations (named RisConFix) that degrade drone robustness. RisConFix continuously monitors the drone’s operational state and automatically triggers a repair mechanism once abnormal flight behaviors are detected. The repair mechanism leverages an LLM to analyze relationships between configuration parameters and flight states, and then generates corrective parameter updates to restore flight stability. To ensure the validity of the updated configuration, RisConFix operates as an iterative process; it continuously monitors the drone’s flight state and, if an anomaly persists after applying an update, automatically triggers the next repair cycle. We evaluated RisConFix through a case study of ArduPilot (with 1,421 groups of misconfigurations). Experimental results show that RisConFix achieved a best repair success rate of 97% and an optimal average number of repairs of 1.17, demonstrating its capability to effectively and efficiently repair risk-prone configurations in real time.
zh

[AI-42] FOAM: Blocked State Folding for Memory-Efficient LLM Training

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中因优化器(如Adam)状态占用大量显存而导致的内存瓶颈问题。现有内存高效方法如奇异值分解(Singular Value Decomposition, SVD)、投影或权重冻结等,往往引入显著计算开销、额外内存消耗或性能下降。其解决方案的关键在于提出Folded Optimizer with Approximate Moment(FOAM),通过计算分块梯度均值对优化器状态进行压缩,并引入残差校正机制以恢复丢失的信息;理论上,FOAM在标准非凸优化环境下可达到与原始Adam相当的收敛速率,实验证明其可减少约50%的总训练内存占用,消除高达90%的优化器状态内存开销并加速收敛,且兼容其他内存高效优化器,性能和吞吐量优于或等同于全秩及现有内存高效基线。

链接: https://arxiv.org/abs/2512.07112
作者: Ziqing Wen,Jiahuan Wang,Ping Luo,Dongsheng Li,Tao Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50%, eliminates up to 90% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.
zh

[AI-43] VIGIL: A Reflective Runtime for Self-Healing Agents

链接: https://arxiv.org/abs/2512.07094
作者: Christopher Cruz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-44] he Geometry of Persona: Disentangling Personality from Reasoning in Large Language Models

【速读】:该论文旨在解决个性化大型语言模型(Large Language Models, LLMs)部署中面临的“稳定性-可塑性困境”(stability-plasticity dilemma),即在实现用户个性适配的同时避免对模型通用推理能力的损害。传统对齐方法如监督微调(Supervised Fine-Tuning, SFT)依赖随机权重更新,常引入“对齐代价”(alignment tax),导致模型性能下降。其解决方案的核心是提出名为“Soul Engine”的框架,基于线性表示假设(Linear Representation Hypothesis),认为人格特质存在于正交的线性子空间中;并通过构建动态上下文采样的SoulBench数据集,采用冻结主干网络的双头架构提取解耦的人格向量(disentangled personality vectors),无需修改原始模型权重。该方法实现了高精度人格建模、几何正交性验证及确定性行为控制,从而以数学严谨的方式替代传统微调策略,为安全可控的AI个性化提供了新范式。

链接: https://arxiv.org/abs/2512.07092
作者: Zhixiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 1 table. Code and dataset available at this https URL

点击查看摘要

Abstract:Background: The deployment of personalized Large Language Models (LLMs) is currently constrained by the stability-plasticity dilemma. Prevailing alignment methods, such as Supervised Fine-Tuning (SFT), rely on stochastic weight updates that often incur an “alignment tax” – degrading general reasoning capabilities. Methods: We propose the Soul Engine, a framework based on the Linear Representation Hypothesis, which posits that personality traits exist as orthogonal linear subspaces. We introduce SoulBench, a dataset constructed via dynamic contextual sampling. Using a dual-head architecture on a frozen Qwen-2.5 base, we extract disentangled personality vectors without modifying the backbone weights. Results: Our experiments demonstrate three breakthroughs. First, High-Precision Profiling: The model achieves a Mean Squared Error (MSE) of 0.011 against psychological ground truth. Second, Geometric Orthogonality: T-SNE visualization confirms that personality manifolds are distinct and continuous, allowing for “Zero-Shot Personality Injection” that maintains original model intelligence. Third, Deterministic Steering: We achieve robust control over behavior via vector arithmetic, validated through extensive ablation studies. Conclusion: This work challenges the necessity of fine-tuning for personalization. By transitioning from probabilistic prompting to deterministic latent intervention, we provide a mathematically rigorous foundation for safe, controllable AI personalization. Comments: 10 pages, 3 figures, 1 table. Code and dataset available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68T05 ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2512.07092 [cs.LG] (or arXiv:2512.07092v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.07092 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhixiang Wang [view email] [v1] Mon, 8 Dec 2025 02:00:57 UTC (108 KB)
zh

[AI-45] hinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking NDSS2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在云服务部署中面临的新型拒绝服务(Denial-of-Service, DoS)攻击问题,即攻击者通过精心设计的输入诱导大语言模型(Large Language Models, LLMs)进入无限或极长推理循环,从而耗尽后端计算资源,影响合法用户的服务可用性。解决方案的关键在于提出一种名为 ThinkTrap 的输入空间优化框架,其核心思想是将离散 token 映射到连续嵌入空间,并在低维子空间中利用输入稀疏性进行高效的黑盒优化,以最小的 token 开销识别出能在多个主流闭源 LLM 服务上引发持续生成或非终止行为的对抗性提示(adversarial prompts),从而实现对 LLM 服务的有效 DoS 攻击。

链接: https://arxiv.org/abs/2512.07086
作者: Yunzhe Li,Jianan Wang,Hongzi Zhu,James Lin,Shan Chang,Minyi Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This version includes the final camera-ready manuscript accepted by NDSS 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have become foundational components in a wide range of applications, including natural language understanding and generation, embodied intelligence, and scientific discovery. As their computational requirements continue to grow, these models are increasingly deployed as cloud-based services, allowing users to access powerful LLMs via the Internet. However, this deployment model introduces a new class of threat: denial-of-service (DoS) attacks via unbounded reasoning, where adversaries craft specially designed inputs that cause the model to enter excessively long or infinite generation loops. These attacks can exhaust backend compute resources, degrading or denying service to legitimate users. To mitigate such risks, many LLM providers adopt a closed-source, black-box setting to obscure model internals. In this paper, we propose ThinkTrap, a novel input-space optimization framework for DoS attacks against LLM services even in black-box environments. The core idea of ThinkTrap is to first map discrete tokens into a continuous embedding space, then undertake efficient black-box optimization in a low-dimensional subspace exploiting input sparsity. The goal of this optimization is to identify adversarial prompts that induce extended or non-terminating generation across several state-of-the-art LLMs, achieving DoS with minimal token overhead. We evaluate the proposed attack across multiple commercial, closed-source LLM services. Our results demonstrate that, even far under the restrictive request frequency limits commonly enforced by these platforms, typically capped at ten requests per minute (10 RPM), the attack can degrade service throughput to as low as 1% of its original capacity, and in some cases, induce complete service failure.
zh

[AI-46] ClinNoteAgents : An LLM Agent s: An LLM Multi-Agent System for Predicting and Interpreting Heart Failure 30-Day Readmission from Clinical Notes

【速读】:该论文旨在解决老年心力衰竭(Heart Failure, HF)患者在出院后30天内再住院风险预测中,临床笔记(clinical notes)这一非结构化数据资源长期被低估和未充分利用的问题。传统方法依赖专家制定规则、医学术语词典和本体论来解析笔记内容,但受限于书写时的时效压力及文本中的拼写错误、缩写和专业术语,效果有限。解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的多智能体框架——ClinNoteAgents,该框架能够自动将自由文本临床笔记转化为两类信息:(1) 结构化的临床与社会风险因素表示,用于关联分析;(2) 类似于医生风格的摘要,用于HF 30天再住院风险预测。该方法显著减少了对结构化字段的依赖,并降低了人工标注与模型训练成本,在数据有限的医疗系统中实现了可扩展且可解释的风险建模。

链接: https://arxiv.org/abs/2512.07081
作者: Rongjia Zhou,Chengzhuo Li,Carl Yang,Jiaying Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures. Submitted to AMIA 2026 Informatics Summit Student Paper Track

点击查看摘要

Abstract:Heart failure (HF) is one of the leading causes of rehospitalization among older adults in the United States. Although clinical notes contain rich, detailed patient information and make up a large portion of electronic health records (EHRs), they remain underutilized for HF readmission risk analysis. Traditional computational models for HF readmission often rely on expert-crafted rules, medical thesauri, and ontologies to interpret clinical notes, which are typically written under time pressure and may contain misspellings, abbreviations, and domain-specific jargon. We present ClinNoteAgents, an LLM-based multi-agent framework that transforms free-text clinical notes into (1) structured representations of clinical and social risk factors for association analysis and (2) clinician-style abstractions for HF 30-day readmission prediction. We evaluate ClinNoteAgents on 3,544 notes from 2,065 patients (readmission rate=35.16%), demonstrating strong performance in extracting risk factors from free-text, identifying key contributing factors, and predicting readmission risk. By reducing reliance on structured fields and minimizing manual annotation and model training, ClinNoteAgents provides a scalable and interpretable approach to note-based HF readmission risk modeling in data-limited healthcare systems.
zh

[AI-47] Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation

【速读】:该论文旨在解决计算机辅助合成规划(Computer-Aided Synthesis Planning, CASP)领域中评估体系不统一、评价指标偏重拓扑完成度而忽视化学有效性的核心问题。其解决方案的关键在于提出RetroCast这一统一的评估套件,通过将异构模型输出标准化为通用格式,实现统计上严谨的直接比较;同时配套包含分层采样和自助法置信区间计算的可复现基准测试流程,并提供SynthArena交互式平台用于定性路线检验,从而系统性揭示了“可解性”(stock-termination rate)与路线质量之间的脱节现象及搜索类方法在长程合成路径重建中的性能衰减(即“复杂度悬崖”)。

链接: https://arxiv.org/abs/2512.07079
作者: Anton Morgunov,Victor S. Batista
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages + 7 pages of SI. RetroCast is available on GitHub, see this https URL . SynthArena is publicly available, see this https URL

点击查看摘要

Abstract:Progress in computer-aided synthesis planning (CASP) is obscured by the lack of standardized evaluation infrastructure and the reliance on metrics that prioritize topological completion over chemical validity. We introduce RetroCast, a unified evaluation suite that standardizes heterogeneous model outputs into a common schema to enable statistically rigorous, apples-to-apples comparison. The framework includes a reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, accompanied by SynthArena, an interactive platform for qualitative route inspection. We utilize this infrastructure to evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks. Our analysis reveals a divergence between “solvability” (stock-termination rate) and route quality; high solvability scores often mask chemical invalidity or fail to correlate with the reproduction of experimental ground truths. Furthermore, we identify a “complexity cliff” in which search-based methods, despite high solvability rates, exhibit a sharp performance decay in reconstructing long-range synthetic plans compared to sequence-based approaches. We release the full framework, benchmark definitions, and a standardized database of model predictions to support transparent and reproducible development in the field.
zh

[AI-48] Self-Supervised Learning on Molecular Graphs: A Systematic Investigation of Masking Design

【速读】:该论文旨在解决当前基于掩码(masking)的自监督学习(Self-supervised Learning, SSL)方法在分子表示学习中存在设计选择缺乏理论依据和系统评估的问题,尤其是掩码策略、预测目标与编码器架构之间的相互作用机制不清晰。其解决方案的关键在于构建一个统一的概率框架来形式化预训练-微调全流程,并在此基础上开展受控实验,系统考察三个核心设计维度:掩码分布、预测目标和编码器架构。研究发现,相较于复杂的掩码分布,均匀采样在节点级任务中并无劣势;真正决定性能提升的是预测目标的语义丰富度及其与图 Transformer 编码器的协同效应,这为开发更有效的分子图 SSL 方法提供了可解释且实用的设计准则。

链接: https://arxiv.org/abs/2512.07064
作者: Jiannan Yang,Veronika Thost,Tengfei Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) plays a central role in molecular representation learning. Yet, many recent innovations in masking-based pretraining are introduced as heuristics and lack principled evaluation, obscuring which design choices are genuinely effective. This work cast the entire pretrain-finetune workflow into a unified probabilistic framework, enabling a transparent comparison and deeper understanding of masking strategies. Building on this formalism, we conduct a controlled study of three core design dimensions: masking distribution, prediction target, and encoder architecture, under rigorously controlled settings. We further employ information-theoretic measures to assess the informativeness of pretraining signals and connect them to empirically benchmarked downstream performance. Our findings reveal a surprising insight: sophisticated masking distributions offer no consistent benefit over uniform sampling for common node-level prediction tasks. Instead, the choice of prediction target and its synergy with the encoder architecture are far more critical. Specifically, shifting to semantically richer targets yields substantial downstream improvements, particularly when paired with expressive Graph Transformer encoders. These insights offer practical guidance for developing more effective SSL methods for molecular graphs.
zh

[AI-49] A Comprehensive Study of Supervised Machine Learning Models for Zero-Day Attack Detection: Analyzing Performance on Imbalanced Data

链接: https://arxiv.org/abs/2512.07030
作者: Zahra Lotfi,Mostafa Lotfi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 5 figures

点击查看摘要

[AI-50] Reformulate Retrieve Localize: Agents for Repository-Level Bug Localization

【速读】:该论文旨在解决大规模软件仓库中**缺陷定位(Bug Localization)**效率低下的问题,尤其针对传统基于信息检索的缺陷定位方法(IRBL)因依赖未处理的噪声性缺陷描述而导致检索准确率不足的瓶颈。其解决方案的关键在于引入一个由大语言模型(LLM)驱动的智能代理(agent),通过轻量级查询重构(query reformulation)与摘要生成,在检索前对原始缺陷报告进行结构化信息提取和语义优化,从而提升BM25检索模型在文件级别缺陷定位上的性能。实验表明,该代理在首次文件召回率上比基线BM25提升35%,相较SWE-agent最高提升达22%。

链接: https://arxiv.org/abs/2512.07022
作者: Genevieve Caumartin,Glaucia Melo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at BoatSE 2026

点击查看摘要

Abstract:Bug localization remains a critical yet time-consuming challenge in large-scale software repositories. Traditional information retrieval-based bug localization (IRBL) methods rely on unchanged bug descriptions, which often contain noisy information, leading to poor retrieval accuracy. Recent advances in large language models (LLMs) have improved bug localization through query reformulation, yet the effect on agent performance remains unexplored. In this study, we investigate how an LLM-powered agent can improve file-level bug localization via lightweight query reformulation and summarization. We first employ an open-source, non-fine-tuned LLM to extract key information from bug reports, such as identifiers and code snippets, and reformulate queries pre-retrieval. Our agent then orchestrates BM25 retrieval using these preprocessed queries, automating localization workflow at scale. Using the best-performing query reformulation technique, our agent achieves 35% better ranking in first-file retrieval than our BM25 baseline and up to +22% file retrieval performance over SWE-agent.
zh

[AI-51] ransferring Clinical Knowledge into ECGs Representation

【速读】:该论文旨在解决深度学习模型在心电图(ECG)分类中因“黑箱”特性导致临床信任度低、可解释性差的问题。解决方案的关键在于提出一种三阶段训练范式,通过从多模态临床数据(如实验室检查、生命体征和生物特征)中迁移知识到单一的ECG编码器中,实现对ECG信号的增强表示学习;其中,自监督联合嵌入预训练阶段构建了富含临床上下文信息的ECG表征,且推理时仅需ECG信号;同时,模型被训练为直接从ECG嵌入预测相关实验室异常,从而以间接方式提供生理学可解释的输出,显著提升了模型的准确性与可信度。

链接: https://arxiv.org/abs/2512.07021
作者: Jose Geraldo Fernandes,Luiz Facury de Souza,Pedro Robles Dutenhefner,Gisele L. Pappa,Wagner Meira Jr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models have shown high accuracy in classifying electrocardiograms (ECGs), but their black box nature hinders clinical adoption due to a lack of trust and interpretability. To address this, we propose a novel three-stage training paradigm that transfers knowledge from multimodal clinical data (laboratory exams, vitals, biometrics) into a powerful, yet unimodal, ECG encoder. We employ a self-supervised, joint-embedding pre-training stage to create an ECG representation that is enriched with contextual clinical information, while only requiring the ECG signal at inference time. Furthermore, as an indirect way to explain the model’s output we train it to also predict associated laboratory abnormalities directly from the ECG embedding. Evaluated on the MIMIC-IV-ECG dataset, our model outperforms a standard signal-only baseline in multi-label diagnosis classification and successfully bridges a substantial portion of the performance gap to a fully multimodal model that requires all data at inference. Our work demonstrates a practical and effective method for creating more accurate and trustworthy ECG classification models. By converting abstract predictions into physiologically grounded \emphexplanations, our approach offers a promising path toward the safer integration of AI into clinical workflows.
zh

[AI-52] Optimizing video analytics inference pipelines: a case study

链接: https://arxiv.org/abs/2512.07009
作者: Saeid Ghafouri,Yuming Ding,Katerine Diaz Chito,Jesús Martinez del Rincón,Niamh O’Connell,Hans Vandierendonck
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT 2025)

点击查看摘要

[AI-53] Multi-Accent Mandarin Dry-Vocal Singing Dataset: Benchmark for Singing Accent Recognition

【速读】:该论文旨在解决当前歌唱口音(singing accent)研究因数据集稀缺而进展缓慢的问题,尤其是现有歌唱数据集普遍存在细节丢失(常源于人声与伴奏分离过程)及缺乏地域口音标注的局限。解决方案的关键在于构建一个大规模、高质量的多口音普通话干声歌唱数据集——Multi-Accent Mandarin Dry-Vocal Singing Dataset (MADVSD),其包含来自中国九个不同地区的4,206名母语者录制的超过670小时干声演唱音频,涵盖三首流行歌曲和覆盖全部普通话元音及完整八度音域的语音练习。该数据集通过基准实验验证了其在歌唱口音识别任务中的有效性,并支持对方言影响及元音在口音变异中作用的深入分析。

链接: https://arxiv.org/abs/2512.07005
作者: Zihao Wang,Ruibin Yuan,Ziqi Geng,Hengjia Li,Xingwei Qu,Xinyi Li,Songye Chen,Haoying Fu,Roger B. Dannenberg,Kejun Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by ACMMM 2025

点击查看摘要

Abstract:Singing accent research is underexplored compared to speech accent studies, primarily due to the scarcity of suitable datasets. Existing singing datasets often suffer from detail loss, frequently resulting from the vocal-instrumental separation process. Additionally, they often lack regional accent annotations. To address this, we introduce the Multi-Accent Mandarin Dry-Vocal Singing Dataset (MADVSD). MADVSD comprises over 670 hours of dry vocal recordings from 4,206 native Mandarin speakers across nine distinct Chinese regions. In addition to each participant recording audio of three popular songs in their native accent, they also recorded phonetic exercises covering all Mandarin vowels and a full octave range. We validated MADVSD through benchmark experiments in singing accent recognition, demonstrating its utility for evaluating state-of-the-art speech models in singing contexts. Furthermore, we explored dialectal influences on singing accent and analyzed the role of vowels in accentual variations, leveraging MADVSD’s unique phonetic exercises.
zh

[AI-54] Benchmarking Deep Neural Networks for Modern Recommendation Systems

链接: https://arxiv.org/abs/2512.07000
作者: Abderaouf Bahi,Ibtissem Gasmi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-55] Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model

链接: https://arxiv.org/abs/2512.06999
作者: Zihao Wang,Ruibin Yuan,Ziqi Geng,Hengjia Li,Xingwei Qu,Xinyi Li,Songye Chen,Haoying Fu,Roger B. Dannenberg,Kejun Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to ACMMM 2025 oral

点击查看摘要

[AI-56] On Memory: A comparison of memory mechanisms in world models

【速读】:该论文旨在解决基于Transformer的世界模型在长程规划中因有效记忆跨度有限而导致的感知漂移问题,进而阻碍了想象轨迹中的回环闭合(loop closure)能力。其解决方案的关键在于通过引入一类系统化的记忆增强机制分类法(区分记忆编码与记忆注入机制),并基于残差流动力学视角分析这些机制如何扩展世界模型的记忆能力;实验表明,此类机制能显著提升视觉Transformer的有效记忆跨度,从而为实现世界模型内部的回环闭合提供可行路径。

链接: https://arxiv.org/abs/2512.06983
作者: Eli J. Laird,Corey Clark
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, hindering the model’s capacity to perform loop closures within imagined trajectories. In this work, we investigate the effective memory span of transformer-based world models through an analysis of several memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory encoding and memory injection mechanisms, motivating their roles in extending the world model’s memory through the lens of residual stream dynamics. Using a state recall evaluation task, we measure the memory recall of each mechanism and analyze its respective trade-offs. Our findings show that memory mechanisms improve the effective memory span in vision transformers and provide a path to completing loop closures within a world model’s imagination.
zh

[AI-57] Comparing BFGS and OGR for Second-Order Optimization

【速读】:该论文旨在解决神经网络训练中海森矩阵(Hessian matrix)估计的挑战,尤其是高维度和计算成本高的问题。传统方法如BFGS(Broyden-Fletcher-Goldfarb-Shanno)依赖Sherman-Morrison更新来维护一个正定的海森近似,但其假设凸性限制了在非凸结构中的适用性。论文提出了一种新颖的在线梯度回归(Online Gradient Regression, OGR)方法,其关键在于通过指数移动平均对梯度与位置进行回归,从而在线估计二阶导数,无需进行海森矩阵求逆;OGR可估计一般(不一定正定)的海森矩阵,因此能有效处理非凸优化问题。实验表明,OGR在标准测试函数上收敛更快、损失更低,尤其在非凸场景下优势显著。

链接: https://arxiv.org/abs/2512.06969
作者: Adrian Przybysz,Mikołaj Kołek,Franciszek Sobota,Jarek Duda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating the Hessian matrix, especially for neural network training, is a challenging problem due to high dimensionality and cost. In this work, we compare the classical Sherman-Morrison update used in the popular BFGS method (Broy-den-Fletcher-Goldfarb-Shanno), which maintains a positive definite Hessian approximation under a convexity assumption, with a novel approach called Online Gradient Regression (OGR). OGR performs regression of gradients against positions using an exponential moving average to estimate second derivatives online, without requiring Hessian inversion. Unlike BFGS, OGR allows estimation of a general (not necessarily positive definite) Hessian and can thus handle non-convex structures. We evaluate both methods across standard test functions and demonstrate that OGR achieves faster convergence and improved loss, particularly in non-convex settings.
zh

[AI-58] A Unifying Human-Centered AI Fairness Framework

链接: https://arxiv.org/abs/2512.06944
作者: Munshi Mahbubur Rahman,Shimei Pan,James R. Foulds
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

[AI-59] Hidden Leaks in Time Series Forecasting: How Data Leakage Affects LSTM Evaluation Across Configurations and Validation Strategies

链接: https://arxiv.org/abs/2512.06932
作者: Salma Albelali,Moataz Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-60] Adaptive Normalization Mamba with Multi Scale Trend Decomposition and Patch MoE Encoding

【速读】:该论文旨在解决时间序列预测中因非平稳性(non stationarity)、多尺度时间模式以及分布偏移(distributional shifts)导致的模型稳定性与准确性下降的问题。解决方案的关键在于提出AdaMamba架构,其核心创新包括:1)自适应归一化模块(Adaptive Normalization Block),通过多尺度卷积趋势提取与通道级重新校准实现一致的去趋势和方差稳定;2)增强型上下文编码器(Context Encoder),融合补丁嵌入、位置编码及基于Mamba的Transformer层与专家混合(MoE)前馈模块,高效建模长程依赖与局部动态;3)轻量级预测头与去归一化机制,重建输出以保障在不同时间条件下仍具鲁棒性。该设计有效缓解协变量偏移(covariate shift),提升跨异构数据集的预测可靠性。

链接: https://arxiv.org/abs/2512.06929
作者: MinCheol Jeon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting in real world environments faces significant challenges non stationarity, multi scale temporal patterns, and distributional shifts that degrade model stability and accuracy. This study propose AdaMamba, a unified forecasting architecture that integrates adaptive normalization, multi scale trend extraction, and contextual sequence modeling to address these challenges. AdaMamba begins with an Adaptive Normalization Block that removes non stationary components through multi scale convolutional trend extraction and channel wise recalibration, enabling consistent detrending and variance stabilization. The normalized sequence is then processed by a Context Encoder that combines patch wise embeddings, positional encoding, and a Mamba enhanced Transformer layer with a mixture of experts feed forward module, allowing efficient modeling of both long range dependencies and local temporal dynamics. A lightweight prediction head generates multi horizon forecasts, and a denormalization mechanism reconstructs outputs by reintegrating local trends to ensure robustness under varying temporal conditions. AdaMamba provides strong representational capacity with modular extensibility, supporting deterministic prediction and compatibility with probabilistic extensions. Its design effectively mitigates covariate shift and enhances predictive reliability across heterogeneous datasets. Experimental evaluations demonstrate that AdaMamba’s combination of adaptive normalization and expert augmented contextual modeling yields consistent improvements in stability and accuracy over conventional Transformer based baselines.
zh

[AI-61] Evaluating the Sensitivity of BiLSTM Forecasting Models to Sequence Length and Input Noise

链接: https://arxiv.org/abs/2512.06926
作者: Salma Albelali,Moataz Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-62] Deep Reinforcement Learning for Phishing Detection with Transformer-Based Semantic Features

链接: https://arxiv.org/abs/2512.06925
作者: Aseer Al Faisal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

[AI-63] SoK: Trust-Authorization Mismatch in LLM Agent Interactions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主代理(autonomous agents)在与外部世界交互时所面临的新型网络安全挑战,特别是由于决策逻辑从确定性代码转向基于自然语言的不确定性推理,导致传统安全机制失效的问题。其核心问题在于:如何在指令模糊、行为不可预测的情况下建立对AI代理的信任并有效实施最小权限原则(Principle of Least Privilege, PoLP)。论文的关键解决方案是提出一个以“信任评估与授权策略之间的不匹配”为核心的风险分析模型,该模型作为统一框架,系统地分类和理解现有攻击与防御措施,并据此识别关键研究空白,从而为构建具备鲁棒性的可信代理及动态授权机制提供明确的研究方向。

链接: https://arxiv.org/abs/2512.06914
作者: Guanquan Shi,Haohua Du,Zhiqiang Wang,Xiaoyu Liang,Weiwenpei Liu,Song Bian,Zhenyu Guan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly evolving into autonomous agents capable of interacting with the external world, significantly expanding their capabilities through standardized interaction protocols. However, this paradigm revives the classic cybersecurity challenges of agency and authorization in a novel and volatile context. As decision-making shifts from deterministic code logic to probabilistic inference driven by natural language, traditional security mechanisms designed for deterministic behavior fail. It is fundamentally challenging to establish trust for unpredictable AI agents and to enforce the Principle of Least Privilege (PoLP) when instructions are ambiguous. Despite the escalating threat landscape, the academic community’s understanding of this emerging domain remains fragmented, lacking a systematic framework to analyze its root causes. This paper provides a unifying formal lens for agent-interaction security. We observed that most security threats in this domain stem from a fundamental mismatch between trust evaluation and authorization policies. We introduce a novel risk analysis model centered on this trust-authorization gap. Using this model as a unifying lens, we survey and classify the implementation paths of existing, often seemingly isolated, attacks and defenses. This new framework not only unifies the field but also allows us to identify critical research gaps. Finally, we leverage our analysis to suggest a systematic research direction toward building robust, trusted agents and dynamic authorization mechanisms. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.06914 [cs.CR] (or arXiv:2512.06914v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.06914 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-64] BabelCoder: Agent ic Code Translation with Specification Alignment

链接: https://arxiv.org/abs/2512.06902
作者: Fazle Rabbi,Soumit Kanti Saha,Tri Minh Triet Pham,Song Wang,Jinqiu Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures, 4 tables

点击查看摘要

[AI-65] WisPaper: Your AI Scholar Search Engine

链接: https://arxiv.org/abs/2512.06879
作者: Li Ju,Jun Zhao,Mingxu Chai,Ziyu Shen,Xiangyang Wang,Yage Geng,Chunchun Ma,Hao Peng,Guangbin Li,Tao Li,Chengyong Liao,Fu Wang,Xiaolong Wang,Junshen Chen,Rui Gong,Shijia Liang,Feiyan Li,Ming Zhang,Kexin Tan,Jujie Ye,Zhiheng Xi,Shihan Dou,Tao Gui,Yuankai Ying,Yang Shi,Yue Zhang,Qi Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures

点击查看摘要

[AI-66] Do Persona-Infused LLM s Affect Performance in a Strategic Reasoning Game? AACL2025

链接: https://arxiv.org/abs/2512.06867
作者: John Licato,Stephen Steinle,Brayden Hollis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at IJCNLP-AACL 2025

点击查看摘要

[AI-67] JT-DA: Enhancing Data Analysis with Tool-Integrated Table Reasoning Large Language Models

【速读】:该论文旨在解决复杂表格推理任务中高质量标注数据稀缺的问题,尤其是在多样化的现实场景下。解决方案的关键在于构建一个包含34种定义明确的表格推理任务、覆盖29个公开表格问答数据集及300万张真实表格的综合性训练语料库,并通过自动化流水线生成涉及多步推理模式的真实任务;同时,基于大语言模型(LLM)评分与工作流对齐过滤策略进行高质量表相关数据的蒸馏,并结合监督微调(SFT)和强化学习(RL)优化模型性能;此外,提出四阶段表格推理工作流(包括表格预处理、表格感知、工具融合推理和提示工程),以提升模型的可解释性和执行准确性。

链接: https://arxiv.org/abs/2512.06859
作者: Ce Chi,Xing Wang,Zhendong Wang,Xiaofan Liu,Ce Li,Zhiyan Song,Chen Zhao,Kexin Yang,Boshen Shi,Jingjing Yang,Chao Deng,Junlan Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we present JT-DA-8B (JiuTian Data Analyst 8B), a specialized large language model designed for complex table reasoning tasks across diverse real-world scenarios. To address the lack of high-quality supervision in tabular reasoning scenarios, we construct a comprehensive and diverse training corpus with 34 well-defined table reasoning tasks, by aggregating 29 public table QA datasets and 3 million tables. An automatic pipeline is proposed to generate realistic multi-step analytical tasks involving reasoning patterns. The model is trained upon open-source JT-Coder-8B model, an 8B-parameter decoder-only foundation model trained from scratch. In the training stage, we leverage LLM-based scoring and workflow-aligned filtering to distill high-quality, table-centric data. Both supervised fine-tuning (SFT) and Reinforcement learning (RL) are adopted to optimize our model. Afterwards, a four-stage table reasoning workflow is proposed, including table preprocessing, table sensing, tool-integrated reasoning, and prompt engineering, to improve model interpretability and execution accuracy. Experimental results show that JT-DA-8B achieves strong performance in various table reasoning tasks, demonstrating the effectiveness of data-centric generation and workflow-driven optimization.
zh

[AI-68] ArchPower: Dataset for Architecture-Level Power Modeling of Modern CPU Design NEURIPS’25

链接: https://arxiv.org/abs/2512.06854
作者: Qijun Zhang,Yao Lu,Mengming Li,Shang Liu,Zhiyao Xie
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Published in NeurIPS’25 Dataset and Benchmark Track

点击查看摘要

[AI-69] Formal that “Floats” High: Formal Verification of Floating Point Arithmetic MICRO

链接: https://arxiv.org/abs/2512.06850
作者: Hansa Mohanty,Vaisakh Naduvodi Viswambharan,Deepak Narayan Gadde
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: To appear at the 37th IEEE International Conference on Microelectronics (ICM), December 14-17, 2025, Cairo, Egypt

点击查看摘要

[AI-70] Leverag ing LLM s to support co-evolution between definitions and instances of textual DSLs

【速读】:该论文试图解决在领域特定语言(Domain-Specific Language, DSL)演化过程中,如何实现语法定义(grammar)与文本实例(textual instances)的协同演化问题,尤其关注在不丢失原始实例中辅助信息(如注释和布局)的前提下对文本实例进行迁移。解决方案的关键在于利用大型语言模型(Large Language Model, LLM)直接处理文本实例,以实现语法更新后的自动适配,从而避免传统方法因依赖元模型(metamodel)演化而导致的信息丢失。实验表明,当前先进LLM(如Claude-3.5和GPT-4o)在小规模实例上具备良好的迁移能力,但在处理大规模实例时面临显著可扩展性挑战。

链接: https://arxiv.org/abs/2512.06836
作者: Weixing Zhang,Regina Hebig,Daniel Strüber
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Software languages evolve over time for various reasons, such as the addition of new features. When the language’s grammar definition evolves, textual instances that originally conformed to the grammar become outdated. For DSLs in a model-driven engineering context, there exists a plethora of techniques to co-evolve models with the evolving metamodel. However, these techniques are not geared to support DSLs with a textual syntax – applying them to textual language definitions and instances may lead to the loss of information from the original instances, such as comments and layout information, which are valuable for software comprehension and maintenance. This study explores the potential of Large Language Model (LLM)-based solutions in achieving grammar and instance co-evolution, with attention to their ability to preserve auxiliary information when directly processing textual instances. By applying two advanced language models, Claude-3.5 and GPT-4o, and conducting experiments across seven case languages, we evaluated the feasibility and limitations of this approach. Our results indicate a good ability of the considered LLMs for migrating textual instances in small-scale cases with limited instance size, which are representative of a subset of cases encountered in practice. In addition, we observe significant challenges with the scalability of LLM-based solutions to larger instances, leading to insights that are useful for informing future research.
zh

[AI-71] Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

【速读】:该论文旨在解决生成式视觉语言模型(Vision-Language Models, VLMs)在强化学习(Reinforcement Learning, RL)训练过程中因缺乏高质量多模态数据而导致的奖励欺骗(reward hacking)问题,尤其是在化学、地球科学和多模态数学等专业领域中尤为显著。其解决方案的关键在于提出一种双解耦框架DoGe(Decouple to Generalize),通过将学习过程解耦为“思考者”(Thinker)与“求解者”(Solver)两个组件,首先引导模型聚焦于问题上下文的理解而非直接解决问题,从而避免合成数据方法对情境信息的忽视;同时,设计了一个演进式课程学习(evolving curriculum learning)机制,包括扩展的本源领域知识语料库和迭代演化的问题种子池,以提升训练数据多样性并稳定策略熵,最终实现从自由探索上下文到实际任务求解的两阶段强化学习后训练流程。

链接: https://arxiv.org/abs/2512.06835
作者: Tingyu Li,Zheng Sun,Jingxuan Wei,Siyuan Li,Conghui He,Lijun Wu,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.
zh

[AI-72] Partial Inverse Design of High-Performance Concrete Using Cooperative Neural Networks for Constraint-Aware Mix Generation

【速读】:该论文旨在解决高性能混凝土(High-performance concrete, HPC)的部分逆向设计(partial inverse design)问题,即在某些混合组成变量受约束固定的情况下,如何高效、准确地确定其余变量以满足目标性能(如抗压强度)。传统数据驱动方法多适用于正向设计(给定配比预测性能),而对受限条件下反向求解的适配性不足。解决方案的关键在于提出一种协同神经网络框架(cooperative neural network framework),由两个耦合的神经网络模型组成:一个用于填补缺失变量的插补模型(imputation model)和一个用于预测强度的代理模型(surrogate model)。通过协同学习机制,该框架能够在单次前向传播中生成符合约束且性能一致的配比方案,无需针对不同约束组合重新训练模型,从而实现了高精度、鲁棒性和计算效率的统一。

链接: https://arxiv.org/abs/2512.06813
作者: Agung Nugraha,Heungjun Im,Jihwan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 12 figures

点击查看摘要

Abstract:High-performance concrete offers exceptional strength and durability but requires complex mix designs involving many interdependent variables and practical constraints. While data-driven methods have advanced predictive modeling for forward design, inverse design, which focuses on determining mix compositions that achieve target performance, remains limited, particularly in design situations where some mix variables are fixed by constraints and only the remaining variables must be determined. This study proposes a cooperative neural network framework for the partial inverse design of high-performance concrete. The framework combines two coupled neural network models, an imputation model that infers the undetermined variables and a surrogate model that predicts compressive strength. Through cooperative learning, the model generates valid and performance-consistent mix designs in a single forward pass while accommodating different constraint combinations without retraining. Its performance is compared with both probabilistic and generative approaches, including Bayesian inference based on a Gaussian process surrogate and autoencoder-based models. Evaluated on a benchmark dataset, the proposed model achieves stable and higher R-squared values of 0.87-0.92 and reduces mean squared error by an average of 50 percent compared with autoencoder baselines and by an average of 70 percent compared with Bayesian inference. The results demonstrate that the cooperative neural network provides an accurate, robust, and computationally efficient foundation for constraint-aware, data-driven mix proportioning in concrete engineering.
zh

[AI-73] Angular Regularization for Positive-Unlabeled Learning on the Hypersphere

链接: https://arxiv.org/abs/2512.06785
作者: Vasileios Sevetlidis,George Pavlidis,Antonios Gasteratos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Featured Certification, J2C Certification. Transactions on Machine Learning Research, 2025

点击查看摘要

[AI-74] From Description to Score: Can LLM s Quantify Vulnerabilities?

链接: https://arxiv.org/abs/2512.06781
作者: Sima Jafarikhah,Daniel Thompson,Eva Deans,Hossein Siadati,Yi Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 10 pages

点击查看摘要

[AI-75] DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

链接: https://arxiv.org/abs/2512.06749
作者: Ming Ma,Jue Zhang,Fangkai Yang,Yu Kang,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

[AI-76] PrivLLM Swarm: Privacy-Preserving LLM -Driven UAV Swarms for Secure IoT Surveillance

链接: https://arxiv.org/abs/2512.06747
作者: Jifar Wakuma Ayana,Huang Qiming
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-77] A Novel Deep Neural Network Architecture for Real-Time Water Demand Forecasting

链接: https://arxiv.org/abs/2512.06714
作者: Tony Salloom,Okyay Kaynak,Wei He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

[AI-78] Stochasticity in Agent ic Evaluations: Quantifying Inconsistency with Intraclass Correlation

链接: https://arxiv.org/abs/2512.06710
作者: Zairah Mustahsan,Abel Lim,Megna Anand,Saahil Jain,Bryan McCann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-79] A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations

【速读】:该论文旨在解决机械系统剩余使用寿命(Remaining Useful Life, RUL)估计中存在的泛化能力差、鲁棒性不足、数据需求高及可解释性弱等问题。其核心解决方案是提出一种多模态RUL框架,通过联合利用振动信号的图像表示(Image Representation, ImR)与时频表示(Time-Frequency Representation, TFR),实现更精准和稳健的RUL预测。该框架包含三个分支:基于空洞卷积块与残差连接的ImR分支和TFR分支用于提取空间退化特征,以及融合分支将二者特征拼接后输入LSTM以建模时间退化模式,并结合多头注意力机制增强关键特征,最终通过线性层完成RUL回归。此外,作者引入了定制化的多模态层相关传播(multimodal-LRP)方法,显著提升了模型的可解释性,使预测结果更具可信度,适用于工业场景部署。

链接: https://arxiv.org/abs/2512.06708
作者: Waleed Razzaq,Yun-Bo Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating the Remaining Useful Life (RUL) of mechanical systems is pivotal in Prognostics and Health Management (PHM). Rolling-element bearings are among the most frequent causes of machinery failure, highlighting the need for robust RUL estimation methods. Existing approaches often suffer from poor generalization, lack of robustness, high data demands, and limited interpretability. This paper proposes a novel multimodal-RUL framework that jointly leverages image representations (ImR) and time-frequency representations (TFR) of multichannel, nonstationary vibration signals. The architecture comprises three branches: (1) an ImR branch and (2) a TFR branch, both employing multiple dilated convolutional blocks with residual connections to extract spatial degradation features; and (3) a fusion branch that concatenates these features and feeds them into an LSTM to model temporal degradation patterns. A multi-head attention mechanism subsequently emphasizes salient features, followed by linear layers for final RUL regression. To enable effective multimodal learning, vibration signals are converted into ImR via the Bresenham line algorithm and into TFR using Continuous Wavelet Transform. We also introduce multimodal Layer-wise Relevance Propagation (multimodal-LRP), a tailored explainability technique that significantly enhances model transparency. The approach is validated on the XJTU-SY and PRONOSTIA benchmark datasets. Results show that our method matches or surpasses state-of-the-art baselines under both seen and unseen operating conditions, while requiring ~28 % less training data on XJTU-SY and ~48 % less on PRONOSTIA. The model exhibits strong noise resilience, and multimodal-LRP visualizations confirm the interpretability and trustworthiness of predictions, making the framework highly suitable for real-world industrial deployment.
zh

[AI-80] Academic journals AI policies fail to curb the surge in AI-assisted academic writing

链接: https://arxiv.org/abs/2512.06705
作者: Yongyuan He,Yi Bu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages, 10 figures, and 9 tables

点击查看摘要

[AI-81] Predictive Modeling of I/O Performance for Machine Learning Training Pipelines: A Data-Driven Approach to Storag e Optimization

链接: https://arxiv.org/abs/2512.06699
作者: Karthik Prabhakar
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 10 figures

点击查看摘要

[AI-82] GradientSpace: Unsupervised Data Clustering for Improved Instruction Tuning

链接: https://arxiv.org/abs/2512.06678
作者: Shrihari Sridharan,Deepak Ravikumar,Anand Raghunathan,Kaushik Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-83] owards Small Language Models for Security Query Generation in SOC Workflows

【速读】:该论文旨在解决安全运营中心(Security Operations Center, SOC)分析师在处理大规模遥测数据流时,因编写Kusto Query Language (KQL) 需要专业技能而产生的效率瓶颈问题。为实现低成本、高准确率的自然语言到KQL(NL2KQL)转换,作者提出一个三 knob 框架:首先采用轻量级检索增强的提示工程策略,引入错误感知提示以应对常见解析失败而不增加 token 消耗;其次通过 LoRA 微调结合推理链蒸馏,在保持小语言模型(Small Language Model, SLM)紧凑性的同时迁移教师模型的推理能力;最后设计两阶段架构,由 SLM 生成候选查询并由低成本大语言模型(Large Language Model, LLM)判官进行基于模式感知的精炼与选择。该方案在 Microsoft NL2KQL Defender 评估数据集上实现了 0.987 的语法正确率和 0.906 的语义准确率,且相比 GPT-5 节省高达 10 倍的 token 成本,验证了 SLM 在企业安全场景下自然语言查询的可行性与可扩展性。

链接: https://arxiv.org/abs/2512.06660
作者: Saleha Muzammil,Rahul Reddy,Vishal Kamalakrishnan,Hadi Ahmadi,Wajih Ul Hassan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analysts in Security Operations Centers routinely query massive telemetry streams using Kusto Query Language (KQL). Writing correct KQL requires specialized expertise, and this dependency creates a bottleneck as security teams scale. This paper investigates whether Small Language Models (SLMs) can enable accurate, cost-effective natural-language-to-KQL translation for enterprise security. We propose a three-knob framework targeting prompting, fine-tuning, and architecture design. First, we adapt existing NL2KQL framework for SLMs with lightweight retrieval and introduce error-aware prompting that addresses common parser failures without increasing token count. Second, we apply LoRA fine-tuning with rationale distillation, augmenting each NLQ-KQL pair with a brief chain-of-thought explanation to transfer reasoning from a teacher model while keeping the SLM compact. Third, we propose a two-stage architecture that uses an SLM for candidate generation and a low-cost LLM judge for schema-aware refinement and selection. We evaluate nine models (five SLMs and four LLMs) across syntax correctness, semantic accuracy, table selection, and filter precision, alongside latency and token cost. On Microsoft’s NL2KQL Defender Evaluation dataset, our two-stage approach achieves 0.987 syntax and 0.906 semantic accuracy. We further demonstrate generalizability on Microsoft Sentinel data, reaching 0.964 syntax and 0.831 semantic accuracy. These results come at up to 10x lower token cost than GPT-5, establishing SLMs as a practical, scalable foundation for natural-language querying in security operations.
zh

[AI-84] GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

链接: https://arxiv.org/abs/2512.06655
作者: Jehyeok Yeon,Federico Cinus,Yifan Wu,Luca Luceri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-85] LightSearcher: Efficient DeepSearch via Experiential Memory

链接: https://arxiv.org/abs/2512.06653
作者: Hengzhi Lan,Yue Yu,Li Qian,Li Peng,Jie Wu,Wei Liu,Jian Luan,Ting Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

[AI-86] Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts

链接: https://arxiv.org/abs/2512.06652
作者: Xiaolei Lu,Shamim Nemati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-87] FlatFormer: A Flat Transformer Knowledge Tracing Model Based on Cognitive Bias Injection

链接: https://arxiv.org/abs/2512.06629
作者: Xiao-li Xia,Hou-biao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 36 pages, 14 figures,Table 5

点击查看摘要

[AI-88] Memory Power Asymmetry in Human-AI Relationships: Preserving Mutual Forgetting in the Digital Age

链接: https://arxiv.org/abs/2512.06616
作者: Rasam Dorri,Rami Zwick
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 31 pages, 2 tables, 2 figures

点击查看摘要

[AI-89] ChargingBoul: A Competitive Negotiating Agent with Novel Opponent Modeling

链接: https://arxiv.org/abs/2512.06595
作者: Joe Shymanski
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures. Describes the ChargingBoul negotiating agent submitted to ANAC 2022. Preprint

点击查看摘要

[AI-90] Beyond Satisfaction: From Placebic to Actionable Explanations For Enhanced Understandability AAMAS2025

链接: https://arxiv.org/abs/2512.06591
作者: Joe Shymanski,Jacob Brue,Sandip Sen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures, 6 tables. EXTRAAMAS 2025 submission. Preprint version

点击查看摘要

[AI-91] owards Efficient Hypergraph and Multi-LLM Agent Recommender Systems

【速读】:该论文旨在解决生成式推荐系统(Generative Recommender Systems)中存在的两个核心问题:一是生成过程中的幻觉(hallucination)现象导致推荐性能下降,二是实际应用场景中较高的计算成本。解决方案的关键在于提出HGLMRec模型,该模型基于多大语言模型(Multi-LLM)代理架构,并引入超图编码器(hypergraph encoder)以捕捉用户与物品之间复杂的多行为关系;同时,在推理阶段仅检索相关token,从而在降低计算开销的同时增强召回上下文的丰富性,实现高效且高质量的推荐效果。

链接: https://arxiv.org/abs/2512.06590
作者: Tendai Mukande,Esraa Ali,Annalina Caputo,Ruihai Dong,Noel OConnor
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 Pages

点击查看摘要

Abstract:Recommender Systems (RSs) have become the cornerstone of various applications such as e-commerce and social media platforms. The evolution of RSs is paramount in the digital era, in which personalised user experience is tailored to the user’s preferences. Large Language Models (LLMs) have sparked a new paradigm - generative retrieval and recommendation. Despite their potential, generative RS methods face issues such as hallucination, which degrades the recommendation performance, and high computational cost in practical scenarios. To address these issues, we introduce HGLMRec, a novel Multi-LLM agent-based RS that incorporates a hypergraph encoder designed to capture complex, multi-behaviour relationships between users and items. The HGLMRec model retrieves only the relevant tokens during inference, reducing computational overhead while enriching the retrieval context. Experimental results show performance improvement by HGLMRec against state-of-the-art baselines at lower computational cost.
zh

[AI-92] QL-LSTM: A Parameter-Efficient LSTM for Stable Long-Sequence Modeling

【速读】:该论文旨在解决传统循环神经网络(如LSTM和GRU)在序列建模中面临的两个核心问题:一是门控机制中存在的冗余参数,二是长程时间距离下信息保留能力下降。解决方案的关键在于提出量子跃迁LSTM(QL-LSTM),其包含两个独立组件:一是参数共享统一门控机制(Parameter-Shared Unified Gating, PSUG),通过用单一共享权重矩阵替代所有门控特定变换,将参数量减少约48%的同时保持完整的门控行为;二是分层门控递归与加性跳跃连接(Hierarchical Gated Recurrence with Additive Skip Connections, HGR-ASC),引入无乘法运算路径以增强长程信息传递并缓解遗忘门退化问题。

链接: https://arxiv.org/abs/2512.06582
作者: Isaac Kofi Nti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Recurrent neural architectures such as LSTM and GRU remain widely used in sequence modeling, but they continue to face two core limitations: redundant gate-specific parameters and reduced ability to retain information across long temporal distances. This paper introduces the Quantum-Leap LSTM (QL-LSTM), a recurrent architecture designed to address both challenges through two independent components. The Parameter-Shared Unified Gating mechanism replaces all gate-specific transformations with a single shared weight matrix, reducing parameters by approximately 48 percent while preserving full gating behavior. The Hierarchical Gated Recurrence with Additive Skip Connections component adds a multiplication-free pathway that improves long-range information flow and reduces forget-gate degradation. We evaluate QL-LSTM on sentiment classification using the IMDB dataset with extended document lengths, comparing it to LSTM, GRU, and BiLSTM reference models. QL-LSTM achieves competitive accuracy while using substantially fewer parameters. Although the PSUG and HGR-ASC components are more efficient per time step, the current prototype remains limited by the inherent sequential nature of recurrent models and therefore does not yet yield wall-clock speed improvements without further kernel-level optimization.
zh

[AI-93] he Effect of Belief Boxes and Open-mindedness on Persuasion

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的多智能体系统在推理与决策任务中缺乏可建模信念表示的问题,从而影响其行为一致性与说服力。解决方案的关键在于引入“信念框”(belief boxes)机制——即在提示空间中显式嵌入智能体持有的信念陈述及其强度信息,以此模拟类命题信念(propositional beliefs)。实验表明,该方法不仅显著增强了智能体对对立观点的抵抗能力及说服他方的能力,还提升了其在群体压力情境下信念变更的灵活性,尤其当智能体处于少数派时,开放性指令进一步调节了其信念更新倾向,验证了信念框技术在多智能体协作与辩论场景中的有效性与可行性。

链接: https://arxiv.org/abs/2512.06573
作者: Onur Bilgin,Abdullah As Sami,Sriram Sai Vujjini,John Licato
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain

点击查看摘要

Abstract:As multi-agent systems are increasingly utilized for reasoning and decision-making applications, there is a greater need for LLM-based agents to have something resembling propositional beliefs. One simple method for doing so is to include statements describing beliefs maintained in the prompt space (in what we’ll call their belief boxes). But when agents have such statements in belief boxes, how does it actually affect their behaviors and dispositions towards those beliefs? And does it significantly affect agents’ ability to be persuasive in multi-agent scenarios? Likewise, if the agents are given instructions to be open-minded, how does that affect their behaviors? We explore these and related questions in a series of experiments. Our findings confirm that instructing agents to be open-minded affects how amenable they are to belief change. We show that incorporating belief statements and their strengths influences an agent’s resistance to (and persuasiveness against) opposing viewpoints. Furthermore, it affects the likelihood of belief change, particularly when the agent is outnumbered in a debate by opposing viewpoints, i.e., peer pressure scenarios. The results demonstrate the feasibility and validity of the belief box technique in reasoning and decision-making tasks.
zh

[AI-94] Deep Manifold Part 2: Neural Network Mathematics

链接: https://arxiv.org/abs/2512.06563
作者: Max Y. Ma,Gen-Hua Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-95] Securing the Model Context Protocol: Defending LLM s Against Tool Poisoning and Adversarial Attacks

【速读】:该论文旨在解决基于模型上下文协议(Model Context Protocol, MCP)的大型语言模型(Large Language Models, LLMs)在集成外部工具时因自主性增强而产生的安全漏洞问题,特别是现有防御机制未能有效应对嵌入工具元数据中的语义攻击。核心问题是:攻击者可通过操纵工具描述符实施三类语义攻击——工具投毒(Tool Poisoning)、影子攻击(Shadowing)和地毯拉扯(Rug Pulls),从而绕过传统防护并操控模型行为。解决方案的关键在于提出一个分层安全框架,包含三个核心技术组件:基于RSA的声明签名以保障工具描述符完整性、LLM-on-LLM语义审查用于识别可疑工具定义,以及轻量级启发式护栏机制实现在运行时阻断异常工具调用行为。该框架无需对模型进行微调或内部修改即可显著降低不安全工具调用率,且在不同模型架构与推理方式下展现出可配置的安全-性能权衡能力。

链接: https://arxiv.org/abs/2512.06556
作者: Saeid Jamshidi,Kawser Wazed Nafi,Arghavan Moradi Dakhel,Negar Shahabi,Foutse Khomh,Naser Ezzati-Jivan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) enables Large Language Models to integrate external tools through structured descriptors, increasing autonomy in decision-making, task execution, and multi-agent workflows. However, this autonomy creates a largely overlooked security gap. Existing defenses focus on prompt-injection attacks and fail to address threats embedded in tool metadata, leaving MCP-based systems exposed to semantic manipulation. This work analyzes three classes of semantic attacks on MCP-integrated systems: (1) Tool Poisoning, where adversarial instructions are hidden in tool descriptors; (2) Shadowing, where trusted tools are indirectly compromised through contaminated shared context; and (3) Rug Pulls, where descriptors are altered after approval to subvert behavior. To counter these threats, we introduce a layered security framework with three components: RSA-based manifest signing to enforce descriptor integrity, LLM-on-LLM semantic vetting to detect suspicious tool definitions, and lightweight heuristic guardrails that block anomalous tool behavior at runtime. Through evaluation of GPT-4, DeepSeek, and Llama-3.5 across eight prompting strategies, we find that security performance varies widely by model architecture and reasoning method. GPT-4 blocks about 71 percent of unsafe tool calls, balancing latency and safety. DeepSeek shows the highest resilience to Shadowing attacks but with greater latency, while Llama-3.5 is fastest but least robust. Our results show that the proposed framework reduces unsafe tool invocation rates without model fine-tuning or internal modification.
zh

[AI-96] BEACON: A Unified Behavioral-Tactical Framework for Explainable Cybercrime Analysis with Large Language Models

链接: https://arxiv.org/abs/2512.06555
作者: Arush Sachdeva,Rajendraprasad Saravanan,Gargi Sarkar,Kavita Vemuri,Sandeep Kumar Shukla
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

[AI-97] A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

【速读】:该论文旨在解决异步强化学习(Asynchronous Reinforcement Learning)中因高数据陈旧性(data staleness)导致的学习不稳定性问题,尤其针对基于耦合损失(coupled-loss)的算法(如PPO、GRPO)在大规模语言模型训练中因引入近邻策略(proximal policy)而产生的计算瓶颈。解决方案的关键在于:由于近邻策略仅作为行为策略与目标策略之间的信任区域锚点,无需显式计算其输出,可通过简单插值近似替代,从而提出A-3PO(APproximated Proximal Policy Optimization)方法,在消除额外前向传播开销的同时保持性能稳定,训练时间减少18%。

链接: https://arxiv.org/abs/2512.06547
作者: Xiaocan Li,Shiliang Wu,Zheng Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Decoupled loss has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss improves coupled-loss style of algorithms’ (e.g., PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy corrections (importance weight) from the controlling policy updates (trust region). However, the proximal policy requires an extra forward pass through the network at each training step, creating a computational bottleneck for large language models. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, reducing training time by 18% while maintaining comparable performance. Code off-the-shelf example are available at: this https URL
zh

[AI-98] Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

链接: https://arxiv.org/abs/2512.06533
作者: Ming Chen,Sheng Tang,Rong-Xi Tan,Ziniu Li,Jiacheng Chen,Ke Xue,Chao Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-99] Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control

链接: https://arxiv.org/abs/2512.06471
作者: Nathan P. Lawrence,Ali Mesbah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IFAC preprint

点击查看摘要

[AI-100] Instance Dependent Testing of Samplers using Interval Conditioning

链接: https://arxiv.org/abs/2512.06458
作者: Rishiraj Bhattacharyya,Sourav Chakraborty,Yash Pote,Uddalok Sarkar,Sayantan Sen
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-101] Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

链接: https://arxiv.org/abs/2512.06443
作者: Xiangyu Li,Chengyu Yin,Weijun Wang,Jianyu Wei,Ting Cao,Yunxin Liu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

[AI-102] Smart Spatial Planning in Egypt: An Algorithm-Driven Approach to Public Service Evaluation in Qena City

链接: https://arxiv.org/abs/2512.06431
作者: Mohamed Shamroukh,Mohamed Alkhuzamy Aziz
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

[AI-103] UncertaintyZoo: A Unified Toolkit for Quantifying Predictive Uncertainty in Deep Learning Systems

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全关键场景中因预测错误可能导致严重后果的问题,尤其是缺乏有效工具来量化和集成不确定性(Uncertainty Quantification, UQ)方法,从而限制了其实际应用与进一步研究。解决方案的关键在于提出一个统一的工具箱——UncertaintyZoo,该工具整合了29种UQ方法,涵盖五类主要方法,并提供标准化接口,使得不同模型(如CodeBERT和ChatGLM3)在代码漏洞检测任务中的不确定性评估得以高效实现与比较,实证表明该工具能有效揭示模型预测的不确定性。

链接: https://arxiv.org/abs/2512.06406
作者: Xianzong Wu,Xiaohong Li,Lili Quan,Qiang Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models(LLMs) are increasingly expanding their real-world applications across domains, e.g., question answering, autonomous driving, and automatic software development. Despite this achievement, LLMs, as data-driven systems, often make incorrect predictions, which can lead to potential losses in safety-critical scenarios. To address this issue and measure the confidence of model outputs, multiple uncertainty quantification(UQ) criteria have been proposed. However, even though important, there are limited tools to integrate these methods, hindering the practical usage of UQ methods and future research in this domain. To bridge this gap, in this paper, we introduce UncertaintyZoo, a unified toolkit that integrates 29 uncertainty quantification methods, covering five major categories under a standardized interface. Using UncertaintyZoo, we evaluate the usefulness of existing uncertainty quantification methods under the code vulnerability detection task on CodeBERT and ChatGLM3 models. The results demonstrate that UncertaintyZoo effectively reveals prediction uncertainty. The tool with a demonstration video is available on the project site this https URL.
zh

[AI-104] GENIUS: An Agent ic AI Framework for Autonomous Design and Execution of Simulation Protocols

链接: https://arxiv.org/abs/2512.06404
作者: Mohammad Soleymanibrojeni,Roland Aydin,Diego Guedes-Sobrinho,Alexandre C. Dias,Maurício J. Piotrowski,Wolfgang Wenzel,Celso Ricardo Caldeira Rêgo
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

[AI-105] Agent icCyber: A GenAI-Powered Multi-Agent System for Multimodal Threat Detection and Adaptive Response in Cybersecurity

链接: https://arxiv.org/abs/2512.06396
作者: Shovan Roy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages for IEEE conference

点击查看摘要

[AI-106] RLAX: Large-Scale Distributed Reinforcement Learning for Large Language Models on TPUs

链接: https://arxiv.org/abs/2512.06392
作者: Runlong Zhou,Lefan Zhang,Shang-Chen Wu,Kelvin Zou,Hanzhi Zhou,Ke Ye,Yihao Feng,Dong Yin,Alex Guillen Garcia,Dmytro Babych,Rohit Chatterjee,Matthew Hopkins,Xiang Kong,Chang Lan,Lezhi Li,Yiping Ma,Daniele Molinari,Senyu Tong,Yanchao Sun,Thomas Voice,Jianyu Wang,Chong Wang,Simon Wang,Floris Weers,Yechen Xu,Guolin Yin,Muyang Yu,Yi Zhang,Zheng Zhou,Danyang Zhuo,Ruoming Pang,Cheng Leong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

[AI-107] Web Technologies Security in the AI Era: A Survey of CDN-Enhanced Defenses

链接: https://arxiv.org/abs/2512.06390
作者: Mehrab Hosain,Sabbir Alom Shuvo,Matthew Ogbe,Md Shah Jalal Mazumder,Yead Rahman,Md Azizul Hakim,Anukul Pandey
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注: Accepted at 2025 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob). 7 pages, 5 figures

点击查看摘要

[AI-108] Protecting Bystander Privacy via Selective Hearing in LALMs ATC

链接: https://arxiv.org/abs/2512.06380
作者: Xiao Zhan,Guangzhi Sun,Jose Such,Phil Woodland
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Dataset: this https URL

点击查看摘要

[AI-109] Proportional integral derivative booster for neural networks-based time-series prediction: Case of water demand prediction

链接: https://arxiv.org/abs/2512.06357
作者: Tony Sallooma,Okyay Kaynak,Xinbo Yub,Wei He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Engineering Applications of Artificial Intelligence 2022

点击查看摘要

[AI-110] DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization

链接: https://arxiv.org/abs/2512.06337
作者: Xuan Xie,Xuan Wang,Wenjie Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-111] Chemistry Integrated Language Model using Hierarchical Molecular Representation for Polymer Informatics

【速读】:该论文旨在解决机器学习在聚合物材料发现中应用受限的问题,尤其针对数据稀缺导致的模型性能瓶颈。其解决方案的关键在于提出了一种名为CI-LLM(Chemically Informed Language Model)的框架,通过引入HAPPY(Hierarchically Abstracted rePeat unit of PolYmer)分子表示方法,将化学子结构编码为令牌,并结合数值描述符嵌入到Transformer架构中,从而实现对聚合物结构的高效表达与精准建模。此策略显著提升了属性预测的准确性与推理速度,并支持可解释的结构-性质关系分析及多目标逆向设计,有效突破了传统SMILES表示方法在聚合物领域的局限性。

链接: https://arxiv.org/abs/2512.06301
作者: Jihun Ahn,Gabriella Pasya Irianti,Vikram Thapar,Su-Mi Hur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning has transformed material discovery for inorganic compounds and small molecules, yet polymers remain largely inaccessible to these methods. While data scarcity is often cited as the primary bottleneck, we demonstrate that strategic molecular representations can overcome this limitation. We introduce CI-LLM (Chemically Informed Language Model), a framework combining HAPPY (Hierarchically Abstracted rePeat unit of PolYmer), which encodes chemical substructures as tokens, with numerical descriptors within transformer architectures. For property prediction, De ^3 BERTa, our descriptor-enriched encoder, achieves 3.5x faster inference than SMILES-based models with improved accuracy ( R^2 score gains of 0.9-4.1 percent across four properties), while providing interpretable structure-property insights at the subgroup level. For inverse design, our GPT-based generator produces polymers with targeted properties, achieving 100 percent scaffold retention and successful multi-property optimization for negatively correlated objectives. This comprehensive framework demonstrates both forward prediction and inverse design capabilities, showcasing how strategic molecular representation advances machine learning applications in polymer science.
zh

[AI-112] Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks

链接: https://arxiv.org/abs/2512.06297
作者: Luca Di Carlo,Chase Goddard,David J. Schwab
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Under Review

点击查看摘要

[AI-113] How Sharp and Bias-Robust is a Model? Dual Evaluation Perspectives on Knowledge Graph Completion WSDM2026

【速读】:该论文旨在解决知识图谱补全(Knowledge Graph Completion, KGC)评估中存在的两个关键问题:一是预测精度的“尖锐性”(predictive sharpness),即对单个预测结果严格程度的衡量;二是对低流行度实体的鲁棒性(popularity-bias robustness),即模型在处理不常见实体时的表现能力。现有评估指标未能充分考虑这两个维度,导致对KGC模型性能的评价存在偏差。为此,作者提出了一种新的评估框架PROBE,其核心创新在于引入两个组件:rank transformer(RT)用于根据预设的预测尖锐度水平估计每个预测得分,rank aggregator(RA)则以流行度感知的方式聚合所有得分,从而实现更全面、可靠的KGC模型评估。

链接: https://arxiv.org/abs/2512.06296
作者: Sooho Moon,Yunyong Ko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, 2 tables, ACM WSDM 2026

点击查看摘要

Abstract:Knowledge graph completion (KGC) aims to predict missing facts from the observed KG. While a number of KGC models have been studied, the evaluation of KGC still remain underexplored. In this paper, we observe that existing metrics overlook two key perspectives for KGC evaluation: (A1) predictive sharpness – the degree of strictness in evaluating an individual prediction, and (A2) popularity-bias robustness – the ability to predict low-popularity entities. Toward reflecting both perspectives, we propose a novel evaluation framework (PROBE), which consists of a rank transformer (RT) estimating the score of each prediction based on a required level of predictive sharpness and a rank aggregator (RA) aggregating all the scores in a popularity-aware manner. Experiments on real-world KGs reveal that existing metrics tend to over- or under-estimate the accuracy of KGC models, whereas PROBE yields a comprehensive understanding of KGC models and reliable evaluation results.
zh

[AI-114] Networked Restless Multi-Arm Bandits with Reinforcement Learning

链接: https://arxiv.org/abs/2512.06274
作者: Hanmo Zhang,Zenghui Sun,Kai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-115] Who Will Top the Charts? Multimodal Music Popularity Prediction via Adaptive Fusion of Modality Experts and Temporal Engagement Modeling

链接: https://arxiv.org/abs/2512.06259
作者: Yash Choudhary,Preeti Rao,Pushpak Bhattacharyya
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

[AI-116] DUET: Agent ic Design Understanding via Experimentation and Testing

链接: https://arxiv.org/abs/2512.06247
作者: Gus Henry Smith,Sandesh Adhikary,Vineet Thumuluri,Karthik Suresh,Vivek Pandit,Kartik Hegde,Hamid Shojaei,Chandra Bhagavatula
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

[AI-117] Auto-exploration for online reinforcement learning

链接: https://arxiv.org/abs/2512.06244
作者: Caleb Ju,Guanghui Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 35 pages (9 appendix), 1 figure. Comments are welcome

点击查看摘要

[AI-118] AI Application in Anti-Money Laundering for Sustainable and Transparent Financial Systems

链接: https://arxiv.org/abs/2512.06240
作者: Chuanhao Nie,Yunbo Liu,Chao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-119] Quantifying Memory Use in Reinforcement Learning with Temporal Range

链接: https://arxiv.org/abs/2512.06204
作者: Rodney Lafuente-Mercado,Daniela Rus,T. Konstantin Rusch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-120] DEFEND: Poisoned Model Detection and Malicious Client Exclusion Mechanism for Secure Federated Learning-based Road Condition Classification

链接: https://arxiv.org/abs/2512.06172
作者: Sheng Liu,Panos Papadimitratos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to the 41st ACM/SIGAPP Symposium on Applied Computing (SAC 2026)

点击查看摘要

[AI-121] Deep learning for autism detection using clinical notes: A comparison of transfer learning for a transparent and black-box approach

链接: https://arxiv.org/abs/2512.06161
作者: Gondy Leroy,Prakash Bisht,Sai Madhuri Kandula,Nell Maltman,Sydney Rice
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

[AI-122] Learning Invariant Graph Representations Through Redundant Information

【速读】:该论文旨在解决生成不变图表示(invariant graph representations)以实现分布外(out-of-distribution, OOD)泛化的问题,其核心挑战在于学习到的表示往往保留了与任务无关的虚假关联(spurious components)。为应对这一问题,作者引入信息论中的部分信息分解(Partial Information Decomposition, PID),突破传统信息论度量的局限性,从而精准识别并分离出目标变量 $ Y $ 在虚假子图 $ G_s $ 与因果子图 $ G_c $ 之间共享的冗余信息(redundant information)。解决方案的关键在于提出一种多级优化框架——冗余引导的不变图学习(Redundancy-guided Invariant Graph learning, RIG),该框架通过交替优化冗余信息的下界估计与最大化过程,同时显式隔离虚假和因果子图,从而在多种分布偏移场景下实现更优的OOD泛化性能。

链接: https://arxiv.org/abs/2512.06154
作者: Barproda Halder,Pasan Dissanayake,Sanghamitra Dutta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Learning invariant graph representations for out-of-distribution (OOD) generalization remains challenging because the learned representations often retain spurious components. To address this challenge, this work introduces a new tool from information theory called Partial Information Decomposition (PID) that goes beyond classical information-theoretic measures. We identify limitations in existing approaches for invariant representation learning that solely rely on classical information-theoretic measures, motivating the need to precisely focus on redundant information about the target Y shared between spurious subgraphs G_s and invariant subgraphs G_c obtained via PID. Next, we propose a new multi-level optimization framework that we call – Redundancy-guided Invariant Graph learning (RIG) – that maximizes redundant information while isolating spurious and causal subgraphs, enabling OOD generalization under diverse distribution shifts. Our approach relies on alternating between estimating a lower bound of redundant information (which itself requires an optimization) and maximizing it along with additional objectives. Experiments on both synthetic and real-world graph datasets demonstrate the generalization capabilities of our proposed RIG framework.
zh

[AI-123] Physics-Informed Neural Koopman Machine for Interpretable Longitudinal Personalized Alzheimers Disease Forecasting

链接: https://arxiv.org/abs/2512.06134
作者: Georgi Hrusanov,Duy-Thanh Vu,Duy-Cat Can,Sophie Tascedda,Margaret Ryan,Julien Bodelet,Katarzyna Koscielska,Carsten Magnus,Oliver Y. Chén
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

[AI-124] oward Patch Robustness Certification and Detection for Deep Learning Systems Beyond Consistent Samples

链接: https://arxiv.org/abs/2512.06123
作者: Qilin Zhou,Zhengyuan Wei,Haipeng Wang,Zhuo Wang,W.K. Chan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: accepted by IEEE Transactions on Reliability; extended technical report

点击查看摘要

[AI-125] WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving

链接: https://arxiv.org/abs/2512.06112
作者: Yifang Xu,Jiahao Cui,Feipeng Cai,Zhihao Zhu,Hanlin Shang,Shan Luan,Mingwang Xu,Neng Zhang,Yaoyi Li,Jia Cai,Siyu Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 18 pages, 11 figures

点击查看摘要

[AI-126] Future You: Designing and Evaluating Multimodal AI-generated Digital Twins for Strengthening Future Self-Continuity

链接: https://arxiv.org/abs/2512.06106
作者: Constanze Albrecht,Chayapatr Archiwaranguprok,Rachel Poonsiriwong,Awu Chen,Peggy Yin,Monchai Lertsutthiwong,Kavin Winson,Hal Hershfield,Pattie Maes,Pat Pataranutaporn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-127] JaxWildfire: A GPU-Accelerated Wildfire Simulator for Reinforcement Learning NEURIPS2025

链接: https://arxiv.org/abs/2512.06102
作者: Ufuk Çakır,Victor-Alexandru Darvariu,Bruno Lacerda,Nick Hawes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To be presented at the NeurIPS 2025 Workshop on Machine Learning and the Physical Sciences (ML4PS)

点击查看摘要

[AI-128] When Privacy Isnt Synthetic: Hidden Data Leakage in Generative AI Models

链接: https://arxiv.org/abs/2512.06062
作者: S.M. Mustaqim,Anantaa Kotal,Paul H. Yi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-129] Reinforcement Learning Integrated Agent ic RAG for Software Test Cases Authoring

链接: https://arxiv.org/abs/2512.06060
作者: Mohanakrishnan Hariharan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-130] Beyond Prototyping: Autonomous Enterprise-Grade Frontend Development from Pixel to Production via a Specialized Multi-Agent Framework

链接: https://arxiv.org/abs/2512.06046
作者: Ramprasath Ganesaraja,Swathika N,Saravanan AP,Kamalkumar Rathinasamy,Chetana Amancharla,Rahul Das,Sahil Dilip Panse,Aditya Batwe,Dileep Vijayan,Veena Ashok,Thanushree A P,Kausthubh J Rao,Alden Olivero,Roshan,Rajeshwar Reddy Manthena,Asmitha Yuga Sre A,Harsh Tripathi,Suganya Selvaraj,Vito Chin,Kasthuri Rangan Bhaskar,Kasthuri Rangan Bhaskar,Venkatraman R,Sajit Vijayakumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures

点击查看摘要

[AI-131] Auto-SPT: Automating Semantic Preserving Transformations for Code

链接: https://arxiv.org/abs/2512.06042
作者: Ashish Hooda,Mihai Christodorescu,Chuangang Ren,Aaron Wilson,Kassem Fawaz,Somesh Jha
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-132] Physics-Guided Deepfake Detection for Voice Authentication Systems

链接: https://arxiv.org/abs/2512.06040
作者: Alireza Mohammadi,Keshav Sood,Dhananjay Thiruvady,Asef Nazari
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

[AI-133] Uncovering Students Inquiry Patterns in GenAI-Supported Clinical Practice: An Integration of Epistemic Network Analysis and Sequential Pattern Mining

链接: https://arxiv.org/abs/2512.06018
作者: Jiameng Wei,Dinh Dang,Kaixun Yang,Emily Stokes,Amna Mazeh,Angelina Lim,David Wei Dai,Joel Moore,Yizhou Fan,Danijela Gasevic,Dragan Gasevic,Guanliang Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-134] POrTAL: Plan-Orchestrated Tree Assembly for Lookahead ICRA26

链接: https://arxiv.org/abs/2512.06002
作者: Evan Conway,David Porfirio,David Chan,Mark Roberts,Laura M. Hiatt
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to ICRA 26

点击查看摘要

[AI-135] Going All-In on LLM Accuracy: Fake Prediction Markets Real Confidence Signals

链接: https://arxiv.org/abs/2512.05998
作者: Michael Todasco(Visiting Fellow at the James Silberrad Center for Artificial Intelligence, San Diego State University)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: 25 pages, 8 tables, 2 figures. Pilot study. Data, prompts, and code available at this https URL

点击查看摘要

[AI-136] A Multi-objective Optimization Approach for Feature Selection in Gentelligent Systems

【速读】:该论文旨在解决制造过程中多目标优化与故障检测的难题,特别是在特征选择与分类性能之间存在冲突时,如何实现高效协同优化的问题。其解决方案的关键在于提出了一种基于支配关系的多目标进化算法(dominance-based multi-objective evolutionary algorithm)的混合框架,能够在单次运行中同时探索帕累托最优解集,从而实现特征选择与分类性能的联合优化,提升制造系统的智能化水平和预测准确性。

链接: https://arxiv.org/abs/2512.05971
作者: Mohammadhossein Ghahramani,Yan Qiao,NaiQi Wu,Mengchu Zhou
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 11 pages. IEEE Internet of Things Journal, 2025

点击查看摘要

Abstract:The integration of advanced technologies, such as Artificial Intelligence (AI), into manufacturing processes is attracting significant attention, paving the way for the development of intelligent systems that enhance efficiency and automation. This paper uses the term “Gentelligent system” to refer to systems that incorporate inherent component information (akin to genes in bioinformatics-where manufacturing operations are likened to chromosomes in this study) and automated mechanisms. By implementing reliable fault detection methods, manufacturers can achieve several benefits, including improved product quality, increased yield, and reduced production costs. To support these objectives, we propose a hybrid framework with a dominance-based multi-objective evolutionary algorithm. This mechanism enables simultaneous optimization of feature selection and classification performance by exploring Pareto-optimal solutions in a single run. This solution helps monitor various manufacturing operations, addressing a range of conflicting objectives that need to be minimized together. Manufacturers can leverage such predictive methods and better adapt to emerging trends. To strengthen the validation of our model, we incorporate two real-world datasets from different industrial domains. The results on both datasets demonstrate the generalizability and effectiveness of our approach.
zh

[AI-137] Social welfare optimisation in well-mixed and structured populations

链接: https://arxiv.org/abs/2512.07453
作者: Van An Nguyen,Vuong Khang Huynh,Ho Nam Duong,Huu Loi Bui,Hai Anh Ha,Quang Dung Le,Le Quoc Dung Ngo,Tan Dat Nguyen,Ngoc Ngu Nguyen,Hoai Thuong Nguyen,Zhao Song,Le Hong Trang, TheAnh Han
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

[AI-138] Exact Synthetic Populations for Scalable Societal and Market Modeling

【速读】:该论文旨在解决合成人口生成中难以同时保证目标统计特征的高精度再现与个体层面一致性的问题。传统数据驱动方法依赖样本推断分布,易引入偏差且无法确保微观个体属性的逻辑自洽;而本文提出的基于约束编程(Constraint Programming)的框架,通过直接编码聚合统计量和结构关系,实现了对人口特征的精确控制,并在无需任何原始微数据的情况下生成符合政策与市场分析需求的合成群体。其关键在于将宏观统计约束转化为可求解的数学约束系统,从而保障生成结果在个体层面上的完全一致性与可重现性。

链接: https://arxiv.org/abs/2512.07306
作者: Thierry Petit,Arnault Pachot
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted for peer review on December 7, 2025

点击查看摘要

Abstract:We introduce a constraint-programming framework for generating synthetic populations that reproduce target statistics with high precision while enforcing full individual consistency. Unlike data-driven approaches that infer distributions from samples, our method directly encodes aggregated statistics and structural relations, enabling exact control of demographic profiles without requiring any microdata. We validate the approach on official demographic sources and study the impact of distributional deviations on downstream analyses. This work is conducted within the Pollitics project developed by Emotia, where synthetic populations can be queried through large language models to model societal behaviors, explore market and policy scenarios, and provide reproducible decision-grade insights without personal data.
zh

[AI-139] Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

链接: https://arxiv.org/abs/2512.07019
作者: Zhiyu Xu,Jia Liu,Yixin Wang,Yuqi Gu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

[AI-140] Optimal and Diffusion Transports in Machine Learning

链接: https://arxiv.org/abs/2512.06797
作者: Gabriel Peyré
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Proc. 2026 International Congress of Mathematicians

点击查看摘要

[AI-141] AI as “Co-founder”: GenAI for Entrepreneurship

【速读】:该论文旨在解决生成式人工智能(Generative AI)是否以及如何促进企业创建的问题,尤其关注其对不同规模企业进入市场的差异化影响。研究通过利用2022年11月ChatGPT全球发布作为外生冲击,识别GenAI降低创业成本的因果效应,并基于地理编码网格中先验AI相关人力资本的异质性进行实证分析。解决方案的关键在于采用高分辨率、覆盖全国的中国企业注册数据(截至2024年底),结合空间层面的人力资本差异,精准识别GenAI对企业进入行为的影响机制,发现其显著推动了小企业的创业活动,且效果在具备AI应用潜力、融资需求较低及首次创业者群体中最为突出,从而揭示GenAI作为促竞争力量的核心作用。

链接: https://arxiv.org/abs/2512.06506
作者: Junhui Jeff Cai,Xian Gu,Liugang Sheng,Mengjia Xia,Linda Zhao,Wu Zhu
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This paper studies whether, how, and for whom generative artificial intelligence (GenAI) facilitates firm creation. Our identification strategy exploits the November 2022 release of ChatGPT as a global shock that lowered start-up costs and leverages variations across geo-coded grids with differential pre-existing AI-specific human capital. Using high-resolution and universal data on Chinese firm registrations by the end of 2024, we find that grids with stronger AI-specific human capital experienced a sharp surge in new firm formation \unicodex2013 driven entirely by small firms, contributing to 6.0% of overall national firm entry. Large-firm entry declines, consistent with a shift toward leaner ventures. New firms are smaller in capital, shareholder number, and founding team size, especially among small firms. The effects are strongest among firms with potential AI applications, weaker financing needs, and among first-time entrepreneurs. Overall, our results highlight that GenAI serves as a pro-competitive force by disproportionately boosting small-firm entry.
zh

[AI-142] PRIMRose: Insights into the Per-Residue Energy Metrics of Proteins with Double InDel Mutations using Deep Learning

链接: https://arxiv.org/abs/2512.06496
作者: Stella Brown,Nicolas Preisig,Autumn Davis,Brian Hutchinson,Filip Jagodzinski
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
备注: Presented at Computational Structural Bioinformatics Workshop 2025

点击查看摘要

[AI-143] Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

【速读】:该论文旨在解决语音转换(Voice Conversion, VC)模型在真实场景中面对输入语音退化(如噪声、混响、对抗攻击或微小扰动)时性能下降的问题,即当前VC模型缺乏鲁棒性。其解决方案的关键在于从输入操纵的角度系统分类现有攻击与防御方法,并通过四个维度——可懂度、自然度、音色相似性及主观感知——全面评估退化输入对VC输出的影响,从而揭示模型脆弱性并为优化攻击与防御策略提供依据。

链接: https://arxiv.org/abs/2512.06304
作者: Xining Song,Zhihua Wei,Rui Wang,Haixiao Hu,Yanxiang Chen,Meng Han
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Identity, accent, style, and emotions are essential components of human speech. Voice conversion (VC) techniques process the speech signals of two input speakers and other modalities of auxiliary information such as prompts and emotion tags. It changes para-linguistic features from one to another, while maintaining linguistic contents. Recently, VC models have made rapid advancements in both generation quality and personalization capabilities. These developments have attracted considerable attention for diverse applications, including privacy preservation, voice-print reproduction for the deceased, and dysarthric speech recovery. However, these models only learn non-robust features due to the clean training data. Subsequently, it results in unsatisfactory performances when dealing with degraded input speech in real-world scenarios, including additional noise, reverberation, adversarial attacks, or even minor perturbation. Hence, it demands robust deployments, especially in real-world settings. Although latest researches attempt to find potential attacks and countermeasures for VC systems, there remains a significant gap in the comprehensive understanding of how robust the VC model is under input manipulation. here also raises many questions: For instance, to what extent do different forms of input degradation attacks alter the expected output of VC models? Is there potential for optimizing these attack and defense strategies? To answer these questions, we classify existing attack and defense methods from the perspective of input manipulation and evaluate the impact of degraded input speech across four dimensions, including intelligibility, naturalness, timbre similarity, and subjective perception. Finally, we outline open issues and future directions.
zh

[AI-144] FlockVote: LLM -Empowered Agent -Based Modeling for Simulating U.S. Presidential Elections

链接: https://arxiv.org/abs/2512.05982
作者: Lingfeng Zhou,Yi Xu,Zhenyu Wang,Dequan Wang
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Published as a conference paper at ICAIS 2025

点击查看摘要

[AI-145] Accelerating Materials Discovery: Learning a Universal Representation of Chemical Processes for Cross-Domain Property Prediction

链接: https://arxiv.org/abs/2512.05979
作者: Mikhail Tsitsvero,Atsuyuki Nakao,Hisaki Ikebata
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
备注: 22 pages, 8 figures

点击查看摘要

机器学习

[LG-0] he Adoption and Usage of AI Agents : Early Evidence from Perplexity

链接: https://arxiv.org/abs/2512.07828
作者: Jeremy Yang,Noah Yonack,Kate Zyskowski,Denis Yarats,Johnny Ho,Jerry Ma
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:This paper presents the first large-scale field study of the adoption, usage intensity, and use cases of general-purpose AI agents operating in open-world web environments. Our analysis centers on Comet, an AI-powered browser developed by Perplexity, and its integrated agent, Comet Assistant. Drawing on hundreds of millions of anonymized user interactions, we address three fundamental questions: Who is using AI agents? How intensively are they using them? And what are they using them for? Our findings reveal substantial heterogeneity in adoption and usage across user segments. Earlier adopters, users in countries with higher GDP per capita and educational attainment, and individuals working in digital or knowledge-intensive sectors – such as digital technology, academia, finance, marketing, and entrepreneurship – are more likely to adopt or actively use the agent. To systematically characterize the substance of agent usage, we introduce a hierarchical agentic taxonomy that organizes use cases across three levels: topic, subtopic, and task. The two largest topics, Productivity Workflow and Learning Research, account for 57% of all agentic queries, while the two largest subtopics, Courses and Shopping for Goods, make up 22%. The top 10 out of 90 tasks represent 55% of queries. Personal use constitutes 55% of queries, while professional and educational contexts comprise 30% and 16%, respectively. In the short term, use cases exhibit strong stickiness, but over time users tend to shift toward more cognitively oriented topics. The diffusion of increasingly capable AI agents carries important implications for researchers, businesses, policymakers, and educators, inviting new lines of inquiry into this rapidly emerging class of AI capabilities.

[LG-1] An Adaptive Multi-Layered Honeynet Architecture for Threat Behavior Analysis via Deep Learning

链接: https://arxiv.org/abs/2512.07827
作者: Lukas Johannes Möller
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The escalating sophistication and variety of cyber threats have rendered static honeypots inadequate, necessitating adaptive, intelligence-driven deception. In this work, ADLAH is introduced: an Adaptive Deep Learning Anomaly Detection Honeynet designed to maximize high-fidelity threat intelligence while minimizing cost through autonomous orchestration of infrastructure. The principal contribution is offered as an end-to-end architectural blueprint and vision for an AI-driven deception platform. Feasibility is evidenced by a functional prototype of the central decision mechanism, in which a reinforcement learning (RL) agent determines, in real time, when sessions should be escalated from low-interaction sensor nodes to dynamically provisioned, high-interaction honeypots. Because sufficient live data were unavailable, field-scale validation is not claimed; instead, design trade-offs and limitations are detailed, and a rigorous roadmap toward empirical evaluation at scale is provided. Beyond selective escalation and anomaly detection, the architecture pursues automated extraction, clustering, and versioning of bot attack chains, a core capability motivated by the empirical observation that exposed services are dominated by automated traffic. Together, these elements delineate a practical path toward cost-efficient capture of high-value adversary behavior, systematic bot versioning, and the production of actionable threat intelligence.

[LG-2] Graph-Based Learning of Spectro-Topographical EEG Representations with Gradient Alignment for Brain-Computer Interfaces

链接: https://arxiv.org/abs/2512.07820
作者: Prithila Angkan,Amin Jalali,Paul Hungler,Ali Etemad
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel graph-based learning of EEG representations with gradient alignment (GEEGA) that leverages multi-domain information to learn EEG representations for brain-computer interfaces. Our model leverages graph convolutional networks to fuse embeddings from frequency-based topographical maps and time-frequency spectrograms, capturing inter-domain relationships. GEEGA addresses the challenge of achieving high inter-class separability, which arises from the temporally dynamic and subject-sensitive nature of EEG signals by incorporating the center loss and pairwise difference loss. Additionally, GEEGA incorporates a gradient alignment strategy to resolve conflicts between gradients from different domains and the fused embeddings, ensuring that discrepancies, where gradients point in conflicting directions, are aligned toward a unified optimization direction. We validate the efficacy of our method through extensive experiments on three publicly available EEG datasets: BCI-2a, CL-Drive and CLARE. Comprehensive ablation studies further highlight the impact of various components of our model.

[LG-3] GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

链接: https://arxiv.org/abs/2512.07782
作者: Jiaxu Liu,Yuhe Bai,Christos-Savvas Bouganis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textitAssociative Memory interpretation, its difference-style update renders the training objective effectively \emphunbounded. In contrast, Softmax attention normalizes updates, leading to \emphmemory shrinkage and gradient vanishing. We propose GatedFWA: a Memory-\underlineGated (\underlineFlash) \underlineWindowed \underlineAttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.

[LG-4] Formalized Hopfield Networks and Boltzmann Machines

链接: https://arxiv.org/abs/2512.07766
作者: Matteo Cipollina,Michail Karatarakis,Freek Wiedijk
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Neural networks are widely used, yet their analysis and verification remain challenging. In this work, we present a Lean 4 formalization of neural networks, covering both deterministic and stochastic models. We first formalize Hopfield networks, recurrent networks that store patterns as stable states. We prove convergence and the correctness of Hebbian learning, a training rule that updates network parameters to encode patterns, here limited to the case of pairwise-orthogonal patterns. We then consider stochastic networks, where updates are probabilistic and convergence is to a stationary distribution. As a canonical example, we formalize the dynamics of Boltzmann machines and prove their ergodicity, showing convergence to a unique stationary distribution using a new formalization of the Perron-Frobenius theorem.

[LG-5] A multimodal Bayesian Network for symptom-level depression and anxiety prediction from voice and speech data

链接: https://arxiv.org/abs/2512.07741
作者: Agnes Norbury,George Fairs,Alexandra L. Georgescu,Matthew M. Nour,Emilia Molimpakis,Stefano Goria
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:During psychiatric assessment, clinicians observe not only what patients report, but important nonverbal signs such as tone, speech rate, fluency, responsiveness, and body language. Weighing and integrating these different information sources is a challenging task and a good candidate for support by intelligence-driven tools - however this is yet to be realized in the clinic. Here, we argue that several important barriers to adoption can be addressed using Bayesian network modelling. To demonstrate this, we evaluate a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets (30,135 unique speakers). Alongside performance for conditions and symptoms (for depression, anxiety ROC-AUC=0.842,0.831 ECE=0.018,0.015; core individual symptom ROC-AUC0.74), we assess demographic fairness and investigate integration across and redundancy between different input modality types. Clinical usefulness metrics and acceptability to mental health service users are explored. When provided with sufficiently rich and large-scale multimodal data streams and specified to represent common mental conditions at the symptom rather than disorder level, such models are a principled approach for building robust assessment support tools: providing clinically-relevant outputs in a transparent and explainable format that is directly amenable to expert clinical supervision.

[LG-6] Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks

链接: https://arxiv.org/abs/2512.07697
作者: Aileen Liao,Dong-Ki Kim,Max Olan Smith,Ali-akbar Agha-mohammadi,Shayegan Omidshafiei
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization from zero delay to measured delay during training and inference. We introduce Delay-Aware Diffusion Policy (DA-DP), a framework for explicitly incorporating inference delays into policy learning. DA-DP corrects zero-delay trajectories to their delay-compensated counterparts, and augments the policy with delay conditioning. We empirically validate DA-DP on a variety of tasks, robots, and delays and find its success rate more robust to delay than delay-unaware methods. DA-DP is architecture agnostic and transfers beyond diffusion policies, offering a general pattern for delay-aware imitation learning. More broadly, DA-DP encourages evaluation protocols that report performance as a function of measured latency, not just task difficulty.

[LG-7] A Bootstrap Perspective on Stochastic Gradient Descent

链接: https://arxiv.org/abs/2512.07676
作者: Hongjian Lan,Yucong Liu,Florian Schäfer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning models trained with \emphstochastic gradient descent (SGD) can generalize better than those trained with deterministic gradient descent (GD). In this work, we study SGD’s impact on generalization through the lens of the statistical bootstrap: SGD uses gradient variability under batch sampling as a proxy for solution variability under the randomness of the data collection process. We use empirical results and theoretical analysis to substantiate this claim. In idealized experiments on empirical risk minimization, we show that SGD is drawn to parameter choices that are robust under resampling and thus avoids spurious solutions even if they lie in wider and deeper minima of the training loss. We prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability. This regularization leads to solutions that are less sensitive to sampling noise, thereby improving generalization. Numerical experiments on neural network training show that explicitly incorporating the estimate of the algorithmic variability as a regularizer improves test performance. This fact supports our claim that bootstrap estimation underpins SGD’s generalization advantages.

[LG-8] Depth-Wise Activation Steering for Honest Language Models

链接: https://arxiv.org/abs/2512.07667
作者: Gracjan Góral,Marysia Winkels,Steven Basart
类目: Machine Learning (cs.LG)
*备注: See \url{ this https URL }. for code and experiments

点击查看摘要

Abstract:Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models’ existing capabilities.

[LG-9] Exploring Test-time Scaling via Prediction Merging on Large-Scale Recommendation

链接: https://arxiv.org/abs/2512.07650
作者: Fuyuan Lyu,Zhentai Chen,Jingyan Jiang,Lingjie Li,Xing Tang,Xiuqiang He,Xue Liu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inspired by the success of language models (LM), scaling up deep learning recommendation systems (DLRS) has become a recent trend in the community. All previous methods tend to scale up the model parameters during training time. However, how to efficiently utilize and scale up computational resources during test time remains underexplored, which can prove to be a scaling-efficient approach and bring orthogonal improvements in LM domains. The key point in applying test-time scaling to DLRS lies in effectively generating diverse yet meaningful outputs for the same instance. We propose two ways: One is to explore the heterogeneity of different model architectures. The other is to utilize the randomness of model initialization under a homogeneous architecture. The evaluation is conducted across eight models, including both classic and SOTA models, on three benchmarks. Sufficient evidence proves the effectiveness of both solutions. We further prove that under the same inference budget, test-time scaling can outperform parameter scaling. Our test-time scaling can also be seamlessly accelerated with the increase in parallel servers when deployed online, without affecting the inference time on the user side. Code is available.

[LG-10] RRAEDy: Adaptive Latent Linearization of Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2512.07542
作者: Jad Mounayer,Sebastian Rodriguez,Jerome Tomezyk,Chady Ghnatios,Francisco Chinesta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing latent-space models for dynamical systems require fixing the latent dimension in advance, they rely on complex loss balancing to approximate linear dynamics, and they don’t regularize the latent variables. We introduce RRAEDy, a model that removes these limitations by discovering the appropriate latent dimension, while enforcing both regularized and linearized dynamics in the latent space. Built upon Rank-Reduction Autoencoders (RRAEs), RRAEDy automatically rank and prune latent variables through their singular values while learning a latent Dynamic Mode Decomposition (DMD) operator that governs their temporal progression. This structure-free yet linearly constrained formulation enables the model to learn stable and low-dimensional dynamics without auxiliary losses or manual tuning. We provide theoretical analysis demonstrating the stability of the learned operator and showcase the generality of our model by proposing an extension that handles parametric ODEs. Experiments on canonical benchmarks, including the Van der Pol oscillator, Burgers’ equation, 2D Navier-Stokes, and Rotating Gaussians, show that RRAEDy achieves accurate and robust predictions. Our code is open-source and available at this https URL. We also provide a video summarizing the main results at this https URL.

[LG-11] FRWKV:Frequency-Domain Linear Attention for Long-Term Time Series Forecasting

链接: https://arxiv.org/abs/2512.07539
作者: Qingyuan Yang,Shizhuo,Dongyue Chen,Da Teng,Zehua Gan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional Transformers face a major bottleneck in long-sequence time series forecasting due to their quadratic complexity (\mathcalO(T^2)) and their limited ability to effectively exploit frequency-domain information. Inspired by RWKV’s \mathcalO(T) linear attention and frequency-domain modeling, we propose FRWKV, a frequency-domain linear-attention framework that overcomes these limitations. Our model integrates linear attention mechanisms with frequency-domain analysis, achieving \mathcalO(T) computational complexity in the attention path while exploiting spectral information to enhance temporal feature representations for scalable long-sequence modeling. Across eight real-world datasets, FRWKV achieves a first-place average rank. Our ablation studies confirm the critical roles of both the linear attention and frequency-encoder components. This work demonstrates the powerful synergy between linear attention and frequency analysis, establishing a new paradigm for scalable time series modeling. Code is available at this repository: this https URL.

[LG-12] Machine Learning: Progress and Prospects

链接: https://arxiv.org/abs/2512.07519
作者: Alexander Gammerman
类目: Machine Learning (cs.LG)
*备注: Inaugural Lecture. 18 pages, 13 figures, Published in 1997 by Royal Holloway, University of London, ISBN 0 900145 93 5

点击查看摘要

Abstract:This Inaugural Lecture was given at Royal Holloway University of London in 1996. It covers an introduction to machine learning and describes various theoretical advances and practical projects in the field. The Lecture here is presented in its original format, but a few remarks have been added in 2025 to reflect recent developments, and the list of references has been updated to enhance the convenience and accuracy for readers. When did machine learning start? Maybe a good starting point is 1949, when Claude Shannon proposed a learning algorithm for chess-playing programs. Or maybe we should go back to the 1930s when Ronald Fisher developed discriminant analysis - a type of learning where the problem is to construct a decision rule that separates two types of vectors. Or could it be the 18th century when David Hume discussed the idea of induction? Or the 14th century, when William of Ockham formulated the principle of “simplicity” known as “Ockham’s razor” (Ockham, by the way, is a small village not far from Royal Holloway). Or it may be that, like almost everything else in Western civilisation and culture, the origin of these ideas lies in the Mediterranean. After all, it was Aristotle who said that “we learn some things only by doing things”. The field of machine learning has been greatly influenced by other disciplines and the subject is in itself not a very homogeneous discipline, but includes separate, overlapping subfields. There are many parallel lines of research in ML: inductive learning, neural networks, clustering, and theories of learning. They are all part of the more general field of machine learning. Comments: Inaugural Lecture. 18 pages, 13 figures, Published in 1997 by Royal Holloway, University of London, ISBN 0 900145 93 5 Subjects: Machine Learning (cs.LG) MSC classes: Machine Learning Cite as: arXiv:2512.07519 [cs.LG] (or arXiv:2512.07519v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.07519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Efficient Low-Tubal-Rank Tensor Estimation via Alternating Preconditioned Gradient Descent

链接: https://arxiv.org/abs/2512.07490
作者: Zhiyu Liu,Zhi Han,Yandong Tang,Jun Fan,Yao Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The problem of low-tubal-rank tensor estimation is a fundamental task with wide applications across high-dimensional signal processing, machine learning, and image science. Traditional approaches tackle such a problem by performing tensor singular value decomposition, which is computationally expensive and becomes infeasible for large-scale tensors. Recent approaches address this issue by factorizing the tensor into two smaller factor tensors and solving the resulting problem using gradient descent. However, this kind of approach requires an accurate estimate of the tensor rank, and when the rank is overestimated, the convergence of gradient descent and its variants slows down significantly or even diverges. To address this problem, we propose an Alternating Preconditioned Gradient Descent (APGD) algorithm, which accelerates convergence in the over-parameterized setting by adding a preconditioning term to the original gradient and updating these two factors alternately. Based on certain geometric assumptions on the objective function, we establish linear convergence guarantees for more general low-tubal-rank tensor estimation problems. Then we further analyze the specific cases of low-tubal-rank tensor factorization and low-tubal-rank tensor recovery. Our theoretical results show that APGD achieves linear convergence even under over-parameterization, and the convergence rate is independent of the tensor condition number. Extensive simulations on synthetic data are carried out to validate our theoretical assertions.

[LG-14] Materium: An Autoregressive Approach for Material Generation

链接: https://arxiv.org/abs/2512.07486
作者: Niklas Dobberstein,Jan Hamaekers
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:We present Materium: an autoregressive transformer for generating crystal structures that converts 3D material representations into token sequences. These sequences include elements with oxidation states, fractional coordinates and lattice parameters. Unlike diffusion approaches, which refine atomic positions iteratively through many denoising steps, Materium places atoms at precise fractional coordinates, enabling fast, scalable generation. With this design, the model can be trained in a few hours on a single GPU and generate samples much faster on GPUs and CPUs than diffusion-based approaches. The model was trained and evaluated using multiple properties as conditions, including fundamental properties, such as density and space group, as well as more practical targets, such as band gap and magnetic density. In both single and combined conditions, the model performs consistently well, producing candidates that align with the requested inputs.

[LG-15] Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation

链接: https://arxiv.org/abs/2512.07472
作者: Siyu Xu,Zijian Wang,Yunke Wang,Chenghao Xia,Tao Huang,Chang Xu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown great performance in robotic manipulation by mapping visual observations and language instructions directly to actions. However, they remain brittle under distribution shifts: when test scenarios change, VLAs often reproduce memorized trajectories instead of adapting to the updated scene, which is a failure mode we refer to as the “Memory Trap”. This limitation stems from the end-to-end design, which lacks explicit 3D spatial reasoning and prevents reliable identification of actionable regions in unfamiliar environments. To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. Our system detects memory traps through proprioception, repositions the robot to recent high-affordance regions, and proposes affordance-driven waypoints that anchor VLA-generated actions. A SAF-based scorer then selects trajectories with the highest cumulative affordance. Extensive experiments demonstrate that our method achieves an average improvement of 23.5% across different VLA backbones ( \pi_0 and \pi_0.5 ) under out-of-distribution scenarios on real-world robotic platforms, and 20.2% on the LIBERO-Pro benchmark, validating its effectiveness in enhancing VLA robustness to distribution shifts.

[LG-16] Parallel Algorithms for Combined Regularized Support Vector Machines: Application in Music Genre Classification

链接: https://arxiv.org/abs/2512.07463
作者: Rongmei Liang,Zizheng Liu,Xiaofei Wu,Jingwen Tu
类目: Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:In the era of rapid development of artificial intelligence, its applications span across diverse fields, relying heavily on effective data processing and model optimization. Combined Regularized Support Vector Machines (CR-SVMs) can effectively handle the structural information among data features, but there is a lack of efficient algorithms in distributed-stored big data. To address this issue, we propose a unified optimization framework based on consensus structure. This framework is not only applicable to various loss functions and combined regularization terms but can also be effectively extended to non-convex regularization terms, showing strong scalability. Based on this framework, we develop a distributed parallel alternating direction method of multipliers (ADMM) algorithm to efficiently compute CR-SVMs when data is stored in a distributed manner. To ensure the convergence of the algorithm, we also introduce the Gaussian back-substitution method. Meanwhile, for the integrity of the paper, we introduce a new model, the sparse group lasso support vector machine (SGL-SVM), and apply it to music information retrieval. Theoretical analysis confirms that the computational complexity of the proposed algorithm is not affected by different regularization terms and loss functions, highlighting the universality of the parallel algorithm. Experiments on synthetic and free music archiv datasets demonstrate the reliability, stability, and efficiency of the algorithm.

[LG-17] Mitigating Bias in Graph Hyperdimensional Computing

链接: https://arxiv.org/abs/2512.07433
作者: Yezi Liu,William Youngwoo Chung,Yang Ni,Hanning Chen,Mohsen Imani
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph hyperdimensional computing (HDC) has emerged as a promising paradigm for cognitive tasks, emulating brain-like computation with high-dimensional vectors known as hypervectors. While HDC offers robustness and efficiency on graph-structured data, its fairness implications remain largely unexplored. In this paper, we study fairness in graph HDC, where biases in data representation and decision rules can lead to unequal treatment of different groups. We show how hypervector encoding and similarity-based classification can propagate or even amplify such biases, and we propose a fairness-aware training framework, FairGHDC, to mitigate them. FairGHDC introduces a bias correction term, derived from a gap-based demographic-parity regularizer, and converts it into a scalar fairness factor that scales the update of the class hypervector for the ground-truth label. This enables debiasing directly in the hypervector space without modifying the graph encoder or requiring backpropagation. Experimental results on six benchmark datasets demonstrate that FairGHDC substantially reduces demographic-parity and equal-opportunity gaps while maintaining accuracy comparable to standard GNNs and fairness-aware GNNs. At the same time, FairGHDC preserves the computational advantages of HDC, achieving up to about one order of magnitude ( \approx 10\times ) speedup in training time on GPU compared to GNN and fairness-aware GNN baselines.

[LG-18] Adaptive Tuning of Parameterized Traffic Controllers via Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2512.07417
作者: Giray Önür,Azita Dabiri,Bart De Schutter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective traffic control is essential for mitigating congestion in transportation networks. Conventional traffic management strategies, including route guidance, ramp metering, and traffic signal control, often rely on state feedback controllers, used for their simplicity and reactivity; however, they lack the adaptability required to cope with complex and time-varying traffic dynamics. This paper proposes a multi-agent reinforcement learning framework in which each agent adaptively tunes the parameters of a state feedback traffic controller, combining the reactivity of state feedback controllers with the adaptability of reinforcement learning. By tuning parameters at a lower frequency rather than directly determining control actions at a high frequency, the reinforcement learning agents achieve improved training efficiency while maintaining adaptability to varying traffic conditions. The multi-agent structure further enhances system robustness, as local controllers can operate independently in the event of partial failures. The proposed framework is evaluated on a simulated multi-class transportation network under varying traffic conditions. Results show that the proposed multi-agent framework outperforms the no control and fixed-parameter state feedback control cases, while performing on par with the single-agent RL-based adaptive state feedback control, with a much better resilience to partial failures.

[LG-19] Empirical Results for Adjusting Truncated Backpropagation Through Time while Training Neural Audio Effects

链接: https://arxiv.org/abs/2512.07393
作者: Yann Bourdin(ASTRAL),Pierrick Legrand(ASTRAL, ENSC, IMS),Fanny Roche
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters – sequence number, batch size, and sequence length – and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with and without conditionning by user controls. Results demonstrate that carefully tuning these parameters enhances model accuracy and training stability, while also reducing computational demands. Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the revised TBPTT configuration maintains high perceptual quality.

[LG-20] PrivORL: Differentially Private Synthetic Dataset for Offline Reinforcement Learning NDSS2026

链接: https://arxiv.org/abs/2512.07342
作者: Chen Gong,Zheng Liu,Kecen Li,Tianhao Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at NDSS 2026; code available at this https URL

点击查看摘要

Abstract:Recently, offline reinforcement learning (RL) has become a popular RL paradigm. In offline RL, data providers share pre-collected datasets – either as individual transitions or sequences of transitions forming trajectories – to enable the training of RL models (also called agents) without direct interaction with the environments. Offline RL saves interactions with environments compared to traditional RL, and has been effective in critical areas, such as navigation tasks. Meanwhile, concerns about privacy leakage from offline RL datasets have emerged. To safeguard private information in offline RL datasets, we propose the first differential privacy (DP) offline dataset synthesis method, PrivORL, which leverages a diffusion model and diffusion transformer to synthesize transitions and trajectories, respectively, under DP. The synthetic dataset can then be securely released for downstream analysis and research. PrivORL adopts the popular approach of pre-training a synthesizer on public datasets, and then fine-tuning on sensitive datasets using DP Stochastic Gradient Descent (DP-SGD). Additionally, PrivORL introduces curiosity-driven pre-training, which uses feedback from the curiosity module to diversify the synthetic dataset and thus can generate diverse synthetic transitions and trajectories that closely resemble the sensitive dataset. Extensive experiments on five sensitive offline RL datasets show that our method achieves better utility and fidelity in both DP transition and trajectory synthesis compared to baselines. The replication package is available at the GitHub repository. Comments: Accepted at NDSS 2026; code available at this https URL Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2512.07342 [cs.CR] (or arXiv:2512.07342v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.07342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] Learning-Augmented Ski Rental with Discrete Distributions: A Bayesian Approach

链接: https://arxiv.org/abs/2512.07313
作者: Bosun Kang,Hyejun Park,Chenglin Fan
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 7 pages

点击查看摘要

Abstract:We revisit the classic ski rental problem through the lens of Bayesian decision-making and machine-learned predictions. While traditional algorithms minimize worst-case cost without assumptions, and recent learning-augmented approaches leverage noisy forecasts with robustness guarantees, our work unifies these perspectives. We propose a discrete Bayesian framework that maintains exact posterior distributions over the time horizon, enabling principled uncertainty quantification and seamless incorporation of expert priors. Our algorithm achieves prior-dependent competitive guarantees and gracefully interpolates between worst-case and fully-informed settings. Our extensive experimental evaluation demonstrates superior empirical performance across diverse scenarios, achieving near-optimal results under accurate priors while maintaining robust worst-case guarantees. This framework naturally extends to incorporate multiple predictions, non-uniform priors, and contextual information, highlighting the practical advantages of Bayesian reasoning in online decision problems with imperfect predictions.

[LG-22] owards a Relationship-Aware Transformer for Tabular Data

链接: https://arxiv.org/abs/2512.07310
作者: Andrei V. Konstantinov,Valerii A. Zuev,Lev V. Utkin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models for tabular data typically do not allow for imposing a graph of external dependencies between samples, which can be useful for accounting for relatedness in tasks such as treatment effect estimation. Graph neural networks only consider adjacent nodes, making them difficult to apply to sparse graphs. This paper proposes several solutions based on a modified attention mechanism, which accounts for possible relationships between data points by adding a term to the attention matrix. Our models are compared with each other and the gradient boosting decision trees in a regression task on synthetic and real-world datasets, as well as in a treatment effect estimation task on the IHDP dataset.

[LG-23] PINE: Pipeline for Important Node Exploration in Attributed Networks

链接: https://arxiv.org/abs/2512.07244
作者: Elizaveta Kovtun,Maksim Makarenko,Natalia Semenova,Alexey Zaytsev,Semen Budennyy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A graph with semantically attributed nodes are a common data structure in a wide range of domains. It could be interlinked web data or citation networks of scientific publications. The essential problem for such a data type is to determine nodes that carry greater importance than all the others, a task that markedly enhances system monitoring and management. Traditional methods to identify important nodes in networks introduce centrality measures, such as node degree or more complex PageRank. However, they consider only the network structure, neglecting the rich node attributes. Recent methods adopt neural networks capable of handling node features, but they require supervision. This work addresses the identified gap–the absence of approaches that are both unsupervised and attribute-aware–by introducing a Pipeline for Important Node Exploration (PINE). At the core of the proposed framework is an attention-based graph model that incorporates node semantic features in the learning process of identifying the structural graph properties. The PINE’s node importance scores leverage the obtained attention distribution. We demonstrate the superior performance of the proposed PINE method on various homogeneous and heterogeneous attributed networks. As an industry-implemented system, PINE tackles the real-world challenge of unsupervised identification of key entities within large-scale enterprise graphs.

[LG-24] MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

链接: https://arxiv.org/abs/2512.07216
作者: Bin Wu,Feifan Yang,Zhangming Chan,Yu-Ran Gu,Jiawei Feng,Chao Yi,Xiang-Rong Sheng,Han Zhu,Jian Xu,Mang Ye,Bo Zheng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage – the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at this https URL.

[LG-25] Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

链接: https://arxiv.org/abs/2512.07209
作者: Masato Ishii,Akio Hayakawa,Takashi Shibuya,Yuki Mitsufuji
类目: Multimedia (cs.MM); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.

[LG-26] Less is More: Non-uniform Road Segments are Efficient for Bus Arrival Prediction

链接: https://arxiv.org/abs/2512.07200
作者: Zhen Huang,Jiaxin Deng,Jiayu Xu,Junbiao Pang,Haitao Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-27] UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting

链接: https://arxiv.org/abs/2512.07184
作者: Da Zhang,Bingyu Li,Zhuyuan Zhao,Junyu Gao,Feiping Nie,Xuelong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As multimodal data proliferates across diverse real-world applications, leveraging heterogeneous information such as texts and timestamps for accurate time series forecasting (TSF) has become a critical challenge. While diffusion models demonstrate exceptional performance in generation tasks, their application to TSF remains largely confined to modeling single-modality numerical sequences, overlooking the abundant cross-modal signals inherent in complex heterogeneous data. To address this gap, we propose UniDiff, a unified diffusion framework for multimodal time series forecasting. To process the numerical sequence, our framework first tokenizes the time series into patches, preserving local temporal dynamics by mapping each patch to an embedding space via a lightweight MLP. At its core lies a unified and parallel fusion module, where a single cross-attention mechanism adaptively weighs and integrates structural information from timestamps and semantic context from texts in one step, enabling a flexible and efficient interplay between modalities. Furthermore, we introduce a novel classifier-free guidance mechanism designed for multi-source conditioning, allowing for decoupled control over the guidance strength of textual and temporal information during inference, which significantly enhances model robustness. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.

[LG-28] SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models NEURIPS2025

链接: https://arxiv.org/abs/2512.07175
作者: Yibo Wang,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Self-play fine-tuning has demonstrated promising abilities in adapting large language models (LLMs) to downstream tasks with limited real-world data. The basic principle is to iteratively refine the model with real samples and synthetic ones generated from itself. However, the existing methods primarily focus on the relative gaps between the rewards for two types of data, neglecting their absolute values. Through theoretical analysis, we identify that the gap-based methods suffer from unstable evolution, due to the potentially degenerated objectives. To address this limitation, we introduce a novel self-play fine-tuning method, namely Self-PlAy via Noise Contrastive Estimation (SPACE), which leverages noise contrastive estimation to capture the real-world data distribution. Specifically, SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner. As a result, SPACE independently optimizes the absolute reward values for each type of data, ensuring a consistently meaningful objective and thereby avoiding the instability issue. Theoretically, we show that the optimal solution of the objective in SPACE aligns with the underlying distribution of real-world data, and SPACE guarantees a provably stable convergence to the optimal distribution. Empirically, we show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples. Compared to gap-based self-play fine-tuning methods, SPACE exhibits remarkable superiority and stable evolution.

[LG-29] Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

链接: https://arxiv.org/abs/2512.07173
作者: Jucheng Shen,Gaurav Sarkar,Yeonju Ro,Sharath Nittur Sridhar,Zhangyang Wang,Aditya Akella,Souvik Kundu
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures. Preprint under review

点击查看摘要

Abstract:We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

[LG-30] Chromatic Feature Vectors for 2-Trees: Exact Formulas for Partition Enumeration with Network Applications

链接: https://arxiv.org/abs/2512.07120
作者: J. Allagan,G. Morgan,S. Langley,R. Lopez-Bonilla,V. Deriglazov
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:We establish closed-form enumeration formulas for chromatic feature vectors of 2-trees under the bichromatic triangle constraint. These efficiently computable structural features derive from constrained graph colorings where each triangle uses exactly two colors, forbidding monochromatic and rainbow triangles, a constraint arising in distributed systems where components avoid complete concentration or isolation. For theta graphs Theta_n, we prove r_k(Theta_n) = S(n-2, k-1) for k = 3 (Stirling numbers of the second kind) and r_2(Theta_n) = 2^(n-2) + 1, computable in O(n) time. For fan graphs Phi_n, we establish r_2(Phi_n) = F_n+1 (Fibonacci numbers) and derive explicit formulas r_k(Phi_n) = sum_t=k-1^n-1 a_n-1,t * S(t, k-1) with efficiently computable binomial coefficients, achieving O(n^2) computation per component. Unlike classical chromatic polynomials, which assign identical features to all n-vertex 2-trees, bichromatic constraints provide informative structural features. While not complete graph invariants, these features capture meaningful structural properties through connections to Fibonacci polynomials, Bell numbers, and independent set enumeration. Applications include Byzantine fault tolerance in hierarchical networks, VM allocation in cloud computing, and secret-sharing protocols in distributed cryptography.

[LG-31] PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes

链接: https://arxiv.org/abs/2512.07113
作者: Kepeng Lin,Qizhe Zhang,Rui Wang,Xuehai Hu,Wei Xu
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 6 pages, 5 figures, accept to BIBM

点击查看摘要

Abstract:Understanding the underlying linguistic rules of plant genomes remains a fundamental challenge in computational biology. Recent advances including AgroNT and PDLLMs have made notable progress although, they suffer from excessive parameter size and limited ability to model the bidirectional nature of DNA strands respectively. To address these limitations, we propose PlantBiMoE, a lightweight and expressive plant genome language model that integrates bidirectional Mamba and a Sparse Mixture-of-Experts (SparseMoE) framework. The bidirectional Mamba enables the model to effectively capture structural dependencies across both the forward and reverse DNA strands, while SparseMoE significantly reduces the number of active parameters, improving computational efficiency without sacrificing modeling capacity. We evaluated and tested our model on the Modified Plants Genome Benchmark (MPGB), an enhanced genomic benchmark, which consolidates 31 datasets across 11 representative tasks, with input sequence lengths ranging from 50 to 6,000 bp. Experimental results demonstrate that PlantBiMoE achieves the best performance on 20 out of 31 datasets and the average best when comparing with existing models. In summary, all above results demonstrate that our model can effectively represent plant genomic sequences, serving as a robust computational tool for diverse genomic tasks, while making substantive contributions to plant genomics, gene editing, and synthetic biology. The code is available at: this https URL

[LG-32] Dual Refinement Cycle Learning: Unsupervised Text Classification of Mamba and Community Detection on Text Attributed Graph

链接: https://arxiv.org/abs/2512.07100
作者: Hong Wang,Yinglong Zhang,Hanhan Guo,Xuewen Xia,Xing Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretrained language models offer strong text understanding capabilities but remain difficult to deploy in real-world text-attributed networks due to their heavy dependence on labeled data. Meanwhile, community detection methods typically ignore textual semantics, limiting their usefulness in downstream applications such as content organization, recommendation, and risk monitoring. To overcome these limitations, we present Dual Refinement Cycle Learning (DRCL), a fully unsupervised framework designed for practical scenarios where no labels or category definitions are available. DRCL integrates structural and semantic information through a warm-start initialization and a bidirectional refinement cycle between a GCN-based Community Detection Module (GCN-CDM) and a Text Semantic Modeling Module (TSMM). The two modules iteratively exchange pseudo-labels, allowing semantic cues to enhance structural clustering and structural patterns to guide text representation learning without manual supervision. Across several text-attributed graph datasets, DRCL consistently improves the structural and semantic quality of discovered communities. Moreover, a Mamba-based classifier trained solely from DRCL’s community signals achieves accuracy comparable to supervised models, demonstrating its potential for deployment in large-scale systems where labeled data are scarce or costly. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.07100 [cs.LG] (or arXiv:2512.07100v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.07100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] RACE: A Generalizable Drift Detector for Streaming Data-Driven Optimization AAAI2026

链接: https://arxiv.org/abs/2512.07082
作者: Yuan-Ting Zhong,Ting Huang,Xiaolin Xiao,Yue-Jiao Gong
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Many optimization tasks involve streaming data with unknown concept drifts, posing a significant challenge as Streaming Data-Driven Optimization (SDDO). Existing methods, while leveraging surrogate model approximation and historical knowledge transfer, are often under restrictive assumptions such as fixed drift intervals and fully environmental observability, limiting their adaptability to diverse dynamic environments. We propose TRACE, a TRAnsferable Concept-drift Estimator that effectively detects distributional changes in streaming data with varying time scales. TRACE leverages a principled tokenization strategy to extract statistical features from data streams and models drift patterns using attention-based sequence learning, enabling accurate detection on unseen datasets and highlighting the transferability of learned drift patterns. Further, we showcase TRACE’s plug-and-play nature by integrating it into a streaming optimizer, facilitating adaptive optimization under unknown drifts. Comprehensive experimental results on diverse benchmarks demonstrate the superior generalization, robustness, and effectiveness of our approach in SDDO scenarios.

[LG-34] Ideal Attribution and Faithful Watermarks for Language Models

链接: https://arxiv.org/abs/2512.07038
作者: Min Jae Song,Kameron Shahabi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages

点击查看摘要

Abstract:We introduce ideal attribution mechanisms, a formal abstraction for reasoning about attribution decisions over strings. At the core of this abstraction lies the ledger, an append-only log of the prompt-response interaction history between a model and its user. Each mechanism produces deterministic decisions based on the ledger and an explicit selection criterion, making it well-suited to serve as a ground truth for attribution. We frame the design goal of watermarking schemes as faithful representation of ideal attribution mechanisms. This novel perspective brings conceptual clarity, replacing piecemeal probabilistic statements with a unified language for stating the guarantees of each scheme. It also enables precise reasoning about desiderata for future watermarking schemes, even when no current construction achieves them, since the ideal functionalities are specified first. In this way, the framework provides a roadmap that clarifies which guarantees are attainable in an idealized setting and worth pursuing in practice.

[LG-35] Always Keep Your Promises: DynamicLRP A Model-Agnostic Solution To Layer-Wise Relevance Propagation

链接: https://arxiv.org/abs/2512.07010
作者: Kevin Lee,Pablo Millan Arias
类目: Machine Learning (cs.LG)
*备注: Work in progress, (12 pages manuscript, 6 figures, 6 tables, 3 pages references, 14 pages appendix)

点击查看摘要

Abstract:Layer-wise Relevance Propagation (LRP) provides principled attribution for neural networks through conservation properties and foundations in Deep Taylor Decomposition. However, existing implementations operate at the module level, requiring architecture-specific propagation rules and modifications. These limit the generality of target model and sustainability of implementations as architectures evolve. We introduce DynamicLRP, a model-agnostic LRP framework operating at the tensor operation level. By decomposing attribution to individual operations within computation graphs and introducing a novel mechanism for deferred activation resolution, named the Promise System, our approach achieves true architecture agnosticity while maintaining LRP’s theoretical guarantees. This design operates independently of backpropagation machinery, enabling operation on arbitrary computation graphs without model modification and side-by-side execution with gradient backpropagation. Being based on computation graphs, this method is theoretically extensible to other deep learning libraries that support auto-differentiation. We demonstrate faithfulness matching or exceeding specialized implementations (1.77 vs 1.69 ABPC on VGG, equivalent performance on ViT, 93.70% and 95.06% top-1 attribution accuracy for explaining RoBERTa-large and Flan-T5-large answers on SQuADv2, respectively) while maintaining practical efficiency on models with hundreds of millions of parameters. We achieved 99.92% node coverage across 31,465 computation graph nodes from 15 diverse architectures, including state-space models (Mamba), audio transformers (Whisper), and multimodal systems (DePlot) without any model-specific code with rules for 47 fundamental operations implemented. Our operation-level decomposition and Promise System establish a sustainable, extensible foundation for LRP across evolving architectures.

[LG-36] oward Reliable Machine Unlearning: Theory Algorithms and Evaluation

链接: https://arxiv.org/abs/2512.06993
作者: Ali Ebrahimpour-Boroojeny
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model’s confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN’s performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model’s behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model’s distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.06993 [cs.LG] (or arXiv:2512.06993v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.06993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction

链接: https://arxiv.org/abs/2512.06987
作者: Emily Jin,Andrei Cristian Nica,Mikhail Galkin,Jarrid Rector-Brooks,Kin Long Kelvin Lee,Santiago Miret,Frances H. Arnold,Michael Bronstein,Avishek Joey Bose,Alexander Tong,Cheng-Hao Liu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Accurately predicting experimentally-realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling ( S^4 ), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization – thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer \textRMSD_10.5 Å and attains over 80% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.

[LG-38] LLM -Driven Composite Neural Architecture Search for Multi-Source RL State Encoding NEURIPS2025

链接: https://arxiv.org/abs/2512.06982
作者: Yu Yu,Qian Xie,Nairen Cao,Li Jin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning

点击查看摘要

[LG-39] Joint Learning of Feasibility-Aware Signal Temporal Logic and BarrierNet for Robust and Correct Control

链接: https://arxiv.org/abs/2512.06973
作者: Shuo Liu,Wenliang Liu,Wei Xiao,Calin A. Belta
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Control Barrier Functions (CBFs) have emerged as a powerful tool for enforcing safety in optimization-based controllers, and their integration with Signal Temporal Logic (STL) has enabled the specification-driven synthesis of complex robotic behaviors. However, existing CBF-STL approaches typically rely on fixed hyperparameters and myopic, per-time step optimization, which can lead to overly conservative behavior, infeasibility near tight input limits, and difficulty satisfying long-horizon STL tasks. To address these limitations, we propose a feasibility-aware learning framework that embeds trainable, time-varying High Order Control Barrier Functions (HOCBFs) into a differentiable Quadratic Program (dQP). Our approach provides a systematic procedure for constructing time-varying HOCBF constraints for a broad fragment of STL and introduces a unified robustness measure that jointly captures STL satisfaction, QP feasibility, and control-bound compliance. Three neural networks-InitNet, RefNet, and an extended BarrierNet-collaborate to generate reference inputs and adapt constraint-related hyperparameters automatically over time and across initial conditions, reducing conservativeness while maximizing robustness. The resulting controller achieves STL satisfaction with strictly feasible dQPs and requires no manual tuning. Simulation results demonstrate that the proposed framework maintains high STL robustness under tight input bounds and significantly outperforms fixed-parameter and non-adaptive baselines in complex environments.

[LG-40] Prediction with Expert Advice under Local Differential Privacy

链接: https://arxiv.org/abs/2512.06971
作者: Ben Jacobsen,Kassem Fawaz
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:We study the classic problem of prediction with expert advice under the constraint of local differential privacy (LDP). In this context, we first show that a classical algorithm naturally satisfies LDP and then design two new algorithms that improve it: RW-AdaBatch and RW-Meta. For RW-AdaBatch, we exploit the limited-switching behavior induced by LDP to provide a novel form of privacy amplification that grows stronger on easier data, analogous to the shuffle model in offline learning. Drawing on the theory of random walks, we prove that this improvement carries essentially no utility cost. For RW-Meta, we develop a general method for privately selecting between experts that are themselves non-trivial learning algorithms, and we show that in the context of LDP this carries no extra privacy cost. In contrast, prior work has only considered data-independent experts. We also derive formal regret bounds that scale inversely with the degree of independence between experts. Our analysis is supplemented by evaluation on real-world data reported by hospitals during the COVID-19 pandemic; RW-Meta outperforms both the classical baseline and a state-of-the-art \textitcentral DP algorithm by 1.5-3 \times on the task of predicting which hospital will report the highest density of COVID patients each week.

[LG-41] Parent-Guided Semantic Reward Model (PGSRM): Embedding-Based Reward Functions for Reinforcement Learning of Transformer Language Models

链接: https://arxiv.org/abs/2512.06920
作者: Alexandr Plashchinsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-42] Know your Trajectory – Trustworthy Reinforcement Learning deployment through Importance-Based Trajectory Analysis AAAI2026

链接: https://arxiv.org/abs/2512.06917
作者: Clifford F,Devika Jay,Abhishek Sarkar,Satheesh K Perepu,Santhosh G S,Kaushik Dey,Balaraman Ravindran
类目: Machine Learning (cs.LG)
*备注: Accepted at 4th Deployable AI Workshop at AAAI 2026

点击查看摘要

[LG-43] Energy-Efficient Navigation for Surface Vehicles in Vortical Flow Fields ICRA2026

链接: https://arxiv.org/abs/2512.06912
作者: Rushiraj Gadhvi,Sandeep Manjanna
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Under Review for International Conference on Robotics and Automation (ICRA 2026)

点击查看摘要

[LG-44] MINES: Explainable Anomaly Detection through Web API Invariant Inference

链接: https://arxiv.org/abs/2512.06906
作者: Wenjie Zhang,Yun Lin,Chun Fung Amos Kwok,Xiwen Teoh,Xiaofei Xie,Frank Liauw,Hongyu Zhang,Jin Song Dong
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-45] Neural Factorization-based Bearing Fault Diagnosis

链接: https://arxiv.org/abs/2512.06837
作者: Zhenhao Li,Xu Cheng,Yi Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-46] A Physics-Aware Attention LSTM Autoencoder for Early Fault Diagnosis of Battery Systems

链接: https://arxiv.org/abs/2512.06809
作者: Jiong Yang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 5 pages, 7 figures

点击查看摘要

Abstract:Battery safety is paramount for electric vehicles. Early fault diagnosis remains a challenge due to the subtle nature of anomalies and the interference of dynamic operating noise. Existing data-driven methods often suffer from “physical blindness” leading to missed detections or false alarms. To address this, we propose a Physics-Aware Attention LSTM Autoencoder (PA-ALSTM-AE). This novel framework explicitly integrates battery aging laws (mileage) into the deep learning pipeline through a multi-stage fusion mechanism. Specifically, an adaptive physical feature construction module selects mileage-sensitive features, and a physics-guided latent fusion module dynamically calibrates the memory cells of the LSTM based on the aging state. Extensive experiments on the large-scale Vloong real-world dataset demonstrate that the proposed method significantly outperforms state-of-the-art baselines. Notably, it improves the recall rate of early faults by over 3 times while maintaining high precision, offering a robust solution for industrial battery management systems.

[LG-47] Small-Gain Nash: Certified Contraction to Nash Equilibria in Differentiable Games

链接: https://arxiv.org/abs/2512.06791
作者: Vedansh Sharma
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-48] Measuring Over-smoothing beyond Dirichlet energy

链接: https://arxiv.org/abs/2512.06782
作者: Weiqi Guan,Zihao Shi
类目: Machine Learning (cs.LG)
*备注: 17 pages, 1 figure

点击查看摘要

[LG-49] Optimal Analysis for Bandit Learning in Matching Markets with Serial Dictatorship

链接: https://arxiv.org/abs/2512.06758
作者: Zilong Wang,Shuai Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-50] Multi-Scale Protein Structure Modelling with Geometric Graph U-Nets

链接: https://arxiv.org/abs/2512.06752
作者: Chang Liu,Vivian Li,Linus Leong,Vladimir Radenkovic,Pietro Liò,Chaitanya K. Joshi
类目: Machine Learning (cs.LG)
*备注: Presented at Machine Learning in Structural Biology, 2025. Open-source code: this https URL

点击查看摘要

[LG-51] KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models

链接: https://arxiv.org/abs/2512.06727
作者: Sourjya Roy,Shrihari Sridharan,Surya Selvam,Anand Raghunathan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) scale in size and context length, the memory requirements of the key value (KV) cache have emerged as a major bottleneck during autoregressive decoding. The KV cache grows with sequence length and embedding dimension, often exceeding the memory footprint of the model itself and limiting achievable batch sizes and context windows. To address this challenge, we present KV CAR, a unified and architecture agnostic framework that significantly reduces KV cache storage while maintaining model fidelity. KV CAR combines two complementary techniques. First, a lightweight autoencoder learns compact representations of key and value tensors along the embedding dimension, compressing them before they are stored in the KV cache and restoring them upon retrieval. Second, a similarity driven reuse mechanism identifies opportunities to reuse KV tensors of specific attention heads across adjacent layers. Together, these methods reduce the dimensional and structural redundancy in KV tensors without requiring changes to the transformer architecture. Evaluations on GPT 2 and TinyLLaMA models across Wikitext, C4, PIQA, and Winogrande datasets demonstrate that KV CAR achieves up to 47.85 percent KV cache memory reduction with minimal impact on perplexity and zero shot accuracy. System level measurements on an NVIDIA A40 GPU show that the reduced KV footprint directly translates into longer sequence lengths and larger batch sizes during inference. These results highlight the effectiveness of KV CAR in enabling memory efficient LLM inference.

[LG-52] Decoding Motor Behavior Using Deep Learning and Reservoir Computing

链接: https://arxiv.org/abs/2512.06725
作者: Tian Lan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 3 figures

点击查看摘要

[LG-53] Pathway to O(sqrtd) Complexity bound under Wasserstein metric of flow-based models

链接: https://arxiv.org/abs/2512.06702
作者: Xiangjun Meng,Zhongjian Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-54] Mitigating Barren plateaus in quantum denoising diffusion probabilistic models

链接: https://arxiv.org/abs/2512.06695
作者: Haipeng Cao,Kaining Zhang,Dacheng Tao,Zhaofeng Su
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 22 pages, 9 figures

点击查看摘要

[LG-55] State Diversity Matters in Offline Behavior Distillation

链接: https://arxiv.org/abs/2512.06692
作者: Shiye Lei,Zhihao Cheng,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 5 tables

点击查看摘要

[LG-56] FedDSR: Federated Deep Supervision and Regularization Towards Autonomous Driving

链接: https://arxiv.org/abs/2512.06676
作者: Wei-Bin Kou,Guangxu Zhu,Bingyang Cheng,Chen Zhang,Yik-Chung Wu,Jianping Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

[LG-57] he Meta-Learning Gap: Combining Hydra and Quant for Large-Scale Time Series Classification

链接: https://arxiv.org/abs/2512.06666
作者: Urav Maniar
类目: Machine Learning (cs.LG)
*备注: Link to the repository: this https URL

点击查看摘要

[LG-58] he Impact of Data Characteristics on GNN Evaluation for Detecting Fake News

链接: https://arxiv.org/abs/2512.06638
作者: Isha Karn,David Jensen
类目: Machine Learning (cs.LG)
*备注: Preprint. Approximately 15 pages, 5 figures, 3 tables

点击查看摘要

[LG-59] Quantum Temporal Convolutional Neural Networks for Cross-Sectional Equity Return Prediction: A Comparative Benchmark Study

链接: https://arxiv.org/abs/2512.06630
作者: Chi-Sheng Chen,Xinyu Zhang,Rong Fu,Qiuzhe Xie,Fan Zhang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

[LG-60] On fine-tuning Boltz-2 for protein-protein affinity prediction

链接: https://arxiv.org/abs/2512.06592
作者: James King,Lewis Cornwall,Andrei Cristian Nica,James Day,Aaron Sim,Neil Dalchau,Lilly Wollman,Joshua Meyers
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: MLSB 2025

点击查看摘要

[LG-61] Approximate Multiplier Induced Error Propagation in Deep Neural Networks

链接: https://arxiv.org/abs/2512.06537
作者: A. M. H. H. Alahakoon,Hassaan Saadat,Darshana Jayasinghe,Sri Parameswaran
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 7 pages, Submitted to Design and Automation Conference (DAC 2026)

点击查看摘要

[LG-62] Hierarchical geometric deep learning enables scalable analysis of molecular dynamics

链接: https://arxiv.org/abs/2512.06520
作者: Zihan Pengmei,Spencer C. Guo,Chatipat Lorpaiboon,Aaron R. Dinner
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 17 pages, 12 figures

点击查看摘要

[LG-63] Diagnosis-based mortality prediction for intensive care unit patients via transfer learning

链接: https://arxiv.org/abs/2512.06511
作者: Mengqi Xu,Subha Maity,Joel Dubin
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In the intensive care unit, the underlying causes of critical illness vary substantially across diagnoses, yet prediction models accounting for diagnostic heterogeneity have not been systematically studied. To address the gap, we evaluate transfer learning approaches for diagnosis-specific mortality prediction and apply both GLM- and XGBoost-based models to the eICU Collaborative Research Database. Our results demonstrate that transfer learning consistently outperforms models trained only on diagnosis-specific data and those using a well-known ICU severity-of-illness score, i.e., APACHE IVa, alone, while also achieving better calibration than models trained on the pooled data. Our findings also suggest that the Youden cutoff is a more appropriate decision threshold than the conventional 0.5 for binary outcomes, and that transfer learning maintains consistently high predictive performance across various cutoff criteria.

[LG-64] Optimizing LLM s Using Quantization for Mobile Execution

链接: https://arxiv.org/abs/2512.06490
作者: Agatsya Yadav,Renta Chintala Bhargavi
类目: Machine Learning (cs.LG)
*备注: 11 pages, 1 equation, 2 tables. Author Accepted Manuscript (AAM) of a paper published in Springer LNNS, ICT4SD 2025. DOI: https://doi.org/10.1007/978-3-032-06697-8_33

点击查看摘要

[LG-65] BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

链接: https://arxiv.org/abs/2512.06457
作者: Huizheng Wang,Hongbin Wang,Shaojun Wei,Yang Hu,Shouyi Yin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.

[LG-66] Neural expressiveness for beyond importance model compression

链接: https://arxiv.org/abs/2512.06440
作者: Angelos-Christos Maroudis,Sotirios Xydis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Network Pruning has been established as driving force in the exploration of memory and energy efficient solutions with high throughput both during training and at test time. In this paper, we introduce a novel criterion for model compression, named “Expressiveness”. Unlike existing pruning methods that rely on the inherent “Importance” of neurons’ and filters’ weights, ``Expressiveness" emphasizes a neuron’s or group of neurons ability to redistribute informational resources effectively, based on the overlap of activations. This characteristic is strongly correlated to a network’s initialization state, establishing criterion autonomy from the learning state stateless and thus setting a new fundamental basis for the expansion of compression strategies in regards to the “When to Prune” question. We show that expressiveness is effectively approximated with arbitrary data or limited dataset’s representative samples, making ground for the exploration of Data-Agnostic strategies. Our work also facilitates a “hybrid” formulation of expressiveness and importance-based pruning strategies, illustrating their complementary benefits and delivering up to 10x extra gains w.r.t. weight-based approaches in parameter compression ratios, with an average of 1% in performance degradation. We also show that employing expressiveness (independently) for pruning leads to an improvement over top-performing and foundational methods in terms of compression efficiency. Finally, on YOLOv8, we achieve a 46.1% MACs reduction by removing 55.4% of the parameters, with an increase of 3% in the mean Absolute Precision ( mAP_50-95 ) for object detection on COCO dataset.

[LG-67] A new initialisation to Control Gradients in Sinusoidal Neural network

链接: https://arxiv.org/abs/2512.06427
作者: Andrea Combette,Antoine Venaille,Nelly Pustelnik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-68] Hankel-FNO: Fast Underwater Acoustic Charting Via Physics-Encoded Fourier Neural Operator

链接: https://arxiv.org/abs/2512.06417
作者: Yifan Sun(1),Lei Cheng(1),Jianlong Li(1),Peter Gerstoft(2) ((1) College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China, (2) Scripps Institution of Oceanography, University of California San Diego, La Jolla, USA)
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

[LG-69] Optimizing Optimizers for Fast Gradient-Based Learning

链接: https://arxiv.org/abs/2512.06370
作者: Jaerin Lee,Kyoung Mu Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 49 pages, 5 figures

点击查看摘要

[LG-70] DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction

链接: https://arxiv.org/abs/2512.06356
作者: Yifan Song,Fenglin Yu,Yihong Luo,Xingjian Tao,Siya Qiu,Kai Han,Jing Tang
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

[LG-71] Zero Generalization Error Theorem for Random Interpolators via Algebraic Geometry

链接: https://arxiv.org/abs/2512.06347
作者: Naoki Yoshida,Isao Ishikawa,Masaaki Imaizumi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We theoretically demonstrate that the generalization error of interpolators for machine learning models under teacher-student settings becomes 0 once the number of training samples exceeds a certain threshold. Understanding the high generalization ability of large-scale models such as deep neural networks (DNNs) remains one of the central open problems in machine learning theory. While recent theoretical studies have attributed this phenomenon to the implicit bias of stochastic gradient descent (SGD) toward well-generalizing solutions, empirical evidences indicate that it primarily stems from properties of the model itself. Specifically, even randomly sampled interpolators, which are parameters that achieve zero training error, have been observed to generalize effectively. In this study, under a teacher-student framework, we prove that the generalization error of randomly sampled interpolators becomes exactly zero once the number of training samples exceeds a threshold determined by the geometric structure of the interpolator set in parameter space. As a proof technique, we leverage tools from algebraic geometry to mathematically characterize this geometric structure.

[LG-72] Interpretive Efficiency: Information-Geometric Foundations of Data Usefulness

链接: https://arxiv.org/abs/2512.06341
作者: Ronald Katende
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注:

点击查看摘要

[LG-73] Multimodal Graph Neural Networks for Prognostic Modeling of Brain Network Reorganization

链接: https://arxiv.org/abs/2512.06303
作者: Preksha Girish,Rachana Mysore,Kiran K. N.,Hiranmayee R.,Shipra Prashanth,Shrey Kumar
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures. IEEE conference-style format

点击查看摘要

[LG-74] Importance-aware Topic Modeling for Discovering Public Transit Risk from Noisy Social Media

链接: https://arxiv.org/abs/2512.06293
作者: Fatima Ashraf,Muhammad Ayub Sabir,Jiaxin Deng,Junbiao Pang,Haitao Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-75] heoretical Compression Bounds for Wide Multilayer Perceptrons

链接: https://arxiv.org/abs/2512.06288
作者: Houssam El Cheairi,David Gamarnik,Rahul Mazumder
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

[LG-76] Learning Without Time-Based Embodiment Resets in Soft-Actor Critic

链接: https://arxiv.org/abs/2512.06252
作者: Homayoon Farrahi,A. Rupam Mahmood
类目: Machine Learning (cs.LG)
*备注: In Proceedings of the 4th Conference on Lifelong Learning Agents (CoLLAs)

点击查看摘要

[LG-77] Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

链接: https://arxiv.org/abs/2512.06250
作者: Chris Tava
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

[LG-78] Quantization Blindspots: How Model Compression Breaks Backdoor Defenses

链接: https://arxiv.org/abs/2512.06243
作者: Rohan Pandey,Eric Ye
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages

点击查看摘要

[LG-79] Back to Author Console Empowering GNNs for Domain Adaptation via Denoising Target Graph

链接: https://arxiv.org/abs/2512.06236
作者: Haiyang Yu,Meng-Chieh Lee,Xiang song,Qi Zhu,Christos Faloutsos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-80] Averag e-reward reinforcement learning in semi-Markov decision processes via relative value iteration

链接: https://arxiv.org/abs/2512.06218
作者: Huizhen Yu,Yi Wan,Richard S. Sutton
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 24 pages. This paper presents the reinforcement-learning material previously contained in version 2 of arXiv:2409.03915 , which is now being split into two stand-alone papers. Minor corrections and improvements to the main results have also been made in the course of this reformatting

点击查看摘要

[LG-81] A Broader View on Clustering under Cluster-Aware Norm Objectives

链接: https://arxiv.org/abs/2512.06211
作者: Martin G. Herold,Evangelos Kipouridis,Joachim Spoerhase
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the (f,g) -clustering problem that we introduced in a recent work [SODA’25], and which subsumes fundamental clustering problems such as k -Center, k -Median, Min-Sum of Radii, and Min-Load k -Clustering. This problem assigns each of the k clusters a cost determined by the monotone, symmetric norm f applied to the vector distances in the cluster, and aims at minimizing the norm g applied to the vector of cluster costs. Previously, we focused on certain special cases for which we designed constant-factor approximation algorithms. Our bounds for more general settings left, however, large gaps to the known bounds for the basic problems they capture. In this work, we provide a clearer picture of the approximability of these more general settings. First, we design an O(\log^2 n) -approximation algorithm for (f, L_1) -clustering for any f . This improves upon our previous \widetildeO(\sqrtn) -approximation. Second, we provide an O(k) -approximation for the general (f,g) -clustering problem, which improves upon our previous \widetildeO(\sqrtkn) -approximation algorithm and matches the best-known upper bound for Min-Load k -Clustering. We then design an approximation algorithm for (f,g) -clustering that interpolates, up to polylog factors, between the best known bounds for k -Center, k -Median, Min-Sum of Radii, Min-Load k -Clustering, (Top, L_1 )-clustering, and (L_\infty,g) -clustering based on a newly defined parameter of f and g . Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2512.06211 [cs.DS] (or arXiv:2512.06211v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2512.06211 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] SparsePixels: Efficient Convolution for Sparse Data on FPGAs

链接: https://arxiv.org/abs/2512.06208
作者: Ho Fung Tsoi,Dylan Rankin,Vladimir Loncar,Philip Harris
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: Under review

点击查看摘要

[LG-83] K2-V2: A 360-Open Reasoning -Enhanced LLM

链接: https://arxiv.org/abs/2512.06201
作者: K2 Team:Zhengzhong Liu,Liping Tang,Linghao Jin,Haonan Li,Nikhil Ranjan,Desai Fan,Shaurya Rohatgi,Richard Fan,Omkar Pangarkar,Huijuan Wang,Zhoujun Cheng,Suqi Sun,Seungwook Han,Bowen Tan,Gurpreet Gosal,Xudong Han,Varad Pimpalkhute,Shibo Hao,Ming Shan Hee,Joel Hestness,Haolong Jia,Liqun Ma,Aaryamonvikram Singh,Daria Soboleva,Natalia Vassilieva,Renxi Wang,Yingquan Wu,Yuekai Sun,Taylor Killian,Alexander Moreno,John Maggs,Hector Ren,Guowei He,Hongyi Wang,Xuezhe Ma,Yuqi Wang,Mikhail Yurochkin,Eric P. Xing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce K2-V2, a 360-open LLM built from scratch as a superior base for reasoning adaptation, in addition to functions such as conversation and knowledge retrieval from general LLMs. It stands as the strongest fully open model, rivals open-weight leaders in its size class, outperforms Qwen2.5-72B and approaches the performance of Qwen3-235B. We actively infuse domain knowledge, reasoning, long-context, and tool use throughout the training process. This explicitly prepares the model for complex reasoning tasks. We demonstrate this potential using simple supervised fine-tuning, establishing a strong baseline that indicates significant headroom for advanced alignment. By releasing the full training history and data composition, we maximize the effectiveness of continuous training, a key open source production scenario. We release the model weights and signature LLM360 artifacts, such as complete training data, to empower the community with a capable, reasoning-centric foundation.

[LG-84] How Should We Evaluate Data Deletion in Graph-Based ANN Indexes? NEURIPS2025

链接: https://arxiv.org/abs/2512.06200
作者: Tomohiro Yamashita,Daichi Amagata,Yusuke Matsui
类目: Machine Learning (cs.LG)
*备注: 4 pages, 4 figures. Accepted at NeurIPS 2025 Workshop on Machine Learning for Systems

点击查看摘要

Abstract:Approximate Nearest Neighbor Search (ANNS) has recently gained significant attention due to its many applications, such as Retrieval-Augmented Generation. Such applications require ANNS algorithms that support dynamic data, so the ANNS problem on dynamic data has attracted considerable interest. However, a comprehensive evaluation methodology for data deletion in ANNS has yet to be established. This study proposes an experimental framework and comprehensive evaluation metrics to assess the efficiency of data deletion for ANNS indexes under practical use cases. Specifically, we categorize data deletion methods in graph-based ANNS into three approaches and formalize them mathematically. The performance is assessed in terms of accuracy, query speed, and other relevant metrics. Finally, we apply the proposed evaluation framework to Hierarchical Navigable Small World, one of the state-of-the-art ANNS methods, to analyze the effects of data deletion, and propose Deletion Control, a method which dynamically selects the appropriate deletion method under a required search accuracy.

[LG-85] PMA-Diffusion: A Physics-guided Mask-Aware Diffusion Framework for TSE from Sparse Observations

链接: https://arxiv.org/abs/2512.06183
作者: Lindong Liu,Zhixiong Jin,Seongjin Choi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-86] gp2Scale: A Class of Compactly-Supported Non-Stationary Kernels and Distributed Computing for Exact Gaussian Processes on 10 Million Data Points

链接: https://arxiv.org/abs/2512.06143
作者: Marcus M. Noack,Mark D. Risser,Hengrui Luo,Vardaan Tekriwal,Ronald J. Pandolfi
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: None

点击查看摘要

[LG-87] Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures

链接: https://arxiv.org/abs/2512.06113
作者: Bin Xu,Ayan Banerjee,Sandeep Gupta
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-88] A Prescriptive Framework for Determining Optimal Days for Short-Term Traffic Counts

链接: https://arxiv.org/abs/2512.06111
作者: Arthur Mukwaya,Nancy Kasamala,Nana Kankam Gyimah,Judith Mwakalonge,Gurcan Comert,Saidi Siuhi,Denis Ruganuza,Mark Ngotonie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Federal Highway Administration (FHWA) mandates that state Departments of Transportation (DOTs) collect reliable Annual Average Daily Traffic (AADT) data. However, many U.S. DOTs struggle to obtain accurate AADT, especially for unmonitored roads. While continuous count (CC) stations offer accurate traffic volume data, their implementation is expensive and difficult to deploy widely, compelling agencies to rely on short-duration traffic counts. This study proposes a machine learning framework, the first to our knowledge, to identify optimal representative days for conducting short count (SC) data collection to improve AADT prediction accuracy. Using 2022 and 2023 traffic volume data from the state of Texas, we compare two scenarios: an ‘optimal day’ approach that iteratively selects the most informative days for AADT estimation and a ‘no optimal day’ baseline reflecting current practice by most DOTs. To align with Texas DOT’s traffic monitoring program, continuous count data were utilized to simulate the 24 hour short counts. The actual field short counts were used to enhance feature engineering through using a leave-one-out (LOO) technique to generate unbiased representative daily traffic features across similar road segments. Our proposed methodology outperforms the baseline across the top five days, with the best day (Day 186) achieving lower errors (RMSE: 7,871.15, MAE: 3,645.09, MAPE: 11.95%) and higher R^2 (0.9756) than the baseline (RMSE: 11,185.00, MAE: 5,118.57, MAPE: 14.42%, R^2: 0.9499). This research offers DOTs an alternative to conventional short-duration count practices, improving AADT estimation, supporting Highway Performance Monitoring System compliance, and reducing the operational costs of statewide traffic data collection.

[LG-89] ARC-AGI Without Pretraining

链接: https://arxiv.org/abs/2512.06104
作者: Isaac Liao,Albert Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-90] Deep learning recognition and analysis of Volatile Organic Compounds based on experimental and synthetic infrared absorption spectra

链接: https://arxiv.org/abs/2512.06059
作者: Andrea Della Valle,Annalisa D’Arco,Tiziana Mancini,Rosanna Mosetti,Maria Chiara Paolozzi,Stefano Lupi,Sebastiano Pilati,Andrea Perali
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Volatile Organic Compounds (VOCs) are organic molecules that have low boiling points and therefore easily evaporate into the air. They pose significant risks to human health, making their accurate detection the crux of efforts to monitor and minimize exposure. Infrared (IR) spectroscopy enables the ultrasensitive detection at low-concentrations of VOCs in the atmosphere by measuring their IR absorption spectra. However, the complexity of the IR spectra limits the possibility to implement VOC recognition and quantification in real-time. While deep neural networks (NNs) are increasingly used for the recognition of complex data structures, they typically require massive datasets for the training phase. Here, we create an experimental VOC dataset for nine different classes of compounds at various concentrations, using their IR absorption spectra. To further increase the amount of spectra and their diversity in term of VOC concentration, we augment the experimental dataset with synthetic spectra created via conditional generative NNs. This allows us to train robust discriminative NNs, able to reliably identify the nine VOCs, as well as to precisely predict their concentrations. The trained NN is suitable to be incorporated into sensing devices for VOCs recognition and analysis.

[LG-91] Closed-Loop Robotic Manipulation of Transparent Substrates for Self-Driving Laboratories using Deep Learning Micro-Error Correction

链接: https://arxiv.org/abs/2512.06038
作者: Kelsey Fontenot,Anjali Gorti,Iva Goel,Tonio Buonassisi,Alexander E. Siemenn
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

[LG-92] Memory-Amortized Inference: A Topological Unification of Search Closure and Structure

链接: https://arxiv.org/abs/2512.05990
作者: Xin Li
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

[LG-93] A self-driving lab for solution-processed electrochromic thin films

链接: https://arxiv.org/abs/2512.05989
作者: Selma Dahms,Luca Torresi,Shahbaz Tareq Bandesha,Jan Hansmann,Holger Röhm,Alexander Colsmann,Marco Schott,Pascal Friederich
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

[LG-94] Intrusion Detection on Resource-Constrained IoT Devices with Hardware-Aware ML and DL

链接: https://arxiv.org/abs/2512.02272
作者: Ali Diab,Adel Chehade,Edoardo Ragusa,Paolo Gastaldo,Rodolfo Zunino,Amer Baghdadi,Mostafa Rizk
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted at the 2025 IEEE International Conference on Emerging Trends in Engineering and Computing (ETECOM). Recipient of the ETECOM 2025 Best Paper Award

点击查看摘要

[LG-95] LUNA: LUT-Based Neural Architecture for Fast and Low-Cost Qubit Readout

链接: https://arxiv.org/abs/2512.07808
作者: M. A. Farooq,G. Di Guglielmo,A. Rajagopala,N. Tran,V. A. Chhabria,A. Arora
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-96] Distribution-informed Online Conformal Prediction

链接: https://arxiv.org/abs/2512.07770
作者: Dongjian Hu,Junxi Wu,Shu-Tao Xia,Changliang Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction provides a pivotal and flexible technique for uncertainty quantification by constructing prediction sets with a predefined coverage rate. Many online conformal prediction methods have been developed to address data distribution shifts in fully adversarial environments, resulting in overly conservative prediction sets. We propose Conformal Optimistic Prediction (COP), an online conformal prediction algorithm incorporating underlying data pattern into the update rule. Through estimated cumulative distribution function of non-conformity scores, COP produces tighter prediction sets when predictable pattern exists, while retaining valid coverage guarantees even when estimates are inaccurate. We establish a joint bound on coverage and regret, which further confirms the validity of our approach. We also prove that COP achieves distribution-free, finite-sample coverage under arbitrary learning rates and can converge when scores are i.i.d. . The experimental results also show that COP can achieve valid coverage and construct shorter prediction intervals than other baselines.

[LG-97] Physics-Informed Neural Networks for Source Inversion and Parameters Estimation in Atmospheric Dispersion

链接: https://arxiv.org/abs/2512.07755
作者: Brenda Anague,Bamdad Hosseini,Issa Karambal,Jean Medard Ngnotchouye
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-98] A scalable and real-time neural decoder for topological quantum codes

链接: https://arxiv.org/abs/2512.07737
作者: Andrew W. Senior,Thomas Edlich,Francisco J.H. Heras,Lei M. Zhang,Oscar Higgott,James S. Spencer,Taylor Applebaum,Sam Blackwell,Justin Ledford,Akvilė Žemgulytė,Augustin Žídek,Noah Shutty,Andrew Cowie,Yin Li,George Holland,Peter Brooks,Charlie Beattie,Michael Newman,Alex Davies,Cody Jones,Sergio Boixo,Hartmut Neven,Pushmeet Kohli,Johannes Bausch
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-99] ϕ-test: Global Feature Selection and Inference for Shapley Additive Explanations

链接: https://arxiv.org/abs/2512.07578
作者: Dongseok Kim,Hyoungsun Choi,Mohamed Jismy Aashik Rasool,Gisung Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 15 pages

点击查看摘要

Abstract:We propose \phi -test, a global feature-selection and significance procedure for black-box predictors that combines Shapley attributions with selective inference. Given a trained model and an evaluation dataset, \phi -test performs SHAP-guided screening and fits a linear surrogate on the screened features via a selection rule with a tractable selective-inference form. For each retained feature, it outputs a Shapley-based global score, a surrogate coefficient, and post-selection p -values and confidence intervals in a global feature-importance table. Experiments on real tabular regression tasks with tree-based and neural backbones suggest that \phi -test can retain much of the predictive ability of the original model while using only a few features and producing feature sets that remain fairly stable across resamples and backbone classes. In these settings, \phi -test acts as a practical global explanation layer linking Shapley-based importance summaries with classical statistical inference.

[LG-100] On Conditional Independence Graph Learning From Multi-Attribute Gaussian Dependent Time Series

链接: https://arxiv.org/abs/2512.07557
作者: Jitendra K. Tugnait
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 16 pages, 3 figures, 4 tables

点击查看摘要

[LG-101] High-Dimensional Change Point Detection using Graph Spanning Ratio

链接: https://arxiv.org/abs/2512.07541
作者: Youngwen Sun,Katerina Papagiannouli,Vladimir Spokoiny
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-102] Optimized Machine Learning Methods for Studying the Thermodynamic Behavior of Complex Spin Systems

链接: https://arxiv.org/abs/2512.07458
作者: Dmitrii Kapitan,Pavel Ovchinnikov,Konstantin Soldatov,Petr Andriushchenko,Vitalii Kapitan
类目: Computational Physics (physics.comp-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 16 pages, in Russian language, 8 figures, 2 tables

点击查看摘要

[LG-103] Microseismic event classification with a lightweight Fourier Neural Operator model

链接: https://arxiv.org/abs/2512.07425
作者: Ayrat Abdullin,Umair bin Waheed,Leo Eisner,Abdullatif Al-Shuhail
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: Submitted to Nature Scientific Reports

点击查看摘要

[LG-104] Machine learning in an expectation-maximisation framework for nowcasting

链接: https://arxiv.org/abs/2512.07335
作者: Paul Wilsens,Katrien Antonio,Gerda Claeskens
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-105] wo-dimensional RMSD projections for reaction path visualization and validation

链接: https://arxiv.org/abs/2512.07329
作者: Rohit Goswami(1) ((1) Institute IMX and Lab-COSMO, École polytechnique fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland)
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 4 pages, 1 figure

点击查看摘要

[LG-106] Equivariant Diffusion for Crystal Structure Prediction ICML2024

链接: https://arxiv.org/abs/2512.07289
作者: Peijia Lin,Pin Chen,Rui Jiao,Qing Mo,Jianhuan Cen,Wenbing Huang,Yang Liu,Dan Huang,Yutong Lu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

[LG-107] Verifiable Deep Quantitative Group Testing

链接: https://arxiv.org/abs/2512.07279
作者: Shreyas Jayant Grampurohit,Satish Mulleti,Ajit Rajwade
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages, 2 figures, 3 tables

点击查看摘要

[LG-108] Non-negative DAG Learning from Time-Series Data

链接: https://arxiv.org/abs/2512.07267
作者: Samuel Rey,Gonzalo Mateos
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-109] DeepSVM: Learning Stochastic Volatility Models with Physics-Informed Deep Operator Networks

链接: https://arxiv.org/abs/2512.07162
作者: Kieran A. Malandain,Selim Kalici,Hakob Chakhoyan
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Real-time calibration of stochastic volatility models (SVMs) is computationally bottlenecked by the need to repeatedly solve coupled partial differential equations (PDEs). In this work, we propose DeepSVM, a physics-informed Deep Operator Network (PI-DeepONet) designed to learn the solution operator of the Heston model across its entire parameter space. Unlike standard data-driven deep learning (DL) approaches, DeepSVM requires no labelled training data. Rather, we employ a hard-constrained ansatz that enforces terminal payoffs and static no-arbitrage conditions by design. Furthermore, we use Residual-based Adaptive Refinement (RAR) to stabilize training in difficult regions subject to high gradients. Overall, DeepSVM achieves a final training loss of 10^-5 and predicts highly accurate option prices across a range of typical market dynamics. While pricing accuracy is high, we find that the model’s derivatives (Greeks) exhibit noise in the at-the-money (ATM) regime, highlighting the specific need for higher-order regularization in physics-informed operator learning.

[LG-110] Physics-Guided Diffusion Priors for Multi-Slice Reconstruction in Scientific Imaging AAAI

链接: https://arxiv.org/abs/2512.06977
作者: Laurentius Valdy,Richard D. Paul,Alessio Quercia,Zhuo Cao,Xuan Zhao,Hanno Scharr,Arya Bangun
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, AAAI AI2ASE 2026

点击查看摘要

[LG-111] Learning Conditional Independence Differential Graphs From Time-Dependent Data

链接: https://arxiv.org/abs/2512.06960
作者: Jitendra K Tugnait
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 20 pages, 4 figures, 2 tables. To be published in IEEE Access, 2025

点击查看摘要

[LG-112] Statistical analysis of Inverse Entropy-regularized Reinforcement Learning

链接: https://arxiv.org/abs/2512.06956
作者: Denis Belomestny,Alexey Naumov,Sergey Samsonov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 27 pages

点击查看摘要

[LG-113] PARIS: Pruning Algorithm via the Representer theorem for Imbalanced Scenarios

链接: https://arxiv.org/abs/2512.06950
作者: Enrico Camporeale
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注:

点击查看摘要

Abstract:The challenge of \textbfimbalanced regression arises when standard Empirical Risk Minimization (ERM) biases models toward high-frequency regions of the data distribution, causing severe degradation on rare but high-impact ``tail’’ events. Existing strategies uch as loss re-weighting or synthetic over-sampling often introduce noise, distort the underlying distribution, or add substantial algorithmic complexity. We introduce \textbfPARIS (Pruning Algorithm via the Representer theorem for Imbalanced Scenarios), a principled framework that mitigates imbalance by \emphoptimizing the training set itself. PARIS leverages the representer theorem for neural networks to compute a \textbfclosed-form representer deletion residual, which quantifies the exact change in validation loss caused by removing a single training point \emphwithout retraining. Combined with an efficient Cholesky rank-one downdating scheme, PARIS performs fast, iterative pruning that eliminates uninformative or performance-degrading samples. We use a real-world space weather example, where PARIS reduces the training set by up to 75% while preserving or improving overall RMSE, outperforming re-weighting, synthetic oversampling, and boosting baselines. Our results demonstrate that representer-guided dataset pruning is a powerful, interpretable, and computationally efficient approach to rare-event regression. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Space Physics (physics.space-ph) Cite as: arXiv:2512.06950 [stat.ML] (or arXiv:2512.06950v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.06950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-114] Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets

链接: https://arxiv.org/abs/2512.06945
作者: Nabil Alami,Jad Zakharia,Souhaib Ben Taieb
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-115] ADAM Optimization with Adaptive Batch Selection ICLR2025

链接: https://arxiv.org/abs/2512.06795
作者: Gyu Yeol Kim,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published at ICLR 2025

点击查看摘要

[LG-116] Learning to Hedge Swaptions

链接: https://arxiv.org/abs/2512.06639
作者: Zaniar Ahmadi,Frédéric Godin
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-117] Latent Nonlinear Denoising Score Matching for Enhanced Learning of Structured Distributions

链接: https://arxiv.org/abs/2512.06615
作者: Kaichen Shen,Wei Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-118] A Latent Variable Framework for Scaling Laws in Large Language Models

链接: https://arxiv.org/abs/2512.06553
作者: Peiyao Cai,Chengyu Cui,Felipe Maia Polo,Seamus Somerstep,Leshem Choshen,Mikhail Yurochkin,Moulinath Banerjee,Yuekai Sun,Kean Ming Tan,Gongjun Xu
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a statistical framework built on latent variable modeling for scaling laws of large language models (LLMs). Our work is motivated by the rapid emergence of numerous new LLM families with distinct architectures and training strategies, evaluated on an increasing number of benchmarks. This heterogeneity makes a single global scaling curve inadequate for capturing how performance varies across families and benchmarks. To address this, we propose a latent variable modeling framework in which each LLM family is associated with a latent variable that captures the common underlying features in that family. An LLM’s performance on different benchmarks is then driven by its latent skills, which are jointly determined by the latent variable and the model’s own observable features. We develop an estimation procedure for this latent variable model and establish its statistical properties. We also design efficient numerical algorithms that support estimation and various downstream tasks. Empirically, we evaluate the approach on 12 widely used benchmarks from the Open LLM Leaderboard (v1/v2).

[LG-119] Canonical Tail Dependence for Soft Extremal Clustering of Multichannel Brain Signals

链接: https://arxiv.org/abs/2512.06435
作者: Mara Sherlin Talento,Jordan Richards,Raphael Huser,Hernando Ombao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-120] Modeling Spatio-temporal Extremes via Conditional Variational Autoencoders

链接: https://arxiv.org/abs/2512.06348
作者: Xiaoyu Ma,Likun Zhang,Christopher K. Wikle
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-121] Interpretable Neural Approximation of Stochastic Reaction Dynamics with Guaranteed Reliability

链接: https://arxiv.org/abs/2512.06294
作者: Quentin Badolle,Arthur Theuer,Zhou Fang,Ankit Gupta,Mustafa Khammash
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG); Probability (math.PR); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stochastic Reaction Networks (SRNs) are a fundamental modeling framework for systems ranging from chemical kinetics and epidemiology to ecological and synthetic biological processes. A central computational challenge is the estimation of expected outputs across initial conditions and times, a task that is rarely solvable analytically and becomes computationally prohibitive with current methods such as Finite State Projection or the Stochastic Simulation Algorithm. Existing deep learning approaches offer empirical scalability, but provide neither interpretability nor reliability guarantees, limiting their use in scientific analysis and in applications where model outputs inform real-world decisions. Here we introduce DeepSKA, a neural framework that jointly achieves interpretability, guaranteed reliability, and substantial computational gains. DeepSKA yields mathematically transparent representations that generalise across states, times, and output functions, and it integrates this structure with a small number of stochastic simulations to produce unbiased, provably convergent, and dramatically lower-variance estimates than classical Monte Carlo. We demonstrate these capabilities across nine SRNs, including nonlinear and non-mass-action models with up to ten species, where DeepSKA delivers accurate predictions and orders-of-magnitude efficiency improvements. This interpretable and reliable neural framework offers a principled foundation for developing analogous methods for other Markovian systems, including stochastic differential equations.

[LG-122] Contextual Strongly Convex Simulation Optimization: Optimize then Predict with Inexact Solutions

链接: https://arxiv.org/abs/2512.06270
作者: Nifei Lin,Heng Luo,L. Jeff Hong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-123] Forests of Uncertaint®ees: Using tree-based ensembles to estimate probability distributions of future conflict

链接: https://arxiv.org/abs/2512.06210
作者: Daniel Mittermaier,Tobias Bohne,Martin Hofer,Daniel Racek
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures, 3 tables. Replication code available at this https URL

点击查看摘要

[LG-124] Beyond Lux thresholds: a systematic pipeline for classifying biologically relevant light contexts from wearable data

链接: https://arxiv.org/abs/2512.06181
作者: Yanuo Zhou
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 16 pages, 8 figures. Reproducible pipeline for classifying biologically light from wearable spectral data. Manuscript in preparation for journal submission

点击查看摘要

Abstract:Background: Wearable spectrometers enable field quantification of biologically relevant light, yet reproducible pipelines for contextual classification remain under-specified. Objective: To establish and validate a subject-wise evaluated, reproducible pipeline and actionable design rules for classifying natural vs. artificial light from wearable spectral data. Methods: We analysed ActLumus recordings from 26 participants, each monitored for at least 7 days at 10-second sampling, paired with daily exposure diaries. The pipeline fixes the sequence: domain selection, log-base-10 transform, L2 normalisation excluding total intensity (to avoid brightness shortcuts), hour-level medoid aggregation, sine/cosine hour encoding, and MLP classifier, evaluated under participant-wise cross-validation. Results: The proposed sequence consistently achieved high performance on the primary task, with representative configurations reaching AUC = 0.938 (accuracy 88%) for natural vs. artificial classification on the held-out subject split. In contrast, indoor vs. outdoor classification remained at feasibility level due to spectral overlap and class imbalance (best AUC approximately 0.75; majority-class collapse without contextual sensors). Threshold baselines were insufficient on our data, supporting the need for spectral-temporal modelling beyond illuminance cut-offs. Conclusions: We provide a reproducible, auditable baseline pipeline and design rules for contextual light classification under subject-wise generalisation. All code, configuration files, and derived artefacts will be openly archived (GitHub + Zenodo DOI) to support reuse and benchmarking. Comments: 16 pages, 8 figures. Reproducible pipeline for classifying biologically light from wearable spectral data. Manuscript in preparation for journal submission Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG) MSC classes: 62P10, 92C55, 68T10 Cite as: arXiv:2512.06181 [q-bio.QM] (or arXiv:2512.06181v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2512.06181 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yanuo Zhou [view email] [v1] Fri, 5 Dec 2025 22:02:02 UTC (1,596 KB) Full-text links: Access Paper: View a PDF of the paper titled Beyond Lux thresholds: a systematic pipeline for classifying biologically relevant light contexts from wearable data, by Yanuo ZhouView PDFHTML (experimental)TeX Source view license Current browse context: q-bio.QM prev | next new | recent | 2025-12 Change to browse by: cs cs.LG q-bio References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-125] Unifying Entropy Regularization in Optimal Control: From and Back to Classical Objectives via Iterated Soft Policies and Path Integral Solutions

链接: https://arxiv.org/abs/2512.06109
作者: Ajinkya Bhole,Mohammad Mahmoudi Filabadi,Guillaume Crevecoeur,Tom Lefebvre
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper develops a unified perspective on several stochastic optimal control formulations through the lens of Kullback-Leibler regularization. We propose a central problem that separates the KL penalties on policies and transitions, assigning them independent weights, thereby generalizing the standard trajectory-level KL-regularization commonly used in probabilistic and KL-regularized control. This generalized formulation acts as a generative structure allowing to recover various control problems. These include the classical Stochastic Optimal Control (SOC), Risk-Sensitive Optimal Control (RSOC), and their policy-based KL-regularized counterparts. The latter we refer to as soft-policy SOC and RSOC, facilitating alternative problems with tractable solutions. Beyond serving as regularized variants, we show that these soft-policy formulations majorize the original SOC and RSOC problem. This means that the regularized solution can be iterated to retrieve the original solution. Furthermore, we identify a structurally synchronized case of the risk-seeking soft-policy RSOC formulation, wherein the policy and transition KL-regularization weights coincide. Remarkably, this specific setting gives rise to several powerful properties such as a linear Bellman equation, path integral solution, and, compositionality, thereby extending these computationally favourable properties to a broad class of control problems.

[LG-126] Multi-resolution Physics-Aware Recurrent Convolutional Neural Network for Complex Flows

链接: https://arxiv.org/abs/2512.06031
作者: Xinlun Cheng,Joseph Choi,H.S. Udaykumar,Stephen Baek
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[LG-127] Physics Enhanced Deep Surrogates for the Phonon Boltzmann Transport Equation

链接: https://arxiv.org/abs/2512.05976
作者: Antonio Varagnolo,Giuseppe Romano,Raphaël Pestourie
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

信息检索

[IR-0] From Show Programmes to Data: Designing a Workflow to Make Performing Arts Ephemera Accessible Through Language Models

链接: https://arxiv.org/abs/2512.07452
作者: Clarisse Bardiot,Pierre-Carl Langlais,Bernard Jacquemin,Jacob Hart,Antonios Lagarias,Nicolas Foucault,Aurélie Lemaître-Legargeant,Jeanne Fras
类目: Information Retrieval (cs.IR)
*备注: 19 pages, 8 figures, 5 tables, 17 references

点击查看摘要

Abstract:Many heritage institutions hold extensive collections of theatre programmes, which remain largely underused due to their complex layouts and lack of structured metadata. In this paper, we present a workflow for transforming such documents into structured data using a combination of multimodal large language models (LLMs), an ontology-based reasoning model, and a custom extension of the Linked Art framework. We show how vision-language models can accurately parse and transcribe born-digital and digitised programmes, achieving over 98% of correct extraction. To overcome the challenges of semantic annotation, we train a reasoning model (POntAvignon) using reinforcement learning with both formal and semantic rewards. This approach enables automated RDF triple generation and supports alignment with existing knowledge graphs. Through a case study based on the Festival d’Avignon corpus, we demonstrate the potential for large-scale, ontology-driven analysis of performing arts data. Our results open new possibilities for interoperable, explainable, and sustainable computational theatre historiography.

[IR-1] OnePiece: The Great Route to Generative Recommendation – A Case Study from Tencent Algorithm Competition

链接: https://arxiv.org/abs/2512.07424
作者: Jiangxia Cao,Shuo Yang,Zijun Wang,Qinghai Tan
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:In past years, the OpenAI’s Scaling-Laws shows the amazing intelligence with the next-token prediction paradigm in neural language modeling, which pointing out a free-lunch way to enhance the model performance by scaling the model parameters. In RecSys, the retrieval stage is also follows a ‘next-token prediction’ paradigm, to recall the hunderds of items from the global item set, thus the generative recommendation usually refers specifically to the retrieval stage (without Tree-based methods). This raises a philosophical question: without a ground-truth next item, does the generative recommendation also holds a potential scaling law? In retrospect, the generative recommendation has two different technique paradigms: (1) ANN-based framework, utilizing the compressed user embedding to retrieve nearest other items in embedding space, e.g, Kuaiformer. (2) Auto-regressive-based framework, employing the beam search to decode the item from whole space, e.g, OneRec. In this paper, we devise a unified encoder-decoder framework to validate their scaling-laws at same time. Our empirical finding is that both of their losses strictly adhere to power-law Scaling Laws ( R^2 0.9) within our unified architecture.

[IR-2] On the Impact of Graph Neural Networks in Recommender Systems: A Topological Perspective

链接: https://arxiv.org/abs/2512.07384
作者: Daniele Malitesta,Claudio Pomo,Vito Walter Anelli,Alberto Carlo Maria Mancino,Alejandro Bellogín,Tommaso Di Noia
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recommender systems, user-item interactions can be modeled as a bipartite graph, where user and item nodes are connected by undirected edges. This graph-based view has motivated the rapid adoption of graph neural networks (GNNs), which often outperform collaborative filtering (CF) methods such as latent factor models, deep neural networks, and generative strategies. Yet, despite their empirical success, the reasons why GNNs offer systematic advantages over other CF approaches remain only partially understood. This monograph advances a topology-centered perspective on GNN-based recommendation. We argue that a comprehensive understanding of these models’ performance should consider the structural properties of user-item graphs and their interaction with GNN architectural design. To support this view, we introduce a formal taxonomy that distills common modeling patterns across eleven representative GNN-based recommendation approaches and consolidates them into a unified conceptual pipeline. We further formalize thirteen classical and topological characteristics of recommendation datasets and reinterpret them through the lens of graph machine learning. Using these definitions, we analyze the considered GNN-based recommender architectures to assess how and to what extent they encode such properties. Building on this analysis, we derive an explanatory framework that links measurable dataset characteristics to model behavior and performance. Taken together, this monograph re-frames GNN-based recommendation through its topological underpinnings and outlines open theoretical, data-centric, and evaluation challenges for the next generation of topology-aware recommender systems.

[IR-3] Space efficient implementation of hypergraph dualization in the D-basis algorithm

链接: https://arxiv.org/abs/2512.06988
作者: Skylar Homan,Anoop Krishnadas,Kira Adaricheva
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 21 pages, 3 figures, 10 tables. Submitted to Discrete Applied Mathematics. Results were presented at the AMS 2025 Fall Western Sectional Meeting at the University of Denver

点击查看摘要

Abstract:We present a new implementation of the D -basis algorithm called the Small Space which considerably reduces the algorithm’s memory usage for data analysis applications. The previous implementation delivers the complete set of implications that hold on the set of attributes of an input binary table. In the new version, the only output is the frequencies of attributes that appear in the antecedents of implications from the D -basis, with a fixed consequent attribute. Such frequencies, rather than the implications themselves, became the primary focus in analysis of datasets where the D -basis has been applied over the last decade. The D -basis employs a hypergraph dualization algorithm, and a dualization implementation known as Reverse Search allows for the gradual computation of frequencies without the need for storing all discovered implications. We demonstrate the effectiveness of the Small Space implementation by comparing the runtimes and maximum memory usage of this new version with the current implementation.

[IR-4] Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation

链接: https://arxiv.org/abs/2512.06883
作者: Zhongtao Rao,Peilin Zhou,Dading Chong,Zhiwei Chen,Shoujin Wang,Nan Tang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-5] Foresight Prediction Enhanced Live-Streaming Recommendation WSDM2026

链接: https://arxiv.org/abs/2512.06700
作者: Jiangxia Cao,Ruochen Yang,Xiang Chen,Changxin Lao,Yueyang Liu,Yusheng Huang,Yuanhao Tian,Xiangyu Wu,Shuang Yang,Zhaojie Liu,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注: Accepted by WSDM 2026

点击查看摘要

[IR-6] Enhancing Medical Cross-Modal Hashing Retrieval using Dropout-Voting Mixture-of-Experts Fusion

链接: https://arxiv.org/abs/2512.06449
作者: Jaewon Ahn,Woosung Jang,Beakcheol Jang
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 1 figure, workshop paper (MMGenSR 2025)

点击查看摘要

Abstract:In recent years, cross-modal retrieval using images and text has become an active area of research, especially in the medical domain. The abundance of data in various modalities in this field has led to a growing importance of cross-modal retrieval for efficient image interpretation, data-driven diagnostic support, and medical education. In the context of the increasing integration of distributed medical data across healthcare facilities with the objective of enhancing interoperability, it is imperative to optimize the performance of retrieval systems in terms of the speed, memory efficiency, and accuracy of the retrieved data. This necessity arises in response to the substantial surge in data volume that characterizes contemporary medical practices. In this study, we propose a novel framework that incorporates dropout voting and mixture-of-experts (MoE) based contrastive fusion modules into a CLIP-based cross-modal hashing retrieval structure. We also propose the application of hybrid loss. So we now call our model MCMFH which is a medical cross-modal fusion hashing retrieval. Our method enables the simultaneous achievement of high accuracy and fast retrieval speed in low-memory environments. The model is demonstrated through experiments on radiological and non-radiological medical datasets.

[IR-7] Enhancing Information Retrieval in Digital Libraries through Unit Harmonisation in Scholarly Knowledge Graphs

链接: https://arxiv.org/abs/2512.06395
作者: Golsa Heidari,Markus Stocker,Sören Auer
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-8] Beyond Existing Retrievals: Cross-Scenario Incremental Sample Learning Framework

链接: https://arxiv.org/abs/2512.06381
作者: Tao Wang,Xun Luo,Jinlong Guo,Yuliang Yan,Jian Wu,Yuning Jiang,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

[IR-9] Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval

链接: https://arxiv.org/abs/2512.06334
作者: Van-Thinh Vo,Minh-Khoi Nguyen,Minh-Huy Tran,Anh-Quan Nguyen-Tran,Duy-Tan Nguyen,Khanh-Loi Nguyen,Anh-Minh Phan
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 6 figures, SOICT 2025

点击查看摘要

[IR-10] Sift or Get Off the PoC: Applying Information Retrieval to Vulnerability Research with SiftRank

链接: https://arxiv.org/abs/2512.06155
作者: Caleb Gross
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Security research is fundamentally a problem of resource constraint and consequent prioritization. There is simply too much attack surface and too little time and energy to spend analyzing it all. The most effective security researchers are often those who are most skilled at intuitively deciding which part of an expansive attack surface to investigate. We demonstrate that this problem of selecting the most promising option from among many possibilities can be reframed as an information retrieval problem, and solved using document ranking techniques with LLMs performing the heavy lifting as general-purpose rankers. We present SiftRank, a ranking algorithm achieving O(n) complexity through three key mechanisms: listwise ranking using an LLM to order documents in small batches of approximately 10 items at a time; inflection-based convergence detection that adaptively terminates ranking when score distributions have stabilized; and iterative refinement that progressively focuses ranking effort on the most relevant documents. Unlike existing reranking approaches that require a separate first-stage retrieval step to narrow datasets to approximately 100 candidates, SiftRank operates directly on thousands of items, with each document evaluated across multiple randomized batches to mitigate inconsistent judgments by an LLM. We demonstrate practical effectiveness on N-day vulnerability analysis, successfully identifying a vulnerability-fixing function among 2,197 changed functions in a stripped binary firmware patch within 99 seconds at an inference cost of 0.82. Our approach enables scalable security prioritization for problems that are generally constrained by manual analysis, requiring only standard LLM API access without specialized infrastructure, embedding, or domain-specific fine-tuning. An open-source implementation of SiftRank may be found at this https URL.

附件下载

点击下载今日全部论文列表