本篇博文主要内容为 2025-12-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-24)
今日共更新412篇论文,其中:
- 自然语言处理共36篇(Computation and Language (cs.CL))
- 人工智能共113篇(Artificial Intelligence (cs.AI))
- 计算机视觉共86篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共124篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Making Large Language Models Efficient Dense Retrievers
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在密集检索任务中因参数量庞大而导致的计算效率低下问题。尽管已有研究表明LLMs在生成任务中存在显著的层冗余,但其在检索任务中的冗余特性尚不明确——因为检索要求将整个序列编码为固定维度的向量表示,而非迭代生成token。研究发现,在检索场景下,MLP层比注意力层更具可剪枝性,而注意力层对语义聚合至关重要。基于此洞察,作者提出EffiR框架,其核心在于通过“粗粒度深度压缩”与“细粒度宽度压缩”相结合的策略对MLP层进行大规模压缩,并辅以检索特定的微调机制,在多个BEIR数据集和LLM骨干网络上实现模型尺寸与推理成本的显著降低,同时保持与全尺寸模型相当的检索性能。
链接: https://arxiv.org/abs/2512.20612
作者: Yibin Lei,Shwai He,Ang Li,Andrew Yates
机构: University of Amsterdam (阿姆斯特丹大学); University of Maryland, College Park (马里兰大学学院市分校); Johns Hopkins University, HLTCOE (约翰霍普金斯大学,HLTCOE)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP compression through a coarse-to-fine strategy (coarse-grained depth reduction followed by fine-grained width reduction), combined with retrieval-specific fine-tuning. Across diverse BEIR datasets and LLM backbones, EffiR achieves substantial reductions in model size and inference cost while preserving the performance of full-size models.
zh
[NLP-1] MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts
【速读】: 该论文旨在解决基于扩散模型(diffusion models)的长文本生成任务中面临的高计算成本和内存开销问题,尤其是在处理长文档时效率低下。其关键解决方案是提出MoE-DiffuSeq框架,该框架融合了专家混合(Mixture of Experts, MoE)架构与定制化的稀疏注意力机制(sparse attention),从而在降低计算复杂度的同时保持文本质量与连贯性;此外,通过在扩散过程中引入软吸收态(soft absorbing state),进一步加速序列重建并提升生成精度,显著改善训练效率与采样速度,尤其适用于科学论文生成、代码库建模和长对话等长文本场景。
链接: https://arxiv.org/abs/2512.20604
作者: Alexandros Christoforos,Chadbourne Davis
机构: Suffolk University (萨福克大学)
类目: Computation and Language (cs.CL)
备注: Under submission
Abstract:We present MoE-DiffuSeq, a mixture of experts based framework for enhancing diffusion models in long document generation. Existing diffusion based text generation models, such as DiffuSeq, suffer from high computational cost and memory overhead when applied to extended sequences. To address these challenges, MoE-DiffuSeq integrates sparse attention with a mixture of experts architecture, enabling efficient and scalable long sequence modeling. Our approach introduces a customized sparse attention mechanism designed to reduce computational complexity while preserving text quality and coherence. In addition, we incorporate a soft absorbing state within the diffusion process to accelerate sequence reconstruction and improve generation precision. Extensive experiments demonstrate that MoE-DiffuSeq significantly improves training efficiency and sampling speed compared to existing diffusion models. These advantages are particularly effective for long document scenarios, including scientific article generation, code repository modeling, and long form dialogue generation. Benchmark results further show that MoE-DiffuSeq improves efficiency, speed, accuracy, and expressiveness, advancing the practical applicability of diffusion models for high quality long form text generation.
zh
[NLP-2] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理与序列决策能力评估中的缺乏标准化、可复现的基准测试问题。为应对这一挑战,作者提出了Cube Bench——一个基于魔方(Rubik’s Cube)的基准测试框架,其关键在于将模型性能分解为五个具体技能维度:(i) 从图像和文本中重建魔方面状态,(ii) 选择最优下一步操作,(iii) 预测候选动作的结果而不实际执行,(iv) 执行多步计划并从错误中恢复,以及 (v) 自我检测与修正错误。通过统一的打乱状态、一致的提示模板与解析器及单一的“距离目标解”指标,该基准实现了对多种MLLM的横向比较,揭示了封闭源代码模型在复杂任务中显著优于开源模型,且即使最优模型在高复杂度下仍表现下降,从而为评估MLLM的空间-序列推理能力提供了紧凑、可重复的实验工具。
链接: https://arxiv.org/abs/2512.20595
作者: Dhruv Anand,Ehsan Shareghi
机构: Monash University (莫纳什大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 5 figures, 9 tables. Cube available at this https URL
Abstract:We introduce Cube Bench, a Rubik’s-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one’s own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.
zh
[NLP-3] Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent
【速读】: 该论文旨在解决立体定向放射外科(Stereotactic Radiosurgery, SRS)中自动化治疗计划生成因黑箱人工智能系统缺乏可解释性而导致临床采纳受限的问题。其解决方案的关键在于开发了一个基于大语言模型(Large Language Model, LLM)的规划代理SAGE(Secure Agent for Generative Dose Expertise),并通过引入链式思维(Chain-of-Thought Reasoning)机制,使AI在制定计划时具备可审计的推理过程,从而提升透明度与可信度。实验表明,采用链式思维的版本在关键剂量学指标上与人工规划相当(所有p > 0.21),同时显著降低了耳蜗受照剂量(p = 0.022),且展现出系统性的规划行为,如前瞻性约束验证(457次)和权衡考量(609次),而标准模型则无此类行为,证明了推理能力对提升自动化计划质量与可解释性的核心作用。
链接: https://arxiv.org/abs/2512.20586
作者: Humza Nusrat,Luke Francisco,Bing Luo,Hassan Bagher-Ebadian,Joshua Kim,Karen Chin-Snyder,Salim Siddiqui,Mira Shah,Eric Mellon,Mohammad Ghassemi,Anthony Doemer,Benjamin Movsas,Kundan Thind
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single-fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non-reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade-off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.
zh
[NLP-4] Can LLM s Predict Their Own Failures? Self-Awareness via Internal Circuits
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中难以识别自身错误和幻觉的问题,传统方法如外部裁判、多样本一致性或文本自评等存在计算开销大或与真实正确性关联弱的局限。解决方案的关键在于提出Gnosis机制——一种轻量级的内在自知能力,通过被动观测推理过程中的隐藏状态(hidden states)和注意力模式(attention patterns),提取固定预算内的特征描述符,并以极低的额外计算成本预测输出的正确性,仅增加约5M参数且不依赖序列长度。该方法无需外部监督即可从生成过程本身提取可靠的正确性信号,从而实现高效、准确的内在自验证。
链接: https://arxiv.org/abs/2512.20578
作者: Amirhosein Ghasemabadi,Di Niu
机构: University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.
zh
[NLP-5] Distilling to Hybrid Attention Models via KL-Guided Layer Selection
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段效率低下的问题,通过将预训练的Softmax注意力Transformer蒸馏到一种混合架构中,该架构交替使用Softmax注意力和线性注意力层,从而在不需从头预训练的情况下提升模型推理效率。解决方案的关键在于提出了一种简单而高效的层选择策略,该策略基于在通用文本数据上少量训练所得的层重要性评分来决定哪些层转换为线性注意力变体;随后采用一种近期的蒸馏流程(RADLADS),包括注意力权重迁移、隐藏状态对齐、基于KL散度的分布匹配及小规模微调,实验证明该方法优于现有基于固定比例均匀插入线性注意力的启发式策略以及依赖专门诊断数据集的复杂方法。
链接: https://arxiv.org/abs/2512.20569
作者: Yanhong Li,Songlin Yang,Shawn Tan,Mayank Mishra,Rameswar Panda,Jiawei Zhou,Yoon Kim
机构: Allen Institute for AI(艾伦人工智能研究所); MIT(麻省理工学院); MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室); Stony Brook University(石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]goldstein2025radlads, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.
zh
[NLP-6] Step-DeepResearch Technical Report
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)向自主代理(autonomous agents)演进过程中,现有学术基准(如BrowseComp)在开放性深度研究任务中表现不足的问题,具体表现为对意图识别、长程决策和跨源验证等关键能力的欠缺。解决方案的核心在于提出Step-DeepResearch——一个成本效益高的端到端代理系统,其关键技术包括:基于原子能力的数据合成策略以强化规划与报告生成能力,以及从代理中期训练到监督微调(Supervised Fine-Tuning, SFT)再到强化学习(Reinforcement Learning, RL)的渐进式训练路径;同时引入检查清单风格的评判器(Checklist-style Judger)提升评估鲁棒性,并构建面向中文场景的ADR-Bench评测基准以弥合评估差距。实验表明,该方案使32B规模模型在Scale AI Research Rubrics上达到61.4%得分,在ADR-Bench上显著优于同类开源模型并接近闭源SOTA模型(如OpenAI和Gemini DeepResearch)的能力水平。
链接: https://arxiv.org/abs/2512.20491
作者: Chen Hu,Haikuo Du,Heng Wang,Lin Lin,Mingrui Chen,Peng Liu,Ruihang Miao,Tianchi Yue,Wang You,Wei Ji,Wei Yuan,Wenjin Deng,Xiaojian Yuan,Xiaoyun Zhang,Xiangyu Liu,Xikai Liu,Yanming Xu,Yicheng Cao,Yifei Zhang,Yongyao Wang,Yubo Shu,Yurong Zhang,Yuxiang Zhang,Zheng Gong,Zhichao Chang,Binyan Li,Dan Ma,Furong Jia,Hongyuan Wang,Jiayu Liu,Jing Bai,Junlan Liu,Manjiao Liu,Na Wang,Qiuping Wu,Qinxin Du,Shiwei Li,Wen Sun,Yifeng Gong,Yonglin Chen,Yuling Zhao,Yuxuan Lin,Ziqi Ren,Zixuan Wang,Aihu Zhang,Brian Li,Buyun Ma,Kang An,Li Xie,Mingliang Li,Pan Li,Shidong Yang,Xi Chen,Xiaojia Liu,Yuchu Luo,Yuan Song,YuanHao Ding,Yuanwei Liang,Zexi Li,Zhaoning Zhang,Zixin Zhang,Binxing Jiao,Daxin Jiang,Jiansheng Chen,Jing Li,Xiangyu Zhang,Yibo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
zh
[NLP-7] Sentiment-Aware Extractive and Abstractive Summarization for Unstructured Text Mining
【速读】: 该论文旨在解决现有文本摘要方法在处理社交媒体等平台中短文本用户生成内容(User-Generated Content, UGC)时,难以有效捕捉情感线索并保持主题相关性的问题。其核心挑战在于,传统摘要技术多针对结构化新闻文本优化,无法适应UGC中常见的噪声大、语言非正式、情感丰富等特点。解决方案的关键在于提出一种情感感知(sentiment-aware)的框架,通过将情感信号嵌入到抽取式(TextRank)和生成式(UniLM)两种摘要方法中:在抽取式中调整句子排序以强化情感显著性,在生成式中引导模型生成融合情感语义的摘要,从而提升对情绪细微差别与主题一致性的建模能力,最终产出更具信息密度和情感价值的摘要结果,支持信息系统领域如品牌监控与市场分析中的实时决策需求。
链接: https://arxiv.org/abs/2512.20404
作者: Junyi Liu,Stanley Kok
机构: 未知
类目: Computation and Language (cs.CL)
备注: WITS 2025 (Workshop on Information Technologies and Systems 2025)
Abstract:With the rapid growth of unstructured data from social media, reviews, and forums, text mining has become essential in Information Systems (IS) for extracting actionable insights. Summarization can condense fragmented, emotion-rich posts, but existing methods-optimized for structured news-struggle with noisy, informal content. Emotional cues are critical for IS tasks such as brand monitoring and market analysis, yet few studies integrate sentiment modeling into summarization of short user-generated texts. We propose a sentiment-aware framework extending extractive (TextRank) and abstractive (UniLM) approaches by embedding sentiment signals into ranking and generation processes. This dual design improves the capture of emotional nuances and thematic relevance, producing concise, sentiment-enriched summaries that enhance timely interventions and strategic decision-making in dynamic online environments.
zh
[NLP-8] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems
【速读】: 该论文旨在解决工业仿真系统中多模态信息融合与可执行代码自动生成的问题,即如何从布局草图(layout sketches)和自然语言提示(natural-language prompts)中联合理解视觉与文本语义,并生成可直接执行的FlexScript代码。其解决方案的关键在于提出了一种视觉-语言模拟模型(Vision-Language Simulation Model, VLSM),通过构建首个大规模生成式数字孪生数据集(包含12万组提示-草图-代码三元组),实现视觉结构、文本描述与仿真逻辑之间的多模态学习;同时设计了结构有效性率(Structural Validity Rate, SVR)、参数匹配率(Parameter Match Rate, PMR)和执行成功率(Execution Success Rate, ESR)三项专用评估指标,从而系统性地验证模型在结构完整性、参数准确性和仿真器可执行性方面的性能表现。
链接: https://arxiv.org/abs/2512.20387
作者: YuChe Hsu,AnJui Wang,TsaiChing Ni,YuanFu Yang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.
zh
[NLP-9] Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohens Kappa and Semantic Similarity for Qualitative Research Validation
【速读】: 该论文旨在解决定性研究中因依赖多人编码导致的可靠性挑战,即传统评分者间一致性(inter-rater agreement)方法耗时且常仅达到中等一致性水平。其解决方案的关键在于提出一种多视角验证框架,结合集成验证(ensemble validation)与双可靠性指标:Cohen’s Kappa(κ)用于衡量评分者间的一致性,余弦相似度(cosine similarity)用于评估语义一致性;该框架支持可配置参数(如种子数1–6、温度0.0–2.0)、自定义提示结构及任意JSON格式的共识主题提取,从而实现基于大语言模型(LLM)的主题分析的高可靠性与可重复性。
链接: https://arxiv.org/abs/2512.20352
作者: Nilesh Jain,Seyi Adeyinka,Leor Roseman,Aza Allsop
机构: Yale School of Medicine (耶鲁医学院); University of Exeter (埃克塞特大学); Center for Collective Healing (集体疗愈中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, 3 tables
Abstract:Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen’s Kappa ( \kappa ) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ( \kappa = 0.907 , cosine=95.3%), followed by GPT-4o ( \kappa = 0.853 , cosine=92.6%) and Claude ( \kappa = 0.842 , cosine=92.1%). All three models achieve a high agreement ( \kappa 0.80 ), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.
zh
[NLP-10] Can LLM s Solve My Grandmas Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles
链接: https://arxiv.org/abs/2512.20324
作者: Nurul Labib Sayeedi,Md. Faiyaz Abdullah Sayeedi,Khushnur Binte Jahangir,Swakkhar Shatabda,Sarah Masud Preum
机构: 未知
类目: Computation and Language (cs.CL)
备注:
[NLP-11] SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision
【速读】: 该论文旨在解决如何在无需文本中间表示的情况下,直接从语音中学习语义表示以支持无文本的口语语言建模(textless spoken language modeling)的问题。其核心挑战在于如何高效且高质量地提取语音中的语义信息,并提升预训练效率。解决方案的关键在于提出一种名为SpidR的自监督语音表示模型,该模型通过结合掩码预测目标(masked prediction objective)、自蒸馏(self-distillation)和在线聚类(online clustering)机制,在原始波形上进行训练,从而有效学习包含丰富音素信息的表示。其中,学生模型的中间层被设计为预测教师模型中间层导出的聚类分配,这一机制显著稳定了在线聚类过程,提升了码本(codebook)质量;同时,该方法仅需16张GPU运行一天即可完成预训练,相比HuBERT等模型大幅缩短时间,得益于优化的预训练策略和高效的代码库实现。
链接: https://arxiv.org/abs/2512.20308
作者: Maxime Poli,Mahi Luthra,Youssef Benchekroun,Yosuke Higuchi,Martin Gleize,Jiayi Shen,Robin Algayres,Yu-An Chung,Mido Assran,Juan Pino,Emmanuel Dupoux
机构: ENS-PSL, EHESS, CNRS; FAIR at Meta (Meta人工智能研究院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 30 pages, 16 figures
Abstract:The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher’s intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at this https URL.
zh
[NLP-12] Patterns vs. Patients: Evaluating LLM s against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives
【速读】: 该论文旨在解决生成式 AI(Generative AI)在精神健康自评场景中对人格障碍(如边缘型人格障碍 BPD 和自恋型人格障碍 NPD)诊断能力的评估问题,尤其关注其与临床专业人员在处理波兰语第一人称叙事文本时的表现差异。解决方案的关键在于构建一个直接对比实验:使用真实患者撰写的首因叙述(first-person autobiographical accounts),分别由最先进的大语言模型(LLM)和心理健康专业人士进行诊断,从而量化模型在诊断准确性、F1分数及解释策略上的表现。研究发现,尽管顶级模型在BPD识别上优于人类专家,但在NPD识别上存在严重低估(F1仅为6.7 vs. 50.0),反映出模型对价值负载术语“自恋”(narcissism)的敏感性不足,揭示了其潜在可靠性缺陷和偏差风险。
链接: https://arxiv.org/abs/2512.20298
作者: Karolina Drożdż,Kacper Dudzic,Anna Sterna,Marcin Moskalewicz
机构: IDEAS Research Institute (IDEAS 研究所); Adam Mickiewicz University (亚当·密凯维奇大学); AMU Center for Artificial Intelligence (AMU 人工智能中心); Poznań University of Medical Sciences (波兹南医科大学); Maria Curie-Skłodowska University (玛丽亚·居里-斯克沃多夫斯卡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term “narcissism.” Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient’s sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.
zh
[NLP-13] AprielGuard
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话和代理场景中面临的双重安全挑战:一方面是如何识别和防范有害内容(如毒性、偏见),另一方面是如何抵御对抗性攻击(如提示注入、越狱攻击)。现有防护工具通常将这两类风险视为独立问题处理,导致防护机制缺乏鲁棒性和泛化能力。论文提出的关键解决方案是设计并训练AprielGuard——一个80亿参数的统一安全防护模型,其核心在于构建了一个涵盖上述两类风险的统一分类体系与学习框架,并基于包含独立提示、多轮对话及代理工作流的多样化数据集进行训练,同时引入结构化推理轨迹以增强可解释性。实验表明,AprielGuard在多项公开和私有基准测试中均优于现有开源护栏模型(如Llama-Guard和Granite Guardian),尤其在多步骤推理任务中表现突出。
链接: https://arxiv.org/abs/2512.20293
作者: Jaykumar Kasundra,Anjaneya Praharaj,Sourabh Surana,Lakshmi Sirisha Chodisetty,Sourav Sharma,Abhigya Verma,Abhishek Bhardwaj,Debasish Kanhar,Aakash Bhagat,Khalil Slimi,Seganrasan Subramanian,Sathwik Tejaswi Madhusudhan,Ranga Prasad Chenna,Srinivas Sunkara
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Safeguarding large language models (LLMs) against unsafe or adversarial behavior is critical as they are increasingly deployed in conversational and agentic settings. Existing moderation tools often treat safety risks (e.g. toxicity, bias) and adversarial threats (e.g. prompt injections, jailbreaks) as separate problems, limiting their robustness and generalizability. We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework. AprielGuard is trained on a diverse mix of open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces to improve interpretability. Across multiple public and proprietary benchmarks, AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations, outperforming existing opensource guardrails such as Llama-Guard and Granite Guardian, particularly in multi-step and reasoning intensive scenarios. By releasing the model, we aim to advance transparent and reproducible research on reliable safeguards for LLMs.
zh
[NLP-14] SlideTailor: Personalized Presentation Slide Generation for Scientific Papers AAAI2026
【速读】: 该论文旨在解决自动幻灯片生成中因用户偏好差异导致的结果不匹配问题,即现有方法在缺乏明确用户需求描述的情况下难以生成符合个体偏好的高质量幻灯片。其解决方案的关键在于提出一种受人类行为启发的代理框架 SlideTailor,该框架通过用户提供的论文-幻灯片示例对和视觉模板(无需详细文本描述)来隐式捕捉并泛化用户偏好,从而实现内容与风格的个性化定制;同时引入链式语音(chain-of-speech)机制以对齐幻灯片内容与口头讲解计划,显著提升生成幻灯片的质量及下游视频演示等应用效果。
链接: https://arxiv.org/abs/2512.20292
作者: Wenzheng Zeng,Mingyu Ouyang,Langyuan Cui,Hwee Tou Ng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: AAAI 2026 (with appendix)
Abstract:Automatic presentation slide generation can greatly streamline content creation. However, since preferences of each user may vary, existing under-specified formulations often lead to suboptimal results that fail to align with individual user needs. We introduce a novel task that conditions paper-to-slides generation on user-specified preferences. We propose a human behavior-inspired agentic framework, SlideTailor, that progressively generates editable slides in a user-aligned manner. Instead of requiring users to write their preferences in detailed textual form, our system only asks for a paper-slides example pair and a visual template - natural and easy-to-provide artifacts that implicitly encode rich user preferences across content and visual style. Despite the implicit and unlabeled nature of these inputs, our framework effectively distills and generalizes the preferences to guide customized slide generation. We also introduce a novel chain-of-speech mechanism to align slide content with planned oral narration. Such a design significantly enhances the quality of generated slides and enables downstream applications like video presentations. To support this new task, we construct a benchmark dataset that captures diverse user preferences, with carefully designed interpretable metrics for robust evaluation. Extensive experiments demonstrate the effectiveness of our framework.
zh
[NLP-15] Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings
【速读】: 该论文旨在解决跨语言会议中自动语音翻译系统评估缺乏真实、多样化语料库的问题,以及如何有效检测跨语言对话中的误解。其核心解决方案是构建一个包含5小时多语言语音记录的语料库,涵盖12种原始语言的自动语音识别(ASR)与人工校正转录,并提供自动及人工修正后的英文翻译,同时附带会议书面摘要(minutes),用于跨语言摘要研究;此外,提出并实现了一种基于大语言模型的误解自动检测方法,通过人工标注误解样本验证模型性能,结果显示Gemini模型在识别含误解文本片段上具有77%的召回率和47%的精确度,为后续跨语言沟通质量评估提供了可量化工具与数据基础。
链接: https://arxiv.org/abs/2512.20204
作者: Marko Čechovič,Natália Komorníková,Dominik Macháček,Ondřej Bojar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 6 tables, published as a conference paper in Text, Speech, and Dialogue 28th International Conference, TSD 2025, Erlangen, Germany, August 25-28, 2025, Proceedings, Part II. This version published here on arXiv.org is before review comments and seedings of the TSD conference staff
Abstract:Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings. Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%. Comments: 12 pages, 2 figures, 6 tables, published as a conference paper in Text, Speech, and Dialogue 28th International Conference, TSD 2025, Erlangen, Germany, August 25-28, 2025, Proceedings, Part II. This version published here on arXiv.org is before review comments and seedings of the TSD conference staff Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20204 [cs.CL] (or arXiv:2512.20204v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.20204 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Text, Speech, and Dialogue 28th International Conference, TSD 2025, Erlangen, Germany, August 25-28, 2025, Proceedings, Part II: Corpus of Cross-Lingual Dialogues with Minutes and Detection of Misunderstandings (pp 301-312) Related DOI: https://doi.org/10.1007/978-3-032-02551-7_26 Focus to learn more DOI(s) linking to related resources
zh
[NLP-16] FaithLens: Detecting and Explaining Faithfulness Hallucination
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)输出中存在忠实性幻觉(faithfulness hallucination)的问题,这对检索增强生成(retrieval-augmented generation)和摘要等实际应用的可靠性构成挑战。解决方案的关键在于提出一种名为 FaithLens 的高效检测模型,其核心创新包括:首先利用先进大语言模型合成带解释的训练数据,并通过严格的过滤策略确保标签准确性、解释质量和数据多样性;其次采用两阶段优化机制——先在高质量标注数据上进行微调作为冷启动,再结合基于规则的强化学习进一步优化,奖励同时涵盖预测正确性和解释质量。实验表明,8B参数规模的 FaithLens 在12项任务上优于 GPT-4.1 和 o3 等先进模型,且能生成高质量解释,在可信度、效率与效果之间取得良好平衡。
链接: https://arxiv.org/abs/2512.20182
作者: Shuzheng Si,Qingyi Wang,Haozhe Zhao,Yuzhuo Bai,Guanqiao Chen,Kangyang Luo,Gang Chen,Fanchao Qi,Minjia Zhang,Baobao Chang,Maosong Sun
机构: Tsinghua University (清华大学); DeepLang AI; Fudan University (复旦大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.
zh
[NLP-17] owards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark CVPR2025
【速读】: 该论文旨在解决现有文档图像检索(Document Image Retrieval, DIR)方法在真实场景中难以有效处理细粒度语义文本查询的问题。当前主流DIR方法多依赖图像查询,在粗粒度语义类别(如报纸或收据)内进行检索,无法满足实际应用中用户以自然语言描述进行精准文档查找的需求。其解决方案的关键在于提出一个新的自然语言驱动的文档图像检索(Natural Language-based Document Image Retrieval, NL-DIR)基准,包含41K张真实文档图像及每张图像对应的五条高质量细粒度语义查询,这些查询通过大语言模型生成并经人工验证。同时,论文评估了主流对比视觉-语言模型和无OCR视觉文档理解(OCR-free Visual Document Understanding, VDU)模型在零样本和微调条件下的性能,并探索了一种两阶段检索方法以兼顾效率与精度,从而推动VDU社区在基于自然语言查询的文档图像检索方向的研究发展。
链接: https://arxiv.org/abs/2512.20174
作者: Hao Guo,Xugong Qin,Jun Jie Ou Yang,Peng Zhang,Gangyan Zeng,Yubo Li,Hailun Lin
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Science and Engineering, Nanjing University of Science and Technology (南京理工大学网络科学与工程学院); State Key Laboratory of Cyberspace Security Defense (网络空间安全防御国家重点实验室); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); University of Southern California (南加州大学); Laboratory for Advanced Computing and Intelligence Engineering (先进计算与智能工程实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: CVPR 2025
Abstract:Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at this http URL.
zh
[NLP-18] Learning to Reason in LLM s by Expectation Maximization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中如何有效学习生成合理推理过程(rationale)以提升答案准确性的核心问题。其解决方案的关键在于将推理建模为潜在变量模型,并基于期望最大化(Expectation-Maximization, EM)框架推导出用于学习推理的优化目标,从而将传统EM方法与现代基于奖励的优化策略相连接。研究指出,该方法的核心挑战在于设计一种采样分布,能够生成能正确解释答案的推理路径;为此,作者比较了多种采样方案(如带预算的拒绝采样、自教式推理器STaR和仅保留推理阶段的提示后验采样PPS),发现PPS虽结构简单但显著优于其他方法,表明高质量推理样本的采样策略是提升模型推理能力的关键。
链接: https://arxiv.org/abs/2512.20169
作者: Junghyun Lee,Branislav Kveton,Sunav Choudhary,Subhojyoti Mukherjee,Anup Rao,Ryan A. Rossi,Alexa Siu
机构: KAIST(韩国科学技术院); Adobe Research
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 12 pages, 3 figures, 1 table
Abstract:Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.
zh
[NLP-19] AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化任务中因“对抗性指令”(adversarial instructions)隐藏于输入数据(如简历或代码)而导致的行为偏离问题,尤其关注在简历筛选等缺乏成熟防御机制的应用场景中的安全漏洞。其解决方案的关键在于提出一种名为FIDS(Foreign Instruction Detection through Separation)的训练时防御机制,该方法基于LoRA(Low-Rank Adaptation)微调技术,通过分离潜在的异常指令信号来提升模型对隐蔽攻击的检测能力;实验表明,FIDS相较于仅依赖提示(prompt-based)的推理时防御,在攻击缓解效果(15.4%)和误拒率控制(10.4%)之间实现了更优平衡,且结合两种策略可实现26.3%的综合攻击降低,验证了训练时防御在安全性与功能保真度上的优势。
链接: https://arxiv.org/abs/2512.20164
作者: Honglin Mu,Jinghao Liu,Kaiyang Wan,Rui Xing,Xiuying Chen,Timothy Baldwin,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); University of Washington (华盛顿大学); Beijing University of Posts and Telecommunications (北京邮电大学); University of Melbourne (墨尔本大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by “adversarial instructions” hidden in input data, such as resumes or code, causing them to deviate from their intended task. Notably, while defenses may exist for mature domains such as code review, they are often absent in other common applications such as resume screening and peer review. This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types. We evaluate two defense mechanisms: prompt-based defenses achieve 10.1% attack reduction with 12.5% false rejection increase, while our proposed FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation achieves 15.4% attack reduction with 10.4% false rejection increase. The combined approach provides 26.3% attack reduction, demonstrating that training-time defenses outperform inference-time mitigations in both security and utility preservation.
zh
[NLP-20] Fun-Audio-Chat Technical Report
【速读】: 该论文旨在解决当前联合语音-文本模型(Joint Speech-Text Models)在实现无缝语音交互时面临的三大核心问题:一是语音标记(speech tokens,25Hz)与文本标记(text tokens,~3Hz)之间的时间分辨率不匹配,导致语义信息稀释、计算成本高昂;二是由于训练过程中对文本大语言模型(LLM)知识的灾难性遗忘(catastrophic forgetting);三是缺乏对语音理解、推理和生成能力的有效增强。解决方案的关键在于两项创新:其一为双分辨率语音表示(Dual-Resolution Speech Representations, DRSR),通过共享LLM以5Hz低频处理音频(利用标记分组提升效率),同时由语音精修头(Speech Refined Head)输出25Hz高质量语音标记,在保证性能的同时减少约50% GPU消耗;其二为核心鸡尾酒训练(Core-Cocktail Training),采用两阶段微调并引入中间合并机制,有效缓解文本LLM知识遗忘问题,并结合多任务DPO训练进一步强化模型在语音理解、指令遵循及语音共情等方面的能力,从而实现文本与语音模态知识的协同保留与增强。
链接: https://arxiv.org/abs/2512.20156
作者: Qian Chen,Luyao Cheng,Chong Deng,Xiangang Li,Jiaqing Liu,Chao-Hong Tan,Wen Wang,Junhao Xu,Jieping Ye,Qinglin Zhang,Qiquan Zhang,Jingren Zhou
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 21 pages, this https URL
Abstract:Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo.
zh
[NLP-21] Retrieval-augmented Prompt Learning for Pre-trained Foundation Models
【速读】: 该论文旨在解决预训练基础模型(Pre-trained Foundation Models, PFM)在提示学习(Prompt Learning)过程中因依赖参数化学习范式而导致的泛化稳定性问题,尤其是在小样本场景下容易过度拟合浅层模式、难以充分利用异常实例的问题。解决方案的关键在于提出一种名为RetroPrompt的新方法,其核心思想是通过解耦知识与单纯记忆的关系,引入一个公开可访问的知识库(从训练数据中生成),并在输入、训练和推理阶段均集成检索机制,使模型能够主动从语料库中检索相关上下文信息以增强可用线索,从而降低对机械记忆的依赖并提升泛化能力。
链接: https://arxiv.org/abs/2512.20145
作者: Xiang Chen,Yixin Ou,Quan Feng,Lei Li,Piji Li,Haibo Ye,Sheng-Jun Huang,Shuofei Qiao,Shumin Deng,Huajun Chen,Ningyu Zhang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Zhejiang University (浙江大学); Hunan Vanguard Group Corporation Limited (湖南航天集团有限公司); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: IEEE/ACM Transactions on Audio, Speech and Language Processing
Abstract:The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict’’ paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.
zh
[NLP-22] Multi-hop Reasoning via Early Knowledge Alignment
【速读】: 该论文旨在解决迭代式检索增强生成(Iterative Retrieval-Augmented Generation, Iterative RAG)系统在处理复杂多跳问题时存在的效率低下和推理链错误累积问题,尤其是现有方法在规划阶段未充分考虑可用检索语料库信息,导致无效检索与低效探索。解决方案的关键在于提出一种称为“早期知识对齐”(Early Knowledge Alignment, EKA)的模块,该模块在迭代RAG系统的规划阶段之前,通过引入上下文相关的检索知识来对齐大语言模型(Large Language Models, LLMs)与检索集合,从而建立更坚实的推理基础。EKA不依赖训练,是一种纯推理阶段的策略,能显著提升检索精度、减少错误传播,并从熵的角度证明其有助于抑制不必要的探索行为,使模型聚焦于相关知识子集,最终实现性能与效率的双重优化。
链接: https://arxiv.org/abs/2512.20144
作者: Yuxin Wang,Shicheng Fang,Bo Wang,Qi Luo,Xuanjing Huang,Yining Zheng,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai SII
类目: Computation and Language (cs.CL)
备注: 16 pages
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at \hrefthis https URLGithub.
zh
[NLP-23] M3KG-RAG : Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
【速读】: 该论文旨在解决多模态检索增强生成(Multimodal Retrieval-Augmented Generation, Multimodal RAG)在音视频领域面临的两大挑战:一是现有多模态知识图谱(Multimodal Knowledge Graph, MMKG)的模态覆盖有限且缺乏多跳连接能力,难以支持深度推理;二是仅依赖共享多模态嵌入空间中的相似性进行检索,无法有效过滤与查询无关或冗余的知识,影响答案的准确性与忠实性。解决方案的关键在于提出M³ KG-RAG框架,其核心创新包括:1)设计轻量级多智能体流水线构建多跳多模态知识图谱(M³ KG),通过上下文增强的三元组实现按模态检索;2)引入GRASP(Grounded Retrieval And Selective Pruning)机制,确保实体与查询精准对齐、评估知识对答案的支持度,并剪枝冗余内容,保留生成响应所需的核心知识。实验表明,该方法显著提升了多模态大语言模型(MLLMs)在多模态推理和 grounding 方面的能力。
链接: https://arxiv.org/abs/2512.20136
作者: Hyeongcheol Park,Jiyoung Seo,Jaewon Mun,Hogun Park,Wonmin Byeon,Sung June Kim,Hyeonsoo Im,JeungSub Lee,Sangpil Kim
机构: Korea University (韩国大学); Sungkyunkwan University (成均馆大学); NVIDIA (英伟达); Hanhwa Systems (韩华系统)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M ^3 KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M ^3 KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M ^3 KG-RAG significantly enhances MLLMs’ multimodal reasoning and grounding over existing approaches.
zh
[NLP-24] ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language
【速读】: 该论文旨在解决长序列决策任务中因交互历史过长而导致的计算不切实际的问题,即如何在保持高效推理的同时维持对任务关键信息的记忆。其解决方案的核心是提出一种通用框架——通过语言表达的信念瓶颈(Acting through Belief Bottlenecks Expressed in Language, ABBEL),将多步交互历史压缩为一个自然语言形式的信念状态(belief state),该状态表示代理对任务相关未知量的认知。在ABBEL框架下,代理每一步先基于最新观测更新先验信念以形成后验信念,再仅依据此后验信念选择动作,从而实现内存使用量近似恒定且生成可解释信念的目标。为进一步提升性能,作者引入强化学习(Reinforcement Learning, RL)进行后训练,采用信念评分与长度惩罚机制优化信念质量与压缩性,实验证明该方法可在更低内存消耗下超越完整上下文设置的表现。
链接: https://arxiv.org/abs/2512.20111
作者: Aly Lidayan,Jakob Bjorner,Satvik Golechha,Kartik Goyal,Alane Suhr
机构: UC Berkeley (加州大学伯克利分校); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL’s performance beyond the full context setting, while using less memory than contemporaneous approaches.
zh
[NLP-25] A Novel Graph-Sequence Learning Model for Inductive Text Classification
【速读】: 该论文旨在解决基于图神经网络(Graph Neural Networks, GNNs)的文本分类方法中存在的两个关键问题:一是未能充分捕捉词对之间的多样化结构信息(如共现、句法和语义关系),二是忽略了文本图结构中序列信息的建模,导致无法有效处理包含新词和新关系的文本。解决方案的关键在于提出一种新颖的图-序列学习模型(TextGSL),其核心创新包括:构建单文本级别的图结构并基于多种关系类型定义边,设计自适应多边消息传递机制以聚合多样化的词对结构信息;同时引入Transformer层以捕获文本中的序列依赖关系,从而增强模型对新词和新关系的泛化能力,并生成更具判别性的文本表示。
链接: https://arxiv.org/abs/2512.20097
作者: Zuo Wang,Ye Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Text classification plays an important role in various downstream text-related tasks, such as sentiment analysis, fake news detection, and public opinion analysis. Recently, text classification based on Graph Neural Networks (GNNs) has made significant progress due to their strong capabilities of structural relationship learning. However, these approaches still face two major limitations. First, these approaches fail to fully consider the diverse structural information across word pairs, e.g., co-occurrence, syntax, and semantics. Furthermore, they neglect sequence information in the text graph structure information learning module and can not classify texts with new words and relations. In this paper, we propose a Novel Graph-Sequence Learning Model for Inductive Text Classification (TextGSL) to address the previously mentioned issues. More specifically, we construct a single text-level graph for all words in each text and establish different edge types based on the diverse relationships between word pairs. Building upon this, we design an adaptive multi-edge message-passing paradigm to aggregate diverse structural information between word pairs. Additionally, sequential information among text data can be captured by the proposed TextGSL through the incorporation of Transformer layers. Therefore, TextGSL can learn more discriminative text representations. TextGSL has been comprehensively compared with several strong baselines. The experimental results on diverse benchmarking datasets demonstrate that TextGSL outperforms these baselines in terms of accuracy.
zh
[NLP-26] Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents
【速读】: 该论文旨在解决长时对话中时间推理能力不足的问题,即随着对话历史变长并积累噪声,现有长上下文模型难以准确识别与时间相关的信息,从而严重影响推理性能。解决方案的关键在于提出 Memory-T1 框架,其核心是通过强化学习(Reinforcement Learning, RL)训练一个时间感知的记忆选择策略:首先采用粗粒度到细粒度的策略,利用时间和相关性过滤器对对话历史进行初步筛选,生成候选会话集合;随后由 RL 代理从候选集中精确定位关键证据会话。其中,多级奖励函数设计尤为关键,特别是时间一致性奖励(temporal consistency reward),它在会话级别(时间顺序接近度)和话语级别(时间准确性)双重维度提供密集信号,有效缓解细微的时间歧义问题,显著提升模型的时间推理能力。
链接: https://arxiv.org/abs/2512.20092
作者: Yiming Du,Baojun Wang,Yifan Xiang,Zhaowei Wang,Wenyu Huang,Boyang Xue,Bin Liang,Xingshan Zeng,Fei Mi,Haoli Bai,Lifeng Shang,Jeff Z. Pan,Yuxin Jiang,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Technologies Co., Ltd (华为技术有限公司); HKUST (香港科技大学); The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at this https URL
zh
[NLP-27] Reason 2Decide: Rationale-Driven Multi-Task Learning
【速读】: 该论文旨在解决临床决策支持系统中生成式AI(Generative AI)模型在预测准确性和解释一致性之间难以平衡的问题,尤其是由于暴露偏差(exposure bias)导致的解释与预测不一致。其解决方案的关键在于提出一种两阶段训练框架Reason2Decide:第一阶段专注于理由生成(rationale generation)的预训练,第二阶段联合优化标签预测与理由生成,并引入调度采样(scheduled sampling)机制,逐步从依赖真实标签过渡到使用模型自身预测进行条件生成,从而缓解暴露偏差并实现任务融合。该方法在多个医学数据集上验证有效,且在仅使用LLM生成的理由进行预训练时即优于其他微调基线,显著降低对人工标注的依赖,同时在模型规模仅为当前基础模型40分之一的情况下仍保持高性能,提升了临床推理在资源受限场景下的可部署性与可解释性。
链接: https://arxiv.org/abs/2512.20074
作者: H M Quamran Hasan,Housam Khalifa Bashier,Jiayi Dai,Mi-Young Kim,Randy Goebel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.
zh
[NLP-28] Schoenfelds Anatomy of Mathematical Reasoning by Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在暴露推理轨迹(reasoning traces)时,其底层认知结构与步骤难以识别和分析的问题,尤其在仅依赖词元级(token-level)统计方法时无法揭示深层次的推理机制。解决方案的关键在于引入Schoenfeld的Episode Theory作为归纳性、中等规模的分析视角,并提出ThinkARM(Anatomy of Reasoning in Models)框架,该框架能够将推理轨迹显式抽象为功能性的推理步骤(如Analysis、Explore、Implement、Verify等),从而实现对模型推理过程的结构化建模与系统性分析。此方法揭示了推理模型与非推理模型之间的可复现思维动态和结构性差异,且通过案例研究验证了探索(Exploration)作为关键分支步骤与正确性相关,以及效率导向方法选择性抑制评估反馈步骤而非简单缩短响应长度。
链接: https://arxiv.org/abs/2512.19995
作者: Ming Li,Chenrui Fan,Yize Cheng,Soheil Feizi,Tianyi Zhou
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld’s Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
zh
[NLP-29] Bias Beneath the Tone: Empirical Characterisation of Tone Bias in LLM -Driven UX Systems
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话系统中隐含的语气偏见(tone bias)问题,即模型在应保持中立时仍表现出过度礼貌、乐观或谨慎等非中性语气,从而影响用户对交互系统的信任、共情与公平感知。解决方案的关键在于将可控的大语言模型对话合成技术与情绪分类模型相结合,通过弱监督方式利用预训练的DistilBERT模型对合成对话数据进行标注,并构建集成分类器以系统性识别和量化此类偏见;实验表明,即使基于中性提示生成的对话也存在稳定的语气偏差,且所提方法可实现高达0.92的宏F1分数,验证了语气偏见的可测量性和其在设计公平、可信对话AI中的重要性。
链接: https://arxiv.org/abs/2512.19950
作者: Heet Bodara,Md Masum Mushfiq,Isma Farah Siddiqui
机构: Monash University (蒙纳士大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large Language Models are increasingly used in conversational systems such as digital personal assistants, shaping how people interact with technology through language. While their responses often sound fluent and natural, they can also carry subtle tone biases such as sounding overly polite, cheerful, or cautious even when neutrality is expected. These tendencies can influence how users perceive trust, empathy, and fairness in dialogue. In this study, we explore tone bias as a hidden behavioral trait of large language models. The novelty of this research lies in the integration of controllable large language model based dialogue synthesis with tone classification models, enabling robust and ethical emotion recognition in personal assistant interactions. We created two synthetic dialogue datasets, one generated from neutral prompts and another explicitly guided to produce positive or negative tones. Surprisingly, even the neutral set showed consistent tonal skew, suggesting that bias may stem from the model’s underlying conversational style. Using weak supervision through a pretrained DistilBERT model, we labeled tones and trained several classifiers to detect these patterns. Ensemble models achieved macro F1 scores up to 0.92, showing that tone bias is systematic, measurable, and relevant to designing fair and trustworthy conversational AI.
zh
[NLP-30] PRISM: A Personality-Driven Multi-Agent Framework for Social Media Simulation
【速读】: 该论文旨在解决传统基于个体的模型(Agent-Based Models, ABMs)在模拟意见动态时因假设个体心理同质性而无法捕捉在线极化背后心理异质性的局限性,从而难以揭示个体认知偏差与信息传播之间的关键交互机制。其解决方案的核心在于提出一种混合框架——人格折射智能仿真模型(Personality-Refracted Intelligent Simulation Model, PRISM),该框架将随机微分方程(Stochastic Differential Equations, SDE)用于连续情绪演化建模,并结合人格条件化的部分可观测马尔可夫决策过程(Personality-Conditional Partially Observable Markov Decision Process, PC-POMDP)实现离散决策行为模拟;通过赋予多模态大语言模型(Multimodal Large Language Model, MLLM)代理基于MBTI的人格认知策略,并利用大规模社交媒体数据驱动的先验初始化,显著提升了人格一致性表现,有效再现了理性抑制和情感共振等涌现现象,为复杂社交媒体生态系统的分析提供了可靠工具。
链接: https://arxiv.org/abs/2512.19933
作者: Zhixiang Lu,Xueyuan Deng,Yiran Liu,Yulong Li,Qiang Yan,Imran Razzak,Jionglong Su
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Traditional agent-based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality-Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality-conditional partially observable Markov decision process (PC-POMDP) for discrete decision-making. In contrast to continuous trait approaches, PRISM assigns distinct Myers-Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data-driven priors from large-scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.
zh
[NLP-31] Counterfactual LLM -based Framework for Measuring Rhetorical Style
【速读】: 该论文旨在解决机器学习(Machine Learning, ML)论文中“ hype ”(夸大宣传)现象难以量化的问题,尤其是如何在不依赖实质性内容的前提下,准确区分语言风格上的夸张与真实研究贡献之间的差异。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的反事实框架:通过多个LLM生成具有不同修辞风格但基于相同实质内容的文本,再由一个LLM裁判进行成对比较,并利用Bradley–Terry模型聚合判断结果,从而实现对论文修辞风格的大规模、客观量化。该方法有效剥离了修辞风格与实际研究成果的混杂影响,为科学评价提供了新的测量工具。
链接: https://arxiv.org/abs/2512.19908
作者: Jingyi Qiu,Hong Chen,Zongyi Li
机构: University of Michigan, Ann Arbor (密歇根大学安娜堡分校); CSAIL (麻省理工学院计算机科学与人工智能实验室); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:The rise of AI has fueled growing concerns about ``hype’’ in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because bold language can stem from either strong empirical results or mere rhetorical style, it is often difficult to distinguish between the two. To disentangle rhetorical style from substantive content, we introduce a counterfactual, LLM-based framework: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley–Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. We find that visionary framing significantly predicts downstream attention, including citations and media attention, even after controlling for peer-review evaluations. We also observe a sharp rise in rhetorical strength after 2023, and provide empirical evidence showing that this increase is largely driven by the adoption of LLM-based writing assistance. The reliability of our framework is validated by its robustness to the choice of personas and the high correlation between LLM judgments and human annotations. Our work demonstrates that LLMs can serve as instruments to measure and improve scientific evaluation.
zh
[NLP-32] How well do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在未经显著定制的情况下,对真实课堂情境中教学行为(instructional moves)的识别能力如何,即其“开箱即用”(out-of-the-box)性能表现。解决方案的关键在于通过对比六种主流LLM在不同提示策略(zero-shot、one-shot和few-shot prompting)下的分类表现,发现采用全面示例的few-shot prompting能显著提升模型对教学行为的识别准确率,最高达到Cohen’s Kappa = 0.58,但同时指出性能提升存在不均衡性,且高召回率常伴随假阳性增加,表明提示设计虽有助于激发模型潜力,但无法完全克服其固有的可靠性限制。
链接: https://arxiv.org/abs/2512.19903
作者: Kirk Vanacore,Rene F. Kizilcec
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly adopted in educational technologies for a variety of tasks, from generating instructional materials and assisting with assessment design to tutoring. While prior work has investigated how models can be adapted or optimized for specific tasks, far less is known about how well LLMs perform at interpreting authentic educational scenarios without significant customization. As LLM-based systems become widely adopted by learners and educators in everyday academic contexts, understanding their out-of-the-box capabilities is increasingly important for setting expectations and benchmarking. We compared six LLMs to estimate their baseline performance on a simple but important task: classifying instructional moves in authentic classroom transcripts. We evaluated typical prompting methods: zero-shot, one-shot, and few-shot prompting. We found that while zero-shot performance was moderate, providing comprehensive examples (few-shot prompting) significantly improved performance for state-of-the-art models, with the strongest configuration reaching Cohen’s Kappa = 0.58 against expert-coded annotations. At the same time, improvements were neither uniform nor complete: performance varied considerably by instructional move, and higher recall frequently came at the cost of increased false positives. Overall, these findings indicate that foundation models demonstrate meaningful yet limited capacity to interpret instructional discourse, with prompt design helping to surface capability but not eliminating fundamental reliability constraints.
zh
[NLP-33] HARMON-E: Hierarchical Agent ic Reasoning for Multimodal Oncology Notes to Extract Structured Data
【速读】: 该论文旨在解决从电子健康记录(Electronic Health Record, EHR)中非结构化临床笔记里高效、准确地提取结构化肿瘤学数据的难题,该任务因术语专业性强、文档格式不一致及跨文档信息冲突等问题而极具挑战性。现有自动化方法多局限于特定场景或变量,难以实现患者层面的综合数据整合。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的代理框架(agentic framework),通过模块化、自适应的任务分解,结合上下文感知检索与迭代式合成能力,实现对2,250名癌症患者超40万份临床文档的全面结构化变量抽取,最终在103个肿瘤学特异性变量中平均F1得分达0.93,关键变量(如生物标志物和药物)超过0.95,并显著提升人工审核通过率至0.94,验证了该方法在真实世界场景下的有效性与可扩展性。
链接: https://arxiv.org/abs/2512.19864
作者: Shashi Kant Gupta,Arijeet Pramanik,Jerrin John Thomas,Regina Schwind,Lauren Wiener,Avi Raju,Jeremy Kornbluth,Yanshan Wang,Zhaohui Su,Hrituraj Singh
机构: Triomics(三拓); McKesson(麦凯博思); University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 Pages, Supplementary Included
Abstract:Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale
zh
[NLP-34] Brain-Grounded Axes for Reading and Steering LLM States
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)解释性方法依赖文本监督信号、缺乏外部 grounding 的问题。其解决方案的关键在于将人类大脑活动(如 MEG 数据中的相位锁定值,Phase-Locking Value, PLV)作为坐标系,而非训练信号,用于读取和操控 LLM 的状态空间。通过构建词级脑图谱并利用独立成分分析(Independent Component Analysis, ICA)提取潜在轴向,作者训练轻量级适配器将 LLM 隐藏状态映射到这些脑源轴上,从而实现无需微调模型即可对 LLM 行为进行可控干预。实验表明,此类脑源轴具有稳定性与可解释性,能有效引导模型在词汇频率和语义内容维度上的变化,并且在多个模型架构中保持一致,为 LLM 提供了一种神经生理学基础的可解释接口。
链接: https://arxiv.org/abs/2512.19399
作者: Sandro Andric
机构: New York University (纽约大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures. Code: this https URL
Abstract:Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.
zh
[NLP-35] Coherence in the brain unfolds across separable temporal regimes
【速读】: 该论文旨在解决自然语言理解过程中语义连贯性(coherence)的神经机制问题,具体聚焦于大脑如何在长时间语境中逐步累积意义的同时,又能快速在事件边界处重构表征。其核心挑战在于揭示这两种竞争性时间需求在人类大脑中的实现方式及其在不同皮层网络中的分离表达。解决方案的关键在于利用大型语言模型(Large Language Model, LLM)提取无需人工标注的“漂移”(drift)与“切换”(shift)信号,结合超高场强(7T)功能磁共振成像(fMRI)获取高精度体素级血氧水平依赖(BOLD)信号,并通过正则化编码框架建立特征驱动的血流动力学响应模型,从而识别出默认模式网络区域主要响应语义渐进积累的漂移信号,而初级听觉皮层和语言关联皮层则显著响应事件驱动的切换信号,为理解语言连贯性的神经基础提供了可分离的机制路径。
链接: https://arxiv.org/abs/2512.20481
作者: Davide Stauba,Finn Rabe,Akhil Misra,Yves Pauli,Roya Hüppi,Nils Lang,Lars Michels,Victoria Edkins,Sascha Frühholz,Iris Sommer,Wolfram Hinzen,Philipp Homan
机构: University of Zurich (苏黎世大学); University Hospital Zurich (苏黎世大学医院); University of Oslo (奥斯陆大学); University of Groningen (格罗宁根大学); University Pompeu Fabra (庞佩乌法布拉大学); ETH Zurich (苏黎世联邦理工学院)
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注:
Abstract:Coherence in language requires the brain to satisfy two competing temporal demands: gradual accumulation of meaning across extended context and rapid reconfiguration of representations at event boundaries. Despite their centrality to language and thought, how these processes are implemented in the human brain during naturalistic listening remains unclear. Here, we tested whether these two processes can be captured by annotation-free drift and shift signals and whether their neural expression dissociates across large-scale cortical systems. These signals were derived from a large language model (LLM) and formalized contextual drift and event shifts directly from the narrative input. To enable high-precision voxelwise encoding models with stable parameter estimates, we densely sampled one healthy adult across more than 7 hours of listening to thirteen crime stories while collecting ultra high-field (7T) BOLD data. We then modeled the feature-informed hemodynamic response using a regularized encoding framework validated on independent stories. Drift predictions were prevalent in default-mode network hubs, whereas shift predictions were evident bilaterally in the primary auditory cortex and language association cortex. Furthermore, activity in default-mode and parietal networks was best explained by a signal capturing how meaning accumulates and gradually fades over the course of the narrative. Together, these findings show that coherence during language comprehension is implemented through dissociable neural regimes of slow contextual integration and rapid event-driven reconfiguration, offering a mechanistic entry point for understanding disturbances of language coherence in psychiatric disorders.
zh
计算机视觉
[CV-0] SemanticGen: Video Generation in Semantic Space
【速读】:该论文旨在解决当前视频生成模型在VAE(变分自编码器)空间中直接建模低级视频token时存在的收敛速度慢和计算成本高的问题,尤其是在生成长视频时更为显著。其解决方案的关键在于引入了一个两阶段的生成框架——SemanticGen,首先在紧凑且高层的语义空间中通过扩散模型生成视频的全局布局特征,实现对视频内容的宏观规划;随后在第二阶段,基于这些语义特征条件生成VAE潜变量以还原高保真细节。这种分阶段、从语义到像素的生成策略显著提升了训练效率与生成质量,尤其适用于长视频场景。
链接: https://arxiv.org/abs/2512.20619
作者: Jianhong Bai,Xiaoshi Wu,Xintao Wang,Fu Xiao,Yuanxing Zhang,Qinghe Wang,Xiaoyu Shi,Menghan Xia,Zuozhu Liu,Haoji Hu,Pengfei Wan,Kun Gai
机构: Zhejiang University (浙江大学); Kling Team, Kuaishou Technology (快手科技); CUHK (香港中文大学); DLUT (大连理工大学); HUST (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.
zh
[CV-1] LongVideoAgent : Multi-Agent Reasoning with Long Videos
【速读】:该论文旨在解决长视频问答(Long-video QA)中因内容压缩导致的时序定位不准确以及细粒度线索丢失的问题,这些问题在现有方法中通常由有损摘要或有限工具集引发。其解决方案的关键在于提出一种多智能体框架,其中主大语言模型(LLM)协调两个专用代理:一个定位代理负责精确定位与问题相关的视频片段,另一个视觉代理从这些片段中提取目标文本观察结果;主代理通过设定步骤限制并采用强化学习训练,以促进简洁、准确且高效的多代理协作,从而增强对视频时序结构的理解与推理能力。
链接: https://arxiv.org/abs/2512.20618
作者: Runtao Liu,Ziyi Liu,Jiaqi Tang,Yue Ma,Renjie Pi,Jipeng Zhang,Qifeng Chen
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at this https URL.
zh
[CV-2] SpatialTree: How Spatial Abilities Branch Out in MLLM s ALT
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中空间能力发展层次不清晰、评估体系缺失的问题,尤其缺乏一个基于认知科学原理的系统性分级框架来理解空间能力从低阶感知到高阶代理行为的渐进式发展机制。其解决方案的关键在于提出并构建了SpatialTree——一个受认知科学启发的空间能力四层分层架构(L1:低级感知;L2:心理地图构建;L3:模拟推理;L4:代理能力),并据此设计首个以能力为中心的层级化基准测试,全面评估主流MLLMs在27个子能力上的表现。该框架揭示了不同层级能力间的依赖结构差异,并通过受控微调与改进的强化学习策略(auto-think)实现了跨层级的能力迁移与整体性能提升,为系统性增强MLLMs的空间认知能力提供了可扩展的方法论基础。
链接: https://arxiv.org/abs/2512.20617
作者: Yuxi Xiao,Longfei Li,Shen Yan,Xinhang Liu,Sida Peng,Yunchao Wei,Xiaowei Zhou,Bingyi Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: webpage: this https URL
Abstract:Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive “thinking” is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.
zh
[CV-3] Active Intelligence in Video Avatars via Closed-loop World Modeling
【速读】:该论文旨在解决当前视频虚拟人(video avatar)生成方法在长期目标导向行为上的不足,即现有方法虽能较好地保持身份一致性和动作对齐,但缺乏自主性与环境交互中的适应能力,无法实现基于策略规划的多步任务完成。其解决方案的关键在于提出ORCA(Online Reasoning and Cognitive Architecture),该框架通过引入内部世界模型(Internal World Model, IWM)机制,构建了一个闭环的OTAR循环(Observe-Think-Act-Reflect),以在生成不确定性下持续验证预测结果与实际生成的一致性,从而实现鲁棒的状态追踪;同时采用分层双系统架构:System 1负责将抽象计划转化为具体动作指令,System 2则执行状态预测与战略推理,最终将虚拟人控制建模为部分可观测马尔可夫决策过程(POMDP),并结合连续信念更新与结果验证机制,显著提升了任务成功率与行为一致性,推动视频虚拟人从被动动画向主动、目标驱动的行为演进。
链接: https://arxiv.org/abs/2512.20615
作者: Xuanhua He,Tianyu Yang,Ke Cao,Ruiqi Wu,Cheng Meng,Yong Zhang,Zhuoliang Kang,Xiaoming Wei,Qifeng Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Meituan (美团); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
zh
[CV-4] FedPOD: the deployable units of training for federated learning MICCAI
【速读】:该论文旨在解决联邦学习(Federated Learning)中训练效率低、通信成本高以及数据分布不均导致的性能下降问题。现有方法如FedPIDAvg虽通过引入预测熵的微分项作为权重并采用PID控制器优化通信,但在处理非均匀数据时因剔除异常参与者而限制了数据利用率,且依赖历史轮次的学习信息,难以灵活扩展。其关键解决方案是提出FedPOD(Proportionally Orchestrated Derivative),该方法通过引入基于泊松分布的数据分布建模策略保留被识别为异常的参与者,避免数据浪费;同时摒弃对前序轮次学习信息的依赖,改用每轮独立计算验证损失的方式提升鲁棒性;此外,将联邦学习的轮次任务抽象为类似Kubernetes中Pod的最小计算单元,实现类似自动伸缩的弹性扩展能力,从而在保持Dice分数(平均0.74)与FedPIDAvg相当的前提下,显著提升了系统灵活性和资源利用效率。
链接: https://arxiv.org/abs/2512.20610
作者: Daewoon Kim,Si Young Yie,Jae Sung Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 12 figures, MICCAI
Abstract:This paper proposes FedPOD (Proportionally Orchestrated Derivative) for optimizing learning efficiency and communication cost in federated learning among multiple clients. Inspired by FedPIDAvg, we define a round-wise task for FedPOD to enhance training efficiency. FedPIDAvg achieved performance improvement by incorporating the training loss reduction for prediction entropy as weights using differential terms. Furthermore, by modeling data distribution with a Poisson distribution and using a PID controller, it reduced communication costs even in skewed data distribution. However, excluding participants classified as outliers based on the Poisson distribution can limit data utilization. Additionally, PID controller requires the same participants to be maintained throughout the federated learning process as it uses previous rounds’ learning information in the current round. In our approach, FedPOD addresses these issues by including participants excluded as outliers, eliminating dependency on previous rounds’ learning information, and applying a method for calculating validation loss at each round. In this challenge, FedPOD presents comparable performance to FedPIDAvg in metrics of Dice score, 0.78, 0.71 and 0.72 for WT, ET and TC in average, and projected convergence score, 0.74 in average. Furthermore, the concept of FedPOD draws inspiration from Kubernetes’ smallest computing unit, POD, designed to be compatible with Kubernetes auto-scaling. Extending round-wise tasks of FedPOD to POD units allows flexible design by applying scale-out similar to Kubernetes’ auto-scaling. This work demonstrated the potentials of FedPOD to enhance federated learning by improving efficiency, flexibility, and performance in metrics.
zh
[CV-5] Repurposing Video Diffusion Transformers for Robust Point Tracking
【速读】:该论文旨在解决视频点跟踪(point tracking)任务中现有方法依赖浅层卷积骨干网络(如ResNet)导致的时间不一致性以及在动态运动和频繁遮挡条件下匹配成本不可靠的问题。其解决方案的关键在于利用预训练于大规模真实视频数据的视频扩散Transformer(video Diffusion Transformers, DiTs),通过三方面改进实现高效且鲁棒的点跟踪:(1) 查询-键注意力匹配机制,(2) 轻量级LoRA微调策略,(3) 与ResNet骨干网络的成本融合。实验表明,即使使用8倍小的批量大小训练,DiTracker在挑战性的ITTO基准上达到最先进性能,并在TAP-Vid基准上匹配或超越当前最优模型,验证了视频DiT特征作为点跟踪基础的有效性和高效性。
链接: https://arxiv.org/abs/2512.20606
作者: Soowon Son,Honggyu An,Chaehyun Kim,Hyunah Ko,Jisu Nam,Dahyun Chung,Siyoon Jin,Jung Yi,Jaewon Min,Junhwa Hur,Seungryong Kim
机构: KAIST AI; Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.
zh
[CV-6] LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
【速读】:该论文旨在解决仿真环境中模仿学习策略在真实闭环驾驶任务中性能不足的问题,核心症结在于专家示范与学生模型观测之间的不对称性:专家拥有更优的感知能力(如忽略遮挡、明确其他车辆行为)和更低的不确定性,而学生模型仅依赖传感器输入;此外,学生模型在测试时仅通过单一目标点表达导航意图,导致意图信息不充分。解决方案的关键在于系统性缩小这种专家与学生之间的差距——通过改进数据生成方式、引入感知监督以及优化导航意图建模,最终使TransFuser v6(TFv6)在CARLA多个基准上达到新SOTA,且在NAVSim和Waymo视觉端到端驾驶基准上实现稳定提升。
链接: https://arxiv.org/abs/2512.20563
作者: Long Nguyen,Micha Fauth,Bernhard Jaeger,Daniel Dauner,Maximilian Igl,Andreas Geiger,Kashyap Chitta
机构: University of Tübingen, Tübingen AI Center (图宾根大学,图宾根人工智能中心); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles’ actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at this https URL.
zh
[CV-7] FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
【速读】:该论文旨在解决大规模视觉语言模型(VLMs)在处理图像或视频帧时因生成数百至数千个视觉标记(visual tokens)而导致的二次注意力计算复杂度高及冗余严重的问题。现有令牌压缩方法通常忽略文本查询信息,或依赖深层注意力图,后者在激进剪枝下不稳定,易造成语义对齐性能下降。其解决方案的关键在于提出 FlashVLM——一种由文本引导的视觉标记选择框架,通过显式计算投影后的图像标记与语言模型空间中归一化文本嵌入之间的跨模态相似性(cross-modal similarity),结合内在视觉显著性,利用对数域加权和温度控制锐化策略融合二者,从而动态适配视觉输入与文本查询;同时引入保持多样性的分区机制以保留最小但具代表性的背景标记,确保全局上下文完整性。该方法在相同令牌预算和评估协议下实现超越无损压缩的效果,在 LLaVA 1.5 上最高可修剪 77.8% 的视觉标记,且在高达 94.4% 的压缩率下仍保持 92.8% 的准确率,显著优于现有方法。
链接: https://arxiv.org/abs/2512.20561
作者: Kaitong Cai,Jusheng Zhang,Jing Yang,Yijia Fan,Pengtao Xie,Jian Wang,Keze Wang
机构: Sun Yat-sen University (中山大学); University of California, San Diego (加州大学圣地亚哥分校); Snap Inc. (Snap Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission
Abstract:Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs. Comments: Under submission Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.20561 [cs.CV] (or arXiv:2512.20561v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.20561 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-8] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLM)在动态空间推理(Dynamic Spatial Reasoning, DSR)能力上的不足,即模型难以准确理解三维空间中物体几何形态及其相互关系随时间演变的复杂过程。其核心问题源于缺乏可扩展的4D感知训练资源。解决方案的关键在于提出一个完整的DSR Suite,包含三方面创新:首先,设计自动化数据生成管道,从真实世界视频中提取丰富的几何与运动信息(如相机位姿、局部点云、物体掩码、朝向和3D轨迹),构建用于训练的DSR-Train数据集和用于评估的人工精修DSR-Bench基准;其次,引入轻量级几何选择模块(Geometry Selection Module, GSM),将预训练的4D重建先验知识高效地转化为紧凑的几何token,精准提取与问题相关的几何信息,避免冗余知识干扰;实验表明,结合DSR-Train与GSM后,Qwen2.5-VL-7B模型在DSR任务上显著提升性能,同时保持通用视频理解能力不变。
链接: https://arxiv.org/abs/2512.20557
作者: Shengchao Zhou,Yuxin Chen,Yuying Ge,Wei Huang,Jiehong Lin,Ying Shan,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
zh
[CV-9] Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios WACV2026
【速读】:该论文旨在解决多曝光或多聚焦图像融合中因动态范围差异和聚焦深度不一致导致的融合质量下降问题,尤其针对现有方法依赖粗粒度文本描述难以实现细粒度细节理解与精准跨模态对齐的局限性。其解决方案的关键在于提出多粒度文本引导图像融合(Multi-grained Text-guided Image Fusion, MTIF)框架,包含三个核心设计:一是引入细粒度、结构和语义三个层次的文本描述,通过分层跨模态调制模块指导融合过程;二是每个粒度层级均提供监督信号以促进视觉与文本特征的对齐并提升辅助文本的利用效率;三是采用显著性驱动的数据增强模块,在训练阶段注入密集语义信息,强化跨模态调制与对齐能力。
链接: https://arxiv.org/abs/2512.20556
作者: Mingwei Tang,Jiahao Nie,Guang Yang,Ziqing Cui,Jie Li
机构: Xidian University (西安电子科技大学); Nanyang Technological University (南洋理工大学); Xi’an University of Technology (西安工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026
Abstract:Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.
zh
[CV-10] AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
【速读】:该论文旨在解决单视角RGB模型-based目标位姿估计方法在深度歧义、杂乱背景和遮挡等问题上的局限性,同时克服多视角位姿估计方法依赖精确的单视角位姿估计或缺乏对未见物体泛化能力的缺陷。解决方案的关键在于提出AlignPose方法,其核心是一个专为对象位姿设计的多视角特征-度量优化机制,该机制通过最小化所有视图中实时渲染的目标特征与观测图像特征之间的差异,同步优化一个统一的世界坐标系下的对象位姿,从而实现无需特定对象训练或对称性标注的高精度多视角位姿估计。
链接: https://arxiv.org/abs/2512.20538
作者: Anna Šárová Mikeštíková,Médéric Fourmy,Martin Cífka,Josef Sivic,Vladimir Petrik
机构: Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague (捷克信息学、机器人学与控制论研究所,布拉格捷克技术大学); Faculty of Electrical Engineering, Czech Technical University in Prague (电气工程学院,布拉格捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures
Abstract:Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.
zh
[CV-11] SirenPose: Dynamic Scene Reconstruction via Geometric Supervision
【速读】:该论文旨在解决单目视频中动态三维场景重建的运动保真度与时空一致性难题,尤其在快速运动、多物体交互、遮挡和场景快速变化等挑战性条件下,现有方法常出现几何失真与时间不连续问题。其解决方案的关键在于提出SirenPose——一种融合正弦表示网络(Sinusoidal Representation Networks)周期激活特性与基于关键点的几何监督的几何感知损失函数,通过引入物理启发的约束以保证关键点在时空维度上的协同预测,并利用高频信号建模能力捕捉精细几何细节;同时扩展UniKPT数据集至60万标注实例并引入图神经网络建模关键点间关系与结构相关性,从而显著提升重建精度与运动连贯性。
链接: https://arxiv.org/abs/2512.20531
作者: Kaitong Cai,Jensen Zhang,Jing Yang,Keze Wang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission
Abstract:We introduce SirenPose, a geometry-aware loss formulation that integrates the periodic activation properties of sinusoidal representation networks with keypoint-based geometric supervision, enabling accurate and temporally consistent reconstruction of dynamic 3D scenes from monocular videos. Existing approaches often struggle with motion fidelity and spatiotemporal coherence in challenging settings involving fast motion, multi-object interaction, occlusion, and rapid scene changes. SirenPose incorporates physics inspired constraints to enforce coherent keypoint predictions across both spatial and temporal dimensions, while leveraging high frequency signal modeling to capture fine grained geometric details. We further expand the UniKPT dataset to 600,000 annotated instances and integrate graph neural networks to model keypoint relationships and structural correlations. Extensive experiments on benchmarks including Sintel, Bonn, and DAVIS demonstrate that SirenPose consistently outperforms state-of-the-art methods. On DAVIS, SirenPose achieves a 17.8 percent reduction in FVD, a 28.7 percent reduction in FID, and a 6.0 percent improvement in LPIPS compared to MoSCA. It also improves temporal consistency, geometric accuracy, user score, and motion smoothness. In pose estimation, SirenPose outperforms Monst3R with lower absolute trajectory error as well as reduced translational and rotational relative pose error, highlighting its effectiveness in handling rapid motion, complex dynamics, and physically plausible reconstruction.
zh
[CV-12] Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition
【速读】:该论文旨在解决多模态理解与识别中跨模态对齐、翻译、融合及知识迁移的核心挑战,以提升机器对复杂输入的处理能力。其关键解决方案包括:(1)提出Spatial-Reasoning Bert模型,将文本中的空间关系映射为剪贴画的二维布局,实现空间语言到视觉表示的有效解码;(2)设计基于医学术语空间共现的损失函数,实现医疗文本到解剖图谱特定3D位置的可解释映射,增强文本导航性;(3)构建结构化文本到知识图谱规范事实的基准测试,缓解自然语言提取中的歧义问题;(4)开发视频帧与目标检测特征融合方法,提升组合动作识别的鲁棒性和准确性;(5)通过多模态知识蒸馏技术,使仅使用RGB图像的模型模仿多模态融合模型的能力,在降低计算成本的同时保持性能。这些方法共同推动了空间语言理解、医学文本解析、知识图谱增强和动作识别等领域的进展。
链接: https://arxiv.org/abs/2512.20501
作者: Gorjan Radevski
机构: Kasteelpark Arenberg 10 box 2441, B-3001 Leuven
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Ph.D. manuscript; Supervisors/Mentors: Marie-Francine Moens and Tinne Tuytelaars
Abstract:This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems’ ability to process complex, multimodal inputs across diverse applications. Comments: Ph.D. manuscript; Supervisors/Mentors: Marie-Francine Moens and Tinne Tuytelaars Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.20501 [cs.CV] (or arXiv:2512.20501v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.20501 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gorjan Radevski [view email] [v1] Tue, 23 Dec 2025 16:46:58 UTC (26,213 KB) Full-text links: Access Paper: View a PDF of the paper titled Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition, by Gorjan RadevskiView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-13] Multi-temporal Adaptive Red-Green-Blue and Long-Wave Infrared Fusion for You Only Look Once-Based Landmine Detection from Unmanned Aerial Systems
链接: https://arxiv.org/abs/2512.20487
作者: James E. Gallagher,Edward J. Oughton,Jana Kosecka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages with 6 figures
[CV-14] UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images SIGGRAPH
【速读】:该论文旨在解决生成式 AI (Generative AI) 在图形设计领域中对小尺寸字体及非拉丁文字符(如中文)的文本渲染能力不足的问题,尤其是在保持风格一致性与文本准确性方面的局限性。解决方案的关键在于提出一个统一框架 UTDesign,其核心创新包括:1)基于 DiT(Diffusion Transformer)架构从零开始训练的文本风格迁移模型,可生成保留参考字形风格的透明 RGBA 文本前景;2)通过在标注详尽的数据集上训练多模态条件编码器,将背景图像、文本提示和布局规范作为条件输入,实现风格一致且准确的条件文本生成;3)集成预训练文本到图像(Text-to-Image, T2I)模型与基于多模态大语言模型(MLLM)的布局规划器,构建端到端自动化文本到设计(Text-to-Design, T2D)流水线。该方法在开源方案中达到当前最优的风格一致性和文本准确性表现,并展现出优于商业闭源方案的独特优势。
链接: https://arxiv.org/abs/2512.20479
作者: Yiming Zhao,Yuanpeng Gao,Yuxuan Luo,Jiwei Duan,Shisong Lin,Longfei Xiong,Zhouhui Lian
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); State Key Laboratory of General Artificial Intelligence, Peking University (北京大学通用人工智能重点实验室); Kingsoft Office (金山办公)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 25 figures, SIGGRAPH Asia 2025, Conference Paper
Abstract:AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at this https URL.
zh
[CV-15] Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding
【速读】:该论文旨在解决当前人类运动理解方法普遍忽视物理线索(如关节驱动力量)的问题,而这些物理线索在生物力学中具有基础性作用。其核心问题是:在何种情况下,通过推断得到的力信息能够提升运动理解性能?解决方案的关键在于将物理推断的力特征整合进现有的运动理解流水线中,系统评估其在步态识别、动作识别和细粒度视频描述三个主流任务上的影响。实验表明,在8个基准数据集上,引入力信息均带来稳定性能提升,尤其在动态、遮挡或外观变化等挑战条件下效果更为显著,证明了力线索能有效补充视觉与运动学特征,增强模型对复杂场景的理解能力。
链接: https://arxiv.org/abs/2512.20451
作者: Anh Dao,Manh Tran,Yufei Zhang,Xiaoming Liu,Zijun Cui
机构: Michigan State University (密歇根州立大学); Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL’s ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
zh
[CV-16] High Dimensional Data Decomposition for Anomaly Detection of Textured Images
链接: https://arxiv.org/abs/2512.20432
作者: Ji Song,Xing Wang,Jianguo Wu,Xiaowei Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
[CV-17] Skin Lesion Classification Using a Soft Voting Ensemble of Convolutional Neural Networks
【速读】:该论文旨在解决早期皮肤癌(Skin Cancer)诊断中因图像特征不显著、背景干扰多而导致的分类准确率低的问题。其解决方案的关键在于构建一个基于软投票集成的卷积神经网络(CNN)分类框架,结合改进的图像预处理流程(包括重平衡、数据增强和滤波)以及利用迁移学习设计的双编码器分割模块,实现对病灶区域的精准定位与分割,从而聚焦于临床关键特征并减少背景噪声;最终通过MobileNetV2、VGG19与InceptionV3的集成模型,在保持高精度(最高达96.32%)的同时兼顾推理速度,满足实际医疗场景部署需求。
链接: https://arxiv.org/abs/2512.20431
作者: Abdullah Al Shafi,Abdul Muntakim,Pintu Chandra Shill,Rowzatul Zannat,Abdullah Al-Amin
机构: Khulna University of Engineering & Technology (Khulna University of Engineering & Technology); Chittagong University of Engineering & Technology (Chittagong University of Engineering & Technology); Daffodil International University (Daffodil International University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Authors’ version of the paper published in proceedings of ECCE, DOI: this https URL
Abstract:Skin cancer can be identified by dermoscopic examination and ocular inspection, but early detection significantly increases survival chances. Artificial intelligence (AI), using annotated skin images and Convolutional Neural Networks (CNNs), improves diagnostic accuracy. This paper presents an early skin cancer classification method using a soft voting ensemble of CNNs. In this investigation, three benchmark datasets, namely HAM10000, ISIC 2016, and ISIC 2019, were used. The process involved rebalancing, image augmentation, and filtering techniques, followed by a hybrid dual encoder for segmentation via transfer learning. Accurate segmentation focused classification models on clinically significant features, reducing background artifacts and improving accuracy. Classification was performed through an ensemble of MobileNetV2, VGG19, and InceptionV3, balancing accuracy and speed for real-world deployment. The method achieved lesion recognition accuracies of 96.32%, 90.86%, and 93.92% for the three datasets. The system performance was evaluated using established skin lesion detection metrics, yielding impressive results.
zh
[CV-18] Simplifying Multi-Task Architectures Through Task-Specific Normalization
【速读】:该论文旨在解决多任务学习(Multi-task Learning, MTL)中资源分配与干扰抑制的难题,尤其针对现有架构方案因引入复杂任务特定模块或路由机制而导致的计算开销增加问题。其解决方案的关键在于:仅通过将共享归一化层替换为任务特定的归一化方式,即可显著提升性能,从而质疑了传统复杂设计的必要性;进一步提出轻量级的“任务特定Sigmoid批归一化”(Task-Specific Sigmoid Batch Normalization, TS σ BN),该机制允许各任务软性分配网络容量,同时完全共享特征提取器,在保持参数高效的同时增强模型稳定性,并在多个视觉基准数据集上达到或超越现有方法的性能表现。
链接: https://arxiv.org/abs/2512.20420
作者: Mihai Suteu,Ovidiu Serban
机构: Imperial College London (帝国理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TS \sigma BN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TS \sigma BN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.
zh
[CV-19] Chain-of-Anomaly Thoughts with Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models)在自动视频监控中因固有正常性偏置(normality bias)而导致的异常检测能力不足问题,尤其在犯罪行为识别上表现不佳。其解决方案的关键在于提出了一种多智能体推理框架——链式异常思维(Chain-of-Anomaly-Thoughts, CoAT),通过引入一个最终聚焦于异常的分类层,在推理过程中注入归纳性的犯罪偏置(inductive criminal bias),从而有效引导模型从异常角度进行推理,显著提升异常检测(F1-score提升11.8个百分点)和异常分类性能(提升3.78个百分点)。
链接: https://arxiv.org/abs/2512.20417
作者: Pedro Domingos,João Pereira,Vasco Lopes,João Neves,David Semedo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 2 pages, 3 figures, 1 table. Accepted for RECPAD 2025
Abstract:Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.
zh
[CV-20] DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
【速读】:该论文旨在解决基于可穿戴传感器的自指视角视频(egocentric video)在人体动作识别中存在用户不适、隐私问题及扩展性差等实际限制,转而探索使用环境传感器捕捉的外指视角视频(exocentric video)作为非侵入式且可扩展的替代方案。然而,现有主要依赖全局对齐(Global Alignment)的方法在该新场景下失效,因两个关键问题:(P1) 无法保留局部细节(如细微动作),(P2) 过度依赖模态不变的时间模式,导致语义上下文不同但时间模式相似的动作发生错位对齐。解决方案的关键在于提出 DETACH 框架——一种分解式时空结构,通过显式分离空间与时间特征以保留局部细节,并引入在线聚类发现的新型传感器空间特征(sensor-spatial features)提供语义锚定,实现上下文感知对齐;进一步采用两阶段对齐策略:首先通过互监督建立空间对应关系,再利用时空加权对比损失自适应处理易负样本、难负样本和伪负样本,从而显著提升跨模态对齐精度。
链接: https://arxiv.org/abs/2512.20409
作者: Junho Yoon,Jaemo Jung,Hyunju Kim,Dongman Lee
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
zh
[CV-21] SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images AAAI2026
【速读】:该论文旨在解决超高清图像(ultra-high-resolution image)在生成式 AI (Generative AI) 时代下,传统图像压缩方法难以同时实现高压缩比与高重建保真度的问题。现有基于3D高斯溅射(3D Gaussian Splatting)的2D高斯图像模型虽提升了表示效率,但在极端分辨率场景中仍难以平衡压缩率与重建质量。其解决方案的关键在于提出 SmartSplat 框架,通过引入梯度-颜色引导的变分采样策略(Gradient-Color Guided Variational Sampling)和排除式均匀采样方案(Exclusion-based Uniform Sampling),优化高斯基元在像素空间中的非重叠覆盖;并设计尺度自适应的颜色初始化方法(Scale-Adaptive Gaussian Color Sampling),联合优化空间布局、尺度和颜色初始化,从而以有限数量的高斯函数高效捕捉局部结构与全局纹理,在强压缩条件下实现高质量重建。
链接: https://arxiv.org/abs/2512.20377
作者: Linfei Li,Lin Zhang,Zhong Wang,Ying Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content, posing significant challenges for efficient compression and real-time decoding on end-user devices. Inspired by 3D Gaussian Splatting, recent 2D Gaussian image models improve representation efficiency, yet existing methods struggle to balance compression ratio and reconstruction fidelity in ultra-high-resolution scenarios. To address this issue, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that supports arbitrary image resolutions and compression ratios. SmartSplat leverages image-aware features such as gradients and color variances, introducing a Gradient-Color Guided Variational Sampling strategy together with an Exclusion-based Uniform Sampling scheme to improve the non-overlapping coverage of Gaussian primitives in pixel space. In addition, we propose a Scale-Adaptive Gaussian Color Sampling method to enhance color initialization across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat efficiently captures both local structures and global textures using a limited number of Gaussians, achieving high reconstruction quality under strong compression. Extensive experiments on DIV8K and a newly constructed 16K dataset demonstrate that SmartSplat consistently outperforms state-of-the-art methods at comparable compression ratios and exceeds their compression limits, showing strong scalability and practical applicability. The code is publicly available at this https URL.
zh
[CV-22] Linking Faces and Voices Across Languages: Insights from the FAME 2026 Challenge ICASSP2026
【速读】:该论文旨在解决跨语言环境下人脸与声音关联(Face-Voice Association)的建模问题,即在测试时使用的语言不同于训练时的语言场景下,如何提升模型的泛化能力。其解决方案的关键在于设计对语言变化具有鲁棒性的特征提取与匹配机制,从而实现跨语言条件下的准确人脸-声音配对。
链接: https://arxiv.org/abs/2512.20376
作者: Marta Moscati,Ahmed Abdullah,Muhammad Saad Saeed,Shah Nawaz,Rohan Kumar Das,Muhammad Zaigham Zaheer,Junaid Mir,Muhammad Haroon Yousaf,Khalid Mahmood Malik,Markus Schedl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2026
Abstract:Over half of the world’s population is bilingual and people often communicate under multilingual scenarios. The Face-Voice Association in Multilingual Environments (FAME) 2026 Challenge, held at ICASSP 2026, focuses on developing methods for face-voice association that are effective when the language at test-time is different than the training one. This report provides a brief summary of the challenge.
zh
[CV-23] CRAFT: Continuous Reasoning and Agent ic Feedback Tuning for Multimodal Text-to-Image Generation
【速读】:该论文旨在解决当前文本到图像生成模型在推理阶段缺乏可解释性、可控性和稳定性的问题,尤其是现有方法依赖隐式的整体评价或无约束的提示重写,导致行为难以预测和干预。其解决方案的关键在于提出一种无需训练、模型无关的框架CRAFT(Continuous Reasoning and Agentic Feedback Tuning),该框架通过将提示分解为依赖结构化的视觉问题,利用视觉-语言模型对生成图像进行显式验证,并仅在约束不满足时由大语言模型(LLM)代理执行针对性的提示修正,同时引入明确的停止条件以确保迭代过程收敛。这一结构化推理机制显著提升了生成结果的组合准确性、文本渲染质量及偏好评估表现,且推理开销极低,使轻量级生成器也能逼近高性能模型的效果。
链接: https://arxiv.org/abs/2512.20362
作者: V. Kovalev,A. Kuvshinov,A. Buzovkin,D. Pokidov,D. Timonin
机构: flymy.ai(飞迈)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 42 figures
Abstract:Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of thinking based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models. Comments: 37 pages, 42 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.20362 [cs.CV] (or arXiv:2512.20362v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.20362 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-24] Field-Space Attention for Structure-Preserving Earth System Transformers
【速读】:该论文旨在解决地球系统动力学建模中机器学习架构难以直接处理连续地理场且无法保持其几何结构的问题,从而导致模型物理一致性不足、可解释性差以及优化不稳定。解决方案的关键在于提出Field-Space Attention(场空间注意力机制),该机制在物理域而非隐空间中计算注意力,通过将所有中间表示保持为球面上的连续场,实现了对输入场的结构保形变形学习,并采用固定非学习的多尺度分解方式,使得粗细尺度信息能够协同整合,同时避免了标准单尺度视觉Transformer(Vision Transformer)常见的优化不稳定性问题。此设计不仅提升了模型收敛速度与稳定性,还允许将物理和统计先验直接嵌入网络结构,显著增强数据驱动地球系统建模的精度与可靠性。
链接: https://arxiv.org/abs/2512.20350
作者: Maximilian Witte,Johannes Meuer,Étienne Plésiat,Christopher Kadow
机构: Deutsches Klimarechenzentrum (German Climate Computing Center)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Mathematical Physics (math-ph)
备注:
Abstract:Accurate and physically consistent modeling of Earth system dynamics requires machine-learning architectures that operate directly on continuous geophysical fields and preserve their underlying geometric structure. Here we introduce Field-Space attention, a mechanism for Earth system Transformers that computes attention in the physical domain rather than in a learned latent space. By maintaining all intermediate representations as continuous fields on the sphere, the architecture enables interpretable internal states and facilitates the enforcement of scientific constraints. The model employs a fixed, non-learned multiscale decomposition and learns structure-preserving deformations of the input field, allowing coherent integration of coarse and fine-scale information while avoiding the optimization instabilities characteristic of standard single-scale Vision Transformers. Applied to global temperature super-resolution on a HEALPix grid, Field-Space Transformers converge more rapidly and stably than conventional Vision Transformers and U-Net baselines, while requiring substantially fewer parameters. The explicit preservation of field structure throughout the network allows physical and statistical priors to be embedded directly into the architecture, yielding improved fidelity and reliability in data-driven Earth system modeling. These results position Field-Space Attention as a compact, interpretable, and physically grounded building block for next-generation Earth system prediction and generative modeling frameworks.
zh
[CV-25] he devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
【速读】:该论文旨在解决基于扩散变换器(Diffusion Transformer, DiT)的视频虚拟试衣(Video Virtual Try-On, VVT)方法在捕捉细粒度服装动态和保持视频帧间背景完整性方面的不足,同时应对现有方法因引入额外交互模块而导致计算成本过高,以及公共数据集规模与质量有限限制模型泛化能力的问题。解决方案的关键在于提出一种关键帧驱动的细节注入策略(keyframe-driven details injection strategy),其核心思想是利用关键帧天然包含前景动态与背景一致性信息的特点,通过指令引导的关键帧采样机制筛选出高信息量帧,并设计两个专用模块——服装细节增强模块(garment details enhancement module)与协同背景优化模块(collaborative background optimization module),分别用于将服装动态信息蒸馏至服装相关潜在表示中,并优化背景潜在表示的完整性;最终将这些 enriched 信息注入标准 DiT 块中,结合姿态、掩码与噪声潜在变量,实现高效且逼真的虚拟试衣视频生成,无需修改 DiT 架构即可保障一致性并避免引入额外复杂度。
链接: https://arxiv.org/abs/2512.20340
作者: Qingdong He,Xueqin Chen,Yanjie Pan,Peng Tang,Pengcheng Xu,Zhenye Gan,Chengjie Wang,Xiaobin Hu,Jiangning Zhang,Yabiao Wang
机构: Tencent YouTu Lab (腾讯优图实验室); Sichuan University (四川大学); Fudan University (复旦大学); Western University (西门菲莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by this http URL enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.
zh
[CV-26] KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System
【速读】:该论文旨在解决当前自动驾驶系统在决策逻辑复杂性上的局限性,即现有数据驱动方法难以有效捕捉交通规则、防御性驾驶原则及伦理规范等深层知识,导致规划性能受限且缺乏可解释性与价值对齐。其解决方案的关键在于提出KnowVal系统,通过融合开放世界感知与知识检索机制,构建一个包含交通法规、防御性驾驶和伦理准则的完整驾驶知识图谱,并结合面向驾驶场景优化的大语言模型(Large Language Model, LLM)检索模块,同时利用人类偏好数据训练价值模型(Value Model),实现可解释、价值对齐的轨迹评估与决策优化。
链接: https://arxiv.org/abs/2512.20299
作者: Zhongyu Xia,Wenhao Chen,Yongtao Wang,Ming-Hsuan Yang
机构: Peking University (北京大学); University of California, Merced (加州大学默塞德分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.
zh
[CV-27] AVID: Text-Driven Audio-Visual Interactive Dialogue Generation KR
【速读】:该论文旨在解决现有研究中交互式视频与对话语音生成通常被孤立处理的问题,忽略了人类对话中紧密耦合的音视频交互特性。为实现更拟人化的对话系统,作者提出TAVID框架,其核心创新在于引入两个跨模态映射器(motion mapper 和 speaker mapper),通过双向信息交换实现面部动作与语音的同步生成,从而在真实感、响应性、互动流畅性和语音质量四个维度上显著提升系统性能。
链接: https://arxiv.org/abs/2512.20296
作者: Ji-Hoon Kim,Junseok Ahn,Doyeop Kwak,Joon Son Chung,Shinji Watanabe
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: Project page: this https URL
Abstract:The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. Extensive experiments demonstrate the effectiveness of our approach across all these aspects.
zh
[CV-28] UbiQVision: Quantifying Uncertainty in XAI for Image Recognition
【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学影像领域应用中,SHAP(Shapley Additive Explanations)解释因存在认知不确定性(epistemic uncertainty)和随机不确定性(aleatoric uncertainty)而导致的不稳定性和不可靠性问题。其解决方案的关键在于引入狄利克雷后验采样(Dirichlet posterior sampling)与Dempster-Shafer理论相结合的方法,通过信念、似然和融合映射(belief, plausible, and fusion map)框架,结合统计定量分析,实现对SHAP解释中不确定性的量化,从而提升模型解释的可信度与鲁棒性。
链接: https://arxiv.org/abs/2512.20288
作者: Akshat Dubey,Aleksandar Anžel,Bahar İlgen,Georges Hattab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in deep learning have led to its widespread adoption across diverse domains, including medical imaging. This progress is driven by increasingly sophisticated model architectures, such as ResNets, Vision Transformers, and Hybrid Convolutional Neural Networks, that offer enhanced performance at the cost of greater complexity. This complexity often compromises model explainability and interpretability. SHAP has emerged as a prominent method for providing interpretable visualizations that aid domain experts in understanding model predictions. However, SHAP explanations can be unstable and unreliable in the presence of epistemic and aleatoric uncertainty. In this study, we address this challenge by using Dirichlet posterior sampling and Dempster-Shafer theory to quantify the uncertainty that arises from these unstable explanations in medical imaging applications. The framework uses a belief, plausible, and fusion map approach alongside statistical quantitative analysis to produce quantification of uncertainty in SHAP. Furthermore, we evaluated our framework on three medical imaging datasets with varying class distributions, image qualities, and modality types which introduces noise due to varying image resolutions and modality-specific aspect covering the examples from pathology, ophthalmology, and radiology, introducing significant epistemic uncertainty.
zh
[CV-29] D3ETOR: Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations
【速读】:该论文旨在解决弱监督伪装目标检测(Weakly-Supervised Camouflaged Object Detection, WSCOD)中存在的两大问题:一是现有方法依赖通用分割模型(如SAM)生成的伪标签不可靠,因其缺乏针对伪装目标检测(Camouflaged Object Detection, COD)任务的语义理解能力;二是忽略了草图标注(scribble annotations)中固有的标注偏差,导致模型难以捕捉伪装目标的全局结构。解决方案的关键在于提出一个两阶段框架 D³ETOR,第一阶段通过自适应熵驱动点采样与多智能体辩论机制增强SAM在COD任务中的伪标签生成能力,提升伪掩码的可解释性与精度;第二阶段设计FADeNet,利用频域感知的多层级特征融合与动态区域权重调整策略,逐步消除标注偏差并平衡全局语义与局部细节建模,从而显著缩小弱监督与全监督COD之间的性能差距。
链接: https://arxiv.org/abs/2512.20260
作者: Jiawei Ge,Jiuxin Cao,Xinyi Li,Xuelin Zhu,Chang Liu,Bo Liu,Chen Feng,Ioannis Patras
机构: Southeast University (东南大学); The Hong Kong Polytechnic University (香港理工大学); Queen’s University Belfast (贝尔法斯特女王大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose D^3 ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, D^3 ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.
zh
[CV-30] LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation
【速读】:该论文旨在解决多模态虚假信息检测中因标注数据稀缺和计算资源受限导致的模型性能下降问题。其核心挑战在于如何在有限标注条件下,构建一个高效且具备强泛化能力的图像-文本对联合检测模型。解决方案的关键是提出LADLE-MM(Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation),该模型采用基于模型汤(model-soup)初始化的轻量化架构,包含两个单模态分支和一个融合BLIP提取的固定多模态嵌入空间的第三分支,从而增强图像与文本表征;该设计在仅使用60.3%更少可训练参数的情况下,实现了对DGM4基准任务的竞争力表现,并在无定位标注(grounding annotations)条件下优于现有方法,同时在VERITE开放集测试中超越依赖大型视觉语言模型(Vision-Language Models, VLMs)的复杂架构,展现出良好的鲁棒性和跨场景泛化能力。
链接: https://arxiv.org/abs/2512.20257
作者: Daniele Cardullo,Simone Teglia,Irene Amerini
机构: Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rise of easily accessible tools for generating and manipulating multimedia content, realistic synthetic alterations to digital media have become a widespread threat, often involving manipulations across multiple modalities simultaneously. Recently, such techniques have been increasingly employed to distort narratives of important events and to spread misinformation on social media, prompting the development of misinformation detectors. In the context of misinformation conveyed through image-text pairs, several detection methods have been proposed. However, these approaches typically rely on computationally intensive architectures or require large amounts of annotated data. In this work we introduce LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation, a model-soup initialized multimodal misinformation detector designed to operate under a limited annotation setup and constrained training resources. LADLE-MM is composed of two unimodal branches and a third multimodal one that enhances image and text representations with additional multimodal embeddings extracted from BLIP, serving as fixed reference space. Despite using 60.3% fewer trainable parameters than previous state-of-the-art models, LADLE-MM achieves competitive performance on both binary and multi-label classification tasks on the DGM4 benchmark, outperforming existing methods when trained without grounding annotations. Moreover, when evaluated on the VERITE dataset, LADLE-MM outperforms current state-of-the-art approaches that utilize more complex architectures involving Large Vision-Language-Models, demonstrating the effective generalization ability in an open-set setting and strong robustness to unimodal bias.
zh
[CV-31] BiCoR-Seg: Bidirectional Co-Refinement Framework for High-Resolution Remote Sensing Image Segmentation
【速读】:该论文旨在解决高分辨率遥感图像语义分割(HRSS)中因类别间相似性高和类内差异大而导致的边界模糊与类别混淆问题。现有方法难以有效将抽象但强判别性的语义知识注入像素级特征学习,从而限制了分割精度。解决方案的关键在于提出双向协同精化框架(BiCoR-Seg),其核心是设计了一个热力图驱动的双向信息协同模块(HBIS),通过生成类别级热力图在特征图与类别嵌入之间建立双向信息流;同时引入分层监督策略,利用各HBIS模块生成的可解释热力图为浅层特征提供低分辨率分割监督,增强浅层特征的判别能力,并进一步提出跨层类别嵌入Fisher判别损失,强化类内紧凑性和类间可分性,从而显著提升分割性能与模型可解释性。
链接: https://arxiv.org/abs/2512.20255
作者: Jinghao Shi,Jianing Song
机构: China University of Geosciences, Wuhan (中国地质大学(武汉)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution remote sensing image semantic segmentation (HRSS) is a fundamental yet critical task in the field of Earth observation. However, it has long faced the challenges of high inter-class similarity and large intra-class variability. Existing approaches often struggle to effectively inject abstract yet strongly discriminative semantic knowledge into pixel-level feature learning, leading to blurred boundaries and class confusion in complex scenes. To address these challenges, we propose Bidirectional Co-Refinement Framework for HRSS (BiCoR-Seg). Specifically, we design a Heatmap-driven Bidirectional Information Synergy Module (HBIS), which establishes a bidirectional information flow between feature maps and class embeddings by generating class-level heatmaps. Based on HBIS, we further introduce a hierarchical supervision strategy, where the interpretable heatmaps generated by each HBIS module are directly utilized as low-resolution segmentation predictions for supervision, thereby enhancing the discriminative capacity of shallow features. In addition, to further improve the discriminability of the embedding representations, we propose a cross-layer class embedding Fisher Discriminative Loss to enforce intra-class compactness and enlarge inter-class separability. Extensive experiments on the LoveDA, Vaihingen, and Potsdam datasets demonstrate that BiCoR-Seg achieves outstanding segmentation performance while offering stronger interpretability. The released code is available at this https URL.
zh
[CV-32] Degradation-Aware Metric Prompting for Hyperspectral Image Restoration
【速读】:该论文旨在解决统一高光谱图像(Hyperspectral Image, HSI)复原中因真实场景下退化类型复杂且混合导致的显式退化先验(如退化标签)难以获取的问题。现有方法依赖此类先验作为提示引导复原,但在实际应用中受限于退化模式的多样性与不确定性。解决方案的关键在于提出一种退化感知度量提示(Degradation-Aware Metric Prompting, DAMP)框架:首先设计空间-光谱退化度量以连续量化多维退化特征,形成退化提示(Degradation Prompt, DP);其次引入空间-光谱自适应模块(Spatial-Spectral Adaptive Module, SSAM),通过可学习参数动态调节空间与光谱特征提取;最终将SSAM作为专家集成于专家混合(Mixture-of-Experts)架构中,并以DP作为门控路由机制,实现对多样化、混合或未见退化的自适应、高效且鲁棒的复原。
链接: https://arxiv.org/abs/2512.20251
作者: Binfeng Wang,Di Wang,Haonan Guo,Ying Fu,Jing Zhang
机构: Beijing Institute of Technology (北京理工大学); Wuhan University (武汉大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Unified hyperspectral image (HSI) restoration aims to recover various degraded HSIs using a single model, offering great practical value. However, existing methods often depend on explicit degradation priors (e.g., degradation labels) as prompts to guide restoration, which are difficult to obtain due to complex and mixed degradations in real-world scenarios. To address this challenge, we propose a Degradation-Aware Metric Prompting (DAMP) framework. Instead of relying on predefined degradation priors, we design spatial-spectral degradation metrics to continuously quantify multi-dimensional degradations, serving as Degradation Prompts (DP). These DP enable the model to capture cross-task similarities in degradation distributions and enhance shared feature learning. Furthermore, we introduce a Spatial-Spectral Adaptive Module (SSAM) that dynamically modulates spatial and spectral feature extraction through learnable parameters. By integrating SSAM as experts within a Mixture-of-Experts architecture, and using DP as the gating router, the framework enables adaptive, efficient, and robust restoration under diverse, mixed, or unseen degradations. Extensive experiments on natural and remote sensing HSI datasets show that DAMP achieves state-of-the-art performance and demonstrates exceptional generalization capability. Code is publicly available at this https URL.
zh
[CV-33] Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion ICPR2026
【速读】:该论文旨在解决多模态脑解码(multimodal brain decoding)中的两个核心问题:跨被试泛化能力不足与模型可解释性差。针对功能脑拓扑在不同被试间存在异质性的挑战,其关键解决方案是设计了一种新的fMRI编码器,通过多图谱软功能分割(soft-ROI)构建共享空间,并引入体素级门控融合机制(Voxel-gate)实现更精细的特征整合,同时借助全局标签对齐确保ROI映射一致性,从而提升跨被试迁移性能;此外,为克服传统人工或黑箱提示方法在稳定性和透明度上的局限,提出一种可解释的提示优化流程,在小样本闭环环境中利用本地部署的Qwen模型迭代生成并筛选人类可读提示,保障提示设计的稳定性并保留可审计的优化轨迹;最后,在推理阶段施加参数化解码约束以进一步增强生成描述的稳定性和质量。
链接: https://arxiv.org/abs/2512.20249
作者: Xuanyu Hu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 15 pages, 2 figures, 4 tables. Submitted to ICPR 2026
Abstract:Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.
zh
[CV-34] IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing ICDAR2025
【速读】:该论文旨在解决当前文档版面分析(Document Layout Analysis)领域中数据集在规模、多样性与标注粒度方面的不足,尤其是针对印地语系(Indic)文档的代表性缺失问题。现有数据集如PubLayNet和DocBank缺乏细粒度区域标签及多语言覆盖,而人工标注数据集如M6Doc和D4LA虽标注丰富但规模较小且语言多样性不足,难以支撑鲁棒模型训练。为应对这一挑战,论文提出IndicDLP——一个涵盖11种代表性印地语系语言及英语、覆盖12类常见文档域的大规模基础版面分析数据集,并进一步构建UED-mini以增强预训练效果。其关键在于通过大规模、多语言、细粒度标注的数据集设计,显著提升模型在Indic文档上的性能,并实现跨语言泛化能力,从而推动包容性更强的文档数字化进程。
链接: https://arxiv.org/abs/2512.20236
作者: Oikantik Nath,Sahithi Kukkala,Mitesh Khapra,Ravi Kiran Sarvadevabhatla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICDAR 2025 (Oral Presentation) - Best Student Paper Runner-Up Award
Abstract:Document layout analysis is essential for downstream tasks such as information retrieval, extraction, OCR, and digitization. However, existing large-scale datasets like PubLayNet and DocBank lack fine-grained region labels and multilingual diversity, making them insufficient for representing complex document layouts. In contrast, human-annotated datasets such as M6Doc and D4LA offer richer labels and greater domain diversity, but are too small to train robust models and lack adequate multilingual coverage. This gap is especially pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets, further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet and M6Doc, to enhance pretraining and provide a solid foundation for Indic layout models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts performance, validating its effectiveness. Moreover, models trained on IndicDLP generalize well beyond Indic layouts, making it a valuable resource for document digitization. This work bridges gaps in scale, diversity, and annotation granularity, driving inclusive and efficient document understanding.
zh
[CV-35] How I Met Your Bias: Investigating Bias Amplification in Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成过程中对训练数据集偏见的复制与放大问题,这一现象此前被视为模型固有特性,缺乏系统性分析。解决方案的关键在于首次揭示采样算法及其超参数对偏见放大效应的显著影响:通过在Biased MNIST、Multi-Color MNIST和BFFHQ等受控数据集上实验验证,发现即使模型本身固定,调整采样过程中的超参数仍可实现偏见的减少或增强,从而为缓解扩散模型中的公平性问题提供了新的干预维度。
链接: https://arxiv.org/abs/2512.20233
作者: Nathan Roos,Ekaterina Iakovleva,Ani Gjergji,Vito Paolo Pastore,Enzo Tartaglione
机构: LTCI, Télécom Paris, Institut Polytechnique de Paris, France; MaLGa-DIBRIS, University of Genova, Italy; AIGO, Istituto Italiano di Tecnologia, Italy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based generative models demonstrate state-of-the-art performance across various image synthesis tasks, yet their tendency to replicate and amplify dataset biases remains poorly understood. Although previous research has viewed bias amplification as an inherent characteristic of diffusion models, this work provides the first analysis of how sampling algorithms and their hyperparameters influence bias amplification. We empirically demonstrate that samplers for diffusion models – commonly optimized for sample quality and speed – have a significant and measurable effect on bias amplification. Through controlled studies with models trained on Biased MNIST, Multi-Color MNIST and BFFHQ, and with Stable Diffusion, we show that sampling hyperparameters can induce both bias reduction and amplification, even when the trained model is fixed. Source code is available at this https URL.
zh
[CV-36] LiteFusion: Taming 3D Object Detectors from Vision-Based to Multi-Modal with Minimal Adaptation
【速读】:该论文旨在解决当前多模态3D目标检测方法对激光雷达(LiDAR)过度依赖所带来的鲁棒性不足与部署困难问题。现有方法通常采用复杂的架构和训练策略,且严重依赖3D稀疏卷积操作,导致在无LiDAR场景下性能显著下降,并难以在非GPU硬件平台(如NPU、FPGA)上高效部署。其解决方案的关键在于重新定义LiDAR在相机-LiDAR融合范式中的作用:LiteFusion不将点云视为独立模态并使用专用的LiDAR编码器,而是将其作为几何信息的补充源,通过在四元数空间中将LiDAR特征融入图像特征,保持正交约束以建模跨模态的领域特定关系,从而生成紧凑的跨模态嵌入。此设计使模型摆脱了对3D骨干网络的依赖,显著提升部署友好性,同时在nuScenes数据集上实现mAP提升20.4%、NDS提升19.7%,且参数仅增加1.1%,即使在无LiDAR输入时仍保持强检测性能。
链接: https://arxiv.org/abs/2512.20217
作者: Xiangxuan Ren,Zhongdao Wang,Pin Tang,Guoqing Wang,Jilai Zheng,Chao Ma
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures, 8 tables
Abstract:3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.
zh
[CV-37] JDPNet: A Network Based on Joint Degradation Processing for Underwater Image Enhancement
【速读】:该论文旨在解决水下图像中多种退化因素(如颜色失真、清晰度下降和对比度降低)之间存在非线性耦合关系,而现有方法通常仅针对单一退化设计处理模块,忽视了退化间的协同信息,导致难以从底层有效建模和处理这些复杂耦合退化的问题。解决方案的关键在于提出JDPNet网络架构,其核心创新包括:1)引入联合特征挖掘模块(joint feature-mining module),用于提取并整合多退化耦合中的潜在特征;2)设计概率Bootstrap分布策略,实现对耦合退化特征的统一调整;3)构建AquaBalanceLoss损失函数,以平衡颜色、清晰度与对比度的优化目标,从而在统一框架内高效处理多退化耦合问题,显著提升水下图像恢复性能。
链接: https://arxiv.org/abs/2512.20213
作者: Tao Ye,Hongbin Ren,Chongbing Zhang,Haoran Chen,Xiaosong Li
机构: China University of Mining and Technology (中国矿业大学); China Shipbuilding Science Research Center (中国船舶科学研究中心); Foshan University (佛山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.
zh
[CV-38] Generative Latent Coding for Ultra-Low Bitrate Image Compression CVPR2024
【速读】:该论文旨在解决传统图像压缩方法在低比特率下难以同时实现高保真度(high-fidelity)与高真实感(high-realism)的问题,其根源在于像素空间中的失真往往与人类感知不一致。解决方案的关键在于提出一种生成式潜在编码(Generative Latent Coding, GLC)架构,该架构将变换编码从像素空间转移到生成式向量量化变分自编码器(VQ-VAE)的潜在空间中进行。该潜在空间具有更强的稀疏性、更丰富的语义信息,并且更贴合人类视觉感知,从而显著提升压缩效率与视觉质量;此外,通过引入分类超模块降低超信息的比特开销,并采用基于代码预测的监督机制增强语义一致性,进一步优化了压缩性能。
链接: https://arxiv.org/abs/2512.20194
作者: Zhaoyang Jia,Jiahao Li,Bin Li,Houqiang Li,Yan Lu
机构: University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at CVPR 2024
Abstract:Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer. The code is available at this https URL.
zh
[CV-39] AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model
【速读】:该论文旨在解决多教师蒸馏(multi-teacher distillation)在视觉基础模型(vision foundation models)训练中学习动态不明确、数据效率低的问题。其核心解决方案在于提出一种名为聚类混合专家(Agglomerative Mixture-of-Experts, AMoE)的架构,通过三项关键技术实现高效训练:(1)不对称关系知识蒸馏损失(Asymmetric Relation-Knowledge Distillation loss),可在保留各教师模型几何特性的前提下实现有效知识迁移;(2)基于token平衡的批处理策略(token-balanced batching),使不同分辨率图像在统一token预算下稳定表示学习;(3)采用层次聚类与采样策略优化训练数据选择,显著提升样本效率。上述方法共同促成一个包含2亿张图像的高质量语料库OpenLVD200M,验证了多教师蒸馏在降低计算成本的同时保持高性能的能力。
链接: https://arxiv.org/abs/2512.20157
作者: Sofian Chaybouti,Sanath Narayan,Yasser Dahou,Phúc H. Lê Khac,Ankit Singh,Ngoc Dung Huynh,Wamiq Reyaz Para,Hilde Kuehne,Hakim Hacid
机构: Technology Innovation Institute(技术革新研究所); University of Tuebingen(图宾根大学); MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures, 11 tables
Abstract:Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data–typically reserved for self-supervised learning–substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.
zh
[CV-40] CoDi – an exemplar-conditioned diffusion model for low-shot counting
【速读】:该论文旨在解决低样本量(low-shot)场景下物体计数问题,尤其是密集区域中小物体的计数与定位难题。现有方法中,基于密度图的计数器虽能较好估计总数,但定位精度差;而基于点检测的计数器虽定位能力较强,却因预训练查询数量有限,在高密度图像中表现不佳,常依赖人工上采样或分块处理等启发式策略。本文提出 CoDi,首个基于潜在扩散模型(latent diffusion-based)的低样本计数方法,其核心创新在于设计了一种新的示例条件化模块(exemplar-based conditioning module),该模块可将目标原型(object prototypes)提取并适配至去噪网络的中间层,从而实现精确的对象位置估计。此机制使 CoDi 能生成高质量密度图,并通过非极大值抑制(non-maxima suppression)准确确定物体位置,在 FSC 和 MCAC 基准上分别显著优于当前最优方法,MAE 提升达 15% 至 44%。
链接: https://arxiv.org/abs/2512.20153
作者: Grega Šuštar,Jer Pelhan,Alan Lukežič,Matej Kristan
机构: University of Ljubljana (卢布尔雅那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-shot object counting addresses estimating the number of previously unobserved objects in an image using only few or no annotated test-time exemplars. A considerable challenge for modern low-shot counters are dense regions with small objects. While total counts in such situations are typically well addressed by density-based counters, their usefulness is limited by poor localization capabilities. This is better addressed by point-detection-based counters, which are based on query-based detectors. However, due to limited number of pre-trained queries, they underperform on images with very large numbers of objects, and resort to ad-hoc techniques like upsampling and tiling. We propose CoDi, the first latent diffusion-based low-shot counter that produces high-quality density maps on which object locations can be determined by non-maxima suppression. Our core contribution is the new exemplar-based conditioning module that extracts and adjusts the object prototypes to the intermediate layers of the denoising network, leading to accurate object location estimation. On FSC benchmark, CoDi outperforms state-of-the-art by 15% MAE, 13% MAE and 10% MAE in the few-shot, one-shot, and reference-less scenarios, respectively, and sets a new state-of-the-art on MCAC benchmark by outperforming the top method by 44% MAE. The code is available at this https URL.
zh
[CV-41] Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS)
【速读】:该论文旨在解决果园环境中苹果姿态估计(apple pose estimation)的挑战,尤其是由于遮挡导致的关键点(如花萼)难以标注的问题。传统方法依赖于这些关键点进行标注和估计,但实际场景中因遮挡常出现标注冲突或缺失,使得人工标注成本高且效率低。解决方案的关键在于提出一种新颖的端到端流水线:首先利用3D Gaussian Splatting技术重建果园场景,然后基于此生成简化标注,并自动将标注投影至图像中,从而大幅减少人工标注需求。实验表明,该方法仅需105个手动标注即可生成28,191个训练标签,标注量减少了99.6%,同时在原始图像上实现了0.927的中性F1分数,验证了其高效性和有效性。
链接: https://arxiv.org/abs/2512.20148
作者: Robert van de Ven,Trim Bresilla,Bram Nelissen,Ard Nieuwenhuizen,Eldert J. van Henten,Gert Kootstra
机构: Wageningen University and Research (瓦赫宁根大学与研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 33 pages, excluding appendices. 17 figures
Abstract:Automating tasks in orchards is challenging because of the large amount of variation in the environment and occlusions. One of the challenges is apple pose estimation, where key points, such as the calyx, are often occluded. Recently developed pose estimation methods no longer rely on these key points, but still require them for annotations, making annotating challenging and time-consuming. Due to the abovementioned occlusions, there can be conflicting and missing annotations of the same fruit between different images. Novel 3D reconstruction methods can be used to simplify annotating and enlarge datasets. We propose a novel pipeline consisting of 3D Gaussian Splatting to reconstruct an orchard scene, simplified annotations, automated projection of the annotations to images, and the training and evaluation of a pose estimation method. Using our pipeline, 105 manual annotations were required to obtain 28,191 training labels, a reduction of 99.6%. Experimental results indicated that training with labels of fruits that are \leq95% occluded resulted in the best performance, with a neutral F1 score of 0.927 on the original images and 0.970 on the rendered images. Adjusting the size of the training dataset had small effects on the model performance in terms of F1 score and pose estimation accuracy. It was found that the least occluded fruits had the best position estimation, which worsened as the fruits became more occluded. It was also found that the tested pose estimation method was unable to correctly learn the orientation estimation of apples.
zh
[CV-42] Dreamcrafter: Immersive Editing of 3D Radiance Fields Through Flexible Generative Inputs and Outputs
【速读】:该论文旨在解决空间计算应用中3D场景创作的门槛问题,即如何在保持实时交互性的同时,融合生成式AI(Generative AI)的能力以提升创作效率与灵活性。其核心挑战在于现有方法要么依赖沉浸式直接操作但缺乏智能辅助,要么利用AI重建真实场景(如NeRFs或3D高斯泼溅)却因高延迟难以实现即时编辑。解决方案的关键在于提出Dreamcrafter系统:通过模块化架构集成多种生成式AI算法,支持自然语言与直接操纵相结合的多层级控制,并引入代理表示(proxy representations)机制,在高延迟操作期间维持流畅交互体验,从而有效统一了直观操控与高层语义编辑的优势。
链接: https://arxiv.org/abs/2512.20129
作者: Cyrus Vachha,Yixiao Kang,Zach Dive,Ashwat Chidambaram,Anik Gupta,Eunice Jun,Bjoern Hartmann
机构: UC Berkeley(加州大学伯克利分校); UCLA(加州大学洛杉矶分校)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: CHI 2025, Project page: this https URL
Abstract:Authoring 3D scenes is a central task for spatial computing applications. Competing visions for lowering existing barriers are (1) focus on immersive, direct manipulation of 3D content or (2) leverage AI techniques that capture real scenes (3D Radiance Fields such as, NeRFs, 3D Gaussian Splatting) and modify them at a higher level of abstraction, at the cost of high latency. We unify the complementary strengths of these approaches and investigate how to integrate generative AI advances into real-time, immersive 3D Radiance Field editing. We introduce Dreamcrafter, a VR-based 3D scene editing system that: (1) provides a modular architecture to integrate generative AI algorithms; (2) combines different levels of control for creating objects, including natural language and direct manipulation; and (3) introduces proxy representations that support interaction during high-latency operations. We contribute empirical findings on control preferences and discuss how generative AI interfaces beyond text input enhance creativity in scene editing and world building.
zh
[CV-43] milliMamba: Specular-Aware Human Pose Estimation via Dual mmWave Radar with Multi-Frame Mamba Fusion WACV2026
【速读】:该论文旨在解决毫米波雷达(Millimeter-wave Radar)在人体姿态估计(Human Pose Estimation, HPE)任务中因镜面反射导致信号稀疏、难以提取鲁棒特征的问题。解决方案的关键在于提出了一种名为 milliMamba 的端到端 2D 人体姿态估计框架,其核心创新是通过跨视图融合的 Mamba 编码器(Cross-View Fusion Mamba encoder)高效提取长序列中的时空特征(线性复杂度),并结合时空交叉注意力解码器(Spatio-Temporal-Cross Attention decoder)预测多帧关节坐标,从而利用邻近帧和关节的上下文信息补偿因信号缺失导致的关节丢失;此外,引入速度损失(velocity loss)以增强运动平滑性,显著提升了模型在 TransHuPR 和 HuPR 数据集上的性能,分别超过基线方法 11.0 AP 和 14.6 AP。
链接: https://arxiv.org/abs/2512.20128
作者: Niraj Prakash Kini,Shiau-Rung Tsai,Guan-Hsun Lin,Wen-Hsiao Peng,Ching-Wen Ma,Jenq-Neng Hwang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026
Abstract:Millimeter-wave radar offers a privacy-preserving and lighting-invariant alternative to RGB sensors for Human Pose Estimation (HPE) task. However, the radar signals are often sparse due to specular reflection, making the extraction of robust features from radar signals highly challenging. To address this, we present milliMamba, a radar-based 2D human pose estimation framework that jointly models spatio-temporal dependencies across both the feature extraction and decoding stages. Specifically, given the high dimensionality of radar inputs, we adopt a Cross-View Fusion Mamba encoder to efficiently extract spatio-temporal features from longer sequences with linear complexity. A Spatio-Temporal-Cross Attention decoder then predicts joint coordinates across multiple frames. Together, this spatio-temporal modeling pipeline enables the model to leverage contextual cues from neighboring frames and joints to infer missing joints caused by specular reflections. To reinforce motion smoothness, we incorporate a velocity loss alongside the standard keypoint loss during training. Experiments on the TransHuPR and HuPR datasets demonstrate that our method achieves significant performance improvements, exceeding the baselines by 11.0 AP and 14.6 AP, respectively, while maintaining reasonable complexity. Code: this https URL
zh
[CV-44] HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在延迟和资源受限平台部署时面临的两大挑战:一是自注意力机制带来的二次方计算复杂度(quadratic attention cost),二是冗余计算导致的效率低下。现有剪枝方法通常孤立地处理token或注意力头(attention head),依赖启发式规则或一阶信号,往往牺牲精度或无法跨输入泛化。其解决方案的关键在于提出HEART-ViT——一个基于海森矩阵(Hessian)引导的、统一的、二阶的、输入自适应动态注意力与token剪枝框架,首次实现了对ViT中token和head的联合优化。该方法通过高效的海森向量乘积估算token与注意力头的曲率加权敏感性,从而做出更合理的剪枝决策;实验表明,token剪枝主导计算量节省,head剪枝实现细粒度冗余消除,二者结合可获得最优精度-效率权衡,在ImageNet数据集上实现最高达49.4% FLOPs减少、36%延迟降低和46%吞吐量提升,且经微调后精度不降反升,同时在AGX Orin等边缘设备上验证了实际推理速度与能效的显著提升。
链接: https://arxiv.org/abs/2512.20120
作者: Mohammad Helal Uddin,Liam Seymour,Sabur Baidya
机构: University of Louisville (路易斯维尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) deliver state-of-the-art accuracy but their quadratic attention cost and redundant computations severely hinder deployment on latency and resource-constrained platforms. Existing pruning approaches treat either tokens or heads in isolation, relying on heuristics or first-order signals, which often sacrifice accuracy or fail to generalize across inputs. We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers, which to the best of our knowledge is the first unified, second-order, input-adaptive framework for ViT optimization. HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products, enabling principled pruning decisions under explicit loss this http URL dual-view sensitivity reveals an important structural insight: token pruning dominates computational savings, while head pruning provides fine-grained redundancy removal, and their combination achieves a superior trade-off. On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput, while consistently matching or even surpassing baseline accuracy after fine-tuning, for example 4.7 percent recovery at 40 percent token pruning. Beyond theoretical benchmarks, we deploy HEART-ViT on different edge devices such as AGX Orin, demonstrating that our reductions in FLOPs and latency translate directly into real-world gains in inference speed and energy efficiency. HEART-ViT bridges the gap between theory and practice, delivering the first unified, curvature-driven pruning framework that is both accuracy-preserving and edge-efficient.
zh
[CV-45] DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
【速读】:该论文旨在解决音频-视觉分割(Audio-Visual Segmentation, AVS)中普遍存在的多源混淆(multi-source entanglement)和音视频错位(audio-visual misalignment)问题,这些问题会导致模型偏向于识别 louder 或 larger 的声源,而忽略较弱、较小或共现的声源。解决方案的关键在于提出 DDAVS 框架,其核心包括两个创新:一是通过可学习查询(learnable queries)从音频中提取语义信息,并将其锚定在由音频原型记忆库(audio prototype memory bank)构建的结构化语义空间中,结合对比学习增强语义区分度与鲁棒性;二是引入延迟双向对齐机制(delayed bidirectional alignment),通过双交叉注意力(dual cross-attention)实现模态间更稳定的对齐,从而缓解音视频错位问题。实验表明,DDAVS 在 AVS-Objects 和 VPO 基准上均优于现有方法,尤其在单源、多源及多实例场景下展现出卓越的泛化能力。
链接: https://arxiv.org/abs/2512.20117
作者: Jingqi Tian,Yiheng Du,Haoji Zhang,Yuji Wang,Isaac Ning Lee,Xulong Bai,Tianrui Zhu,Jingxuan Niu,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: this https URL
Abstract:Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: this https URL
zh
[CV-46] Multi Modal Attention Networks with Uncertainty Quantification for Automated Concrete Bridge Deck Delamination Detection
【速读】:该论文旨在解决桥梁铺装层脱粘缺陷检测中单一模态传感器(如探地雷达和红外热成像)存在的局限性问题,例如雷达对湿度敏感且难以探测浅层缺陷,而热成像则受天气影响大且探测深度有限。解决方案的关键在于提出一种多模态注意力网络,通过融合雷达的时间模式与热成像的空间特征,利用时间注意力机制处理雷达数据、空间注意力机制提取热图像特征,并引入可学习嵌入的跨模态融合策略以挖掘单个传感器无法识别的互补缺陷模式。此外,系统还结合蒙特卡洛Dropout和可学习方差估计实现不确定性量化,将不确定性分解为认知不确定性(epistemic)与随机不确定性(aleatoric),从而提升决策安全性并支持选择性预测,显著优于传统单模态方法及简单特征拼接融合方式。
链接: https://arxiv.org/abs/2512.20113
作者: Alireza Moayedikia,Sattar Dorafshan
机构: Swinburne University of Technology (斯威本科技大学); University of North Dakota (北达科他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Deteriorating civil infrastructure requires automated inspection techniques overcoming limitations of visual assessment. While Ground Penetrating Radar and Infrared Thermography enable subsurface defect detection, single modal approaches face complementary constraints radar struggles with moisture and shallow defects, while thermography exhibits weather dependency and limited depth. This paper presents a multi modal attention network fusing radar temporal patterns with thermal spatial signatures for bridge deck delamination detection. Our architecture introduces temporal attention for radar processing, spatial attention for thermal features, and cross modal fusion with learnable embeddings discovering complementary defect patterns invisible to individual sensors. We incorporate uncertainty quantification through Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components for safety critical decisions. Experiments on five bridge datasets reveal that on balanced to moderately imbalanced data, our approach substantially outperforms baselines in accuracy and AUC representing meaningful improvements over single modal and concatenation based fusion. Ablation studies demonstrate cross modal attention provides critical gains beyond within modality attention, while multi head mechanisms achieve improved calibration. Uncertainty quantification reduces calibration error, enabling selective prediction by rejecting uncertain cases. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse. These findings provide actionable guidance: attention based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. Our system maintains deployment efficiency, enabling real time inspection with characterized capabilities and limitations.
zh
[CV-47] UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis NEURIPS2025
【速读】:该论文旨在解决新颖视图合成(Novel View Synthesis, NVS)中长期存在的权衡问题:确定性网络虽能快速渲染已观测区域,但对未观测区域会产生模糊;而基于扩散的随机方法虽能合理生成缺失内容,却带来高昂的训练与推理成本。解决方案的关键在于提出一种混合框架,通过双向Transformer联合编码多视角图像标记(image tokens)与Plucker-ray嵌入,构建共享潜在表示;在此基础上设计两个轻量级分支:一是前馈回归头用于高置信度几何约束区域的快速像素渲染,二是掩码自回归扩散头用于填补遮挡或未见区域。整个模型端到端训练,结合光度损失与扩散损失,无需人工设计的3D归纳偏置,从而在保持先进图像质量的同时,将渲染时间降低一个数量级。
链接: https://arxiv.org/abs/2512.20107
作者: Thanh-Tung Le,Tuan Pham,Tung Nguyen,Deying Kong,Xiaohui Xie,Stephan Mandt
机构: UCI(加州大学欧文分校); UCLA(加州大学洛杉矶分校); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025. The first two authors contributed equally
Abstract:Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.
zh
[CV-48] LiDARDraft: Generating LiDAR Point Cloud from Versatile Inputs
【速读】:该论文旨在解决当前LiDAR点云生成方法在控制灵活性与生成质量之间难以平衡的问题,即复杂分布的LiDAR点云与简单控制信号之间的不匹配。其解决方案的关键在于提出LiDARDraft框架,通过引入3D布局(3D layout)作为桥梁,将文本、图像等多样化输入统一转化为语义和深度控制信号,并借助基于rangemap的ControlNet实现像素级对齐控制,从而显著提升生成点云的质量与可控性,支持从任意文本描述、图像或草图中“从零开始”构建自动驾驶仿真环境。
链接: https://arxiv.org/abs/2512.20105
作者: Haiyun Wei,Fan Lu,Yunwei Zhu,Zehan Zheng,Weiyi Xue,Lin Shao,Xudong Zhang,Ya Wu,Rong Fu,Guang Chen
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating realistic and diverse LiDAR point clouds is crucial for autonomous driving simulation. Although previous methods achieve LiDAR point cloud generation from user inputs, they struggle to attain high-quality results while enabling versatile controllability, due to the imbalance between the complex distribution of LiDAR point clouds and the simple control signals. To address the limitation, we propose LiDARDraft, which utilizes the 3D layout to build a bridge between versatile conditional signals and LiDAR point clouds. The 3D layout can be trivially generated from various user inputs such as textual descriptions and images. Specifically, we represent text, images, and point clouds as unified 3D layouts, which are further transformed into semantic and depth control signals. Then, we employ a rangemap-based ControlNet to guide LiDAR point cloud generation. This pixel-level alignment approach demonstrates excellent performance in controllable LiDAR point clouds generation, enabling “simulation from scratch”, allowing self-driving environments to be created from arbitrary textual descriptions, images and sketches.
zh
[CV-49] Effect of Activation Function and Model Optimizer on the Performance of Human Activity Recognition System Using Various Deep Learning Models
【速读】:该论文旨在解决深度学习中激活函数(Activation Functions, AFs)与模型优化器(Model Optimizers, MOs)组合对人类活动识别(Human Activity Recognition, HAR)性能影响不明确的问题,尤其关注其在实际医疗场景下的表现差异。解决方案的关键在于系统性地评估三种常见激活函数(ReLU、Sigmoid、Tanh)与四种优化算法(SGD、Adam、RMSprop、Adagrad)在两种递归神经网络架构(BiLSTM 和 ConvLSTM)中的交互效应,并基于 HMDB51 和 UCF101 数据集上六个医学相关活动类别进行实验验证。结果表明,ConvLSTM 结合 Adam 或 RMSprop 时表现出最优性能(最高准确率达 99.00%),凸显其在时空特征建模上的优势及对 AF-MO 组合变化的稳定性,为面向医疗应用的 HAR 系统设计提供了可落地的优化策略。
链接: https://arxiv.org/abs/2512.20104
作者: Subrata Kumer Paula,Dewan Nafiul Islam Noora,Rakhi Rani Paula,Md. Ekramul Hamidb,Fahmid Al Faridc,Hezerul Abdul Karimd,Md. Maruf Al Hossain Princee,Abu Saleh Musa Miahb
机构: Bangladesh Army University of Engineering & Technology (BAUET); University of Rajshahi; Berlin School of Business and Innovation (柏林商学院); Multimedia University; Bangladesh Army University of Science and Technology (BAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human Activity Recognition (HAR) plays a vital role in healthcare, surveillance, and innovative environments, where reliable action recognition supports timely decision-making and automation. Although deep learning-based HAR systems are widely adopted, the impact of Activation Functions (AFs) and Model Optimizers (MOs) on performance has not been sufficiently analyzed, particularly regarding how their combinations influence model behavior in practical scenarios. Most existing studies focus on architecture design, while the interaction between AF and MO choices remains relatively unexplored. In this work, we investigate the effect of three commonly used activation functions (ReLU, Sigmoid, and Tanh) combined with four optimization algorithms (SGD, Adam, RMSprop, and Adagrad) using two recurrent deep learning architectures, namely BiLSTM and ConvLSTM. Experiments are conducted on six medically relevant activity classes selected from the HMDB51 and UCF101 datasets, considering their suitability for healthcare-oriented HAR applications. Our experimental results show that ConvLSTM consistently outperforms BiLSTM across both datasets. ConvLSTM, combined with Adam or RMSprop, achieves an accuracy of up to 99.00%, demonstrating strong spatio-temporal learning capabilities and stable performance. While BiLSTM performs reasonably well on UCF101, with accuracy approaching 98.00%, its performance drops to approximately 60.00% on HMDB51, indicating limited robustness across datasets and weaker sensitivity to AF and MO variations. This study provides practical insights for optimizing HAR systems, particularly for real-world healthcare environments where fast and precise activity detection is critical.
zh
[CV-50] Item Region-based Style Classification Network (IRSN): A Fashion Style Classifier Based on Domain Knowledge of Fashion Experts
【速读】:该论文旨在解决时尚风格分类(Fashion Style Classification)中的关键挑战,即同一风格内部存在较大的视觉差异,且不同风格之间可能具有高度相似的外观特征。传统方法通常依赖全局图像特征进行分类,难以捕捉个体服饰物品及其组合属性对风格的贡献。为此,作者提出了一种基于物品区域的时尚风格分类网络(Item Region-based Style Classification Network, IRSN),其核心创新在于引入了两个关键技术:一是通过物品区域池化(Item Region Pooling, IRP)提取每个服饰物品的局部特征并独立分析,二是采用门控特征融合(Gated Feature Fusion, GFF)机制整合这些局部特征与全局特征。此外,研究还构建了一个双骨干架构(dual-backbone architecture),融合领域特定特征提取器与在大规模图文数据集上预训练的通用特征提取器,从而显著提升模型对细微风格差异的感知能力。实验表明,该方法在FashionStyle14和ShowniqV3数据集上平均准确率提升达6.9%–7.6%,最大提升分别达14.5%和15.1%。
链接: https://arxiv.org/abs/2512.20088
作者: Jinyoung Choi,Youngchae Kwon,Injung Kim
机构: Handong Global University (韩东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This is a pre-print of an article published in Applied Intelligence. The final authenticated version is available online at: this https URL
Abstract:Fashion style classification is a challenging task because of the large visual variation within the same style and the existence of visually similar styles. Styles are expressed not only by the global appearance, but also by the attributes of individual items and their combinations. In this study, we propose an item region-based fashion style classification network (IRSN) to effectively classify fashion styles by analyzing item-specific features and their combinations in addition to global features. IRSN extracts features of each item region using item region pooling (IRP), analyzes them separately, and combines them using gated feature fusion (GFF). In addition, we improve the feature extractor by applying a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pre-trained with a large-scale image-text dataset. In experiments, applying IRSN to six widely-used backbones, including EfficientNet, ConvNeXt, and Swin Transformer, improved style classification accuracy by an average of 6.9% and a maximum of 14.5% on the FashionStyle14 dataset and by an average of 7.6% and a maximum of 15.1% on the ShowniqV3 dataset. Visualization analysis also supports that the IRSN models are better than the baseline models at capturing differences between similar style classes. Comments: This is a pre-print of an article published in Applied Intelligence. The final authenticated version is available online at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20088 [cs.CV] (or arXiv:2512.20088v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.20088 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Applied Intelligence, Vol. 54, pp. 6197-6209 (2024) Related DOI: https://doi.org/10.1007/s10489-024-05683-9 Focus to learn more DOI(s) linking to related resources Submission history From: Jinyoung Choi [view email] [v1] Tue, 23 Dec 2025 06:30:33 UTC (16,102 KB) Full-text links: Access Paper: View a PDF of the paper titled Item Region-based Style Classification Network (IRSN): A Fashion Style Classifier Based on Domain Knowledge of Fashion Experts, by Jinyoung Choi and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-51] Progressive Learned Image Compression for Machine Perception
【速读】:该论文旨在解决机器感知导向的图像压缩中缺乏细粒度可扩展性(Fine Granular Scalability, FGS)的问题,即如何在单一比特流中实现多质量层级的渐进式解码以适应下游机器任务的需求。其解决方案的关键在于提出一种基于三元平面(trit-plane)编码的新型渐进式学习图像压缩网络PICM-Net,并设计了一个自适应解码控制器,该控制器在推理阶段动态确定最优解码层级,从而在保证下游分类任务性能的同时,实现高效且灵活的渐进传输。
链接: https://arxiv.org/abs/2512.20070
作者: Jungwoo Kim,Jun-Hyuk Kim,Jong-Seok Lee
机构: Yonsei University (延世大学); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recent advances in learned image codecs have been extended from human perception toward machine perception. However, progressive image compression with fine granular scalability (FGS)-which enables decoding a single bitstream at multiple quality levels-remains unexplored for machine-oriented codecs. In this work, we propose a novel progressive learned image compression codec for machine perception, PICM-Net, based on trit-plane coding. By analyzing the difference between human- and machine-oriented rate-distortion priorities, we systematically examine the latent prioritization strategies in terms of machine-oriented codecs. To further enhance real-world adaptability, we design an adaptive decoding controller, which dynamically determines the necessary decoding level during inference time to maintain the desired confidence of downstream machine prediction. Extensive experiments demonstrate that our approach enables efficient and adaptive progressive transmission while maintaining high performance in the downstream classification task, establishing a new paradigm for machine-aware progressive image compression.
zh
[CV-52] owards Generative Location Awareness for Disaster Response: A Probabilistic Cross-view Geolocalization Approach
【速读】:该论文旨在解决灾害响应中快速准确识别灾情地点的问题,以支持决策制定和资源调配。其核心挑战在于如何在多类灾害(如飓风、野火、洪水和龙卷风)背景下实现跨视角地理定位(cross-view geolocalization)的高精度与可解释性。解决方案的关键在于提出一种概率性跨视角地理定位方法(Probabilistic Cross-view Geolocalization, ProbGLC),通过将概率模型与确定性模型融合到统一框架中,同时提升地理定位性能(如Acc@1km达0.86,Acc@25km达0.97)并提供不确定性量化和局部可定位性评分(localizability score),从而增强模型的可解释性与实用性,为生成式AI (Generative AI) 在灾害响应中的应用开辟新路径。
链接: https://arxiv.org/abs/2512.20056
作者: Hao Li,Fabian Deuser,Wenping Yin,Steffen Knoblauch,Wufan Zhao,Filip Biljecki,Yong Xue,Wei Huang
机构: National University of Singapore (新加坡国立大学); Technical University of Munich (慕尼黑工业大学); China University of Mining and Technology (中国矿业大学); Heidelberg University (海德堡大学); Hong Kong University of Science and Technology (广州) (香港科技大学(广州) ); Tongji University (同济大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As Earth’s climate changes, it is impacting disasters and extreme weather events across the planet. Record-breaking heat waves, drenching rainfalls, extreme wildfires, and widespread flooding during hurricanes are all becoming more frequent and more intense. Rapid and efficient response to disaster events is essential for climate resilience and sustainability. A key challenge in disaster response is to accurately and quickly identify disaster locations to support decision-making and resources allocation. In this paper, we propose a Probabilistic Cross-view Geolocalization approach, called ProbGLC, exploring new pathways towards generative location awareness for rapid disaster response. Herein, we combine probabilistic and deterministic geolocalization models into a unified framework to simultaneously enhance model explainability (via uncertainty quantification) and achieve state-of-the-art geolocalization performance. Designed for rapid diaster response, the ProbGLC is able to address cross-view geolocalization across multiple disaster events as well as to offer unique features of probabilistic distribution and localizability score. To evaluate the ProbGLC, we conduct extensive experiments on two cross-view disaster datasets (i.e., MultiIAN and SAGAINDisaster), consisting diverse cross-view imagery pairs of multiple disaster types (e.g., hurricanes, wildfires, floods, to tornadoes). Preliminary results confirms the superior geolocalization accuracy (i.e., 0.86 in Acc@1km and 0.97 in Acc@25km) and model explainability (i.e., via probabilistic distributions and localizability scores) of the proposed ProbGLC approach, highlighting the great potential of leveraging generative cross-view approach to facilitate location awareness for better and faster disaster response. The data and code is publicly available at this https URL
zh
[CV-53] Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva ACM-MM’25
【速读】:该论文旨在解决真实世界图像描述(image captions)缺乏上下文深度的问题,即现有方法通常无法包含事件背景、时间线索、结果以及视觉不可见的命名实体等关键信息,从而限制了图像理解在新闻报道、教育和数字档案等领域中的应用效果。解决方案的关键在于构建一个多模态管道:首先利用BEIT-3和SigLIP模型检索语义相似图像,再通过ORB和SIFT特征进行几何对齐重排序,随后借助语义搜索从相关文章中提取上下文信息,并最终使用微调后的Qwen3模型(采用QLoRA参数高效微调技术)融合这些外部知识与Instruct BLIP生成的基础描述,从而产出富含事件背景的上下文感知型图像描述。
链接: https://arxiv.org/abs/2512.20042
作者: Nguyen Lam Phu Quy,Pham Phu Hoa,Tran Chi Nguyen,Dao Sy Duy Minh,Nguyen Hoang Minh Ngoc,Huynh Trung Kiet
机构: University of Science - VNUHCM (胡志明市国家大学所属科学大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures. System description for the EVENTA Grand Challenge (Track 1) at ACM MM’25
Abstract:Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding
zh
[CV-54] FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
【速读】:该论文旨在解决现有唇形同步(lip-sync)系统中依赖显式掩码(mask)以及难以同时实现高视觉质量和实时性能的问题。解决方案的关键在于提出一个两阶段、无需掩码的端到端框架 FlashLips:第一阶段通过一个紧凑的单步潜在空间编辑器,在仅使用参考身份、目标帧和低维唇部姿态向量的情况下重建图像,训练过程仅依赖重建损失(无生成对抗网络 GAN 或扩散模型);为消除推理时的显式掩码,采用自监督策略,生成口部修改的伪真值样本用于微调,使网络能精准定位到唇部区域并保持其余部分不变;第二阶段则是一个基于流匹配(flow-matching)目标训练的音频到姿态变换器,从语音中预测唇部姿态向量。整体方案实现了确定性重建与鲁棒音频控制的结合,兼具高感知质量与超过 100 FPS 的实时性能。
链接: https://arxiv.org/abs/2512.20033
作者: Andreas Zinonos,Michał Stypułkowski,Antoni Bigata,Stavros Petridis,Maja Pantic,Nikita Drobyshev
机构: Imperial College London (帝国理工学院); Cantina Labs; NatWest
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision: we generate mouth-altered variants of the target image, that serve as pseudo ground truth for fine-tuning, teaching the network to localize edits to the lips while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-poses vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.
zh
[CV-55] VALLR-Pin: Dual-Decoding Visual Speech Recognition for Mandarin with Pinyin-Guided LLM Refinement
【速读】:该论文旨在解决中文唇读(Visual Speech Recognition)中因音位(viseme)高度模糊和同音字(homophone)普遍存在的挑战。解决方案的关键在于提出一个两阶段框架VALLR-Pin,其核心创新包括:第一阶段通过共享视频编码器与双解码器结构联合预测汉字序列及其标准拼音(Pinyin),利用多任务学习增强视觉-语义表征的鲁棒性;第二阶段在推理时生成多个候选文本,并将拼音输出与候选汉字序列拼接为提示(prompt)输入大语言模型(LLM),以显式语音上下文纠正同音错误;同时,通过合成带噪拼音-文本对微调LLM,使其掌握模型特定的错误模式,从而实现视觉特征与语音及语言上下文的协同优化,显著提升中文唇读准确率。
链接: https://arxiv.org/abs/2512.20032
作者: Chang Sun,Dongliang Xie,Bo Qin,Hong Yang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); First Research Institute of the Ministry of Public Security of the People’s Republic of China (中华人民共和国公安部第一研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Speech Recognition aims to transcribe spoken words from silent lip-motion videos. This task is particularly challenging for Mandarin, as visemes are highly ambiguous and homophones are prevalent. We propose VALLR-Pin, a novel two-stage framework that extends the recent VALLR architecture from English to Mandarin. First, a shared video encoder feeds into dual decoders, which jointly predict both Chinese character sequences and their standard Pinyin romanization. The multi-task learning of character and phonetic outputs fosters robust visual-semantic representations. During inference, the text decoder generates multiple candidate transcripts. We construct a prompt by concatenating the Pinyin output with these candidate Chinese sequences and feed it to a large language model to resolve ambiguities and refine the transcription. This provides the LLM with explicit phonetic context to correct homophone-induced errors. Finally, we fine-tune the LLM on synthetic noisy examples: we generate imperfect Pinyin-text pairs from intermediate VALLR-Pin checkpoints using the training data, creating instruction-response pairs for error correction. This endows the LLM with awareness of our model’s specific error patterns. In summary, VALLR-Pin synergizes visual features with phonetic and linguistic context to improve Mandarin lip-reading performance.
zh
[CV-56] textH2em: Learning Hierarchical Hyperbolic Embeddings for Compositional Zero-Shot Learning
【速读】:该论文针对组合零样本学习(Compositional Zero-Shot Learning, CZSL)中现有方法忽视语义层级结构的问题,提出了一种基于双层层次化超球面嵌入(Hierarchical Hyperbolic EMbeddings, H2em)的新框架。其核心挑战在于:传统方法在欧几里得空间中建模语义层级(如“苹果”属于“水果”类别,以及“切片苹果”是“苹果”的具体组合)时,受限于平坦几何空间的多项式体积增长特性,难以匹配真实世界中指数级扩展的分类体系,从而削弱泛化能力。解决方案的关键在于利用超球面几何(hyperbolic geometry)天然适合低失真嵌入树状结构的特性,并设计两个创新目标函数:一是双层层次蕴含损失(Dual-Hierarchical Entailment Loss),通过超球面蕴含锥(entailment cones)强制执行预定义的语义层级关系;二是判别性对齐损失(Discriminative Alignment Loss)结合困难负样本挖掘,最大化语义相近组合间的测地距离(geodesic distance)。此外,引入超球面跨模态注意力机制(Hyperbolic Cross-Modal Attention)实现实例感知的跨模态信息融合,显著提升了模型在封闭和开放世界场景下的性能表现。
链接: https://arxiv.org/abs/2512.20029
作者: Lin Li,Jiahui Li,Jiaming Lei,Jun Xiao,Feifei Shao,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Compositional zero-shot learning (CZSL) aims to recognize unseen state-object compositions by generalizing from a training set of their primitives (state and object). Current methods often overlook the rich hierarchical structures, such as the semantic hierarchy of primitives (e.g., apple fruit) and the conceptual hierarchy between primitives and compositions (e.g, sliced apple apple). A few recent efforts have shown effectiveness in modeling these hierarchies through loss regularization within Euclidean space. In this paper, we argue that they fail to scale to the large-scale taxonomies required for real-world CZSL: the space’s polynomial volume growth in flat geometry cannot match the exponential structure, impairing generalization capacity. To this end, we propose H2em, a new framework that learns Hierarchical Hyperbolic EMbeddings for CZSL. H2em leverages the unique properties of hyperbolic geometry, a space naturally suited for embedding tree-like structures with low distortion. However, a naive hyperbolic mapping may suffer from hierarchical collapse and poor fine-grained discrimination. We further design two learning objectives to structure this space: a Dual-Hierarchical Entailment Loss that uses hyperbolic entailment cones to enforce the predefined hierarchies, and a Discriminative Alignment Loss with hard negative mining to establish a large geodesic distance between semantically similar compositions. Furthermore, we devise Hyperbolic Cross-Modal Attention to realize instance-aware cross-modal infusion within hyperbolic geometry. Extensive ablations on three benchmarks demonstrate that H2em establishes a new state-of-the-art in both closed-world and open-world scenarios. Our codes will be released.
zh
[CV-57] MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis AAAI AAAI-26
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在多模态医学诊断中因依赖单一静态图结构而导致的患者特异性病理关系建模能力不足的问题。其解决方案的关键在于提出多激活平面交互图神经网络(Multi-Activation Plane Interaction Graph Neural Network, MAPI-GNN),通过从语义解耦的特征子空间中学习多维图谱特征,利用多维判别器挖掘潜在的图感知模式,动态构建一系列激活图,并由关系融合引擎对这些多面图谱进行聚合与上下文建模,从而实现更鲁棒的诊断性能。
链接: https://arxiv.org/abs/2512.20026
作者: Ziwei Qin,Xuhui Song,Deqing Huang,Na Qin,Jun Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Proceedings of the AAAI Conference on Artificial Intelligence 40 (AAAI-26)
Abstract:Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.
zh
[CV-58] A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments
【速读】:该论文旨在解决当前基于计算机视觉的分心驾驶检测模型普遍仅依赖驾驶员面向视角(driver-facing views),而忽视对驾驶行为具有重要影响的道路面向视角(road-facing views)这一局限性问题。其解决方案的关键在于系统性地评估不同时空动作识别架构在引入双视角输入(driver-only 与 stacked dual-view)时的表现差异,发现单纯增加环境视觉上下文并不能保证性能提升,反而可能因模型结构不支持多视角融合而导致性能下降;因此,实现高精度分心检测的核心在于采用具备融合感知设计(fusion-aware design)的网络架构,以有效整合多视角信息并避免表征冲突。
链接: https://arxiv.org/abs/2512.20025
作者: Anthony Dontoh,Stephanie Ivey,Armstrong Aboah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.
zh
[CV-59] SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images
【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像中复杂语言指令与像素间精准对齐的问题,尤其针对现有模型在处理多目标、多层次粒度、隐含语义推理及语言多样性等复杂地理空间场景时的失效问题。其解决方案的核心在于提出两个关键改进:一是设计了空间注意力监督机制,有效提升小目标及其组成部分的定位精度;二是引入灵活高效的分割查询机制,统一支持单目标与多目标场景下的语言引导分割任务。通过构建首个涵盖四个核心维度的大规模数据集LaSeRS,并基于此训练SegEarth-R2模型,实现了对复杂地理空间推理能力的显著增强。
链接: https://arxiv.org/abs/2512.20013
作者: Zepeng Xin,Kaiyu Li,Luodi Chen,Wanchen Li,Yuchen Xiao,Hui Qiao,Weizhan Zhang,Deyu Meng,Xiangyong Cao
机构: Xi’an Jiaotong University (西安交通大学); China Telecom Shaanxi Branch (中国电信陕西分公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model’s effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at this https URL.
zh
[CV-60] PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification
【速读】:该论文旨在解决当前自动化路面缺陷检测模型在不同现实场景中泛化能力不足的问题,其根源在于现有数据集在标注风格、病害类型定义和格式上缺乏统一标准,导致难以整合用于联合训练。解决方案的关键在于构建一个标准化的基准数据集,该数据集整合了来自7个国家的52,747张图像和135,277个边界框标注,覆盖13类路面病害,并涵盖广泛的图像质量、分辨率、视角和天气条件差异,从而为模型的统一训练与评估提供可靠资源。通过引入标准化的类别定义和标注格式,该数据集首次实现了全球代表性,支持模型间的公平比较及零样本迁移至新环境。
链接: https://arxiv.org/abs/2512.20011
作者: Blessing Agyei Kyem,Joshua Kofi Asamoah,Anthony Dontoh,Andrews Danyo,Eugene Denteh,Armstrong Aboah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated pavement defect detection often struggles to generalize across diverse real-world conditions due to the lack of standardized datasets. Existing datasets differ in annotation styles, distress type definitions, and formats, limiting their integration for unified training. To address this gap, we introduce a comprehensive benchmark dataset that consolidates multiple publicly available sources into a standardized collection of 52747 images from seven countries, with 135277 bounding box annotations covering 13 distinct distress types. The dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions, offering a unique resource for consistent training and evaluation. Its effectiveness was demonstrated through benchmarking with state-of-the-art object detection models including YOLOv8-YOLOv12, Faster R-CNN, and DETR, which achieved competitive performance across diverse scenarios. By standardizing class definitions and annotation formats, this dataset provides the first globally representative benchmark for pavement defect detection and enables fair comparison of models, including zero-shot transfer to new environments.
zh
[CV-61] Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在图像动画生成中面临的两大挑战:一是视频信号高维度导致训练数据稀缺,使模型倾向于记忆而非遵循提示生成运动;二是模型难以泛化到训练集中未出现的新运动模式,且在小样本下微调以学习新运动模式的研究仍不充分。解决方案的关键在于提出模块化图像到视频适配器(Modular Image-to-Video Adapter, MIVA),这是一种轻量级子网络,可附加于预训练扩散模型上,每个MIVA专门捕获单一运动模式,并通过并行化实现可扩展性。MIVA仅需约十张样本即可在单个消费级GPU上高效训练,推理时用户通过选择一个或多个MIVA即可指定运动,无需提示工程,从而实现精确的运动控制,同时保持甚至超越大规模数据训练模型的生成质量。
链接: https://arxiv.org/abs/2512.20000
作者: Zhenhao Li,Shaohan Yi,Zheng Liu,Leonartinus Gao,Minh Ngoc Le,Ambrose Ling,Zhuoran Wang,Md Amirul Islam,Zhixiang Chi,Yuanhao Yu
机构: Huawei Technologies Canada (华为技术加拿大公司); University of Waterloo (滑铁卢大学); University of British Columbia (不列颠哥伦比亚大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples using a single consumer-grade GPU. At inference time, users can specify motion by selecting one or multiple MIVAs, eliminating the need for prompt engineering. Extensive experiments demonstrate that MIVA enables more precise motion control while maintaining, or even surpassing, the generation quality of models trained on significantly larger datasets.
zh
[CV-62] A Dual-Branch Local-Global Framework for Cross-Resolution Land Cover Mapping
【速读】:该论文旨在解决跨分辨率地表覆盖制图(cross-resolution land cover mapping)中因分辨率差异导致的监督信号不匹配问题,即如何在粗粒度标签(coarse supervision)条件下有效学习并生成高分辨率语义预测。现有弱监督方法难以对齐细粒度空间结构与粗标签,从而引入噪声并降低精度。其解决方案的关键在于提出DDTM框架,该框架采用双分支架构:一是基于扩散模型(diffusion-based)的分支,用于在粗标签约束下逐步精细化局部语义;二是基于Transformer的分支,确保大范围空间内的全局上下文一致性。此外,设计了伪标签置信度评估模块以识别并利用可靠的监督信号,从而缓解跨分辨率不一致带来的噪声干扰。
链接: https://arxiv.org/abs/2512.19990
作者: Peng Gao,Ke Li,Di Wang,Yongshan Zhu,Yiming Zhang,Xuemei Luo,Yifeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-resolution land cover mapping aims to produce high-resolution semantic predictions from coarse or low-resolution supervision, yet the severe resolution mismatch makes effective learning highly challenging. Existing weakly supervised approaches often struggle to align fine-grained spatial structures with coarse labels, leading to noisy supervision and degraded mapping accuracy. To tackle this problem, we propose DDTM, a dual-branch weakly supervised framework that explicitly decouples local semantic refinement from global contextual reasoning. Specifically, DDTM introduces a diffusion-based branch to progressively refine fine-scale local semantics under coarse supervision, while a transformer-based branch enforces long-range contextual consistency across large spatial extents. In addition, we design a pseudo-label confidence evaluation module to mitigate noise induced by cross-resolution inconsistencies and to selectively exploit reliable supervisory signals. Extensive experiments demonstrate that DDTM establishes a new state-of-the-art on the Chesapeake Bay benchmark, achieving 66.52% mIoU and substantially outperforming prior weakly supervised methods. The code is available at this https URL.
zh
[CV-63] A Novel CNN Gradient Boosting Ensemble for Guava Disease Detection
【速读】:该论文旨在解决孟加拉国本地栽培番石榴(Psidium guajava)因炭疽病(Anthracnose)和果实蝇(Fruit Fly)感染导致的产量下降与品质劣化问题,从而提升农业经济收益。解决方案的关键在于构建一种基于卷积神经网络(CNN)与传统机器学习(ML)相结合的集成模型(Ensemble Model),通过CNN提取图像特征并融合梯度提升机(Gradient Boosting Machine, GBM)进行分类决策,最终实现对健康、果实蝇侵害及炭疽病三种类别的高精度识别(准确率约99.99%),具备在实时农业监测系统中部署的能力。
链接: https://arxiv.org/abs/2512.19989
作者: Tamim Ahasan Rijon,Yeasin Arafath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IEEE ICCIT 2025. This is the author accepted manuscript
Abstract:As a significant agricultural country, Bangladesh utilizes its fertile land for guava cultivation and dedicated labor to boost its economic development. In a nation like Bangladesh, enhancing guava production and agricultural practices plays a crucial role in its economy. Anthracnose and fruit fly infection can lower the quality and productivity of guava, a crucial tropical fruit. Expert systems that detect diseases early can reduce losses and safeguard the harvest. Images of guava fruits classified into the Healthy, Fruit Flies, and Anthracnose classes are included in the Guava Fruit Disease Dataset 2024 (GFDD24), which comes from plantations in Rajshahi and Pabna, Bangladesh. This study aims to create models using CNN alongside traditional machine learning techniques that can effectively identify guava diseases in locally cultivated varieties in Bangladesh. In order to achieve the highest classification accuracy of approximately 99.99% for the guava dataset, we propose utilizing ensemble models that combine CNNML with Gradient Boosting Machine. In general, the CNN-ML cascade framework exhibits strong, high-accuracy guava disease detection that is appropriate for real-time agricultural monitoring systems.
zh
[CV-64] WSD-MIL: Window Scale Decay Multiple Instance Learning for Whole Slide Image Classification
【速读】:该论文旨在解决现有基于Transformer的多实例学习(Multiple Instance Learning, MIL)方法在处理全切片图像(Whole Slide Image, WSI)时面临的两大问题:一是由于注意力机制的二次计算复杂度导致难以扩展到大规模WSI;二是固定尺度的注意力机制无法适应不同WSI中肿瘤区域尺度差异,且忽略了patch间相关性随距离衰减的特性。解决方案的关键在于提出窗口尺度衰减多实例学习(Window Scale Decay MIL, WSD-MIL),其核心创新包括:1)基于窗口尺度衰减的注意力模块,通过聚类采样策略降低计算成本,并逐步缩小注意力窗口以捕捉不同尺度下的局部实例关系;2)基于挤压-激励机制的区域门控模块,动态调整窗口权重以增强全局信息建模能力。实验表明,WSD-MIL在CAMELYON16和TCGA-BRCA数据集上达到最优性能,同时减少62%的内存消耗。
链接: https://arxiv.org/abs/2512.19982
作者: Le Feng,Li Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, the integration of pre-trained foundational models with multiple instance learning (MIL) has improved diagnostic accuracy in computational pathology. However, existing MIL methods focus on optimizing feature extractors and aggregation strategies while overlooking the complex semantic relationships among instances within whole slide image (WSI). Although Transformer-based MIL approaches aiming to model instance dependencies, the quadratic computational complexity limits their scalability to large-scale WSIs. Moreover, due to the pronounced variations in tumor region scales across different WSIs, existing Transformer-based methods employing fixed-scale attention mechanisms face significant challenges in precisely capturing local instance correlations and fail to account for the distance-based decay effect of patch relevance. To address these challenges, we propose window scale decay MIL (WSD-MIL), designed to enhance the capacity to model tumor regions of varying scales while improving computational efficiency. WSD-MIL comprises: 1) a window scale decay based attention module, which employs a cluster-based sampling strategy to reduce computational costs while progressively decaying attention window-scale to capture local instance relationships at varying scales; and 2) a squeeze-and-excitation based region gate module, which dynamically adjusts window weights to enhance global information modeling. Experimental results demonstrate that WSD-MIL achieves state-of-the-art performance on the CAMELYON16 and TCGA-BRCA datasets while reducing 62% of the computational memory. The code will be publicly available.
zh
[CV-65] HistoWAS: A Pathomics Framework for Large-Scale Feature-Wide Association Studies of Tissue Topology and Patient Outcomes
【速读】:该论文旨在解决组织病理学中微环境与宏环境特征的空间交互关系及其与临床参数关联性不足的问题,从而提升组织特征在精准医学中的临床相关性。其解决方案的关键在于提出HistoWAS(Histology-Wide Association Study)计算框架,该框架包含两个核心组件:一是通过引入30个源自地理信息系统(Geographic Information Systems, GIS)点模式分析的拓扑与空间特征,扩展传统基于对象的指标以量化组织微架构;二是借鉴表型组关联研究(Phenome-Wide Association Study, PheWAS)思想,构建大规模单变量回归分析引擎,并进行统计校正,实现组织空间特征与临床结局的系统关联分析。
链接: https://arxiv.org/abs/2512.19954
作者: Yuechen Yang,Junlin Guo,Yanfan Zhu,Jialin Yue,Junchao Zhu,Yu Wang,Shilin Zhao,Haichun Yang,Xingyi Guo,Jovan Tanevski,Laura Barisoni,Avi Z. Rosenberg,Yuankai Huo
机构: Vanderbilt University (范德堡大学); Washington University in St. Louis (圣路易斯华盛顿大学); Vanderbilt University Medical Center (范德堡大学医学中心); Heidelberg University (海德堡大学); Heidelberg University Hospital (海德堡大学医院); Duke University (杜克大学); Johns Hopkins University School of Medicine (约翰霍普金斯大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-throughput “pathomic” analysis of Whole Slide Images (WSIs) offers new opportunities to study tissue characteristics and for biomarker discovery. However, the clinical relevance of the tissue characteristics at the micro- and macro-environment level is limited by the lack of tools that facilitate the measurement of the spatial interaction of individual structure characteristics and their association with clinical parameters. To address these challenges, we introduce HistoWAS (Histology-Wide Association Study), a computational framework designed to link tissue spatial organization to clinical outcomes. Specifically, HistoWAS implements (1) a feature space that augments conventional metrics with 30 topological and spatial features, adapted from Geographic Information Systems (GIS) point pattern analysis, to quantify tissue micro-architecture; and (2) an association study engine, inspired by Phenome-Wide Association Studies (PheWAS), that performs mass univariate regression for each feature with statistical correction. As a proof of concept, we applied HistoWAS to analyze a total of 102 features (72 conventional object-level features and our 30 spatial features) using 385 PAS-stained WSIs from 206 participants in the Kidney Precision Medicine Project (KPMP). The code and data have been released to this https URL.
zh
[CV-66] How Much 3D Do Video Foundation Models Encode?
【速读】:该论文旨在解决视频基础模型(Video Foundation Models, VidFMs)在大规模视频数据预训练后是否能自然涌现出对3D世界的理解这一关键问题。其解决方案的核心在于提出首个与模型无关的评估框架,通过浅层读出(shallow read-outs)从VidFMs提取的特征中估计多种3D属性,从而量化不同模型的3D感知能力。该方法无需依赖特定模型结构或任务设计,实现了对现有VidFMs在多个维度上的3D意识水平的系统性测量,并揭示了当前最先进的视频生成模型虽未接触任何3D数据,却展现出强于专门训练用于3D任务的专家模型的3D理解能力。
链接: https://arxiv.org/abs/2512.19949
作者: Zixuan Huang,Xiang Li,Zhaoyang Lv,James M. Rehg
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Impossible, Inc (不可能公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
zh
[CV-67] SE360: Semantic Edit in 360circ Panoramas via Hierarchical Data Construction
【速读】:该论文旨在解决360°全景图像中基于指令的图像编辑任务中存在的不合理结果问题,尤其是在等距圆柱投影(equirectangular projection, ERP)和透视视图下现有方法难以保持语义一致性与几何合理性的问题。其解决方案的关键在于提出SE360框架,核心创新包括:1)一种无需人工干预的粗粒度到细粒度自主数据生成管道,利用视觉-语言模型(Vision-Language Model, VLM)和自适应投影调整实现层级化分析,确保对象及其物理上下文的整体分割;2)一种低成本的两阶段数据精炼策略,提升数据真实感并缓解模型对擦除伪影的过拟合;最终基于构建的数据集训练了一个基于Transformer的扩散模型,支持文本、掩码或参考图像引导下的灵活对象编辑,在视觉质量和语义准确性上均优于现有方法。
链接: https://arxiv.org/abs/2512.19943
作者: Haoyi Zhong,Fang-Lue Zhang,Andrew Chalmers,Taehyun Rhee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While instruction-based image editing is emerging, extending it to 360 ^\circ panoramas introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360 ^\circ panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erase artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360 ^\circ panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.
zh
[CV-68] Block-Recurrent Dynamics in Vision Transformers
【速读】:该论文试图解决视觉 Transformer (Vision Transformer, ViT) 中深度结构的计算机制不明确问题,即缺乏一个能将 Transformer 深度视为具有良好表征的动态流(well-characterized flow)的统一框架。其解决方案的关键在于提出块循环假设(Block-Recurrent Hypothesis, BRH),认为训练好的 ViT 可以被重写为仅使用少量(k ≪ L)不同块并重复应用的循环结构,从而揭示出 ViT 内部存在少数连续阶段(contiguous phases)的可复用计算模式。基于此假设,作者进一步构建了相位结构 Transformer 的循环近似模型(Raptor),并通过实验证明:在等效计算成本下,仅用 2 个块即可恢复 DINOv2 ImageNet-1k 线性探针准确率的 96%,验证了 BRH 的有效性,并由此发展出一套基于动力学系统的可解释性研究范式(Dynamical Interpretability)。
链接: https://arxiv.org/abs/2512.19941
作者: Mozes Jacobs,Thomas Fel,Richard Hakim,Alessandra Brondetta,Demba Ba,T. Andy Keller
机构: Kempner Institute, Harvard University (哈佛大学肯普纳研究所); University of Osnabrück (奥斯纳布吕克大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 15 figures
Abstract:As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original L blocks can be accurately rewritten using only k \ll L distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover 96% of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
zh
[CV-69] Vehicle-centric Perception via Multimodal Structured Pre-training AAAI2024
【速读】:该论文旨在解决现有车辆感知(vehicle-centric perception)预训练方法在预训练阶段缺乏对车辆相关知识的有效学习,导致模型难以建模通用的车辆感知表征的问题。解决方案的关键在于提出VehicleMAE-V2,一种面向车辆中心的预训练大模型,其通过引入三种结构化先验(structured priors)指导掩码标记重建过程:对称性引导的掩码模块(Symmetry-guided Mask Module, SMM)利用车辆对称约束选择高质量掩码图像块以减少冗余;轮廓引导的表示模块(Contour-guided Representation Module, CRM)最小化轮廓特征与重建特征之间的概率分布差异,保留车辆整体结构信息;语义引导的表示模块(Semantics-guided Representation Module, SRM)通过对比学习和跨模态蒸馏对齐图文特征,缓解因语义理解不足导致的特征混淆问题。
链接: https://arxiv.org/abs/2512.19934
作者: Wentao Wu,Xiao Wang,Chenglong Li,Jin Tang,Bin Luo
机构: Anhui University (安徽大学); Information Materials and Intelligent Sensing Laboratory of Anhui Province (安徽省信息材料与智能感知重点实验室); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation (安徽省多模态认知计算重点实验室); School of Artificial Intelligence (人工智能学院); School of Computer Science and Technology (计算机科学与技术学院); Institute of Artificial Intelligence (人工智能研究院); Hefei Comprehensive National Science Center (合肥综合性国家科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Journal extension of VehicleMAE (AAAI 2024)
Abstract:Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model’s capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.
zh
[CV-70] Unified Brain Surface and Volume Registration
【速读】:该论文旨在解决脑部MRI图像中皮层与皮层下区域在跨被试分析时因传统方法分别处理体积和表面配准而导致的不一致性问题(inconsistencies),从而影响下游分析的准确性。其解决方案的关键在于提出一种名为NeurAlign的深度学习框架,通过统一的体积-表面表示联合对齐大脑皮层和皮层下结构;该框架利用中间球面坐标空间将解剖表面拓扑与体积解剖相连接,确保体积域与表面域之间的几何一致性,同时显著提升配准精度(Dice分数最高提升7点)和推理速度(比标准方法快数个数量级),且无需额外输入,仅需原始MRI扫描即可完成高质量配准。
链接: https://arxiv.org/abs/2512.19928
作者: S. Mazdak Abulnaga,Andrew Hoopes,Malte Hoffmann,Robin Magnet,Maks Ovsjanikov,Lilla Zöllei,John Guttag,Bruce Fischl,Adrian Dalca
机构: MIT Computer Science and Artificial Intelligence Laboratory (麻省理工学院计算机科学与人工智能实验室); Massachusetts General Hospital, Harvard Medical School (马萨诸塞州总医院,哈佛医学院); Université Paris Cité, INRIA (巴黎城市大学,法国国家信息与自动化研究院); LIX, CNRS, École Polytechnique (LIX,法国国家科学研究中心,巴黎综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, NeurAlign, that registers 3 D brain MRI images by jointly aligning both cortical and subcortical regions through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods – improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. With its superior accuracy, fast inference, and ease of use, NeurAlign sets a new standard for joint cortical and subcortical registration.
zh
[CV-71] Widget2Code: From Visual Widgets to UI Code via Multimodal LLM s
【速读】:该论文旨在解决应用界面(UI)到代码(UI2Code)生成中长期被忽视的“组件化小部件”(widget)场景下的代码生成问题。与网页或移动界面不同,小部件具有紧凑、无上下文依赖的特点,且其设计通常为专有格式,缺乏可访问的标记数据,导致现有方法难以有效建模其结构与语义。解决方案的关键在于提出一个图像驱动的小部件到代码(Widget2Code)基准测试框架,并构建名为WidgetFactory的端到端系统:在感知层面,基于小部件设计原则将原子组件组合成完整布局,集成图标检索与可重用可视化模块;在系统层面,设计了一个与框架无关的小部件领域特定语言(WidgetDSL)及其编译器,可生成多种前端实现(如React、HTML/CSS),并引入自适应渲染模块以满足空间紧凑性约束,从而显著提升视觉保真度和代码可靠性。
链接: https://arxiv.org/abs/2512.19918
作者: Houston H. Zhang,Tao Zhang,Baoze Lin,Yuanqi Xue,Yincheng Zhu,Huan Liu,Li Gu,Linfeng Ye,Ziqiang Wang,Xinxin Zuo,Yang Wang,Yuanhao Yu,Zhixiang Chi
机构: McMaster University (麦克马斯特大学); University of Toronto (多伦多大学); University of Waterloo (滑铁卢大学); Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.
zh
[CV-72] HyGE-Occ: Hybrid View-Transformation with 3D Gaussian and Edge Priors for 3D Panoptic Occupancy Prediction
【速读】:该论文针对3D全景占用预测(3D Panoptic Occupancy Prediction)中几何精度不足与实例空间范围界定不准确的问题,提出了解决方案。其关键在于引入HyGE-Occ框架,通过融合基于连续高斯的深度表示与离散深度bin的公式化方法,构建具有更好几何一致性与结构一致性的BEV特征;同时,从BEV特征中提取边缘图作为辅助信息,以增强边界感知能力,从而提升复杂环境下的精细3D场景理解与实例分离性能。
链接: https://arxiv.org/abs/2512.19871
作者: Jong Wook Kim,Wonseok Roh,Ha Dam Baek,Pilhyeon Lee,Jonghyun Choi,Sangpil Kim
机构: Korea University (韩国大学); Hyundai Motor Company (现代汽车公司); Inha University (仁川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:3D Panoptic Occupancy Prediction aims to reconstruct a dense volumetric scene map by predicting the semantic class and instance identity of every occupied region in 3D space. Achieving such fine-grained 3D understanding requires precise geometric reasoning and spatially consistent scene representation across complex environments. However, existing approaches often struggle to maintain precise geometry and capture the precise spatial range of 3D instances critical for robust panoptic separation. To overcome these limitations, we introduce HyGE-Occ, a novel framework that leverages a hybrid view-transformation branch with 3D Gaussian and edge priors to enhance both geometric consistency and boundary awareness in 3D panoptic occupancy prediction. HyGE-Occ employs a hybrid view-transformation branch that fuses a continuous Gaussian-based depth representation with a discretized depth-bin formulation, producing BEV features with improved geometric consistency and structural coherence. In parallel, we extract edge maps from BEV features and use them as auxiliary information to learn edge cues. In our extensive experiments on the Occ3D-nuScenes dataset, HyGE-Occ outperforms existing work, demonstrating superior 3D geometric reasoning.
zh
[CV-73] RANSAC Scoring Functions: Analysis and Reality Check
【速读】:该论文旨在解决RANSAC(Random Sample Consensus)中候选几何模型评分函数的设计问题,即如何为几何模型拟合结果赋予一个合理的质量度量(score),以提升鲁棒几何估计的性能。其关键解决方案在于:首先将传统基于高斯噪声的几何误差扩展至球面噪声场景,其次在存在异常值的鲁棒设定下,通过混合均匀分布的异常值模型,提出统一框架来理解基于似然和M-估计的评分函数及其局部优化方法;进一步分析表明,当前表现最优的MAGSAC++算法所采用的评分函数,在理论上并不符合严谨推导原则,实际上等价于一个简单的高斯-均匀混合似然模型,且实验验证显示该评分函数在性能上并未优于基础模型,也未展现出对阈值超参数更低的敏感性。这一发现揭示了现有先进方法的理论局限性,并提出了基于大规模或随机验证集的标准化评估方法,为未来鲁棒几何估计研究提供了更清晰的方向。
链接: https://arxiv.org/abs/2512.19850
作者: A. Shekhovtsov
机构: Czech Technical University in Prague (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:
Abstract:We revisit the problem of assigning a score (a quality of fit) to candidate geometric models – one of the key components of RANSAC for robust geometric fitting. In a non-robust setting, the ``gold standard’’ scoring function, known as the geometric error, follows from a probabilistic model with Gaussian noises. We extend it to spherical noises. In a robust setting, we consider a mixture with uniformly distributed outliers and show that a threshold-based parameterization leads to a unified view of likelihood-based and robust M-estimators and associated local optimization schemes. Next we analyze MAGSAC++ which stands out for two reasons. First, it achieves the best results according to existing benchmarks. Second, it makes quite different modeling assumptions and derivation steps. We discovered, however that the derivation does not correspond to sound principles and the resulting score function is in fact numerically equivalent to a simple Gaussian-uniform likelihood, a basic model within the proposed framework. Finally, we propose an experimental methodology for evaluating scoring functions: assuming either a large validation set, or a small random validation set in expectation. We find that all scoring functions, including using a learned inlier distribution, perform identically. In particular, MAGSAC++ score is found to be neither better performing than simple contenders nor less sensitive to the choice of the threshold hyperparameter. Our theoretical and experimental analysis thus comprehensively revisit the state-of-the-art, which is critical for any future research seeking to improve the methods or apply them to other robust fitting problems. Subjects: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP) Cite as: arXiv:2512.19850 [cs.CV] (or arXiv:2512.19850v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.19850 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexander Shekhovtsov [view email] [v1] Mon, 22 Dec 2025 20:08:46 UTC (739 KB)
zh
[CV-74] Learning to Refocus with Video Diffusion Models SIGGRAPH WWW
【速读】:该论文旨在解决传统自动对焦系统在拍摄时难以准确捕捉预期主体,以及用户在拍摄后无法灵活调整焦点的问题。其核心挑战在于如何从单张模糊图像中重建出具有真实感的焦距堆栈(focal stack),从而实现拍摄后的交互式重新对焦。解决方案的关键在于利用视频扩散模型(video diffusion models)从单一模糊图像中生成感知上准确的焦距序列,该序列以视频形式表示,不仅支持实时交互式调焦,还为后续多种下游应用提供了基础。研究团队同时发布了大规模真实场景下智能手机采集的焦距堆栈数据集,显著提升了方法的泛化能力和实用性。
链接: https://arxiv.org/abs/2512.19823
作者: SaiKiran Tedla,Zhoutong Zhang,Xuaner Zhang,Shumian Xin
机构: Adobe(Adobe); York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and data are available at this https URL . SIGGRAPH Asia 2025, Dec. 2025
Abstract:Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at this http URL
zh
[CV-75] Generating the Past Present and Future from a Motion-Blurred Image
【速读】:该论文旨在解决从运动模糊图像中恢复场景在曝光时刻的复杂动态信息,包括过去和未来的状态,而不仅限于复原清晰图像或预测单一帧。传统方法依赖手工设计先验或特定网络结构来处理该逆问题中的歧义,且缺乏大规模数据驱动的图像与视频先验,导致难以还原复杂的场景动态。本文的关键解决方案是利用一个在互联网规模数据集上预训练的视频扩散模型(video diffusion model),通过其强大的泛化能力重构出包含曝光瞬间前后动态的视频序列,从而实现对场景运动、相机轨迹及动态三维结构的恢复,显著优于现有方法并适用于真实世界复杂场景。
链接: https://arxiv.org/abs/2512.19817
作者: SaiKiran Tedla,Kelly Zhu,Trevor Canham,Felix Taubner,Michael S. Brown,Kiriakos N. Kutulakos,David B. Lindell
机构: York University (约克大学); University of Toronto (多伦多大学); Vector Institute (向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Code and data are available at this https URL
Abstract:We seek to answer the question: what can a motion-blurred image reveal about a scene’s past, present, and future? Although motion blur obscures image details and degrades visual quality, it also encodes information about scene and camera motion during an exposure. Previous techniques leverage this information to estimate a sharp image from an input blurry one, or to predict a sequence of video frames showing what might have occurred at the moment of image capture. However, they rely on handcrafted priors or network architectures to resolve ambiguities in this inverse problem, and do not incorporate image and video priors on large-scale datasets. As such, existing methods struggle to reproduce complex scene dynamics and do not attempt to recover what occurred before or after an image was taken. Here, we introduce a new technique that repurposes a pre-trained video diffusion model trained on internet-scale datasets to recover videos revealing complex scene dynamics during the moment of capture and what might have occurred immediately into the past or future. Our approach is robust and versatile; it outperforms previous methods for this task, generalizes to challenging in-the-wild images, and supports downstream tasks such as recovering camera trajectories, object motion, and dynamic 3D scene structure. Code and data are available at this https URL
zh
[CV-76] Exploring Deep-to-Shallow Transformable Neural Networks for Intelligent Embedded Systems
【速读】:该论文旨在解决嵌入式智能系统中深度神经网络(Deep Neural Networks, DNNs)在资源受限环境下因网络深度增加导致硬件效率下降的问题,即准确性与硬件效率之间的权衡困境。解决方案的关键在于提出一种可变结构的神经架构搜索(Neural Architecture Search, NAS)范式——Double-Win NAS,其核心思想是先通过自动搜索获得高准确性的深层网络,再将其等效转换为浅层结构以提升硬件执行效率,从而实现“双胜”:既赢得强准确性,又赢得高硬件效率。此外,论文还引入了混合可变换训练和任意分辨率弹性训练两项增强技术,进一步优化训练精度并支持跨输入分辨率的自然网络弹性。
链接: https://arxiv.org/abs/2512.19731
作者: Xiangzhong Luo,Weichen Liu
机构: Southeast University (东南大学); Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Thanks to the evolving network depth, convolutional neural networks (CNNs) have achieved remarkable success across various embedded scenarios, paving the way for ubiquitous embedded intelligence. Despite its promise, the evolving network depth comes at the cost of degraded hardware efficiency. In contrast to deep networks, shallow networks can deliver superior hardware efficiency but often suffer from inferior accuracy. To address this dilemma, we propose Double-Win NAS, a novel deep-to-shallow transformable neural architecture search (NAS) paradigm tailored for resource-constrained intelligent embedded systems. Specifically, Double-Win NAS strives to automatically explore deep networks to first win strong accuracy, which are then equivalently transformed into their shallow counterparts to further win strong hardware efficiency. In addition to search, we also propose two enhanced training techniques, including hybrid transformable training towards better training accuracy and arbitrary-resolution elastic training towards enabling natural network elasticity across arbitrary input resolutions. Extensive experimental results on two popular intelligent embedded systems (i.e., NVIDIA Jetson AGX Xavier and NVIDIA Jetson Nano) and two representative large-scale datasets (i.e., ImageNet and ImageNet-100) clearly demonstrate the superiority of Double-Win NAS over previous state-of-the-art NAS approaches.
zh
[CV-77] PHANTOM: PHysical ANamorphic Threats Obstructing Connected Vehicle Mobility
【速读】:该论文旨在解决连接式自动驾驶车辆(Connected Autonomous Vehicles, CAVs)在感知与通信层面对物理空间中对抗性攻击的脆弱性问题。现有系统依赖于基于视觉的深度神经网络(Deep Neural Networks, DNNs)和低延迟车联万物(Vehicle-to-Everything, V2X)通信,但易受物理域中的对抗样本干扰。解决方案的关键在于提出PHANTOM框架,其利用**变焦艺术(anamorphic art)**生成视角依赖的对抗样本:这些样本在人类观察下呈现自然外观,却能以高置信度误导主流目标检测器(YOLOv5、SSD、Faster R-CNN、RetinaNet),且无需模型访问即可实现黑盒攻击,并具备跨架构强迁移性。实验表明,PHANTOM在CARLA仿真中可在6–10米内触发误判,导致安全响应时间不足;更严重的是,它通过V2X网络传播虚假紧急信息,使信息时效性指标(Peak Age of Information)恶化68–89%,揭示了CAV生态系统的多层级协同风险。
链接: https://arxiv.org/abs/2512.19711
作者: Md Nahid Hasan Shuvo,Moinul Hossain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Connected autonomous vehicles (CAVs) rely on vision-based deep neural networks (DNNs) and low-latency (Vehicle-to-Everything) V2X communication to navigate safely and efficiently. Despite their advances, these systems remain vulnerable to physical adversarial attacks. In this paper, we introduce PHANTOM (PHysical ANamorphic Threats Obstructing connected vehicle Mobility), a novel framework for crafting and deploying perspective-dependent adversarial examples using \textitanamorphic art. PHANTOM exploits geometric distortions that appear natural to humans but are misclassified with high confidence by state-of-the-art object detectors. Unlike conventional attacks, PHANTOM operates in black-box settings without model access and demonstrates strong transferability across four diverse detector architectures (YOLOv5, SSD, Faster R-CNN, and RetinaNet). Comprehensive evaluation in CARLA across varying speeds, weather conditions, and lighting scenarios shows that PHANTOM achieves over 90% attack success rate under optimal conditions and maintains 60-80% effectiveness even in degraded environments. The attack activates within 6-10 meters of the target, providing insufficient time for safe maneuvering. Beyond individual vehicle deception, PHANTOM triggers network-wide disruption in CAV systems: SUMO-OMNeT++ co-simulation demonstrates that false emergency messages propagate through V2X links, increasing Peak Age of Information by 68-89% and degrading safety-critical communication. These findings expose critical vulnerabilities in both perception and communication layers of CAV ecosystems.
zh
[CV-78] Snapshot 3D image projection using a diffractive decoder
链接: https://arxiv.org/abs/2512.20464
作者: Cagatay Isil,Alexander Chen,Yuhang Li,F. Onuralp Ardic,Shiqi Chen,Che-Yung Shen,Aydogan Ozcan
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph)
备注: 22 Pages, 8 Figures
[CV-79] Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI
【速读】:该论文旨在解决急性缺血性脑卒中病灶在扩散磁共振成像(Diffusion MRI)中的自动分割难题,该任务对临床决策和预后评估至关重要。由于病灶在DWI(弥散加权成像)和ADC(表观弥散系数)图像中呈现高度变异性,传统方法难以实现稳定准确的分割。解决方案的关键在于提出一种双编码器Transformer架构(dual-encoder TransUNet),通过分别提取DWI与ADC模态的特异性特征,并结合相邻切片的空间上下文信息(采用三切片输入配置),从而提升多模态扩散MRI数据下的病灶分割精度。实验表明,该方法在ISLES 2022数据集上达到85.4%的Dice相似系数,显著优于基于卷积神经网络的基线模型,为自动化脑卒中病灶识别提供了鲁棒且高效的框架。
链接: https://arxiv.org/abs/2512.20436
作者: Muhammad Usman,Azka Rehman,Muhammad Mutti Ur Rehman,Abd Ur Rehman,Muhammad Umar Farooq
机构: Stanford University (斯坦福大学); Seoul National University (首尔国立大学); National University of Sciences and Technology (NUST) (巴基斯坦国立科技大学); The University of Alabama (阿拉巴马大学); Hanyang University (汉阳大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of ischemic stroke lesions from diffusion magnetic resonance imaging (MRI) is essential for clinical decision-making and outcome assessment. Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) scans provide complementary information on acute and sub-acute ischemic changes; however, automated lesion delineation remains challenging due to variability in lesion appearance. In this work, we study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset. Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked. Based on performance, a dual-encoder TransUNet architecture is proposed to learn modality-specific representations from DWI and ADC inputs. To incorporate spatial context, adjacent slice information is integrated using a three-slice input configuration. All models are trained under a unified framework and evaluated using the Dice Similarity Coefficient (DSC). Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set. The proposed framework offers a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.20436 [eess.IV] (or arXiv:2512.20436v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2512.20436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-80] CLIP Based Region-Aware Feature Fusion for Automated BBPS Scoring in Colonoscopy Images BMVC2025
【速读】:该论文旨在解决肠镜检查中肠道清洁度评估的主观性与观察者间差异问题,传统使用的波士顿肠道准备评分系统(Boston Bowel Preparation Scale, BBPS)依赖人工判读,存在一致性差的局限。其解决方案的关键在于提出一种基于CLIP模型的自动化BBPS评分框架,通过适配器(adapter-based)迁移学习和专门设计的粪便特征提取分支,融合全局视觉特征与与粪便相关的文本先验信息,从而在无需显式分割的情况下提升评分准确性,显著优于现有基线方法。
链接: https://arxiv.org/abs/2512.20374
作者: Yujia Fu,Zhiyu Dong,Tianwen Qian,Chenye Zheng,Danian Ji,Linhai Zhuo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures, BMVC 2025 submission
Abstract:Accurate assessment of bowel cleanliness is essential for effective colonoscopy procedures. The Boston Bowel Preparation Scale (BBPS) offers a standardized scoring system but suffers from subjectivity and inter-observer variability when performed manually. In this paper, to support robust training and evaluation, we construct a high-quality colonoscopy dataset comprising 2,240 images from 517 subjects, annotated with expert-agreed BBPS scores. We propose a novel automated BBPS scoring framework that leverages the CLIP model with adapter-based transfer learning and a dedicated fecal-feature extraction branch. Our method fuses global visual features with stool-related textual priors to improve the accuracy of bowel cleanliness evaluation without requiring explicit segmentation. Extensive experiments on both our dataset and the public NERTHU dataset demonstrate the superiority of our approach over existing baselines, highlighting its potential for clinical deployment in computer-aided colonoscopy analysis.
zh
[CV-81] SAM Audio: Segment Anything in Audio
【速读】:该论文旨在解决通用音频源分离(General Audio Source Separation)问题,即如何构建一个能够灵活处理多种类型声音(如语音、音乐和通用声响)并支持多模态控制指令(文本、视觉掩码和时间跨度)的统一模型。现有方法通常局限于特定领域或单一控制模态,难以满足多模态AI系统对声音感知与推理的需求。解决方案的关键在于提出SAM Audio,这是一个基于扩散Transformer架构(Diffusion Transformer Architecture)的基础模型,通过流匹配(Flow Matching)在大规模跨域音频数据上训练,实现了文本、视觉和时间跨度提示的统一建模与灵活控制,从而在多个基准测试中达到最先进性能,并引入了新的真实世界评估基准与无参考评价模型以更贴近人类判断。
链接: https://arxiv.org/abs/2512.18099
作者: Bowen Shi,Andros Tjandra,John Hoffman,Helin Wang,Yi-Chiao Wu,Luya Gao,Julius Richter,Matt Le,Apoorv Vyas,Sanyuan Chen,Christoph Feichtenhofer,Piotr Dollár,Wei-Ning Hsu,Ann Lee
机构: Meta Superintelligence Labs
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.
zh
人工智能
[AI-0] Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
【速读】:该论文旨在解决基于自回归模型(autoregressive model)的强化学习(reinforcement learning, RL)在稀疏奖励场景下因逐token采样导致的学习效率低下问题。其核心挑战在于,传统RL通过逐token生成输出进行探索时,在需要长期规划或复杂行为序列的任务中难以有效积累奖励信号。解决方案的关键在于引入一种高阶、非因果的序列模型,该模型作用于基础自回归模型的残差流(residual stream)激活状态,从而在潜在空间(latent space)中生成抽象动作(latent actions),即“内部控制器”(internal controllers)。这些控制器能够执行一系列有意义的行为,并附带可学习的终止条件,实现跨时间尺度的层次化决策;并通过直接对内部控制器进行强化学习(称为“内部强化学习”,internal RL),显著提升稀疏奖励环境下的探索效率与任务完成能力。
链接: https://arxiv.org/abs/2512.20605
作者: Seijin Kobayashi,Yanick Schimpf,Maximilian Schlegel,Angelika Steger,Maciej Wolczyk,Johannes von Oswald,Nino Scherre,Kaitlin Maile,Guillaume Lajoie,Blake A. Richards,Rif A. Saurous,James Manyika,Blaise Agüera y Arcas,Alexander Meulemans,João Sacramento
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term “internal RL”, enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
zh
[AI-1] Leverag ing High-Fidelity Digital Models and Reinforcement Learning for Mission Engineering: A Case Study of Aerial Firefighting Under Perfect Information
【速读】:该论文旨在解决复杂系统(System of Systems, SoS)背景下任务分配与重构的动态适应性问题,尤其是在不确定、动态的任务环境中,传统静态架构难以实现高效任务执行的问题。其解决方案的关键在于构建一个基于数字工程(Digital Engineering, DE)的智能任务协调方法,通过将高保真数字任务模型与强化学习(Reinforcement Learning, RL)相结合,将任务战术管理建模为马尔可夫决策过程(Markov Decision Process, MDP),并利用近端策略优化(Proximal Policy Optimization, PPO)训练RL代理,在仿真沙盒中不断根据实际任务结果优化策略,从而实现自适应的任务分配与系统重构能力。
链接: https://arxiv.org/abs/2512.20589
作者: İbrahim Oğuz Çetinkaya,Sajad Khodadadian,Taylan G. Topçu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:
Abstract:As systems engineering (SE) objectives evolve from design and operation of monolithic systems to complex System of Systems (SoS), the discipline of Mission Engineering (ME) has emerged which is increasingly being accepted as a new line of thinking for the SE community. Moreover, mission environments are uncertain, dynamic, and mission outcomes are a direct function of how the mission assets will interact with this environment. This proves static architectures brittle and calls for analytically rigorous approaches for ME. To that end, this paper proposes an intelligent mission coordination methodology that integrates digital mission models with Reinforcement Learning (RL), that specifically addresses the need for adaptive task allocation and reconfiguration. More specifically, we are leveraging a Digital Engineering (DE) based infrastructure that is composed of a high-fidelity digital mission model and agent-based simulation; and then we formulate the mission tactics management problem as a Markov Decision Process (MDP), and employ an RL agent trained via Proximal Policy Optimization. By leveraging the simulation as a sandbox, we map the system states to actions, refining the policy based on realized mission outcomes. The utility of the RL-based intelligent mission coordinator is demonstrated through an aerial firefighting case study. Our findings indicate that the RL-based intelligent mission coordinator not only surpasses baseline performance but also significantly reduces the variability in mission performance. Thus, this study serves as a proof of concept demonstrating that DE-enabled mission simulations combined with advanced analytical tools offer a mission-agnostic framework for improving ME practice; which can be extended to more complicated fleet design and selection problems in the future from a mission-first perspective.
zh
[AI-2] Performative Policy Gradient: Optimality in Performative Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在部署后因智能体行为改变环境分布而导致的性能下降问题,即“表现性”(performative)效应——传统RL方法忽略这种由策略自身引起的分布漂移,从而无法获得最优策略。解决方案的关键在于提出首个考虑表现性的策略梯度算法:Performative Policy Gradient(PePG),其理论基础是推导出表现性版本的性能差分引理(performance difference lemma)和策略梯度定理(policy gradient theorem)。通过在Softmax参数化下结合熵正则化与不加正则化的设定,作者证明PePG能收敛至表现性最优策略(performatively optimal policies),即在自身引发的分布变化下依然保持最优的策略,从而突破了此前仅实现表现性稳定(performative stability)而无法保证最优性的局限。
链接: https://arxiv.org/abs/2512.20576
作者: Debabrota Basu,Udvas Das,Brahim Driss,Uddalak Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.
zh
[AI-3] Fail Fast Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLM s
【速读】:该论文旨在解决自回归大语言模型(Autoregressive Large Language Models, AR LLMs)在生成过程中存在的效率与质量权衡问题,尤其是传统推测解码(speculative decoding)中因draft长度受限导致的加速效果不显著的问题。解决方案的关键在于利用扩散大语言模型(Diffusion Large Language Models, dLLMs)的并行生成能力,设计了一个名为FailFast的动态推测框架:该框架通过在易推测区域大幅扩展draft长度以降低验证延迟,在难推测区域快速失败(fail fast)以减少计算浪费,从而实现高效且无损的加速。此方法无需任何微调即可在多种模型和负载下实现最高达4.9倍的推理速度提升。
链接: https://arxiv.org/abs/2512.20573
作者: Rui Pan,Zhuofu Chen,Ravi Netravali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM’s speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It “fails fast” by spending minimal compute in hard-to-speculate regions to shrink speculation latency and “wins big” by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9 \times speedup over vanilla decoding, 1.7 \times over the best naive dLLM drafter, and 1.4 \times over EAGLE-3 across diverse models and workloads. We open-source FailFast at this https URL.
zh
[AI-4] Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset The Effective AAM-TSA Model
【速读】:该论文旨在解决教师情绪状态在教育场景中难以准确捕捉的问题,尤其针对现有研究忽视教学信息对情绪影响及教师情绪表达具表演性(performative nature)的局限。为应对这一挑战,作者构建了首个大规模教师多模态情感分析数据集T-MED,涵盖14,938条来自250间真实课堂、覆盖K-12至高等教育共11个学科的文本、音频、视频与教学信息融合的数据,并提出一种基于非对称注意力机制的多模态教师情感分析模型(AAM-TSA)。该模型的关键创新在于引入非对称注意力机制和分层门控单元,实现跨模态特征差异化融合与精准情感分类,显著优于现有最先进方法,在准确性和可解释性方面均取得提升。
链接: https://arxiv.org/abs/2512.20548
作者: Zhiyi Duan,Xiangren Wang,Hongyu Yuan,Qianli Xing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Teachers’ emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers’ emotions due to the performative nature and overlook the critical impact of instructional information on emotional this http URL this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, this http URL ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling this http URL T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional this http URL, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, this http URL-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.
zh
[AI-5] Benchmarking LLM s for Predictive Applications in the Intensive Care Units
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在预测重症患者休克(Shock)等临床事件中的有效性问题,尤其是在与传统小型语言模型(Small Language Models, SLMs)相比时的表现差异。其关键解决方案在于:利用MIMIC-III数据库中17,294例ICU患者的文本数据,通过设定24小时住院时长和休克指数(Shock Index, SI)阈值筛选出正常与异常SI患者,采用焦点损失(focal loss)和交叉熵损失(cross-entropy loss)对多种LLMs(如GatorTron-Base、Llama 8B、Mistral 7B)与SLMs(如BioBERT、DocBERT、Word2Vec等)进行微调,并系统评估其预测性能。结果表明,尽管GatorTron-Base在加权召回率上达到80.5%,但整体性能与SLMs相当,提示LLMs并非天然优于SLMs用于预测未来临床事件,未来训练应聚焦于预测临床轨迹而非简化任务如命名实体识别或表型分类。
链接: https://arxiv.org/abs/2512.20520
作者: Chehak Malhotra,Mehak Gopal,Akshaya Devadiga,Pradeep Singh,Ridam Pal,Ritwik Kashyap,Tavpritesh Sethi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay 24 hours and shock index (SI) 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.
zh
[AI-6] SweRank: Multilingual Multi-Turn Code Ranking for Software Issue Localization
【速读】:该论文旨在解决大规模多语言代码库中错误定位(issue localization)的准确性问题,即如何将自然语言描述的错误信息精准映射到需要修改的相关函数。现有方法通常局限于Python语言且仅进行单次遍历搜索,难以有效处理跨语言场景和复杂推理需求。解决方案的关键在于提出SweRank+框架,其核心由两部分组成:一是SweRankMulti,一个基于代码嵌入检索器与列表级大语言模型(LLM)重排序器的跨语言代码排名工具,通过精心构建的大规模多语言问题定位数据集进行训练;二是SweRankAgent,一种采用代理式搜索循环(agentic search loop)的迭代式多轮推理机制,利用记忆缓冲区积累并优化候选函数,从而超越传统单次定位方法,在多种编程语言上的基准测试中达到新的最先进性能。
链接: https://arxiv.org/abs/2512.20482
作者: Revanth Gangi Reddy,Ye Liu,Wenting Zhao,JaeHyeok Doo,Tarun Suresh,Daniel Lee,Caiming Xiong,Yingbo Zhou,Semih Yavuz,Shafiq Joty
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Maintaining large-scale, multilingual codebases hinges on accurately localizing issues, which requires mapping natural-language error descriptions to the relevant functions that need to be modified. However, existing ranking approaches are often Python-centric and perform a single-pass search over the codebase. This work introduces SweRank+, a framework that couples SweRankMulti, a cross-lingual code ranking tool, with SweRankAgent, an agentic search setup, for iterative, multi-turn reasoning over the code repository. SweRankMulti comprises a code embedding retriever and a listwise LLM reranker, and is trained using a carefully curated large-scale issue localization dataset spanning multiple popular programming languages. SweRankAgent adopts an agentic search loop that moves beyond single-shot localization with a memory buffer to reason and accumulate relevant localization candidates over multiple turns. Our experiments on issue localization benchmarks spanning various languages demonstrate new state-of-the-art performance with SweRankMulti, while SweRankAgent further improves localization over single-pass ranking.
zh
[AI-7] Bohrium SciMaster: Building the Infrastructure and Ecosystem for Agent ic Science at Scale
【速读】:该论文旨在解决当前科学工作中因AI加速而面临的可复现性、可追踪性和系统化协作难题,尤其是在多步骤科学工作流中实现生成式AI(Generative AI)与工具调用、验证机制的深度融合。其核心挑战在于:现有流程难以观测和复现;大量实验设备与软件未适配为“代理就绪”(agent-ready)状态;执行过程缺乏透明度与管控能力;且早期AI科学家系统多为定制开发,限制了知识积累与迭代优化。解决方案的关键在于构建一个分层基础设施与生态系统——Bohrium+SciMaster架构:Bohrium作为托管型、可追溯的AI for Science(AI4S)资产枢纽,将多样化的数据、计算资源与实验室系统转化为标准化的代理可用能力;SciMaster则负责编排这些能力以执行长周期科学工作流;两者之间通过“科学智能底座”整合可重用模型、知识与组件,形成支持推理、行动与持续改进的执行单元。实证表明,该体系在11个真实场景中实现了端到端科研周期时间数量级缩短,并从大规模实际负载中提取出执行驱动的反馈信号,为规模化智能科学提供了可行路径。
链接: https://arxiv.org/abs/2512.20469
作者: Linfeng Zhang,Siheng Chen,Yuzhu Cai,Jingyi Chai,Junhan Chang,Kun Chen,Zhi X. Chen,Zhaohan Ding,Yuwen Du,Yuanpeng Gao,Yuan Gao,Jing Gao,Zhifeng Gao,Qiangqiang Gu,Yanhui Hong,Yuan Huang,Xi Fang,Xiaohong Ji,Guolin Ke,Zixing Lei,Xinyu Li,Yongge Li,Ruoxue Liao,Hang Lin,Xiaolu Lin,Yuxiang Liu,Xinzijian Liu,Zexi Liu,Jintan Lu,Tingjia Miao,Haohui Que,Weijie Sun,Yanfeng Wang,Bingyang Wu,Tianju Xue,Rui Ye,Jinzhe Zeng,Duo Zhang,Jiahui Zhang,Linfeng Zhang,Tianhan Zhang,Wenchang Zhang,Yuzhi Zhang,Zezhong Zhang,Hang Zheng,Hui Zhou,Tong Zhu,Xinyu Zhu,Qingguo Zhou,Weinan E
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are emerging as a practical way to run multi-step scientific workflows that interleave reasoning with tool use and verification, pointing to a shift from isolated AI-assisted steps toward \emphagentic science at scale. This shift is increasingly feasible, as scientific tools and models can be invoked through stable interfaces and verified with recorded execution traces, and increasingly necessary, as AI accelerates scientific output and stresses the peer-review and publication pipeline, raising the bar for traceability and credible evaluation. However, scaling agentic science remains difficult: workflows are hard to observe and reproduce; many tools and laboratory systems are not agent-ready; execution is hard to trace and govern; and prototype AI Scientist systems are often bespoke, limiting reuse and systematic improvement from real workflow signals. We argue that scaling agentic science requires an infrastructure-and-ecosystem approach, instantiated in Bohrium+SciMaster. Bohrium acts as a managed, traceable hub for AI4S assets – akin to a HuggingFace of AI for Science – that turns diverse scientific data, software, compute, and laboratory systems into agent-ready capabilities. SciMaster orchestrates these capabilities into long-horizon scientific workflows, on which scientific agents can be composed and executed. Between infrastructure and orchestration, a \emphscientific intelligence substrate organizes reusable models, knowledge, and components into executable building blocks for workflow reasoning and action, enabling composition, auditability, and improvement through use. We demonstrate this stack with eleven representative master agents in real workflows, achieving orders-of-magnitude reductions in end-to-end scientific cycle time and generating execution-grounded signals from real workloads at multi-million scale. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20469 [cs.AI] (or arXiv:2512.20469v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.20469 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Linfeng Zhang [view email] [v1] Tue, 23 Dec 2025 16:04:41 UTC (4,931 KB)
zh
[AI-8] Evasion-Resilient Detection of DNS-over-HTTPS Data Exfiltration: A Practical Evaluation and Toolkit
【速读】:该论文旨在解决防御方对DNS-over-HTTPS(DoH)文件外泄行为的检测能力不足,以及攻击者如何利用多种规避策略绕过现有检测机制的问题。其核心解决方案是一个端到端、容器化的可复现数据处理流水线,能够配置化地生成DoH外泄流量(支持分块、编码、填充和解析器轮换等参数),并在解析器端实现文件重构;同时通过改进版DoHLyzer提取流级特征,并集成机器学习模型(如随机森林、梯度提升和逻辑回归)与阈值检测方法进行对比评估,在对抗场景下量化不同检测策略的有效性。该方案的关键在于将攻击模拟、特征工程、模型训练与检测评估统一于Docker容器中,确保跨平台可复现性,从而为研究DoH外泄的隐蔽性边界及经济可行性提供基础工具与实证依据。
链接: https://arxiv.org/abs/2512.20423
作者: Adam Elaoumari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 61 pages Advisor : Dr Darren Hurley-Smith
Abstract:The purpose of this project is to assess how well defenders can detect DNS-over-HTTPS (DoH) file exfiltration, and which evasion strategies can be used by attackers. While providing a reproducible toolkit to generate, intercept and analyze DoH exfiltration, and comparing Machine Learning vs threshold-based detection under adversarial scenarios. The originality of this project is the introduction of an end-to-end, containerized pipeline that generates configurable file exfiltration over DoH using several parameters (e.g., chunking, encoding, padding, resolver rotation). It allows for file reconstruction at the resolver side, while extracting flow-level features using a fork of DoHLyzer. The pipeline contains a prediction side, which allows the training of machine learning models based on public labelled datasets and then evaluates them side-by-side with threshold-based detection methods against malicious and evasive DNS-Over-HTTPS traffic. We train Random Forest, Gradient Boosting and Logistic Regression classifiers on a public DoH dataset and benchmark them against evasive DoH exfiltration scenarios. The toolkit orchestrates traffic generation, file capture, feature extraction, model training and analysis. The toolkit is then encapsulated into several Docker containers for easy setup and full reproducibility regardless of the platform it is run on. Future research regarding this project is directed at validating the results on mixed enterprise traffic, extending the protocol coverage to HTTP/3/QUIC request, adding a benign traffic generation, and working on real-time traffic evaluation. A key objective is to quantify when stealth constraints make DoH exfiltration uneconomical and unworthy for the attacker.
zh
[AI-9] AUDRON: A Deep Learning Framework with Fused Acoustic Signatures for Drone Type Recognition
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicles, UAVs)滥用带来的安全与隐私风险问题,提出一种基于音频信号的可靠检测方法。其解决方案的关键在于设计了一个混合深度学习框架AUDRON(AUdio-based Drone Recognition Network),通过融合多种声学特征表示(包括梅尔频率倒谱系数Mel-Frequency Cepstral Coefficients, MFCC和短时傅里叶变换Short-Time Fourier Transform, STFT频谱图),并结合卷积神经网络(Convolutional Neural Networks, CNNs)、循环神经网络(Recurrent Neural Networks, RNNs)进行空间与时间建模,以及自编码器(Autoencoder)提取高层语义特征,在特征层面实现信息互补与融合,从而显著提升对复杂背景噪声中无人机声纹的识别准确率。实验表明,该方法在二分类和多分类任务中分别达到98.51%和97.11%的精度,验证了其在安防监控等场景下的有效性与泛化能力。
链接: https://arxiv.org/abs/2512.20407
作者: Rajdeep Chatterjee,Sudip Chakrabarty,Trishaani Acharjee,Deepanjali Mishra
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at the 2025 IEEE 22nd India Council International Conference (INDICON). 6 pages, 3 figures
Abstract:Unmanned aerial vehicles (UAVs), commonly known as drones, are increasingly used across diverse domains, including logistics, agriculture, surveillance, and defense. While these systems provide numerous benefits, their misuse raises safety and security concerns, making effective detection mechanisms essential. Acoustic sensing offers a low-cost and non-intrusive alternative to vision or radar-based detection, as drone propellers generate distinctive sound patterns. This study introduces AUDRON (AUdio-based Drone Recognition Network), a hybrid deep learning framework for drone sound detection, employing a combination of Mel-Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT) spectrograms processed with convolutional neural networks (CNNs), recurrent layers for temporal modeling, and autoencoder-based representations. Feature-level fusion integrates complementary information before classification. Experimental evaluation demonstrates that AUDRON effectively differentiates drone acoustic signatures from background noise, achieving high accuracy while maintaining generalizability across varying conditions. AUDRON achieves 98.51 percent and 97.11 percent accuracy in binary and multiclass classification. The results highlight the advantage of combining multiple feature representations with deep learning for reliable acoustic drone detection, suggesting the framework’s potential for deployment in security and surveillance applications where visual or radar sensing may be limited.
zh
[AI-10] Identifying Appropriately-Sized Services with Deep Reinforcement Learning
【速读】:该论文旨在解决服务化架构(Service-based Architecture, SBA)中服务划分粒度难以合理确定的问题,尤其是在缺乏文档、人员访问权限或预设服务数量的情况下,传统方法难以有效分解遗留系统。其解决方案的关键在于提出一种基于深度强化学习的自动化服务分解技术 Rake,该方法直接从实现层面的方法级代码和现有文档中学习服务边界,无需依赖特定文档格式或项目人员介入,且具备语言无关性;同时支持可定制的目标函数,在模块化质量与业务能力对齐之间实现平衡优化,从而在实际应用中显著提升服务划分的效果。
链接: https://arxiv.org/abs/2512.20381
作者: Syeda Tasnim Fabiha,Saad Shafiq,Wesley Klewerton Guez Assunção,Nenad Medvidović
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures
Abstract:Service-based architecture (SBA) has gained attention in industry and academia as a means to modernize legacy systems. It refers to a design style that enables systems to be developed as suites of small, loosely coupled, and autonomous components (services) that encapsulate functionality and communicate via language-agnostic APIs. However, defining appropriately sized services that capture cohesive subsets of system functionality remains challenging. Existing work often relies on the availability of documentation, access to project personnel, or a priori knowledge of the target number of services, assumptions that do not hold in many real-world scenarios. Our work addresses these limitations using a deep reinforcement learning-based approach to identify appropriately sized services directly from implementation artifacts. We present Rake, a reinforcement learning-based technique that leverages available system documentation and source code to guide service decomposition at the level of implementation methods. Rake does not require specific documentation or access to project personnel and is language-agnostic. It also supports a customizable objective function that balances modularization quality and business capability alignment, i.e., the degree to which a service covers the targeted business capability. We applied Rake to four open-source legacy projects and compared it with two state-of-the-art techniques. On average, Rake achieved 7-14 percent higher modularization quality and 18-22 percent stronger business capability alignment. Our results further show that optimizing solely for business context can degrade decomposition quality in tightly coupled systems, highlighting the need for balanced objectives.
zh
[AI-11] Clust-PSI-PFL: A Population Stability Index Approach for Clustered Non-IID Personalized Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据非独立同分布(non-IID)导致的模型更新偏差与性能下降问题,尤其在标签偏斜(label skew)场景下影响显著。其解决方案的关键在于提出一种基于聚类的个性化联邦学习框架 Clust-PSI-PFL,核心创新是利用 Population Stability Index (PSI) 构建加权度量 WPSI^L 来更精准刻画数据分布差异,并通过 K-means++ 聚类将客户端划分为分布同质组,从而实现更有效的个性化模型训练。该方法在多个数据模态和非IID划分策略下均显著提升全局准确率(最高达18%)并改善客户端公平性(相对提升37%),验证了 PSI 指导聚类作为轻量且鲁棒的个性化联邦学习机制的有效性。
链接: https://arxiv.org/abs/2512.20363
作者: Daniel M. Jimenez-Gutierrez,Mehrdad Hassanzadeh,Aris Anagnostopoulos,Ioannis Chatzigiannakis,Andrea Vitaletti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Applications (stat.AP); Machine Learning (stat.ML)
备注: Accepted for publication to the 40th IEEE International Parallel Distributed Processing Symposium (IPDPS 2026)
Abstract:Federated learning (FL) supports privacy-preserving, decentralized machine learning (ML) model training by keeping data on client devices. However, non-independent and identically distributed (non-IID) data across clients biases updates and degrades performance. To alleviate these issues, we propose Clust-PSI-PFL, a clustering-based personalized FL framework that uses the Population Stability Index (PSI) to quantify the level of non-IID data. We compute a weighted PSI metric, WPSI^L , which we show to be more informative than common non-IID metrics (Hellinger, Jensen-Shannon, and Earth Mover’s distance). Using PSI features, we form distributionally homogeneous groups of clients via K-means++; the number of optimal clusters is chosen by a systematic silhouette-based procedure, typically yielding few clusters with modest overhead. Across six datasets (tabular, image, and text modalities), two partition protocols (Dirichlet with parameter \alpha and Similarity with parameter S), and multiple client sizes, Clust-PSI-PFL delivers up to 18% higher global accuracy than state-of-the-art baselines and markedly improves client fairness by a relative improvement of 37% under severe non-IID data. These results establish PSI-guided clustering as a principled, lightweight mechanism for robust PFL under label skew.
zh
[AI-12] A DeepSeek -Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice
【速读】:该论文旨在解决全球放射科医生短缺与胸部X线(chest X-ray)工作量激增之间的矛盾,尤其是在基层医疗场景中,AI辅助诊断系统的临床有效性亟待验证。其解决方案的关键在于开发并前瞻性验证了一个基于DeepSeek Janus-Pro模型的胸部X线智能解读系统——Janus-Pro-CXR(1B),该系统通过轻量化架构和领域特定优化,在自动化报告生成、关键影像学异常检测(如六类临床危急征象)方面优于现有主流模型(包括ChatGPT 4o),并在多中心前瞻性临床试验中显著提升报告质量评分、缩短18.3%的阅片时间(P < 0.001),且在54.3%的病例中获得专家偏好,从而在资源受限环境中实现了诊断可靠性与工作效率的双重提升。
链接: https://arxiv.org/abs/2512.20344
作者: Yaowei Bai,Ruiheng Zhang,Yu Lei,Xuhua Duan,Jingfeng Yao,Shuguang Ju,Chaoyang Wang,Wei Yao,Yiwan Guo,Guilin Zhang,Chao Wan,Qian Yuan,Lei Chen,Wenjuan Tang,Biqiang Zhu,Xinggang Wang,Tao Sun,Wei Zhou,Dacheng Tao,Yongchao Xu,Chuansheng Zheng,Huangxuan Zhao,Bo Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2507.19493
Abstract:A global shortage of radiologists has been exacerbated by the significant volume of chest X-ray workloads, particularly in primary care. Although multimodal large language models show promise, existing evaluations predominantly rely on automated metrics or retrospective analyses, lacking rigorous prospective clinical validation. Janus-Pro-CXR (1B), a chest X-ray interpretation system based on DeepSeek Janus-Pro model, was developed and rigorously validated through a multicenter prospective trial (NCT07117266). Our system outperforms state-of-the-art X-ray report generation models in automated report generation, surpassing even larger-scale models including ChatGPT 4o (200B parameters), while demonstrating reliable detection of six clinically critical radiographic findings. Retrospective evaluation confirms significantly higher report accuracy than Janus-Pro and ChatGPT 4o. In prospective clinical deployment, AI assistance significantly improved report quality scores, reduced interpretation time by 18.3% (P 0.001), and was preferred by a majority of experts in 54.3% of cases. Through lightweight architecture and domain-specific optimization, Janus-Pro-CXR improves diagnostic reliability and workflow efficiency, particularly in resource-constrained settings. The model architecture and implementation framework will be open-sourced to facilitate the clinical translation of AI-assisted radiology solutions.
zh
[AI-13] SynCraft: Guiding Large Language Models to Predict Edit Sequences for Molecular Synthesizability Optimization
【速读】:该论文旨在解决生成式AI在化学空间探索中产生的分子大多难以合成的问题(即“合成可及性瓶颈”)。现有方法如事后过滤或基于投影的方法常因强制分子适配预定义合成模板而牺牲结构新颖性或破坏关键药效团。其解决方案的关键在于提出SynCraft框架,将合成优化重构为原子级结构编辑问题而非序列翻译任务;通过利用大语言模型(Large Language Models, LLMs)的推理能力,预测可执行的原子级修改序列,从而绕过LLMs对SMILES字符串的语法脆弱性,同时保留其化学直觉,显著提升合成可行性与结构保真度。
链接: https://arxiv.org/abs/2512.20333
作者: Junren Li,Luhua Lai
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 28 pages, 4 figures, 1 table
Abstract:Generative artificial intelligence has revolutionized the exploration of chemical space, yet a critical bottleneck remains that a substantial fraction of generated molecules is synthetically inaccessible. Current solutions, such as post-hoc filtering or projection-based methods, often compromise structural novelty or disrupt key pharmacophores by forcing molecules into pre-defined synthetic templates. Herein, we introduce SynCraft, a reasoning-based framework that reframes synthesizability optimization not as a sequence translation task, but as a precise structural editing problem. Leveraging the emergent reasoning capabilities of Large Language Models, SynCraft navigates the “synthesis cliff” where minimal structural modifications yield significant gains in synthetic feasibility. By predicting executable sequences of atom-level edits rather than generating SMILES strings directly, SynCraft circumvents the syntactic fragility of LLMs while harnessing their chemical intuition. Extensive benchmarks demonstrate that SynCraft outperforms state-of-the-art baselines in generating synthesizable analogs with high structural fidelity. Furthermore, through interaction-aware prompting, SynCraft successfully replicates expert medicinal chemistry intuition in editing PLK1 inhibitors and rescuing high-scoring but previously discarded RIPK1 candidates in previous molecular generation literatures.
zh
[AI-14] oward Explaining Large Language Models in Software Engineering Tasks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程(Software Engineering, SE)任务中因黑箱特性导致的可解释性不足问题,尤其在高风险和安全关键场景下,缺乏与开发者实际推理逻辑对齐的领域特定解释机制。解决方案的关键在于提出FeatureSHAP——首个完全自动化、模型无关的可解释性框架,其核心创新是基于Shapley值通过系统性输入扰动和任务特异性相似度比较,将模型输出归因于高层次输入特征,从而实现对代码生成与代码摘要等双模态SE任务的有效解释。该方法兼容开源与专有LLM,并经实证验证在解释保真度和相关性上优于基线方法,且被实践者调研证实能提升开发者对模型输出的理解与决策质量。
链接: https://arxiv.org/abs/2512.20328
作者: Antonio Vitale,Khai-Nguyen Nguyen,Denys Poshyvanyk,Rocco Oliveto,Simone Scalabrino,Antonio Mastropaolo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent progress in Large Language Models (LLMs) has substantially advanced the automation of software engineering (SE) tasks, enabling complex activities such as code generation and code summarization. However, the black-box nature of LLMs remains a major barrier to their adoption in high-stakes and safety-critical domains, where explainability and transparency are vital for trust, accountability, and effective human supervision. Despite increasing interest in explainable AI for software engineering, existing methods lack domain-specific explanations aligned with how practitioners reason about SE artifacts. To address this gap, we introduce FeatureSHAP, the first fully automated, model-agnostic explainability framework tailored to software engineering tasks. Based on Shapley values, FeatureSHAP attributes model outputs to high-level input features through systematic input perturbation and task-specific similarity comparisons, while remaining compatible with both open-source and proprietary LLMs. We evaluate FeatureSHAP on two bi-modal SE tasks: code generation and code summarization. The results show that FeatureSHAP assigns less importance to irrelevant input features and produces explanations with higher fidelity than baseline methods. A practitioner survey involving 37 participants shows that FeatureSHAP helps practitioners better interpret model outputs and make more informed decisions. Collectively, FeatureSHAP represents a meaningful step toward practical explainable AI in software engineering. FeatureSHAP is available at this https URL.
zh
[AI-15] ableGPT -R1: Advancing Tabular Reasoning Through Reinforcement Learning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂表格任务时存在的局限性,尤其是其在多步推理和鲁棒代码执行方面的能力不足,以及在强化学习(Reinforcement Learning, RL)应用中面临的三大挑战:高质量代理轨迹稀缺、反馈信号异质性强(从严格的SQL执行到开放式的数据解释),以及垂直专业化过程中通用知识的灾难性遗忘。解决方案的关键在于提出TableGPT-R1,一个基于系统化强化学习框架的专用表格模型,其核心创新包括:构建分层难度的代理轨迹合成数据工程流程以支持监督对齐与RL训练;设计任务自适应奖励机制,融合规则验证、标准注入式奖励模型及过程级步骤奖励塑造与行为正则化;采用多阶段训练策略,在表特定任务专业化前先稳定推理能力。该方法显著提升了复杂表格任务上的性能,并保持了较强的通用能力。
链接: https://arxiv.org/abs/2512.20312
作者: Saisai Yang,Qingyi Huang,Jing Yuan,Liangyu Zha,Kai Tang,Yuhang Yang,Ning Wang,Yucheng Wei,Liyao Li,Wentao Ye,Hao Chen,Tao Zhang,Junlin Zhou,Haobo Wang,Gang Chen,Junbo Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbfTableGPT-R1, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at this https URL.
zh
[AI-16] Synthesizing Procedural Memory: Challenges and Architectures in Automated Workflow Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)从被动工具使用者向主动工作流架构师转变过程中,如何自主从零开始合成可执行代码形式的程序性记忆(procedural memory)这一核心问题。其解决方案的关键在于引入一种“假设-探测-编码”(hypothesize, probe, and code)的科学方法论,通过系统性地识别并克服四大结构性瓶颈——发现缺口(Discovery Gap)、验证缺口(Verification Gap)、分解缺口(Decomposition Gap)与扩展缺口(Scaling Gap),使代理能够自主生成具备鲁棒性和生产级质量的代码技能,从而实现跨服务编排任务中对工具注册表导航、响应结构锚定、状态分解以及并发与持久化等关键能力的自动化构建。
链接: https://arxiv.org/abs/2512.20278
作者: Nishant Gaurav,Adit Akarsh,Ankit Ranjan,Manoj Bajaj
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:While CodeMem establishes executable code as the optimal representation for agentic procedural memory, the mechanism for autonomously synthesizing this memory from a blank slate remains underexplored. This paper operationalizes the transition of Large Language Models from passive tool-users to active workflow architects. Through a high-fidelity case study of a cross-service orchestration task involving Outlook and OneDrive, we identify and address four structural bottlenecks in automated skill generation: the Discovery Gap involving navigation of large tool registries, the Verification Gap regarding grounding tool response structures, the Decomposition Gap which replaces inefficient search with Linear State Anchoring, and the Scaling Gap focused on concurrency and persistence. We demonstrate that by enforcing a scientific methodology of hypothesize, probe, and code, agents can autonomously write robust, production-grade code skills.
zh
[AI-17] ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在资源受限的边缘设备上部署时因推理延迟过高而导致无法实现实时机器人控制的问题。当前VLA模型在边缘端通常仅能运行3–5 Hz,远低于机器人交互所需的20–30 Hz频率,主要瓶颈在于自回归解码过程的内存密集型特性。解决方案的关键在于提出ActionFlow系统级推理框架,其核心创新为跨请求流水线(Cross-Request Pipelining)策略,该策略将连续时间步的VLA推理任务重新建模为宏流水线(macro-pipeline),智能地将内存密集的Decode阶段与计算密集的Prefill阶段跨请求批处理,从而最大化硬件利用率;同时配套设计了跨请求状态打包前向传播算子(Cross Request State Packed Forward operator)和统一KV环形缓冲区(Unified KV Ring Buffer),以融合碎片化内存操作并提升计算效率。实验表明,ActionFlow在不重新训练模型的前提下实现了OpenVLA-7B模型FPS提升2.55倍,支持边缘设备上的实时动态操作。
链接: https://arxiv.org/abs/2512.20276
作者: Yuntao Dai,Hang Gu,Teng Wang,Qianyu Cheng,Yifei Zheng,Zhiyong Qiu,Lei Gong,Wenqi Lou,Xuehai Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at this https URL.
zh
[AI-18] Graph-Symbolic Policy Enforcement and Control (G-SPEC): A Neuro-Symbolic Framework for Safe Agent ic AI in 5G Autonomous Networks
【速读】:该论文旨在解决5G独立组网(Standalone)和未来6G网络中,传统静态自动化与深度强化学习(Deep Reinforcement Learning)在编排(orchestration)能力上的局限性,以及大语言模型(Large Language Model, LLM)代理在意图驱动网络(intent-based networking)中引入的随机性风险,如拓扑幻觉(topology hallucinations)和策略不合规(policy non-compliance)。其解决方案的关键在于提出一种神经符号框架——图-符号策略执行与控制(Graph-Symbolic Policy Enforcement and Control, G-SPEC),该框架通过确定性验证约束概率规划过程,核心由治理三元组(Governance Triad)构成:适配电信场景的代理(TSLAM-4B)、网络知识图谱(Network Knowledge Graph, NKG)和SHACL约束规则。实验证明,G-SPEC在450节点5G核心网仿真中实现零安全违规,并达到94.1%的修复成功率,显著优于基线的82.4%,其中NKG验证贡献了68%的安全提升,表明其为关键机制。
链接: https://arxiv.org/abs/2512.20275
作者: Divya Vijay,Vignesh Ethiraj
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 15 pages, 3 figures, 3 tables
Abstract:As networks evolve toward 5G Standalone and 6G, operators face orchestration challenges that exceed the limits of static automation and Deep Reinforcement Learning. Although Large Language Model (LLM) agents offer a path toward intent-based networking, they introduce stochastic risks, including topology hallucinations and policy non-compliance. To mitigate this, we propose Graph-Symbolic Policy Enforcement and Control (G-SPEC), a neuro-symbolic framework that constrains probabilistic planning with deterministic verification. The architecture relies on a Governance Triad - a telecom-adapted agent (TSLAM-4B), a Network Knowledge Graph (NKG), and SHACL constraints. We evaluated G-SPEC on a simulated 450-node 5G Core, achieving zero safety violations and a 94.1% remediation success rate, significantly outperforming the 82.4% baseline. Ablation analysis indicates that NKG validation drives the majority of safety gains (68%), followed by SHACL policies (24%). Scalability tests on topologies ranging from 10K to 100K nodes demonstrate that validation latency scales as O(k^1.2) where k is subgraph size. With a processing overhead of 142ms, G-SPEC is viable for SMO-layer operations.
zh
[AI-19] Memory as Resonance: A Biomimetic Architecture for Infinite Context Memory on Ergodic Phonetic Manifolds
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理长上下文时面临的内存瓶颈问题:随着模型学习,其Key-Value状态线性累积(O(N)),导致必须在遗忘(amnesia)与延迟(latency)之间做出破坏性选择。解决方案的关键在于提出一种新型神经符号架构——音素轨迹记忆(Phonetic Trajectory Memory, PTM),它将语言编码为由无理旋转矩阵驱动的遍历流形上的连续路径,而非静态张量序列;通过解耦导航(不变的O(1)几何信号)与重建(概率生成行为),PTM实现了超过3000倍的压缩比,并使检索变为共振过程,借助“信号共识”机制提升事实准确性至约92%,同时保持恒定的低延迟(约34ms),从而证明无限上下文无需无限硅资源,只需将记忆视为一个作用于守恒物理信号的重构过程。
链接: https://arxiv.org/abs/2512.20245
作者: Tarik Houichime,Abdelghani Souhar,Younes El Amrani
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Symbolic Computation (cs.SC); Software Engineering (cs.SE)
备注:
Abstract:The memory of contemporary Large Language Models is bound by a physical paradox: as they learn, they fill up. The linear accumulation (O(N)) of Key-Value states treats context as a warehouse of static artifacts, eventually forcing a destructive choice between amnesia and latency. We challenge this discrete orthodoxy, proposing that long-term memory is not the storage of items, but the persistence of a trajectory. We introduce Phonetic Trajectory Memory (PTM), a neuro-symbolic architecture that encodes language not as a sequence of tensors, but as a continuous path on an ergodic manifold governed by irrational rotation matrices. By decoupling the navigation (an invariant O(1) geometric signal) from the reconstruction (a probabilistic generative act), PTM achieves a compression magnitude of greater than 3,000x relative to dense caches. We demonstrate that retrieval becomes a process of resonance: the phonetic trace stabilizes the model against hallucination via “Signal Consensus” mechanism, securing up to approximately 92% factual accuracy. While this aggressive abstraction alters generative texture, it unlocks immediate access latency (approximately 34ms) independent of depth. Our results suggest that infinite context does not require infinite silicon; it requires treating memory not as data to be stored, but as a reconstructive process acting on a conserved, undying physical signal.
zh
[AI-20] MemR3: Memory Retrieval via Reflective Reasoning for LLM Agents
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理中记忆系统过度优化压缩与存储效率,而缺乏对记忆检索过程进行显式、闭环控制的问题。其解决方案的关键在于提出一种名为 MemR³ 的自主、精准且兼容的代理系统,包含两个核心机制:一是路由模块(router),用于在“检索”、“反思”和“回答”三种动作间动态选择以优化答案质量;二是全局证据缺口追踪器(global evidence-gap tracker),用于显式透明化回答过程并跟踪证据收集状态。该设计通过引入闭环控制机制,突破了传统“检索-回答”流水线的局限,实现了记忆检索的自主决策,并在 LoCoMo 基准测试中显著优于现有基线,尤其在 RAG 和 Zep 上分别提升 7.29% 和 1.94%(使用 GPT-4.1-mini 后端),展现出作为即插即用控制器对现有记忆存储系统的增强潜力。
链接: https://arxiv.org/abs/2512.20237
作者: Xingbo Du,Loka Li,Duzhen Zhang,Le Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR ^3 , which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve-then-answer pipeline by introducing a closed-loop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that MemR ^3 surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-and-play controller for existing memory stores.
zh
[AI-21] ongSIM: A General Platform for Simulating Intelligent Machines
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)研究中缺乏通用、高保真仿真平台的问题,尤其是针对具身智能(Embodied Intelligence)训练与评估的局限性。现有模拟平台大多局限于特定任务,难以支持从低层次导航到高层次复合活动(如多智能体社交模拟和人机协作)的多样化需求。解决方案的关键在于提出 TongSIM —— 一个面向具身智能的通用仿真平台,其核心优势包括:提供超过100种多样化的室内场景及开放式的室外城镇环境,具备定制化场景、任务自适应保真度、多种类代理类型与动态环境模拟能力,并集成全面的评估框架,从而实现对感知、认知、决策、人机协作及空间和社会推理等能力的精准评测,为推动通用具身智能的发展提供统一且可扩展的研究平台。
链接: https://arxiv.org/abs/2512.20206
作者: Zhe Sun,Kunlun Wu,Chuanjian Fu,Zeming Song,Langyong Shi,Zihe Xue,Bohan Jing,Ying Yang,Xiaomeng Gao,Aijia Li,Tianyu Guo,Huiying Li,Xueyuan Yang,Rongkai Liu,Xinyi He,Yuxi Wang,Yue Li,Mingyuan Liu,Yujie Lu,Hongzhao Xie,Shiyun Zhao,Bo Dai,Wei Wang,Tao Yuan,Song-Chun Zhu,Yujia Peng,Zhenliang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.
zh
[AI-22] Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)系统在全身体机器人操作中因视觉-语言模型(VLM)推理速度慢而导致的控制稳定性差与实时性不足的问题。现有方法通常要求VLM与动作专家以统一频率同步运行,限制了整体性能。其解决方案的关键在于提出一种真正异步的Fast-Slow VLA框架(DuoCore-FS),通过构建高速路径(用于高频动作生成)与低速路径(用于VLM语义推理)实现解耦执行;其中,关键创新包括:1)引入潜在表示缓冲区(latent representation buffer),存储与场景-指令上下文对齐的语义信息和动作推理表征,为高速路径提供高层指导;2)设计全身体动作标记化器(whole-body action tokenizer),以紧凑统一的方式表示多关节动作。该方案在保持端到端联合训练的前提下,使30亿参数的VLM实现约30 Hz的全身体动作块生成速率,显著优于同步基线模型。
链接: https://arxiv.org/abs/2512.20188
作者: Teqiang Zou,Hongliang Zeng,Yuxuan Nong,Yifan Li,Kehui Liu,Haotian Yang,Xinyang Ling,Xin Li,Lianyang Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.
zh
[AI-23] Offline Safe Policy Optimization From Heterogeneous Feedback AAMAS2026
【速读】:该论文旨在解决离线偏好强化学习(Offline Preference-based Reinforcement Learning, PbRL)中安全性保障不足的问题,尤其是在长时程连续控制任务中,传统基于奖励与成本模型的方法因误差累积导致性能下降。其关键解决方案是提出一种名为PreSa(Preference and Safety Alignment)的新框架,该框架不显式学习奖励和成本模型,而是直接从行为偏好和轨迹片段的安全性二分类标签中联合优化策略,通过拉格朗日约束优化机制实现安全且高奖励的策略学习,从而避免了对约束强化学习(Constrained Reinforcement Learning, CRL)的依赖,提升了在真实和合成人类反馈下的安全性与性能表现。
链接: https://arxiv.org/abs/2512.20173
作者: Ze Gong,Pradeep Varakantham,Akshat Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAMAS 2026 (Extended Abstract)
Abstract:Offline Preference-based Reinforcement Learning (PbRL) learns rewards and policies aligned with human preferences without the need for extensive reward engineering and direct interaction with human annotators. However, ensuring safety remains a critical challenge across many domains and tasks. Previous works on safe RL from human feedback (RLHF) first learn reward and cost models from offline data, then use constrained RL to optimize a safe policy. While such an approach works in the contextual bandits settings (LLMs), in long horizon continuous control tasks, errors in rewards and costs accumulate, leading to impairment in performance when used with constrained RL methods. To address these challenges, (a) instead of indirectly learning policies (from rewards and costs), we introduce a framework that learns a policy directly based on pairwise preferences regarding the agent’s behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments; (b) we propose \textscPreSa (Preference and Safety Alignment), a method that combines preference learning module with safety alignment in a constrained optimization problem. This optimization problem is solved within a Lagrangian paradigm that directly learns reward-maximizing safe policy \textitwithout explicitly learning reward and cost models, avoiding the need for constrained RL; © we evaluate our approach on continuous control tasks with both synthetic and real human feedback. Empirically, our method successfully learns safe policies with high rewards, outperforming state-of-the-art baselines, and offline safe RL approaches with ground-truth reward and cost.
zh
[AI-24] Odysseus: Jailbreaking Commercial Multimodal LLM -integrated Systems via Dual Steganography NDSS
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全对齐(alignment)机制下仍可能被绕过的问题,即“越狱攻击”(jailbreak attacks)如何通过隐蔽手段规避现有的内容过滤和安全防护策略。当前主流防御方法依赖于对输入或输出中显式恶意内容的检测,但在MLLM系统中,这种假设不再成立——攻击者可利用图像等感知模态隐匿恶意意图,从而绕过基于文本的过滤机制。解决方案的关键在于提出Odysseus这一新型越狱范式,其核心创新是引入双重隐写术(dual steganography),将恶意查询与响应嵌入看似无害的图像中,实现跨模态的隐蔽攻击,实验表明该方法可在多个真实部署的MLLM系统上达到高达99%的成功率,揭示了现有防御体系在跨模态场景下的根本性盲区。
链接: https://arxiv.org/abs/2512.20168
作者: Songze Li,Jiameng Cheng,Yiming Li,Xiaojun Jia,Dacheng Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is accepted by Network and Distributed System Security Symposium (NDSS) 2026
Abstract:By integrating language understanding with perceptual modalities such as images, multimodal large language models (MLLMs) constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility also raises heightened risks of misuse, such as generating harmful or unsafe content. To mitigate these risks, alignment techniques are commonly applied to align model behavior with human values. Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit limited effectiveness against commercial MLLM-integrated systems, which often employ additional filters. These filters can detect and prevent malicious input and output content, significantly reducing jailbreak threats. In this paper, we reveal that the success of these safety filters heavily relies on a critical assumption that malicious content must be explicitly visible in either the input or the output. This assumption, while often valid for traditional LLM-integrated systems, breaks down in MLLM-integrated systems, where attackers can leverage multiple modalities to conceal adversarial intent, leading to a false sense of security in existing MLLM-integrated systems. To challenge this assumption, we propose Odysseus, a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries and responses into benign-looking images. Extensive experiments on benchmark datasets demonstrate that our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, achieving up to 99% attack success rate. It exposes a fundamental blind spot in existing defenses, and calls for rethinking cross-modal security in MLLM-integrated systems.
zh
[AI-25] Concept Generalization in Humans and Large Language Models : Insights from the Number Game
【速读】:该论文旨在解决人类与大型语言模型(Large Language Model, LLM)在数学概念推理任务中泛化能力差异的问题,具体聚焦于“数字游戏”这一概念推断任务。其解决方案的关键在于引入贝叶斯模型作为分析框架,揭示了人类和LLM在归纳偏置(inductive biases)和推理策略上的本质区别:人类能灵活地基于规则和相似性进行概念推断,且具备从单个示例即可实现少样本泛化的特性;而LLM则更依赖数学规则,需更多样本来完成泛化。该框架有效量化并比较了两类智能体的推理机制,凸显了人类认知在概念学习中的高效性和适应性。
链接: https://arxiv.org/abs/2512.20162
作者: Arghavan Bazigaran,Hansem Sohn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We compare human and large language model (LLM) generalization in the number game, a concept inference task. Using a Bayesian model as an analytical framework, we examined the inductive biases and inference strategies of humans and LLMs. The Bayesian model captured human behavior better than LLMs in that humans flexibly infer rule-based and similarity-based concepts, whereas LLMs rely more on mathematical rules. Humans also demonstrated a few-shot generalization, even from a single example, while LLMs required more samples to generalize. These contrasts highlight the fundamental differences in how humans and LLMs infer and generalize mathematical concepts.
zh
[AI-26] A Bidirectional Gated Recurrent Unit Model for PUE Prediction in Data Centers IJCNN
【速读】:该论文旨在解决数据中心(Data Center, DC)能源效率优化问题,尤其是通过精准预测电力使用效率(Power Usage Effectiveness, PUE)来识别影响能耗的关键因素,并据此实施针对性改进措施。其解决方案的关键在于构建基于双向门控循环单元(Bidirectional Gated Recurrent Unit, BiGRU)的PUE预测模型,并结合递归特征消除交叉验证(Recursive Feature Elimination with Cross-Validation, RFECV)算法筛选出最相关的117个特征子集,从而实现对PUE的高精度预测。该方法不仅提升了模型性能(通过均方误差MSE、平均绝对误差MAE和决定系数R²评估),还为数据驱动的节能策略提供了可解释性依据,助力数据中心向绿色低碳方向发展。
链接: https://arxiv.org/abs/2512.20161
作者: Dhivya Dharshini Kannan,Anupam Trivedi,Dipti Srinivasan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 2025, this https URL
Abstract:Data centers account for significant global energy consumption and a carbon footprint. The recent increasing demand for edge computing and AI advancements drives the growth of data center storage capacity. Energy efficiency is a cost-effective way to combat climate change, cut energy costs, improve business competitiveness, and promote IT and environmental sustainability. Thus, optimizing data center energy management is the most important factor in the sustainability of the world. Power Usage Effectiveness (PUE) is used to represent the operational efficiency of the data center. Predicting PUE using Neural Networks provides an understanding of the effect of each feature on energy consumption, thus enabling targeted modifications of those key features to improve energy efficiency. In this paper, we have developed Bidirectional Gated Recurrent Unit (BiGRU) based PUE prediction model and compared the model performance with GRU. The data set comprises 52,560 samples with 117 features using EnergyPlus, simulating a DC in Singapore. Sets of the most relevant features are selected using the Recursive Feature Elimination with Cross-Validation (RFECV) algorithm for different parameter settings. These feature sets are used to find the optimal hyperparameter configuration and train the BiGRU model. The performance of the optimized BiGRU-based PUE prediction model is then compared with that of GRU using mean squared error (MSE), mean absolute error (MAE), and R-squared metrics.
zh
[AI-27] AXIOM: Benchmarking LLM -as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration
【速读】:该论文旨在解决当前代码评估基准在衡量大语言模型(Large Language Models, LLMs)生成代码质量时存在的三大局限性:一是采用粗粒度二元标签,无法捕捉细微错误;二是评估标准主观且定义模糊,导致人工标注得分不可靠;三是数据合成方法缺乏控制,造成分数分布失衡,难以反映真实场景。其解决方案的关键在于提出AXIOM框架——一个基于扰动的规模化代码评估基准构建方法,通过两个核心阶段实现精准控制与可靠评估:(1) 规则引导的扰动(Rule-guided perturbation),利用LLM按预设规则对高质量程序进行功能和质量层面的逐步修改,从而精确调控目标评分以获得均衡分布;(2) 多源质量校准(Multisource quality calibration),结合多种来源的信息对程序质量进行客观校准,提升标注一致性与可靠性。
链接: https://arxiv.org/abs/2512.20159
作者: Ruiqi Wang,Xinchen Wang,Cuiyun Gao,Chun Yong Chong,Xin Xia,Qing Liao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have been increasingly deployed in real-world software engineering, fostering the development of code evaluation metrics to study the quality of LLM-generated code. Conventional rule-based metrics merely score programs based on their surface-level similarities with reference programs instead of analyzing functionality and code quality in depth. To address this limitation, researchers have developed LLM-as-a-judge metrics, prompting LLMs to evaluate and score code, and curated various code evaluation benchmarks to validate their effectiveness. However, these benchmarks suffer from critical limitations, hindering reliable assessments of evaluation capability: Some feature coarse-grained binary labels, which reduce rich code behavior to a single bit of information, obscuring subtle errors. Others propose fine-grained but subjective, vaguely-defined evaluation criteria, introducing unreliability in manually-annotated scores, which is the ground-truth they rely on. Furthermore, they often use uncontrolled data synthesis methods, leading to unbalanced score distributions that poorly represent real-world code generation scenarios. To curate a diverse benchmark with programs of well-balanced distributions across various quality levels and streamline the manual annotation procedure, we propose AXIOM, a novel perturbation-based framework for synthesizing code evaluation benchmarks at scale. It reframes program scores as the refinement effort needed for deployment, consisting of two stages: (1) Rule-guided perturbation, which prompts LLMs to apply sequences of predefined perturbation rules to existing high-quality programs to modify their functionality and code quality, enabling us to precisely control each program’s target score to achieve balanced score distributions. (2) Multisource quality calibration, which first selects a subset of… Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20159 [cs.SE] (or arXiv:2512.20159v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.20159 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-28] Enhancing Zero-Shot Time Series Forecasting in Off-the-Shelf LLM s via Noise Injection
【速读】:该论文旨在解决如何在不进行任何微调(fine-tuning)的前提下,利用现成的大型语言模型(Large Language Models, LLMs)实现有效的零样本时间序列(Time Series, TS)预测问题。其核心挑战在于如何将数值型时间序列数据转化为与LLM预训练知识对齐的文本表示,而传统方法依赖于专门模块的微调来弥合这一鸿沟。论文提出的关键解决方案是:在原始时间序列数据被tokenization之前注入噪声,作为一种无需参数更新的推理时增强(inference-time augmentation)策略,从而迫使冻结的LLM基于稳健的时间模式进行外推,而非依赖表面数值特征。该方法显著提升了模型在多种基准上的鲁棒性和性能表现。
链接: https://arxiv.org/abs/2512.20140
作者: Xingyou Yin,Ceyao Zhang,Min Hu,Kai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures
Abstract:Large Language Models (LLMs) have demonstrated effectiveness as zero-shot time series (TS) forecasters. The key challenge lies in tokenizing TS data into textual representations that align with LLMs’ pre-trained knowledge. While existing work often relies on fine-tuning specialized modules to bridge this gap, a distinct, yet challenging, paradigm aims to leverage truly off-the-shelf LLMs without any fine-tuning whatsoever, relying solely on strategic tokenization of numerical sequences. The performance of these fully frozen models is acutely sensitive to the textual representation of the input data, as their parameters cannot adapt to distribution shifts. In this paper, we introduce a simple yet highly effective strategy to overcome this brittleness: injecting noise into the raw time series before tokenization. This non-invasive intervention acts as a form of inference-time augmentation, compelling the frozen LLM to extrapolate based on robust underlying temporal patterns rather than superficial numerical artifacts. We theoretically analyze this phenomenon and empirically validate its effectiveness across diverse benchmarks. Notably, to fully eliminate potential biases from data contamination during LLM pre-training, we introduce two novel TS datasets that fall outside all utilized LLMs’ pre-training scopes, and consistently observe improved performance. This study provides a further step in directly leveraging off-the-shelf LLMs for time series forecasting.
zh
[AI-29] MolAct: An Agent ic RL Framework for Molecular Editing and Property Optimization
【速读】:该论文旨在解决分子编辑(molecular editing)与优化(molecular optimization)中的多步迭代问题,即在保持分子化学有效性(chemically valid)和结构相似性(structurally similar)的前提下,逐步提升目标性质。其解决方案的关键在于将分子设计建模为一个**代理强化学习(Agentic Reinforcement Learning)**问题,引入MolAct框架,通过两阶段训练策略实现:首先构建编辑能力,随后在保留已学编辑行为的基础上进行属性优化。该框架使代理能够以多轮交互方式调用化学工具(如有效性检查、性质评估和相似性控制),并利用反馈动态调整后续编辑决策,从而实现可靠且可解释的分子改进。
链接: https://arxiv.org/abs/2512.20135
作者: Zhuo Yang,Yeyun chen,Jiaqing Xie,Ben Gao,Shuaike Shen,Wanhao Liu,Liujia Yang,Beilun Wang,Tianfan Fu,Yuqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Molecular editing and optimization are multi-step problems that require iteratively improving properties while keeping molecules chemically valid and structurally similar. We frame both tasks as sequential, tool-guided decisions and introduce MolAct, an agentic reinforcement learning framework that employs a two-stage training paradigm: first building editing capability, then optimizing properties while reusing the learned editing behaviors. To the best of our knowledge, this is the first work to formalize molecular design as an Agentic Reinforcement Learning problem, where an LLM agent learns to interleave reasoning, tool-use, and molecular optimization. The framework enables agents to interact in multiple turns, invoking chemical tools for validity checking, property assessment, and similarity control, and leverages their feedback to refine subsequent edits. We instantiate the MolAct framework to train two model families: MolEditAgent for molecular editing tasks and MolOptAgent for molecular optimization tasks. In molecular editing, MolEditAgent-7B delivers 100, 95, and 98 valid add, delete, and substitute edits, outperforming strong closed “thinking” baselines such as DeepSeek-R1; MolEditAgent-3B approaches the performance of much larger open “thinking” models like Qwen3-32B-think. In molecular optimization, MolOptAgent-7B (trained on MolEditAgent-7B) surpasses the best closed “thinking” baseline (e.g., Claude 3.7) on LogP and remains competitive on solubility, while maintaining balanced performance across other objectives. These results highlight that treating molecular design as a multi-step, tool-augmented process is key to reliable and interpretable improvements.
zh
[AI-30] Evolutionary Neural Architecture Search with Dual Contrastive Learning
【速读】:该论文旨在解决进化式神经架构搜索(Evolutionary Neural Architecture Search, ENAS)中因训练数据获取成本高而导致的神经预测器精度不足的问题。由于每个标签都需要对候选架构进行完整训练,受限于计算预算(即有限的已训练架构-标签对数量),如何高效构建高精度预测器成为ENAS成功的关键瓶颈。解决方案的核心在于提出一种双对比学习(Dual Contrastive Learning, DCL-ENAS)方法:第一阶段通过对比自监督学习从无标签架构中提取有意义的表示;第二阶段采用对比微调策略,使预测器专注于学习不同架构间的相对性能而非绝对性能,从而在有限计算资源下显著提升搜索效率与最终模型精度。
链接: https://arxiv.org/abs/2512.20112
作者: Xian-Rong Zhang,Yue-Jiao Gong,Wei-Neng Chen,Jun Zhang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:Evolutionary Neural Architecture Search (ENAS) has gained attention for automatically designing neural network architectures. Recent studies use a neural predictor to guide the process, but the high computational costs of gathering training data – since each label requires fully training an architecture – make achieving a high-precision predictor with limited compute budget (i.e., a capped number of fully trained architecture-label pairs) crucial for ENAS success. This paper introduces ENAS with Dual Contrastive Learning (DCL-ENAS), a novel method that employs two stages of contrastive learning to train the neural predictor. In the first stage, contrastive self-supervised learning is used to learn meaningful representations from neural architectures without requiring labels. In the second stage, fine-tuning with contrastive learning is performed to accurately predict the relative performance of different architectures rather than their absolute performance, which is sufficient to guide the evolutionary search. Across NASBench-101 and NASBench-201, DCL-ENAS achieves the highest validation accuracy, surpassing the strongest published baselines by 0.05% (ImageNet16-120) to 0.39% (NASBench-101). On a real-world ECG arrhythmia classification task, DCL-ENAS improves performance by approximately 2.5 percentage points over a manually designed, non-NAS model obtained via random search, while requiring only 7.7 GPU-days.
zh
[AI-31] Spatio-Temporal Graphs Beyond Grids: Benchmark for Maritime Anomaly Detection NEURIPS2025
【速读】:该论文旨在解决非网格结构的时空系统(如海上交通)中异常检测的难题,这类系统缺乏固定的参考点,且轨迹数据稀疏不规则,异常可能在多个粒度层级上表现。其解决方案的关键在于构建一个面向图神经网络(Graph Neural Networks, GNNs)的基准数据集——扩展自Open Maritime Traffic Analysis Dataset (OMTAD),支持节点级、边级和图级异常的系统性评估,并引入两个基于大语言模型(Large Language Models, LLMs)的专用代理:轨迹合成器(Trajectory Synthesizer)和异常注入器(Anomaly Injector),以生成语义合理的交互场景与多粒度异常,从而提升方法的可复现性和推动非网格时空系统异常检测技术的发展。
链接: https://arxiv.org/abs/2512.20086
作者: Jeehong Kim,Youngseok Hwang,Minchan Kim,Sungho Bae,Hyunwoo Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025 Workshop in AI for Science: The Reach and Limits of AI for Scientific Discovery
Abstract:Spatio-temporal graph neural networks (ST-GNNs) have achieved notable success in structured domains such as road traffic and public transportation, where spatial entities can be naturally represented as fixed nodes. In contrast, many real-world systems including maritime traffic lack such fixed anchors, making the construction of spatio-temporal graphs a fundamental challenge. Anomaly detection in these non-grid environments is particularly difficult due to the absence of canonical reference points, the sparsity and irregularity of trajectories, and the fact that anomalies may manifest at multiple granularities. In this work, we introduce a novel benchmark dataset for anomaly detection in the maritime domain, extending the Open Maritime Traffic Analysis Dataset (OMTAD) into a benchmark tailored for graph-based anomaly detection. Our dataset enables systematic evaluation across three different granularities: node-level, edge-level, and graph-level anomalies. We plan to employ two specialized LLM-based agents: \emphTrajectory Synthesizer and \emphAnomaly Injector to construct richer interaction contexts and generate semantically meaningful anomalies. We expect this benchmark to promote reproducibility and to foster methodological advances in anomaly detection for non-grid spatio-temporal systems.
zh
[AI-32] QE-Catalytic: A Graph-Language Multimodal Base Model for Relaxed-Energy Prediction in Catalytic Adsorption
【速读】:该论文旨在解决催化吸附能预测中因结构配置多样性导致的精度不足问题,特别是现有语言模型方法在区分“同一系统不同构型”时表现不佳的问题。其关键解决方案是提出了一种多模态框架QE-Catalytic,通过深度融合E(3)-等变图Transformer(Equiformer-V2)与大语言模型Qwen,实现三维原子坐标与结构化文本描述的联合建模,并利用图-文对齐机制将3D几何信息注入语言通道,从而在无精确坐标条件下仍保持高精度预测能力,同时支持基于目标能量驱动的结构自回归生成与信息补全。
链接: https://arxiv.org/abs/2512.20084
作者: Yanjie Li,Jian Xu,Xueqing Chen,Lina Yu,Shiming Xiang,Weijun Li,Cheng-lin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages
Abstract:Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph – thereby broadening applicability – remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa. To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model (\textbfQwen) with an E(3)-equivariant graph Transformer (\textbfEquiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects 3D geometric information’’ into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~eV to 0.486~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols. Comments: 25 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20084 [cs.LG] (or arXiv:2512.20084v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20084 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-33] Adaptive Financial Sentiment Analysis for NIFTY 50 via Instruction-Tuned LLM s RAG and Reinforcement Learning Approaches
【速读】:该论文旨在解决现有金融情感分析方法忽视股价或市场反馈对情感分类影响的问题,从而提升模型在真实市场环境下的适应性与预测准确性。其关键解决方案在于构建一个融合大语言模型(Large Language Models, LLMs)与实时股票市场反馈的自适应框架:首先基于SentiFin数据集对LLaMA 3.2 3B模型进行指令微调;其次引入检索增强生成(Retrieval-Augmented Generation, RAG)管道,通过句子嵌入的余弦相似度动态选取多源上下文信息以增强情感预测;进一步设计了一个反馈驱动模块,通过比较预测情感与次日实际股票收益来调整源可靠性,实现系统对市场行为的迭代适应;最后,为使该机制可泛化至时间序列数据,采用近端策略优化(Proximal Policy Optimization, PPO)训练强化学习代理,学习基于情感-收益一致性累积奖励信号的源权重策略。实验证明该方法显著优于基线模型和静态检索方法,在NIFTY 50新闻标题上的分类准确率、F1分数及市场一致性均有明显提升。
链接: https://arxiv.org/abs/2512.20082
作者: Chaithra,Kamesh Kadimisetty,Biju R Mohan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in CODS 2025
Abstract:Financial sentiment analysis plays a crucial role in informing investment decisions, assessing market risk, and predicting stock price trends. Existing works in financial sentiment analysis have not considered the impact of stock prices or market feedback on sentiment analysis. In this paper, we propose an adaptive framework that integrates large language models (LLMs) with real-world stock market feedback to improve sentiment classification in the context of the Indian stock market. The proposed methodology fine-tunes the LLaMA 3.2 3B model using instruction-based learning on the SentiFin dataset. To enhance sentiment predictions, a retrieval-augmented generation (RAG) pipeline is employed that dynamically selects multi-source contextual information based on the cosine similarity of the sentence embeddings. Furthermore, a feedback-driven module is introduced that adjusts the reliability of the source by comparing predicted sentiment with actual next-day stock returns, allowing the system to iteratively adapt to market behavior. To generalize this adaptive mechanism across temporal data, a reinforcement learning agent trained using proximal policy optimization (PPO) is incorporated. The PPO agent learns to optimize source weighting policies based on cumulative reward signals from sentiment-return alignment. Experimental results on NIFTY 50 news headlines collected from 2024 to 2025 demonstrate that the proposed system significantly improves classification accuracy, F1-score, and market alignment over baseline models and static retrieval methods. The results validate the potential of combining instruction-tuned LLMs with dynamic feedback and reinforcement learning for robust, market-aware financial sentiment modeling.
zh
[AI-34] CBA: Communication-Bound-Aware Cross-Domain Resource Assignment for Pipeline-Parallel Distributed LLM Training in Dynamic Multi-DC Optical Networks
【速读】:该论文旨在解决多数据中心光网络中流水线并行分布式训练的通信瓶颈问题,通过优化跨域资源分配策略来降低迭代时间并减少请求阻塞。其解决方案的关键在于提出了一种通信感知的跨域资源分配框架(communication-bound-aware cross-domain resource assignment framework),能够根据通信负载动态调整资源分配,从而显著提升训练效率,实验表明该方法相较基线方案可降低31.25%的迭代时间并减少13.20%的阻塞请求。
链接: https://arxiv.org/abs/2512.20080
作者: Dianxuan Fu,Xiaomin Liu,Yihao Zhang,Shikui Shen,Weisheng Hu,Qunbi Zhuge
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Optics (physics.optics)
备注:
Abstract:We propose a communication-bound-aware cross-domain resource assignment framework for pipeline-parallel distributed training over multi-datacenter optical networks, which lowers iteration time by 31.25% and reduces 13.20% blocking requests compared to baselines.
zh
[AI-35] On the Effectiveness of Instruction-Tuning Local LLM s for Identifying Software Vulnerabilities
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在自动化软件漏洞分析中面临的两大局限:一是现有方法主要依赖在线API服务,导致源代码需外泄至第三方平台,存在安全风险;二是任务通常被简化为二分类(有漏洞或无漏洞),实用性受限。其解决方案的关键在于将漏洞分析任务重新定义为软件漏洞识别(Software Vulnerability Identification, SVI),要求模型输出Common Weakness Enumeration (CWE) ID类型的弱点标签,从而提升分析的细粒度与实用价值;同时通过指令微调(instruction-tuning)小型本地部署的LLM,证明其在性能和成本权衡上优于依赖云端API的大模型,从而实现更安全、高效且可集成于实际漏洞管理流程的解决方案。
链接: https://arxiv.org/abs/2512.20062
作者: Sangryu Park,Gihyuk Ko,Homook Cho
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: The 9th International Conference on Mobile Internet Security (MobiSec 2025)
Abstract:Large Language Models (LLMs) show significant promise in automating software vulnerability analysis, a critical task given the impact of security failure of modern software systems. However, current approaches in using LLMs to automate vulnerability analysis mostly rely on using online API-based LLM services, requiring the user to disclose the source code in development. Moreover, they predominantly frame the task as a binary classification(vulnerable or not vulnerable), limiting potential practical utility. This paper addresses these limitations by reformulating the problem as Software Vulnerability Identification (SVI), where LLMs are asked to output the type of weakness in Common Weakness Enumeration (CWE) IDs rather than simply indicating the presence or absence of a vulnerability. We also tackle the reliance on large, API-based LLMs by demonstrating that instruction-tuning smaller, locally deployable LLMs can achieve superior identification performance. In our analysis, instruct-tuning a local LLM showed better overall performance and cost trade-off than online API-based LLMs. Our findings indicate that instruct-tuned local models represent a more effective, secure, and practical approach for leveraging LLMs in real-world vulnerability management workflows.
zh
[AI-36] Scaling Reinforcement Learning for Content Moderation with Large Language Models
【速读】:该论文旨在解决大规模内容审核(content moderation)中面临的挑战,即如何在标签稀疏、政策定义动态变化且需复杂推理的现实场景下,训练出具备专家级准确率的AI系统。其解决方案的关键在于系统性地探索强化学习(reinforcement learning, RL)在内容分类任务中的规模化应用,通过多种RL训练策略与奖励塑造方法(包括可验证奖励和基于大语言模型作为裁判的框架),将通用语言模型转化为符合政策规范的专用分类器。研究表明,RL呈现类似S型的扩展行为,在增加训练数据、采样轮次和优化步骤时性能平滑提升并逐渐饱和,同时在需要复杂政策推理的任务上显著优于监督微调(supervised fine-tuning),最高可实现100倍的数据效率提升,尤其适用于专家标注稀缺或昂贵的场景。
链接: https://arxiv.org/abs/2512.20061
作者: Hamed Firooz,Rui Liu,Yuchen Lu,Zhenyu Hou,Fangzhou Xiong,Xiaoyang Zhang,Changshu Jian,Zhicheng Zhu,Jiayuan Ma,Jacob Tao,Chaitali Gupta,Xiaochang Peng,Shike Mei,Hang Cui,Yang Qin,Shuo Tang,Jason Gaedtke,Arpit Mittal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Content moderation at scale remains one of the most pressing challenges in today’s digital ecosystem, where billions of user- and AI-generated artifacts must be continuously evaluated for policy violations. Although recent advances in large language models (LLMs) have demonstrated strong potential for policy-grounded moderation, the practical challenges of training these systems to achieve expert-level accuracy in real-world settings remain largely unexplored, particularly in regimes characterized by label sparsity, evolving policy definitions, and the need for nuanced reasoning beyond shallow pattern matching. In this work, we present a comprehensive empirical investigation of scaling reinforcement learning (RL) for content classification, systematically evaluating multiple RL training recipes and reward-shaping strategies-including verifiable rewards and LLM-as-judge frameworks-to transform general-purpose language models into specialized, policy-aligned classifiers across three real-world content moderation tasks. Our findings provide actionable insights for industrial-scale moderation systems, demonstrating that RL exhibits sigmoid-like scaling behavior in which performance improves smoothly with increased training data, rollouts, and optimization steps before gradually saturating. Moreover, we show that RL substantially improves performance on tasks requiring complex policy-grounded reasoning while achieving up to 100x higher data efficiency than supervised fine-tuning, making it particularly effective in domains where expert annotations are scarce or costly.
zh
[AI-37] An Optimal Policy for Learning Controllable Dynamics by Exploration
【速读】:该论文旨在解决在未知环境中学习可控马尔可夫链(Controllable Markov Chains)动力学的最优探索问题,即如何在有限时间范围内设计一种能够最大化信息增益的策略,以实现高效探索与学习。其解决方案的关键在于提出了一种非平稳策略(non-stationary policy),该策略通过随时间动态调整控制集合(constraint set),使得智能体能够以贪婪方式选择最优控制动作,从而在探索过程中持续获取最大信息量。这一方法特别针对瞬态状态(transient states)、吸收状态(absorbing states)和不可回溯状态(non-backtracking states)等限制动力学可控性的特殊状态进行优化,证明了这些状态的存在迫使策略必须是非平稳的,才能实现最优探索效果,并通过计数论证、与次优策略对比及动态规划中的序贯改进性质验证了策略的最优性。
链接: https://arxiv.org/abs/2512.20053
作者: Peter N. Loxley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal exploration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming.
zh
[AI-38] Learning Skills from Action-Free Videos
【速读】:该论文旨在解决当前视频生成模型难以转化为低层机器人动作、而潜在动作模型又缺乏高层规划能力的问题。解决方案的关键在于提出Skill Abstraction from Optical Flow (SOF) 框架,通过光学流(optical flow)这一中间表示学习与视频动态和机器人动作对齐的潜在技能空间,从而实现从原始视觉数据中直接获取并组合高阶技能,并支持基于视频的高层规划与动作映射。
链接: https://arxiv.org/abs/2512.20052
作者: Hung-Chieh Fang,Kuo-Han Hung,Chu-Rong Chen,Po-Jung Chou,Chun-Kai Yang,Po-Chen Ko,Yu-Chiang Wang,Yueh-Hua Wu,Min-Hung Chen,Shao-Hua Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Learning from videos offers a promising path toward generalist robots by providing rich visual and temporal priors beyond what real robot datasets contain. While existing video generative models produce impressive visual predictions, they are difficult to translate into low-level actions. Conversely, latent-action models better align videos with actions, but they typically operate at the single-step level and lack high-level planning capabilities. We bridge this gap by introducing Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from large collections of action-free videos. Our key idea is to learn a latent skill space through an intermediate representation based on optical flow that captures motion information aligned with both video dynamics and robot actions. By learning skills in this flow-based latent space, SOF enables high-level planning over video-derived skills and allows for easier translation of these skills into actions. Experiments show that our approach consistently improves performance in both multitask and long-horizon settings, demonstrating the ability to acquire and compose skills directly from raw visual data.
zh
[AI-39] Discovering Lie Groups with Flow Matching
【速读】:该论文旨在解决如何从数据中直接学习物理系统和机器学习模型中的对称性(symmetry)问题,以提升模型性能与样本效率。传统方法通常依赖于先验知识或强假设来识别对称性,而本文提出了一种基于流匹配(flow matching)在李群(Lie group)上进行对称性发现的新方法——\lieflow。其核心创新在于将对称性发现建模为在一个更大的假设群上学习一个分布,使得该分布与数据中观测到的对称性相匹配,从而无需预先指定对称类型即可灵活识别离散群(如反射对称),且显著减少假设约束。实验表明,该方法在二维和三维点云数据中成功识别出多种对称结构,并针对“最后时刻收敛”(last-minute convergence)现象提出了新的插值方案以优化流匹配过程。
链接: https://arxiv.org/abs/2512.20043
作者: Jung Yeon Park,Yuxuan Chen,Floor Eijkelboom,Jan-Willem van de Meent,Lawson L.S. Wong,Robin Walters
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Symmetry is fundamental to understanding physical systems, and at the same time, can improve performance and sample efficiency in machine learning. Both pursuits require knowledge of the underlying symmetries in data. To address this, we propose learning symmetries directly from data via flow matching on Lie groups. We formulate symmetry discovery as learning a distribution over a larger hypothesis group, such that the learned distribution matches the symmetries observed in data. Relative to previous works, our method, \lieflow, is more flexible in terms of the types of groups it can discover and requires fewer assumptions. Experiments on 2D and 3D point clouds demonstrate the successful discovery of discrete groups, including reflections by flow matching over the complex domain. We identify a key challenge where the symmetric arrangement of the target modes causes ``last-minute convergence,‘’ where samples remain stationary until relatively late in the flow, and introduce a novel interpolation scheme for flow matching for symmetry discovery.
zh
[AI-40] DecoKAN: Interpretable Decomposition for Forecasting Cryptocurrency Market Dynamics
【速读】:该论文旨在解决当前基于深度学习的加密货币时间序列预测模型在处理复杂市场动态时存在的两大问题:一是无法有效解耦长期社会经济趋势与短期高频投机波动的复合结构,二是缺乏可解释性,难以支持可信的金融决策。其解决方案的关键在于提出DecoKAN框架,该框架通过多级离散小波变换(Discrete Wavelet Transform, DWT)实现信号的分频解耦,从而分离出不同频率成分进行针对性分析;同时引入基于样条函数的科尔莫戈罗夫-阿诺德网络(Kolmogorov-Arnold Network, KAN)混合器,在每个分解后的子序列中构建内在可解释的非线性映射关系,并结合稀疏化、剪枝和符号化等符号分析流程,生成简洁的解析表达式以揭示模型所学模式的本质规律。这一设计显著提升了预测精度与模型透明度之间的平衡能力。
链接: https://arxiv.org/abs/2512.20028
作者: Yuan Gao,Zhenguo Dong,Xuelong Wang,Zhiqiang Wang,Yong Zhang,Shaofan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and interpretable forecasting of multivariate time series is crucial for understanding the complex dynamics of cryptocurrency markets in digital asset systems. Advanced deep learning methodologies, particularly Transformer-based and MLP-based architectures, have achieved competitive predictive performance in cryptocurrency forecasting tasks. However, cryptocurrency data is inherently composed of long-term socio-economic trends and local high-frequency speculative oscillations. Existing deep learning-based ‘black-box’ models fail to effectively decouple these composite dynamics or provide the interpretability needed for trustworthy financial decision-making. To overcome these limitations, we propose DecoKAN, an interpretable forecasting framework that integrates multi-level Discrete Wavelet Transform (DWT) for decoupling and hierarchical signal decomposition with Kolmogorov-Arnold Network (KAN) mixers for transparent and interpretable nonlinear modeling. The DWT component decomposes complex cryptocurrency time series into distinct frequency components, enabling frequency-specific analysis, while KAN mixers provide intrinsically interpretable spline-based mappings within each decomposed subseries. Furthermore, interpretability is enhanced through a symbolic analysis pipeline involving sparsification, pruning, and symbolization, which produces concise analytical expressions offering symbolic representations of the learned patterns. Extensive experiments demonstrate that DecoKAN achieves the lowest average Mean Squared Error on all tested real-world cryptocurrency datasets (BTC, ETH, XMR), consistently outperforming a comprehensive suite of competitive state-of-the-art baselines. These results validate DecoKAN’s potential to bridge the gap between predictive accuracy and model transparency, advancing trustworthy decision support within complex cryptocurrency markets.
zh
[AI-41] Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在处理个性化指令时的局限性问题,即当用户发出如“拿我的杯子”这类命令时,模型需从外观相似的多个物体中识别并操作特定个体实例,而此类任务在训练中未见。为应对这一挑战,论文提出一种无需训练的感知适配器——视觉注意提示(Visual Attentive Prompting, VAP),其核心在于通过参考图像构建非参数化视觉记忆,利用开放词汇检测与基于嵌入的匹配实现对个人物品的场景定位,并将该定位结果以视觉提示形式注入VLA模型,具体表现为突出目标物体并改写指令文本,从而增强模型对实例级控制的能力。
链接: https://arxiv.org/abs/2512.20014
作者: Sangoh Lee,Sangwoo Mo,Wook-Shin Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as “bring my cup”, where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
zh
[AI-42] IoT-based Android Malware Detection Using Graph Neural Network With Adversarial Defense
【速读】:该论文旨在解决基于图神经网络(Graph Neural Network, GNN)的安卓恶意软件检测模型在面对对抗攻击时的脆弱性问题。其核心挑战在于,攻击者可通过添加虚假API调用关系生成对抗样本,从而规避GNN分类器的检测。解决方案的关键在于提出一种基于变分图自编码器(Variational Graph Autoencoder, VGAE)的生成对抗网络(Generative Adversarial Network, GAN)攻击算法——VGAE-MalGAN,该算法通过生成具有欺骗性的恶意API图结构来干扰目标检测模型,并进一步验证了通过引入对抗样本进行再训练可有效提升模型鲁棒性,从而缓解此类攻击的影响。
链接: https://arxiv.org/abs/2512.20004
作者: Rahul Yumlembam,Biju Issac,Seibu Mary Jacob,Longzhi Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages
Abstract:Since the Internet of Things (IoT) is widely adopted using Android applications, detecting malicious Android apps is essential. In recent years, Android graph-based deep learning research has proposed many approaches to extract relationships from applications as graphs to generate graph embeddings. First, we demonstrate the effectiveness of graph-based classification using a Graph Neural Network (GNN)-based classifier to generate API graph embeddings. The graph embeddings are combined with Permission and Intent features to train multiple machine learning and deep learning models for Android malware detection. The proposed classification approach achieves an accuracy of 98.33 percent on the CICMaldroid dataset and 98.68 percent on the Drebin dataset. However, graph-based deep learning models are vulnerable, as attackers can add fake relationships to evade detection by the classifier. Second, we propose a Generative Adversarial Network (GAN)-based attack algorithm named VGAE-MalGAN targeting graph-based GNN Android malware classifiers. The VGAE-MalGAN generator produces adversarial malware API graphs, while the VGAE-MalGAN substitute detector attempts to mimic the target detector. Experimental results show that VGAE-MalGAN can significantly reduce the detection rate of GNN-based malware classifiers. Although the model initially fails to detect adversarial malware, retraining with generated adversarial samples improves robustness and helps mitigate adversarial attacks.
zh
[AI-43] S3IT: A Benchmark for Spatially Situated Social Intelligence Test
【速读】:该论文旨在解决当前智能体评估体系中缺乏对具身社会智能(embodied social intelligence)的综合衡量问题,即现有方法要么局限于非具身的社会推理(如文本层面),要么忽视社会因素的物理任务,无法有效评估智能体在真实具身场景下协调物理约束与社会规范的能力。其解决方案的关键在于提出Spatially Situated Social Intelligence Test (S³IT) 基准测试,该基准围绕一个新颖且复杂的“座位排序任务”构建,要求智能体在三维环境中为一群由大语言模型(LLM-driven)驱动的NPC安排座位,这些NPC具有多样身份、偏好及复杂人际关系。通过可程序化扩展的框架生成可控难度的多样化场景,迫使智能体通过主动对话获取偏好、自主探索环境并进行多目标优化,从而系统性地评估其在空间感知、社交理解与决策权衡方面的综合能力。
链接: https://arxiv.org/abs/2512.19992
作者: Zhe Sun,Xueyuan Yang,Yujie Lu,Zhenliang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 9 figures
Abstract:The integration of embodied agents into human environments demands embodied social intelligence: reasoning over both social norms and physical constraints. However, existing evaluations fail to address this integration, as they are limited to either disembodied social reasoning (e.g., in text) or socially-agnostic physical tasks. Both approaches fail to assess an agent’s ability to integrate and trade off both physical and social constraints within a realistic, embodied context. To address this challenge, we introduce Spatially Situated Social Intelligence Test (S ^3 IT), a benchmark specifically designed to evaluate embodied social intelligence. It is centered on a novel and challenging seat-ordering task, requiring an agent to arrange seating in a 3D environment for a group of large language model-driven (LLM-driven) NPCs with diverse identities, preferences, and intricate interpersonal relationships. Our procedurally extensible framework generates a vast and diverse scenario space with controllable difficulty, compelling the agent to acquire preferences through active dialogue, perceive the environment via autonomous exploration, and perform multi-objective optimization within a complex constraint network. We evaluate state-of-the-art LLMs on S ^3 IT and found that they still struggle with this problem, showing an obvious gap compared with the human baseline. Results imply that LLMs have deficiencies in spatial intelligence, yet simultaneously demonstrate their ability to achieve near human-level competence in resolving conflicts that possess explicit textual cues.
zh
[AI-44] Neuron-Guided Interpretation of Code LLM s: Where Why and How?
【速读】:该论文旨在解决代码大语言模型(Code LLMs)在内部机制上的可解释性不足问题,特别是现有自然语言处理(NLP)领域的神经元可解释性方法难以适配编程语言的语法结构、层级性和可执行特性。其关键解决方案在于从神经元层面实证分析代码LLMs,识别出两类核心组件:一是语言特异性神经元(language-specific neurons),它们对特定编程语言具有选择性响应;二是概念层(concept layers),即前馈层中编码跨语言通用代码语义表示的层次结构。研究发现,低层主要捕捉语言特异性的语法特征,而中层则涌现出共享语义抽象的概念层,这些发现为多语言场景下的代码生成优化、克隆检测和摘要迁移等任务提供了可解释且有效的干预路径。
链接: https://arxiv.org/abs/2512.19980
作者: Zhe Yin,Xiaodong Gu,Beijun Shen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by FSE2026
Abstract:Code language models excel on code intelligence tasks, yet their internal interpretability is underexplored. Existing neuron interpretability techniques from NLP are suboptimal for source code due to programming languages formal, hierarchical, and executable nature. We empirically investigate code LLMs at the neuron level, localizing language-specific neurons (selectively responsive to one language) and concept layers (feed-forward layers encoding language-agnostic code representations). We analyze Llama-3.1-8B and Qwen2.5-Coder-32B on multilingual inputs in C++, Java, Python, Go, and JavaScript, measuring neuron selectivity and layerwise contributions during generation. We find (1) neurons specialized for individual languages alongside a universal subset supporting general-purpose generation; and (2) lower layers mainly encode language-specific syntax, while middle layers capture semantic abstractions shared across languages, emerging as concept layers. We demonstrate utility on three tasks: neuron-guided fine-tuning for code generation, clone detection via concept-layer embeddings, and concept-layer-guided transfer for code summarization, each yielding consistent gains in multilingual settings.
zh
[AI-45] FGDCC: Fine-Grained Deep Cluster Categorization – A Framework for Intra-Class Variability Problems in Plant Classification
【速读】:该论文旨在解决细粒度视觉分类(Fine-Grained Visual Categorization, FGVC)任务中因类内变异性(intra-class variability)导致的深度学习模型性能下降问题,尤其在类别样本稀缺时更为显著。其解决方案的关键在于:对每个类别单独进行聚类,以发现能够编码图像间潜在相似性的伪标签(pseudo-labels),并基于这些标签构建分层分类框架,从而学习更精细的视觉特征,有效缓解类内差异带来的负面影响。
链接: https://arxiv.org/abs/2512.19960
作者: Luciano Araujo Dourado Filho,Rodrigo Tripodi Calumby
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Intra-class variability is given according to the significance in the degree of dissimilarity between images within a class. In that sense, depending on its intensity, intra-class variability can hinder the learning process for DL models, specially when such classes are also underrepresented, which is a very common scenario in Fine-Grained Visual Categorization (FGVC) tasks. This paper proposes a novel method that aims at leveraging classification performance in FGVC tasks by learning fine-grained features via classification of class-wise cluster assignments. Our goal is to apply clustering over each class individually, which can allow to discover pseudo-labels that encodes a latent degree of similarity between images. In turn, those labels can be employed in a hierarchical classification process that allows to learn more fine-grained visual features and thereby mitigating intra-class variability issues. Initial experiments over the PlantNet300k enabled to shed light upon several key points in which future work will have to be developed in order to find more conclusive evidence regarding the effectiveness of our method. Our method still achieves state-of-the-art performance on the PlantNet300k dataset even though some of its components haven’t been shown to be fully optimized. Our code is available at \hrefthis https URLthis https URL.
zh
[AI-46] Zero-Shot Segmentation through Prototype-Guidance for Multi-Label Plant Species Identification
【速读】:该论文旨在解决PlantClef 2025挑战中高分辨率图像下的细粒度多标签物种识别问题(fine-grained multi-label species identification)。其核心解决方案在于利用训练集中的类别原型(class prototypes)作为指导,通过在测试集图像上训练一个分割型视觉Transformer(segmentation Vision Transformer, ViT),实现跨域适应。关键创新点包括:首先,基于训练图像特征进行K-Means聚类生成与类别数相等的原型;其次,构建一个定制化的窄ViT模型,用冻结的DinoV2替换原始patch嵌入层以提取预训练特征,并训练该模型从测试图像中重建训练集的类别原型;最终,借助模型产生的注意力分数定位感兴趣区域,从而引导多标签分类过程。此方法成功将单物种分类任务转化为高分辨率植被图中的多标签分类任务,取得第五名成绩(F1=0.33331),性能接近最优方案。
链接: https://arxiv.org/abs/2512.19957
作者: Luciano Araujo Dourado Filho,Almir Moreira da Silva Neto,Rodrigo Pereira David,Rodrigo Tripodi Calumby
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents an approach developed to address the PlantClef 2025 challenge, which consists of a fine-grained multi-label species identification, over high-resolution images. Our solution focused on employing class prototypes obtained from the training dataset as a proxy guidance for training a segmentation Vision Transformer (ViT) on the test set images. To obtain these representations, the proposed method extracts features from training dataset images and create clusters, by applying K-Means, with K equals to the number of classes in the dataset. The segmentation model is a customized narrow ViT, built by replacing the patch embedding layer with a frozen DinoV2, pre-trained on the training dataset for individual species classification. This model is trained to reconstruct the class prototypes of the training dataset from the test dataset images. We then use this model to obtain attention scores that enable to identify and localize areas of interest and consequently guide the classification process. The proposed approach enabled a domain-adaptation from multi-class identification with individual species, into multi-label classification from high-resolution vegetation plots. Our method achieved fifth place in the PlantCLEF 2025 challenge on the private leaderboard, with an F1 score of 0.33331. Besides that, in absolute terms our method scored 0.03 lower than the top-performing submission, suggesting that it may achieved competitive performance in the benchmark task. Our code is available at \hrefthis https URLthis https URL.
zh
[AI-47] Interpolative Decoding: Exploring the Spectrum of Personality Traits in LLM s
【速读】:该论文旨在解决在利用大语言模型(Large Language Models, LLMs)模拟人类行为时,因需为每个个性特征组合单独设计提示(prompt)而导致的实验开销高和可重复性差的问题。其解决方案的关键在于采用插值解码(interpolative decoding)技术,将人格的每一个维度表示为一对对立提示,并通过插值参数控制模型行为沿该维度的变化,从而实现对人格连续谱的高效建模与行为模拟。这种方法不仅可靠地调节了大五人格(Big Five)各维度得分,还能使LLM在经济博弈中复现人类决策行为,为个体化人机交互提供了新路径。
链接: https://arxiv.org/abs/2512.19937
作者: Eric Yeh,John Cadigan,Ran Chen,Dick Crouch,Melinda Gervasio,Dayne Freitag
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures
Abstract:Recent research has explored using very large language models (LLMs) as proxies for humans in tasks such as simulation, surveys, and studies. While LLMs do not possess a human psychology, they often can emulate human behaviors with sufficiently high fidelity to drive simulations to test human behavioral hypotheses, exhibiting more nuance and range than the rule-based agents often employed in behavioral economics. One key area of interest is the effect of personality on decision making, but the requirement that a prompt must be created for every tested personality profile introduces experimental overhead and degrades replicability. To address this issue, we leverage interpolative decoding, representing each dimension of personality as a pair of opposed prompts and employing an interpolation parameter to simulate behavior along the dimension. We show that interpolative decoding reliably modulates scores along each of the Big Five dimensions. We then show how interpolative decoding causes LLMs to mimic human decision-making behavior in economic games, replicating results from human psychological research. Finally, we present preliminary results of our efforts to ``twin’’ individual human players in a collaborative game through systematic search for points in interpolation space that cause the system to replicate actions taken by the human subject.
zh
[AI-48] Conditional Adversarial Frag ility in Financial Machine Learning under Macroeconomic Stress
【速读】:该论文旨在解决金融决策系统中机器学习模型在非平稳经济环境下缺乏有效对抗鲁棒性评估的问题。传统方法通常基于静态假设进行鲁棒性测试,忽略了宏观经济波动对模型脆弱性的动态影响。其解决方案的关键在于提出条件对抗脆弱性(Conditional Adversarial Fragility)的概念,并构建一个基于外部经济压力指标的制度感知评估框架,通过波动率驱动的经济状态划分,在保持模型架构、攻击方法和评估协议不变的前提下,量化模型在平静期与压力期下的对抗脆弱性差异。研究发现,尽管基准预测性能在不同经济制度下一致,但对抗扰动下压力期模型的准确率、决策阈值和风险敏感指标显著劣化,且错误阴性率上升,凸显了将经济制度信息纳入鲁棒性评估的重要性。
链接: https://arxiv.org/abs/2512.19935
作者: Samruddhi Baviskar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Machine learning models used in financial decision systems operate in nonstationary economic environments, yet adversarial robustness is typically evaluated under static assumptions. This work introduces Conditional Adversarial Fragility, a regime dependent phenomenon in which adversarial vulnerability is systematically amplified during periods of macroeconomic stress. We propose a regime aware evaluation framework for time indexed tabular financial classification tasks that conditions robustness assessment on external indicators of economic stress. Using volatility based regime segmentation as a proxy for macroeconomic conditions, we evaluate model behavior across calm and stress periods while holding model architecture, attack methodology, and evaluation protocols constant. Baseline predictive performance remains comparable across regimes, indicating that economic stress alone does not induce inherent performance degradation. Under adversarial perturbations, however, models operating during stress regimes exhibit substantially greater degradation across predictive accuracy, operational decision thresholds, and risk sensitive outcomes. We further demonstrate that this amplification propagates to increased false negative rates, elevating the risk of missed high risk cases during adverse conditions. To complement numerical robustness metrics, we introduce an interpretive governance layer based on semantic auditing of model explanations using large language models. Together, these results demonstrate that adversarial robustness in financial machine learning is a regime dependent property and motivate stress aware approaches to model risk assessment in high stakes financial deployments.
zh
[AI-49] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在关键领域部署中因持续幻觉(hallucination)导致的可靠性问题——即模型生成看似合理但事实错误的陈述。研究表明,幻觉并非单纯的随机误差,而是训练目标偏向于拟合数据分布而非追求认知诚实(epistemic honesty)的统计结果。传统强化学习与人类反馈(Reinforcement Learning with Human Feedback, RLHF)范式使用二元奖励信号,反而激励模型成为“高分应试者”而非“诚实沟通者”,从而在置信度低于100%时仍强行输出答案。解决方案的关键在于引入行为校准(behavioral calibration),通过优化严格合适的评分规则(strictly proper scoring rules),促使模型在不确定时选择弃权(abstain)或标记具体不确信的命题,使输出概率与实际正确性保持一致。实验表明,基于Qwen3-4B-Instruct的小模型经此方法训练后,在不确定性量化能力上超越前沿模型,且该能力可脱离原始预测准确性独立提升,具备跨任务迁移潜力。
链接: https://arxiv.org/abs/2512.19920
作者: Jiayun Wu,Jiashuo Liu,Zhiyuan Zeng,Tianyang Zhan,Wenhao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM deployment in critical domains is currently impeded by persistent hallucinations–generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification–a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model’s log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5’s (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.
zh
[AI-50] A Time-efficient Prioritised Scheduling Algorithm to Optimise Initial Flock Formation of Drones
【速读】:该论文旨在解决无人机蜂群(drone flocking)在初始编队阶段因潜在碰撞导致路径次优的问题,现有算法在效率和可扩展性方面存在不足。解决方案的关键在于提出一种基于优先级调度的时间高效算法:通过计算每架无人机的潜在碰撞数量及其到达目标位置时对其他无人机造成永久阻碍的概率,为其分配优先级;随后,各无人机根据优先级确定适当的延迟时间,从而确保无碰撞路径的生成。该方法显著提升了大规模蜂群(最多5000架无人机)的编队效率,并优于基于耦合度的启发式优先规划方法(CDH-PP)。
链接: https://arxiv.org/abs/2512.19914
作者: Sujan Warnakulasooriya,Andreas Willig,Xiaobing Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 35 pages
Abstract:Drone applications continue to expand across various domains, with flocking offering enhanced cooperative capabilities but introducing significant challenges during initial formation. Existing flocking algorithms often struggle with efficiency and scalability, particularly when potential collisions force drones into suboptimal trajectories. This paper presents a time-efficient prioritised scheduling algorithm that improves the initial formation process of drone flocks. The method assigns each drone a priority based on its number of potential collisions and its likelihood of reaching its target position without permanently obstructing other drones. Using this hierarchy, each drone computes an appropriate delay to ensure a collision-free path. Simulation results show that the proposed algorithm successfully generates collision-free trajectories for flocks of up to 5000 drones and outperforms the coupling-degree-based heuristic prioritised planning method (CDH-PP) in both performance and computational efficiency.
zh
[AI-51] Modeling Non-Ergodic Path Effects Using Conditional Generative Model for Fourier Amplitude Spectra
【速读】:该论文旨在解决传统非遍历性地震动模型(non-ergodic ground-motion models, GMMs)在大规模应用中因依赖高斯过程(Gaussian Process, GP)方法而面临计算效率低下的问题,尤其是在处理多频段、大空间范围的非遍历路径效应建模时。解决方案的关键在于提出一种基于深度学习的条件生成建模方法——条件变分自编码器用于傅里叶振幅谱(Conditional Generative Modeling for Fourier Amplitude Spectra, CGM-FAS),该方法通过将地震和台站的地理坐标作为条件变量,直接从数据中学习空间模式和频间相关性,无需预设相关函数,从而显著提升预测速度与可扩展性,同时保持与GP基线模型一致的物理合理性。
链接: https://arxiv.org/abs/2512.19909
作者: Maxime Lacour,Pu Ren,Rie Nakata,Nori Nakata,Michael Mahoney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent developments in non-ergodic ground-motion models (GMMs) explicitly model systematic spatial variations in source, site, and path effects, reducing standard deviation to 30-40% of ergodic models and enabling more accurate site-specific seismic hazard analysis. Current non-ergodic GMMs rely on Gaussian Process (GP) methods with prescribed correlation functions and thus have computational limitations for large-scale predictions. This study proposes a deep-learning approach called Conditional Generative Modeling for Fourier Amplitude Spectra (CGM-FAS) as an alternative to GP-based methods for modeling non-ergodic path effects in Fourier Amplitude Spectra (FAS). CGM-FAS uses a Conditional Variational Autoencoder architecture to learn spatial patterns and interfrequency correlation directly from data by using geographical coordinates of earthquakes and stations as conditional variables. Using San Francisco Bay Area earthquake data, we compare CGM-FAS against a recent GP-based GMM for the region and demonstrate consistent predictions of non-ergodic path effects. Additionally, CGM-FAS offers advantages compared to GP-based approaches in learning spatial patterns without prescribed correlation functions, capturing interfrequency correlations, and enabling rapid predictions, generating maps for 10,000 sites across 1,000 frequencies within 10 seconds using a few GB of memory. CGM-FAS hyperparameters can be tuned to ensure generated path effects exhibit variability consistent with the GP-based empirical GMM. This work demonstrates a promising direction for efficient non-ergodic ground-motion prediction across multiple frequencies and large spatial domains.
zh
[AI-52] Demystifying LLM -as-a-Judge: Analytically Tractable Model for Inference-Time Scaling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段计算资源分配的理论机制不明确的问题,尤其是如何通过增加推理时间的采样次数(inference-time scaling)来优化模型泛化性能。其解决方案的关键在于构建一个可解析处理的贝叶斯线性回归模型,结合奖励加权采样器(reward-weighted sampler),其中奖励由线性模型定义,模拟LLM-as-a-judge场景;在此高维设定下,推导出后验预测均值和方差的确定性等价表达式,并分析训练数据来自教师模型时的泛化误差行为。研究发现:当奖励与教师模型接近时,随着推理采样数 $ k $ 增加,泛化误差单调下降;但最优奖励通常不同于教师模型,且存在因奖励误设导致的有限最优 $ k $,超过该点进一步采样反而恶化性能;此外,固定 $ k $ 下存在最优采样温度,且在“best-of-k”极限中,泛化误差以 Θ(1/k2) 速率衰减,其系数可通过极值理论确定,从而明确界定推理计算扩展优于数据收集的适用范围。
链接: https://arxiv.org/abs/2512.19905
作者: Indranil Halder,Cengiz Pehlevan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages
Abstract:Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw k inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples k . However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal k beyond which more sampling can increase the generalization error. For fixed k , there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of- k " limit with the teacher as reward, we theoretically show that the generalization error decays as \Theta(1/k^2) and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.
zh
[AI-53] A Branch-and-Price Algorithm for Fast and Equitable Last-Mile Relief Aid Distribution
【速读】:该论文旨在解决重大灾害发生后,受限于预置救援物资数量不足的情况下,如何从配送中心向避难所规划车辆路径并分配有限的救援物资,以在效率(总行驶时间最短)与公平性(未满足需求的基尼指数最小化)之间取得平衡的问题。其解决方案的关键在于构建一个双目标混合整数规划(Mixed Integer Programming, MIP)模型,并采用ε-约束法处理多目标优化;通过推导最优解的数学性质引入有效不等式,设计了基于可行车辆路径的最优分配算法;同时开发了一种分支定价(Branch-and-Price, BP)算法,显著优于商业MIP求解器,在实际地震数据(土耳其凡省)和预测数据(伊斯坦布尔卡特尔区)测试中,使援助分配不公平性降低34%而不牺牲效率,且根据时间约束松紧程度选择不同的优化策略(如字典序优化优先覆盖需求或均衡权衡)。
链接: https://arxiv.org/abs/2512.19882
作者: Mahdi Mostajabdaveh,F. Sibel Salman,Walter J. Gutjahr
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:The distribution of relief supplies to shelters is a critical aspect of post-disaster humanitarian logistics. In major disasters, prepositioned supplies often fall short of meeting all demands. We address the problem of planning vehicle routes from a distribution center to shelters while allocating limited relief supplies. To balance efficiency and equity, we formulate a bi-objective problem: minimizing a Gini-index-based measure of inequity in unsatisfied demand for fair distribution and minimizing total travel time for timely delivery. We propose a Mixed Integer Programming (MIP) model and use the \epsilon -constraint method to handle the bi-objective nature. By deriving mathematical properties of the optimal solution, we introduce valid inequalities and design an algorithm for optimal delivery allocations given feasible vehicle routes. A branch-and-price (BP) algorithm is developed to solve the problem efficiently. Computational tests on realistic datasets from a past earthquake in Van, Turkey, and predicted data for Istanbul’s Kartal region show that the BP algorithm significantly outperforms commercial MIP solvers. Our bi-objective approach reduces aid distribution inequity by 34% without compromising efficiency. Results indicate that when time constraints are very loose or tight, lexicographic optimization prioritizing demand coverage over fairness is effective. For moderately restrictive time constraints, a balanced approach is essential to avoid inequitable outcomes.
zh
[AI-54] Fine-Tuned In-Context Learners for Efficient Adaptation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在下游任务适配中两种主流方法——提示工程(prompt engineering)与微调(fine-tuning)之间的权衡问题:提示工程在少量数据下表现优异,但性能随数据量增加趋于饱和;而微调虽能随着数据增多持续提升性能,但在训练样本稀缺时效果不佳。解决方案的关键在于提出一种统一框架,将上下文学习(in-context learning)机制直接嵌入微调过程,即在任务特定数据中引入类似k-shot提示结构的示例进行增强,从而在保持提示工程样本高效性的同时,获得微调带来的性能增益。此外,为应对低数据场景下的超参数选择难题,作者采用预序评估(prequential evaluation)策略,避免昂贵的交叉验证并充分利用全部可用数据进行训练与验证。
链接: https://arxiv.org/abs/2512.19879
作者: Jorg Bornschein,Clare Lyle,Yazhe Li,Amal Rannen-Triki,Xu Owen He,Razvan Pascanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:When adapting large language models (LLMs) to a specific downstream task, two primary approaches are commonly employed: (1) prompt engineering, often with in-context few-shot learning, leveraging the model’s inherent generalization abilities, and (2) fine-tuning on task-specific data, directly optimizing the model’s parameters. While prompt-based methods excel in few-shot scenarios, their effectiveness often plateaus as more data becomes available. Conversely, fine-tuning scales well with data but may underperform when training examples are scarce. We investigate a unified approach that bridges these two paradigms by incorporating in-context learning directly into the fine-tuning process. Specifically, we fine-tune the model on task-specific data augmented with in-context examples, mimicking the structure of k-shot prompts. This approach, while requiring per-task fine-tuning, combines the sample efficiency of in-context learning with the performance gains of fine-tuning, leading to a method that consistently matches and often significantly exceeds both these baselines. To perform hyperparameter selection in the low-data regime, we propose to use prequential evaluation, which eliminates the need for expensive cross-validation and leverages all available data for training while simultaneously providing a robust validation signal. We conduct an extensive empirical study to determine which adaptation paradigm - fine-tuning, in-context learning, or our proposed unified approach offers the best predictive performance on a concrete data downstream-tasks.
zh
[AI-55] UCCL-EP: Portable Expert-Parallel Communication
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)工作负载中专家并行(Expert Parallelism, EP)通信系统在异构GPU与网卡(NIC)平台上的可移植性差的问题。现有方案如DeepEP虽性能优异,但其依赖GPU发起的基于RDMA的逐令牌通信机制需GPU与NIC之间紧密集成(如直接写入NIC驱动或MMIO接口),导致难以适配不同硬件架构。解决方案的关键在于提出UCCL-EP,通过引入高吞吐量的GPU-CPU控制通道替代GPU主导的RDMA操作:将紧凑的令牌路由指令交由多线程CPU代理处理,并由CPU代为发起GPUDirect RDMA请求;同时利用RDMA立即数据(immediate data)模拟多种排序语义,从而在缺乏原生有序性的网卡(如AWS EFA)上保证正确性。此设计实现了跨异构平台的高性能与高可移植性,实测在EFA平台上比最优现有方案提升达2.1倍,且在多个训练场景下显著提升吞吐量。
链接: https://arxiv.org/abs/2512.19849
作者: Ziming Mao,Yihan Zhang,Chihan Cui,Kaichao You,Zhongjie Chen,Zhiying Xu,Scott Shenker,Costin Raiciu,Yang Zhou,Ion Stoica
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to 2.1\times for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2512.19849 [cs.DC] (or arXiv:2512.19849v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.19849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-56] PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在开放科学场景中端到端问题求解能力不足的问题,尤其是在物理学这一高度抽象、数学密集且需融合分析推理与代码计算的领域。其解决方案的关键在于提出PhysMaster——一个作为自主理论与计算物理学家的LLM代理,通过将抽象推理与数值计算相结合,并利用LANDAU(Layered Academic Data Universe)结构化存储检索文献、精炼先验知识和验证的方法学轨迹,从而提升决策的可靠性与稳定性;同时引入自适应探索策略,在效率与开放式探索之间取得平衡,使系统能够在超长任务周期内保持鲁棒性能。
链接: https://arxiv.org/abs/2512.19799
作者: Tingjia Miao(1 and 2 and 5),Jiawen Dai(2),Jingkun Liu(2),Jinxin Tan(2 and 3 and 4),Muhua Zhang(2 and 3 and 4),Wenkai Jin(1),Yuwen Du(1),Tian Jin(1),Xianghe Pang(1),Zexi Liu(1),Tu Guo(2 and 4),Zhengliang Zhang(2 and 4 and 5),Yunjie Huang(1),Shuo Chen(6),Rui Ye(1),Yuzhi Zhang(7),Linfeng Zhang(7),Kun Chen(6),Wei Wang(2 and 3 and 4),Weinan E(1),Siheng Chen(1) ((1) School of Artificial Intelligence, Shanghai Jiao Tong University, (2) School of Physics and Astronomy, Shanghai Jiao Tong University, (3) State Key Laboratory of Dark Matter Physics, Shanghai Jiao Tong University, (4) Tsung-Dao Lee Institute, Shanghai Jiao Tong University, (5) Zhiyuan College, Shanghai Jiao Tong University, (6) Institute of Theoretical Physics, Chinese Academy of Sciences, (7) DP Technology)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 10 figures
Abstract:Advances in LLMs have produced agents with knowledge and operational capabilities comparable to human scientists, suggesting potential to assist, accelerate, and automate research. However, existing studies mainly evaluate such systems on well-defined benchmarks or general tasks like literature retrieval, limiting their end-to-end problem-solving ability in open scientific scenarios. This is particularly true in physics, which is abstract, mathematically intensive, and requires integrating analytical reasoning with code-based computation. To address this, we propose PhysMaster, an LLM-based agent functioning as an autonomous theoretical and computational physicist. PhysMaster couples absract reasoning with numerical computation and leverages LANDAU, the Layered Academic Data Universe, which preserves retrieved literature, curated prior knowledge, and validated methodological traces, enhancing decision reliability and stability. It also employs an adaptive exploration strategy balancing efficiency and open-ended exploration, enabling robust performance in ultra-long-horizon tasks. We evaluate PhysMaster on problems from high-energy theory, condensed matter theory to astrophysics, including: (i) acceleration, compressing labor-intensive research from months to hours; (ii) automation, autonomously executing hypothesis-driven loops ; and (iii) autonomous discovery, independently exploring open problems.
zh
[AI-57] Learned Digital Codes for Over-the-Air Computation in Federated Edge Learning
【速读】:该论文旨在解决联邦边缘学习(Federated Edge Learning, FEEL)中因频繁上行传输模型更新而导致的通信瓶颈问题,尤其在低信噪比(SNR)条件下现有数字过空气(Digital Over-the-Air, OTA)聚合方案性能受限的问题。解决方案的关键在于提出一种基于学习的数字OTA框架,其核心创新包括:集成无源随机接入(Unsourced Random Access, URA)码本与向量量化技术,并设计一个端到端训练的AMP-DA-Net解码器——该解码器基于未展开的近似消息传递(Approximate Message Passing, AMP)结构,联合优化数字码本和参数服务器本地训练统计信息。此设计不仅将可靠数字OTA操作的适用SNR范围扩展超过10 dB,还实现了对广义对称函数(如截断均值和多数投票规则)的支持,显著提升了恢复精度、收敛性和鲁棒性,同时保持与先进方法相同的上行开销。
链接: https://arxiv.org/abs/2512.19777
作者: Antonio Tarizzo,Mohammad Kazemi,Deniz Gündüz
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated edge learning (FEEL) enables wireless devices to collaboratively train a centralised model without sharing raw data, but repeated uplink transmission of model updates makes communication the dominant bottleneck. Over-the-air (OTA) aggregation alleviates this by exploiting the superposition property of the wireless channel, enabling simultaneous transmission and merging communication with computation. Digital OTA schemes extend this principle by incorporating the robustness of conventional digital communication, but current designs remain limited in low signal-to-noise ratio (SNR) regimes. This work proposes a learned digital OTA framework that improves recovery accuracy, convergence behaviour, and robustness to challenging SNR conditions while maintaining the same uplink overhead as state-of-the-art methods. The design integrates an unsourced random access (URA) codebook with vector quantisation and AMP-DA-Net, an unrolled approximate message passing (AMP)-style decoder trained end-to-end with the digital codebook and parameter server local training statistics. The proposed design extends OTA aggregation beyond averaging to a broad class of symmetric functions, including trimmed means and majority-based rules. Experiments on highly heterogeneous device datasets and varying numbers of active devices show that the proposed design extends reliable digital OTA operation by more than 10 dB into low SNR regimes while matching or improving performance across the full SNR range. The learned decoder remains effective under message corruption and nonlinear aggregation, highlighting the broader potential of end-to-end learned design for digital OTA communication in FEEL.
zh
[AI-58] A K-Means Ward and DBSCAN repeatability study
【速读】:该论文旨在解决机器学习中可复现性(reproducibility)问题,特别是针对聚类算法(如K-Means、DBSCAN和Ward)在不同运行环境下结果不一致的问题。其解决方案的关键在于将这些算法分解为基本步骤,并明确每个阶段实现比特级重复性(bitwise repeatability)所需的条件;通过使用scikit-learn库的实现示例,研究发现当OpenMP线程数超过两个时,K-Means会出现非一致性结果,从而揭示了并行计算环境对算法可复现性的关键影响,旨在提升用户和开发者对该问题的认识并推动进一步优化。
链接: https://arxiv.org/abs/2512.19772
作者: Anthony Bertrand(LIMOS),Engelbert Mephu Nguifo(LIMOS),Violaine Antoine(LIMOS),David Hill(LIMOS)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reproducibility is essential in machine learning because it ensures that a model or experiment yields the same scientific conclusion. For specific algorithms repeatability with bitwise identical results is also a key for scientific integrity because it allows debugging. We decomposed several very popular clustering algorithms: K-Means, DBSCAN and Ward into their fundamental steps, and we identify the conditions required to achieve repeatability at each stage. We use an implementation example with the Python library scikit-learn to examine the repeatable aspects of each method. Our results reveal inconsistent results with K-Means when the number of OpenMP threads exceeds two. This work aims to raise awareness of this issue among both users and developers, encouraging further investigation and potential fixes.
zh
[AI-59] A Declarative Language for Building And Orchestrating LLM -Powered Agent Workflows
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在实际部署中面临的复杂性问题,即现有系统将代理逻辑与特定编程语言和部署模型紧密耦合,导致开发效率低、维护困难且难以跨环境迁移。解决方案的关键在于提出一种声明式(declarative)系统,通过将代理工作流的规范与实现分离,使同一管道定义可在多种后端语言(如 Java、Python、Go)和部署环境(云原生、本地部署)中运行。其核心思想是识别出大多数代理工作流由通用模式(如数据序列化、过滤、检索增强生成 RAG、API 协调)构成,并用统一的领域特定语言(DSL)表达这些模式,从而将代理开发从编码转变为配置,显著降低变更成本并提升可扩展性。实证表明,该方法在 PayPal 的真实电商场景中实现了 60% 的开发时间减少和 3 倍的部署速度提升。
链接: https://arxiv.org/abs/2512.19769
作者: Ivan Daunis
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Building deployment-ready LLM agents requires complex orchestration of tools, data sources, and control flow logic, yet existing systems tightly couple agent logic to specific programming languages and deployment models. We present a declarative system that separates agent workflow specification from implementation, enabling the same pipeline definition to execute across multiple backend languages (Java, Python, Go) and deployment environments (cloud-native, on-premises). Our key insight is that most agent workflows consist of common patterns – data serialization, filtering, RAG retrieval, API orchestration – that can be expressed through a unified DSL rather than imperative code. This approach transforms agent development from application programming to configuration, where adding new tools or fine-tuning agent behaviors requires only pipeline specification changes, not code deployment. Our system natively supports A/B testing of agent strategies, allowing multiple pipeline variants to run on the same backend infrastructure with automatic metric collection and comparison. We evaluate our approach on real-world e-commerce workflows at PayPal, processing millions of daily interactions. Our results demonstrate 60% reduction in development time, and 3x improvement in deployment velocity compared to imperative implementations. The language’s declarative approach enables non-engineers to modify agent behaviors safely, while maintaining sub-100ms orchestration overhead. We show that complex workflows involving product search, personalization, and cart management can be expressed in under 50 lines of DSL compared to 500+ lines of imperative code. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2512.19769 [cs.SE] (or arXiv:2512.19769v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.19769 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-60] How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts AAAI2026
【速读】:该论文旨在解决稀疏混合专家(Sparse Mixture-of-Experts, SMoE)架构中专家语义分化不足的问题,即如何在适应不同任务需求时动态调整专家池规模并提升各专家的语义特异性。现有方法要么依赖繁琐的超参数调优,要么忽视专家间语义角色的多样性。其解决方案的关键在于提出Mixture-of-Experts for Adaptive Semantic Specialization (MASS),包含两项核心创新:一是基于梯度的语义漂移检测器(gradient-based semantic drift detector),用于识别当前专家池是否无法覆盖数据的全部语义多样性,并触发针对性的专家扩展;二是自适应路由策略(adaptive routing strategy),根据token级别的路由置信度分布动态调整专家使用频率,从而实现更高效的专家分工与语义专业化。
链接: https://arxiv.org/abs/2512.19765
作者: Sumin Park,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026 (Main Track)
Abstract:Finding the optimal configuration of Sparse Mixture-ofExperts (SMoE) that maximizes semantic differentiation among experts is essential for exploiting the full potential of MoE architectures. However, existing SMoE frameworks either heavily rely on hyperparameter tuning or overlook the importance of diversifying semantic roles across experts when adapting the expert pool size. We propose Mixture-of-Experts for Adaptive Semantic Specialization (MASS), a semanticaware MoE framework for adaptive expert expansion and dynamic routing. MASS introduces two key advancements: (i) a gradient-based semantic drift detector that prompts targeted expert expansion when the existing expert pool lacks capacity to capture the full semantic diversity of the data, and (ii) an integration of adaptive routing strategy that dynamically adjusts expert usage based on token-level routing confidence mass. We first demonstrate that MASS reliably converges to the point of optimal balance between cost-performance trade-off with notably improved sematic specialization in a highly controlled synthetic setup. Further empirical results on real-world datasets across language and vision domains show that MASS consistently outperforms a range of strong MoE baselines, demonstrating its domain robustness and enhanced expert specialization.
zh
[AI-61] Attention Distance: A Novel Metric for Directed Fuzzing with Large Language Models ICSE2026
【速读】:该论文旨在解决当前定向灰盒模糊测试(Directed Grey-Box Fuzzing, DGF)在复杂二进制程序中因仅依赖种子执行路径与目标位置之间的物理距离而忽略代码片段间逻辑关系的问题,这导致引导信息冗余或误导,削弱了实际检测效率。其解决方案的关键在于提出一种新的“注意力距离”(attention distance)度量方法,该方法利用大语言模型(Large Language Model, LLM)对代码元素进行上下文分析,计算注意力分数以揭示代码段间的内在关联,并在不修改任何模糊测试组件的前提下,将传统物理距离替换为注意力距离,在38个真实漏洞复现实验中平均提升测试效率3.43倍,显著优于DAFL和WindRanger等先进方法。
链接: https://arxiv.org/abs/2512.19758
作者: Wang Bin,Ao Yang,Kedan Li,Aofan Liu,Hui Li,Guibo Luo,Weixiang Huang,Yan Zhuang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to ICSE 2026 Research Track
Abstract:In the domain of software security testing, Directed Grey-Box Fuzzing (DGF) has garnered widespread attention for its efficient target localization and excellent detection performance. However, existing approaches measure only the physical distance between seed execution paths and target locations, overlooking logical relationships among code segments. This omission can yield redundant or misleading guidance in complex binaries, weakening DGF’s real-world effectiveness. To address this, we introduce \textbfattention distance, a novel metric that leverages a large language model’s contextual analysis to compute attention scores between code elements and reveal their intrinsic connections. Under the same AFLGo configuration – without altering any fuzzing components other than the distance metric – replacing physical distances with attention distances across 38 real vulnerability reproduction experiments delivers a \textbf3.43 \times average increase in testing efficiency over the traditional method. Compared to state-of-the-art directed fuzzers DAFL and WindRanger, our approach achieves \textbf2.89 \times and \textbf7.13 \times improvements, respectively. To further validate the generalizability of attention distance, we integrate it into DAFL and WindRanger, where it also consistently enhances their original performance. All related code and datasets are publicly available at this https URL\this http URL.
zh
[AI-62] From Theory to Throughput: CUDA-Optimized APML for Large-Batch 3D Learning
【速读】:该论文旨在解决3D点云学习中损失函数在几何保真度与计算成本之间的权衡问题。现有方法如Chamfer Distance虽计算高效但允许多对一对应关系,而Earth Mover Distance虽能更好反映一对一匹配但计算开销大;APML(Approximate Optimal Transport with Differentiable Sinkhorn Iterations)通过可微分Sinkhorn迭代和解析推导的温度参数近似最优传输,但其密集实现内存复杂度为二次方,限制了大规模应用。解决方案的关键在于提出CUDA-APML——一种稀疏GPU实现,通过阈值过滤低权重分配、直接在COO(Coordinate Format)格式下执行自适应softmax、双向对称化及Sinkhorn归一化,实现了近线性内存增长,并保持梯度传播能力,同时在ShapeNet和MM-Fi数据集上性能接近密集APML,峰值GPU内存降低99.9%。
链接: https://arxiv.org/abs/2512.19743
作者: Sasan Sharifipour,Constantino Álvarez Casado,Manuel Lage Cañellas,Miguel Bordallo López
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注: 5 pages, 2 figures, 2 tables, 5 formulas, 34 references, journal paper
Abstract:Loss functions are fundamental to learning accurate 3D point cloud models, yet common choices trade geometric fidelity for computational cost. Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost. APML approximates transport with differentiable Sinkhorn iterations and an analytically derived temperature, but its dense formulation scales quadratically in memory. We present CUDA-APML, a sparse GPU implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn normalization directly in COO form. This yields near-linear memory scaling and preserves gradients on the stored support, while pairwise distance evaluation remains quadratic in the current implementation. On ShapeNet and MM-Fi, CUDA-APML matches dense APML within a small tolerance while reducing peak GPU memory by 99.9%. Code available at: this https URL
zh
[AI-63] Simulation-Driven Railway Delay Prediction: An Imitation Learning Approach AAAI2026
【速读】:该论文旨在解决铁路运输系统中列车延误预测的可靠性问题,以提升系统的鲁棒性和运行效率。其核心挑战在于如何准确建模延迟在大规模网络中的传播过程,尤其是捕捉其序列依赖性和不确定性。解决方案的关键在于将延误预测重构为一个随机模拟任务,并提出一种名为漂移校正模仿学习(Drift-Corrected Imitation Learning, DCIL)的自监督算法,该方法通过引入基于距离的漂移校正机制扩展了DAgger框架,在无需外部专家标签或对抗训练的情况下有效缓解了滚动过程中出现的协变量偏移(covariate shift),从而实现了高保真动态建模与数据驱动表示能力的融合,最终支持基于蒙特卡洛模拟的不确定性感知预测。
链接: https://arxiv.org/abs/2512.19737
作者: Clément Elliker,Jesse Read,Sonia Vanier,Albert Bifet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 body pages, 3 appendix pages, 8 tables, 2 figures, accepted at AAAI2026 Main Track
Abstract:Reliable prediction of train delays is essential for enhancing the robustness and efficiency of railway transportation systems. In this work, we reframe delay forecasting as a stochastic simulation task, modeling state-transition dynamics through imitation learning. We introduce Drift-Corrected Imitation Learning (DCIL), a novel self-supervised algorithm that extends DAgger by incorporating distance-based drift correction, thereby mitigating covariate shift during rollouts without requiring access to an external oracle or adversarial schemes. Our approach synthesizes the dynamical fidelity of event-driven models with the representational capacity of data-driven methods, enabling uncertainty-aware forecasting via Monte Carlo simulation. We evaluate DCIL using a comprehensive real-world dataset from \textscInfrabel, the Belgian railway infrastructure manager, which encompasses over three million train movements. Our results, focused on predictions up to 30 minutes ahead, demonstrate superior predictive performance of DCIL over traditional regression models and behavioral cloning on deep learning architectures, highlighting its effectiveness in capturing the sequential and uncertain nature of delay propagation in large-scale networks.
zh
[AI-64] CoPHo: Classifier-guided Conditional Topology Generation with Persistent Homology KDD2026
【速读】:该论文旨在解决现有生成式图模型在拓扑结构控制方面的局限性问题,即如何在不重新训练模型的前提下,高效生成具有特定拓扑属性的合成图。传统基于扩散的方法要么需为每种属性重新训练模型(嵌入条件),要么采用分类器引导策略但无法考虑图规模和实际约束。其解决方案的关键在于从离散视角出发,将预训练图级分类器的梯度引入离散反向扩散后验中,从而在去噪过程中实现对目标拓扑特性的动态引导;具体而言,提出Classifier-guided Conditional Topology Generation with Persistent Homology (CoPHo),通过在中间图上构建持久同调滤过(persistent homology filtration)并将特征作为引导信号,在每个去噪步骤中精确调控生成过程以逼近指定结构属性。
链接: https://arxiv.org/abs/2512.19736
作者: Gongli Xi,Ye Tian,Mengyu Yang,Zhenyu Zhao,Yuchao Zhang,Xiangyang Gong,Xirong Que,Wendong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026. 12 pages, 5 figures
Abstract:The structure of topology underpins much of the research on performance and robustness, yet available topology data are typically scarce, necessitating the generation of synthetic graphs with desired properties for testing or release. Prior diffusion-based approaches either embed conditions into the diffusion model, requiring retraining for each attribute and hindering real-time applicability, or use classifier-based guidance post-training, which does not account for topology scale and practical constraints. In this paper, we show from a discrete perspective that gradients from a pre-trained graph-level classifier can be incorporated into the discrete reverse diffusion posterior to steer generation toward specified structural properties. Based on this insight, we propose Classifier-guided Conditional Topology Generation with Persistent Homology (CoPHo), which builds a persistent homology filtration over intermediate graphs and interprets features as guidance signals that steer generation toward the desired properties at each denoising step. Experiments on four generic/network datasets demonstrate that CoPHo outperforms existing methods at matching target metrics, and we further validate its transferability on the QM9 molecular dataset.
zh
[AI-65] High-Performance Self-Supervised Learning by Joint Training of Flow Matching
【速读】:该论文旨在解决扩散模型(Diffusion Models)在自监督学习(Self-Supervised Learning, SSL)中面临的两大挑战:一是生成质量与判别性能之间的权衡,二是迭代采样带来的高计算和能源开销,限制了其在工业和边缘人工智能(Edge AI)场景中的应用。解决方案的关键在于提出基于流匹配(Flow Matching)的基础模型(FlowFM),通过解耦设计联合训练表示编码器和条件流匹配生成器,利用流匹配学习更简单的速度场(velocity field)以加速并稳定训练过程,从而在保持高保真度生成的同时显著提升表征学习效率。实验表明,FlowFM在可穿戴传感器数据上相较扩散方法训练时间减少50.4%,且在下游任务中超越现有最优SSL方法(SSL-Wearables),推理速度最高提升51.0倍。
链接: https://arxiv.org/abs/2512.19729
作者: Kosuke Ukita,Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 pages
Abstract:Diffusion models can learn rich representations during data generation, showing potential for Self-Supervised Learning (SSL), but they face a trade-off between generative quality and discriminative performance. Their iterative sampling also incurs substantial computational and energy costs, hindering industrial and edge AI applications. To address these issues, we propose the Flow Matching-based Foundation Model (FlowFM), which jointly trains a representation encoder and a conditional flow matching generator. This decoupled design achieves both high-fidelity generation and effective recognition. By using flow matching to learn a simpler velocity field, FlowFM accelerates and stabilizes training, improving its efficiency for representation learning. Experiments on wearable sensor data show FlowFM reduces training time by 50.4% compared to a diffusion-based approach. On downstream tasks, FlowFM surpassed the state-of-the-art SSL method (SSL-Wearables) on all five datasets while achieving up to a 51.0x inference speedup and maintaining high generative quality. The implementation code is available at this https URL.
zh
[AI-66] ny On-Device Decision Makers with the MiniConv Library
【速读】:该论文旨在解决在资源受限的边缘设备上部署视觉策略(visual policies)时面临的计算成本高和通信延迟大的问题,尤其是在强化学习(Reinforcement Learning, RL)场景中,传统方案通常将策略推理任务卸载至远程服务器,导致网络往返开销显著,并需传输高维观测数据。其解决方案的关键在于提出一种分层式策略架构(split-policy architecture),其中在设备端部署一个轻量级编码器(由OpenGL片段着色器实现以兼容广泛嵌入式GPU),将原始观测压缩为紧凑的特征张量进行传输,而远程的策略头(policy head)则基于这些低维特征做出决策。该方法不仅减少了传输数据量、降低了带宽受限环境下的闭环决策延迟,还减少了服务器每请求的计算负载,同时在单次运行基准测试中实现了与完整策略相当的学习性能(以最终100轮平均回报衡量),仅带来适度的回报损失。
链接: https://arxiv.org/abs/2512.19726
作者: Carlos Purves
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 Pages, 5 Figures
Abstract:Reinforcement learning (RL) has achieved strong results, but deploying visual policies on resource-constrained edge devices remains challenging due to computational cost and communication latency. Many deployments therefore offload policy inference to a remote server, incurring network round trips and requiring transmission of high-dimensional observations. We introduce a split-policy architecture in which a small on-device encoder, implemented as OpenGL fragment-shader passes for broad embedded GPU support, transforms each observation into a compact feature tensor that is transmitted to a remote policy head. In RL, this communication overhead manifests as closed-loop decision latency rather than only per-request inference latency. The proposed approach reduces transmitted data, lowers decision latency in bandwidth-limited settings, and reduces server-side compute per request, whilst achieving broadly comparable learning performance by final return (mean over the final 100 episodes) in single-run benchmarks, with modest trade-offs in mean return. We evaluate across an NVIDIA Jetson Nano, a Raspberry Pi 4B, and a Raspberry Pi Zero 2 W, reporting learning results, on-device execution behaviour under sustained load, and end-to-end decision latency and scalability measurements under bandwidth shaping. Code for training, deployment, and measurement is released as open source.
zh
[AI-67] Multiscale Dual-path Feature Aggregation Network for Remaining Useful Life Prediction of Lithium-Ion Batteries
【速读】:该论文旨在解决当前电池退化序列的局部与全局相关性建模效率低下、难以满足实际应用需求的问题。其解决方案的关键在于提出一种新颖的深度学习架构——多尺度双路径特征聚合网络(Multiscale Dual-Path Feature Aggregation Network, MDFA-Net),该网络包含两条并行路径:第一条为多尺度特征网络(MF-Net),用于保留浅层信息并避免信息丢失;第二条为编码器网络(EC-Net),用于捕捉序列的连续趋势并保留深层细节。通过有效融合深层与浅层特征,MDFA-Net能够同时精准建模电池容量退化的局部细微变化与全局演化趋势,从而在两个公开锂离子电池数据集上的剩余使用寿命(RUL)预测任务中显著优于现有先进方法。
链接: https://arxiv.org/abs/2512.19719
作者: Zihao Lv,Siqi Ai,Yanbin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Targeted maintenance strategies, ensuring the dependability and safety of industrial machinery. However, current modeling techniques for assessing both local and global correlation of battery degradation sequences are inefficient and difficult to meet the needs in real-life applications. For this reason, we propose a novel deep learning architecture, multiscale dual-path feature aggregation network (MDFA-Net), for RUL prediction. MDFA-Net consists of dual-path networks, the first path network, multiscale feature network (MF-Net) that maintains the shallow information and avoids missing information, and the second path network is an encoder network (EC-Net) that captures the continuous trend of the sequences and retains deep details. Integrating both deep and shallow attributes effectively grasps both local and global patterns. Testing conducted with two publicly available Lithium-ion battery datasets reveals our approach surpasses existing top-tier methods in RUL forecasting, accurately mapping the capacity degradation trajectory.
zh
[AI-68] hermodynamic Focusing for Inference-Time Search: Practical Methods for Target-Conditioned Sampling and Prompted Inference
【速读】:该论文旨在解决在极大候选空间中寻找稀有但有用的解这一普遍性难题,常见于语言生成、规划和强化学习等领域。其核心解决方案是提出一种实用框架——逆因果聚焦算法(Inverted Causality Focusing Algorithm, ICFA),该方法将搜索过程建模为一个目标条件下的重加权采样过程;ICFA 利用现有提议采样器(proposal sampler)和任务特定的相似性函数构造聚焦采样分布,并自适应地控制聚焦强度以避免退化问题,从而显著降低对样本数量的需求。
链接: https://arxiv.org/abs/2512.19717
作者: Zhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages
Abstract:Finding rare but useful solutions in very large candidate spaces is a recurring practical challenge across language generation, planning, and reinforcement learning. We present a practical framework, \emphInverted Causality Focusing Algorithm (ICFA), that treats search as a target-conditioned reweighting process. ICFA reuses an available proposal sampler and a task-specific similarity function to form a focused sampling distribution, while adaptively controlling focusing strength to avoid degeneracy. We provide a clear recipe, a stability diagnostic based on effective sample size, a compact theoretical sketch explaining when ICFA can reduce sample needs, and two reproducible experiments: constrained language generation and sparse-reward navigation. We further show how structured prompts instantiate an approximate, language-level form of ICFA and describe a hybrid architecture combining prompted inference with algorithmic reweighting.
zh
[AI-69] Development and external validation of a multimodal artificial intelligence mortality prediction model of critically ill patients using multicenter data
【速读】:该论文旨在解决重症患者住院死亡风险早期预测的问题,以辅助临床医生优化治疗决策。其解决方案的关键在于构建一个融合结构化与非结构化临床数据的多模态深度学习模型,该模型利用患者入院后24小时内的时间序列变量、静态特征、临床文书(clinical notes)及胸部X光图像等多源信息进行联合建模,并通过在MIMIC-III、MIMIC-IV、eICU和HiRID等多个独立数据库中的外部验证,证明了模型具有良好的泛化能力与预测性能(如AUROC达0.92),尤其当整合临床笔记和影像数据时,模型性能显著提升(AUROC从0.87提升至0.89)。
链接: https://arxiv.org/abs/2512.19716
作者: Behrooz Mamandipoor,Chun-Nan Hsu,Martin Krause,Ulrich H. Schmidt,Rodney A. Gabriel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 75 pages (33 main text + references, 35 supplementary materials), 5 figures, 2 tables
Abstract:Early prediction of in-hospital mortality in critically ill patients can aid clinicians in optimizing treatment. The objective was to develop a multimodal deep learning model, using structured and unstructured clinical data, to predict in-hospital mortality risk among critically ill patients after their initial 24 hour intensive care unit (ICU) admission. We used data from MIMIC-III, MIMIC-IV, eICU, and HiRID. A multimodal model was developed on the MIMIC datasets, featuring time series components occurring within the first 24 hours of ICU admission and predicting risk of subsequent inpatient mortality. Inputs included time-invariant variables, time-variant variables, clinical notes, and chest X-ray images. External validation occurred in a temporally separated MIMIC population, HiRID, and eICU datasets. A total of 203,434 ICU admissions from more than 200 hospitals between 2001 to 2022 were included, in which mortality rate ranged from 5.2% to 7.9% across the four datasets. The model integrating structured data points had AUROC, AUPRC, and Brier scores of 0.92, 0.53, and 0.19, respectively. We externally validated the model on eight different institutions within the eICU dataset, demonstrating AUROCs ranging from 0.84-0.92. When including only patients with available clinical notes and imaging data, inclusion of notes and imaging into the model, the AUROC, AUPRC, and Brier score improved from 0.87 to 0.89, 0.43 to 0.48, and 0.37 to 0.17, respectively. Our findings highlight the importance of incorporating multiple sources of patient information for mortality prediction and the importance of external validation.
zh
[AI-70] Bidirectional human-AI collaboration in brain tumour assessments improves both expert human and AI agent performance
【速读】:该论文旨在解决人工智能(AI)在医疗领域中如何与人类专家协同提升临床决策质量的问题,特别是聚焦于磁共振成像(MRI)引导下脑肿瘤患者表征中的诊断准确性与元认知能力。其解决方案的关键在于提出并验证两种人机协作范式:一是放射科医生借助AI辅助提升诊断表现,二是AI代理通过人类专家支持实现性能优化。研究发现,无论哪种模式,人机协作均能显著提高准确率、元认知能力和评分者间一致性;尤其当AI由人类专家支持时,患者获益最大,表明AI并非替代人类智能,而是通过持续利用和放大人类专业知识来增强自身临床能力,从而实现人机协同的最优价值。
链接: https://arxiv.org/abs/2512.19707
作者: James K Ruffle,Samia Mohinta,Guilherme Pombo,Asthik Biswas,Alan Campbell,Indran Davagnanam,David Doig,Ahmed Hamman,Harpreet Hyare,Farrah Jabeen,Emma Lim,Dermot Mallon,Stephanie Owen,Sophie Wilkinson,Sebastian Brandner,Parashkev Nachev
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 38 pages, 6 figures, 7 supplementary figures
Abstract:The benefits of artificial intelligence (AI) human partnerships-evaluating how AI agents enhance expert human performance-are increasingly studied. Though rarely evaluated in healthcare, an inverse approach is possible: AI benefiting from the support of an expert human agent. Here, we investigate both human-AI clinical partnership paradigms in the magnetic resonance imaging-guided characterisation of patients with brain tumours. We reveal that human-AI partnerships improve accuracy and metacognitive ability not only for radiologists supported by AI, but also for AI agents supported by radiologists. Moreover, the greatest patient benefit was evident with an AI agent supported by a human one. Synergistic improvements in agent accuracy, metacognitive performance, and inter-rater agreement suggest that AI can create more capable, confident, and consistent clinical agents, whether human or model-based. Our work suggests that the maximal value of AI in healthcare could emerge not from replacing human intelligence, but from AI agents that routinely leverage and amplify it.
zh
[AI-71] Large Language Models for EDA Cloud Job Resource and Lifetime Prediction
【速读】:该论文旨在解决电子设计自动化(Electronic Design Automation, EDA)领域中云资源调度面临的资源与作业生命周期预测难题,传统机器学习方法因EDA工作负载的复杂性和异构性而表现不佳,且依赖大量特征工程和领域知识。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的文本到文本回归框架,通过引入科学计数法表示和前缀填充(prefix filling)机制约束输出格式,显著提升预测结果的可靠性;同时发现全注意力(full-attention)微调与推理可有效提升滑动窗口注意力(sliding-window-attention)LLMs的预测精度,从而在真实云数据集上建立了EDA领域性能预测的新基准。
链接: https://arxiv.org/abs/2512.19701
作者: Yuxuan Yin,Shengke Zhou,Yunjie Zhang,Ajay Mohindra,Boxun Xu,Peng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:The rapid growth of cloud computing in the Electronic Design Automation (EDA) industry has created a critical need for resource and job lifetime prediction to achieve optimal scheduling. Traditional machine learning methods often struggle with the complexity and heterogeneity of EDA workloads, requiring extensive feature engineering and domain expertise. We propose a novel framework that fine-tunes Large Language Models (LLMs) to address this challenge through text-to-text regression. We introduce the scientific notation and prefix filling to constrain the LLM, significantly improving output format reliability. Moreover, we found that full-attention finetuning and inference improves the prediction accuracy of sliding-window-attention LLMs. We demonstrate the effectiveness of our proposed framework on real-world cloud datasets, setting a new baseline for performance prediction in the EDA domain.
zh
[AI-72] Automated Fault Detection in 5G Core Networks Using Large Language Models
【速读】:该论文旨在解决现代电信网络中因数据量激增和规模扩大而导致的传统故障诊断方法效率低下、难以满足高可靠性和快速响应需求的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)实现网络故障的自动化检测与分类:研究团队在基于Kubernetes的测试网络中注入多种典型故障(如Pod故障、网络延迟、丢包等),收集健康与故障状态下的多源数据(包括日志、系统描述、RTT测试结果及Pod状态),并基于此数据集对GPT-4.1 nano模型进行微调,显著提升了故障识别准确率,验证了LLM驱动的闭环、无需人工干预的故障管理机制在提高网络可靠性与降低运维成本方面的潜力。
链接: https://arxiv.org/abs/2512.19697
作者: Parsa Hatami,Ahmadreza Majlesara,Ali Majlesi,Babak Hossein Khalaj
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid growth of data volume in modern telecommunication networks and the continuous expansion of their scale, maintaining high reliability has become a critical requirement. These networks support a wide range of applications and services, including highly sensitive and mission-critical ones, which demand rapid and accurate detection and resolution of network errors. Traditional fault-diagnosis methods are no longer efficient for such complex environments.\citeb1 In this study, we leverage Large Language Models (LLMs) to automate network fault detection and classification. Various types of network errors were intentionally injected into a Kubernetes-based test network, and data were collected under both healthy and faulty conditions. The dataset includes logs from different network components (pods), along with complementary data such as system descriptions, events, Round Trip Time (RTT) tests, and pod status information. The dataset covers common fault types such as pod failure, pod kill, network delay, network loss, and disk I/O failures. We fine-tuned the GPT-4.1 nano model via its API on this dataset, resulting in a significant improvement in fault-detection accuracy compared to the base model. These findings highlight the potential of LLM-based approaches for achieving closed-loop, and operator-free fault management, which can enhance network reliability and reduce downtime-related operational costs for service providers.
zh
[AI-73] QoS-Aware Dynamic CU Selection in O-RAN with Graph-Based Reinforcement Learning
【速读】:该论文旨在解决传统无线接入网(RAN)中逻辑功能与物理位置绑定静态化所导致的资源利用效率低下问题,尤其在时变流量和资源条件下难以实现灵活调度。其核心解决方案是通过动态服务功能链(SFC)编排,结合在线O-CU(Open Centralized Unit)选择机制,以最小化网络能耗并满足服务质量(QoS)约束。关键创新在于将问题建模为马尔可夫决策过程(Markov Decision Process),并提出GRLDyP方法——一种由图神经网络(GNN)辅助的深度强化学习(DRL)框架,使代理能够联合决策路由路径与O-CU部署位置,从而在实时网络状态(如CPU负载与带宽利用率)下优化能效、延迟和服务等级之间的权衡。
链接: https://arxiv.org/abs/2512.19696
作者: Sebastian Racedo,Brigitte Jaumard,Oscar Delgado,Meysam Masoudi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Open Radio Access Network (O RAN) disaggregates conventional RAN into interoperable components, enabling flexible resource allocation, energy savings, and agile architectural design. In legacy deployments, the binding between logical functions and physical locations is static, which leads to inefficiencies under time varying traffic and resource conditions. We address this limitation by relaxing the fixed mapping and performing dynamic service function chain (SFC) provisioning with on the fly O CU selection. We formulate the problem as a Markov decision process and solve it using GRLDyP, i.e., a graph neural network (GNN) assisted deep reinforcement learning (DRL). The proposed agent jointly selects routes and the O-CU location (from candidate sites) for each incoming service flow to minimize network energy consumption while satisfying quality of service (QoS) constraints. The GNN encodes the instantaneous network topology and resource utilization (e.g., CPU and bandwidth), and the DRL policy learns to balance grade of service, latency, and energy. We perform the evaluation of GRLDyP on a data set with 24-hour traffic traces from the city of Montreal, showing that dynamic O CU selection and routing significantly reduce energy consumption compared to a static mapping baseline, without violating QoS. The results highlight DRL based SFC provisioning as a practical control primitive for energy-aware, resource-adaptive O-RAN deployments.
zh
[AI-74] Deep Learning Classification of EEG Responses to Multi-Dimensional Transcranial Electrical Stimulation
链接: https://arxiv.org/abs/2512.20319
作者: Alexis Pomares Pastor,Ines Ribeiro Violante,Gregory Scott
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: For open-sourced datasets and source code, see: this https URL
[AI-75] Regression of Functions by Quantum Neural Networks Circuits
【速读】:该论文旨在解决量子神经网络(Quantum Neural Network, QNN)在回归任务中架构设计的难题,特别是如何自动化地构建高效且紧凑的量子电路结构。传统方法依赖人工经验选择电路深度、参数门配置及数据编码策略,而这一过程复杂且缺乏系统性指导。其解决方案的关键在于提出一种基于遗传算法(Genetic Algorithm)的框架,将量子回归器(Reduced Regressor QNN)的构造转化为一个优化问题,通过探索电路深度、可变参数门布局以及灵活的数据重上传(data re-uploading)模式来自动发现最优架构。该方法不仅显著减少了模型参数数量,还在多个非线性基准函数和解析函数上展现出与经典回归模型相当甚至更优的性能,同时利用十二种结构描述符对数据集复杂度进行量化分析,证明这些指标能可靠预测最佳量子架构,从而为元学习驱动的量子架构设计提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2512.19978
作者: Fernando M. de Paula Neto,Lucas dos Reis Silva,Paulo S. G. de Mattos Neto,Felipe F. Fanchini
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of quantum neural network models depends strongly on architectural decisions, including circuit depth, placement of parametrized operations, and data-encoding strategies. Selecting an effective architecture is challenging and closely related to the classical difficulty of choosing suitable neural-network topologies, which is computationally hard. This work investigates automated quantum-circuit construction for regression tasks and introduces a genetic-algorithm framework that discovers Reduced Regressor QNN architectures. The approach explores depth, parametrized gate configurations, and flexible data re-uploading patterns, formulating the construction of quantum regressors as an optimization process. The discovered circuits are evaluated against seventeen classical regression models on twenty-two nonlinear benchmark functions and four analytical functions. Although classical methods often achieve comparable results, they typically require far more parameters, whereas the evolved quantum models remain compact while providing competitive performance. We further analyze dataset complexity using twelve structural descriptors and show, across five increasingly challenging meta-learning scenarios, that these measures can reliably predict which quantum architecture will perform best. The results demonstrate perfect or near-perfect predictive accuracy in several scenarios, indicating that complexity metrics offer powerful and compact representations of dataset structure and can effectively guide automated model selection. Overall, this study provides a principled basis for meta-learning-driven quantum architecture design and advances the understanding of how quantum models behave in regression settings–a topic that has received limited exploration in prior work. These findings pave the way for more systematic and theoretically grounded approaches to quantum regression.
zh
[AI-76] QMBench: A Research Level Benchmark for Quantum Materials Research
【速读】:该论文旨在解决当前大型语言模型代理(Large Language Model Agents)在量子材料研究中缺乏系统性评估工具的问题,从而阻碍其在该领域实现科学创新。解决方案的关键在于提出QMBench——一个专门针对量子材料研究的综合性基准测试平台,该平台通过标准化评估框架,全面衡量模型在凝聚态物理知识与计算技术(如密度 functional theory)应用中的能力,涵盖结构特性、电子性质、热力学行为、对称性原理及计算方法等多个维度,以推动具备创造性贡献潜力的“AI科学家”在量子材料领域的开发与演进。
链接: https://arxiv.org/abs/2512.19753
作者: Yanzhen Wang,Yiyang Jiang,Diana Golovanova,Kamal Das,Hyeonhu Bae,Yufei Zhao,Huu-Thong Le,Abhinava Chatterjee,Yunzhe Liu,Chao-Xing Liu,Felipe H. da Jornada,Binghai Yan,Xiao-Liang Qi
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 20 pages, 1 figure
Abstract:We introduce QMBench, a comprehensive benchmark designed to evaluate the capability of large language model agents in quantum materials research. This specialized benchmark assesses the model’s ability to apply condensed matter physics knowledge and computational techniques such as density functional theory to solve research problems in quantum materials science. QMBench encompasses different domains of the quantum material research, including structural properties, electronic properties, thermodynamic and other properties, symmetry principle and computational methodologies. By providing a standardized evaluation framework, QMBench aims to accelerate the development of an AI scientist capable of making creative contributions to quantum materials research. We expect QMBench to be developed and constantly improved by the research community.
zh
[AI-77] Generative AI for Analysts
【速读】:该论文旨在解决生成式 AI(Generative AI)在金融分析师工作中的实际影响问题,特别是其对报告质量、信息丰富度及预测准确性的作用机制。研究通过将FactSet于2023年推出的AI平台部署视为自然实验,发现AI显著提升了报告的信息多样性(增加40%不同信息源)、覆盖广度(扩大34%主题范围)和分析深度(提升25%高级方法使用),同时改善了时效性;但同时也导致预测误差上升59%,因AI呈现更均衡的正负信息组合,增加了认知负荷,尤其对高任务复杂度的分析师更为明显。解决方案的关键在于利用真实世界平台部署作为外生冲击,结合安慰剂检验排除其他数据供应商干扰,从而识别出生成式AI在金融信息生产中带来的双重效应:既带来生产力提升,也揭示了认知能力的边界。
链接: https://arxiv.org/abs/2512.19705
作者: Jian Xue,Qian Zhang,Wu Zhu
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); General Economics (econ.GN); General Finance (q-fin.GN)
备注:
Abstract:We study how generative artificial intelligence (AI) transforms the work of financial analysts. Using the 2023 launch of FactSet’s AI platform as a natural experiment, we find that adoption produces markedly richer and more comprehensive reports – featuring 40% more distinct information sources, 34% broader topical coverage, and 25% greater use of advanced analytical methods – while also improving timeliness. However, forecast errors rise by 59% as AI-assisted reports convey a more balanced mix of positive and negative information that is harder to synthesize, particularly for analysts facing heavier cognitive demands. Placebo tests using other data vendors confirm that these effects are unique to FactSet’s AI integration. Overall, our findings reveal both the productivity gains and cognitive limits of generative AI in financial information production.
zh
机器学习
[LG-0] Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures
链接: https://arxiv.org/abs/2512.20607
作者: Yedi Zhang,Andrew Saxe,Peter E. Latham
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.
[LG-1] Relu and softplus neural nets as zero-sum turn-based games
链接: https://arxiv.org/abs/2512.20582
作者: Stephane Gaubert,Yiannis Vlassopoulos
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注: 24 pages, 2 figures
Abstract:We show that the output of a ReLU neural network can be interpreted as the value of a zero-sum, turn-based, stopping game, which we call the ReLU net game. The game runs in the direction opposite to that of the network, and the input of the network serves as the terminal reward of the game. In fact, evaluating the network is the same as running the Shapley-Bellman backward recursion for the value of the game. Using the expression of the value of the game as an expected total payoff with respect to the path measure induced by the transition probabilities and a pair of optimal policies, we derive a discrete Feynman-Kac-type path-integral formula for the network output. This game-theoretic representation can be used to derive bounds on the output from bounds on the input, leveraging the monotonicity of Shapley operators, and to verify robustness properties using policies as certificates. Moreover, training the neural network becomes an inverse game problem: given pairs of terminal rewards and corresponding values, one seeks transition probabilities and rewards of a game that reproduces them. Finally, we show that a similar approach applies to neural networks with Softplus activation functions, where the ReLU net game is replaced by its entropic regularization.
[LG-2] Improving ML Training Data with Gold-Standard Quality Metrics
链接: https://arxiv.org/abs/2512.20577
作者: Leslie Barrett,Michael W. Sherman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hand-tagged training data is essential to many machine learning tasks. However, training data quality control has received little attention in the literature, despite data quality varying considerably with the tagging exercise. We propose methods to evaluate and enhance the quality of hand-tagged training data using statistical approaches to measure tagging consistency and agreement. We show that agreement metrics give more reliable results if recorded over multiple iterations of tagging, where declining variance in such recordings is an indicator of increasing data quality. We also show one way a tagging project can collect high-quality training data without requiring multiple tags for every work item, and that a tagger burn-in period may not be sufficient for minimizing tagger errors.
[LG-3] Explainable time-series forecasting with sampling-free SHAP for Transformers
链接: https://arxiv.org/abs/2512.20514
作者: Matthias Hertel,Sebastian Pütz,Ralf Mikut,Veit Hagenmeyer,Benjamin Schäfer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time-series forecasts are essential for planning and decision-making in many domains. Explainability is key to building user trust and meeting transparency requirements. Shapley Additive Explanations (SHAP) is a popular explainable AI framework, but it lacks efficient implementations for time series and often assumes feature independence when sampling counterfactuals. We introduce SHAPformer, an accurate, fast and sampling-free explainable time-series forecasting model based on the Transformer architecture. It leverages attention manipulation to make predictions based on feature subsets. SHAPformer generates explanations in under one second, several orders of magnitude faster than the SHAP Permutation Explainer. On synthetic data with ground truth explanations, SHAPformer provides explanations that are true to the data. Applied to real-world electrical load data, it achieves competitive predictive performance and delivers meaningful local and global insights, such as identifying the past load as the key predictor and revealing a distinct model behavior during the Christmas period.
[LG-4] Recurrent Off-Policy Deep Reinforcement Learning Doesnt Have to be Slow
链接: https://arxiv.org/abs/2512.20513
作者: Tyler Clark,Christine Evers,Jonathon Hare
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recurrent off-policy deep reinforcement learning models achieve state-of-the-art performance but are often sidelined due to their high computational demands. In response, we introduce RISE (Recurrent Integration via Simplified Encodings), a novel approach that can leverage recurrent networks in any image-based off-policy RL setting without significant computational overheads via using both learnable and non-learnable encoder layers. When integrating RISE into leading non-recurrent off-policy RL algorithms, we observe a 35.6% human-normalized interquartile mean (IQM) performance improvement across the Atari benchmark. We analyze various implementation strategies to highlight the versatility and potential of our proposed framework.
[LG-5] Machine Learning to Predict Digital Frustration from Clickstream Data
链接: https://arxiv.org/abs/2512.20438
作者: Jibin Joseph
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures
Abstract:Many businesses depend on their mobile apps and websites, so user frustration while trying to complete a task on these channels can cause lost sales and complaints. In this research, I use clickstream data from a real e-commerce site to predict whether a session is frustrated or not. Frustration is defined using certain rules based on rage bursts, back and forth navigation (U turns), cart churn, search struggle, and long wandering sessions, and applies these rules to 5.4 million raw clickstream events (304,881 sessions). From each session, I build tabular features and train standard classifier models. I also use the full event sequence to train a discriminative LSTM classifier. XGBoost reaches about 90% accuracy, ROC AUC of 0.9579, while the LSTM performs best with about 91% accuracy and a ROC AUC of 0.9705. Finally, the research shows that with only the first 20 to 30 interactions, the LSTM already predicts frustration reliably.
[LG-6] BRIDGE: Budget-aware Reasoning via Intermediate Distillation with Guided Examples
链接: https://arxiv.org/abs/2512.20403
作者: Xuan-An Le,Minh-Nam Tran,Son Nguyen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Distilling knowledge from large proprietary models (e.g., GPT-4) to tiny deployable models (less than 1B parameters) faces a critical capacity-budget trap: the 1000x capacity gap between teachers and students prevents effective direct transfer, while API costs prohibit extensive data collection. We introduce BRIDGE (Budget-Aware Reasoning via Intermediate Distillation), a two-phase framework that resolves these constraints through strategic intermediation and budget asymmetry. In Phase 1, a mid-sized Teacher Assistant (TA; e.g., about 7B) learns from the black-box teacher on a strictly limited subset of data (e.g., 3-5%), selected via a zero-API-cost pipeline that balances entropic difficulty and semantic diversity using only local TA inference. In Phase 2, we exploit this asymmetry-teacher queries are expensive, whereas TA inference is free to amplify supervision: the refined TA generates synthetic rationales for the full dataset to train the tiny student. Crucially, we apply an instruction-tuning curriculum to establish behavioral alignment in the tiny student before transferring reasoning. Our theoretical analysis shows that BRIDGE yields tighter generalization bounds than direct distillation when data is abundant. Experiments across medical, legal, and financial benchmarks demonstrate consistent improvements: BRIDGE delivers student performance gains of 28-41%, closing the capability gap with proprietary teachers by 12-16% while using 10x fewer teacher queries. Notably, BRIDGE defies the conventional cost-performance frontier, surpassing direct distillation baselines that use 100% of the budget while consuming only 5% of the resources.
[LG-7] GeoTransolver: Learning Physics on Irregumar Domains Using Multi-scale Geometry Aware Physics Attention Transformer
链接: https://arxiv.org/abs/2512.20399
作者: Corey Adams,Rishikesh Ranade,Ram Cherukuri,Sanjay Choudhry
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We present GeoTransolver, a Multiscale Geometry-Aware Physics Attention Transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO) and reused in every block. Implemented and released in NVIDIA PhysicsNeMo, GeoTransolver persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes. We benchmark GeoTransolver on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, comparing against Domino, Transolver (as released in PhysicsNeMo), and literature-reported AB-UPT, and evaluate drag/lift R2 and Relative L1 errors for field variables. GeoTransolver delivers better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency; we include ablations on DrivAerML and qualitative results such as contour plots and design trends for the best GeoTransolver models. By unifying multiscale geometry-aware context with physics-based attention in a scalable transformer, GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes.
[LG-8] Physics-guided Neural Network-based Shaft Power Prediction for Vessels
链接: https://arxiv.org/abs/2512.20348
作者: Dogan Altan,Hamza Haruna Mohammed,Glenn Terje Lines,Dusica Marijan,Arnbjørn Maressa
类目: Machine Learning (cs.LG)
*备注: This work has been accepted for publication in the 11th Special Session on Intelligent Data Mining at IEEE BigData 2025. The final published version of this work will be available through IEEE
Abstract:Optimizing maritime operations, particularly fuel consumption for vessels, is crucial, considering its significant share in global trade. As fuel consumption is closely related to the shaft power of a vessel, predicting shaft power accurately is a crucial problem that requires careful consideration to minimize costs and emissions. Traditional approaches, which incorporate empirical formulas, often struggle to model dynamic conditions, such as sea conditions or fouling on vessels. In this paper, we present a hybrid, physics-guided neural network-based approach that utilizes empirical formulas within the network to combine the advantages of both neural networks and traditional techniques. We evaluate the presented method using data obtained from four similar-sized cargo vessels and compare the results with those of a baseline neural network and a traditional approach that employs empirical formulas. The experimental results demonstrate that the physics-guided neural network approach achieves lower mean absolute error, root mean square error, and mean absolute percentage error for all tested vessels compared to both the empirical formula-based method and the base neural network.
[LG-9] Inverse Autoregressive Flows for Zero Degree Calorimeter fast simulation NEURIPS
链接: https://arxiv.org/abs/2512.20346
作者: Emilia Majerz,Witold Dzwinel,Jacek Kitowski
类目: Machine Learning (cs.LG)
*备注: Presented as a poster at the Machine Learning and the Physical Sciences Workshop, 39th Conference on Neural Information Processing Systems (NeurIPS), 2025
Abstract:Physics-based machine learning blends traditional science with modern data-driven techniques. Rather than relying exclusively on empirical data or predefined equations, this methodology embeds domain knowledge directly into the learning process, resulting in models that are both more accurate and robust. We leverage this paradigm to accelerate simulations of the Zero Degree Calorimeter (ZDC) of the ALICE experiment at CERN. Our method introduces a novel loss function and an output variability-based scaling mechanism, which enhance the model’s capability to accurately represent the spatial distribution and morphology of particle showers in detector outputs while mitigating the influence of rare artefacts on the training. Leveraging Normalizing Flows (NFs) in a teacher-student generative framework, we demonstrate that our approach not only outperforms classic data-driven model assimilation but also yields models that are 421 times faster than existing NF implementations in ZDC simulation literature.
[LG-10] FedDPC : Handling Data Heterogeneity and Partial Client Participation in Federated Learning
链接: https://arxiv.org/abs/2512.20329
作者: Mrinmay Sen,Subhrajit Nag
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures
Abstract:Data heterogeneity is a significant challenge in modern federated learning (FL) as it creates variance in local model updates, causing the aggregated global model to shift away from the true global optimum. Partial client participation in FL further exacerbates this issue by skewing the aggregation of local models towards the data distribution of participating clients. This creates additional variance in the global model updates, causing the global model to converge away from the optima of the global objective. These variances lead to instability in FL training, which degrades global model performance and slows down FL training. While existing literature primarily focuses on addressing data heterogeneity, the impact of partial client participation has received less attention. In this paper, we propose FedDPC, a novel FL method, designed to improve FL training and global model performance by mitigating both data heterogeneity and partial client participation. FedDPC addresses these issues by projecting each local update onto the previous global update, thereby controlling variance in both local and global updates. To further accelerate FL training, FedDPC employs adaptive scaling for each local update before aggregation. Extensive experiments on image classification tasks with multiple heterogeneously partitioned datasets validate the effectiveness of FedDPC. The results demonstrate that FedDPC outperforms state-of-the-art FL algorithms by achieving faster reduction in training loss and improved test accuracy across communication rounds.
[LG-11] op-K Exterior Power Persistent Homology: Algorithm Structure and Stability
链接: https://arxiv.org/abs/2512.20325
作者: Yoshihiro Maruyama
类目: Computational Geometry (cs.CG); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:
Abstract:Exterior powers play important roles in persistent homology in computational geometry. In the present paper we study the problem of extracting the K longest intervals of the exterior-power layers of a tame persistence module. We prove a structural decomposition theorem that organizes the exterior-power layers into monotone per-anchor streams with explicit multiplicities, enabling a best-first algorithm. We also show that the Top- K length vector is 2 -Lipschitz under bottleneck perturbations of the input barcode, and prove a comparison-model lower bound. Our experiments confirm the theory, showing speedups over full enumeration in high overlap cases. By enabling efficient extraction of the most prominent features, our approach makes higher-order persistence feasible for large datasets and thus broadly applicable to machine learning, data science, and scientific computing.
[LG-12] Algorithm for Interpretable Graph Features via Motivic Persistent Cohomology
链接: https://arxiv.org/abs/2512.20311
作者: Yoshihiro Maruyama
类目: Computational Geometry (cs.CG); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:
Abstract:We present the Chromatic Persistence Algorithm (CPA), an event-driven method for computing persistent cohomological features of weighted graphs via graphic arrangements, a classical object in computational geometry. We establish rigorous complexity results: CPA is exponential in the worst case, fixed-parameter tractable in treewidth, and nearly linear for common graph families such as trees, cycles, and series-parallel graphs. Finally, we demonstrate its practical applicability through a controlled experiment on molecular-like graph structures.
[LG-13] Mixture-of-Experts with Gradient Conflict-Driven Subspace Topology Pruning for Emergent Modularity
链接: https://arxiv.org/abs/2512.20291
作者: Yuxing Gan,Ziyu Lei
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixture-of-Experts (MoE) architectures achieve parameter efficiency through conditional computation, yet contemporary designs suffer from two fundamental limitations: structural parameter isolation that causes catastrophic forgetting, and instruction-overfitting that degrades performance in instruction-free scenarios. We propose CDSP-MoE (Conflict-Driven Subspace Pruning MoE), a framework that addresses these issues through a paradigm shift from isolated expert containers to dynamic expert instantiation within a shared physical subspace. Grounded in the Universal Weight Subspace Hypothesis, CDSP-MoE maintains a super-complete parameter backbone where logical experts are carved out via learnable topology masks. Unlike prior work that uses gradient conflict for token reassignment or optimization surgery, we leverage it as a structural supervisory signal: a Lagged Gradient Game penalizes interfering connections in the shared manifold, enabling the topology to spontaneously prune conflicting pathways and evolve interpretable modular structures. Experimental results demonstrate that CDSP-MoE achieves robust content-driven routing without human-defined task labels, maintaining semantic specialization even under strict blind inference protocols where explicit instructions are absent. Code is available at: this https URL
[LG-14] HGAN-SDEs: Learning Neural Stochastic Differential Equations with Hermite-Guided Adversarial Training
链接: https://arxiv.org/abs/2512.20272
作者: Yuanjian Xu,Yuan Shuai,Jianing Hao,Guang Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural Stochastic Differential Equations (Neural SDEs) provide a principled framework for modeling continuous-time stochastic processes and have been widely adopted in fields ranging from physics to finance. Recent advances suggest that Generative Adversarial Networks (GANs) offer a promising solution to learning the complex path distributions induced by SDEs. However, a critical bottleneck lies in designing a discriminator that faithfully captures temporal dependencies while remaining computationally efficient. Prior works have explored Neural Controlled Differential Equations (CDEs) as discriminators due to their ability to model continuous-time dynamics, but such architectures suffer from high computational costs and exacerbate the instability of adversarial training. To address these limitations, we introduce HGAN-SDEs, a novel GAN-based framework that leverages Neural Hermite functions to construct a structured and efficient discriminator. Hermite functions provide an expressive yet lightweight basis for approximating path-level dynamics, enabling both reduced runtime complexity and improved training stability. We establish the universal approximation property of our framework for a broad class of SDE-driven distributions and theoretically characterize its convergence behavior. Extensive empirical evaluations on synthetic and real-world systems demonstrate that HGAN-SDEs achieve superior sample quality and learning efficiency compared to existing generative models for SDEs
[LG-15] DeepONet-accelerated Bayesian inversion for moving boundary problems
链接: https://arxiv.org/abs/2512.20268
作者: Marco A. Iglesias,Michael. E. Causon,Mikhail Y. Matveev,Andreas Endruweit,Michael .V. Tretyakov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This work demonstrates that neural operator learning provides a powerful and flexible framework for building fast, accurate emulators of moving boundary systems, enabling their integration into digital twin platforms. To this end, a Deep Operator Network (DeepONet) architecture is employed to construct an efficient surrogate model for moving boundary problems in single-phase Darcy flow through porous media. The surrogate enables rapid and accurate approximation of complex flow dynamics and is coupled with an Ensemble Kalman Inversion (EKI) algorithm to solve Bayesian inverse problems. The proposed inversion framework is demonstrated by estimating the permeability and porosity of fibre reinforcements for composite materials manufactured via the Resin Transfer Moulding (RTM) process. Using both synthetic and experimental in-process data, the DeepONet surrogate accelerates inversion by several orders of magnitude compared with full-model EKI. This computational efficiency enables real-time, accurate, high-resolution estimation of local variations in permeability, porosity, and other parameters, thereby supporting effective monitoring and control of RTM processes, as well as other applications involving moving boundary flows. Unlike prior approaches for RTM inversion that learn mesh-dependent mappings, the proposed neural operator generalises across spatial and temporal domains, enabling evaluation at arbitrary sensor configurations without retraining, and represents a significant step toward practical industrial deployment of digital twins. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2512.20268 [cs.LG] (or arXiv:2512.20268v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20268 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-16] Adaptive Multi-task Learning for Probabilistic Load Forecasting
链接: https://arxiv.org/abs/2512.20232
作者: Onintze Zaballa,Verónica Álvarez,Santiago Mazuelas
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Simultaneous load forecasting across multiple entities (e.g., regions, buildings) is crucial for the efficient, reliable, and cost-effective operation of power systems. Accurate load forecasting is a challenging problem due to the inherent uncertainties in load demand, dynamic changes in consumption patterns, and correlations among entities. Multi-task learning has emerged as a powerful machine learning approach that enables the simultaneous learning across multiple related problems. However, its application to load forecasting remains underexplored and is limited to offline learning-based methods, which cannot capture changes in consumption patterns. This paper presents an adaptive multi-task learning method for probabilistic load forecasting. The proposed method can dynamically adapt to changes in consumption patterns and correlations among entities. In addition, the techniques presented provide reliable probabilistic predictions for loads of multiples entities and assess load uncertainties. Specifically, the method is based on vectorvalued hidden Markov models and uses a recursive process to update the model parameters and provide predictions with the most recent parameters. The performance of the proposed method is evaluated using datasets that contain the load demand of multiple entities and exhibit diverse and dynamic consumption patterns. The experimental results show that the presented techniques outperform existing methods both in terms of forecasting performance and uncertainty assessment.
[LG-17] Generalisation in Multitask Fitted Q-Iteration and Offline Q-learning
链接: https://arxiv.org/abs/2512.20220
作者: Kausthubh Manda,Raghuram Bharadwaj Diddigi
类目: Machine Learning (cs.LG)
*备注: 18 pages (9 pages + Appendix and references), this is version 1
Abstract:We study offline multitask reinforcement learning in settings where multiple tasks share a low-rank representation of their action-value functions. In this regime, a learner is provided with fixed datasets collected from several related tasks, without access to further online interaction, and seeks to exploit shared structure to improve statistical efficiency and generalization. We analyze a multitask variant of fitted Q-iteration that jointly learns a shared representation and task-specific value functions via Bellman error minimization on offline data. Under standard realizability and coverage assumptions commonly used in offline reinforcement learning, we establish finite-sample generalization guarantees for the learned value functions. Our analysis explicitly characterizes how pooling data across tasks improves estimation accuracy, yielding a 1/\sqrtnT dependence on the total number of samples across tasks, while retaining the usual dependence on the horizon and concentrability coefficients arising from distribution shift. In addition, we consider a downstream offline setting in which a new task shares the same underlying representation as the upstream tasks. We study how reusing the representation learned during the multitask phase affects value estimation for this new task, and show that it can reduce the effective complexity of downstream learning relative to learning from scratch. Together, our results clarify the role of shared representations in multitask offline Q-learning and provide theoretical insight into when and how multitask structure can improve generalization in model-free, value-based reinforcement learning.
[LG-18] Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud
链接: https://arxiv.org/abs/2512.20218
作者: Jixiao Yang,Jinyu Chen,Zixiao Huang,Chengda Xu,Chi Zhang,Sijia Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning across multi-cloud environments faces critical challenges, including non-IID data distributions, malicious participant detection, and substantial cross-cloud communication costs (egress fees). Existing Byzantine-robust methods focus primarily on model accuracy while overlooking the economic implications of data transfer across cloud providers. This paper presents Cost-TrustFL, a hierarchical federated learning framework that jointly optimizes model performance and communication costs while providing robust defense against poisoning attacks. We propose a gradient-based approximate Shapley value computation method that reduces the complexity from exponential to linear, enabling lightweight reputation evaluation. Our cost-aware aggregation strategy prioritizes intra-cloud communication to minimize expensive cross-cloud data transfers. Experiments on CIFAR-10 and FEMNIST datasets demonstrate that Cost-TrustFL achieves 86.7% accuracy under 30% malicious clients while reducing communication costs by 32% compared to baseline methods. The framework maintains stable performance across varying non-IID degrees and attack intensities, making it practical for real-world multi-cloud deployments.
[LG-19] NeuralCrop: Combining physics and machine learning for improved crop yield predictions
链接: https://arxiv.org/abs/2512.20177
作者: Yunan Lin,Sebastian Bathiany,Maha Badri,Maximilian Gelbrecht,Philipp Hess,Brian Groenke,Jens Heinke,Christoph Müller,Niklas Boers
类目: Machine Learning (cs.LG)
*备注:
Abstract:Global gridded crop models (GGCMs) simulate daily crop growth by explicitly representing key biophysical processes and project end-of-season yield time series. They are a primary tool to quantify the impacts of climate change on agricultural productivity and assess associated risks for food security. Despite decades of development, state-of-the-art GGCMs still have substantial uncertainties in simulating complex biophysical processes due to limited process understanding. Recently, machine learning approaches trained on observational data have shown great potential in crop yield predictions. However, these models have not demonstrated improved performance over classical GGCMs and are not suitable for simulating crop yields under changing climate conditions due to problems in generalizing outside their training distributions. Here we introduce NeuralCrop, a hybrid GGCM that combines the strengths of an advanced process-based GGCM, resolving important processes explicitly, with data-driven machine learning components. The model is first trained to emulate a competitive GGCM before it is fine-tuned on observational data. We show that NeuralCrop outperforms state-of-the-art GGCMs across site-level and large-scale cropping regions. Across moisture conditions, NeuralCrop reproduces the interannual yield anomalies in European wheat regions and the US Corn Belt more accurately during the period from 2000 to 2019 with particularly strong improvements under drought extremes. When generalizing to conditions unseen during training, NeuralCrop continues to make robust projections, while pure machine learning models exhibit substantial performance degradation. Our results show that our hybrid crop modelling approach offers overall improved crop modeling and more reliable yield projections under climate change and intensifying extreme weather conditions.
[LG-20] Sample-Efficient Policy Constraint Offline Deep Reinforcement Learning based on Sample Filtering
链接: https://arxiv.org/abs/2512.20115
作者: Yuanhao Chen,Qi Liu,Pengbin Chen,Zhongjian Qiao,Yanjie Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected return using a given static dataset of transitions. However, offline RL faces the distribution shift problem. The policy constraint offline RL method is proposed to solve the distribution shift problem. During the policy constraint offline RL training, it is important to ensure the difference between the learned policy and behavior policy within a given threshold. Thus, the learned policy heavily relies on the quality of the behavior policy. However, a problem exists in existing policy constraint methods: if the dataset contains many low-reward transitions, the learned will be contained with a suboptimal reference policy, leading to slow learning speed, low sample efficiency, and inferior performances. This paper shows that the sampling method in policy constraint offline RL that uses all the transitions in the dataset can be improved. A simple but efficient sample filtering method is proposed to improve the sample efficiency and the final performance. First, we evaluate the score of the transitions by average reward and average discounted reward of episodes in the dataset and extract the transition samples of high scores. Second, the high-score transition samples are used to train the offline RL algorithms. We verify the proposed method in a series of offline RL algorithms and benchmark tasks. Experimental results show that the proposed method outperforms baselines.
[LG-21] Information-directed sampling for bandits: a primer
链接: https://arxiv.org/abs/2512.20096
作者: Annika Hirling,Giorgio Nicoletti,Antonio Celani
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:The Multi-Armed Bandit problem provides a fundamental framework for analyzing the tension between exploration and exploitation in sequential learning. This paper explores Information Directed Sampling (IDS) policies, a class of heuristics that balance immediate regret against information gain. We focus on the tractable environment of two-state Bernoulli bandits as a minimal model to rigorously compare heuristic strategies against the optimal policy. We extend the IDS framework to the discounted infinite-horizon setting by introducing a modified information measure and a tuning parameter to modulate the decision-making behavior. We examine two specific problem classes: symmetric bandits and the scenario involving one fair coin. In the symmetric case we show that IDS achieves bounded cumulative regret, whereas in the one-fair-coin scenario the IDS policy yields a regret that scales logarithmically with the horizon, in agreement with classical asymptotic lower bounds. This work serves as a pedagogical synthesis, aiming to bridge concepts from reinforcement learning and information theory for an audience of statistical physicists.
[LG-22] Jensen-Shannon Divergence Message-Passing for Rich-Text Graph Representation Learning
链接: https://arxiv.org/abs/2512.20094
作者: Zuo Wang,Ye Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we investigate how the widely existing contextual and structural divergence may influence the representation learning in rich-text graphs. To this end, we propose Jensen-Shannon Divergence Message-Passing (JSDMP), a new learning paradigm for rich-text graph representation learning. Besides considering similarity regarding structure and text, JSDMP further captures their corresponding dissimilarity by Jensen-Shannon divergence. Similarity and dissimilarity are then jointly used to compute new message weights among text nodes, thus enabling representations to learn with contextual and structural information from truly correlated text nodes. With JSDMP, we propose two novel graph neural networks, namely Divergent message-passing graph convolutional network (DMPGCN) and Divergent message-passing Page-Rank graph neural networks (DMPPRG), for learning representations in rich-text graphs. DMPGCN and DMPPRG have been extensively texted on well-established rich-text datasets and compared with several state-of-the-art baselines. The experimental results show that DMPGCN and DMPPRG can outperform other baselines, demonstrating the effectiveness of the proposed Jensen-Shannon Divergence Message-Passing paradigm
[LG-23] PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models
链接: https://arxiv.org/abs/2512.20063
作者: Mingue Park,Jisung Hwang,Seungwoo Yoo,Kyeongmin Yeo,Minhyuk Sung
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce \textttPairFlow , a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. \textttPairFlow addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher. At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired source-target samples. Despite its extremely low cost, taking only up to 1.7% of the compute needed for full model training, \textttPairFlow matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.
[LG-24] DS-HGCN: A Dual-Stream Hypergraph Convolutional Network for Predicting Student Engagement via Social Contagion
链接: https://arxiv.org/abs/2512.20059
作者: Ziyang Fan,Li Tao,Yi Wang,Jingwei Qu,Ying Wang,Fei Jiang
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注: 14pages,Accepted by MMM2026
Abstract:Student engagement is a critical factor influencing academic success and learning outcomes. Accurately predicting student engagement is essential for optimizing teaching strategies and providing personalized interventions. However, most approaches focus on single-dimensional feature analysis and assessing engagement based on individual student factors. In this work, we propose a dual-stream multi-feature fusion model based on hypergraph convolutional networks (DS-HGCN), incorporating social contagion of student engagement. DS-HGCN enables accurate prediction of student engagement states by modeling multi-dimensional features and their propagation mechanisms between students. The framework constructs a hypergraph structure to encode engagement contagion among students and captures the emotional and behavioral differences and commonalities by multi-frequency signals. Furthermore, we introduce a hypergraph attention mechanism to dynamically weigh the influence of each student, accounting for individual differences in the propagation process. Extensive experiments on public benchmark datasets demonstrate that our proposed method achieves superior performance and significantly outperforms existing state-of-the-art approaches.
[LG-25] Deep Eigenspace Network and Its Application to Parametric Non-selfadjoint Eigenvalue Problems
链接: https://arxiv.org/abs/2512.20058
作者: H. Li,J. Sun,Z. Zhang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We consider operator learning for efficiently solving parametric non-selfadjoint eigenvalue problems. To overcome the spectral instability and mode switching inherent in non-selfadjoint operators, we introduce a hybrid framework that learns the stable invariant eigensubspace mapping rather than individual eigenfunctions. We proposed a Deep Eigenspace Network (DEN) architecture integrating Fourier Neural Operators, geometry-adaptive POD bases, and explicit banded cross-mode mixing mechanisms to capture complex spectral dependencies on unstructured meshes. We apply DEN to the parametric non-selfadjoint Steklov eigenvalue problem and provide theoretical proofs for the Lipschitz continuity of the eigensubspace with respect to the parameters. In addition, we derive error bounds for the reconstruction of the eigenspace. Numerical experiments validate DEN’s high accuracy and zero-shot generalization capabilities across different discretizations.
[LG-26] Orthogonal Activation with Implicit Group-Aware Bias Learning for Class Imbalance
链接: https://arxiv.org/abs/2512.20006
作者: Sukumar Kishanthan,Asela Hevapathige
类目: Machine Learning (cs.LG)
*备注:
Abstract:Class imbalance is a common challenge in machine learning and data mining, often leading to suboptimal performance in classifiers. While deep learning excels in feature extraction, its performance still deteriorates under imbalanced data. In this work, we propose a novel activation function, named OGAB, designed to alleviate class imbalance in deep learning classifiers. OGAB incorporates orthogonality and group-aware bias learning to enhance feature distinguishability in imbalanced scenarios without explicitly requiring label information. Our key insight is that activation functions can be used to introduce strong inductive biases that can address complex data challenges beyond traditional non-linearity. Our work demonstrates that orthogonal transformations can preserve information about minority classes by maintaining feature independence, thereby preventing the dominance of majority classes in the embedding space. Further, the proposed group-aware bias mechanism automatically identifies data clusters and adjusts embeddings to enhance class separability without the need for explicit supervision. Unlike existing approaches that address class imbalance through preprocessing data modifications or post-processing corrections, our proposed approach tackles class imbalance during the training phase at the embedding learning level, enabling direct integration with the learning process. We demonstrate the effectiveness of our solution on both real-world and synthetic imbalanced datasets, showing consistent performance improvements over both traditional and learnable activation functions.
[LG-27] Control Variate Score Matching for Diffusion Models
链接: https://arxiv.org/abs/2512.20003
作者: Khaled Kahouli,Romuald Elie,Klaus-Robert Müller,Quentin Berthet,Oliver T. Unke,Arnaud Doucet
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models offer a robust framework for sampling from unnormalized probability densities, which requires accurately estimating the score of the noise-perturbed target distribution. While the standard Denoising Score Identity (DSI) relies on data samples, access to the target energy function enables an alternative formulation via the Target Score Identity (TSI). However, these estimators face a fundamental variance trade-off: DSI exhibits high variance in low-noise regimes, whereas TSI suffers from high variance at high noise levels. In this work, we reconcile these approaches by unifying both estimators within the principled framework of control variates. We introduce the Control Variate Score Identity (CVSI), deriving an optimal, time-dependent control coefficient that theoretically guarantees variance minimization across the entire noise spectrum. We demonstrate that CVSI serves as a robust, low-variance plug-in estimator that significantly enhances sample efficiency in both data-free sampler learning and inference-time diffusion sampling.
[LG-28] LoFT-LLM : Low-Frequency Time-Series Forecasting with Large Language Models KDD2026
链接: https://arxiv.org/abs/2512.20002
作者: Jiacheng You,Jingcheng Yang,Yuhang Xie,Zhongxuan Wu,Xiucheng Li,Feng Li,Pengjie Wang,Jian Xu,Bo Zheng,Xinyang Chen
类目: Machine Learning (cs.LG)
*备注: Accepted at KDD 2026. 9 pages
Abstract:Time-series forecasting in real-world applications such as finance and energy often faces challenges due to limited training data and complex, noisy temporal dynamics. Existing deep forecasting models typically supervise predictions using full-length temporal windows, which include substantial high-frequency noise and obscure long-term trends. Moreover, auxiliary variables containing rich domain-specific information are often underutilized, especially in few-shot settings. To address these challenges, we propose LoFT-LLM, a frequency-aware forecasting pipeline that integrates low-frequency learning with semantic calibration via a large language model (LLM). Firstly, a Patch Low-Frequency forecasting Module (PLFM) extracts stable low-frequency trends from localized spectral patches. Secondly, a residual learner then models high-frequency variations. Finally, a fine-tuned LLM refines the predictions by incorporating auxiliary context and domain knowledge through structured natural language prompts. Extensive experiments on financial and energy datasets demonstrate that LoFT-LLM significantly outperforms strong baselines under both full-data and few-shot regimes, delivering superior accuracy, robustness, and interpretability.
[LG-29] Bloom Filter Encoding for Machine Learning
链接: https://arxiv.org/abs/2512.19991
作者: John Cartmell,Mihaela Cardei,Ionut Cardei
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures
Abstract:We present a method that uses the Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact, privacy-preserving bit array. This reduces memory use and protects the original data while keeping enough structure for accurate classification. We test the method on six datasets: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are used: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve accuracy similar to models trained on raw data or other transforms. At the same time, the method provides memory savings while enhancing privacy. These results suggest that the Bloom filter transform is an efficient preprocessing approach for diverse machine learning tasks.
[LG-30] Spatio-Temporal Graph Neural Networks for Dairy Farm Sustainability Forecasting and Counterfactual Policy Analysis
链接: https://arxiv.org/abs/2512.19970
作者: Surya Jayakumar,Kieran Sullivan,John McLaughlin,Christine O’Meara,Indrakshi Dey
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This study introduces a novel data-driven framework and the first-ever county-scale application of Spatio-Temporal Graph Neural Networks (STGNN) to forecast composite sustainability indices from herd-level operational records. The methodology employs a novel, end-to-end pipeline utilizing a Variational Autoencoder (VAE) to augment Irish Cattle Breeding Federation (ICBF) datasets, preserving joint distributions while mitigating sparsity. A first-ever pillar-based scoring formulation is derived via Principal Component Analysis, identifying Reproductive Efficiency, Genetic Management, Herd Health, and Herd Management, to construct weighted composite indices. These indices are modelled using a novel STGNN architecture that explicitly encodes geographic dependencies and non-linear temporal dynamics to generate multi-year forecasts for 2026-2030.
[LG-31] he Seismic Wavefield Common Task Framework
链接: https://arxiv.org/abs/2512.19927
作者: Alexey Yermakov,Yue Zhao,Marine Denolle,Yiyu Ni,Philippe M. Wyder,Judah Goldfeder,Stefano Riva,Jan Williams,David Zoro,Amy Sara Rude,Matteo Tomasetto,Joe Germany,Joseph Bakarji,Georg Maierhofer,Miles Cranmer,J. Nathan Kutz
类目: Machine Learning (cs.LG)
*备注: 35 pages, 7 figures
Abstract:Seismology faces fundamental challenges in state forecasting and reconstruction (e.g., earthquake early warning and ground motion prediction) and managing the parametric variability of source locations, mechanisms, and Earth models (e.g., subsurface structure and topography effects). Addressing these with simulations is hindered by their massive scale, both in synthetic data volumes and numerical complexity, while real-data efforts are constrained by models that inadequately reflect the Earth’s complexity and by sparse sensor measurements from the field. Recent machine learning (ML) efforts offer promise, but progress is obscured by a lack of proper characterization, fair reporting, and rigorous comparisons. To address this, we introduce a Common Task Framework (CTF) for ML for seismic wavefields, starting with three distinct wavefield datasets. Our CTF features a curated set of datasets at various scales (global, crustal, and local) and task-specific metrics spanning forecasting, reconstruction, and generalization under realistic constraints such as noise and limited data. Inspired by CTFs in fields like natural language processing, this framework provides a structured and rigorous foundation for head-to-head algorithm evaluation. We illustrate the evaluation procedure with scores reported for two of the datasets, showcasing the performance of various methods and foundation models for reconstructing seismic wavefields from both simulated and real-world sensor measurements. The CTF scores reveal the strengths, limitations, and suitability for specific problem classes. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigor and reproducibility in scientific ML.
[LG-32] Detecting cyberbullying in Spanish texts through deep learning techniques
链接: https://arxiv.org/abs/2512.19899
作者: Paúl Cumba-Armijos,Diego Riofrío-Luzcando,Verónica Rodríguez-Arboleda,Joe Carrión-Jumbo
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Preprint (Author’s Original Manuscript, AOM). Published version: this https URL
Abstract:Recent recollected data suggests that it is possible to automatically detect events that may negatively affect the most vulnerable parts of our society, by using any communication technology like social networks or messaging applications. This research consolidates and prepares a corpus with Spanish bullying expressions taken from Twitter in order to use them as an input to train a convolutional neuronal network through deep learning techniques. As a result of this training, a predictive model was created, which can identify Spanish cyberbullying expressions such as insults, racism, homophobic attacks, and so on.
[LG-33] Guardrailed Uplift Targeting: A Causal Optimization Playbook for Marketing Strategy
链接: https://arxiv.org/abs/2512.19805
作者: Deepit Sapru
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:This paper introduces a marketing decision framework that converts heterogeneous-treatment uplift into constrained targeting strategies to maximize revenue and retention while honoring business guardrails. The approach estimates Conditional Average Treatment Effects (CATE) with uplift learners and then solves a constrained allocation to decide who to target and which offer to deploy under limits such as budget or acceptable sales deterioration. Applied to retention messaging, event rewards, and spend-threshold assignment, the framework consistently outperforms propensity and static baselines in offline evaluations using uplift AUC, Inverse Propensity Scoring (IPS), and Self-Normalized IPS (SNIPS). A production-scale online A/B test further validates strategic lift on revenue and completion while preserving customer-experience constraints. The result is a reusable playbook for marketers to operationalize causal targeting at scale, set guardrails, and align campaigns with strategic KPIs.
[LG-34] Reduced Order Modeling for Tsunami Forecasting with Bayesian Hierarchical Pooling
链接: https://arxiv.org/abs/2512.19804
作者: Shane X. Coffing,John Tipton,Arvind T. Mohan,Darren Engwirda
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Reduced order models (ROM) can represent spatiotemporal processes in significantly fewer dimensions and can be solved many orders faster than their governing partial differential equations (PDEs). For example, using a proper orthogonal decomposition produces a ROM that is a small linear combination of fixed features and weights, but that is constrained to the given process it models. In this work, we explore a new type of ROM that is not constrained to fixed weights, based on neural Galerkin-Projections, which is an initial value problem that encodes the physics of the governing PDEs, calibrated via neural networks to accurately model the trajectory of these weights. Then using a statistical hierarchical pooling technique to learn a distribution on the initial values of the temporal weights, we can create new, statistically interpretable and physically justified weights that are generalized to many similar problems. When recombined with the spatial features, we form a complete physics surrogate, called a randPROM, for generating simulations that are consistent in distribution to a neighborhood of initial conditions close to those used to construct the ROM. We apply the randPROM technique to the study of tsunamis, which are unpredictable, catastrophic, and highly-detailed non-linear problems, modeling both a synthetic case of tsunamis near Fiji and the real-world Tohoku 2011 disaster. We demonstrate that randPROMs may enable us to significantly reduce the number of simulations needed to generate a statistically calibrated and physically defensible prediction model for arrival time and height of tsunami waves.
[LG-35] Learning to Design City-scale Transit Routes
链接: https://arxiv.org/abs/2512.19767
作者: Bibek Poudel,Weizi Li
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Designing efficient transit route networks is an NP-hard problem with exponentially large solution spaces that traditionally relies on manual planning processes. We present an end-to-end reinforcement learning (RL) framework based on graph attention networks for sequential transit network construction. To address the long-horizon credit assignment challenge, we introduce a two-level reward structure combining incremental topological feedback with simulation-based terminal rewards. We evaluate our approach on a new real-world dataset from Bloomington, Indiana with topologically accurate road networks, census-derived demand, and existing transit routes. Our learned policies substantially outperform existing designs and traditional heuristics across two initialization schemes and two modal-split scenarios. Under high transit adoption with transit center initialization, our approach achieves 25.6% higher service rates, 30.9% shorter wait times, and 21.0% better bus utilization compared to the real-world network. Under mixed-mode conditions with random initialization, it delivers 68.8% higher route efficiency than demand coverage heuristics and 5.9% lower travel times than shortest path construction. These results demonstrate that end-to-end RL can design transit networks that substantially outperform both human-designed systems and hand-crafted heuristics on realistic city-scale benchmarks.
[LG-36] DeepBridge: A Unified and Production-Ready Framework for Multi-Dimensional Machine Learning Validation
链接: https://arxiv.org/abs/2512.19744
作者: Gustavo Coelho Haase,Paulo Henrique Dourado da Silva
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 8 pages and 4 tables
Abstract:We present DeepBridge, an 80K-line Python library that unifies multi-dimensional validation, automatic compliance verification, knowledge distillation, and synthetic data generation. DeepBridge offers: (i) 5 validation suites (fairness with 15 metrics, robustness with weakness detection, uncertainty via conformal prediction, resilience with 5 drift types, hyperparameter sensitivity), (ii) automatic EEOC/ECOA/GDPR verification, (iii) multi-format reporting system (interactive/static HTML, PDF, JSON), (iv) HPM-KD framework for knowledge distillation with meta-learning, and (v) scalable synthetic data generation via Dask. Through 6 case studies (credit scoring, hiring, healthcare, mortgage, insurance, fraud) we demonstrate that DeepBridge: reduces validation time by 89% (17 min vs. 150 min with fragmented tools), automatically detects fairness violations with complete coverage (10/10 features vs. 2/10 from existing tools), generates audit-ready reports in minutes. HPM-KD demonstrates consistent superiority across compression ratios 2.3–7x (CIFAR100): +1.00–2.04pp vs. Direct Training (p0.05), confirming that Knowledge Distillation is effective at larger teacher-student gaps. Usability study with 20 participants shows SUS score 87.5 (top 10%, ``excellent’'), 95% success rate, and low cognitive load (NASA-TLX 28/100). DeepBridge is open-source under MIT license at this https URL, with complete documentation at this https URL
[LG-37] On-device Large Multi-modal Agent for Human Activity Recognition
链接: https://arxiv.org/abs/2512.19742
作者: Md Shakhrul Iman Siam,Ishtiaque Ahmed Showmik,Guanqun Song,Ting Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Human Activity Recognition (HAR) has been an active area of research, with applications ranging from healthcare to smart environments. The recent advancements in Large Language Models (LLMs) have opened new possibilities to leverage their capabilities in HAR, enabling not just activity classification but also interpretability and human-like interaction. In this paper, we present a Large Multi-Modal Agent designed for HAR, which integrates the power of LLMs to enhance both performance and user engagement. The proposed framework not only delivers activity classification but also bridges the gap between technical outputs and user-friendly insights through its reasoning and question-answering capabilities. We conduct extensive evaluations using widely adopted HAR datasets, including HHAR, Shoaib, Motionsense to assess the performance of our framework. The results demonstrate that our model achieves high classification accuracy comparable to state-of-the-art methods while significantly improving interpretability through its reasoning and QA capabilities.
[LG-38] EdgeFlex-Transformer: Transformer Inference for Edge Devices
链接: https://arxiv.org/abs/2512.19741
作者: Shoaib Mohammad,Guanqun Song,Ting Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deploying large-scale transformer models on edge devices presents significant challenges due to strict constraints on memory, compute, and latency. In this work, we propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs) for deployment in resource-constrained environments. Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model’s memory footprint without requiring costly retraining or task-specific fine-tuning. Starting from a ViT-Huge backbone with 632 million parameters, we first identify low-importance channels using activation statistics collected via forward hooks, followed by structured pruning to shrink the MLP layers under a target memory budget. We further apply FP16 conversion to selected components and leverage AWQ to quantize the remaining model weights and activations to INT8 with minimal accuracy degradation. Our experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline. This framework offers a practical path toward efficient transformer inference on edge platforms, and opens future avenues for integrating dynamic sparsity and Mixture-of-Experts (MoE) architectures to further scale performance across diverse tasks.
[LG-39] Asia Cup 2025: A Structured T20 Match-Level Dataset and Exploratory Analysis for Cricket Analytics
链接: https://arxiv.org/abs/2512.19740
作者: Kousar Raza,Faizan Ali
类目: Machine Learning (cs.LG); Databases (cs.DB); Other Statistics (stat.OT)
*备注: Dataset available via Zenodo:{ this https URL }. Source code and analysis scripts are publicly available at : this https URL
Abstract:This paper presents a structured and comprehensive dataset corresponding to the 2025 Asia Cup T20 cricket tournament, designed to facilitate data-driven research in sports analytics. The dataset comprises records from all 19 matches of the tournament and includes 61 variables covering team scores, wickets, powerplay statistics, boundary counts, toss decisions, venues, and player-specific highlights. To demonstrate its analytical value, we conduct an exploratory data analysis focusing on team performance indicators, boundary distributions, and scoring patterns. The dataset is publicly released through Zenodo under a CC-BY 4.0 license to support reproducibility and further research in cricket analytics, predictive modeling, and strategic decision-making. This work contributes an open, machine-readable benchmark dataset for advancing cricket analytics research.
[LG-40] OASI: Objective-Aware Surrogate Initialization for Multi-Objective Bayesian Optimization in TinyML Keyword Spotting
链接: https://arxiv.org/abs/2512.19739
作者: Soumen Garai,Suman Samui
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: Baseline version
Abstract:Voice assistants utilize Keyword Spotting (KWS) to enable efficient, privacy-friendly activation. However, realizing accurate KWS models on ultra-low-power TinyML devices (often with less than 2 MB of flash memory) necessitates a delicate balance between accuracy with strict resource constraints. Multi-objective Bayesian Optimization (MOBO) is an ideal candidate for managing such a trade-off but is highly initialization-dependent, especially under the budgeted black-box setting. Existing methods typically fall back to naive, ad-hoc sampling routines (e.g., Latin Hypercube Sampling (LHS), Sobol sequences, or Random search) that are adapted to neither the Pareto front nor undergo rigorous statistical comparison. To address this, we propose Objective-Aware Surrogate Initialization (OASI), a novel initialization strategy that leverages Multi-Objective Simulated Annealing (MOSA) to generate a seed Pareto set of high-performing and diverse configurations that explicitly balance accuracy and model size. Evaluated in a TinyML KWS setting, OASI outperforms LHS, Sobol, and Random initialization, achieving the highest hypervolume (0.0627) and the lowest generational distance (0.0) across multiple runs, with only a modest increase in computation time (1934 s vs. \sim 1500 s). A non-parametric statistical analysis using the Kruskal-Wallis test ( H = 5.40 , p = 0.144 , \eta^2 = 0.0007 ) and Dunn’s post-hoc test confirms OASI’s superior consistency despite the non-significant overall difference with respect to the \alpha=0.05 threshold.
[LG-41] OpComm: A Reinforcement Learning Framework for Adaptive Buffer Control in Warehouse Volume Forecasting
链接: https://arxiv.org/abs/2512.19738
作者: Wilson Fung,Lu Guo,Drake Hilliard,Alessandro Casadei,Raj Ratan,Sreyoshi Bhaduri,Adi Surve,Nikhil Agarwal,Rohit Malshe,Pavan Mullapudi,Hungjen Wang,Saurabh Doodhwala,Ankush Pole,Arkajit Rakshit
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate forecasting of package volumes at delivery stations is critical for last-mile logistics, where errors lead to inefficient resource allocation, higher costs, and delivery delays. We propose OpComm, a forecasting and decision-support framework that combines supervised learning with reinforcement learning-based buffer control and a generative AI-driven communication module. A LightGBM regression model generates station-level demand forecasts, which serve as context for a Proximal Policy Optimization (PPO) agent that selects buffer levels from a discrete action set. The reward function penalizes under-buffering more heavily than over-buffering, reflecting real-world trade-offs between unmet demand risks and resource inefficiency. Station outcomes are fed back through a Monte Carlo update mechanism, enabling continual policy adaptation. To enhance interpretability, a generative AI layer produces executive-level summaries and scenario analyses grounded in SHAP-based feature attributions. Across 400+ stations, OpComm reduced Weighted Absolute Percentage Error (WAPE) by 21.65% compared to manual forecasts, while lowering under-buffering incidents and improving transparency for decision-makers. This work shows how contextual reinforcement learning, coupled with predictive modeling, can address operational forecasting challenges and bridge statistical rigor with practical decision-making in high-stakes logistics environments.
[LG-42] Case Prompting to Mitigate Large Language Model Bias for ICU Mortality Prediction
链接: https://arxiv.org/abs/2512.19735
作者: Gangxiong Zhang,Yongchao Long
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate mortality risk prediction for intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show promise in predicting outcomes from structured medical data, their predictions may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice. Existing debiasing methods often reduce predictive performance, making it difficult to jointly optimize fairness and accuracy. In this study, we systematically examine bias in LLM-based ICU mortality prediction and propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance. We first develop a multi-dimensional bias assessment scheme for comprehensive model diagnosis. Building on this analysis, we introduce CAse Prompting (CAP), a novel prompting framework that integrates conventional debiasing prompts with case-based reasoning. CAP guides the model to learn from similar historical misprediction cases and their correct outcomes, enabling correction of biased reasoning patterns. Experiments on the MIMIC-IV dataset show that CAP substantially improves both predictive accuracy and fairness. CAP increases AUROC from 0.806 to 0.873 and AUPRC from 0.497 to 0.694, while reducing sex- and race-related disparities by over 90%. Feature reliance analysis further indicates highly consistent attention patterns across demographic groups, with similarity scores exceeding 0.98. These results demonstrate that LLMs exhibit measurable bias in ICU mortality prediction, and that a carefully designed prompting framework can effectively co-optimize fairness and performance without retraining, offering a transferable paradigm for equitable clinical decision support.
[LG-43] he Deleuzian Representation Hypothesis
链接: https://arxiv.org/abs/2512.19734
作者: Clément Cornet,Romaric Besançon,Hervé Le Borgne
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose an alternative to sparse autoencoders (SAEs) as a simple and effective unsupervised method for extracting interpretable concepts from neural networks. The core idea is to cluster differences in activations, which we formally justify within a discriminant analysis framework. To enhance the diversity of extracted concepts, we refine the approach by weighting the clustering using the skewness of activations. The method aligns with Deleuze’s modern view of concepts as differences. We evaluate the approach across five models and three modalities (vision, language, and audio), measuring concept quality, diversity, and consistency. Our results show that the proposed method achieves concept quality surpassing prior unsupervised SAE variants while approaching supervised baselines, and that the extracted concepts enable steering of a model’s inner representations, demonstrating their causal influence on downstream behavior.
[LG-44] Leakage-Aware Bandgap Prediction on the JARVIS-DFT Dataset: A Phase-Wise Feature Analysis
链接: https://arxiv.org/abs/2512.19732
作者: Gaurav Kumar Sharma
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 21 pages, 11 figures
Abstract:In this study, we perform a systematic analysis of the JARVIS-DFT bandgap dataset and identify and remove descriptors that may inadvertently encode band-structure information, such as effective masses. This process yields a curated, leakage-controlled subset of 2280 materials. Using this dataset, a three-phase modeling framework is implemented that incrementally incorporates basic physical descriptors, engineered features, and compositional attributes. The results show that tree-based models achieve R2 values of approximately 0.88 to 0.90 across all phases, indicating that expanding the descriptor space does not substantially improve predictive accuracy when leakage is controlled. SHAP analysis consistently identifies the dielectric tensor components as the dominant contributors. This work provides a curated dataset and baseline performance metrics for future leakage-aware bandgap prediction studies.
[LG-45] ArcGen: Generalizing Neural Backdoor Detection Across Diverse Architectures
链接: https://arxiv.org/abs/2512.19730
作者: Zhonghao Yang,Cheng Luo,Daojing He,Yiming Li,Yu Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 16 pages, 8 figures. This article was accepted by IEEE Transactions on Information Forensics and Security. DOI: https://doi.org/10.1109/TIFS.2025.3610254
Abstract:Backdoor attacks pose a significant threat to the security and reliability of deep learning models. To mitigate such attacks, one promising approach is to learn to extract features from the target model and use these features for backdoor detection. However, we discover that existing learning-based neural backdoor detection methods do not generalize well to new architectures not seen during the learning phase. In this paper, we analyze the root cause of this issue and propose a novel black-box neural backdoor detection method called ArcGen. Our method aims to obtain architecture-invariant model features, i.e., aligned features, for effective backdoor detection. Specifically, in contrast to existing methods directly using model outputs as model features, we introduce an additional alignment layer in the feature extraction function to further process these features. This reduces the direct influence of architecture information on the features. Then, we design two alignment losses to train the feature extraction function. These losses explicitly require that features from models with similar backdoor behaviors but different architectures are aligned at both the distribution and sample levels. With these techniques, our method demonstrates up to 42.5% improvements in detection performance (e.g., AUC) on unseen model architectures. This is based on a large-scale evaluation involving 16,896 models trained on diverse datasets, subjected to various backdoor attacks, and utilizing different model architectures. Our code is available at this https URL.
[LG-46] Hard Negative Sample-Augmented DPO Post-Training for Small Language Models
链接: https://arxiv.org/abs/2512.19728
作者: Haocheng Lu,Minjun Zhu,Henry Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.
[LG-47] rend Extrapolation for Technology Forecasting: Leverag ing LSTM Neural Networks for Trend Analysis of Space Exploration Vessels
链接: https://arxiv.org/abs/2512.19727
作者: Peng-Hung Tsai,Daniel Berleant
类目: Machine Learning (cs.LG)
*备注:
Abstract:Forecasting technological advancement in complex domains such as space exploration presents significant challenges due to the intricate interaction of technical, economic, and policy-related factors. The field of technology forecasting has long relied on quantitative trend extrapolation techniques, such as growth curves (e.g., Moore’s law) and time series models, to project technological progress. To assess the current state of these methods, we conducted an updated systematic literature review (SLR) that incorporates recent advances. This review highlights a growing trend toward machine learning-based hybrid models. Motivated by this review, we developed a forecasting model that combines long short-term memory (LSTM) neural networks with an augmentation of Moore’s law to predict spacecraft lifetimes. Operational lifetime is an important engineering characteristic of spacecraft and a potential proxy for technological progress in space exploration. Lifetimes were modeled as depending on launch date and additional predictors. Our modeling analysis introduces a novel advance in the recently introduced Start Time End Time Integration (STETI) approach. STETI addresses a critical right censoring problem known to bias lifetime analyses: the more recent the launch dates, the shorter the lifetimes of the spacecraft that have failed and can thus contribute lifetime data. Longer-lived spacecraft are still operating and therefore do not contribute data. This systematically distorts putative lifetime versus launch date curves by biasing lifetime estimates for recent launch dates downward. STETI mitigates this distortion by interconverting between expressing lifetimes as functions of launch time and modeling them as functions of failure time. The results provide insights relevant to space mission planning and policy decision-making. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.19727 [cs.LG] (or arXiv:2512.19727v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.19727 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1016/j.aei.2025.104226 Focus to learn more DOI(s) linking to related resources
[LG-48] Out-of-Distribution Detection for Continual Learning: Design Principles and Benchmarking
链接: https://arxiv.org/abs/2512.19725
作者: Srishti Gupta,Riccardo Balia,Daniele Angioni,Fabio Brau,Maura Pintor,Ambra Demontis,Alessandro Sebastian,Salvatore Mario Carta,Fabio Roli,Battista Biggio
类目: Machine Learning (cs.LG)
*备注: International Journal of Computer Vision
Abstract:Recent years have witnessed significant progress in the development of machine learning models across a wide range of fields, fueled by increased computational resources, large-scale datasets, and the rise of deep learning architectures. From malware detection to enabling autonomous navigation, modern machine learning systems have demonstrated remarkable capabilities. However, as these models are deployed in ever-changing real-world scenarios, their ability to remain reliable and adaptive over time becomes increasingly important. For example, in the real world, new malware families are continuously developed, whereas autonomous driving cars are employed in many different cities and weather conditions. Models trained in fixed settings can not respond effectively to novel conditions encountered post-deployment. In fact, most machine learning models are still developed under the assumption that training and test data are independent and identically distributed (i.i.d.), i.e., sampled from the same underlying (unknown) distribution. While this assumption simplifies model development and evaluation, it does not hold in many real-world applications, where data changes over time and unexpected inputs frequently occur. Retraining models from scratch whenever new data appears is computationally expensive, time-consuming, and impractical in resource-constrained environments. These limitations underscore the need for Continual Learning (CL), which enables models to incrementally learn from evolving data streams without forgetting past knowledge, and Out-of-Distribution (OOD) detection, which allows systems to identify and respond to novel or anomalous inputs. Jointly addressing both challenges is critical to developing robust, efficient, and adaptive AI systems.
[LG-49] End-to-End Data Quality-Driven Framework for Machine Learning in Production Environment
链接: https://arxiv.org/abs/2512.19723
作者: Firas Bayram,Bestoun S. Ahmed,Erik Hallin
类目: Machine Learning (cs.LG)
*备注: 38 Pages
Abstract:This paper introduces a novel end-to-end framework that efficiently integrates data quality assessment with machine learning (ML) model operations in real-time production environments. While existing approaches treat data quality assessment and ML systems as isolated processes, our framework addresses the critical gap between theoretical methods and practical implementation by combining dynamic drift detection, adaptive data quality metrics, and MLOps into a cohesive, lightweight system. The key innovation lies in its operational efficiency, enabling real-time, quality-driven ML decision-making with minimal computational overhead. We validate the framework in a steel manufacturing company’s Electroslag Remelting (ESR) vacuum pumping process, demonstrating a 12% improvement in model performance (R2 = 94%) and a fourfold reduction in prediction latency. By exploring the impact of data quality acceptability thresholds, we provide actionable insights into balancing data quality standards and predictive performance in industrial applications. This framework represents a significant advancement in MLOps, offering a robust solution for time-sensitive, data-driven decision-making in dynamic industrial environments.
[LG-50] Node-Level Financial Optimization in Demand Forecasting Through Dynamic Cost Asymmetry and Feedback Mechanism
链接: https://arxiv.org/abs/2512.19722
作者: Alessandro Casadei,Clemens Grupp,Sreyoshi Bhaduri,Lu Guo,Wilson Fung,Rohit Malshe,Raj Ratan,Ankush Pole,Arkajit Rakshit
类目: Machine Learning (cs.LG)
*备注: Accepted to Amazon internal conference (AFSS). Now sharing with general public. This is submission is replacing a previous submission with the same title: the main paper is now submitted, while previously we submitted a summary
Abstract:This work introduces a methodology to adjust forecasts based on node-specific cost function asymmetry. The proposed model generates savings by dynamically incorporating the cost asymmetry into the forecasting error probability distribution to favor the least expensive scenario. Savings are calculated and a self-regulation mechanism modulates the adjustments magnitude based on the observed savings, enabling the model to adapt to station-specific conditions and unmodeled factors such as calibration errors or shifting macroeconomic dynamics. Finally, empirical results demonstrate the model’s ability to achieve \ 5.1M annual savings.
[LG-51] Sign-Aware Multistate Jaccard Kernels and Geometry for Real and Complex-Valued Signals
链接: https://arxiv.org/abs/2512.19721
作者: Vineet Yadav
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:We introduce a sign-aware, multistate Jaccard/Tanimoto framework that extends overlap-based distances from nonnegative vectors and measures to arbitrary real- and complex-valued signals while retaining bounded metric and positive-semidefinite kernel structure. Formally, the construction is a set- and measure-theoretic geometry: signals are represented as atomic measures on a signed state space, and similarity is given by a generalized Jaccard overlap of these measures. Each signal is embedded into a nonnegative multistate representation, using positive/negative splits for real signals, Cartesian and polar decompositions for complex signals, and user-defined state partitions for refined regime analysis. Applying the Tanimoto construction to these embeddings yields a family of [0,1] distances that satisfy the triangle inequality and define positive-semidefinite kernels usable directly in kernel methods and graph-based learning. Beyond pairwise distances, we develop coalition analysis via Möbius inversion, which decomposes signal magnitude into nonnegative, additive contributions with exact budget closure across coalitions of signals. Normalizing the same embeddings produces probability measures on coordinate – state configurations, so that the distance becomes a monotone transform of total variation and admits a regime – intensity decomposition. The resulting construction yields a single, mechanistically interpretable distance that simultaneously provides bounded metric structure, positive-semidefinite kernels, probabilistic semantics, and transparent budget accounting within one sign-aware framework, supporting correlograms, feature engineering, similarity graphs, and other analytical tools in scientific and financial applications.
[LG-52] Per-Axis Weight Deltas for Frequent Model Updates DATE NEURIPS2025
链接: https://arxiv.org/abs/2512.19720
作者: Stefan Kuyumdzhiev,Radostin Cholakov
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures, AI That Keeps Up: Workshop on Continual and Compatible Foundation Model Updates (CCFM), Neurips 2025
Abstract:Serving many task-specialized LLM variants is often limited by the large size of fine-tuned checkpoints and the resulting cold-start latency. Since fine-tuned weights differ from their base model by relatively small structured residuals, a natural approach is to represent them as compressed deltas. We propose a simple 1-bit delta scheme that stores only the sign of the weight difference together with lightweight per-axis (row/column) FP16 scaling factors, learned from a small calibration set. This design preserves the compactness of 1-bit deltas while more accurately capturing variation across weight dimensions, leading to improved reconstruction quality over scalar alternatives. From a systems perspective, a streamlined loader that transfers packed deltas in a single operation per module reduces cold-start latency and storage overhead, with artifacts several times smaller than a full FP16 checkpoint. The method is drop-in, requires minimal calibration data, and maintains inference efficiency by avoiding dense reconstruction. Our experimental setup and source code are available at this https URL.
[LG-53] Synthetic Data Blueprint (SDB): A modular framework for the statistical structural and graph-based evaluation of synthetic tabular data
链接: https://arxiv.org/abs/2512.19718
作者: Vasileios C. Pezoulas,Nikolaos S. Tachos,Eleni Georga,Kostas Marias,Manolis Tsiknakis,Dimitrios I. Fotiadis
类目: Machine Learning (cs.LG)
*备注: 25 pages main body, 28 pages Appendix, 4 Figures, 4 Tables
Abstract:In the rapidly evolving era of Artificial Intelligence (AI), synthetic data are widely used to accelerate innovation while preserving privacy and enabling broader data accessibility. However, the evaluation of synthetic data remains fragmented across heterogeneous metrics, ad-hoc scripts, and incomplete reporting practices. To address this gap, we introduce Synthetic Data Blueprint (SDB), a modular Pythonic based library to quantitatively and visually assess the fidelity of synthetic tabular data. SDB supports: (i) automated feature-type detection, (ii) distributional and dependency-level fidelity metrics, (iii) graph- and embedding-based structure preservation scores, and (iv) a rich suite of data visualization schemas. To demonstrate the breadth, robustness, and domain-agnostic applicability of the SDB, we evaluated the framework across three real-world use cases that differ substantially in scale, feature composition, statistical complexity, and downstream analytical requirements. These include: (i) healthcare diagnostics, (ii) socioeconomic and financial modelling, and (iii) cybersecurity and network traffic analysis. These use cases reveal how SDB can address diverse data fidelity assessment challenges, varying from mixed-type clinical variables to high-cardinality categorical attributes and high-dimensional telemetry signals, while at the same time offering a consistent, transparent, and reproducible benchmarking across heterogeneous domains.
[LG-54] Reducing Label Dependency in Human Activity Recognition with Wearables: From Supervised Learning to Novel Weakly Self-Supervised Approaches
链接: https://arxiv.org/abs/2512.19713
作者: Taoran Sheng,Manfred Huber
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Human activity recognition (HAR) using wearable sensors has advanced through various machine learning paradigms, each with inherent trade-offs between performance and labeling requirements. While fully supervised techniques achieve high accuracy, they demand extensive labeled datasets that are costly to obtain. Conversely, unsupervised methods eliminate labeling needs but often deliver suboptimal performance. This paper presents a comprehensive investigation across the supervision spectrum for wearable-based HAR, with particular focus on novel approaches that minimize labeling requirements while maintaining competitive accuracy. We develop and empirically compare: (1) traditional fully supervised learning, (2) basic unsupervised learning, (3) a weakly supervised learning approach with constraints, (4) a multi-task learning approach with knowledge sharing, (5) a self-supervised approach based on domain expertise, and (6) a novel weakly self-supervised learning framework that leverages domain knowledge and minimal labeled data. Experiments across benchmark datasets demonstrate that: (i) our weakly supervised methods achieve performance comparable to fully supervised approaches while significantly reducing supervision requirements; (ii) the proposed multi-task framework enhances performance through knowledge sharing between related tasks; (iii) our weakly self-supervised approach demonstrates remarkable efficiency with just 10% of labeled data. These results not only highlight the complementary strengths of different learning paradigms, offering insights into tailoring HAR solutions based on the availability of labeled data, but also establish that our novel weakly self-supervised framework offers a promising solution for practical HAR applications where labeled data are limited.
[LG-55] Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Learnable Channel Attention
链接: https://arxiv.org/abs/2512.20562
作者: Yingzhen Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study the problem of learning a low-degree spherical polynomial of degree \ell_0 = \Theta(1) \ge 1 defined on the unit sphere in \RR^d by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk \eps \in (0,1) , a carefully designed two-layer NN with channel attention and finite width of m \ge \Theta(n^4 \log (2n/\delta)/d^2\ell_0) trained by the vanilla gradient descent (GD) requires the lowest sample complexity of n \asymp \Theta(d^\ell_0/\eps) with probability 1-\delta for every \delta \in (0,1) , in contrast with the representative sample complexity \Theta\pthd^\ell_0 \max\set\eps^-2,\log d , where n is the training daata size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order \Theta(d^\ell_0/n) with probability at least 1-\delta . On the other hand, the minimax optimal rate for the regression risk with a kernel of rank \Theta(d^\ell_0) is \Theta(d^\ell_0/n) , so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention consists of two stages. In Stage 1, a provable learnable channel selection algorithm identifies the ground-truth channel number \ell_0 from the initial L \ge \ell_0 channels in the first-layer activation, with high probability. This learnable selection is achieved by an efficient one-step GD update on both layers, enabling feature learning for low-degree polynomial targets. In Stage 2, the second layer is trained by standard GD using the activation function with the selected channels.
[LG-56] Over-the-Air Goal-Oriented Communications
链接: https://arxiv.org/abs/2512.20533
作者: Kyriakos Stylianopoulos,Paolo Di Lorenzo,George C. Alexandropoulos
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 35 pages, 9 figures. Book chapter
Abstract:Goal-oriented communications offer an attractive alternative to the Shannon-based communication paradigm, where the data is never reconstructed at the Receiver (RX) side. Rather, focusing on the case of edge inference, the Transmitter (TX) and the RX cooperate to exchange features of the input data that will be used to predict an unseen attribute of them, leveraging information from collected data sets. This chapter demonstrates that the wireless channel can be used to perform computations over the data, when equipped with programmable metasurfaces. The end-to-end system of the TX, RX, and MS-based channel is treated as a single deep neural network which is trained through backpropagation to perform inference on unseen data. Using Stacked Intelligent Metasurfaces (SIM), it is shown that this Metasurfaces-Integrated Neural Network (MINN) can achieve performance comparable to fully digital neural networks under various system parameters and data sets. By offloading computations onto the channel itself, important benefits may be achieved in terms of energy consumption, arising from reduced computations at the transceivers and smaller transmission power required for successful inference.
[LG-57] ScoreMatchingRiesz: Auto-DML with Infinitesimal Classification
链接: https://arxiv.org/abs/2512.20523
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:This study proposes Riesz representer estimation methods based on score matching. The Riesz representer is a key component in debiased machine learning for constructing \sqrtn -consistent and efficient estimators in causal inference and structural parameter estimation. To estimate the Riesz representer, direct approaches have garnered attention, such as Riesz regression and the covariate balancing propensity score. These approaches can also be interpreted as variants of direct density ratio estimation (DRE) in several applications such as average treatment effect estimation. In DRE, it is well known that flexible models can easily overfit the observed data due to the estimand and the form of the loss function. To address this issue, recent work has proposed modeling the density ratio as a product of multiple intermediate density ratios and estimating it using score-matching techniques, which are often used in the diffusion model literature. We extend score-matching-based DRE methods to Riesz representer estimation. Our proposed method not only mitigates overfitting but also provides insights for causal inference by bridging marginal effects and average policy effects through time score functions.
[LG-58] he Aligned Economic Index The State Switching Model
链接: https://arxiv.org/abs/2512.20460
作者: Ilias Aarab
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Econometrics (econ.EM); Portfolio Management (q-fin.PM); Applications (stat.AP)
*备注:
Abstract:A growing empirical literature suggests that equity-premium predictability is state dependent, with much of the forecasting power concentrated around recessionary periods \parenciteHenkel2011,DanglHalling2012,Devpura2018. I study U.S. stock return predictability across economic regimes and document strong evidence of time-varying expected returns across both expansionary and contractionary states. I contribute in two ways. First, I introduce a state-switching predictive regression in which the market state is defined in real time using the slope of the yield curve. Relative to the standard one-state predictive regression, the state-switching specification increases both in-sample and out-of-sample performance for the set of popular predictors considered by \textciteWelchGoyal2008, improving the out-of-sample performance of most predictors in economically meaningful ways. Second, I propose a new aggregate predictor, the Aligned Economic Index, constructed via partial least squares (PLS). Under the state-switching model, the Aligned Economic Index exhibits statistically and economically significant predictive power in sample and out of sample, and it outperforms widely used benchmark predictors and alternative predictor-combination methods.
[LG-59] Avoiding the Price of Adaptivity: Inference in Linear Contextual Bandits via Stability
链接: https://arxiv.org/abs/2512.20368
作者: Samya Praharaj,Koulik Khamaru
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Statistical inference in contextual bandits is complicated by the adaptive, non-i.i.d. nature of the data. A growing body of work has shown that classical least-squares inference may fail under adaptive sampling, and that constructing valid confidence intervals for linear functionals of the model parameter typically requires paying an unavoidable inflation of order \sqrtd \log T . This phenomenon – often referred to as the price of adaptivity – highlights the inherent difficulty of reliable inference under general contextual bandit policies. A key structural property that circumvents this limitation is the \emphstability condition of Lai and Wei, which requires the empirical feature covariance to concentrate around a deterministic limit. When stability holds, the ordinary least-squares estimator satisfies a central limit theorem, and classical Wald-type confidence intervals – designed for i.i.d. data – become asymptotically valid even under adaptation, \emphwithout incurring the \sqrtd \log T price of adaptivity. In this paper, we propose and analyze a penalized EXP4 algorithm for linear contextual bandits. Our first main result shows that this procedure satisfies the Lai–Wei stability condition and therefore admits valid Wald-type confidence intervals for linear functionals. Our second result establishes that the same algorithm achieves regret guarantees that are minimax optimal up to logarithmic factors, demonstrating that stability and statistical efficiency can coexist within a single contextual bandit method. Finally, we complement our theory with simulations illustrating the empirical normality of the resulting estimators and the sharpness of the corresponding confidence intervals. Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2512.20368 [stat.ML] (or arXiv:2512.20368v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.20368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] KAN-AFT: An Interpretable Nonlinear Survival Model Integrating Kolmogorov-Arnold Networks with Accelerated Failure Time Analysis
链接: https://arxiv.org/abs/2512.20305
作者: Mebin Jose,Jisha Francis,Sudheesh Kumar Kattumannil
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: A new development in Survival Analysis based on the celebrated Kolmogorov-Arnold Networks (KANs)
Abstract:Survival analysis relies fundamentally on the semi-parametric Cox Proportional Hazards (CoxPH) model and the parametric Accelerated Failure Time (AFT) model. CoxPH assumes constant hazard ratios, often failing to capture real-world dynamics, while traditional AFT models are limited by rigid distributional assumptions. Although deep learning models like DeepAFT address these constraints by improving predictive accuracy and handling censoring, they inherit the significant challenge of black-box interpretability. The recent introduction of CoxKAN demonstrated the successful integration of Kolmogorov-Arnold Networks (KANs), a novel architecture that yields highly accurate and interpretable symbolic representations, within the CoxPH framework. Motivated by the interpretability gains of CoxKAN, we introduce KAN-AFT (Kolmogorov Arnold Network-based AFT), the first framework to apply KANs to the AFT model. KAN-AFT effectively models complex nonlinear relationships within the AFT framework. Our primary contributions include: (i) a principled AFT-KAN formulation, (ii) robust optimization strategies for right-censored observations (e.g., Buckley-James and IPCW), and (iii) an interpretability pipeline that converts the learned spline functions into closed-form symbolic equations for survival time. Empirical results on multiple datasets confirm that KAN-AFT achieves performance comparable to or better than DeepAFT, while uniquely providing transparent, symbolic models of the survival process.
[LG-61] Optimality-Informed Neural Networks for Solving Parametric Optimization Problems
链接: https://arxiv.org/abs/2512.20270
作者: Matthias K. Hoffmann,Amine Othmane,Kathrin Flaßkamp
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Under review, 24 pages, 10 figures
Abstract:Many engineering tasks require solving families of nonlinear constrained optimization problems, parametrized in setting-specific variables. This is computationally demanding, particularly, if solutions have to be computed across strongly varying parameter values, e.g., in real-time control or for model-based design. Thus, we propose to learn the mapping from parameters to the primal optimal solutions and to their corresponding duals using neural networks, giving a dense estimation in contrast to gridded approaches. Our approach, Optimality-informed Neural Networks (OptINNs), combines (i) a KKT-residual loss that penalizes violations of the first-order optimality conditions under standard constraint qualifications assumptions, and (ii) problem-specific output activations that enforce simple inequality constraints (e.g., box-type/positivity) by construction. This design reduces data requirements, allows the prediction of dual variables, and improves feasibility and closeness to optimality compared to penalty-only training. Taking quadratic penalties as a baseline, since this approach has been previously proposed for the considered problem class in literature, our method simplifies hyperparameter tuning and attains tighter adherence to optimality conditions. We evaluate OptINNs on different nonlinear optimization problems ranging from low to high dimensions. On small problems, OptINNs match a quadratic-penalty baseline in primal accuracy while additionally predicting dual variables with low error. On larger problems, OptINNs achieve lower constraint violations and lower primal error compared to neural networks based on the quadratic-penalty method. These results suggest that embedding feasibility and optimality into the network architecture and loss can make learning-based surrogates more accurate, feasible, and data-efficient for parametric optimization.
[LG-62] Optimal Anytime-Valid Tests for Composite Nulls
链接: https://arxiv.org/abs/2512.20039
作者: Shubhanshu Shekhar
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 1 figure
Abstract:We consider the problem of designing optimal level- \alpha power-one tests for composite nulls. Given a parameter \alpha \in (0,1) and a stream of \mathcalX -valued observations \X_n: n \geq 1\ \overseti.i.d.\sim P , the goal is to design a level- \alpha power-one test \tau_\alpha for the null H_0: P \in \mathcalP_0 \subset \mathcalP(\mathcalX) . Prior works have shown that any such \tau_\alpha must satisfy \mathbbE_P[\tau_\alpha] \geq \tfrac\log(1/\alpha)\gamma^(P, \mathcalP_0) , where \gamma^(P, \mathcalP_0) is the so-called \mathrmKL_\inf or minimum divergence of P to the null class. In this paper, our objective is to develop and analyze constructive schemes that match this lower bound as \alpha \downarrow 0 . We first consider the finite-alphabet case~( |\mathcalX| = m \infty ), and show that a test based on \emphuniversal e -process~(formed by the ratio of a universal predictor and the running null MLE) is optimal in the above sense. The proof relies on a Donsker-Varadhan~(DV) based saddle-point representation of \mathrmKL_\inf , and an application of Sion’s minimax theorem. This characterization motivates a general method for arbitrary \mathcalX : construct an e -process based on the empirical solutions to the saddle-point representation over a sufficiently rich class of test functions. We give sufficient conditions for the optimality of this test for compact convex nulls, and verify them for Hölder smooth density models. We end the paper with a discussion on the computational aspects of implementing our proposed tests in some practical settings. Comments: 24 pages, 1 figure Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.20039 [math.ST] (or arXiv:2512.20039v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2512.20039 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-63] Gaussian Process Assisted Meta-learning for Image Classification and Object Detection Models
链接: https://arxiv.org/abs/2512.20021
作者: Anna R. Flowers,Christopher T. Franck,Robert B. Gramacy,Justin A. Krometis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures
Abstract:Collecting operationally realistic data to inform machine learning models can be costly. Before collecting new data, it is helpful to understand where a model is deficient. For example, object detectors trained on images of rare objects may not be good at identification in poorly represented conditions. We offer a way of informing subsequent data acquisition to maximize model performance by leveraging the toolkit of computer experiments and metadata describing the circumstances under which the training data was collected (e.g., season, time of day, location). We do this by evaluating the learner as the training data is varied according to its metadata. A Gaussian process (GP) surrogate fit to that response surface can inform new data acquisitions. This meta-learning approach offers improvements to learner performance as compared to data with randomly selected metadata, which we illustrate on both classic learning examples, and on a motivating application involving the collection of aerial images in search of airplanes.
[LG-64] Reliable LLM -Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems
链接: https://arxiv.org/abs/2512.20012
作者: Qiushuo Hou,Sangwoo Park,Matteo Zecchin,Yunlong Cai,Guanding Yu,Osvaldo Simeone,Tommaso Melodia
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper has been submitted to a journal
Abstract:Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications, assisting with tasks including troubleshooting, standards interpretation, and network optimization. However, their deployment in practice must balance inference cost, latency, and reliability. In this work, we study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline. In it, an efficient edge model handles routine queries, a more capable cloud model addresses complex cases, and human experts are involved only when necessary. We define a misalignment-cost constrained optimization problem, aiming to minimize average processing cost, while guaranteeing alignment of automated answers with expert judgments. We propose a statistically rigorous threshold selection method based on multiple hypothesis testing (MHT) for a query processing mechanism based on knowledge and confidence tests. The approach provides finite-sample guarantees on misalignment risk. Experiments on the TeleQnA dataset – a telecom-specific benchmark – demonstrate that the proposed method achieves superior cost-efficiency compared to conventional cascaded baselines, while ensuring reliability at prescribed confidence levels.
[LG-65] Semiparametric KSD test: unifying score and distance-based approaches for goodness-of-fit testing
链接: https://arxiv.org/abs/2512.20007
作者: Zhihan Huang,Ziang Niu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Goodness-of-fit (GoF) tests are fundamental for assessing model adequacy. Score-based tests are appealing because they require fitting the model only once under the null. However, extending them to powerful nonparametric alternatives is difficult due to the lack of suitable score functions. Through a class of exponentially tilted models, we show that the resulting score-based GoF tests are equivalent to the tests based on integral probability metrics (IPMs) indexed by a function class. When the class is rich, the test is universally consistent. This simple yet insightful perspective enables reinterpretation of classical distance-based testing procedures-including those based on Kolmogorov-Smirnov distance, Wasserstein-1 distance, and maximum mean discrepancy-as arising from score-based constructions. Building on this insight, we propose a new nonparametric score-based GoF test through a special class of IPM induced by kernelized Stein’s function class, called semiparametric kernelized Stein discrepancy (SKSD) test. Compared with other nonparametric score-based tests, the SKSD test is computationally efficient and accommodates general nuisance-parameter estimators, supported by a generic parametric bootstrap procedure. The SKSD test is universally consistent and attains Pitman efficiency. Moreover, SKSD test provides simple GoF tests for models with intractable likelihoods but tractable scores with the help of Stein’s identity and we use two popular models, kernel exponential family and conditional Gaussian models, to illustrate the power of our method. Our method achieves power comparable to task-specific normality tests such as Anderson-Darling and Lilliefors, despite being designed for general nonparametric alternatives.
[LG-66] Covariance-Aware Simplex Projection for Cardinality-Constrained Portfolio Optimization
链接: https://arxiv.org/abs/2512.19986
作者: Nikolaos Iliopoulos
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Finance (q-fin.CP)
*备注: 9 pages, 3 figures, 5 tables
Abstract:Metaheuristic algorithms for cardinality-constrained portfolio optimization require repair operators to map infeasible candidates onto the feasible region. Standard Euclidean projection treats assets as independent and can ignore the covariance structure that governs portfolio risk, potentially producing less diversified portfolios. This paper introduces Covariance-Aware Simplex Projection (CASP), a two-stage repair operator that (i) selects a target number of assets using volatility-normalized scores and (ii) projects the candidate weights using a covariance-aware geometry aligned with tracking-error risk. This provides a portfolio-theoretic foundation for using a covariance-induced distance in repair operators. On SP 500 data (2020-2024), CASP-Basic delivers materially lower portfolio variance than standard Euclidean repair without relying on return estimates, with improvements that are robust across assets and statistically significant. Ablation results indicate that volatility-normalized selection drives most of the variance reduction, while the covariance-aware projection provides an additional, consistent improvement. We further show that optional return-aware extensions can improve Sharpe ratios, and out-of-sample tests confirm that gains transfer to realized performance. CASP integrates as a drop-in replacement for Euclidean projection in metaheuristic portfolio optimizers.
[LG-67] GIMLET: Generalizable and Interpretable Model Learning through Embedded Thermodynamics
链接: https://arxiv.org/abs/2512.19936
作者: Suguru Shiratori,Elham Kiyani,Khemraj Shukla,George Em Karniadakis
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:We develop a data-driven framework for discovering constitutive relations in models of fluid flow and scalar transport. Our approach infers unknown closure terms in the governing equations (gray-box discovery) under the assumption that the temporal derivative, convective transport, and pressure-gradient contributions are known. The formulation is rooted in a variational principle from nonequilibrium thermodynamics, where the dynamics is defined by a free-energy functional and a dissipation functional. The unknown constitutive terms arise as functional derivatives of these functionals with respect to the state variables. To enable a flexible and structured model discovery, the free-energy and dissipation functionals are parameterized using neural networks, while their functional derivatives are obtained via automatic differentiation. This construction enforces thermodynamic consistency by design, ensuring monotonic decay of the total free energy and non-negative entropy production. The resulting method, termed GIMLET (Generalizable and Interpretable Model Learning through Embedded Thermodynamics), avoids reliance on a predefined library of candidate functions, unlike sparse regression or symbolic identification approaches. The learned models are generalizable in that functionals identified from one dataset can be transferred to distinct datasets governed by the same underlying equations. Moreover, the inferred free-energy and dissipation functions provide direct physical interpretability of the learned dynamics. The framework is demonstrated on several benchmark systems, including the viscous Burgers equation, the Kuramoto–Sivashinsky equation, and the incompressible Navier–Stokes equations for both Newtonian and non-Newtonian fluids.
[LG-68] Quasiprobabilistic Density Ratio Estimation with a Reverse Engineered Classification Loss Function
链接: https://arxiv.org/abs/2512.19913
作者: Matthew Drnevich,Stephen Jiggins,Kyle Cranmer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 25 pages, 7 figures
Abstract:We consider a generalization of the classifier-based density-ratio estimation task to a quasiprobabilistic setting where probability densities can be negative. The problem with most loss functions used for this task is that they implicitly define a relationship between the optimal classifier and the target quasiprobabilistic density ratio which is discontinuous or not surjective. We address these problems by introducing a convex loss function that is well-suited for both probabilistic and quasiprobabilistic density ratio estimation. To quantify performance, an extended version of the Sliced-Wasserstein distance is introduced which is compatible with quasiprobability distributions. We demonstrate our approach on a real-world example from particle physics, of di-Higgs production in association with jets via gluon-gluon fusion, and achieve state-of-the-art results.
[LG-69] Efficient Learning of Lattice Gauge Theories with Fermions
链接: https://arxiv.org/abs/2512.19891
作者: Shreya Shukla,Yukari Yamauchi,Andrey Y. Lokhov,Scott Lawrence,Abhijith Jayakumar
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 12 pages, 2 figures
Abstract:We introduce a learning method for recovering action parameters in lattice field theories. Our method is based on the minimization of a convex loss function constructed using the Schwinger-Dyson relations. We show that score matching, a popular learning method, is a special case of our construction of an infinite family of valid loss functions. Importantly, our general Schwinger-Dyson-based construction applies to gauge theories and models with Grassmann-valued fields used to represent dynamical fermions. In particular, we extend our method to realistic lattice field theories including quantum chromodynamics.
[LG-70] Fundamentals of quantum Boltzmann machine learning with visible and hidden units
链接: https://arxiv.org/abs/2512.19819
作者: Mark M. Wilde
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 61 pages, 1 figure
Abstract:One of the primary applications of classical Boltzmann machines is generative modeling, wherein the goal is to tune the parameters of a model distribution so that it closely approximates a target distribution. Training relies on estimating the gradient of the relative entropy between the target and model distributions, a task that is well understood when the classical Boltzmann machine has both visible and hidden units. For some years now, it has been an obstacle to generalize this finding to quantum state learning with quantum Boltzmann machines that have both visible and hidden units. In this paper, I derive an analytical expression for the gradient of the quantum relative entropy between a target quantum state and the reduced state of the visible units of a quantum Boltzmann machine. Crucially, this expression is amenable to estimation on a quantum computer, as it involves modular-flow-generated unitary rotations reminiscent of those appearing in my prior work on rotated Petz recovery maps. This leads to a quantum algorithm for gradient estimation in this setting. I then specialize the setting to quantum visible units and classical hidden units, and vice versa, and provide analytical expressions for the gradients, along with quantum algorithms for estimating them. Finally, I replace the quantum relative entropy objective function with the Petz-Tsallis relative entropy; here I develop an analytical expression for the gradient and sketch a quantum algorithm for estimating it, as an application of a novel formula for the derivative of the matrix power function, which also involves modular-flow-generated unitary rotations. Ultimately, this paper demarcates progress in training quantum Boltzmann machines with visible and hidden units for generative modeling and quantum state learning.
[LG-71] Robust Causal Directionality Inference in Quantum Inference under MNAR Observation and High-Dimensional Noise
链接: https://arxiv.org/abs/2512.19746
作者: Joonsung Kang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In quantum mechanics, observation actively shapes the system, paralleling the statistical notion of Missing Not At Random (MNAR). This study introduces a unified framework for \textbfrobust causal directionality inference in quantum engineering, determining whether relations are system \to observation, observation \to system, or bidirectional. The method integrates CVAE-based latent constraints, MNAR-aware selection models, GEE-stabilized regression, penalized empirical likelihood, and Bayesian optimization. It jointly addresses quantum and classical noise while uncovering causal directionality, with theoretical guarantees for double robustness, perturbation stability, and oracle inequalities. Simulation and real-data analyses (TCGA gene expression, proteomics) show that the proposed MNAR-stabilized CVAE+GEE+AIPW+PEL framework achieves lower bias and variance, near-nominal coverage, and superior quantum-specific diagnostics. This establishes robust causal directionality inference as a key methodological advance for reliable quantum engineering. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2512.19746 [stat.ML] (or arXiv:2512.19746v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.19746 Focus to learn more arXiv-issued DOI via DataCite
[LG-72] NMIRacle: Multi-modal Generative Molecular Elucidation from IR and NMR Spectra
链接: https://arxiv.org/abs/2512.19733
作者: Federico Ottomano,Yingzhen Li,Alex M. Ganose
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Molecular structure elucidation from spectroscopic data is a long-standing challenge in Chemistry, traditionally requiring expert interpretation. We introduce NMIRacle, a two-stage generative framework that builds upon recent paradigms in AI-driven spectroscopy with minimal assumptions. In the first stage, NMIRacle learns to reconstruct molecular structures from count-aware fragment encodings, which capture both fragment identities and their occurrences. In the second stage, a spectral encoder maps input spectroscopic measurements (IR, 1H-NMR, 13C-NMR) into a latent embedding that conditions the pre-trained generator. This formulation bridges fragment-level chemical modeling with spectral evidence, yielding accurate molecular predictions. Empirical results show that NMIRacle outperforms existing baselines on molecular elucidation, while maintaining robust performance across increasing levels of molecular complexity.
[LG-73] Chemically-Informed Machine Learning Approach for Prediction of Reactivity Ratios in Radical Copolymerization
链接: https://arxiv.org/abs/2512.19715
作者: Habibollah Safari,Mona Bavarian
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Predicting monomer reactivity ratios is crucial for controlling monomer sequence distribution in copolymers and their properties. Traditional experimental methods of determining reactivity ratios are time-consuming and resource-intensive, while existing computational methods often struggle with accuracy or scalability. Here, we present a method that combines unsupervised learning with artificial neural networks to predict reactivity ratios in radical copolymerization. By applying spectral clustering to physicochemical features of monomers, we identified three distinct monomer groups with characteristic reactivity patterns. This computationally efficient clustering approach revealed specific monomer group interactions leading to different sequence arrangements, including alternating, random, block, and gradient copolymers, providing chemical insights for initial exploration. Building upon these insights, we trained artificial neural networks to achieve quantitative reactivity ratio predictions. We explored two integration strategies including direct feature concatenation, and cluster-specific training, which demonstrated performance enhancements for targeted chemical domains compared to general training with equivalent sample sizes. However, models utilizing complete datasets outperformed specialized models trained on focused subsets, revealing a fundamental trade-off between chemical specificity and data availability. This work demonstrates that unsupervised learning offers rapid chemical insight for exploratory analysis, while supervised learning provides the accuracy necessary for final design predictions, with optimal strategies depending on data availability and application requirements.
[LG-74] ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval
链接: https://arxiv.org/abs/2512.19703
作者: Siyuan Fu,Xuchen Guo,Mingjun Liu,Hongxiang Li,Boyin Tan,Gongxi Zhu,Xianwei Zhuang,Jinghan Ru,Yuxin Xie,Yuguo Yin
类目: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注:
Abstract:The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.
信息检索
[IR-0] Laser: Governing Long-Horizon Agent ic Search via Structured Protocol and Context Register
链接: https://arxiv.org/abs/2512.20458
作者: Shuting Wang,Qiaolin Xia,Hao Wang,Yu Lu,Bobsimons,Zhicheng Dou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs) have enabled agentic search systems that interleave multi-step reasoning with external tool use. However, existing frameworks largely rely on unstructured natural-language reasoning and accumulate raw intermediate traces in the context, which often leads to unstable reasoning trajectories, context overflow, and degraded performance on complex multi-hop queries. In this study, we introduce Laser, a general framework for stabilizing and scaling agentic search. Laser defines a symbolic action protocol that organizes agent behaviors into three spaces: planning, task-solving, and retrospection. Each action is specified with explicit semantics and a deterministic execution format, enabling structured and logical reasoning processes and reliable action parsing. This design makes intermediate decisions interpretable and traceable, enhancing explicit retrospection and fine-grained control over reasoning trajectories. In coordination with parsable actions, Laser further maintains a compact context register that stores only essential states of the reasoning process, allowing the agent to reason over long horizons without uncontrolled context expansion. Experiments on Qwen2.5/3-series models across challenging multi-hop QA datasets show that Laser consistently outperforms existing agentic search baselines under both prompting-only and fine-tuning settings, demonstrating that Laser provides a principled and effective foundation for robust, scalable agentic search.
[IR-1] Collaborative Group-Aware Hashing for Fast Recommender Systems
链接: https://arxiv.org/abs/2512.20172
作者: Yan Zhang,Li Deng,Lixin Duan,Ivor W. Tsang,Guowu Yang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The fast online recommendation is critical for applications with large-scale databases; meanwhile, it is challenging to provide accurate recommendations in sparse scenarios. Hash technique has shown its superiority for speeding up the online recommendation by bit operations on Hamming distance computations. However, existing hashing-based recommendations suffer from low accuracy, especially with sparse settings, due to the limited representation capability of each bit and neglected inherent relations among users and items. To this end, this paper lodges a Collaborative Group-Aware Hashing (CGAH) method for both collaborative filtering (namely CGAH-CF) and content-aware recommendations (namely CGAH) by integrating the inherent group information to alleviate the sparse issue. Firstly, we extract inherent group affinities of users and items by classifying their latent vectors into different groups. Then, the preference is formulated as the inner product of the group affinity and the similarity of hash codes. By learning hash codes with the inherent group information, CGAH obtains more effective hash codes than other discrete methods with sparse interactive data. Extensive experiments on three public datasets show the superior performance of our proposed CGAH and CGAH-CF over the state-of-the-art discrete collaborative filtering methods and discrete content-aware recommendations under different sparse settings.
[IR-2] VSA:Visual-Structural Alignment for UI-to-Code
链接: https://arxiv.org/abs/2512.20034
作者: Xian Wu,Ming Zhang,Zhiyu Fang,Fei Li,Bin Wang,Yong Jiang,Hao Zhou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The automation of user interface development has the potential to accelerate software delivery by mitigating intensive manual implementation. Despite the advancements in Large Multimodal Models for design-to-code translation, existing methodologies predominantly yield unstructured, flat codebases that lack compatibility with component-oriented libraries such as React or Angular. Such outputs typically exhibit low cohesion and high coupling, complicating long-term maintenance. In this paper, we propose \textbfVSA (VSA), a multi-stage paradigm designed to synthesize organized frontend assets through visual-structural alignment. Our approach first employs a spatial-aware transformer to reconstruct the visual input into a hierarchical tree representation. Moving beyond basic layout extraction, we integrate an algorithmic pattern-matching layer to identify recurring UI motifs and encapsulate them into modular templates. These templates are then processed via a schema-driven synthesis engine, ensuring the Large Language Model generates type-safe, prop-drilled components suitable for production environments. Experimental results indicate that our framework yields a substantial improvement in code modularity and architectural consistency over state-of-the-art benchmarks, effectively bridging the gap between raw pixels and scalable software engineering.
[IR-3] LLM -Assisted Abstract Screening with OLIVER: Evaluating Calibration and Single-Model vs. Actor-Critic Configurations in Literature Reviews
链接: https://arxiv.org/abs/2512.20022
作者: Kian Godhwani,David Benrimoh
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Introduction: Recent work suggests large language models (LLMs) can accelerate screening, but prior evaluations focus on earlier LLMs, standardized Cochrane reviews, single-model setups, and accuracy as the primary metric, leaving generalizability, configuration effects, and calibration largely unexamined. Methods: We developed OLIVER (Optimized LLM-based Inclusion and Vetting Engine for Reviews), an open-source pipeline for LLM-assisted abstract screening. We evaluated multiple contemporary LLMs across two non-Cochrane systematic reviews and performance was assessed at both the full-text screening and final inclusion stages using accuracy, AUC, and calibration metrics. We further tested an actor-critic screening framework combining two lightweight models under three aggregation rules. Results: Across individual models, performance varied widely. In the smaller Review 1 (821 abstracts, 63 final includes), several models achieved high sensitivity for final includes but at the cost of substantial false positives and poor calibration. In the larger Review 2 (7741 abstracts, 71 final includes), most models were highly specific but struggled to recover true includes, with prompt design influencing recall. Calibration was consistently weak across single-model configurations despite high overall accuracy. Actor-critic screening improved discrimination and markedly reduced calibration error in both reviews, yielding higher AUCs. Discussion: LLMs may eventually accelerate abstract screening, but single-model performance is highly sensitive to review characteristics, prompting, and calibration is limited. An actor-critic framework improves classification quality and confidence reliability while remaining computationally efficient, enabling large-scale screening at low cost. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2512.20022 [cs.IR] (or arXiv:2512.20022v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.20022 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kian Godhwani [view email] [v1] Tue, 23 Dec 2025 03:32:43 UTC (1,452 KB)
[IR-4] IGDMRec: Behavior Conditioned Item Graph Diffusion for Multimodal Recommendation
链接: https://arxiv.org/abs/2512.19983
作者: Ziyuan Guo,Jie Guo,Zhenghao Chen,Bin Song,Fei Richard Yu
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 6 figures. This paper has been accepted for publication in IEEE Transactions on Multimedia. The final published version will be available via IEEE Xplore
Abstract:Multimodal recommender systems (MRSs) are critical for various online platforms, offering users more accurate personalized recommendations by incorporating multimodal information of items. Structure-based MRSs have achieved state-of-the-art performance by constructing semantic item graphs, which explicitly model relationships between items based on modality feature similarity. However, such semantic item graphs are often noisy due to 1) inherent noise in multimodal information and 2) misalignment between item semantics and user-item co-occurrence relationships, which introduces false links and leads to suboptimal recommendations. To address this challenge, we propose Item Graph Diffusion for Multimodal Recommendation (IGDMRec), a novel method that leverages a diffusion model with classifier-free guidance to denoise the semantic item graph by integrating user behavioral information. Specifically, IGDMRec introduces a Behavior-conditioned Graph Diffusion (BGD) module, incorporating interaction data as conditioning information to guide the denoising of the semantic item graph. Additionally, a Conditional Denoising Network (CD-Net) is designed to implement the denoising process with manageable complexity. Finally, we propose a contrastive representation augmentation scheme that leverages both the denoised item graph and the original item graph to enhance item representations. \LLExtensive experiments on four real-world datasets demonstrate the superiority of IGDMRec over competitive baselines, with robustness analysis validating its denoising capability and ablation studies verifying the effectiveness of its key components.
[IR-5] owards Analysing Invoices and Receipts with Amazon Textract
链接: https://arxiv.org/abs/2512.19958
作者: Sneha Oommen,Gabby Sanchez,Cassandra T. Britto,Di Wang,Jordan Chiou,Maria Spichkova
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:
Abstract:This paper presents an evaluation of the AWS Textract in the context of extracting data from receipts. We analyse Textract functionalities using a dataset that includes receipts of varied formats and conditions. Our analysis provided a qualitative view of Textract strengths and limitations. While the receipts totals were consistently detected, we also observed typical issues and irregularities that were often influenced by image quality and layout. Based on the analysis of the observations, we propose mitigation strategies.
[IR-6] owards a point-to-point CV-QKD system: Implementation challenges and perspectives
链接: https://arxiv.org/abs/2512.19834
作者: Davi Juvêncio Gomes de Sousa,Nelson Alves Ferreira Neto,Christiano M. S. Nascimento,Lucas Q. Galvão,Mauro Queiroz Nooblath Neto,Micael Andrade Dias,Cássio de Castro Silva,Braian Pinheiro da Silva,Alexandre B. Tacla,Valéria Loureiro da Silva
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: 33 pages with 8 figures
Abstract:This article presents an analysis of the practical challenges and implementation perspectives of point-to-point continuous-variable quantum key distribution (CV-QKD) systems over optical fiber. The study addresses the physical layer, including the design of transmitters, quantum channels, and receivers, with emphasis on impairments such as attenuation, chromatic dispersion, polarization fluctuations, and coexistence with classical channels. We further examine the role of digital signal processing (DSP) as the bridge between quantum state transmission and classical post-processing, highlighting its impact on excess noise mitigation, covariance matrix estimation, and reconciliation efficiency. The post-processing pipeline is detailed with a focus on parameter estimation in the finite-size regime, information reconciliation using LDPC-based codes optimized for low-SNR conditions, and privacy amplification employing large-block universal hashing. From a hardware perspective, we discuss modular digital architectures that integrate dedicated accelerators with programmable processors, supported by a reference software framework (CV-QKD-ModSim) for algorithm validation and hardware co-design. Finally, we outline perspectives for the deployment of CV-QKD in Brazil, starting from metropolitan testbeds and extending toward hybrid fiber/FSO and space-based infrastructures. The work establishes the foundations for the first point-to-point CV-QKD system in Brazil, while providing a roadmap for scalable and interoperable quantum communication networks.

