本篇博文主要内容为 2025-11-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-11-20)
今日共更新464篇论文,其中:
- 自然语言处理共48篇(Computation and Language (cs.CL))
- 人工智能共135篇(Artificial Intelligence (cs.AI))
- 计算机视觉共122篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共124篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] okenisation over Bounded Alphabets is Hard
【速读】: 该论文试图解决的是文本分词(tokenisation)在计算复杂性上的理论边界问题,特别是针对实际应用中固定大小字母表(如字节或Unicode字符)下的分词难度。传统研究认为分词是NP完全问题,但其假设基于无界字母表,这与现实不符;本文通过分析两种自然的分词变体——自底向上分词(bottom-up tokenisation)和直接分词(direct tokenisation),证明即使在二进制字母表下,这两种方法依然是NP完全且不存在多项式时间近似方案(除非P=NP)。进一步地,作者还发现直接分词在单字母表(unary alphabet)下仍为NP完全,表明其计算困难并非源于大字母表或复杂结构,而是一种根本性的障碍。解决方案的关键在于从理论上严格界定分词问题的难解性,并由此解释为何当前主流分词算法(如BPE、UnigramLM)只能依赖启发式策略,从而推动未来研究向近似算法方向发展。
链接: https://arxiv.org/abs/2511.15709
作者: Violeta Kastreva,Philip Whittington,Dennis Komm,Tiago Pimentel
机构: ETH Zürich (苏黎世联邦理工学院); Sofia University “St. Kliment Ohridski” (索非亚大学“克莱门特·奥霍里斯基”)
类目: Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
备注:
Abstract:Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets – an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded n -ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an n -ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.
zh
[NLP-1] hink Visually Reason Textually: Vision-Language Synergy in ARC
【速读】: 该论文旨在解决前沿基础模型(如GPT-5和Grok 4)在极少示例下进行抽象推理的能力不足问题,即模型难以从少量例子中推断出结构化的变换规则,而这正是人类智能的关键特征。针对这一挑战,论文提出的核心解决方案是引入视觉与语言的协同推理机制:首先通过视觉-语言协同推理(Vision-Language Synergy Reasoning, VLSR)将ARC-AGI任务分解为模态对齐的子任务,利用视觉模态进行全局模式抽象与验证,同时借助语言模态完成符号化规则建模与精确执行;其次设计模态切换自校正(Modality-Switch Self-Correction, MSSC)策略,以视觉模态验证文本推理结果并实现内在错误纠正。实验表明,该方法在多个旗舰模型和ARC-AGI任务上相较纯文本基线提升达4.33%,验证了融合视觉抽象与语言推理对实现类人通用智能的重要性。
链接: https://arxiv.org/abs/2511.15703
作者: Beichen Zhang,Yuhang Zang,Xiaoyi Dong,Yuhang Cao,Haodong Duan,Dahua Lin,Jiaqi Wang
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
zh
[NLP-2] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中基于专家混合(Mixture-of-Experts, MoE)架构在推理阶段计算效率低的问题。现有专家跳过(expert skipping)方法虽能降低计算开销,但其设计初衷面向单模态大语言模型(Unimodal Large Language Models, LLMs),未能充分考虑MoE层间专家贡献的异质性以及不同模态token在各层中的差异化行为,导致性能显著下降。为此,作者提出无需训练的MoDES框架,其核心创新在于:一是引入全局调制局部门控(Globally-modulated Local Gating, GMLG)机制,将层级重要性信息融合至局部路由概率中,以更精确地估计每个token对应的专家重要性;二是采用双模态阈值设定(Dual-Modality Thresholding, DMT)策略,分别处理视觉与语言token,实现精细化的跳过调度;三是提出一种前沿搜索算法,利用单调性特性快速收敛最优阈值,将调参时间从数天缩短至数小时。实验表明,MoDES在多个基准上显著优于现有方法,在跳过88%专家的情况下仍可提升性能达10.67%,并大幅加速推理过程。
链接: https://arxiv.org/abs/2511.15690
作者: Yushi Huang,Zining Wang,Zhihang Yuan,Yifu Ding,Ruihao Gong,Jinyang Guo,Xianglong Liu,Jun Zhang
机构: Hong Kong University of Science and Technology (香港科技大学); Beihang University (北京航空航天大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Code will be released upon acceptance
Abstract:Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16 \times and the decoding time by 1.26 \times .
zh
[NLP-3] VisPlay: Self-Evolving Vision-Language Models from Images
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的视觉语言模型(Vision-Language Models, VLMs)在复杂推理任务中依赖人工标注标签或任务特定启发式奖励函数的问题,这些问题限制了方法的可扩展性和自动化程度。解决方案的关键在于提出一个自进化强化学习框架VisPlay,其核心机制是让VLM自主扮演两个交互角色:图像条件下的提问者(Image-Conditioned Questioner)生成具有挑战性且可回答的视觉问题,以及多模态推理者(Multimodal Reasoner)生成“银质”答案(silver responses)。两者通过Group Relative Policy Optimization (GRPO)联合训练,引入多样性与难度奖励以平衡问题复杂度与答案质量,从而实现无需人工标注的自我迭代优化。该框架在Qwen2.5-VL和MiMo-VL两个模型家族上均展现出视觉推理能力、组合泛化性能和幻觉减少方面的稳定提升,验证了其在构建自进化多模态智能系统中的可扩展潜力。
链接: https://arxiv.org/abs/2511.15661
作者: Yicheng He,Chengsong Huang,Zongxia Li,Jiaxin Huang,Yonghui Yang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Washington University in St. Louis (圣路易斯华盛顿大学); University of Maryland (马里兰大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at this https URL
zh
[NLP-4] When to Think and When to Look: Uncertainty-Guided Lookback
【速读】: 该论文旨在解决生成式 AI(Generative AI)在视觉语言模型(Vision Language Models, VLMs)中进行测试时推理(test-time thinking)的效果不明确的问题,特别是缺乏对“思考”如何影响视觉推理的系统性分析。此前研究表明,显式中间推理链可提升大语言模型和大型视觉语言模型(LVLMs)性能,但其机制尚不清楚,且存在长推理链可能引入错误轨迹、忽略图像信息导致性能下降的现象。论文通过大规模、受控实验对比了来自InternVL3.5和Qwen3-VL系列的十种变体在MMMU-val数据集上的表现,发现并非“越多思考越好”,并揭示了短回溯短语(lookback phrases)——即显式指涉图像内容的提示词——在成功推理路径中显著富集,且与更强的视觉定位能力相关。基于此洞察,作者提出一种无需训练的解码策略:不确定性引导回溯(uncertainty guided lookback),该策略结合不确定性信号与自适应回溯提示及广度优先搜索,在固定模型家族和token预算下显著提升整体MMMU性能,并在多个下游任务上实现一致改进,成为新的SOTA结果。
链接: https://arxiv.org/abs/2511.15613
作者: Jing Bi,Filippos Bellos,Junjia Guo,Yayuan Li,Chao Huang,Yunlong(Yolo)Tang,Luchuan Song,Susan Liang,Zhongfei(Mark)Zhang,Jason J. Corso,Chenliang Xu
机构: University of Rochester (罗切斯特大学); University of Michigan (密歇根大学); Binghamton University (宾汉姆顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
zh
[NLP-5] SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
【速读】: 该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因依赖专家示范而导致的示范偏差问题,以及现有强化学习(Reinforcement Learning, RL)方法因奖励稀疏性导致训练效率低下的瓶颈。其核心解决方案是提出一种自参考策略优化框架(Self-Referential Policy Optimization, SRPO),关键创新在于利用当前训练批次内模型自身生成的成功轨迹作为自参考基准,为失败轨迹赋予基于行为进展的奖励信号;同时引入世界模型的潜在空间表征(latent world representations)来鲁棒地度量行为进展,无需外部奖励工程或领域特定微调,从而实现高效、泛化性强的策略优化。
链接: https://arxiv.org/abs/2511.15605
作者: Senyu Fei,Siyin Wang,Li Ji,Ao Li,Shiduo Zhang,Liming Liu,Jinlong Hou,Jingjing Gong,Xianzhong Zhao,Xipeng Qiu
机构: Fudan University (复旦大学); Tongji University (同济大学); Shanghai Innovation Institute (上海创新研究院)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model’s own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model’s latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO’s efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
zh
[NLP-6] HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning AAAI-2026
【速读】: 该论文旨在解决中文第二语言习得(Chinese Second Language Acquisition, SLA)建模中缺乏系统性基准测试工具的问题,尤其针对现有方法在伦理和实践上难以控制人类学习者输入所带来的可验证性和可扩展性挑战。其关键解决方案是构建了首个面向分阶段建模与写作评估的基准——HSKBenchmark,涵盖HSK 3至6级,包含真实教材文本(6.76百万词元)、16K合成指令样本、30个测试主题及基于语言学理论的评估体系;同时提出课程微调框架(curriculum-tuning framework),模拟人类从初级到高级的学习轨迹,并开发HSKAgent模型以实现对生成文本的多维评估(如语法覆盖率、错误类型、词汇与句法复杂度等)。实验表明,该基准不仅能有效建模中文SLA过程,还可作为动态写作评估的可靠标准,且微调后的LLM写作能力达到高级人类学习者水平并呈现类人习得特征。
链接: https://arxiv.org/abs/2511.15574
作者: Qihao Yang,Xuelin Wang,Jiale Chen,Xuelian Dong,Yuxin Hao,Tianyong Hao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-2026
Abstract:Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners’ language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: this https URL.
zh
[NLP-7] Computer-Use Agents as Judges for Generative User Interface
【速读】: 该论文旨在解决当前图形用户界面(GUI)设计主要面向人类使用体验,导致计算机使用代理(CUA)在执行任务时不得不采用非最优的人类行为模式,从而影响效率的问题。同时,随着代码导向型语言模型(Coder)在自动GUI设计中的快速发展,如何利用CUA作为评估者来辅助Coder进行高效、可靠的设计成为关键挑战。解决方案的核心在于提出“Coder-CUA协同框架”:其中Coder扮演设计师角色,生成并迭代优化网页;CUA则作为评判者,基于任务可执行性与导航成功率对设计进行功能验证和反馈。为实现有效反馈,研究进一步设计了CUA仪表盘(Dashboard),将多步骤的导航历史压缩为简洁的可视化摘要,提供可解释的指导用于迭代改进。这一机制使界面设计从人本导向转向代理原生(agent-native)的效率与可靠性优先,推动CUA从被动使用者向数字环境主动参与者转变。
链接: https://arxiv.org/abs/2511.15567
作者: Kevin Qinghong Lin,Siyuan Hu,Linjie Li,Zhengyuan Yang,Lijuan Wang,Philip Torr,Mike Zheng Shou
机构: University of Oxford (牛津大学); Show Lab, National University of Singapore (新加坡国立大学展示实验室); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Project: this https URL Github: this https URL
Abstract:Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans–prioritizing aesthetics and usability–forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at this https URL.
zh
[NLP-8] Multimodal Evaluation of Russian-language Architectures
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在俄语语境下缺乏系统性评估基准的问题,尤其针对其智能水平、能力边界及潜在风险尚不明确的现状。解决方案的关键在于构建一个面向俄语的开放多模态评估框架——Mera Multi,该框架包含18个全新设计的评测任务,覆盖文本、图像、音频和视频四种模态,并基于统一的提示模板与度量标准,同时引入防止基准泄露的方法(如水印技术和私有数据集授权机制),从而为通用模型和特定模态架构(如图像到文本、视频到文本、音频到文本)提供可复现的基准测试方案,且具备向其他斯拉夫语系乃至语言类型差异较大的语言迁移的潜力。
链接: https://arxiv.org/abs/2511.15552
作者: Artem Chervyakov,Ulyana Isaeva,Anton Emelyanov,Artem Safin,Maria Tikhonova,Alexander Kharitonov,Yulia Lyakh,Petr Surovtsev,Denis Shevelev Vildan Saburov,Vasily Konovalov,Elisei Rykov,Ivan Sviridov,Amina Miftakhova,Ilseyar Alimova,Alexander Panchenko,Alexander Kapitanov,Alena Fenogenova
机构: MERA Team
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
zh
[NLP-9] Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis
【速读】: 该论文旨在解决当前语言处理研究中面临的两大核心问题:一是缺乏统一的数据组织与共享标准,二是语言处理流程缺乏标准化和可重复性。为应对这些问题,作者提出两个关键解决方案:其一,设计了受神经科学领域BIDS(Brain Imaging Data Structure)启发的语言处理数据结构(Language Processing Data Structure, LPDS),通过规范化的文件夹结构和命名约定实现语言数据的标准化存储;其二,开发了一个模块化、可扩展的Python工具包pelican nlp,支持从原始数据清洗到复杂特征提取(如语义嵌入和韵律指标)的全流程自动化处理,并借助单一可共享的配置文件驱动整个工作流,从而确保方法透明性和结果可复现性。
链接: https://arxiv.org/abs/2511.15512
作者: Yves Pauli,Jan-Bernard Marsman,Finn Rabe,Victoria Edkins,Roya Hüppi,Silvia Ciampelli,Akhil Ratan Misra,Nils Lang,Wolfram Hinzen,Iris Sommer,Philipp Homan
机构: University of Zurich (苏黎世大学); University Medical Center Groningen (格罗宁根大学医学中心); Pompeu Fabra University (庞培法布拉大学)
类目: Computation and Language (cs.CL)
备注: 26 pages, 3 figures
Abstract:The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.
zh
[NLP-10] CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search AAAI-2026
【速读】: 该论文旨在解决工业级密集检索(Dense Retrieval)系统中因依赖历史用户交互数据进行自强化训练所导致的“过滤气泡”(Filter Bubble)问题,即模型难以学习到未被曝光但可能相关的内容,从而造成检索结果偏保守和多样性不足。解决方案的关键在于提出 CroPS(Cross-Perspective Positive Samples)——一种多视角正样本生成引擎,通过融合三种异构信号:用户查询重述行为(query-level)、推荐流中的互动数据(system-level)以及大语言模型生成的世界知识(knowledge-level),构建更丰富、语义合理的正样本集合;并设计了分层标签分配(Hierarchical Label Assignment, HLA)策略与对应的 H-InfoNCE 损失函数,实现细粒度、相关性感知的优化,有效提升检索多样性与准确性。
链接: https://arxiv.org/abs/2511.15443
作者: Ao Xie,Jiahui Chen,Quanzhi Zhu,Xiaoze Jiang,Zhiheng Qin,Enyun Yu,Han Li
机构: Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: AAAI-2026, Oral
Abstract:Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.
zh
[NLP-11] LLM -MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在无监督文本聚类任务中因缺乏状态记忆和难以控制聚类粒度而导致的性能瓶颈问题。现有方法通常依赖复杂的外部模块构建多阶段流水线,无法实现真正端到端的聚类流程。其解决方案的关键在于提出LLM-MemCluster框架,该框架通过引入动态记忆(Dynamic Memory)机制赋予模型状态感知能力,从而支持迭代优化;同时采用双提示策略(Dual-Prompt Strategy)使模型能够自主推理并决定最优聚类数量,从而实现无需调参、完全基于LLM原生能力的高效、可解释且端到端的文本聚类。
链接: https://arxiv.org/abs/2511.15424
作者: Yuanjie Zhu,Liangwei Yang,Ke Xu,Weizhi Zhang,Zihe Song,Jindong Wang,Philip S. Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); William & Mary (威廉玛丽学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
zh
[NLP-12] Building Robust and Scalable Multilingual ASR for Indian Languages
【速读】: 该论文旨在解决自动语音识别(ASR)系统在多语言、多方言场景下的适应性问题,具体目标是提升模型对8种语言及其33种方言的识别准确率。解决方案的关键在于提出一种基于多解码器(Multi-Decoder)架构的新训练方法,利用音素级公共标签集(Common Label Set, CLS)作为中间表示,从而增强跨语言和跨方言的共享表征能力。该方法在CLS空间中显著优于基线模型,并通过一系列策略有效保留了音素空间中的性能优势,最终成功转换回对应的字符级(grapheme)输出,实现了在Track 2中3种语言的词错误率(WER)/字符错误率(CER)最优,同时获得所有参赛团队中最高的语言识别(language ID)和方言识别(dialect ID)准确率。
链接: https://arxiv.org/abs/2511.15418
作者: Arjun Gangwar,Kaousheik Jayakumar,S. Umesh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).
zh
[NLP-13] NAMeGEn: Creative Name Generation via A Novel Agent -based Multiple Personalized Goal Enhancement Framework
【速读】: 该论文旨在解决生成式自然语言生成(CNLG)中的两大核心挑战:一是多目标灵活性问题,即用户需求常具个性化、细粒度和多元性,现有大语言模型(LLMs)难以同时满足;二是解释复杂性问题,即创造性不仅涉及生成,还需理解并阐释隐含意义以提升用户感知。为应对这些问题,作者聚焦于中文婴儿命名这一典型短文本生成任务,提出NAMeGEn框架——一个基于多智能体优化的迭代机制,其关键在于交替执行目标提取、名称生成与评估三个阶段,从而精准响应多样化约束并提供有意义的美学解释。该方案无需额外训练即可显著优于六种不同LLM基线方法,在自建的CBNames基准上验证了有效性。
链接: https://arxiv.org/abs/2511.15408
作者: Shanlin Zhou(1),Xinpeng Wang(1),Jianxun Lian(2),Zhenghao Liu(3),Laks V.S. Lakshmanan(4),Xiaoyuan Yi(2),Yongtao Hao(1) ((1) Tongji University, (2) Microsoft Research Asia, (3) Northeastern University, (4) The University of British Columbia)
机构: Tongji University (同济大学); Microsoft Research Asia (微软亚洲研究院); Northeastern University (东北大学); The University of British Columbia (英属哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages,9 figures. This work has been submitted to the IEEE for possible publication
Abstract:Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users’ perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
zh
[NLP-14] DEPO: Dual-Efficiency Preference Optimization for LLM Agents AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为智能体(agent)在实际应用中因推理链条(chain of thought, CoT)过长而导致的交互效率低下问题。现有方法缺乏对LLM代理效率的系统性定义,阻碍了针对性优化。为此,作者提出“双效性”(dual-efficiency)概念,包括步骤级效率(最小化每步token消耗)和轨迹级效率(最小化完成任务所需步骤数)。解决方案的关键在于DEPO(Dual-Efficiency Preference Optimization),一种联合奖励简洁响应与减少动作步骤的偏好优化方法,通过同时优化这两个维度,在WebShop和BabyAI等基准上实现了最高达60.9%的token节省、26.9%的步骤减少,且性能提升达29.3%,并展现出良好的跨领域泛化能力。
链接: https://arxiv.org/abs/2511.15392
作者: Sirui Chen,Mengshi Zhao,Lei Xu,Yuying Zhao,Beier Zhu,Hanwang Zhang,Shengjie Zhao,Chaochao Lu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tencent AI Lab (腾讯人工智能实验室); 4. Tsinghua University (清华大学); 5. National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026
Abstract:Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at this https URL.
zh
[NLP-15] A Compliance-Preserving Retrieval System for Aircraft MRO Task Search
【速读】: 该论文旨在解决航空维修、修理与大修(MRO)作业中,维修技术人员(AMTs)因频繁查阅手册而耗费高达30%工作时间的问题,这已成为影响运营效率的关键瓶颈。解决方案的核心在于构建一个合规性保留的语义检索系统,该系统通过在不替代现有认证文档查看器的前提下,结合大语言模型(LLM)重排序与语义搜索技术,利用ATA章节层级结构生成版本鲁棒的嵌入表示,并借助视觉-语言解析对认证内容进行结构化处理,从而实现任务预览与经验证程序的快速访问。该方案在49,000个合成查询上达到90%的检索准确率,并通过双语对照实验验证了其在真实场景下的有效性:10名持证AMT参与测试时,前10位检索成功率高达90.9%,单次任务查找时间从6–15分钟降至18秒,同时满足严格监管要求。
链接: https://arxiv.org/abs/2511.15383
作者: Byungho Jo
机构: Inha University (仁川大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注:
Abstract:Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves 90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
zh
[NLP-16] he Empowerment of Science of Science by Large Language Models : New Tools and Methods
【速读】: 该论文旨在解决如何系统性地理解与应用大语言模型(Large Language Models, LLMs)在科学计量学(Scientometrics)等科研领域的关键技术问题,尤其关注其在科学评估、前沿研究发现和知识图谱构建中的潜力。解决方案的关键在于从用户视角出发,整合包括提示工程(prompt engineering)、增强知识检索的生成(retrieval-augmented generation, RAG)、微调(fine-tuning)、预训练(pretraining)及工具学习(tool learning)在内的核心技术体系,并提出基于AI代理(AI agent)的科学评价模型,从而推动LLMs在科研智能化中的深度落地与创新应用。
链接: https://arxiv.org/abs/2511.15370
作者: Guoqiang Liang,Jingqian Gong,Mengxuan Li,Gege Lin,Shuo Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The manuscript is currently ongoing the underreview process of the journal of information science
Abstract:Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.
zh
[NLP-17] HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
【速读】: 该论文旨在解决当前医疗健康领域中高质量多语言推理数据集匮乏的问题,以支持生成式 AI (Generative AI) 在生物医学推理任务中的研究与模型优化。其解决方案的关键在于构建并发布 HEAD-QA v2 数据集,该版本扩展至超过 12,000 道来自西班牙十年专业考试的题目,并提供多语言版本,同时通过提示工程(prompting)、检索增强生成(RAG)和基于概率的答案选择策略对多个开源大语言模型(LLM)进行基准测试,结果表明模型规模和内在推理能力是性能的主要驱动因素,而复杂推理策略带来的提升有限。
链接: https://arxiv.org/abs/2511.15355
作者: Alexis Correa-Guillén,Carlos Gómez-Rodríguez,David Vilares
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. 12 pages
Abstract:We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
zh
[NLP-18] SkyEgg: Joint Implementation Selection and Scheduling for Hardware Synthesis using E-graphs
【速读】: 该论文旨在解决当前高层次综合(High-Level Synthesis, HLS)工具在硬件生成过程中因将实现选择(implementation selection)与调度(scheduling)分离而导致的次优设计问题,尤其难以充分利用现代FPGA异构架构(如DSP slices)的潜力。其核心挑战在于:传统方法通过启发式模式匹配进行实现选择,忽略其对调度的影响;随后的调度算法基于固定实现方案和不准确的延迟估计,错失了关键的协同优化机会。解决方案的关键在于提出SkyEgg框架,该框架采用e-graph数据结构统一建模代数变换与硬件实现选择,并通过等式饱和(equality saturation)技术构建完整的候选实现空间,最终将联合优化问题形式化为混合整数线性规划(Mixed-Integer Linear Programming, MILP)问题,在饱和后的e-graph上求解最优的实现与调度组合,从而实现全局协同优化。
链接: https://arxiv.org/abs/2511.15323
作者: Youwei Xiao,Yuyang Zou,Yun Liang
机构: Peking University (北京大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:
Abstract:Hardware synthesis from high-level descriptions remains fundamentally limited by the sequential optimization of interdependent design decisions. Current methodologies, including state-of-the-art high-level synthesis (HLS) tools, artificially separate implementation selection from scheduling, leading to suboptimal designs that cannot fully exploit modern FPGA heterogeneous architectures. Implementation selection is typically performed by ad-hoc pattern matching on operations, a process that does not consider the impact on scheduling. Subsequently, scheduling algorithms operate on fixed selection solutions with inaccurate delay estimates, which misses critical optimization opportunities from appropriately configured FPGA blocks like DSP slices. We present SkyEgg, a novel hardware synthesis framework that jointly optimizes implementation selection and scheduling using the e-graph data structure. Our key insight is that both algebraic transformations and hardware implementation choices can be uniformly represented as rewrite rules within an e-graph, modeling the complete design space of implementation candidates to be selected and scheduled together. First, SkyEgg constructs an e-graph from the input program. It then applies both algebraic and implementation rewrites through equality saturation. Finally, it formulates the joint optimization as a mixed-integer linear programming (MILP) problem on the saturated e-graph. We provide both exact MILP solving and an efficient ASAP heuristic for scalable synthesis. Our evaluation on benchmarks from diverse applications targeting Xilinx Kintex UltraScale+ FPGAs demonstrates that SkyEgg achieves an average speedup of 3.01x over Vitis HLS, with improvements up to 5.22x for complex expressions. Subjects: Programming Languages (cs.PL); Computation and Language (cs.CL) Cite as: arXiv:2511.15323 [cs.PL] (or arXiv:2511.15323v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2511.15323 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在面对对抗性提示(adversarial prompts)时存在系统性安全漏洞,尤其是如何通过非语义内容的风格化变换绕过现有的安全对齐机制。解决方案的关键在于发现并验证“对抗性诗歌”(adversarial poetry)作为一种通用的单轮越狱(single-turn jailbreak)技术的有效性——即仅通过将有害指令转化为特定韵律结构的诗歌形式,即可显著提升攻击成功率(Attack Success Rate, ASR),平均达到62%(手工创作)和43%(自动化转换),远超原始非诗性提示基线。这一发现揭示了当前LLM安全机制对文本表层风格变化的敏感性不足,暴露了现有对齐方法的根本局限性。
链接: https://arxiv.org/abs/2511.15304
作者: Piercosma Bisconti,Matteo Prandi,Federico Pierucci,Francesco Giarrusso,Marcantonio Bracale,Marcello Galisai,Vincenzo Suriani,Olga Sorokoletova,Federico Sartore,Daniele Nardi
机构: DEXAI – Icaro Lab (DEXAI – Icaro 实验室); Sapienza University of Rome (罗马大学); Sant’Anna School of Advanced Studies (圣安娜高等研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
zh
[NLP-20] MAPROC at AHaSIS Shared Task: Few-Shot and Sentence Transformer for Sentiment Analysis of Arabic Hotel Reviews
【速读】: 该论文旨在解决阿拉伯语方言(如摩洛哥方言和沙特方言)在酒店评论场景下的情感分析问题,其核心挑战在于语言多样性与标注数据稀缺性。解决方案的关键在于采用SetFit(Sentence Transformer Fine-tuning)框架,这是一种数据高效的少样本学习方法,能够在有限标注数据条件下有效捕捉方言文本的情感特征,从而提升模型在特定领域内的泛化能力。
链接: https://arxiv.org/abs/2511.15291
作者: Randa Zarnoufi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared task, which focuses on sentiment analysis on Arabic dialects in the hospitality domain. The dataset comprises hotel reviews written in Moroccan and Saudi dialects, and the objective is to classify the reviewers sentiment as positive, negative, or neutral. We employed the SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique. On the official evaluation set, our system achieved an F1 of 73%, ranking 12th among 26 participants. This work highlights the potential of few-shot learning to address data scarcity in processing nuanced dialectal Arabic text within specialized domains like hotel reviews.
zh
[NLP-21] ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing AAAI2026
【速读】: 该论文旨在解决现有图表编辑(Chart Editing)基准测试在数据多样性上的局限性以及对完整图表代码的依赖问题,而现实场景中通常仅能获取图表图像和自然语言指令。其解决方案的关键在于构建了一个名为ChartEditVista的综合性基准,包含7,964个样本,覆盖31类图表,且输入仅为原始图表图像与自然语言编辑指令,不依赖原始代码;同时提出两种细粒度规则评估指标——布局指标(layout metric)和文本指标(text metric),并基于此开发了ChartEditor模型,该模型采用强化学习框架并引入渲染奖励(rendering reward)以同步保障生成代码的可执行性和视觉保真度,从而实现高效、准确的图表编辑。
链接: https://arxiv.org/abs/2511.15266
作者: Liangyu Chen,Yichen Xu,Jianzhe Ma,Yuqi Liu,Donglu Yang,Liang Zhang,Wenxuan Wang,Qin Jin
机构: 未知
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: Accept to AAAI 2026 Main Track
Abstract:Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.
zh
[NLP-22] IndicGEC: Powerful Models or a Measurement Mirag e?
【速读】: 该论文旨在解决印度语种(Telugu、Hindi、Tamil、Malayalam 和 Bangla)的语法错误修正(Grammatical Error Correction, GEC)问题,尤其是在低资源语言环境下如何有效利用生成式 AI(Generative AI)模型进行高质量纠错。其解决方案的关键在于采用零样本(zero-shot)和少样本(few-shot)提示工程(prompting),并测试不同规模的语言模型(从4B参数到大型专有模型)在GEC任务上的表现。实验结果表明,小型语言模型在多个印度语言中展现出与大型模型相当甚至更优的性能,同时强调了高质量数据集构建和适配印度文字脚本的评估指标的重要性。
链接: https://arxiv.org/abs/2511.15260
作者: Sowmya Vajjala
机构: 未知
类目: Computation and Language (cs.CL)
备注: Technical report
Abstract:In this paper, we report the results of the TeamNRC’s participation in the BHASHA-Task 1 Grammatical Error Correction shared task this https URL for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.
zh
[NLP-23] M Toolchain and Language for Reusable Model Compilation
【速读】: 该论文旨在解决复杂软件驱动系统在建模与实现过程中面临的多目标编译难题,即如何从单一高阶系统模型中高效、安全地生成适用于仿真、部署和形式化验证等多种用途的异构目标模型。传统建模语言通常仅针对特定应用场景(如仿真或实现)设计,缺乏对多目标编译的支持,导致后续转换过程语义失真或难以扩展。解决方案的关键在于提出一种名为M的文本式、基于语法驱动的建模语言及其配套工具链,该语言以Actor模型为基础并扩展了离散事件调度语义,支持系统实体、消息交互及时间/状态触发反应等建模能力,并通过结构化机制保障生成目标模型与源模型之间的语义一致性,同时可作为中间语言兼容其他建模语言,从而赋能整个模型驱动工程(Model-Driven Engineering, MDE)流程。
链接: https://arxiv.org/abs/2511.15257
作者: Hiep Hong Trinh,Federico Ciccozzi,Abu Naser Masud,Marjan Sirjani,Mikael Sjödin
机构: Malmö University (马尔默大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Complex software-driven systems often interleave distributed, concurrent computation processes with physical interactions with the environment. Developing these systems more efficiently and safely can be achieved by employing actionable, software-based models. From a high-level system model, engineers often need to derive multiple specialized models for different purposes, including simulation, deployment, and formal verification. Each of these target models usually rely on its own formalism, specification language, and execution platform. Traditionally, a compiler analyzes a program written in a programming language and generates executable code. In contrast, a model compiler processes a source model written in a modeling language and should ideally support the generation of multiple heterogeneous targets. However, most existing modeling languages are designed with a narrow focus, typically targeting only simulation or implementation. Multi-target compilation, when not considered during the language’s early design, becomes significantly harder to achieve. In this paper, we introduce our initiative: a toolchain and modeling language called M, designed to support system modeling and multi-target compilation for model-driven engineering of complex, concurrent, and time-aware systems. M is a textual, grammar-driven language based on the actor model and extended with discrete-event scheduling semantics. It provides constructs for modeling system entities, message-based interactions, and time- or state-triggered reactions. From such models, M enables the systematic generation of diverse target artifacts while preserving semantic conformance to the original model. Moreover, M can serve as a middle language to which other modeling languages may anchor, thereby allowing them to benefit from its compilation framework.
zh
[NLP-24] Context Cascade Compression: Exploring the Upper Limits of Text Compression
【速读】: 该论文旨在解决长上下文任务中大规模token输入给大语言模型(Large Language Models, LLMs)带来的计算和内存瓶颈问题。其核心解决方案是提出一种分层式文本压缩方法——Context Cascade Compression (C3),关键在于通过两级LLM协同实现高效压缩与解码:第一阶段使用小型LLM将原始长文本压缩为少量潜变量token(如32或64个),形成高比例的压缩比(如20:1或40:1);第二阶段由大型LLM基于这些潜变量token进行准确解码。实验表明,在20倍压缩比下,C3可实现98%的解码准确率,显著优于此前基于光学字符识别(OCR)的压缩方法(约60%),且在40倍压缩比时仍保持93%的准确性,验证了纯文本管道在上下文压缩中的优越性与可行性,并为未来OCR及相关领域的压缩上限提供了参考依据。
链接: https://arxiv.org/abs/2511.15244
作者: Fanfan Liu,Haibo Qiu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at this https URL
zh
[NLP-25] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
【速读】: 该论文旨在解决临床命名实体识别(Clinical Named Entity Recognition, NER)在缺乏大量标注数据时的性能瓶颈问题,尤其针对传统监督模型(如CRF和BioClinicalBERT)依赖昂贵人工标注数据,以及零样本NER方法在示例选择粒度和提示(prompt)与自我改进融合方面存在的不足。其解决方案的关键在于提出OEMA框架,通过多智能体协作实现:1)自注释器生成候选示例;2)判别器基于SNOMED CT本体过滤噪声;3)预测器利用实体描述进行精准推理。该设计结合了本体引导的推理机制与多智能体协同优化,显著提升了零样本场景下的准确率,达到接近监督学习的效果,为临床自然语言处理(Clinical NLP)提供了高效且可扩展的新范式。
链接: https://arxiv.org/abs/2511.15211
作者: Xinli Tao,Xin Dong,Xuezhong Zhou
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 4 tables
Abstract:Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA’s three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.
zh
[NLP-26] Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
【速读】: 该论文旨在解决当前对大语言模型(Large Language Models, LLMs)中内在维度(Intrinsic Dimension, ID)的文本决定因素理解不足的问题,尤其在语义结构、文体差异及可解释特征层面缺乏系统性研究。解决方案的关键在于通过交叉编码器分析(cross-encoder analysis)、语言学特征建模以及稀疏自编码器(Sparse Autoencoders, SAEs)的多维方法,首次将ID与可解释的文本属性建立因果联系:发现ID与基于熵的指标互补,且在不同文体中呈现稳定分层(科学写作ID≈8,百科类≈9,创意/观点类≈10.5),并进一步识别出具体影响ID的因果特征——如正式语气、统计信息等降低ID,而个性化、情感表达和叙事结构则显著提升ID,且通过控制实验验证其因果效应。
链接: https://arxiv.org/abs/2511.15210
作者: Vladislav Pedashenko,Laida Kushnareva,Yana Khassan Nibal,Eduard Tulchinskii,Kristian Kuznetsov,Vladislav Zharchinskii,Yury Maximov,Irina Piontkovskaya
机构: Moscow State University (莫斯科国立大学); Lomonosov Research Institute (洛蒙诺索夫研究院); Interdata Astana (Interdata 阿斯塔纳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text “representationally simple” while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively “easy”, whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
zh
[NLP-27] HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples
【速读】: 该论文旨在解决当前多语言视觉-语言模型(Multilingual Vision-Language Models, VLMs)评估中存在的四大局限:依赖未经验证的自动翻译、任务与领域覆盖狭窄、样本量不足,以及缺乏文化相关且原生来源的问答(Question-Answering, QA)数据。为应对这些问题,论文提出了一种可扩展的评估框架,并基于此构建了HinTel-AlignBench基准,涵盖印地语(Hindi)和泰卢固语(Telugu)的多样化数据源,包含约4000个QA对/语言。解决方案的关键在于:(1)结合反向翻译、过滤与人工验证的半自动化数据生成流程;(2)引入适应性英文数据集(如VQAv2、RealWorldQA、CLEVR-Math)与本土原创印地语数据集(如JEE用于STEM、VAANI用于文化 grounding),确保语言多样性与文化适配性;(3)对多种前沿开源与闭源VLM进行系统性能分析,揭示其在低资源印度语言中的显著性能下降(平均比英语低8.3分(印地语)和5.5分(泰卢固语)),并识别出常见失败模式,为多语言多模态理解能力的改进提供明确方向。
链接: https://arxiv.org/abs/2511.15183
作者: Rishikant Chigrupaatii,Ponnada Sai Tulasi Kanishka,Lalit Chandra Routhu,Martin Patel Sama Supratheek Reddy,Divyam Gupta,Dasari Srikar,Krishna Teja Kuchimanchi,Rajiv Misra,Rohun Tripathi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.
zh
[NLP-28] aching According to Students Aptitude: Personalized Mathematics Tutoring via Persona- Memory- and Forgetting-Aware LLM s AAAI2026
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能教学系统在数学辅导中难以动态捕捉学生知识演变过程的问题,尤其是对学生能力水平、概念性认知缺口以及遗忘模式的建模不足。解决方案的关键在于提出TASA(Teaching According to Students’ Aptitude)框架,该框架通过整合学生画像(persona)、事件记忆(event memory)和遗忘动力学,构建了一个能够持续更新学生掌握状态的个性化学习系统;其核心机制包括基于知识追踪(knowledge tracing)的连续遗忘曲线建模,从而实现难度自适应的问题生成与解释输出,显著提升了教学的精准性与适应性。
链接: https://arxiv.org/abs/2511.15163
作者: Yang Wu,Rujing Yao,Tong Zhang,Yufei Shi,Zhuoren Jiang,Zhushan Li,Xiaozhong Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: AAAI 2026 Workshop
Abstract:Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students’ knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student’s mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students’ Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student’s mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
zh
[NLP-29] Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation ML4H2025
【速读】: 该论文旨在解决手术训练中高质量、实时反馈难以规模化提供的问题,其核心挑战在于如何构建能够理解临床语义表示的自动化反馈模型。解决方案的关键在于提出一个结构感知的流水线,通过从真实手术导师与学员的对话中挖掘出仪器-动作-目标(Instrument-Action-Target, IAT)三元组,并将其聚类为标准化类别;进而微调一个视频到IAT的模型,利用手术流程和任务上下文以及细粒度的时间序列器械运动信息进行条件约束;最后将IAT三元组作为结构化提示输入至GPT-4o,以生成具有临床依据、贴近导师风格的反馈文本。该方法显著提升了视频到IAT识别的AUC性能,并使生成反馈的临床贴合度(fidelity)提高12.4%,同时增强了可验证性和可审计性。
链接: https://arxiv.org/abs/2511.15159
作者: Firdavs Nasriddinov,Rafal Kocielnik,Anima Anandkumar,Andrew J. Hung
机构: California Institute of Technology (加州理工学院); Cedars-Sinai Medical Center (塞德斯-西奈医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted as proceedings paper for ML4H 2025
Abstract:High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score = 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
zh
[NLP-30] Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents
链接: https://arxiv.org/abs/2511.15074
作者: Henrik Bradland,Morten Goodwin,Vladimir I. Zadorozhny,Per-Arne Andersen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 4 figures, in review
[NLP-31] ProRAC: A Neuro-symbolic Method for Reasoning about Actions with LLM -based Progression
【速读】: 该论文旨在解决动作与变化推理(Reasoning about Actions and Change, RAC)问题,即在动态环境中通过一系列动作推导出最终状态并回答相关查询。其解决方案的关键在于提出了一种基于进展的神经符号框架 ProRAC(Progression-based Reasoning about Actions and Change),该框架利用大语言模型(LLM)提取问题中的基本 RAC 元素(如动作和查询),逐步执行每个动作以推导出最终状态,并基于此状态评估查询以得出答案。该方法在多个 RAC 基准测试中表现出色,验证了其在不同领域、LLM 骨干网络及任务类型下的强泛化能力。
链接: https://arxiv.org/abs/2511.15069
作者: Haoyong Wu,Yongmei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In this paper, we propose ProRAC (Progression-based Reasoning about Actions and Change), a neuro-symbolic framework that leverages LLMs to tackle RAC problems. ProRAC extracts fundamental RAC elements including actions and questions from the problem, progressively executes each action to derive the final state, and then evaluates the query against the progressed state to arrive at an answer. We evaluate ProRAC on several RAC benchmarks, and the results demonstrate that our approach achieves strong performance across different benchmarks, domains, LLM backbones, and types of RAC tasks.
zh
[NLP-32] Evaluating Multimodal Large Language Models on Vertically Written Japanese Text
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLs)在处理垂直书写日文文本时性能显著下降的问题。由于部分日文文档采用竖排书写方式,而现有模型对此类文本的理解能力不足,导致其在视觉文档理解任务中表现受限。解决方案的关键在于构建一个合成的日文光学字符识别(Optical Character Recognition, OCR)数据集,该数据集包含水平与垂直书写两种格式的日文文本图像,并用于模型微调与评估;同时,还引入了真实世界文档图像构成的评估集以验证实际场景下的效果。实验表明,基于该合成数据集进行训练可有效提升模型对垂直日文文本的识别能力,尤其对于原本无法处理此类文本的模型具有显著改进作用。
链接: https://arxiv.org/abs/2511.15059
作者: Keito Sasagawa,Shuhei Kurita,Daisuke Kawahara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 17pages, 8 figures
Abstract:Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available this https URL.
zh
[NLP-33] Mathematical Analysis of Hallucination Dynamics in Large Language Models : Uncertainty Quantification Advanced Decoding and Principled Mitigation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型生成看似合理但事实错误或缺乏依据的输出。其解决方案的关键在于构建一个数学基础框架,融合概率建模、信息论、三角信号分析与贝叶斯不确定性估计,从理论上分析误差在自回归生成过程中的累积机制,并提出改进的不确定性度量指标(包括语义感知和相位感知变体),进而设计出基于对比解码、检索增强接地、事实对齐和主动回避等原则性的缓解策略,从而实现更安全、可靠的LLM输出。
链接: https://arxiv.org/abs/2511.15005
作者: Moses Kiprono
机构: Catholic University of America (天主教大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, theoretical/mathematical LLM research, no figures, intended for peer-reviewed journal
Abstract:Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.
zh
[NLP-34] How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding
【速读】: 该论文旨在解决大型语言模型在临床文本训练中可能泄露敏感患者信息的问题,同时确保诊断准确性满足实际部署需求。其核心挑战在于如何在保护隐私的前提下最大化模型的实用性。解决方案的关键在于采用知识蒸馏(knowledge distillation)策略:通过使用差分隐私(Differential Privacy, DP)训练的教师模型来指导学生模型的学习,相较于直接应用差分隐私随机梯度下降(DP-SGD)或基于DP合成数据的训练方法,该策略在中等和宽松隐私预算(ε ∈ {4, 6})下显著提升了诊断编码任务的性能,恢复了高达63%的非私有模型表现,同时保持强隐私保障(成员推断攻击AUC ≈ 0.5)。这一发现揭示了不同隐私保护架构在隐私-效用权衡上的显著差异,并确立了知识蒸馏作为临床自然语言处理(NLP)中最具实用性的隐私保护路径。
链接: https://arxiv.org/abs/2511.14936
作者: Mathieu Dufour,Andrew Duncan
机构: Imperial College London (帝国理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 5 figures. Accepted to the Privacy-Preserving Machine Learning Workshop at EurIPS 2025
Abstract:Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ( \varepsilon \in \4, 6\ ), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC \approx 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.
zh
[NLP-35] Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在皮肤科诊断中面临的三大局限性:数据异质性导致诊断标签与临床概念标注不一致、缺乏可解释的诊断推理依据(即“接地的诊断理由”)、以及小规模密集标注数据训练模型难以迁移到大规模稀疏标注场景时的泛化能力不足。解决方案的关键在于提出SkinR1——一种结合教科书级深度推理与强化学习(Reinforcement Learning, RL)广义迁移能力的新型皮肤病VLM。其核心创新包括:首先设计基于教科书的推理生成器,构建高保真、层次感知且包含鉴别诊断(Differential Diagnosis, DDx)信息的推理轨迹,提供可靠的专家级监督信号;其次利用这些轨迹进行监督微调(Supervised Fine-Tuning, SFT),赋予模型具有语义锚定的推理能力;最后开发一种融合疾病层次结构的新颖RL范式,有效将上述接地推理模式迁移至大规模稀疏数据集,从而实现诊断准确率的显著提升和推理基础的稳固建立。
链接: https://arxiv.org/abs/2511.14900
作者: Zehao Liu,Wejieying Ren,Jipeng Zhang,Tianxiang Zhao,Jingxi Zhu,Xiaoting Li,Vasant G. Honavar
机构: Pennsylvania State University (宾夕法尼亚州立大学); Stanford University (斯坦福大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones. To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2511.14900 [cs.CV] (or arXiv:2511.14900v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.14900 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-36] Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本嵌入(text embeddings)时因因果注意力机制(causal attention mechanism)导致的信息流动受限问题,即信息无法从后续token有效传递到前面token,从而降低表示质量;尤其是在长文档场景下,现有方法通过添加单一摘要token(summary token)进行信息压缩,存在过度压缩(over-compression)的问题,损害性能。其解决方案的关键在于提出分层令牌前置(Hierarchical Token Prepending, HTP)方法:首先将输入序列分块,并在每个后续块前插入块级摘要token,构建多路径后向信息流以缓解注意力层面的压缩;其次用均值池化(mean-pooling)替代最后token池化(last-token pooling),理论分析支持该选择可缓解读出层面的过挤压(over-squashing)问题,从而显著提升长上下文下的嵌入质量,在11个检索数据集和30个通用嵌入基准上实现一致性能提升。
链接: https://arxiv.org/abs/2511.14868
作者: Xueying Ding,Xingyue Huang,Mingxuan Ju,Liam Collins,Yozen Liu,Leman Akoglu,Neil Shah,Tong Zhao
机构: Carnegie Mellon University (卡内基梅隆大学); University of Oxford (牛津大学); Snap Inc.
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
zh
[NLP-37] Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization
【速读】: 该论文旨在解决当前强化学习(Reinforcement Learning, RL)方法在训练大语言模型(Large Language Models, LLMs)进行多轮工具集成推理(Multi-turn Tool-Integrated Reasoning, TIR)时面临的挑战,即现有方法如Group Relative Policy Optimization (GRPO) 由于采用粗粒度的轨迹级奖励机制,难以提供足够精细的学习信号以支持复杂的多轮交互,从而导致训练停滞。其解决方案的关键在于提出一种名为Group Turn Policy Optimization (GTPO) 的新型强化学习算法,核心创新包括:(1) 引入轮次级(turn-level)奖励分配机制,实现对每一轮交互的细粒度反馈;(2) 基于回报的优势估计方法,通过归一化折扣回报作为优势值;(3) 自监督奖励塑形技术,利用生成代码中的自监督信号来增强稀疏的二元结果奖励,从而提升学习效率与稳定性。
链接: https://arxiv.org/abs/2511.14846
作者: Yifeng Ding,Hung Le,Songyang Han,Kangrui Ruan,Zhenghui Jin,Varun Kumar,Zijian Wang,Anoop Deoras
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); AWS AI Labs (亚马逊云科技人工智能实验室); Meta (Meta)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.
zh
[NLP-38] Opinion Mining and Analysis Using Hybrid Deep Neural Networks
【速读】: 该论文旨在解决情感分析(Sentiment Analysis)中面临的三大挑战:上下文语义细微差异的捕捉能力不足、模型在大规模数据上的可扩展性受限,以及类别不平衡导致的分类偏差问题。其解决方案的关键在于提出一种混合深度神经网络架构——BGRU-LSTM(Hybrid BGRU-LSTM, HBGRU-LSTM),该模型融合双向门控循环单元(Bidirectional Gated Recurrent Unit, BGRU)与长短期记忆(Long Short-Term Memory, LSTM)层,以增强对文本语义关系的建模能力,并通过数据集平衡策略显著提升负向情感类别的召回率(从86%提升至96%)和整体分类稳定性,从而实现更准确、公平且具备良好泛化性能的情感分类。
链接: https://arxiv.org/abs/2511.14796
作者: Adel Hidri,Suleiman Ali Alsaif,Muteeb Alahmari,Eman AlShehri,Minyar Sassi Hidri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures, 11 tables
Abstract:Understanding customer attitudes has become a critical component of decision-making due to the growing influence of social media and e-commerce. Text-based opinions are the most structured, hence playing an important role in sentiment analysis. Most of the existing methods, which include lexicon-based approaches and traditional machine learning techniques, are insufficient for handling contextual nuances and scalability. While the latter has limitations in model performance and generalization, deep learning (DL) has achieved improvement, especially on semantic relationship capturing with recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The aim of the study is to enhance opinion mining by introducing a hybrid deep neural network model that combines a bidirectional gated recurrent unit (BGRU) and long short-term memory (LSTM) layers to improve sentiment analysis, particularly addressing challenges such as contextual nuance, scalability, and class imbalance. To substantiate the efficacy of the proposed model, we conducted comprehensive experiments utilizing benchmark datasets, encompassing IMDB movie critiques and Amazon product evaluations. The introduced hybrid BGRULSTM (HBGRU-LSTM) architecture attained a testing accuracy of 95%, exceeding the performance of traditional DL frameworks such as LSTM (93.06%), CNN+LSTM (93.31%), and GRU+LSTM (92.20%). Moreover, our model exhibited a noteworthy enhancement in recall for negative sentiments, escalating from 86% (unbalanced dataset) to 96% (balanced dataset), thereby ensuring a more equitable and just sentiment classification. Furthermore, the model diminished misclassification loss from 20.24% for unbalanced to 13.3% for balanced dataset, signifying enhanced generalization and resilience.
zh
[NLP-39] Human or LLM as Standardized Patients? A Comparative Study for Medical Education
【速读】: 该论文旨在解决标准化病人(Standardized Patients, SP)在临床技能训练中存在成本高、灵活性差和难以规模化的问题。现有基于大语言模型(Large-Language-Model, LLM)的SP模拟器虽具成本优势,但行为一致性不足且缺乏与真人SP的严谨对比。其解决方案的关键在于提出EasyMED框架,该框架由三个智能体组成:患者代理(Patient Agent)用于生成逼真对话,辅助代理(Auxiliary Agent)确保事实一致性,评估代理(Evaluation Agent)提供可操作反馈;同时构建SPBench基准,涵盖14个专科和8项专家定义的评价维度,实现系统性评估。实验表明,EasyMED在学习效果上可媲美真人SP,并在低基础学生群体中带来更大技能提升,同时显著改善灵活性、心理安全感和成本效率。
链接: https://arxiv.org/abs/2511.14783
作者: Bingquan Zhang,Xiaoxiao Liu,Yuchi Wang,Lei Zhou,Qianqian Xie,Benyou Wang
机构: Wuhan University (武汉大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Freedom AI
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages, 9 figures, 8 table
Abstract:Standardized Patients (SP) are indispensable for clinical skills training but remain expensive, inflexible, and difficult to scale. Existing large-language-model (LLM)-based SP simulators promise lower cost yet show inconsistent behavior and lack rigorous comparison with human SP. We present EasyMED, a multi-agent framework combining a Patient Agent for realistic dialogue, an Auxiliary Agent for factual consistency, and an Evaluation Agent that delivers actionable feedback. To support systematic assessment, we introduce SPBench, a benchmark of real SP-doctor interactions spanning 14 specialties and eight expert-defined evaluation criteria. Experiments demonstrate that EasyMED matches human SP learning outcomes while producing greater skill gains for lower-baseline students and offering improved flexibility, psychological safety, and cost efficiency.
zh
[NLP-40] he Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech
【速读】: 该论文旨在解决自发性口语(spontaneous speech)在语音合成中难以自然呈现对话流、换位、停顿和不流畅现象的问题。其关键解决方案在于引入显式的韵律分割标注(prosodic segmentation annotations),通过对比人工标注与自动标注两种方式对非自回归模型FastSpeech 2进行训练,发现人工韵律分割虽引入更大变异性但能提升语音的自然度和可懂度,尤其在预核轮廓(pre-nuclear contours)的模拟上更贴近真实语料,从而验证了显式韵律结构标注对提升自发口语合成质量的有效性。
链接: https://arxiv.org/abs/2511.14779
作者: Julio Cesar Galdino,Sidney Evaldo Leal,Leticia Gabriella De Souza,Rodrigo de Freitas Lima,Antonio Nelson Fornari Mendes Moreira,Arnaldo Candido Junior,Miguel Oliveira Jr.,Edresson Casanova,Sandra M. Aluísio
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.
zh
[NLP-41] COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中尽管拥有相关证据却仍会产生事实性错误的问题,其根源在于模型对上下文知识与参数化知识的注意力分配失衡。解决方案的关键在于提出一种轻量级、可解释的控制框架COMPASS(Context-Modulated PID Attention Steering System),该框架通过嵌入基于模型的反馈回路直接作用于解码过程,利用透明指标——上下文依赖度评分(Context Reliance Score, CRS)在线监测注意力头对证据的依赖程度,并借助PID控制器动态调节注意力机制,从而在不需重新训练或多次解码的情况下维持生成内容的事实一致性。
链接: https://arxiv.org/abs/2511.14776
作者: Snigdha Pandya,Rohan Nagale,Kenji Sahay,Anna Lin,Shikhar Shiromani,Kevin Zhu,Dev Sunishchal
机构: Algoverse; Georgia Institute of Technology (佐治亚理工学院); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures including algorithmns, 2 tables
Abstract:Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8 to 5.8 percent absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.
zh
[NLP-42] LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中跨语言知识迁移(cross-lingual knowledge transfer)的评估难题,即难以区分模型在目标语言中给出正确答案是源于真正的跨语言知识迁移,还是由于预训练阶段对相关知识的先验接触。其解决方案的关键在于提出LiveCLKTBench——一个自动化的生成流水线,通过识别现实世界中自包含且具有时间敏感性的知识实体(self-contained, time-sensitive knowledge entities),依据时间发生顺序过滤并验证模型知识,进而生成事实性问题,并将其翻译至多种语言以评估跨语言迁移能力。该方法有效隔离了真实迁移信号,为多语言知识迁移提供了可量化、可重复的基准测试框架。
链接: https://arxiv.org/abs/2511.14774
作者: Pei-Fu Guo,Yun-Da Tsai,Chun-Chia Hsu,Kai-Xin Chen,Ya-An Tsai,Kai-Wei Chang,Nanyun Peng,Mi-Yen Yeh,Shou-De Lin
机构: National Taiwan University (国立台湾大学); University of California, Los Angeles (加州大学洛杉矶分校); Academia Sinica, Taiwan (台湾中央研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model’s knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.
zh
[NLP-43] mporal Predictors of Outcome in Reasoning Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Model, LLM)在链式思维(Chain-of-Thought, CoT)推理过程中,何时会内部确定最终答案。这一问题关系到模型推理过程的可解释性与运行时控制能力。解决方案的关键在于,通过在前 t 个推理标记(reasoning tokens)后的隐藏状态上训练线性分类器,来评估模型是否能早期预测最终结果的正确性。研究发现,即使需要较长的推理步骤才能得出明确答案,模型在仅经过几个推理 token 后即可高度准确地预测最终结果;同时,在较难问题中,预测准确性的下降揭示了选择偏差——即长 CoT 中更可能包含困难样本。这表明,模型的内部自我评估能力在早期阶段即已形成,为推理过程的干预和优化提供了依据。
链接: https://arxiv.org/abs/2511.14773
作者: Joey David
机构: 未知
类目: Computation and Language (cs.CL)
备注: 4 pages, 4 figures
Abstract:The chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning, gradually refining the model’s latent representation of a solution. However, it remains unclear just how early a Large Language Model (LLM) internally commits to an eventual outcome. We probe this by training linear classifiers on hidden states after the first t reasoning tokens, showing that eventual correctness is highly predictable after only a few tokens, even when longer outputs are needed to reach a definite answer. We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact: hard items are disproportionately represented in long CoTs. Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens, with implications for interpretability and for inference-time control.
zh
[NLP-44] st-time Scaling of LLM s: A Survey from A Subproblem Structure Perspective
【速读】: 该论文旨在解决预训练大语言模型在推理阶段预测准确性不足的问题,其解决方案的关键在于通过分配额外的计算资源(即“测试时扩展”或test-time scaling)来提升模型性能。论文强调了问题分解方式与子问题拓扑结构(如串行、并行或树状结构)对方法设计的核心影响,并以此为统一视角整合了Chain-of-Thought、Branch-Solve-Merge和Tree-of-Thought等多样化策略,从而系统性地分析各类方法的优势与局限,为未来研究指明方向。
链接: https://arxiv.org/abs/2511.14772
作者: Zhuoyi Yang,Xu Guo,Tong Zhang,Huijuan Xu,Boyang Li
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With this paper, we survey techniques for improving the predictive accuracy of pretrained large language models by allocating additional compute at inference time. In categorizing test-time scaling methods, we place special emphasis on how a problem is decomposed into subproblems and on the topological organization of these subproblems whether sequential, parallel, or tree-structured. This perspective allows us to unify diverse approaches such as Chain-of-Thought, Branch-Solve-Merge, and Tree-of-Thought under a common lens. We further synthesize existing analyses of these techniques, highlighting their respective strengths and weaknesses, and conclude by outlining promising directions for future research
zh
[NLP-45] Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因采用静态 top-k 检索策略而导致的检索深度与查询复杂度不匹配的问题:对于聚焦明确的查询,固定数量的文档可能造成信息不足;而对于模糊或广泛的问题,固定数量则可能导致冗余信息引入。解决方案的关键在于提出一种基于聚类的自适应检索(Cluster-based Adaptive Retrieval, CAR)算法,其核心是通过分析查询-文档相似度距离的排序分布,识别出高相关文档与低相关候选之间的过渡点(即聚类结构变化处),从而动态确定最优检索深度,实现对不同复杂度查询的自适应响应。
链接: https://arxiv.org/abs/2511.14769
作者: Yifan Xu,Vipul Gupta,Rohit Aggarwal,Varsha Mahadevan,Bhaskar Krishnamachari
机构: Coinbase(币安); University of Southern California (南加州大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top-k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster-based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query-document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut-off that scales with query complexity. On Coinbase’s CDP corpus and the public MultiHop-RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top-k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end-to-end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase’s virtual assistant, we’ve seen user engagement jump by 200%.
zh
[NLP-46] Optimizing Agricultural Research: A RAG -Based Approach to Mycorrhizal Fungi Information
【速读】: 该论文旨在解决传统大型语言模型(Large Language Models, LLMs)因依赖静态训练语料而难以动态获取最新农业领域知识、特别是关于丛枝菌根真菌(Arbuscular Mycorrhizal Fungi, AMF)相关研究的时效性与准确性问题。其解决方案的关键在于构建一个检索增强生成(Retrieval-Augmented Generation, RAG)系统,通过双层策略实现:一是利用向量嵌入(vector embeddings)从农学和生物技术文献库中进行语义检索与内容增强;二是结构化提取实验元数据(如接种方法、孢子密度、土壤参数及产量结果),确保生成回答既语义连贯又具备实证支撑。该框架借助高性能向量数据库实现近实时的知识更新与检索,从而提升农业应用场景下AMF相关决策的科学性与可操作性。
链接: https://arxiv.org/abs/2511.14765
作者: Mohammad Usman Altam,Md Imtiaz Habib,Tuan Hoang
机构: Université Côte d’Azur (蔚蓝海岸大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 1 table
Abstract:Retrieval-Augmented Generation (RAG) represents a transformative approach within natural language processing (NLP), combining neural information retrieval with generative language modeling to enhance both contextual accuracy and factual reliability of responses. Unlike conventional Large Language Models (LLMs), which are constrained by static training corpora, RAG-powered systems dynamically integrate domain-specific external knowledge sources, thereby overcoming temporal and disciplinary limitations. In this study, we present the design and evaluation of a RAG-enabled system tailored for Mycophyto, with a focus on advancing agricultural applications related to arbuscular mycorrhizal fungi (AMF). These fungi play a critical role in sustainable agriculture by enhancing nutrient acquisition, improving plant resilience under abiotic and biotic stresses, and contributing to soil health. Our system operationalizes a dual-layered strategy: (i) semantic retrieval and augmentation of domain-specific content from agronomy and biotechnology corpora using vector embeddings, and (ii) structured data extraction to capture predefined experimental metadata such as inoculation methods, spore densities, soil parameters, and yield outcomes. This hybrid approach ensures that generated responses are not only semantically aligned but also supported by structured experimental evidence. To support scalability, embeddings are stored in a high-performance vector database, allowing near real-time retrieval from an evolving literature base. Empirical evaluation demonstrates that the proposed pipeline retrieves and synthesizes highly relevant information regarding AMF interactions with crop systems, such as tomato (Solanum lycopersicum). The framework underscores the potential of AI-driven knowledge discovery to accelerate agroecological innovation and enhance decision-making in sustainable farming systems.
zh
[NLP-47] CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
【速读】: 该论文旨在解决音频时刻检索(Audio Moment Retrieval, AMR)任务中缺乏真实世界数据基准的问题。早期研究主要依赖合成数据集进行训练,且评估样本不足100个,导致性能指标不可靠,难以满足实际应用场景的需求。解决方案的关键在于构建一个大规模、人工标注的AMR基准数据集CASTELLA,其包含1,009、213和640条音频记录分别用于训练、验证和测试,规模是此前数据集的24倍。此外,作者基于CASTELLA建立了基线模型,并通过实验证明,在合成数据预训练基础上微调的模型在Recall1@0.7上比仅使用合成数据训练的模型提升10.4点,显著提升了模型在真实场景下的检索性能。
链接: https://arxiv.org/abs/2511.15131
作者: Hokuto Munakata,Takehiro Imamura,Taichi Nishimura,Tatsuya Komatsu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in this https URL.
zh
计算机视觉
[CV-0] RoMa v2: Harder Better Faster Denser Feature Matching
【速读】:该论文旨在解决现有密集特征匹配(Dense Feature Matching)方法在复杂真实场景中性能不佳、高精度模型计算效率低的问题。其关键解决方案在于:构建一种新颖的匹配架构与损失函数,结合精心设计的多样化训练分布,显著提升模型对复杂任务的适应能力;采用解耦的两阶段匹配-精修流水线加速训练,并通过自定义CUDA内核大幅降低精修阶段内存占用;同时引入DINOv3基础模型及其他优化策略,增强模型鲁棒性与减少偏差。最终,所提出的匹配器在多个实验中达到新的最先进水平。
链接: https://arxiv.org/abs/2511.15706
作者: Johan Edstedt,David Nordström,Yushan Zhang,Georg Bökman,Jonathan Astermark,Viktor Larsson,Anders Heyden,Fredrik Kahl,Mårten Wadenbäck,Michael Felsberg
机构: Linköping University (林雪平大学); Chalmers University of Technology (查尔姆斯理工大学); University of Amsterdam (阿姆斯特丹大学); Centre for Mathematical Sciences, Lund University (隆德大学数学科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at this https URL
zh
[CV-1] GeoVista: Web-Augmented Agent ic Visual Reasoning for Geolocalization
【速读】:该论文旨在解决当前基于代理的视觉推理(agentic visual reasoning)模型在地理定位(geolocalization)任务中面临的局限性,即现有研究多集中于图像操作工具,缺乏对通用型代理模型的深入探索。为填补这一空白,作者提出GeoBench基准数据集,包含高分辨率照片、全景图及卫星图像,以更严格地评估代理模型的地理定位能力;同时设计GeoVista模型,其核心创新在于将工具调用(如图像缩放和网络搜索)无缝嵌入推理循环中,并通过监督微调(SFT)与强化学习(RL)相结合的训练流程优化模型的推理路径与工具使用策略,辅以分层奖励机制以整合多层级地理信息,从而显著提升模型在复杂场景下的地理定位准确率。
链接: https://arxiv.org/abs/2511.15705
作者: Yikun Wang,Zuyan Liu,Ziyi Wang,Pengfei Liu,Han Hu,Yongming Rao
机构: Fudan University (复旦大学); Tencent Hunyuan (腾讯混元); Tsinghua University (清华大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
zh
[CV-2] In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data
【速读】:该论文旨在解决如何有效利用具身视频(egocentric videos)数据来训练操纵策略的问题,特别是针对现有方法因数据异质性而仅能进行简单预训练、未能充分挖掘其潜力的局限。解决方案的关键在于提出一种可扩展的数据收集与使用范式,将人类数据分为“野生场景”(in-the-wild)和“任务对齐”(on-task)两类,并系统性分析其使用方式;在此基础上构建了包含超1000小时野生数据和20小时任务对齐数据的PHSD数据集,进而训练出一个大规模语言条件流匹配策略模型Human0,通过领域自适应技术缩小人类与类人机器人之间的分布差距,从而实现仅用人类数据即可完成语言指令遵循、少样本学习及鲁棒性提升等新能力。
链接: https://arxiv.org/abs/2511.15704
作者: Xiongyi Cai,Ri-Zhao Qiu,Geng Chen,Lai Wei,Isabella Liu,Tianshu Huang,Xuxin Cheng,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: this https URL
zh
[CV-3] First Frame Is the Place to Go for Video Content Customization
【速读】:该论文旨在解决视频生成模型中如何实现高效且泛化能力强的参考式视频内容定制问题,传统方法通常将第一帧视为时空起始点,仅作为后续动画的初始种子,难以灵活控制生成内容。其解决方案的关键在于揭示了第一帧在模型内部实际上充当了一个概念记忆缓冲区(conceptual memory buffer),用于存储视觉实体供后续生成阶段复用;基于此洞察,研究者仅需20-50个训练样本即可实现多样场景下的鲁棒视频定制,无需修改模型架构或进行大规模微调,从而挖掘出视频生成模型中被长期忽视的参考式定制潜力。
链接: https://arxiv.org/abs/2511.15700
作者: Jingxi Chen,Zongxia Li,Zhichao Liu,Guangyao Shi,Xiyang Wu,Fuxiao Liu,Cornelia Fermuller,Brandon Y. Feng,Yiannis Aloimonos
机构: University of Maryland (马里兰大学); University of Southern California (南加州大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL
Abstract:What role does the first frame play in video generation models? Traditionally, it’s viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it’s possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
zh
[CV-4] Hyperspectral Image Classification using Spectral-Spatial Mixer Network
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中因标注数据稀缺而导致模型性能受限的问题,尤其是在仅有少量标签样本情况下如何实现高效且准确的分类。解决方案的关键在于提出一种轻量级深度学习模型SS-MixNet,其核心创新包括:(1) 采用3D卷积层提取局部光谱-空间特征;(2) 引入两个并行的MLP-style混合器块(mixer blocks)以捕获光谱与空间维度上的长程依赖关系;(3) 设计基于深度可分离卷积的注意力机制,在不显著增加计算开销的前提下增强判别能力。实验表明,该方法在仅使用1%标注数据的情况下仍能取得优于2D-CNN、3D-CNN、IP-SWIN、SimPoolFormer和HybridKAN等主流方法的性能,验证了其在低监督场景下的有效性与鲁棒性。
链接: https://arxiv.org/abs/2511.15692
作者: Mohammed Q. Alkhatib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for WHISPERS2025
Abstract:This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model’s effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: this https URL
zh
[CV-5] MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking Facial and Acoustic Features
【速读】:该论文旨在解决现有图神经网络模型在抑郁症检测中仅关注低频信息、忽视高频特征导致的判别能力不足的问题。其关键解决方案是提出一种多频图卷积网络(Multi-Frequency Graph Convolutional Network, MF-GCN),核心创新在于设计了多频滤波器组模块(Multi-Frequency Filter Bank Module, MFFBM),能够同时利用低频和高频信号,从而更全面地捕捉跨模态(眼动、音频与视频)特征中的抑郁相关模式,显著提升分类性能,在二分类和三分类任务中均优于传统机器学习与深度学习基线方法。
链接: https://arxiv.org/abs/2511.15675
作者: Sejuti Rahman,Swakshar Deb,MD. Sameer Iqbal Chowdhury,MD. Jubair Ahmed Sourov,Mohammad Shamsuddin
机构: New Uzbekistan University (新乌兹别克斯坦大学); University of Dhaka (达卡大学); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.
zh
[CV-6] GEO-Bench-2: From Performance to Capability Rethinking Evaluation in Geospatial AI
【速读】:该论文旨在解决地球观测(Earth Observation, EO)领域中地理空间基础模型(Geospatial Foundation Models, GeoFMs)缺乏标准化评估协议的问题。为应对这一挑战,作者提出了GEO-Bench-2,一个涵盖分类、分割、回归、目标检测和实例分割等任务的综合性评估框架,包含19个许可宽松的数据集,并引入“能力组”(capability groups)对模型在具有相似特征(如分辨率、波段、时序性)的数据集上的表现进行分组排名,从而帮助用户识别不同模型的优势与短板。解决方案的关键在于定义了一套既具指导性又灵活的评估协议,不仅保障了基准测试的一致性和可比性,还支持模型适配策略的研究,推动GeoFMs在下游任务中的进一步发展。实验表明,当前并无单一模型在所有任务上均占优,说明模型选择需依据具体任务需求、数据模态及约束条件,凸显了构建通用GeoFM仍是一个开放问题。
链接: https://arxiv.org/abs/2511.15658
作者: Naomi Simumba,Nils Lehmann,Paolo Fraccaro,Hamed Alemohammad,Geeth De Mel,Salman Khan,Manil Maskey,Nicolas Longepe,Xiao Xiang Zhu,Hannah Kerner,Juan Bernabe-Moreno,Alexander Lacoste
机构: IBM Research Europe(IBM研究欧洲); Technical University Munich(慕尼黑工业大学); Clark University(克拉克大学); NASA Impact(美国国家航空航天局影响); ESA Φ\Phi-lab(欧洲航天局Φ实验室); MBZUAI(穆罕默德·本·扎耶德人工智能大学); Arizona State University(亚利桑那州立大学); ServiceNow Research(服务-now研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ‘‘capability’’ groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.15658 [cs.CV] (or arXiv:2511.15658v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.15658 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-7] INQUIRE-Search: A Framework for Interactive Discovery in Large-Scale Biodiversity Databases
【速读】:该论文旨在解决大规模生物多样性图像数据中蕴含的生态情境信息(如行为、相互作用、物候和栖息地等)因依赖元数据筛选或人工检查而难以规模化利用的问题。解决方案的关键在于提出并实现INQUIRE-Search系统,这是一个开源的自然语言交互式搜索工具,使科学家能够快速定位图像数据库中特定概念的观测记录,验证并导出相关数据,进而用于新型科学分析。相比传统方法,该系统显著缩短了探索时间,为生态学研究提供了高效、可扩展的新范式。
链接: https://arxiv.org/abs/2511.15656
作者: Edward Vendrow,Julia Chae,Rupa Kurinchi-Vendhan,Isaac Eckert,Jazlynn Hall,Marta Jarzyna,Reymond Miyajima,Ruth Oliver,Laura Pollock,Lauren Schrack,Scott Yanco,Oisin Mac Aodha,Sara Beery
机构: Massachusetts Institute of Technology (麻省理工学院); McGill University (麦吉尔大学); Cary Institute of Ecosystem Studies (生态系统研究所); The Ohio State University (俄亥俄州立大学); University of California Santa Barbara (加州大学圣塔芭芭拉分校); Smithsonian’s National Zoo & Conservation Biology Institute (史密森尼国家动物园与保护生物学研究所); University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EV, JC, RKV contributed equally
Abstract:Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and habitat. Yet most ecological workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible at scale. We introduce INQUIRE-Search, an open-source system that enables scientists to rapidly and interactively search within an ecological image database for specific concepts using natural language, verify and export relevant observations, and utilize this discovered data for novel scientific analysis. Compared to traditional methods, INQUIRE-Search takes a fraction of the time, opening up new possibilities for scientific questions that can be explored. Through five case studies, we show the diversity of scientific applications that a tool like INQUIRE-Search can support, from seasonal variation in behavior across species to forest regrowth after wildfires. These examples demonstrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we emphasize using such AI-enabled discovery tools for science call for experts to reframe the priorities of the scientific process and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.
zh
[CV-8] MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling
【速读】:该论文旨在解决惯性里程计(Inertial Odometry, IO)中传统全局坐标系(global coordinate frame)在行人场景下定位精度不足的问题。研究表明,在无人机场景中使用机体坐标系(body coordinate frame)可显著提升精度,但其在行人IO中的适用性尚未被充分验证。为此,作者通过理论分析、定性观察与定量实验系统评估了全局坐标系的局限性,并提出MambaIO方法作为解决方案:其核心创新在于利用拉普拉斯金字塔(Laplacian pyramid)将IMU测量分解为高频与低频分量——低频分量由Mamba架构处理以提取隐式的上下文运动特征,高频分量则由卷积结构捕捉细粒度局部运动细节,从而实现更精准的实时定位。该工作首次将Mamba架构引入惯性里程计任务,取得了当前最优性能(state-of-the-art)。
链接: https://arxiv.org/abs/2511.15645
作者: Shanshan Zhang
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, IMU measurements in IO have been processed under two coordinate system paradigms: the body coordinate frame and the global coordinate frame, with the latter being widely adopted. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global coordinate frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the inertial odometry task.
zh
[CV-9] Multi-Stage Residual-Aware Unsupervised Deep Learning Framework for Consistent Ultrasound Strain Elastography
【速读】:该论文旨在解决超声应变弹性成像(Ultrasound Strain Elastography, USE)在临床应用中面临的三大挑战:组织去相关噪声干扰、缺乏真实标签(ground truth)以及在不同形变条件下应变估计不一致的问题。解决方案的关键在于提出一种残差感知的多阶段无监督序列深度学习框架——MUSSE-Net,其核心创新包括:基于上下文感知互补特征融合(Context-Aware Complementary Feature Fusion, CACFF)的编码器与三重交叉注意力(Tri-Cross Attention, TCA)瓶颈结构相结合的端到端多流编解码架构(USSE-Net),用于并行处理变形前后射频(RF)信号序列以估计位移场和轴向应变;引入定制化的一致性损失函数以保障时序一致性与应变稳定性;并通过第二阶段的残差精修模块进一步提升精度并抑制噪声。该方法在仿真、活体及孟加拉国工程技术大学(BUET)医疗中心临床数据集上均表现出优于现有无监督方法的性能,显著提升了病变区域与背景间的对比度和图像可解释性。
链接: https://arxiv.org/abs/2511.15640
作者: Shourov Joarder,Tushar Talukder Showrav,Md. Kamrul Hasan
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures
Abstract:Ultrasound Strain Elastography (USE) is a powerful non-invasive imaging technique for assessing tissue mechanical properties, offering crucial diagnostic value across diverse clinical applications. However, its clinical application remains limited by tissue decorrelation noise, scarcity of ground truth, and inconsistent strain estimation under different deformation conditions. Overcoming these barriers, we propose MUSSE-Net, a residual-aware, multi-stage unsupervised sequential deep learning framework designed for robust and consistent strain estimation. At its backbone lies our proposed USSE-Net, an end-to-end multi-stream encoder-decoder architecture that parallelly processes pre- and post-deformation RF sequences to estimate displacement fields and axial strains. The novel architecture incorporates Context-Aware Complementary Feature Fusion (CACFF)-based encoder with Tri-Cross Attention (TCA) bottleneck with a Cross-Attentive Fusion (CAF)-based sequential decoder. To ensure temporal coherence and strain stability across varying deformation levels, this architecture leverages a tailored consistency loss. Finally, with the MUSSE-Net framework, a secondary residual refinement stage further enhances accuracy and suppresses noise. Extensive validation on simulation, in vivo, and private clinical datasets from Bangladesh University of Engineering and Technology (BUET) medical center, demonstrates MUSSE-Net’s outperformed existing unsupervised approaches. On MUSSE-Net achieves state-of-the-art performance with a target SNR of 24.54, background SNR of 132.76, CNR of 59.81, and elastographic SNR of 9.73 on simulation data. In particular, on the BUET dataset, MUSSE-Net produces strain maps with enhanced lesion-to-background contrast and significant noise suppression yielding clinically interpretable strain patterns.
zh
[CV-10] Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因缺乏对视觉与语言概念层次结构显式建模而导致的细粒度类别特征漂移及灾难性遗忘问题。现有基于CLIP的方法虽利用多模态预训练获得可迁移特征,但未能有效保留类别间的层级关系(如“狗”包含“拉布拉多”和“金毛猎犬”),导致在增量更新过程中旧类特征退化。解决方案的关键在于提出HASTEN(Hierarchical Semantic Tree Anchoring)框架:首先借助外部知识图谱作为监督信号,将视觉与文本特征嵌入双曲空间(hyperbolic space),以高效保持数据演化过程中的层次结构;其次通过将梯度投影至共享双曲映射器的零空间(null space),抑制对先前任务的干扰,从而协同实现对历史知识的稳定保留与新类别的有效学习。
链接: https://arxiv.org/abs/2511.15633
作者: Tao Hu,Lan Li,Zhen-Hao Xie,Da-Wei Zhou
机构: Nanjing University (南京大学); State Key Laboratory for Novel Software Technology (新型软件技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like “dog” subsumes fine-grained categories such as “Labrador” and “Golden Retriever,” and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.
zh
[CV-11] he SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
【速读】:该论文旨在解决野生动物保护领域中多动物追踪(Multi-Animal Tracking, MAT)模型缺乏通用性的问题,当前现有数据集在规模、物种多样性、时空覆盖范围等方面存在显著局限,难以支撑跨物种、跨区域的通用MAT模型训练。解决方案的关键在于构建SA-FARI——目前最大的开源野生动物多动物追踪数据集,其包含约10年跨度(2014–2024)、741个地点、4大洲分布的11,609段相机陷阱视频,涵盖99个物种类别,并提供高密度标注(约46小时视频、16,224个masklet身份、942,702个边界框与分割掩码),同时公开匿名化地理位置信息。该数据集首次实现了高物种多样性、多区域覆盖与高质量时空标注的统一,为开发可泛化的野外多动物追踪方法提供了新基准。
链接: https://arxiv.org/abs/2511.15622
作者: Dante Francisco Wasmuht,Otto Brookes,Maximillian Schall,Pablo Palencia,Chris Beirne,Tilo Burghardt,Majid Mirmehdi,Hjalmar Kühl,Mimi Arandjelovic,Sam Pottie,Peter Bermant,Brandon Asheim,Yi Jin Toh,Adam Elzinga,Jason Holmberg,Andrew Whitworth,Eleanor Flatt,Laura Gustafson,Chaitanya Ryali,Yuan-Ting Hu,Baishan Guo,Andrew Westbury,Kate Saenko,Didac Suris
机构: Conservation X Labs (CXL); Meta; University of Bristol; Hasso Plattner Institute; University of Oviedo; Osa Conservation; Senckenberg Museum of Natural History; Max Planck Institute for Evolutionary Anthropology; Climate Corridors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at \hrefthis https URL\textthis http URL .
zh
[CV-12] FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
【速读】:该论文旨在解决自回归模型在生成高质量三维网格(3D mesh)时推理速度慢的问题,这限制了其在交互式和大规模应用场景中的实际使用。解决方案的关键在于提出了一种名为FlashMesh的快速且高保真度的网格生成框架,其核心思想是通过“预测-修正-验证”范式重构自回归解码过程。关键创新在于利用网格令牌(mesh tokens)在结构和几何上的强相关性,实现多令牌的自信推测(multi-token speculation),并设计了一种针对常见hourglass Transformer架构的推测解码(speculative decoding)机制,从而在面、点和坐标等多个层级上实现并行预测,显著提升生成效率与质量。
链接: https://arxiv.org/abs/2511.15618
作者: Tingrui Shen,Yiheng Zhang,Chen Tang,Chuan Ping,Zixing Zhao,Le Wan,Yuwang Wang,Ronggang Wang,Shengfeng He
机构: South China University of Technology (华南理工大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); Tencent VISVISE (腾讯VISVISE); Peking University (北京大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.
zh
[CV-13] MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中传统点卷积分割头(point-wise convolutional segmentation head)因类特定通道绑定而导致的特征共享受限与语义泛化能力不足的问题。其解决方案的关键在于提出一种统一解耦的分割头(unified decoupled segmentation head),将多类别预测解耦为类无关的掩码预测与类别标签预测,借助共享的对象查询(object queries)实现更灵活的特征利用;同时引入全尺度感知可变形Transformer模块(Full-Scale Aware Deformable Transformer),使低分辨率编码器特征能够通过可变形注意力机制关注全分辨率特征,从而实现内存高效且空间对齐的全尺度融合。
链接: https://arxiv.org/abs/2511.15603
作者: Bin Xie,Gady Agam
机构: Illinois Institute of Technology (伊利诺伊理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation typically adopts a point-wise convolutional segmentation head to predict dense labels, where each output channel is heuristically tied to a specific class. This rigid design limits both feature sharing and semantic generalization. In this work, we propose a unified decoupled segmentation head that separates multi-class prediction into class-agnostic mask prediction and class label prediction using shared object queries. Furthermore, we introduce a Full-Scale Aware Deformable Transformer module that enables low-resolution encoder features to attend across full-resolution encoder features via deformable attention, achieving memory-efficient and spatially aligned full-scale fusion. Our proposed method, named MaskMed, achieves state-of-the-art performance, surpassing nnUNet by +2.0% Dice on AMOS 2022 and +6.9% Dice on BTCV.
zh
[CV-14] US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery MICCAI2025
【速读】:该论文旨在解决超声成像在脊柱手术导航中因骨骼造成的声影效应而导致椎体等结构可视化不完整的问题(即“超声成像的固有局限性”)。其解决方案的关键在于提出了一种新颖的多模态深度学习方法,通过融合单张X射线图像与3D超声数据中的互补信息,实现对超声图像中被遮挡椎体结构的重建。该方法利用生成的配对训练数据(模拟X射线视图和部分可见的3D椎体表示),有效整合了两种模态的形态学特征,在无需术前CT配准的情况下显著提升了椎体重建精度(p < 0.001),从而克服了超声的主要限制并保留了其实时、无电离辐射的优势。
链接: https://arxiv.org/abs/2511.15600
作者: Miruna-Alexandra Gafencu,Yordanka Velikova,Nassir Navab,Mohammad Farid Azampour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the Workshop on Shape in Medical Imaging at MICCAI 2025
Abstract:Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound’s key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at this https URL
zh
[CV-15] Learning from Mistakes: Loss-Aware Memory Enhanced Continual Learning for LiDAR Place Recognition
【速读】:该论文旨在解决LiDAR位姿识别在持续学习场景中面临的灾难性遗忘(catastrophic forgetting)问题,即模型在适应新环境时会丢失对先前环境的知识。解决方案的关键在于提出KDF+框架,其核心创新包括:(1) 基于损失感知的采样策略,通过样本损失值估计学习难度,优先重放更难样本以增强判别性信息的保留;(2) 重放缓冲机制,通过轻微降低记忆样本在新任务训练中的损失,促使它们在更新过程中进一步优化,从而强化长期知识留存。实验表明,KDF+在多个基准上显著优于现有方法,并可无缝集成至先进LiDAR位姿识别持续学习框架中。
链接: https://arxiv.org/abs/2511.15597
作者: Xufei Wang,Junqiao Zhao,Siyue Tao,Qiwen Gu,Wonbong Kim,Tiantian Feng
机构: Shanghai Research Institute for Intelligent Autonomous System, Tongji University (同济大学智能自主系统研究院); Department of Computer Science and Technology, School of Electronics and Information Engineering, Tongji University (同济大学电子与信息工程学院计算机科学与技术系); MOE Key Lab of Embedded System and Service Computing, Tongji University (同济大学嵌入式系统与服务计算教育部重点实验室); Institute of Intelligent Vehicles, Tongji University (同济大学智能车辆研究所); College of Surveying and Geo-Informatics, Tongji University (同济大学测绘与地理信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:LiDAR place recognition plays a crucial role in SLAM, robot navigation, and autonomous driving. However, existing LiDAR place recognition methods often struggle to adapt to new environments without forgetting previously learned knowledge, a challenge widely known as catastrophic forgetting. To address this issue, we propose KDF+, a novel continual learning framework for LiDAR place recognition that extends the KDF paradigm with a loss-aware sampling strategy and a rehearsal enhancement mechanism. The proposed sampling strategy estimates the learning difficulty of each sample via its loss value and selects samples for replay according to their estimated difficulty. Harder samples, which tend to encode more discriminative information, are sampled with higher probability while maintaining distributional coverage across the dataset. In addition, the rehearsal enhancement mechanism encourages memory samples to be further refined during new-task training by slightly reducing their loss relative to previous tasks, thereby reinforcing long-term knowledge retention. Extensive experiments across multiple benchmarks demonstrate that KDF+ consistently outperforms existing continual learning methods and can be seamlessly integrated into state-of-the-art continual learning for LiDAR place recognition frameworks to yield significant and stable performance gains. The code will be available at this https URL.
zh
[CV-16] MHR: Momentum Human Rig
【速读】:该论文旨在解决现有参数化人体模型在动画表达力与解剖学合理性之间的平衡问题,尤其针对增强现实(AR)/虚拟现实(VR)及图形渲染管线中对高保真、鲁棒人体动画的需求。解决方案的关键在于融合ATLAS模型的解耦骨架/形状架构与受Momentum库启发的灵活现代骨骼绑定和姿态矫正系统,从而支持非线性姿态矫正,实现更具表现力且符合生理结构的人体动画。
链接: https://arxiv.org/abs/2511.15586
作者: Aaron Ferguson,Ahmed A. A. Osman,Berta Bescos,Carsten Stoll,Chris Twigg,Christoph Lassner,David Otte,Eric Vignola,Federica Bogo,Igor Santesteban,Javier Romero,Jenna Zarate,Jeongseok Lee,Jinhyung Park,Jinlong Yang,John Doublestein,Kishore Venkateshan,Kris Kitani,Ladislav Kavan,Marco Dal Farra,Matthew Hu,Matthew Cioffi,Michael Fabris,Michael Ranieri,Mohammad Modarres,Petr Kadlecek,Rinat Abdrashitov,Romain Prévost,Roman Rajbhandari,Ronald Mallet,Russel Pearsall,Sandy Kao,Sanjeev Kumar,Scott Parrish,Te-Li Wang,Tony Tung,Yuan Dong,Yuhua Chen,Yuanlu Xu,Yuting Ye,Zhongshi Jiang
机构: Meta(元)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. Our model enables expressive, anatomically plausible human animation, supporting non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines.
zh
[CV-17] CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking AAAI2026
【速读】:该论文旨在解决LiDAR点云中3D单目标跟踪(SOT)任务面临的双重冗余挑战:一是背景噪声带来的空间冗余影响跟踪精度,二是前景目标内部的信息冗余降低计算效率。解决方案的关键在于提出一个端到端框架CompTrack,其核心创新为两个模块:一是基于信息熵的Spatial Foreground Predictor(SFP)模块,用于过滤背景噪声以缓解空间冗余;二是基于信息瓶颈理论的动态Token压缩模块(IB-DTC),通过在线奇异值分解(online SVD)实现前景特征的低秩近似压缩,自适应地将冗余信息提炼为紧凑且高保真的代理Token(proxy tokens),从而有效消除前景内部的信息冗余,显著提升跟踪效率与精度。
链接: https://arxiv.org/abs/2511.15580
作者: Sifan Zhou,Yichao Cao,Jiahao Nie,Yuqian Fu,Ziyu Zhao,Xiaobo Lu,Shuo Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 (Oral)
Abstract:3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
zh
[CV-18] AVATAAR: Agent ic Video Answering via Temporal Adaptive Alignment and Reasoning
【速读】:该论文旨在解决长视频问答(Video Question Answering, VQA)中因复杂语义理解需求而导致的性能瓶颈问题,尤其针对需要全局与局部上下文融合、以及多轮推理能力的细粒度查询任务。其解决方案的关键在于提出一个模块化且可解释的框架AVATAAR,该框架通过整合全局视频摘要与局部细节信息,并引入预检索思维代理(Pre Retrieval Thinking Agent)和重思模块(Rethink Module),构建了一个反馈循环机制,使系统能够基于部分答案动态优化检索策略,从而实现类人迭代式推理。实验表明,该反馈机制显著提升了模型在时间推理、技术性问题、主题相关性和叙事理解等维度上的表现,验证了其在准确性、可解释性和扩展性方面的优势。
链接: https://arxiv.org/abs/2511.15578
作者: Urjitkumar Patel,Fang-Chun Yeh,Chinmay Gondhalekar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)
Abstract:With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR’s effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.
zh
[CV-19] From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers
【速读】:该论文试图解决Vision Transformer (ViT) 中特征图知识蒸馏(feature-map knowledge distillation, KD)效果不佳的问题。研究表明,尽管ViT的最终层表示具有全局低秩特性(如CaiT-S24中仅需121维即可保留99%能量),理论上应可通过简单线性投影实现高效学生模型与教师模型的对齐,但实际中标准特征KD方法表现较差,存在理论与实践之间的矛盾。为解释这一现象,作者提出一种token-level Spectral Energy Pattern (SEP) 分析,揭示出虽然整体低秩,但每个token在通道维度上广泛分布能量,形成高带宽编码模式,导致宽教师与窄学生之间存在编码不匹配。解决方案的关键在于识别并修复这种“宽度不匹配”问题:一是引入轻量级后处理特征提升模块(post-hoc feature lifting),保留推理时的投影器;二是采用原生宽度对齐策略(native width alignment),仅将学生模型最后一层扩展至教师宽度。这两种最小化改动的设计显著提升了ViT特征蒸馏性能,在ImageNet-1K上使DeiT-Tiny准确率从74.86%提升至77.53%和78.23%,同时亦改善了无教师训练的学生模型表现。
链接: https://arxiv.org/abs/2511.15572
作者: Huiyuan Tian,Bonan Xu,Shijian Li,Xin Jin
机构: Zhejiang University (浙江大学); The Hong Kong Polytechnic University (香港理工大学); GenPi Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only 121/61/34/14 dimensions suffice to capture 99%/95%/90%/80% of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student’s last block to the teacher’s width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from 74.86% to 77.53% and 78.23% when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.
zh
[CV-20] ransferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector
【速读】:该论文旨在解决当前生成式 AI 图像(AIGI)检测器在面对对抗攻击时安全性不足的问题,尤其是在对抗性干扰下的鲁棒性和跨模型迁移能力尚未得到充分研究的现状。其解决方案的关键在于提出一种双域特征重要性攻击(DuFIA)方案,通过联合建模空间域和频域的特征重要性来增强对抗样本的迁移性:具体而言,利用空间插值梯度捕捉空间域中的重要特征,并结合频域感知扰动强化频率域特征的干扰,进而将两类特征重要性融合用于指导基于优化的对抗样本生成,从而有效削弱多种 AIGI 检测器的判别性能。
链接: https://arxiv.org/abs/2511.15571
作者: Weiheng Zhu,Gang Cao,Jing Liu,Lifang Yu,Shaowei Weng
机构: Communication University of China (中国传媒大学); Hunan University of Information Technology (湖南信息职业技术学院); Beijing Institute of Graphic Communication (北京印刷学院); Fujian University of Technology (福建理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluating the security of such detectors, which remains unexplored sufficiently. This letter proposes a Dual-domain Feature Importance Attack (DuFIA) scheme to invalidate AIGI detectors to some extent. Forensically important features are captured by the spatially interpolated gradient and frequency-aware perturbation. The adversarial transferability is enhanced by jointly modeling spatial and frequency-domain feature importances, which are fused to guide the optimization-based adversarial example generation. Extensive experiments across various AIGI detectors verify the cross-model transferability, transparency and robustness of DuFIA.
zh
[CV-21] Scriboora: Rethinking Human Pose Forecasting
【速读】:该论文旨在解决人体姿态预测(human pose forecasting)任务中存在的可复现性问题,并提升模型在真实场景下的鲁棒性。其核心解决方案包括:首先构建了一个统一的训练与评估流程,以增强不同算法间的公平比较和结果可复现性;其次借鉴语音理解任务中的先进模型架构,将其高效迁移至姿态预测任务中,显著提升了当前最先进性能;最后引入一种基于估计关节坐标噪声的新数据变体,模拟现实世界中的观测误差,通过无监督微调有效缓解了因噪声导致的性能下降问题,从而增强了模型在实际应用中的鲁棒性。
链接: https://arxiv.org/abs/2511.15565
作者: Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
机构: University of Augsburg (奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.
zh
[CV-22] A Hybrid CNN-ViT-GNN Framework with GAN-Based Augmentation for Intelligent Weed Detection in Precision Agriculture
【速读】:该论文旨在解决精准农业中杂草检测的难题,即如何在复杂多变的田间条件下实现高精度、高鲁棒性的杂草物种识别,从而支持选择性施药,推动可持续作物管理。解决方案的关键在于提出了一种融合卷积神经网络(Convolutional Neural Networks, CNNs)、视觉Transformer(Vision Transformers, ViTs)和图神经网络(Graph Neural Networks, GNNs)的混合深度学习框架,以同时捕获局部、全局及关系特征;此外,引入基于生成对抗网络(Generative Adversarial Networks, GANs)的数据增强方法平衡类别分布,并采用自监督对比预训练策略从有限标注数据中提取更具判别力的特征表示,显著提升了模型在多基准数据集上的性能(准确率、精确率、召回率与F1分数均达99.33%),并具备良好的可解释性和适应性,支持边缘设备上的实时部署。
链接: https://arxiv.org/abs/2511.15535
作者: Pandiyaraju V,Abishek Karthik,Sreya Mynampati,Poovarasan L,D. Saraswathi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment to edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
zh
[CV-23] Multi-Text Guided Few-Shot Semantic Segmentation
【速读】:该论文旨在解决基于CLIP的少样本语义分割方法中因单一文本提示导致的目标区域激活不完整、跨模态交互不明确以及支持特征噪声干扰等问题,这些问题会显著降低视觉先验的质量。其解决方案的关键在于提出了一种双分支网络MTGNet,通过引入多文本引导机制来增强文本先验的表达能力与鲁棒性:首先设计了多文本先验精炼(MTPR)模块以聚合互补语义线索并抑制干扰,提升前景激活程度;其次提出文本锚点特征融合(TAFF)模块,利用多文本嵌入作为语义锚点,促进支持图像到查询图像间判别性局部原型的迁移,从而改善类别内一致性;最后引入前景置信度加权注意力(FCWA)模块,基于支持前景特征内部自相似性自适应地抑制不一致区域,增强视觉先验的鲁棒性。
链接: https://arxiv.org/abs/2511.15515
作者: Qiang Jiao,Bin Yan,Yi Yang,Mengrui Shi,Qiang Zhang
机构: State Key Laboratory of Electromechanical Integrated Manufacturing of High-Performance Electronic Equipments (高性能电子设备机电一体化制造国家重点实验室); Center for Complex Systems, School of Mechano-Electronic Engineering, Xidian University (西安电子科技大学机电工程学院复杂系统研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.
zh
[CV-24] Learning to Expand Images for Efficient Visual Autoregressive Modeling
【速读】:该论文旨在解决当前自回归图像生成模型中存在的效率低下问题,尤其是由于逐token解码或多层次表示带来的计算复杂度高、生成速度慢等瓶颈。其解决方案的关键在于提出了一种新的生成范式——扩展式自回归表示(Expanding Autoregressive Representation, EAR),该方法模拟人类视觉系统的中心向外感知模式,以螺旋顺序从图像中心开始逐步向外扩展生成图像token,从而在保持空间连续性的同时支持高效的并行解码;此外,引入长度自适应解码策略动态调整每步预测的token数量,进一步提升灵活性与生成速度,使生成顺序更符合感知相关性,显著改善生成质量与计算效率的平衡。
链接: https://arxiv.org/abs/2511.15499
作者: Ruiqing Yang,Kaixin Zhang,Zheng Zhang,Shan You,Tao Huang
机构: University of Electronic Science and Technology of China (电子科技大学); School of Computer Science and Engineering, Central South University (中南大学计算机科学与工程学院); Xidian University (西安电子科技大学); SenseTime Research (商汤研究院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 18 figures, includes appendix with additional visualizations, submitted as arXiv preprint
Abstract:Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system’s center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.
zh
[CV-25] Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels
【速读】:该论文旨在解决低光照环境下图像增强技术在不同照明强度下性能不稳定的问题。现有基于学习的增强方法通常依赖于单一低光条件与正常光照参考图像的配对训练数据,缺乏对辐射多样性(radiance diversity)的覆盖,导致模型在实际多变光照场景中泛化能力不足。其解决方案的关键在于构建了一个名为Multi-Illumination Low-Light (MILL)的数据集,该数据集在固定相机设置和精确照度测量条件下采集了多种光照强度下的图像,从而支持跨光照条件的全面评估。基于此数据集,作者不仅系统性地评测了多个前沿增强算法的性能差异,还提出改进策略以提升算法在不同光照场景中的鲁棒性,最终在DSLR和智能手机拍摄的全高清图像上分别实现了最高达10 dB和2 dB的PSNR提升。
链接: https://arxiv.org/abs/2511.15496
作者: Maria Pilligua,David Serrano-Lozano,Pai Peng,Ramon Baldrich,Michael S. Brown,Javier Vazquez-Corral
机构: Computer Vision Center (计算机视觉中心); Universitat Autònoma de Barcelona (巴塞罗那自治大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement methods rely on paired training data captured under a single low-light condition and a well-lit reference. The lack of radiance diversity limits our understanding of how enhancement techniques perform across varying illumination intensities. We introduce the Multi-Illumination Low-Light (MILL) dataset, containing images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. MILL enables comprehensive evaluation of enhancement algorithms across variable lighting conditions. We benchmark several state-of-the-art methods and reveal significant performance variations across intensity levels. Leveraging the unique multi-illumination structure of our dataset, we propose improvements that enhance robustness across diverse illumination scenarios. Our modifications achieve up to 10 dB PSNR improvement for DSLR and 2 dB for the smartphone on Full HD images.
zh
[CV-26] NTK-Guided Implicit Neural Teaching
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在拟合高分辨率信号时因需优化数百万个坐标而导致计算成本过高的问题。其解决方案的关键在于提出一种基于神经切向核(Neural Tangent Kernel, NTK)引导的隐式神经教学方法(NTK-Guided Implicit Neural Teaching, NINT),通过动态选择能最大化全局函数更新的坐标来加速训练过程;NINT利用NTK对损失梯度进行加权评分,同时捕捉拟合误差与异质影响(包括自影响和跨坐标耦合),从而实现比现有采样策略更快的收敛速度,并在保持或提升表示质量的前提下将训练时间减少近一半。
链接: https://arxiv.org/abs/2511.15487
作者: Chen Zhang,Wei Zuo,Bingyang Cheng,Yikun Wang,Wei-Bin Kou,Yik Chung WU,Ngai Wong
机构: The University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.
zh
[CV-27] A Novel CustNetGC Boosted Model with Spectral Features for Parkinsons Disease Prediction
【速读】:该论文旨在解决帕金森病(Parkinson’s Disease, PD)早期诊断困难的问题,尤其是如何利用语音特征实现高精度、可解释的自动分类。其解决方案的关键在于提出了一种新型混合模型 CustNetGC,该模型融合了卷积神经网络(Convolutional Neural Network, CNN)、自定义网络梯度加权类激活映射(Custom Network Grad-CAM)和 CatBoost 分类器:首先从语音中提取关键频谱特征(L-mHP 和 Spectral Slopes),其中 L-mHP 结合了对数梅尔谱图、谐波谱图与打击乐谱图(基于谐波-打击乐源分离,Harmonic-Percussive Source Separation, HPSS);随后通过 Grad-CAM 技术增强模型预测的可解释性,使重要声学区域可视化;最后利用 CatBoost 提升分类鲁棒性和性能。实验表明,该方法在公开数据集上实现了 99.06% 的准确率和高达 0.90 的 AUC 值,显著提升了 PD 诊断的准确性与模型透明度。
链接: https://arxiv.org/abs/2511.15485
作者: Abishek Karthik,Pandiyaraju V,Dominic Savio M,Rohit Swaminathan S
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Parkinson’s disease is a neurodegenerative disorder that can be very tricky to diagnose and treat. Such early symptoms can include tremors, wheezy breathing, and changes in voice quality as critical indicators of neural damage. Notably, there has been growing interest in utilizing changes in vocal attributes as markers for the detection of PD early on. Based on this understanding, the present paper was designed to focus on the acoustic feature analysis based on voice recordings of patients diagnosed with PD and healthy controls (HC). In this paper, we introduce a novel classification and visualization model known as CustNetGC, combining a Convolutional Neural Network (CNN) with Custom Network Grad-CAM and CatBoost to enhance the efficiency of PD diagnosis. We use a publicly available dataset from Figshare, including voice recordings of 81 participants: 40 patients with PD and 41 healthy controls. From these recordings, we extracted the key spectral features: L-mHP and Spectral Slopes. The L-mHP feature combines three spectrogram representations: Log-Mel spectrogram, harmonic spectrogram, and percussive spectrogram, which are derived using Harmonic-Percussive Source Separation (HPSS). Grad-CAM was used to highlight the important regions in the data, thus making the PD predictions interpretable and effective. Our proposed CustNetGC model achieved an accuracy of 99.06% and precision of 95.83%, with the area under the ROC curve (AUC) recorded at 0.90 for the PD class and 0.89 for the HC class. Additionally, the combination of CatBoost, a gradient boosting algorithm, enhanced the robustness and the prediction performance by properly classifying PD and non-PD samples. Therefore, the results provide the potential improvement in the CustNetGC system in enhancing diagnostic accuracy and the interpretability of the Parkinson’s Disease prediction model.
zh
[CV-28] FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI
【速读】:该论文旨在解决医学图像分析中缺乏密集标注数据集的问题,尤其是缺少包含诊断标签及背后推理依据的标注信息,这限制了可解释人工智能(Explainable AI, xAI)模型的发展与评估。解决方案的关键在于提出FunnyNodules——一个完全参数化的合成数据集,能够生成具有可控视觉属性(如圆形度、边界锐度和毛刺性)的肺结节样形状,并通过预定义的属性组合确定目标类别,从而实现对属性到诊断类别的决策规则的完全控制。该框架支持模型无关的评估,可用于检验模型是否学习正确的属性-目标关系、解释属性预测中的过拟合或欠拟合现象,以及分析注意力机制与特定属性区域的一致性,为可解释AI方法的开发、基准测试和深入分析提供了一个灵活且具备完整真实标签的通用平台。
链接: https://arxiv.org/abs/2511.15481
作者: Luisa Gallée,Yiheng Xiong,Meinrad Beer,Michael Götz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.
zh
[CV-29] RS-CA-HSICT: A Residual and Spatial Channel Augmented CNN Transformer Framework for Monkeypox Detection
【速读】:该论文旨在解决猴痘(Mpox)图像分类中特征提取不充分、局部细节与全局依赖关系难以协同建模的问题。其解决方案的关键在于提出了一种融合残差学习与空间学习的通道增强集成卷积神经网络-Transformer架构(RS-CA-HSICT),通过引入新型HSICT模块整合抽象茎干CNN与定制化的ICT块,实现多头注意力机制与结构化CNN层的高效融合;同时利用H层和S层分别捕捉空间同质性与精细结构特征,并借助逆残差学习缓解梯度消失问题,结合阶段式分辨率降低保障尺度不变性;此外,通过通道融合与注意力机制优化特征通道选择,强化多尺度特征表达能力,最终在Kaggle基准数据集和多样化MPox数据集上实现了高达98.30%的分类准确率和98.13%的F1分数,显著优于现有CNN与ViT模型。
链接: https://arxiv.org/abs/2511.15476
作者: Rashid Iqbal,Saddam Hussain Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 Pages, 12 Figure, 4 Tables
Abstract:This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.
zh
[CV-30] Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners
【速读】:该论文旨在解决电子监控(Electronic Monitoring, EM)系统在金枪鱼延绳钓渔业中产生的海量视频数据难以高效处理的问题,特别是针对大眼金枪鱼(Bigeye Tuna, BET)与黄鳍金枪鱼(Yellowfin Tuna, YFT)等物种识别准确率低、专家判别一致性差的挑战。其解决方案的关键在于构建一个多层次的自动化分析流程:首先采用YOLOv9与Segment Anything Model 2(SAM2)相结合的分割方法实现高精度个体检测与分割(验证集平均精度达0.66 ± 0.03,召回率达0.88 ± 0.03),再利用ByteTrack进行目标跟踪;随后引入分层分类策略替代传统多类分类模型,显著提升模型泛化能力;最终通过交叉验证并在已知渔获组成的真实作业场景下测试,实现了84.8%个体的准确分割与分类,平均误差仅为4.5%,有效提升了EM系统中物种识别的效率与可靠性。
链接: https://arxiv.org/abs/2511.15468
作者: Xabier Lekunberri,Ahmad Kamal,Izaro Goienetxea,Jon Ruiz,Iñaki Quincoces,Jaime Valls Miro,Ignacio Arganda-Carreras,Jose A. Fernandes-Salvador
机构: AZTI, Marine Research, Basque Research and Technology Alliance (BRTA), Txatxarramendi Ugartea z/g, Sukarrieta (Bizkaia), 48395, Spain; University of the Basque Country (UPV/EHU), San Sebastian, Spain; IKERBASQUE, Basque Foundation for Science, Bilbao, Spain; Donostia International Physics Center (DIPC), San Sebastian, Spain; Biofisika Institute (CSIC, UPV/EHU), Leioa, Spain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 5 figures
Abstract:Purse seiners play a crucial role in tuna fishing, as approximately 69% of the world’s tropical tuna is caught using this gear. All tuna Regional Fisheries Management Organizations have established minimum standards to use electronic monitoring (EM) in fisheries in addition to traditional observers. The EM systems produce a massive amount of video data that human analysts must process. Integrating artificial intelligence (AI) into their workflow can decrease that workload and improve the accuracy of the reports. However, species identification still poses significant challenges for AI, as achieving balanced performance across all species requires appropriate training data. Here, we quantify the difficulty experts face to distinguish bigeye tuna (BET, Thunnus Obesus) from yellowfin tuna (YFT, Thunnus Albacares) using images captured by EM systems. We found inter-expert agreements of 42.9% \pm 35.6% for BET and 57.1% \pm 35.6% for YFT. We then present a multi-stage pipeline to estimate the species composition of the catches using a reliable ground-truth dataset based on identifications made by observers on board. Three segmentation approaches are compared: Mask R-CNN, a combination of DINOv2 with SAM2, and a integration of YOLOv9 with SAM2. We found that the latest performs the best, with a validation mean average precision of 0.66 \pm 0.03 and a recall of 0.88 \pm 0.03. Segmented individuals are tracked using ByteTrack. For classification, we evaluate a standard multiclass classification model and a hierarchical approach, finding a superior generalization by the hierarchical. All our models were cross-validated during training and tested on fishing operations with fully known catch composition. Combining YOLOv9-SAM2 with the hierarchical classification produced the best estimations, with 84.8% of the individuals being segmented and classified with a mean average error of 4.5%.
zh
[CV-31] SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome
【速读】:该论文旨在解决现有计算病理学方法在对齐苏木精-伊红(Hematoxylin and Eosin, HE)图像与空间转录组(Spatial Transcriptomic, ST)数据时,仅在单一尺度上进行特征对齐、忽略细胞结构细节及其空间组织的问题。其解决方案的关键在于提出Sigmma框架,通过多模态对比对齐(multi-modal contrastive alignment)机制,在多个尺度上学习HE图像与ST数据的分层表示,并引入图结构建模细胞间相互作用,整合子图内与子图间的关联关系,从而有效捕捉组织微环境中从细粒度到粗粒度的细胞互作模式,提升跨模态对应关系的准确性。
链接: https://arxiv.org/abs/2511.15464
作者: Dabin Jeong,Amirhossein Vahidi,Ciro Ramírez-Suástegui,Marie Moullet,Kevin Ly,Mohammad Vali Sanian,Sebastian Birk,Yinshui Chang,Adam Boxall,Daniyal Jafree,Lloyd Steele,Vijaya Baskar MS,Muzlifah Haniffa,Mohammad Lotfollahi
机构: Wellcome Sanger Institute (桑格研究所); University of Cambridge (剑桥大学); Helmholtz Center Munich (慕尼黑赫尔姆霍兹研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78% in the gene-expression prediction task and avg. 26.93% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
zh
[CV-32] Driving in Spikes: An Entropy-Guided Object Detector for Spike Cameras
【速读】:该论文旨在解决自动驾驶中因快速运动和极端光照条件导致的运动模糊与过曝问题,从而提升目标检测的准确性。传统图像传感器在上述场景下性能受限,而事件相机(Spike Camera)虽具备微秒级延迟和超高的动态范围优势,其稀疏离散的输出难以被基于图像的标准检测器直接处理,造成端到端事件流检测的关键挑战。论文提出的解决方案是EASD(End-to-End Spike Detector),其核心在于双分支架构设计:一是基于时间维度的纹理与特征融合分支,用于捕捉跨时间切片的全局语义信息;二是熵选择性注意力分支,聚焦于目标中心细节以增强局部感知能力。此外,为弥补数据差距,作者还引入了首个面向驾驶场景的模拟事件检测基准DSEC Spike。
链接: https://arxiv.org/abs/2511.15459
作者: Ziyan Liu,Qi Su,Lulu Tang,Zhaofei Yu,Tiejun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object detection in autonomous driving suffers from motion blur and saturation under fast motion and extreme lighting. Spike cameras, offer microsecond latency and ultra high dynamic range for object detection by using per pixel asynchronous integrate and fire. However, their sparse, discrete output cannot be processed by standard image-based detectors, posing a critical challenge for end to end spike stream detection. We propose EASD, an end to end spike camera detector with a dual branch design: a Temporal Based Texture plus Feature Fusion branch for global cross slice semantics, and an Entropy Selective Attention branch for object centric details. To close the data gap, we introduce DSEC Spike, the first driving oriented simulated spike detection benchmark.
zh
[CV-33] A Dataset and Baseline for Deep Learning-Based Visual Quality Inspection in Remanufacturing
【速读】:该论文旨在解决制造过程中零部件质量检测的自动化难题,特别是针对再制造(Remanufacturing)场景下,由于零件种类和缺陷模式多样性导致深度神经网络模型难以泛化的问题。其解决方案的关键在于构建了一个包含两类汽车变速箱典型齿轮箱组件(良好与缺陷状态)的新型图像数据集,并通过不同的训练-测试划分生成分布偏移(distribution shifts),以此作为基准评估分类模型的泛化能力;同时提出一种对比正则化损失(contrastive regularization loss),显著提升了模型对未见组件类型的鲁棒性与泛化性能。
链接: https://arxiv.org/abs/2511.15440
作者: Johannes C. Bauer,Paul Geng,Stephan Trattnig,Petr Dokládal,Rüdiger Daub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remanufacturing describes a process where worn products are restored to like-new condition and it offers vast ecological and economic potentials. A key step is the quality inspection of disassembled components, which is mostly done manually due to the high variety of parts and defect patterns. Deep neural networks show great potential to automate such visual inspection tasks but struggle to generalize to new product variants, components, or defect patterns. To tackle this challenge, we propose a novel image dataset depicting typical gearbox components in good and defective condition from two automotive transmissions. Depending on the train-test split of the data, different distribution shifts are generated to benchmark the generalization ability of a classification model. We evaluate different models using the dataset and propose a contrastive regularization loss to enhance model robustness. The results obtained demonstrate the ability of the loss to improve generalisation to unseen types of components.
zh
[CV-34] HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation
【速读】:该论文旨在解决视觉攻击下多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MRAG)系统的安全性问题,即如何通过仅在用户输入图像上添加不可察觉的扰动,来破坏MRAG系统中检索与生成模块的协同工作,从而降低其性能。解决方案的关键在于提出一种分层式视觉攻击方法(Hierarchical Visual Attack),该方法通过两级策略实现对生成器输入的错位干扰:首先破坏图像与文本之间的跨模态对齐,进而扰乱多模态语义一致性,使检索器召回无关知识;最终导致生成器在错误的知识基础上进行推理,显著削弱检索和生成性能。
链接: https://arxiv.org/abs/2511.15435
作者: Linyin Luo,Yujuan Ding,Yunshan Ma,Wenqi Fan,Hanjiang Lai
机构: The Hong Kong Polytechnic University (香港理工大学); Sun Yat-Sen University (中山大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG’s generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.
zh
[CV-35] Representation Space Constrained Learning with Modality Decoupling for Multimodal Object Detection
【速读】:该论文旨在解决多模态目标检测中的融合退化(fusion degradation)问题,即在多模态架构下,单模态分支的梯度被严重抑制,导致其优化不足,且由于模态质量差异,弱模态遭受更强的梯度抑制,从而引发模态学习不平衡。解决方案的关键在于提出一种表示空间约束与模态解耦学习(Representation Space Constrained Learning with Modality Decoupling, RSC-MD)方法,通过两个模块分别增强被抑制的梯度并消除模态间耦合干扰与不平衡,从而实现各模态专用主干网络的全面优化。
链接: https://arxiv.org/abs/2511.15433
作者: YiKang Shao,Tao Shi
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion strategies, most neglect fusion degradation, and none provide a theoretical analysis of its underlying causes. To fill this gap, this paper presents a systematic theoretical investigation of fusion degradation in multimodal detection and identifies two key optimization deficiencies: (1) the gradients of unimodal branch backbones are severely suppressed under multimodal architectures, resulting in under-optimization of the unimodal branches; (2) disparities in modality quality cause weaker modalities to experience stronger gradient suppression, which in turn results in imbalanced modality learning. To address these issues, this paper proposes a Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method, which consists of two modules. The RSC module and the MD module are designed to respectively amplify the suppressed gradients and eliminate inter-modality coupling interference as well as modality imbalance, thereby enabling the comprehensive optimization of each modality-specific backbone. Extensive experiments conducted on the FLIR, LLVIP, M3FD, and MFAD datasets demonstrate that the proposed method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks. The code and training procedures will be released at this https URL.
zh
[CV-36] WarNav: An Autonomous Driving Benchmark for Segmentation of Navigable Zones in War Scenes
【速读】:该论文旨在解决当前自动驾驶车辆在非结构化、冲突影响环境(如战区)中导航时,缺乏专门用于训练和评估语义分割模型的数据集问题。现有数据集多基于城市道路场景,难以适配极端战场环境中的复杂性和不确定性。解决方案的关键在于构建WarNav这一新型真实世界数据集,其源自开源DATTALION仓库,专为支持无人地面车辆在受损或危险区域的语义分割任务而设计;同时提出无需目标图像标注即可提升导航能力的初步方法,强调在标注成本高昂的极端场景下实现高效、鲁棒的自主导航能力。
链接: https://arxiv.org/abs/2511.15429
作者: Marc-Emmanuel Coupvent des Graviers,Hejer Ammar,Christophe Guettier,Yann Dumortier,Romaric Audigier
机构: Safran Electronics and Defense (萨夫兰电子与防御公司); Université Paris-Saclay, CEA, List (巴黎-萨克雷大学, 法国原子能和替代能源委员会, 列斯研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CAID (Conference on Artificial Intelligence for Defence)
Abstract:We introduce WarNav, a novel real-world dataset constructed from images of the open-source DATTALION repository, specifically tailored to enable the development and benchmarking of semantic segmentation models for autonomous ground vehicle navigation in unstructured, conflict-affected environments. This dataset addresses a critical gap between conventional urban driving resources and the unique operational scenarios encountered by unmanned systems in hazardous and damaged war-zones. We detail the methodological challenges encountered, ranging from data heterogeneity to ethical considerations, providing guidance for future efforts that target extreme operational contexts. To establish performance references, we report baseline results on WarNav using several state-of-the-art semantic segmentation models trained on structured urban scenes. We further analyse the impact of training data environments and propose a first step towards effective navigability in challenging environments with the constraint of having no annotation of the targeted images. Our goal is to foster impactful research that enhances the robustness and safety of autonomous vehicles in high-risk scenarios while being frugal in annotated data.
zh
[CV-37] D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models
【速读】:该论文旨在解决数据无感知量化(Data-Free Quantization, DFQ)在视觉-语言模型(Vision-Language Models)如CLIP中的性能下降问题。现有DFQ方法直接应用于CLIP时,因合成样本语义信息不足和图像内多样性低而导致显著精度损失。解决方案的关键在于提出D4C框架,其通过三个核心组件协同作用:(1)Prompt-Guided Semantic Injection利用文本提示注入语义信息,使生成图像更贴近真实场景;(2)Structural Contrastive Generation通过前景-背景对比合成恢复自然图像的结构组成;(3)Perturbation-Aware Enhancement引入可控扰动以提升样本多样性和鲁棒性。这三个机制共同确保了伪图像在语义丰富性和结构多样性上的优化,从而有效弥合DFQ在CLIP模型上的性能差距。
链接: https://arxiv.org/abs/2511.15411
作者: Wenlun Zhang,Yunshan Zhong,Zihao Ding,Xinyu Li,Kentaro Yoshioka
机构: Keio University (庆应义塾大学); Hainan University (海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.
zh
[CV-38] IPR-1: Interactive Physical Reason er
【速读】:该论文旨在解决智能体如何通过交互式学习获得类人推理能力,并在持续积累经验中不断提升物理与因果推理水平的问题。其核心挑战在于现有视觉语言模型(VLM)缺乏前瞻规划能力,而世界模型则倾向于模仿视觉模式而非理解物理机制。解决方案的关键是提出交互式物理推理器(IPR),它利用世界模型的滚动预测来评估并强化VLM策略,并引入PhysCode——一种以物理为中心的动作编码机制,将语义意图与动态行为对齐,从而构建统一的动作空间用于预测与推理。这一方法在1000+异构游戏中预训练后,在生存、好奇心和效用三个类人推理层级上均表现稳健,且超越GPT-5在好奇心维度的表现,验证了以物理为核心交互路径的有效性。
链接: https://arxiv.org/abs/2511.15407
作者: Mingyu Zhang,Lifeng Zhuo,Tianxi Tan,Guocan Xie,Xian Nie,Yan Li,Renjie Zhao,Zizhu He,Ziyu Wang,Jiting Cai,Yong-Lu Li
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
Abstract:Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM’s policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.
zh
[CV-39] Controlling False Positives in Image Segmentation via Conformal Prediction
【速读】:该论文旨在解决深度学习模型在医学图像语义分割中缺乏显式统计误差保证的问题,尤其关注假阳性(false-positive)预测的可控性,这对临床决策具有重要意义。解决方案的关键在于提出一种简单且模型无关的后处理框架,通过构建置信掩码(confidence mask)实现图像级的假阳性控制:利用预训练分割模型生成一系列收缩掩码(可通过提高阈值或形态学腐蚀获得),并基于标签校准集使用等变预测(conformal prediction)选择最优收缩参数,从而在有限样本下提供无分布假设的统计保障——即对于与校准数据同分布的新图像,假阳性比例以高概率低于用户设定的容忍度。
链接: https://arxiv.org/abs/2511.15406
作者: Luca Mossina,Corentin Friedrich
机构: IRT Saint Exupéry (法国图卢兹航空航天研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc framework that constructs confidence masks with distribution-free, image-level control of false-positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user-specified tolerance with high probability. The method is model-agnostic, requires no retraining, and provides finite-sample guarantees regardless of the underlying predictor. Experiments on a polyp-segmentation benchmark demonstrate target-level empirical validity. Our framework enables practical, risk-aware segmentation in settings where over-segmentation can have clinical consequences. Code at this https URL.
zh
[CV-40] ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
【速读】:该论文旨在解决当前基于视觉的弱监督或自监督占用估计方法中因依赖2D投影或渲染监督而导致的几何不一致性和深度溢出(depth bleeding)问题。现有方法通常需要LiDAR传感器或人工3D标注,难以在纯视觉场景下实现高精度的3D空间建模。其解决方案的关键在于提出ShelfOcc,一种完全基于视频输入、无需LiDAR的3D监督框架:通过从视频中生成度量一致的语义体素标签(semantic voxel labels),将监督信号直接引入原生3D空间,从而实现真正意义上的3D监督。该方法进一步设计了一种专用框架,通过对多帧静态几何进行过滤与累积,有效处理动态内容并传播语义信息至稳定的体素表示,显著提升了弱监督下占用估计的鲁棒性与准确性。
链接: https://arxiv.org/abs/2511.15396
作者: Simon Boeder,Fabian Gigengack,Simon Roesler,Holger Caesar,Benjamin Risse
机构: Bosch Research (博世研究); University of Münster (明斯特大学); TU Delft (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.
zh
[CV-41] Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因模型规模庞大而导致的效率瓶颈问题,以及现有剪枝方法(如Wanda)高度依赖人工设计、成本高昂且缺乏自适应性的问题。关键挑战在于:一方面,传统剪枝方法难以自动化地生成高效剪枝策略;另一方面,在高剪枝率下由于均匀稀疏性导致的异常值(outlier value)问题会显著引发性能下降。解决方案的核心是提出一种名为AutoPrune的新方法,其关键技术包括:1)利用LLM自身能力自动设计最优剪枝算法,突破专家知识限制;2)引入图驱动的思维链(Graph-driven Chain-of-Thought, GCoT)优化提示词,提升LLM在剪枝策略学习中的推理能力和可解释性;3)基于对异常值问题的洞察,设计偏度感知的动态稀疏分配机制(Skew-aware Dynamic Sparsity Allocation, SDSA),有效缓解高剪枝率下的性能退化问题。
链接: https://arxiv.org/abs/2511.15390
作者: Haidong Kang,Lihong Lin,Enneng Yang,Hongning Dai,Hao Wang
机构: Northeastern University (东北大学); Sun Yat-sen University (中山大学); Hong Kong Baptist University (香港浸会大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textithuge labor costs and \textitrequires expert knowledge. Furthermore, we are the first to identify the serious \textitoutlier value issue behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbfAutoPrune, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: this https URL.
zh
[CV-42] Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training
【速读】:该论文旨在解决复杂人类行为理解中运动分割的难题,即如何在无需密集标注或预定义动作类别的情况下,将运动序列分解为语义对齐的细粒度子动作(sub-actions),以支持开放词汇场景下的行为分析、具身智能和虚拟现实应用。解决方案的关键在于提出ZOMG框架,其核心创新包括:(1) 利用大语言模型(Large Language Models, LLMs)实现语言语义划分,将自然语言指令自动分解为有序的子动作单元;(2) 引入软掩码优化机制(soft masking optimization),在不修改预训练编码器的前提下,学习实例特定的时间掩码,聚焦于与子动作相关的关键帧,同时保持段内连续性和段间分离性,从而实现零样本、开放词汇的运动接地(motion grounding)能力。
链接: https://arxiv.org/abs/2511.15379
作者: Yunjiao Zhou,Xinyan Chen,Junlang Qian,Lihua Xie,Jianfei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
zh
[CV-43] IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers WACV2026
【速读】:该论文旨在解决视觉Transformer模型在后训练量化(Post-Training Quantization, PTQ)过程中难以实现全整数运算(integer-only inference)且精度损失显著的问题。现有PTQ方法要么仅部分量化非线性层,要么通过调整激活分布来维持精度,但无法完全避免浮点运算;而量化感知训练(Quantization-Aware Training, QAT)虽可实现整数推理,却需昂贵的重训练流程,不适用于资源受限场景。其解决方案的关键在于提出IPTQ-ViT框架:首先设计两种高精度近似函数——基于多项式的GELU(Gaussian Error Linear Unit)和基于位移操作的Softmax,分别优化视觉数据特性和PTQ中的近似误差;其次引入一个统一指标,综合量化敏感度、扰动程度与计算成本,自动选择每层激活的最佳近似函数。该方法无需重训练即可实现全整数推理,在W8A8和W4A8条件下优于部分浮点PTQ方法,并达到与整数QAT相当的准确率与延迟性能。
链接: https://arxiv.org/abs/2511.15369
作者: Gihwan Kim,Jemin Lee,Hyungshin Kim
机构: Chungnam National University (忠南国立大学); Electronics and Telecommunications Research Institute (电子通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted in WACV 2026 (10 pages)
Abstract:Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44%p (avg. 1.78%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code this https URL.
zh
[CV-44] Octopus: Agent ic Multimodal Reasoning with Six-Capability Orchestration
【速读】:该论文旨在解决现有多模态推理模型在架构上的根本局限性,即缺乏类人自主探索多样化推理路径的能力(如直接推理、工具驱动的视觉探索、程序化视觉操作或内在视觉想象),导致其难以适应现实任务中动态变化的能力需求。解决方案的关键在于提出Octopus:一种基于六能力协同调度的多模态代理推理新范式,通过定义六种核心多模态推理能力并构建相应的评估基准Octopus-Bench,使系统能够在推理过程中自主探索并根据当前状态动态选择最优能力组合,从而显著提升复杂任务下的适应性和性能表现。
链接: https://arxiv.org/abs/2511.15351
作者: Yifu Guo,Zishan Xu,Zhiyuan Yao,Yuquan Lu,Jiaye Lin,Sen Hu,Zhenheng Tang,Yingchao Li,Huacan Wang,Ronghao Chen
机构: Sun Yat-sen University (中山大学); Shanghai Jiaotong University (上海交通大学); Zhejiang University (浙江大学); Tsinghua University (清华大学); Peking University (北京大学); The Hong Kong University of Science and Technology (香港科技大学); University of Washington (华盛顿大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
zh
[CV-45] Fast Post-Hoc Confidence Fusion for 3-Class Open-Set Aerial Object Detection
【速读】:该论文旨在解决无人机(UAV)导航系统中开放集目标检测(open-set object detection)的问题,即在保持对已知类别(in-domain, ID)目标高精度识别的同时,有效区分未知类别(out-of-distribution, OOD)对象与背景区域,从而提升系统在复杂环境中的鲁棒性。现有方法多依赖单一不确定性阈值进行二分类(ID/OOD),易将OOD对象误判为背景,限制了实际应用安全性。其解决方案的关键在于提出一种轻量级、模型无关的后处理框架,通过融合多个置信度估计和检测特征,并利用一个紧凑的多层感知机(multilayer perceptron, MLP)实现三种类别的实时区分:ID目标、OOD对象和背景。该设计不仅显著提升了二分类AUROC性能(平均提升2.7%),还首次实现了高精度的三类分类能力,同时不牺牲检测吞吐量,且在封闭集mAP上最高提升9点(相对增益18%),为安全无人机导航提供了关键技术支持。
链接: https://arxiv.org/abs/2511.15343
作者: Spyridon Loukovitis,Vasileios Karampinis,Athanasios Voulodimos
机构: National Technical University Athens (雅典国立技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector’s performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.
zh
[CV-46] Adaptive thresholding pattern for fingerprint forgery detection
【速读】:该论文旨在解决指纹活体检测系统中伪造指纹(fingerprint spoofing)带来的安全威胁问题,特别是针对伪造指纹在噪声干扰、像素缺失和区块缺失等环境扰动下仍能欺骗检测器的挑战。解决方案的关键在于提出一种基于自适应阈值模式的伪造检测算法:首先对输入图像进行各向异性扩散处理,并通过三层小波变换提取特征;随后对不同层级的小波系数进行自适应阈值处理并拼接生成特征向量,最终使用支持向量机(SVM)分类器进行判别。该方法显著提升了对多种恶意篡改和环境扰动的鲁棒性,在90%像素缺失和70×70区块缺失场景下,准确率分别优于现有方法约8%和5%。
链接: https://arxiv.org/abs/2511.15322
作者: Zahra Farzadpour,Masoumeh Azghani
机构: Sahand University of Technology (萨赫兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 10 figures, Journal paper
Abstract:Fingerprint liveness detection systems have been affected by spoofing, which is a severe threat for fingerprint-based biometric systems. Therefore, it is crucial to develop some techniques to distinguish the fake fingerprints from the real ones. The software based techniques can detect the fingerprint forgery automatically. Also, the scheme shall be resistant against various distortions such as noise contamination, pixel missing and block missing, so that the forgers cannot deceive the detector by adding some distortions to the faked fingerprint. In this paper, we propose a fingerprint forgery detection algorithm based on a suggested adaptive thresholding pattern. The anisotropic diffusion of the input image is passed through three levels of the wavelet transform. The coefficients of different layers are adaptively thresholded and concatenated to produce the feature vector which is classified using the SVM classifier. Another contribution of the paper is to investigate the effect of various distortions such as pixel missing, block missing, and noise contamination. Our suggested approach includes a novel method that exhibits improved resistance against a range of distortions caused by environmental phenomena or manipulations by malicious users. In quantitative comparisons, our proposed method outperforms its counterparts by approximately 8% and 5% in accuracy for missing pixel scenarios of 90% and block missing scenarios of size 70x70 , respectively. This highlights the novelty approach in addressing such challenges.
zh
[CV-47] What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
【速读】:该论文旨在解决Split DNNs(分割深度神经网络)中因中间特征(intermediate features)泄露而导致的隐私安全问题,特别是针对特征逆向攻击(Feature Inversion Attack, FIA)重建质量低的问题。现有FIA方法难以准确评估隐私泄露程度,因此作者提出FIA-Flow框架,其关键在于两个创新设计:一是引入Latent Feature Space Alignment Module (LFSAM),用于对齐中间特征空间与潜在空间之间的语义差异;二是提出Deterministic Inversion Flow Matching (DIFM),通过单步推断将分布偏离目标流形的特征映射回目标流形,从而有效缓解分布不匹配问题。该解耦式设计显著提升了重建图像的保真度和语义一致性,揭示了Split DNNs中比以往认知更严重的隐私风险。
链接: https://arxiv.org/abs/2511.15316
作者: Zhihan Ren,Lijun He,Jiaxi Liang,Xinzhu Fu,Haixia Bi,Fan Li
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.
zh
[CV-48] A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar Audio and Video Data
【速读】:该论文旨在解决无人机(UAV)检测与空中目标识别在现代安防监控中面临的挑战,尤其是单一模态方法性能受限的问题。其解决方案的关键在于设计并评估一种新颖的多模态Transformer模型,该模型融合雷达、可见光视频(RGB)、红外(IR)视频和音频等多种数据源,利用Transformer的自注意力机制学习跨模态的互补且高判别力表征,从而实现对空中目标的精准分类。实验表明,该模型在独立测试集上取得了优异的综合性能(如宏平均F1-score达0.9826),同时具备良好的实时性(推理速度41.11 FPS),验证了多模态数据融合结合Transformer架构在复杂空域环境下实现高性能UAV检测的有效性和可行性。
链接: https://arxiv.org/abs/2511.15312
作者: Mauro Larrat,Claudomiro Sales
机构: Institute of Exact and Natural Sciences, Federal University of Pará, Brazil(巴西帕拉联邦大学精确与自然科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 7 figures
Abstract:Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer’s self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.
zh
[CV-49] Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models AAAI2026
【速读】:该论文旨在解决3D视觉-语言基础模型(3D Vision-Language Foundation Models, 3D VLFMs)在实际应用中因数据噪声、不完整性或分布偏移而导致性能下降的问题。其解决方案的关键在于提出一种无需训练的在线测试时自适应(training-free online test-time adaptation, TTA)策略——Uni-Adapter,该策略基于动态原型学习机制:通过构建一个3D缓存(3D cache)存储类别特定的聚类中心作为原型,并持续更新以捕捉异构数据分布中的类内变化;同时利用图结构标签平滑模块增强相似原型间的标签一致性,并采用熵加权聚合方式融合原始3D VLFM预测与缓存优化结果,从而实现对分布偏移的有效缓解,在多个3D基准测试上显著提升性能。
链接: https://arxiv.org/abs/2511.15311
作者: Mehran Tamjidi,Hamidreza Dastmalchi,Mohammadreza Alimoradijazi,Ali Cheraghian,Aijun An,Morteza Saberi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026. 7 pages, 4 figures
Abstract:3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.
zh
[CV-50] xt2Loc: Generalizing 3D Point Cloud Localization from Natural Language CVPR2024
【速读】:该论文旨在解决基于复杂且多样的自然语言描述对3D点云子图进行精确定位的问题(3D point cloud submap localization using natural language descriptions)。其核心解决方案是提出Text2Loc++,一个采用粗到细定位流程的神经网络架构:在全局场景识别阶段,结合预训练语言模型与分层Transformer(HTM)提取句子级语义,并通过注意力机制点云编码器实现空间理解;引入掩码实例训练(Masked Instance Training, MIT)以过滤未对齐对象并提升跨模态鲁棒性;设计模态感知分层对比学习(Modality-aware Hierarchical Contrastive Learning, MHCL)融合跨模态、子图级、文本级和实例级损失以优化嵌入空间。在精细定位阶段,则摒弃显式的文本-实例匹配机制,采用轻量但高效的原型式地图克隆(Prototype-based Map Cloning, PMC)与级联交叉注意力Transformer(Cascaded Cross-Attention Transformer, CCAT)构建新框架。实验表明,该方法在KITTI360Pose数据集上性能优于现有方法达15%,并在新构建的城市尺度数据集上展现出良好的泛化能力,可有效处理复杂语言表达与多样化城市环境。
链接: https://arxiv.org/abs/2511.15308
作者: Yan Xia,Letian Shi,Yilin Di,Joao F. Henriques,Daniel Cremers
机构: University of Science and Technology of China (中国科学技术大学); Technical University of Munich (慕尼黑工业大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper builds upon and extends our earlier conference paper Text2Loc presented at CVPR 2024
Abstract:We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.
zh
[CV-51] aming Generative Synthetic Data for X-ray Prohibited Item Detection
【速读】:该论文旨在解决X-ray安全图像检测模型训练中因数据不足导致的性能瓶颈问题,尤其是传统图像合成方法依赖繁琐的两阶段流程(先进行前景提取再进行图像拼接),引入额外的人工成本且效率低下。其解决方案的关键在于提出了一种基于文本到图像生成的单阶段合成框架Xsyn,通过两个核心策略提升合成图像质量与实用性:一是利用扩散模型中的交叉注意力图(Cross-Attention Map)实现边界框标注的精细化调整(Cross-Attention Refinement, CAR),二是显式建模潜在空间中的背景遮挡关系以增强成像复杂性(Background Occlusion Modeling, BOM)。该方法无需额外人工标注即可生成高质量合成图像,在多个X-ray安全数据集和检测器上均显著提升了违禁物品检测性能(mAP提升1.2%)。
链接: https://arxiv.org/abs/2511.15299
作者: Jialong Sun,Hongguang Zhu,Weizhe Liu,Yunda Sun,Renshuai Tao,Yunchao Wei
机构: Institute of Information Science, Beijing Jiaotong University (北京交通大学信息科学研究所); Faculty of Data Science, City University of Macau (澳门城市大学数据科学学院); Nuctech Company Limited (纳克特公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at this https URL.
zh
[CV-52] Edge-Centric Relational Reasoning for 3D Scene Graph Prediction
【速读】:该论文旨在解决现有3D场景图预测方法中关系表示受限于成对物体上下文、难以捕捉高阶关系依赖的问题。其解决方案的关键在于提出一种链路引导的边中心关系推理框架(Link-guided Edge-centric relational reasoning framework with Object-aware fusion, LEO),通过两个核心步骤实现:首先预测潜在的物体对链接以抑制无关边,随后将原场景图转换为线图(line graph),使每条关系成为节点,并利用线图神经网络进行边中心的关系推理以捕获关系间的上下文信息;最后将增强后的边特征融合回原始对象中心图中,提升对象级理解与关系预测性能。该方法具有模型无关性,可集成至任意现有对象中心方法中。
链接: https://arxiv.org/abs/2511.15288
作者: Yanni Ma,Hao Liu,Yulan Guo,Theo Gevers,Martin R. Oswald
机构: 1. School of Computer Science and Engineering, South China University of Technology (华南理工大学计算机科学与工程学院); 2. Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente (特温特大学电气工程、数学与计算机科学学院); 3. Institute of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能研究所); 4. School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-centric graph neural networks, where relation edge features are iteratively updated by aggregating messages from connected object nodes. However, this design inherently restricts relation representations to pairwise object context, making it difficult to capture high-order relational dependencies that are essential for accurate relation prediction. To address this limitation, we propose a Link-guided Edge-centric relational reasoning framework with Object-aware fusion, namely LEO, which enables progressive reasoning from relation-level context to object-level understanding. Specifically, LEO first predicts potential links between object pairs to suppress irrelevant edges, and then transforms the original scene graph into a line graph where each relation is treated as a node. A line graph neural network is applied to perform edge-centric relational reasoning to capture inter-relation context. The enriched relation features are subsequently integrated into the original object-centric graph to enhance object-level reasoning and improve relation prediction. Our framework is model-agnostic and can be integrated with any existing object-centric method. Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, highlighting the effectiveness of our edge-to-object reasoning paradigm.
zh
[CV-53] Look Zoom Understand: The Robotic Eyeball for Embodied Perception
【速读】:该论文旨在解决嵌入式人工智能(Embodied AI)感知系统中视觉感知被动化的问题,即现有视觉模型与固定RGB-D相机系统难以在像素和空间预算约束下同时实现大范围覆盖与细粒度细节获取,从而限制了其在开放世界机器人应用中的效能。解决方案的关键在于提出EyeVLA——一种可基于指令主动执行动作的机器人眼球系统,通过将动作行为离散化为动作令牌(action tokens),并与具备强大开放世界理解能力的视觉语言模型(Vision-Language Models, VLMs)融合,实现视觉、语言与动作在单一自回归序列中的联合建模;进一步利用2D边界框坐标引导推理链,并结合强化学习优化视角选择策略,仅需少量真实世界数据即可将VLM的开放场景理解能力迁移至视觉-语言-动作(Vision-Language-Action, VLA)策略中,从而实现由指令驱动的旋转与缩放动作,主动获取更精确的视觉信息,显著提升环境感知能力。
链接: https://arxiv.org/abs/2511.15279
作者: Jiashu Yang,Yifan Han,Yucheng Xie,Ning Guo,Wenzhao Lian
机构: Shanghai Jiao Tong University (上海交通大学); Chinese Academy of Sciences (中国科学院); Dalian University of Technology (大连理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
zh
[CV-54] Graph Query Networks for Object Detection with Automotive Radar WACV2026
【速读】:该论文旨在解决3D雷达在自动驾驶360°环境感知中因波长较长导致反射点稀疏且不规则,从而难以被传统基于网格(grid)或序列(sequence)的卷积神经网络(Convolutional Neural Networks, CNNs)和Transformer检测器有效处理的问题。解决方案的关键在于提出Graph Query Networks (GQN),其核心创新是将雷达感知到的对象建模为图结构,并引入图查询(graph queries)机制动态地在鸟瞰图(Bird’s-eye View, BEV)空间中进行注意力机制选择,进而通过两个新设计模块——EdgeFocus用于关系推理、DeepContext Pooling用于上下文聚合,实现对个体化对象关系与场景上下文特征的高效提取。实验表明,GQN在NuScenes数据集上相对mAP提升最高达+53%,相较最强雷达方法提升+8.2%,同时将峰值图构建开销降低80%,仅需适度计算量(FLOPs)。
链接: https://arxiv.org/abs/2511.15271
作者: Loveneet Saini,Hasan Tercan,Tobias Meisen
机构: University of Wuppertal (伍珀塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in WACV 2026 Main Conference
Abstract:Object detection with 3D radar is essential for 360-degree automotive perception, but radar’s long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird’s-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.
zh
[CV-55] SplitFlux: Learning to Decouple Content and Style from a Single Image
【速读】:该论文旨在解决当前基于SDXL和Flux模型的图像生成方法在内容(content)与风格(style)分离方面存在的不足,尤其是Flux模型因未充分挖掘其内部结构特性而导致内容与风格难以有效解耦的问题。解决方案的关键在于对Flux模型进行系统性分析并提出SplitFlux框架:首先识别出单个Dream Block对图像生成至关重要,且早期块主要控制内容、后期块主导风格;进而通过LoRA(Low-Rank Adaptation)微调特定Dream Block实现内容与风格的解耦,其中核心创新包括Rank-Constrained Adaptation(限制更新秩并放大幅度以防止内容泄露至风格模块)和Visual-Gated LoRA(基于图像显著性将内容LoRA分为高低秩分支,分别保留主体信息与残差细节,从而提升重嵌入能力并避免过拟合)。
链接: https://arxiv.org/abs/2511.15258
作者: Yitong Yang,Yinglin Wang,Changshuo Wang,Yongjun Zhang,Ziyang Chen,Shuting He
机构: Shanghai University of Finance and Economics (上海财经大学); University College London (伦敦大学学院); Guizhou University (贵州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results, while the recently proposed Flux model fails to achieve effective content-style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.
zh
[CV-56] GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning
【速读】:该论文试图解决的问题是:如何将原本用于大语言模型(Large Language Models, LLMs)微调的组相对策略优化(Group Relative Policy Optimization, GRPO)方法推广到表示学习模型(Representation Models)中,以实现其在后训练阶段的有效优化。解决方案的关键在于提出了一种适用于表示学习模型的GRPO-RM方法,其核心创新包括:1)构建一个预定义的输出集合,替代LLMs中基于token序列采样的方式,从而生成用于概率驱动优化的输出组;2)设计一种专门适配表示学习模型特性的奖励函数,以提升优化过程的针对性和有效性。实验在多个真实数据集上验证了该方法的优越性能。
链接: https://arxiv.org/abs/2511.15256
作者: Yanchen Xu,Ziheng Jiao,Hongyuan Zhang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom; School of Artificial Intelligence (OPtics and ElectroNics) (iOPEN), Northwestern Polytechnical University; HuaWei Technologies Co., Ltd.; The University of Hong Kong
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.
zh
[CV-57] SkinGPT -R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning
【速读】:该论文旨在解决皮肤科视觉语言模型在诊断推理过程中缺乏可解释性与可验证性的问题,即如何使模型的诊断决策过程更加透明、结构化,并符合临床实践标准。解决方案的关键在于提出SkinGPT-R1,一个专注于皮肤科领域的视觉语言模型,其核心创新是引入DermCoT(Dermatology Chain of Thought)语料库,该语料库整合了10,000个经DermEval筛选的训练病例和3,000个由皮肤科医生评分的认证病例,从而构建标准化的皮肤病学推理链;同时定义了DermEval作为六维医师对齐评估指标以及DermBench作为对应的推理质量基准。实验表明,SkinGPT-R1在DermBench上平均得分4.031(满分5),显著优于其他14种通用及医学视觉语言模型,且相比基线Vision-R1提升约41%,证明了基于DermCoT的推理监督和皮肤科感知的视觉蒸馏策略在提升诊断推理质量和识别准确率方面的有效性。
链接: https://arxiv.org/abs/2511.15242
作者: Yuhao Shen,Jiahe Qian,Zhangtianyi Chen,Yuanhao He,Juexiao Zhou
机构: School of Data Science, The Chinese University of Hong Kong, Shenzhen(深圳大学数据科学学院); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.
zh
[CV-58] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
【速读】:该论文旨在解决当前主流评估指标(如BLEU、CIDEr、VQA分数、SigLIP-2和CLIPScore)在领域特定或上下文依赖场景下难以准确捕捉语义或结构准确性的问题。其解决方案的关键在于提出一种物理约束的多模态数据评估(Physics-Constrained Multimodal Data Evaluation, PCMDE)方法,该方法融合大语言模型(Large Language Models, LLMs)的推理能力、基于知识的映射机制以及视觉-语言模型(Vision-Language Models, VLMs),通过三个核心阶段实现:(1) 利用目标检测与VLM提取空间和语义特征;(2) 采用置信度加权组件融合进行自适应的组件级验证;(3) 借助LLMs进行物理引导的推理,以强制执行结构和关系约束(如对齐、位置一致性等)。
链接: https://arxiv.org/abs/2511.15204
作者: Kishor Datta Gupta,Marufa Kamal,Md. Mahfuzur Rahman,Fahad Rahman,Mohd Ariful Haque,Sunzida Siddique
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
zh
[CV-59] owards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval
【速读】:该论文旨在解决食物图像与食谱之间的跨模态检索问题中因因果关系误判导致的表征学习偏差问题。现有方法将食谱视为描述菜品视觉外观的文本源,忽略了烹饪过程、摆盘方式和拍摄条件等因素对图像细节捕捉不全的影响,从而在相似性判断中引入偏倚,使模型倾向于捕捉主导的视觉-文本对齐特征而忽视决定检索相关性的细微差异。解决方案的关键在于引入因果理论建模这一偏倚:识别食材为混杂因素(confounder),并通过简单的后门调整(backdoor adjustment)实现因果干预,重构传统检索模型以消除相似性判断中的潜在偏倚。基于此理论指导的改进,在Recipe1M数据集上实现了MedR=1的最优检索性能,并提出一个可插拔的多标签食材分类模块用于去偏,显著提升了检索效果。
链接: https://arxiv.org/abs/2511.15201
作者: Qing Wang,Chong-Wah Ngo,Ee-Peng Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.
zh
[CV-60] Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition
【速读】:该论文旨在解决在风格化领域中插入真实世界物体时,基于参考的物体组合方法失效的问题。现有方法要么是缺乏生成保真度的实用“混合器”,要么是需要不切实际的逐对象在线微调的“生成器”。解决方案的关键在于提出一种全新的零样本生成框架 Insert In Style,其核心创新包括:(i) 一种新颖的多阶段训练协议,用于解耦身份、风格和构图的表征;(ii) 一种专用的掩码注意力架构,在生成过程中手术式地强制执行这种解耦,从而避免通用统一注意力模型中常见的概念干扰。该方法显著提升了真实物体在风格化场景中的保真度与一致性,且无需文本提示或额外微调。
链接: https://arxiv.org/abs/2511.15197
作者: Raghu Vamsi Chittersu,Yuvraj Singh Rathore,Pranav Adlinge,Kunal Swami
机构: Samsung Research India Bangalore (三星研究印度班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical “blenders” that lack generative fidelity and “generators” that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.
zh
[CV-61] BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI
【速读】:该论文旨在解决基于结构磁共振成像(sMRI)进行脑年龄估计时,传统回归模型与卷积神经网络(CNN)方法存在的局限性,如人工特征工程、感受野有限及在异质数据上过拟合等问题,同时克服纯视觉Transformer(ViT)模型对大规模数据依赖和高计算成本的缺点。其解决方案的关键在于提出了一种混合架构BrainRotViT,该架构将ViT的全局上下文建模能力与残差CNN的局部细节优化相结合:首先利用辅助的年龄和性别分类任务训练一个冻结的ViT编码器以提取切片级特征,随后将所有矢状切片的嵌入向量构成二维矩阵输入至残差CNN回归器,并在最终全连接层引入个体性别信息以预测连续脑年龄。此设计实现了高效、可解释且泛化能力强的脑年龄预测,显著优于基线与现有最先进模型。
链接: https://arxiv.org/abs/2511.15188
作者: Wasif Jalal,Md Nafiu Rahman,M.Sohel Rahman
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson r=0.98 , Spearman \rho=0.97 , R^2=0.95 ) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer’s disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.
zh
[CV-62] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
【速读】:该论文旨在解决当前胸部X光片(CXR)病灶分割模型在实际应用中面临的两大限制:一是目标标注数量有限,二是依赖于冗长且专业的文本输入,导致模型难以普及使用。为应对这些问题,作者提出了一种新的范式——指令引导的病灶分割(instruction-guided lesion segmentation, ILS),其核心在于通过简洁、用户友好的自然语言指令实现对多种病灶类型的像素级定位。解决方案的关键在于构建了首个大规模的CXRs指令-答案数据集MIMIC-ILS(包含110万条指令-答案对和91,000个唯一分割掩码),并基于此训练出一个视觉-语言模型ROSALIA,该模型能根据用户指令生成精准的病灶分割结果及文本解释,从而显著提升模型的实用性与可扩展性。
链接: https://arxiv.org/abs/2511.15186
作者: Geon Choi,Hangyul Yoon,Hyunju Shin,Hyunki Park,Sang Hoon Seo,Eunho Yang,Edward Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.
zh
[CV-63] MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction WACV2026
【速读】:该论文旨在解决人类运动预测(Human Motion Prediction, HMP)中现有评估指标无法准确衡量多模态预测性能的问题。传统方法仅关注单个预测轨迹与真实轨迹的差异,忽视了多模态场景下预测结果应覆盖多样运动模式且具备动力学合理性的要求。为解决这一问题,论文提出了一种基于聚类的多模态感知度量方法(Multimodality-aware Metric using Clustering-based Modes, MMCM),其核心在于:首先通过聚类将运动空间划分为多个语义合理的模式(mode),以量化预测结果是否分布于多个模式中(即覆盖性);其次利用真实数据集识别有效模式(valid modes),从而评估预测运动是否符合物理可实现的动力学约束(即有效性)。该方案显著提升了对多模态预测质量的判别能力。
链接: https://arxiv.org/abs/2511.15179
作者: Kyotaro Tokoro,Hiromu Taketsugu,Norimichi Ukita
机构: Toyota Technological Institute (丰田工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV2026
Abstract:This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple motions. While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. \textbf(a) Coverage: motions should be distributed among multiple motion modes to cover diverse possibilities. \textbf(b) Validity: motions should be kinematically valid as future motions observable from a given past motion. However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid. To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM). For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes. For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset. Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions. Code: this https URL
zh
[CV-64] Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation
【速读】:该论文旨在解决自监督深度估计(self-supervised depth estimation)在恶劣天气条件(如雨、雾)下性能显著下降的问题,此类条件下因能见度降低导致深度预测精度严重受损。解决方案的关键在于提出一种新颖的自进化对比学习框架 SEC-Depth,其核心是利用训练过程中生成的中间参数构建时序演化的延迟模型(latency models),并设计一种自进化对比损失(SECL),将历史延迟模型的输出作为负样本,从而动态调整学习目标,隐式感知天气退化程度,减少对人工干预的依赖,同时提升模型在零样本评估下的鲁棒性。
链接: https://arxiv.org/abs/2511.15167
作者: Jing Cao,Kui Jiang,Shenyi Li,Xiaocheng Feng,Yong Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.
zh
[CV-65] Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
【速读】:该论文旨在解决多模态持续指令微调(multimodal continual instruction tuning)中因灾难性遗忘(catastrophic forgetting)导致模型在学习新任务时性能下降的问题。其解决方案的关键在于将灾难性遗忘重新定义为旧任务梯度缺失的问题,并通过利用参数空间的几何特性,以当前参数与历史最优参数之间的方向向量作为梯度引导来近似缺失梯度;该近似梯度可进一步与有限回放缓冲区中的真实梯度结合,并通过伯努利采样策略动态调节模型的稳定性与可塑性,从而在不扩展模型规模的前提下有效缓解遗忘问题。
链接: https://arxiv.org/abs/2511.15164
作者: Songze Li,Mingyu Gao,Tonghua Su,Xu-Yao Zhang,Zhongjie Wang
机构: Harbin Institute of Technology (哈尔滨工业大学); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.
zh
[CV-66] SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection WACV2026
【速读】:该论文旨在解决高精地图(High-Definition, HD)在城市环境中因环境动态变化而快速过时的问题,尤其关注如何将基于二维图像的变更检测结果有效转化为三维点云更新,从而实现HD地图的持续维护。其解决方案的关键在于提出首个面向城市尺度的3D地图更新研究数据集SceneEdited,该数据集包含超过800个更新场景、73公里驾驶里程及约3平方公里的城市区域,涵盖23,000余处通过人工与自动方式合成的对象变更(如缺失的道路基础设施、建筑、立交桥和电线杆),并提供校准的RGB图像、LiDAR扫描数据及详细的变更掩膜(change masks),同时配套一套支持可扩展性、可追踪性和可移植性的工具链,为未来研究提供标准化基准和实用框架。
链接: https://arxiv.org/abs/2511.15153
作者: Chun-Jung Lin,Tat-Jun Chin,Sourav Garg,Feras Dayoub
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by WACV 2026
Abstract:Accurate, up-to-date High-Definition (HD) maps are critical for urban planning, infrastructure monitoring, and autonomous navigation. However, these maps quickly become outdated as environments evolve, creating a need for robust methods that not only detect changes but also incorporate them into updated 3D representations. While change detection techniques have advanced significantly, there remains a clear gap between detecting changes and actually updating 3D maps, particularly when relying on 2D image-based change detection. To address this gap, we introduce SceneEdited, the first city-scale dataset explicitly designed to support research on HD map maintenance through 3D point cloud updating. SceneEdited contains over 800 up-to-date scenes covering 73 km of driving and approximate 3 \textkm^2 of urban area, with more than 23,000 synthesized object changes created both manually and automatically across 2000+ out-of-date versions, simulating realistic urban modifications such as missing roadside infrastructure, buildings, overpasses, and utility poles. Each scene includes calibrated RGB images, LiDAR scans, and detailed change masks for training and evaluation. We also provide baseline methods using a foundational image-based structure-from-motion pipeline for updating outdated scenes, as well as a comprehensive toolkit supporting scalability, trackability, and portability for future dataset expansion and unification of out-of-date object annotations. Both the dataset and the toolkit are publicly available at this https URL, establising a standardized benchmark for 3D map updating research.
zh
[CV-67] DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain Imaging
【速读】:该论文旨在解决高维神经影像分析在临床诊断中面临的两大挑战:一是时空保真度与模型适应性之间的权衡,二是大规模通用模型在特定任务上的泛化能力不足。其解决方案的关键在于提出一种端到端的框架——动态课程学习时空编码(Dynamic Curriculum Learning for Spatiotemporal Encoding, DCL-SE),核心创新包括基于近似秩池化(Approximate Rank Pooling, ARP)的三维脑数据高效映射为二维动态表征,以及通过动态分组机制(Dynamic Group Mechanism, DGM)引导的动态课程学习策略,逐步优化解码器从全局解剖结构到细微病理特征的特征提取能力。
链接: https://arxiv.org/abs/2511.15151
作者: Meihua Zhou,Xinyu Tong,Jiarui Zhao,Min Cheng,Li Yang,Lei Tian,Nan Wan
机构: University of Chinese Academy of Sciences (中国科学院大学); Capital Medical University (首都医科大学); Wuhu Hospital (芜湖医院); Beijing Tongren Hospital (北京同仁医院); Beijing Institute of Ophthalmology (北京市眼科研究所); Wannan Medical University (皖南医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer’s disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.
zh
[CV-68] WaveFuse-AL: Cyclical and Performance-Adaptive Multi-Strategy Active Learning for Medical Images
【速读】:该论文旨在解决医学图像分析中主动学习(Active Learning)因单一采集策略在不同训练阶段表现不稳定而导致的标注效率低下问题。其解决方案的关键在于提出了一种周期性与性能自适应相结合的多策略融合框架——WaveFuse-AL,通过引入正弦周期性时间先验和基于模型性能动态调整各策略权重机制,实现对BALD、BADGE、熵值(Entropy)和CoreSet等多种成熟采集策略的时序融合,从而在有限标注预算下显著提升模型性能稳定性与泛化能力。
链接: https://arxiv.org/abs/2511.15132
作者: Nishchala Thakur,Swati Kochhar,Deepti R. Bathula,Sukrit Gupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inconsistent behavior across different stages of the active learning cycle. We propose Cyclical and Performance-Adaptive Multi-Strategy Active Learning (WaveFuse-AL), a novel framework that adaptively fuses multiple established acquisition strategies-BALD, BADGE, Entropy, and CoreSet throughout the learning process. WaveFuse-AL integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust strategy importance over time. We evaluate WaveFuse-AL on three medical imaging benchmarks: APTOS-2019 (multi-class classification), RSNA Pneumonia Detection (binary classification), and ISIC-2018 (skin lesion segmentation). Experimental results demonstrate that WaveFuse-AL consistently outperforms both single-strategy and alternating-strategy baselines, achieving statistically significant performance improvements (on ten out of twelve metric measurements) while maximizing the utility of limited annotation budgets.
zh
[CV-69] Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation
【速读】:该论文旨在解决少样本分割(few-shot segmentation)任务中,现有基于Segment Anything Model (SAM) 的方法因依赖精确且显式的提示(prompt)而导致的泛化能力不足和解码偏差问题。其核心解决方案是提出一种无偏语义解码(Unbiased Semantic Decoding, USD)策略,通过同时从支持集(support set)和查询集(query set)中提取目标信息,并借助对比语言-图像预训练(CLIP)模型的语义引导,实现更一致且鲁棒的分割预测。关键创新在于:1)设计两种特征增强策略——图像级全局补充提供类别泛化指示,像素级局部引导提供目标位置线索;2)引入可学习的视觉-文本目标提示生成器,融合CLIP视觉特征与文本嵌入以生成聚焦目标的提示嵌入,从而在不微调视觉基础模型的前提下,显著提升SAM对未知类别的判别能力和解码稳定性。
链接: https://arxiv.org/abs/2511.15118
作者: Jin Wang,Bingfeng Zhang,Jian Pang,Weifeng Liu,Baodi Liu,Honglong Chen
机构: China University of Petroleum (East China) (中国石油大学(华东))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.
zh
[CV-70] An Event-triggered System for Social Persuasion and Danger Alert in Elder Home Monitoring
【速读】:该论文旨在解决老年人在居家环境中安全监护与情感沟通不足的问题,尤其针对其生理状态(如跌倒、异常活动)和心理状态(如孤独感、社交隔离)的双重关注。解决方案的关键在于构建一个基于事件触发机制的智能监护系统,通过高斯混合模型(GMM)背景建模实现对访客和老年人运动行为的检测,从而识别“看护事件”(watch dog)和“危险通知”(danger notice);同时利用支持向量机(SVM)对捕获图像进行分类分析,提升事件识别准确性;此外,为降低老年人技术使用门槛,设计了类日常行为的直观操作界面,使其能通过社交媒体与亲属建立自然沟通,增强社会连接与心理支持。
链接: https://arxiv.org/abs/2511.15117
作者: Jun-Yi Liu,Chung-Hao Chen,Ya-Chi Tsao,Ssu-Yao Wu,Yu-Ting Tsao,Lyn Chao-ling Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted in the 35th IPPR Conference on Computer Vision, Graphics, and Image Processing (CVGIP2022)
Abstract:In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By adopting GMM background modeling, the motion behavior of visitors and elders can be detected in the watch dog event and danger notice event respectively. Experiments set in home scenarios and 5 families participated in the experiments for detecting and recording three types of events from their life activities. In addition, the captured images were analyzed using SVM machine learning. For lack of technical experiences of elders, an intuitive operation as normal life activity was designed to create communication between elder and relatives via social media.
zh
[CV-71] Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting AAAI2026
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在合成未见过的采样率视角时出现的视觉失真问题,具体表现为:在放大时产生由侵蚀引起的模糊伪影(erosion-induced blurring artifacts),在缩小时产生由扩张引起的阶梯状伪影(dilation-induced staircase artifacts)。这些问题源于3DGS中采用的传统标量alpha混合机制的局限性——该机制将alpha和透射率视为像素级别的标量值,无法有效建模空间变化的遮挡关系。论文提出的关键解决方案是引入高斯混合(Gaussian Blending),将alpha和透射率从标量扩展为像素区域内空间分布的随机变量,从而允许背景邻近点基于alpha的空间分布贡献到最终渲染结果,实现更精确的遮挡计算与细节保留。该方法保持实时渲染速度且无需额外内存开销,可无缝集成至现有3DGS或其他新视图合成(Novel View Synthesis, NVS)框架中。
链接: https://arxiv.org/abs/2511.15102
作者: Junseo Koo,Jinseo Jeong,Gunhee Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still exhibit noticeable visual discrepancies when synthesizing views at sampling rates unseen during training. Specifically, they suffer from (i) erosion-induced blurring artifacts when zooming in and (ii) dilation-induced staircase artifacts when zooming out. We speculate that these artifacts arise from the fundamental limitation of the alpha blending adopted in 3DGS methods. Instead of the conventional alpha blending that computes alpha and transmittance as scalar quantities over a pixel, we propose to replace it with our novel Gaussian Blending that treats alpha and transmittance as spatially varying distributions. Thus, transmittances can be updated considering the spatial distribution of alpha values across the pixel area, allowing nearby background splats to contribute to the final rendering. Our Gaussian Blending maintains real-time rendering speed and requires no additional memory cost, while being easily integrated as a drop-in replacement into existing 3DGS-based or other NVS frameworks. Extensive experiments demonstrate that Gaussian Blending effectively captures fine details at various sampling rates unseen during training, consistently outperforming existing novel view synthesis models across both unseen and seen sampling rates.
zh
[CV-72] A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
【速读】:该论文旨在解决离散扩散多模态大语言模型(discrete diffusion-based multimodal large language models, dMLLMs)在推理阶段因每一步去噪过程中的全序列注意力计算而导致显著计算开销的问题。其解决方案的关键在于系统性地分析视觉 token 冗余的演化规律及其对模型响应和效率的影响,发现仅从头训练(from-scratch)的 dMLLM 在处理长答案任务时才会出现视觉冗余,并且只有此类模型能在后期去噪步骤中逐步恢复被剪枝造成的非可忽略信息损失;基于此,提出针对不同架构的优化策略:层跳过(layer-skipping)适用于从自回归模型迁移而来的 dMLLM,而渐进式或晚期剪枝(progressive or late-step pruning)则更有效于从头训练的 dMLLM,从而实现高效且准确的推理加速。
链接: https://arxiv.org/abs/2511.15098
作者: Duo Li,Zuhao Yang,Xiaoqin Zhang,Ling Shao,Shijian Lu
机构: CCDS, NTU, Singapore; CCST, ZJUT, China; Terminus AI Lab, UCAS, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 2 figures
Abstract:Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.
zh
[CV-73] Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis
【速读】:该论文旨在解决单视角参考图像导致的纹理不完整以及缺乏显式跨视角交互所引发的姿势引导人体图像生成质量受限问题。解决方案的关键在于提出联合条件扩散模型(JCDM),其核心由两个模块构成:一是外观先验模块(APM),用于从不完整的单视图参考中推断保持身份一致性的全局先验;二是联合条件注入机制(JCI),通过融合多视角线索并将共享条件注入去噪主干网络,实现不同姿态下身份、颜色与纹理的一致性对齐。该方法支持可变数量的参考视图,并仅需最小且有针对性的架构修改即可集成至标准扩散模型,从而在保真度和跨视角一致性上达到当前最优性能。
链接: https://arxiv.org/abs/2511.15092
作者: Chengyu Xie,Zhi Gong,Junchi Ren,Linkun Yu,Si Shen,Fei Shen,Xiaoyu Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.
zh
[CV-74] BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer
【速读】:该论文旨在解决当前文档视觉问答(Document Visual Question Answering, DocVQA)任务中普遍存在的空间定位粒度不足问题,即现有数据集多局限于页面级(page-level)标注,缺乏细粒度的边界框(bounding box)级空间对齐信息,从而限制了视觉语言模型(Vision Language Models, VLMs)的空间推理能力和可解释性。其解决方案的关键在于构建一个大规模、边界框接地(bounding box grounded)的DocVQA数据集——BBox DocVQA,并设计了一套自动化构建流程“Segment Judge and Generate”,该流程融合区域分割模型、语义判断VLM与高级问答生成VLM,并辅以人工验证确保质量。该数据集包含3.6K多样化文档和32K问答对,支持单/多区域及单/多页场景,每条问答均明确关联具体边界框,显著提升了空间语义对齐的精细评估能力,实验证明其在提升VLM的空间定位准确性和问答生成性能方面具有显著效果。
链接: https://arxiv.org/abs/2511.15090
作者: Wenhan Yu,Wang Chen,Guanqiang Qi,Weikang Li,Yang Li,Lei Sha,Deguo Xia,Jizhou Huang
机构: Baidu Inc.(百度公司); Beihang University(北京航空航天大学); The University of Hong Kong(香港大学); Peking University(北京大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 4 figures
Abstract:Document Visual Question Answering (DocVQA) is a fundamental task for multimodal document understanding and a key testbed for vision language reasoning. However, most existing DocVQA datasets are limited to the page level and lack fine grained spatial grounding, constraining the interpretability and reasoning capability of Vision Language Models (VLMs). To address this gap, we introduce BBox DocVQA a large scale, bounding box grounded dataset designed to enhance spatial reasoning and evidence localization in visual documents. We further present an automated construction pipeline, Segment Judge and Generate, which integrates a segment model for region segmentation, a VLM for semantic judgment, and another advanced VLM for question answer generation, followed by human verification for quality assurance. The resulting dataset contains 3.6 K diverse documents and 32 K QA pairs, encompassing single and multi region as well as single and multi page scenarios. Each QA instance is grounded on explicit bounding boxes, enabling fine grained evaluation of spatial semantic alignment. Benchmarking multiple state of the art VLMs (e.g., GPT 5, Qwen2.5 VL, and InternVL) on BBox DocVQA reveals persistent challenges in spatial grounding and reasoning accuracy. Furthermore, fine tuning on BBox DocVQA substantially improves both bounding box localization and answer generation, validating its effectiveness for enhancing the reasoning ability of VLMs. Our dataset and code will be publicly released to advance research on interpretable and spatially grounded vision language reasoning.
zh
[CV-75] CAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition
【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中普遍存在的跨模态情感冲突问题,即同一样本的不同模态(如视觉、听觉和文本)可能表达不一致的情感倾向,而现有方法通常依赖统一情感标签进行监督训练,忽视了这种不一致性对模型性能的负面影响。解决方案的关键在于提出一种基于典型性的具有一致性感知能力的多模态情感识别框架(Typicality-based Consistent-aware Multimodal Emotion Recognition, TiCAL),其核心机制包括:1)利用伪单模态情感标签与典型性估计动态评估每个样本的模态一致性;2)在双曲空间(hyperbolic space)中嵌入特征表示,以更好地捕捉情感类别间的细粒度差异;3)将一致性估计融入训练过程,从而提升模型在高模态不一致性样本上的识别准确率。实验表明,该方法在CMU-MOSEI和MER2023等基准数据集上显著优于当前最优方法(如DMD),平均提升约2.6%。
链接: https://arxiv.org/abs/2511.15085
作者: Wen Yin,Siyu Zhan,Cencen Liu,Xin Hu,Guiduo Duan,Xiurui Xie,Yuan-Fang Li,Tao He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures
Abstract:Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.
zh
[CV-76] MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation
【速读】:该论文旨在解决动态户外环境中高时间变化(High Temporal Variation, HTV)对LiDAR点云中单目标3D跟踪带来的挑战,尤其是现有基于记忆的跟踪方法普遍存在的二次计算复杂度、时间冗余以及几何先验利用不足等问题。其解决方案的关键在于提出MambaTrack3D框架,核心创新包括:1)设计基于状态空间模型Mamba的帧间传播模块(MIP),以近线性复杂度替代传统单帧特征提取,显式建模历史帧间的空间关系;2)引入分组特征增强模块(GFEM),在通道层面分离前景与背景语义,有效缓解记忆库中的时间冗余问题。该方案在KITTI-HTV和nuScenes-HTV基准上显著优于现有HTV导向及常规场景跟踪器,并在标准KITTI数据集上保持与最先进方法相当的性能,验证了其在准确性与效率之间的优越权衡能力。
链接: https://arxiv.org/abs/2511.15077
作者: Shengjing Tian,Yinan Han,Xiantong Zhao,Xuehu Liu,Qi Lang
机构: China University of Mining and Technology (中国矿业大学); Dalian University of Technology (大连理工大学); School of Mathematical Sciences, Dalian University of Technology (大连理工大学数学科学学院); Northeast Normal University (东北师范大学); Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to a journal for possible publication
Abstract:Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer from quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors. To address these issues, we propose MambaTrack3D, a novel HTV-oriented tracking framework built upon the state space model Mamba. Specifically, we design a Mamba-based Inter-frame Propagation (MIP) module that replaces conventional single-frame feature extraction with efficient inter-frame propagation, achieving near-linear complexity while explicitly modeling spatial relations across historical frames. Furthermore, a Grouped Feature Enhancement Module (GFEM) is introduced to separate foreground and background semantics at the channel level, thereby mitigating temporal redundancy in the memory bank. Extensive experiments on KITTI-HTV and nuScenes-HTV benchmarks demonstrate that MambaTrack3D consistently outperforms both HTV-oriented and normal-scenario trackers, achieving improvements of up to 6.5 success and 9.5 precision over HVTrack under moderate temporal gaps. On the standard KITTI dataset, MambaTrack3D remains highly competitive with state-of-the-art normal-scenario trackers, confirming its strong generalization ability. Overall, MambaTrack3D achieves a superior accuracy-efficiency trade-off, delivering robust performance across both specialized HTV and conventional tracking scenarios.
zh
[CV-77] Deep Pathomic Learning Defines Prognostic Subtypes and Molecular Drivers in Colorectal Cancer
【速读】:该论文旨在解决结直肠癌(Colorectal Cancer, CRC)因高度异质性导致的预后分层不精确问题,传统TNM分期系统难以满足个体化医疗需求。其核心解决方案是构建并验证了一种基于组织病理全切片图像(Whole-Slide Images, WSI)的多实例学习模型TDAM-CRC,该模型在TCGA发现队列(n=581)和独立外部验证队列(n=1031)中均展现出优于传统临床分期及多种前沿模型的预后预测性能。关键创新在于通过整合多组学数据提升模型可解释性,并识别出线粒体核糖体蛋白L37(MRPL37)作为连接深度病理组学特征与临床预后的关键枢纽基因,其启动子低甲基化驱动高表达,且独立提示良好预后,最终构建包含TDAM-CRC风险评分与临床因素的列线图,为CRC患者提供精准、可解释的临床决策工具。
链接: https://arxiv.org/abs/2511.15067
作者: Zisong Wang,Xuanyu Wang,Hang Chen,Haizhou Wang,Yuxin Chen,Yihang Xu,Yunhe Yuan,Lihuan Luo,Xitong Ling,Xiaoping Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN)
备注:
Abstract:Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM-CRC using histopathological whole-slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi-omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM-CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state-of-the-art models. The TDAM-CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi-omics analysis revealed that the high-risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM-CRC risk score and clinical factors to provide a precise and interpretable clinical decision-making tool for CRC patients. Our AI-driven pathological model TDAM-CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision-making.
zh
[CV-78] BokehFlow: Depth-Free Controllable Bokeh Rendering via Flow Matching
【速读】:该论文旨在解决无深度输入条件下实现可控虚化(bokeh)渲染的难题。现有方法多依赖精确的深度图(depth map)以控制焦点区域和模糊强度,而生成式方法则常面临可控性差与效率低的问题。其解决方案的关键在于提出BokehFlow框架,该框架基于流匹配(flow matching)技术,直接从全焦图像中合成逼真的虚化效果,无需深度信息;并通过交叉注意力机制(cross-attention mechanism)实现对聚焦区域和模糊强度的语义级控制,从而在不依赖深度输入的前提下实现了高质量、高效率且可控性强的虚化渲染。
链接: https://arxiv.org/abs/2511.15066
作者: Yachuan Huang,Xianrui Luo,Qiwen Wang,Liao Shen,Jiaqi Li,Huiqiang Sun,Zihao Huang,Wei Jiang,Zhiguo Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bokeh rendering simulates the shallow depth-of-field effect in photography, enhancing visual aesthetics and guiding viewer attention to regions of interest. Although recent approaches perform well, rendering controllable bokeh without additional depth inputs remains a significant challenge. Existing classical and neural controllable methods rely on accurate depth maps, while generative approaches often struggle with limited controllability and efficiency. In this paper, we propose BokehFlow, a depth-free framework for controllable bokeh rendering based on flow matching. BokehFlow directly synthesizes photorealistic bokeh effects from all-in-focus images, eliminating the need for depth inputs. It employs a cross-attention mechanism to enable semantic control over both focus regions and blur intensity via text prompts. To support training and evaluation, we collect and synthesize four datasets. Extensive experiments demonstrate that BokehFlow achieves visually compelling bokeh effects and offers precise control, outperforming existing depth-dependent and generative methods in both rendering quality and efficiency.
zh
[CV-79] Reasoning via Video: The First Evaluation of Video Models Reasoning Abilities through Maze-Solving Tasks
【速读】:该论文旨在解决视频模型是否能够通过视频生成实现空间推理的问题,即探索“推理 via video”(通过视频进行推理)的可行性与有效性。其核心挑战在于如何利用视频数据中天然存在的显式空间布局和时序连续性来支持复杂的空间规划与多步推理任务。解决方案的关键在于提出并构建VR-Bench——一个涵盖7920个程序化生成视频的基准测试集,包含五类迷宫类型及多种视觉风格,用于系统评估视频模型在空间推理方面的性能;实验表明,通过监督微调(SFT)可高效激发视频模型的推理能力,且其在推理过程中展现出更强的空间感知能力,优于主流视觉语言模型(VLMs),并在不同场景、任务和复杂度下具有良好的泛化性,同时发现推理时采样多样性可提升10–20%的可靠性,凸显了视频生成在空间推理任务中的独特潜力与可扩展性。
链接: https://arxiv.org/abs/2511.15065
作者: Cheng Yang,Haiyuan Wan,Yiran Peng,Xin Cheng,Zhaoyang Yu,Jiayi Zhang,Junchi Yu,Xinlei Yu,Xiawu Zheng,Dongzhan Zhou,Chenglin Wu
机构: DeepWisdom; Tsinghua University (清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Renmin University of China (中国人民大学); University of Oxford (牛津大学); National University of Singapore (新加坡国立大学); Xiamen University (厦门大学); Hong Kong University of Science and Technology (GuangZhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench – a comprehensive benchmark designed to systematically evaluate video models’ reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10–20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
zh
[CV-80] ProPL: Universal Semi-Supervised Ultrasound Image Segmentation via Prompt-Guided Pseudo-Labeling AAAI2026
【速读】:该论文旨在解决超声图像分割任务中现有方法普遍局限于特定解剖结构或任务,从而限制其在临床实践中通用性的难题。解决方案的关键在于提出ProPL框架,该框架采用共享视觉编码器与提示引导的双解码器结构,通过“解码时提示”机制实现灵活的任务适配,并借助不确定性驱动的伪标签校准(Uncertainty-driven Pseudo-label Calibration, UPLC)模块实现可靠的自训练,从而有效利用大量未标注数据提升分割性能,同时支持多器官和多任务的统一建模。
链接: https://arxiv.org/abs/2511.15057
作者: Yaxiong Chen,Qicong Wang,Chunlei Li,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou
机构: MedAI Technology (Wuxi) Co. Ltd.(MedAI科技(无锡)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Existing approaches for the problem of ultrasound image segmentation, whether supervised or semi-supervised, are typically specialized for specific anatomical structures or tasks, limiting their practical utility in clinical settings. In this paper, we pioneer the task of universal semi-supervised ultrasound image segmentation and propose ProPL, a framework that can handle multiple organs and segmentation tasks while leveraging both labeled and unlabeled data. At its core, ProPL employs a shared vision encoder coupled with prompt-guided dual decoders, enabling flexible task adaptation through a prompting-upon-decoding mechanism and reliable self-training via an uncertainty-driven pseudo-label calibration (UPLC) module. To facilitate research in this direction, we introduce a comprehensive ultrasound dataset spanning 5 organs and 8 segmentation tasks. Extensive experiments demonstrate that ProPL outperforms state-of-the-art methods across various metrics, establishing a new benchmark for universal ultrasound image segmentation.
zh
[CV-81] CellGenNet: A Knowledge-Distilled Framework for Robust Cell Segmentation in Cancer Tissues
【速读】:该论文旨在解决显微成像全切片图像(Whole Slide Images, WSIs)中细胞核分割的准确性问题,尤其在染色差异、成像条件变化和组织形态多样性背景下,现有方法在有限标注数据下难以实现鲁棒且泛化的分割效果。解决方案的关键在于提出一种基于知识蒸馏(Knowledge Distillation)的框架 CellGenNet,其核心机制包括:利用一个容量较大的教师模型在稀疏标注数据上训练并生成软伪标签(soft pseudo-labels),学生模型则通过联合优化目标(融合真实标签、教师提供的概率目标以及结合二元交叉熵与Tversky损失的混合损失函数)进行训练,从而实现对类别不平衡的有效抑制并保留少数类核结构;同时引入一致性正则化和逐层dropout策略以稳定特征表示并促进可靠的知识迁移,显著提升了跨组织类型的细胞分割精度与泛化能力。
链接: https://arxiv.org/abs/2511.15054
作者: Srijan Ray,Bikesh K. Nirala,Jason T. Yustein,Sundaresh Ram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures, Submitted to IEEE SSIAI 2026
Abstract:Accurate nuclei segmentation in microscopy whole slide images (WSIs) remains challenging due to variability in staining, imaging conditions, and tissue morphology. We propose CellGenNet, a knowledge distillation framework for robust cross-tissue cell segmentation under limited supervision. CellGenNet adopts a student-teacher architecture, where a capacity teacher is trained on sparse annotations and generates soft pseudo-labels for unlabeled regions. The student is optimized using a joint objective that integrates ground-truth labels, teacher-derived probabilistic targets, and a hybrid loss function combining binary cross-entropy and Tversky loss, enabling asymmetric penalties to mitigate class imbalance and better preserve minority nuclear structures. Consistency regularization and layerwise dropout further stabilize feature representations and promote reliable feature transfer. Experiments across diverse cancer tissue WSIs show that CellGenNet improves segmentation accuracy and generalization over supervised and semi-supervised baselines, supporting scalable and reproducible histopathology analysis.
zh
[CV-82] Hyperspectral Super-Resolution with Inter-Image Variability via Degradation-based Low-Rank and Residual Fusion Method
【速读】:该论文旨在解决高光谱图像(HSI)与多光谱图像(MSI)融合过程中因成像条件差异导致的跨图像光谱变异性和空间局部变化(统称为 inter-image variability)对融合性能的负面影响问题。现有方法通常直接对图像进行变换处理,易加剧融合模型的不适定性。其解决方案的关键在于提出一种基于退化建模的低秩与残差融合(Degradation-based Low-Rank and Residual Fusion, DLRRF)模型:首先将光谱变异建模为光谱退化算子的变化;其次通过将目标HSI分解为低秩和残差分量来恢复因空间局部变化丢失的细节信息,其中残差项专门用于捕捉高频空间特征;同时利用图像内部光谱相关性对两分量分别进行降维,并引入隐式正则项以利用图像的空间先验信息,最终在Plug-and-Play框架下结合近似交替优化(PAO)算法求解,显著提升了融合质量。
链接: https://arxiv.org/abs/2511.15052
作者: Yue Wen,Kunjing Yang,Minru Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The fusion of hyperspectral image (HSI) with multispectral image (MSI) provides an effective way to enhance the spatial resolution of HSI. However, due to different acquisition conditions, there may exist spectral variability and spatially localized changes between HSI and MSI, referred to as inter-image variability, which can significantly affect the fusion performance. Existing methods typically handle inter-image variability by applying direct transformations to the images themselves, which can exacerbate the ill-posedness of the fusion model. To address this challenge, we propose a Degradation-based Low-Rank and Residual Fusion (DLRRF) model. First, we model the spectral variability as change in the spectral degradation operator. Second, to recover the lost spatial details caused by spatially localized changes, we decompose the target HSI into low rank and residual components, where the latter is used to capture the lost details. By exploiting the spectral correlation within the images, we perform dimensionality reduction on both components. Additionally, we introduce an implicit regularizer to utilize the spatial prior information from the images. The proposed DLRRF model is solved using the Proximal Alternating Optimization (PAO) algorithm within a Plug-and-Play (PnP) framework, where the subproblem regarding implicit regularizer is addressed by an external denoiser. We further provide a comprehensive convergence analysis of the algorithm. Finally, extensive numerical experiments demonstrate that DLRRF achieves superior performance in fusing HSI and MSI with inter-image variability.
zh
[CV-83] UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space AAAI2026
【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)检测与生成任务长期分离建模所导致的交互理解不全面问题。传统方法分别处理HOI检测和生成,限制了知识共享与泛化能力。其解决方案的关键在于提出UniHOI框架,通过统一的token空间联合建模两项任务,并引入对称的交互感知注意力模块与统一的半监督学习范式,从而在有限标注条件下实现图像与交互语义之间的高效双向映射,显著提升长尾场景下的检测准确率(+4.9%)和开放词汇生成任务的交互质量(+42.0%)。
链接: https://arxiv.org/abs/2511.15046
作者: Panqi Yang,Haodong Jing,Nanning Zheng,Yongqiang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026,9 pages, 4 figures
Abstract:In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.
zh
[CV-84] Computer Vision Modeling of the Development of Geometric and Numerical Concepts in Humans
【速读】:该论文试图解决的问题是:计算机视觉(Computer Vision, CV)模型是否能够模拟人类数学认知的发展过程,即是否存在“发展一致性”(developmental alignment),使得模型在训练过程中表现出与儿童数学能力发展相似的轨迹。解决方案的关键在于通过详细案例研究ResNet-50模型,分析其在几何、拓扑和数字概念上的表征学习进展,发现部分数学概念(如欧几里得几何、拓扑性质、数感中的“心理数轴”)在模型训练中呈现出与人类儿童发展路径一致的演化模式,从而验证了CV模型可作为研究人类数学认知发展的计算工具。这一发现为利用深度学习模型理解数学思维的发育机制提供了实证基础,并指明未来研究方向。
链接: https://arxiv.org/abs/2511.15029
作者: Zekun Wang,Sashank Varma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Mathematical thinking is a fundamental aspect of human cognition. Cognitive scientists have investigated the mechanisms that underlie our ability to thinking geometrically and numerically, to take two prominent examples, and developmental scientists have documented the trajectories of these abilities over the lifespan. Prior research has shown that computer vision (CV) models trained on the unrelated task of image classification nevertheless learn latent representations of geometric and numerical concepts similar to those of adults. Building on this demonstrated cognitive alignment, the current study investigates whether CV models also show developmental alignment: whether their performance improvements across training to match the developmental progressions observed in children. In a detailed case study of the ResNet-50 model, we show that this is the case. For the case of geometry and topology, we find developmental alignment for some classes of concepts (Euclidean Geometry, Geometrical Figures, Metric Properties, Topology) but not others (Chiral Figures, Geometric Transformations, Symmetrical Figures). For the case of number, we find developmental alignment in the emergence of a human-like ``mental number line’’ representation with experience. These findings show the promise of computer vision models for understanding the development of mathematical understanding in humans. They point the way to future research exploring additional model architectures and building larger benchmarks.
zh
[CV-85] Complex-Valued 2D Gaussian Representation for Computer-Generated Holography
【速读】:该论文旨在解决传统计算机生成全息图(Computer-Generated Holography, CGH)中参数搜索空间过大、内存占用高及优化效率低的问题。其解决方案的关键在于提出一种基于结构化复值二维高斯基元(structured complex-valued 2D Gaussian primitives)的新全息表示方法,该方法通过替代逐像素信息存储机制,将参数搜索空间压缩至原来的1/10;同时开发了一个可微分光栅化器(differentiable rasterizer)并集成GPU优化的自由空间光传播核,从而支持端到端训练。实验表明,该方法在保持更高重建保真度的同时,显著降低了显存使用(最多减少2.5倍)和优化时间(提升50%),并通过转换流程适配实际全息格式(如平滑和随机相位纯全息图),有效抑制了以往方法中的噪声伪影。
链接: https://arxiv.org/abs/2511.15022
作者: Yicheng Zhan,Xiangjun Gao,Long Quan,Kaan Akşit
机构: University College London (伦敦大学学院); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 8 pages, 11 figures
Abstract:We propose a new hologram representation based on structured complex-valued 2D Gaussian primitives, which replaces per-pixel information storage and reduces the parameter search space by up to 10:1. To enable end-to-end training, we develop a differentiable rasterizer for our representation, integrated with a GPU-optimized light propagation kernel in free space. Our extensive experiments show that our method achieves up to 2.5x lower VRAM usage and 50% faster optimization while producing higher-fidelity reconstructions than existing methods. We further introduce a conversion procedure that adapts our representation to practical hologram formats, including smooth and random phase-only holograms. Our experiments show that this procedure can effectively suppress noise artifacts observed in previous methods. By reducing the hologram parameter search space, our representation enables a more scalable hologram estimation in the next-generation computer-generated holography systems.
zh
[CV-86] CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification
【速读】:该论文旨在解决可见光-红外持续人物再识别(Visible-Infrared Lifelong person Re-IDentification, VI-LReID)中因模态特异性知识与模态共通知识相互干扰而导致的灾难性遗忘问题,即协同遗忘(collaborative forgetting)。现有方法通常依赖跨模态知识蒸馏缓解旧知识遗忘,但忽略了不同模态间知识获取过程中的冲突,导致性能下降。其解决方案的关键在于提出一种跨模态知识解耦与对齐方法(Cross-modality Knowledge Disentanglement and Alignment, CKDA),通过两个核心模块实现:一是模态共通提示模块(Modality-Common Prompting, MCP)和模态特异提示模块(Modality-Specific Prompting, MSP),显式解耦并净化不同模态共存且特有的判别信息,避免两类知识间的相互干扰;二是跨模态知识对齐模块(Cross-modal Knowledge Alignment, CKA),基于双模态原型在独立的模态内和模态间特征空间中平衡地对齐新旧知识,从而实现连续学习下的稳定性能提升。
链接: https://arxiv.org/abs/2511.15016
作者: Zhenyu Cui,Jiahuan Zhou,Yuxin Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at this https URL.
zh
[CV-87] FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation
链接: https://arxiv.org/abs/2511.14998
作者: Yueru He,Xueqing Peng,Yupeng Cao,Yan Wang,Lingfei Qian,Haohang Li,Yi Han,Ruoyu Xiang,Mingquan Lin,Prayag Tiwari,Jimin Huang,Guojun Xiong,Sophia Ananiadou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Yueru He, Xueqing Peng: These two authors contributed equally to this work
[CV-88] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
【速读】:该论文旨在解决高分辨率图像与10秒视频生成任务中模型性能、效率与可扩展性的挑战,特别是在生成质量、训练稳定性及推理速度之间的平衡问题。其解决方案的关键在于构建一个分层的多阶段训练框架——Kandinsky 5.0,包含三种核心模型:轻量级图像生成模型(Kandinsky 5.0 Image Lite)、快速轻量级文本到视频/图像到视频模型(Kandinsky 5.0 Video Lite)以及高性能视频生成模型(Kandinsky 5.0 Video Pro),并结合系统性的数据生命周期管理(包括收集、处理、过滤与聚类)和质量增强技术(如自监督微调SFT与基于强化学习RL的后训练优化),辅以新颖的架构设计、训练策略与推理加速优化,从而在人类评估中实现卓越的生成效果与高速推理能力,推动高质量生成式AI(Generative AI)模型的研究普及与应用落地。
链接: https://arxiv.org/abs/2511.14993
作者: Vladimir Arkhipkin,Vladimir Korviakov,Nikolai Gerasimenko,Denis Parkhomenko,Viacheslav Vasilev,Alexey Letunovskiy,Maria Kovaleva,Nikolai Vaulin,Ivan Kirillov,Lev Novitskiy,Denis Koposov,Nikita Kiselev,Alexander Varlamov,Dmitrii Mikhailov,Vladimir Polovnikov,Andrey Shutkin,Ilya Vasiliev,Julia Agafonova,Anastasiia Kargapoltseva,Anna Dmitrienko,Anastasia Maltseva,Anna Averchenkova,Olga Kim,Tatiana Nikulina,Denis Dimitrov
机构: Kandinsky Lab(坎迪斯基实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Website: this https URL
Abstract:This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
zh
[CV-89] Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation NEURIPS
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中传统方法依赖 logits-based 损失(如交叉熵)导致学生模型性能受限的问题。现有特征蒸馏方法虽利用中间层特征(latent representations),但通常仍需结合 logit 损失以引导训练,限制了纯特征蒸馏的潜力。其解决方案的关键在于提出一种仅使用基于特征的损失函数(feature-based losses)训练学生网络主干(backbone)的框架,并引入一种基于潜在表示几何结构的知识质量度量(knowledge quality metric),用于识别教师模型中最有效的蒸馏层。该方法在三个图像分类数据集和四种不同师生模型组合(含卷积神经网络与视觉Transformer)上均实现显著性能提升,最高可达15%的top-1准确率增益,且代码已公开,推动该方向进一步研究。
链接: https://arxiv.org/abs/2511.14981
作者: Nicholas Cooper,Lijun Chen,Sailesh Dwivedy,Danna Gurari
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS Workshop on Symmetry and Geometry in Neural Representations (NeurReps), December 2025
Abstract:Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student’s backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at this https URL.
zh
[CV-90] EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects
【速读】:该论文旨在解决透明物体感知在计算机视觉中的难题,尤其是透明性对深度估计(depth estimation)和语义分割(semantic segmentation)任务的干扰问题。现有方法虽尝试通过多任务学习提升鲁棒性,但常因任务间负向交互而性能受限。其解决方案的关键在于提出边缘引导的空间注意力机制(Edge-Guided Spatial Attention, EGSA),通过融合边界信息来优化语义特征与几何特征之间的交互,从而缓解破坏性跨任务干扰;此外,还设计了一种多模态渐进式训练策略,从RGB图像中提取的边缘逐步过渡到预测深度图中的边缘,无需真实深度标签即可实现有效学习,显著提升了透明区域的深度精度并保持良好的分割性能。
链接: https://arxiv.org/abs/2511.14970
作者: Gbenga Omotara,Ramy Farag,Seyed Mohamad Ali Tousi,G.N. DeSouza
机构: University of Missouri (密苏里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.
zh
[CV-91] Knowledge Graphs as Structured Memory for Embedding Spaces: From Training Clusters to Explainable Inference
【速读】:该论文旨在解决非参数学习中局部证据与全局一致性难以平衡的问题,尤其是在样本稀缺场景下,传统方法如k近邻(k NN)和标签传播(Label Spreading)往往存在校准性差、决策边界不平滑等缺陷。解决方案的关键在于提出图记忆(Graph Memory, GM),其通过构建一个结构化的非参数框架,将嵌入空间中的训练实例归纳为带有可靠性指标的原型节点,并以边编码几何与上下文关系,从而实现实例检索、原型推理与图标签传播的统一建模。该设计不仅支持高效推理和可解释性,还显著提升了模型在少量样本下的校准性能与决策边界平滑度。
链接: https://arxiv.org/abs/2511.14961
作者: Artur A. Oliveira,Mateus Espadoto,Roberto M. Cesar Jr.,Roberto Hirata Jr
机构: University of São Paulo (圣保罗大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to GRIVAPP 2026 (21st International Conference on Computer Graphics, Interaction, Visualization Theory and Applications), Marbella, Spain, March 9-11 2026
Abstract:We introduce Graph Memory (GM), a structured non-parametric framework that augments embedding-based inference with a compact, relational memory over region-level prototypes. Rather than treating each training instance in isolation, GM summarizes the embedding space into prototype nodes annotated with reliability indicators and connected by edges that encode geometric and contextual relations. This design unifies instance retrieval, prototype-based reasoning, and graph-based label propagation within a single inductive model that supports both efficient inference and faithful explanation. Experiments on synthetic and real datasets including breast histopathology (IDC) show that GM achieves accuracy competitive with k NN and Label Spreading while offering substantially better calibration and smoother decision boundaries, all with an order of magnitude fewer samples. By explicitly modeling reliability and relational structure, GM provides a principled bridge between local evidence and global consistency in non-parametric learning.
zh
[CV-92] Artificial intelligence approaches for energy-efficient laser cutting machines
【速读】:该论文旨在解决激光切割过程中能耗高和环境影响大的问题,特别是针对CO₂激光吸尘泵采用开环控制导致的能源浪费。其解决方案的关键在于引入基于深度学习(Deep Learning, DL)的闭环控制系统,通过动态调整吸尘泵功率来实现节能:一方面利用无透镜散斑传感结合定制卷积神经网络(Convolutional Neural Network, CNN)或USB摄像头与预训练VGG16模型进行材料分类;另一方面使用独立的DL模型检测烟雾浓度,从而协同控制吸尘泵在非工作时段停机、运行时按需调节功率,最终实验证明可使吸尘泵能耗降低20%至50%,显著提升制造过程的可持续性。
链接: https://arxiv.org/abs/2511.14952
作者: Mohamed Abdallah Salem,Hamdy Ahmed Ashour,Ahmed Elshenawy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduction. Recognizing the current lack of adaptive control and the open-loop nature of CO2 laser suction pumps, this study utilizes closed-loop configurations that dynamically adjust pump power based on both the material being cut and the smoke level generated. To implement this adaptive system, diverse material classification methods are introduced, including techniques leveraging lens-less speckle sensing with a customized Convolutional Neural Network (CNN) and an approach using a USB camera with transfer learning via the pre-trained VGG16 CNN model. Furthermore, a separate DL model for smoke level detection is employed to simultaneously refine the pump’s power output. This integration prompts the exhaust suction pump to automatically halt during inactive times and dynamically adjust power during operation, leading to experimentally proven and remarkable energy savings, with results showing a 20% to 50% reduction in the smoke suction pump’s energy consumption, thereby contributing substantially to sustainable development in the manufacturing sector.
zh
[CV-93] RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems
【速读】:该论文旨在解决多视角视频流在异构摄像系统(如专业与消费级设备、可见光与红外传感器、含/不含音频的系统)中难以实现精确时空对齐的问题,这在动态场景应用(如多视角三维重建、姿态估计和场景理解)中尤为关键。解决方案的关键在于提出一种低成本、通用的同步方法——自建的“LED时钟”(LED Clock),通过红光和红外LED编码时间信息,使录制帧能够被视觉解码出曝光窗口的起止时间,从而实现毫秒级的时间对齐精度。该方法在多种实验条件下均优于基于光、音频和时间码的同步方案,并显著提升下游计算机视觉任务性能。
链接: https://arxiv.org/abs/2511.14948
作者: Jaro Meyer,Frédéric Giraud,Joschua Wüthrich,Marc Pollefeys,Philipp Fürnstahl,Lilian Calvet
机构: ETH Zurich (苏黎世联邦理工学院); Research in Orthopedic Computer Science, Balgrist University Hospital, University of Zurich (骨科计算机科学研究中心,巴尔格里斯特大学医院,苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures
Abstract:Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built \textitLED Clock that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34~ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications.
zh
[CV-94] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities WACV2026
【速读】:该论文旨在解决长期周期性工作流(long-term periodic workflows)在制造、体育和日常生活等场景中因低对比度模式而难以被有效识别与建模的问题。现有研究主要聚焦于短期周期性活动,其结构简单且特征明显,而长期工作流则具有复杂性和隐含性,缺乏系统性的评估基准和高效建模方法。解决方案的关键在于构建首个包含580个多模态人类活动序列的基准数据集,并提出一种无需训练的轻量级基线模型,该模型能够统一处理三种现实应用相关的任务:无监督周期性工作流检测、任务完成追踪以及流程异常检测。实验表明,该基线在所有任务上显著优于现有方法,且在实际部署中可媲美传统监督方法,同时避免了标注和重新训练的需求。
链接: https://arxiv.org/abs/2511.14945
作者: Fan Yang,Quanting Xie,Atsunori Moteki,Shoichi Masui,Shan Jiang,Yonatan Bisk,Graham Neubig
机构: Fujitsu Research of America (富士通研究美国公司); Fujitsu Limited (富士通有限公司); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to WACV 2026
Abstract:Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities – characterized by simple structures and high-contrast patterns – have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is this https URL.
zh
[CV-95] CPSL: Representing Volumetric Video via Content-Promoted Scene Layers
【速读】:该论文旨在解决现有体积视频(volumetric video)表示方法在捕获、计算和渲染方面成本高昂的问题,从而限制了其在按需视频场景中的可扩展性及实时通信的可行性。解决方案的关键在于提出内容引导的场景分层(Content-Promoted Scene Layers, CPSL),这是一种紧凑的2.5D视频表示方法:通过每帧的深度图和内容显著性引导,将图像分解为少量几何一致的图层,每个图层配备软Alpha带和边缘-深度缓存,以联合保持遮挡顺序与边界连续性;该结构支持基于深度加权变形和从前到后Alpha合成的视差校正新视角生成,避免昂贵的3D重建,并结合运动引导传播与图层编码实现帧间一致性,使标准视频编码器即可实现实时播放,同时显著降低存储与渲染开销。
链接: https://arxiv.org/abs/2511.14927
作者: Kaiyuan Hu,Yili Jin,Junhua Liu,Xize Duan,Hong Kang,Xue Liu
机构: McGill University (麦吉尔大学); University of Southern California (南加州大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations from explicit point clouds to implicit neural fields, remain costly in capture, computation, and rendering, which limits their scalability for on-demand video and reduces their feasibility for real-time communication. To bridge this gap, we propose Content-Promoted Scene Layers (CPSL), a compact 2.5D video representation that brings the perceptual benefits of volumetric video to conventional 2D content. Guided by per-frame depth and content saliency, CPSL decomposes each frame into a small set of geometry-consistent layers equipped with soft alpha bands and an edge-depth cache that jointly preserve occlusion ordering and boundary continuity. These lightweight, 2D-encodable assets enable parallax-corrected novel-view synthesis via depth-weighted warping and front-to-back alpha compositing, bypassing expensive 3D reconstruction. Temporally, CPSL maintains inter-frame coherence using motion-guided propagation and per-layer encoding, supporting real-time playback with standard video codecs. Across multiple benchmarks, CPSL achieves superior perceptual quality and boundary fidelity compared with layer-based and neural-field baselines while reducing storage and rendering cost by several folds. Our approach offer a practical path from 2D video to scalable 2.5D immersive media. Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2511.14927 [cs.CV] (or arXiv:2511.14927v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.14927 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-96] X-WIN: Building Chest Radiograph World Model via Predictive Sensing
【速读】:该论文旨在解决胸片(Chest X-ray radiography, CXR)作为二维投影图像时因结构重叠导致无法准确捕捉三维解剖结构的问题,从而限制了表示学习和疾病诊断的准确性。其解决方案的关键在于提出一种名为X-WIN的新颖胸片世界模型,通过在潜在空间中学习预测胸部CT(Computed Tomography, CT)的二维投影来蒸馏体积知识;核心思想是构建一个内化三维解剖结构知识的世界模型,能够预测不同三维变换下的胸片图像。为增强跨视图信息关联性,引入了一种基于亲和力引导的对比对齐损失(affinity-guided contrastive alignment loss),同时结合掩码图像建模与领域分类器,使模型能适应真实胸片数据并生成统计上一致的特征表示,从而显著提升下游任务性能。
链接: https://arxiv.org/abs/2511.14918
作者: Zefan Yang,Ge Wang,James Hendler,Mannudeep K. Kalra,Pingkun Yan
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Massachusetts General Hospital (麻省总医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chest X-ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X-WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity-guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. X-WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.
zh
[CV-97] nnMIL: A generalizable multiple instance learning framework for computational pathology
【速读】:该论文旨在解决当前基于病理切片(whole-slide images, WSI)的多实例学习(Multiple-Instance Learning, MIL)方法在聚合patch级特征为slide级临床预测时存在的设计局限性,这些问题限制了模型的泛化能力和可靠性。其解决方案的关键在于提出nnMIL框架,该框架通过在patch和特征两个层级引入随机采样机制,实现大批次优化、任务感知的采样策略,并支持跨数据集与模型架构的高效可扩展训练;同时,采用轻量级聚合器进行滑动窗口推理,生成集成的slide级预测并提供合理的不确定性估计,从而显著提升诊断、亚型分类、分子标志物检测及泛癌预后预测等任务的性能与鲁棒性。
链接: https://arxiv.org/abs/2511.14907
作者: Xiangde Luo,Jinxi Xiang,Yuanfeng Ji,Ruijiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A conceptual evaluation work; more studies are in progress; examples are here ( this https URL )
Abstract:Computational pathology holds substantial promise for improving diagnosis and guiding treatment decisions. Recent pathology foundation models enable the extraction of rich patch-level representations from large-scale whole-slide images (WSIs), but current approaches for aggregating these features into slide-level predictions remain constrained by design limitations that hinder generalizability and reliability. Here, we developed nnMIL, a simple yet broadly applicable multiple-instance learning framework that connects patch-level foundation models to robust slide-level clinical inference. nnMIL introduces random sampling at both the patch and feature levels, enabling large-batch optimization, task-aware sampling strategies, and efficient and scalable training across datasets and model architectures. A lightweight aggregator performs sliding-window inference to generate ensemble slide-level predictions and supports principled uncertainty estimation. Across 40,000 WSIs encompassing 35 clinical tasks and four pathology foundation models, nnMIL consistently outperformed existing MIL methods for disease diagnosis, histologic subtyping, molecular biomarker detection, and pan- cancer prognosis prediction. It further demonstrated strong cross-model generalization, reliable uncertainty quantification, and robust survival stratification in multiple external cohorts. In conclusion, nnMIL offers a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions, advancing the development and deployment of reliable AI systems in real-world settings.
zh
[CV-98] FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding
【速读】:该论文旨在解决当前遥感(Remote Sensing, RS)领域中基于CLIP的视觉-语言预训练模型在细粒度区域级对齐能力不足的问题,其核心挑战在于现有RS图像-文本数据集仅提供全局描述,导致对象级监督信息未被充分利用,且通用域的区域-文本对齐方法直接迁移至RS场景时性能下降。解决方案的关键在于:首先构建首个多粒度遥感图像-文本数据集MGRS-200k,引入丰富的对象级文本标注以增强区域类别对齐;其次提出FarSLIP框架,采用patch-to-patch蒸馏策略替代传统的patch-to-CLS自蒸馏,从而在保持语义一致性的同时提升局部特征判别力,并通过基于CLS token的区域类别对齐机制有效利用区域文本监督,避免显式的patch级对齐带来的语义破坏。此方案显著提升了遥感场景下的细粒度视觉-语言对齐能力,在开放词汇语义分割及图像级任务(如零样本分类和图文检索)上均达到新最优性能。
链接: https://arxiv.org/abs/2511.14901
作者: Zhenshi Li,Weikang Yu,Dilxat Muhtar,Xueliang Zhang,Pengfeng Xiao,Pedram Ghamisi,Xiao Xiang Zhu
机构: Nanjing University (南京大学); Technical University of Munich (慕尼黑工业大学); Helmholtz-Zentrum Dresden-Rossendorf (德累斯顿罗斯多夫赫尔姆霍兹研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As CLIP’s global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP’s semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at this https URL.
zh
[CV-99] InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
【速读】:该论文旨在解决从稀疏输入视角进行多视图图像编辑时的跨视图一致性问题,即在仅提供少量不同视角图像的情况下,根据文本指令修改场景内容的同时保持各视角间的一致性。现有方法依赖于单场景神经场或时序注意力机制,在此任务中常产生伪影和不一致的编辑结果。其解决方案的关键在于提出 InstructMix2Mix (I-Mix2Mix) 框架,该框架将 2D 扩散模型的编辑能力蒸馏至预训练的多视图扩散模型中,利用后者蕴含的数据驱动三维先验实现跨视图一致性;核心创新是用多视图扩散学生模型替代 Score Distillation Sampling (SDS) 中的传统神经场整合器,并引入三项关键技术:跨时间步的增量式学生更新、防止退化的专用教师噪声调度策略,以及无需额外计算成本即可增强跨视图一致性的注意力机制修改。
链接: https://arxiv.org/abs/2511.14899
作者: Daniel Gilo,Or Litany
机构: Technion – Israel Institute of Technology (以色列理工学院); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.
zh
[CV-100] HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation
【速读】:该论文旨在解决高场强(High-Field, HF)与超低场强(Ultra-Low Field, ULF)磁共振成像(MRI)之间图像转换的难题,特别是实现双向无监督图像合成——即从HF图像生成ULF-like图像,以及反向从ULF图像重建HF图像。传统方法缺乏对HF与ULF MRI间对比度变化物理机制的建模,导致合成质量受限。本文的关键创新在于引入基于物理驱动的前向模型:通过估计组织类型的信噪比(Signal-to-Noise Ratio, SNR)来模拟HF到ULF的转换过程,同时利用隐式神经表示(Implicit Neural Representation, INR)网络在无观测HF数据条件下同步预测组织分割与图像强度,从而实现超分辨率重建。该方案显著提升了白质-灰质(WM-GM)对比度,且在噪声和初始种子扰动下表现出鲁棒性。
链接: https://arxiv.org/abs/2511.14897
作者: Pranav Indrakanti,Ivor Simpson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to ISBI 2026
Abstract:We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image and vice-versa. Unlike existing MRI synthesis models, our approach is inspired by the physics that drives contrast changes between HF and ULF MRIs. Our forward model simulates a HF to ULF transformation by estimating the tissue-type Signal-to-Noise ratio (SNR) values based on target contrast values. For the Super-Resolution task, we used an Implicit Neural Representation (INR) network to synthesize HF image by simultaneously predicting tissue-type segmentations and image intensity without observed HF data. The proposed method is evaluated using synthetic ULF-like data from generated from standard 3T T _1 -weighted images for qualitative assessments and paired 3T-64mT T _1 -weighted images for validation experiments. WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Sensitivity experiments demonstrated the robustness of our forward model to variations in target contrast, noise and initial seeding.
zh
[CV-101] GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis
【速读】:该论文旨在解决现有文本驱动室内3D场景生成方法中存在的两大问题:一是从零训练的生成模型往往忽略室内场景固有的图结构(scene graph),导致场景 coherence 和 realism 不足;二是依赖预定义语义关系或真实标注关系的方法在实际应用中受限于用户输入复杂度或数据获取难度。解决方案的关键在于提出 GeoSceneGraph,该方法通过利用3D场景的几何对称性和隐式图结构来合成高质量场景,无需预先设定关系类别或依赖地面真值关系标注。其核心技术是基于等变图神经网络(equivariant graph neural networks, EGNNs)并引入一种简单有效的文本特征条件化策略,使EGNN能够处理复杂的文本模态输入,从而实现与依赖关系标注方法相当甚至更优的生成性能。
链接: https://arxiv.org/abs/2511.14884
作者: Antonio Ruiz,Tao Wu,Andrew Melnik,Qing Cheng,Xuqin Wang,Lu Liu,Yongliang Wang,Yanfeng Zhang,Helge Ritter
机构: Huawei Technologies (华为技术有限公司); Center for Cognitive Interaction Technology (CITEC), Bielefeld University (比勒费尔德大学认知交互技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.
zh
[CV-102] Attacking Autonomous Driving Agents with Adversarial Machine Learning: A Holistic Evaluation with the CARLA Leaderboard
【速读】:该论文旨在解决对抗样本(adversarial examples)在自动驾驶系统中是否能有效诱导有害驾驶行为的问题,即评估对抗攻击对真实自动驾驶代理(driving agents)的整体影响,而不仅限于其内部机器学习(ML)模型。解决方案的关键在于:利用CARLA城市驾驶仿真平台,在不修改任何驾驶代理代码的前提下,实时注入设计好的对抗补丁(adversarial patches),并针对CARLA Leaderboard中的多个开源驾驶代理进行多场景、多光照条件和多地理位置的系统性测试,从而揭示部分代理因包含如PID控制或基于GPS规则的模块而具备抵抗ML模型预测干扰的能力。
链接: https://arxiv.org/abs/2511.14876
作者: Henry Wong,Clement Fung,Weiran Lin,Karen Li,Stanley Chen,Lujo Bauer
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 12 pages
Abstract:To autonomously control vehicles, driving agents use outputs from a combination of machine-learning (ML) models, controller logic, and custom modules. Although numerous prior works have shown that adversarial examples can mislead ML models used in autonomous driving contexts, it remains unclear if these attacks are effective at producing harmful driving actions for various agents, environments, and scenarios. To assess the risk of adversarial examples to autonomous driving, we evaluate attacks against a variety of driving agents, rather than against ML models in isolation. To support this evaluation, we leverage CARLA, an urban driving simulator, to create and evaluate adversarial examples. We create adversarial patches designed to stop or steer driving agents, stream them into the CARLA simulator at runtime, and evaluate them against agents from the CARLA Leaderboard, a public repository of best-performing autonomous driving agents from an annual research competition. Unlike prior work, we evaluate attacks against autonomous driving systems without creating or modifying any driving-agent code and against all parts of the agent included with the ML model. We perform a case-study investigation of two attack strategies against three open-source driving agents from the CARLA Leaderboard across multiple driving scenarios, lighting conditions, and locations. Interestingly, we show that, although some attacks can successfully mislead ML models into predicting erroneous stopping or steering commands, some driving agents use modules, such as PID control or GPS-based rules, that can overrule attacker-manipulated predictions from ML models. Comments: 12 pages Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2511.14876 [cs.CR] (or arXiv:2511.14876v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.14876 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-103] B-Rep Distance Functions (BR-DF): How to Represent a B-Rep Model by Volumetric Distance Functions?
【速读】:该论文旨在解决传统CAD边界表示(B-Rep)模型在生成与重建过程中难以保证几何完整性与拓扑正确性的问题,尤其在基于学习的方法中常出现失败或非闭合(non-watertight)结果。其解决方案的关键在于提出一种全新的几何表示方法——B-Rep Distance Functions (BR-DF),该方法将CAD模型的表面网格几何编码为有符号距离函数(SDF),同时将顶点、边、面及其拓扑信息编码为每面无符号距离函数(UDF)。通过扩展Marching Cubes算法,可直接从BR-DF生成严格闭合的(faceted)B-Rep模型,且该转换过程具有100%的成功率。此外,作者进一步设计了基于3D U-Net骨干网络的多分支潜在扩散模型,联合生成SDF与各面UDFs,从而在保持生成质量的同时显著提升了输出模型的可靠性。
链接: https://arxiv.org/abs/2511.14870
作者: Fuyang Zhang,Pradeep Kumar Jayaraman,Xiang Xu,Yasutaka Furukawa
机构: Simon Fraser University (西蒙菲莎大学); Autodesk (欧特克); Wayve
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:This paper presents a novel geometric representation for CAD Boundary Representation (B-Rep) based on volumetric distance functions, dubbed B-Rep Distance Functions (BR-DF). BR-DF encodes the surface mesh geometry of a CAD model as signed distance function (SDF). B-Rep vertices, edges, faces and their topology information are encoded as per-face unsigned distance functions (UDFs). An extension of the Marching Cubes algorithm converts BR-DF directly into watertight CAD B-Rep model (strictly speaking a faceted B-Rep model). A surprising characteristic of BR-DF is that this conversion process never fails. Leveraging the volumetric nature of BR-DF, we propose a multi-branch latent diffusion with 3D U-Net backbone for jointly generating the SDF and per-face UDFs of a BR-DF model. Our approach achieves comparable CAD generation performance against SOTA methods while reaching the unprecedented 100% success rate in producing (faceted) B-Rep models.
zh
[CV-104] When CNNs Outperform Transformers and Mambas: Revisiting Deep Architectures for Dental Caries Segmentation
【速读】:该论文旨在解决全景牙片(panoramic radiographs)中龋齿(dental caries)的自动识别与分割问题,其核心挑战在于病变对比度低、形态变异大以及标注数据有限。解决方案的关键在于通过DC1000数据集对多种先进模型架构进行系统性基准测试,包括卷积神经网络(CNN)、视觉Transformer和状态空间Mamba架构。研究发现,尽管基于注意力机制的Transformer和Mamba模型在理论上具备更强的全局上下文建模能力,但在小样本场景下表现不佳,而结构相对简单的CNN-based DoubleU-Net在Dice系数(0.7345)、mIoU(0.5978)和精度(0.8145)等指标上均优于其他模型,表明在特定医学图像分割任务中,模型架构与任务特性的匹配程度比单纯追求复杂度更为关键。
链接: https://arxiv.org/abs/2511.14860
作者: Aashish Ghimire,Jun Zeng,Roshan Paudel,Nikhil Kumar Tomar,Deepak Ranjan Nayak,Harshith Reddy Nalla,Vivek Jha,Glenda Reynolds,Debesh Jha
机构: University of South Dakota (南达科他大学); Indian Institute of Technology Bhubaneswar (印度理工学院布巴内斯瓦尔分校); Stanford University (斯坦福大学); National Institute of Technology Rourkela (国立技术研究所鲁尔克纳拉分校); International Institute of Information Technology Hyderabad (国际信息科技学院海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures
Abstract:Accurate identification and segmentation of dental caries in panoramic radiographs are critical for early diagnosis and effective treatment planning. Automated segmentation remains challenging due to low lesion contrast, morphological variability, and limited annotated data. In this study, we present the first comprehensive benchmarking of convolutional neural networks, vision transformers and state-space mamba architectures for automated dental caries segmentation on panoramic radiographs through a DC1000 dataset. Twelve state-of-the-art architectures, including VMUnet, MambaUNet, VMUNetv2, RMAMamba-S, TransNetR, PVTFormer, DoubleU-Net, and ResUNet++, were trained under identical configurations. Results reveal that, contrary to the growing trend toward complex attention based architectures, the CNN-based DoubleU-Net achieved the highest dice coefficient of 0.7345, mIoU of 0.5978, and precision of 0.8145, outperforming all transformer and Mamba variants. In the study, the top 3 results across all performance metrics were achieved by CNN-based architectures. Here, Mamba and transformer-based methods, despite their theoretical advantage in global context modeling, underperformed due to limited data and weaker spatial priors. These findings underscore the importance of architecture-task alignment in domain-specific medical image segmentation more than model complexity. Our code is available at: this https URL.
zh
[CV-105] Gaussian See Gaussian Do: Semantic 3D Motion Transfer from Multiview Video SIGGRAPH
【速读】:该论文旨在解决跨类别、无骨骼(rig-free)的语义3D运动迁移问题,即在不同物体之间实现基于语义对应关系的高质量动态三维重建与运动传递。其核心挑战在于如何从多视角视频中提取可迁移的运动信息,并在目标静态形状上准确重建具有结构一致性的动态场景。解决方案的关键在于提出了一种基于锚点的视图感知运动嵌入机制(anchor-based view-aware motion embedding),该机制确保了跨视角运动的一致性并加速了收敛;同时结合动态3D高斯泼溅(dynamic 3D Gaussian Splatting)重建流程,利用噪声监督视频进行鲁棒的四维(4D)重建,从而实现了比现有适配基线方法更优的运动保真度和结构一致性。
链接: https://arxiv.org/abs/2511.14848
作者: Yarin Bekor,Gal Michael Harari,Or Perel,Or Litany
机构: Technion - Israel Institute of Technology (以色列理工学院); NVIDIA; University of Toronto (多伦多大学); Vector Institute (矢量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025
Abstract:We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at this https URL
zh
[CV-106] Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence
【速读】:该论文旨在解决当前机器学习模型(包括大语言模型)在非平稳环境中的适应性不足问题,其核心挑战在于模型架构僵化导致无法实现持续学习和终身学习。现有模型常因缺乏动态调整机制而出现“顺行性遗忘”(anterograde amnesia),即无法有效压缩上下文流并应对分布偏移。解决方案的关键在于提出动态嵌套层次结构(Dynamic Nested Hierarchies),该结构通过模仿神经可塑性(neuroplasticity),使模型在训练或推理过程中自主调节优化层级数量、嵌套结构及更新频率,从而实现无需预设约束的自我演化。这一机制结合严格的数学建模、收敛性证明、表达能力边界分析以及次线性后悔(sublinear regret)理论,显著提升了模型在语言建模、持续学习和长程推理任务中的性能,为构建具备自适应性和通用性的智能系统奠定基础。
链接: https://arxiv.org/abs/2511.14823
作者: Akbar Anbar Jafari,Cagri Ozcinar,Gholamreza Anbarjafari
机构: University of Tartu (塔尔图大学); 3S Holding OÜ
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 1 figure
Abstract:Contemporary machine learning models, including large language models, exhibit remarkable capabilities in static tasks yet falter in non-stationary environments due to rigid architectures that hinder continual adaptation and lifelong learning. Building upon the nested learning paradigm, which decomposes models into multi-level optimization problems with fixed update frequencies, this work proposes dynamic nested hierarchies as the next evolutionary step in advancing artificial intelligence and machine learning. Dynamic nested hierarchies empower models to autonomously adjust the number of optimization levels, their nesting structures, and update frequencies during training or inference, inspired by neuroplasticity to enable self-evolution without predefined constraints. This innovation addresses the anterograde amnesia in existing models, facilitating true lifelong learning by dynamically compressing context flows and adapting to distribution shifts. Through rigorous mathematical formulations, theoretical proofs of convergence, expressivity bounds, and sublinear regret in varying regimes, alongside empirical demonstrations of superior performance in language modeling, continual learning, and long-context reasoning, dynamic nested hierarchies establish a foundational advancement toward adaptive, general-purpose intelligence.
zh
[CV-107] ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
【速读】:该论文旨在解决视频剪辑中shot assembly(镜头组接)的自动化与艺术表达一致性难题,即现有智能视频编辑技术难以捕捉创作者独特的艺术风格,导致生成内容缺乏叙事连贯性与视觉统一性。解决方案的关键在于提出一种基于能量模型(energy-based model)的优化方法:首先通过大语言模型生成脚本并与视频库进行视觉-语义匹配,筛选出语义对齐的候选镜头;接着从参考视频中提取镜头属性(如景别、摄像机运动和语义信息),并利用能量模型学习这些属性以评分候选镜头序列;最终结合多种语法规则实现镜头排序优化,使输出视频在结构和风格上均贴近参考视频,从而在无需专业编辑经验的情况下生成具有艺术一致性的高质量视频内容。
链接: https://arxiv.org/abs/2511.02505
作者: Yaosen Chen,Wei Wang,Tianheng Zheng,Xuming Wen,Han Yang,Yanru Zhang
机构: Sobey Media Intelligence Laboratory (Sobey媒体智能实验室); University of Electronic Science and Technology of China (电子科技大学); SiChuan University (四川大学); Qinghai Normal University (青海师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator’s unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: this https URL
zh
[CV-108] Data-driven Prediction of Species-Specific Plant Responses to Spectral-Shifting Films from Leaf Phenotypic and Photosynthetic Traits
【速读】:该论文旨在解决光质调控(特别是通过光谱转换薄膜,Spectral-Shifting Films, SF)对不同作物生长和产量影响的复杂性问题,即如何系统理解多种植物表型特征(如叶反射率、单位面积叶质量、叶绿素含量等)与光合生理响应之间的协同作用机制,从而准确预测SF应用对作物产量的影响。其解决方案的关键在于结合多维植物表型数据与每日光积分(Daily Light Integral, DLI),并利用变分自编码器(Variational Autoencoder)进行数据增强,随后构建前馈神经网络(Feedforward Neural Network, FFNN)模型,实现对SF是否显著提升作物产量的高精度二分类预测(测试集准确率达91.4%),从而揭示了叶片表型、光合特性与太阳光谱成分之间复杂的交互关系。
链接: https://arxiv.org/abs/2511.15173
作者: Jun Hyeun Kang,Jung Eek Son,Tae In Ahn
机构: Korea Institute of Science and Technology (韩国科学技术院); Seoul National University (首尔国立大学)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:The application of spectral-shifting films in greenhouses to shift green light to red light has shown variable growth responses across crop species. However, the yield enhancement of crops under altered light quality is related to the collective effects of the specific biophysical characteristics of each species. Considering only one attribute of a crop has limitations in understanding the relationship between sunlight quality adjustments and crop growth performance. Therefore, this study aims to comprehensively link multiple plant phenotypic traits and daily light integral considering the physiological responses of crops to their growth outcomes under SF using artificial intelligence. Between 2021 and 2024, various leafy, fruiting, and root crops were grown in greenhouses covered with either PEF or SF, and leaf reflectance, leaf mass per area, chlorophyll content, daily light integral, and light saturation point were measured from the plants cultivated in each condition. 210 data points were collected, but there was insufficient data to train deep learning models, so a variational autoencoder was used for data augmentation. Most crop yields showed an average increase of 22.5% under SF. These data were used to train several models, including logistic regression, decision tree, random forest, XGBoost, and feedforward neural network (FFNN), aiming to binary classify whether there was a significant effect on yield with SF application. The FFNN achieved a high classification accuracy of 91.4% on a test dataset that was not used for training. This study provide insight into the complex interactions between leaf phenotypic and photosynthetic traits, environmental conditions, and solar spectral components by improving the ability to predict solar spectral shift effects using SF.
zh
[CV-109] Image Denoising Using Transformed L1 (TL1) Regularization via ADMM
【速读】:该论文旨在解决传统总变差(Total Variation, TV)正则化在图像去噪中产生的阶梯状伪影(staircase artifacts)和对比度损失问题。其解决方案的关键在于引入变换的 ℓ1(Transformed ℓ1, TL1)正则项作用于图像梯度,并构建基于ADMM(交替方向乘子法)的优化模型;该方法通过推导出TL1正则项的闭式近似算子(proximal operator)以及在周期边界条件下利用快速傅里叶变换(FFT)实现图像更新,从而在有效抑制噪声的同时显著增强边缘保持与图像对比度。
链接: https://arxiv.org/abs/2511.15060
作者: Nabiha Choudhury,Jianqing Jia,Yifei Lou
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:
Abstract:Total variation (TV) regularization is a classical tool for image denoising, but its convex \ell_1 formulation often leads to staircase artifacts and loss of contrast. To address these issues, we introduce the Transformed \ell_1 (TL1) regularizer applied to image gradients. In particular, we develop a TL1-regularized denoising model and solve it using the Alternating Direction Method of Multipliers (ADMM), featuring a closed-form TL1 proximal operator and an FFT-based image update under periodic boundary conditions. Experimental results demonstrate that our approach achieves superior denoising performance, effectively suppressing noise while preserving edges and enhancing image contrast.
zh
[CV-110] Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors
【速读】:该论文旨在解决光纤斑点图传感器(Fiber Specklegram Sensors, FSS)在环境温度监测中因斑点图数据非线性特性导致的温度预测精度不足问题。其解决方案的关键在于引入基于Transformer架构的深度学习模型,包括Vision Transformer (ViT)、Swin Transformer及新型模型如LINA-ViT与MAP-ViGAT,利用其自注意力机制和图结构建模能力有效捕捉斑点图中复杂的模态交互与相位变化特征,从而显著提升温度预测准确性(ViT实现平均绝对误差MAE=1.15),同时结合可解释人工智能(Explainable AI, XAI)技术增强模型决策过程的透明度与可信度。
链接: https://arxiv.org/abs/2511.14792
作者: Abhishek Sebastian
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fiber Specklegram Sensors (FSS) are highly effective for environmental monitoring, particularly for detecting temperature variations. However, the nonlinear nature of specklegram data presents significant challenges for accurate temperature prediction. This study investigates the use of transformer-based architectures, including Vision Transformers (ViTs), Swin Transformers, and emerging models such as Learnable Importance Non-Symmetric Attention Vision Transformers (LINA-ViT) and Multi-Adaptive Proximity Vision Graph Attention Transformers (MAP-ViGAT), to predict temperature from specklegram data over a range of 0 to 120 Celsius. The results show that ViTs achieved a Mean Absolute Error (MAE) of 1.15, outperforming traditional models such as CNNs. GAT-ViT and MAP-ViGAT variants also demonstrated competitive accuracy, highlighting the importance of adaptive attention mechanisms and graph-based structures in capturing complex modal interactions and phase shifts in specklegram data. Additionally, this study incorporates Explainable AI (XAI) techniques, including attention maps and saliency maps, to provide insights into the decision-making processes of the transformer models, improving interpretability and transparency. These findings establish transformer architectures as strong benchmarks for optical fiber-based temperature sensing and offer promising directions for industrial monitoring and structural health assessment applications.
zh
人工智能
[AI-0] Walrus: A Cross-Domain Foundation Model for Continuum Dynamics
【速读】:该论文旨在解决基础模型(Foundation Models)在物理仿真领域难以实现类语言和视觉领域同等影响力的问题,核心挑战包括数据异质性与长期动态不稳定性导致的学习多样性不足,以及不同分辨率和维度带来的现代硬件训练效率低下。解决方案的关键在于提出三项创新方法:基于谐波分析的稳定化机制以提升预测稳定性;负载均衡的分布式二维与三维训练策略以优化硬件利用效率;以及计算自适应的分词(Tokenization)技术以适配多尺度物理场。通过这些方法,作者构建了名为Walrus的Transformer架构基础模型,用于流体类连续介质动力学建模,并在涵盖天体物理、地球科学、等离子体物理等多个领域的19种场景中预训练,实验表明其在短期与长期预测任务上均优于现有模型,且消融研究验证了各项技术对预报稳定性、训练吞吐量及迁移性能的显著提升。
链接: https://arxiv.org/abs/2511.15684
作者: Michael McCabe,Payel Mukhopadhyay,Tanya Marwah,Bruno Regaldo-Saint Blancard,Francois Rozet,Cristiana Diaconu,Lucas Meyer,Kaze W. K. Wong,Hadi Sotoudeh,Alberto Bietti,Irina Espejo,Rio Fear,Siavash Golkar,Tom Hehir,Keiya Hirashima,Geraud Krawezik,Francois Lanusse,Rudy Morel,Ruben Ohana,Liam Parker,Mariel Pettee,Jeff Shen,Kyunghyun Cho,Miles Cranmer,Shirley Ho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-term dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis-based stabilization method, load-balanced distributed 2D and 3D training strategies, and compute-adaptive tokenization. Using these tools, we develop Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. Walrus is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that Walrus outperforms prior foundation models on both short and long term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches. Code and weights are released for community use.
zh
[AI-1] DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在执行机器人任务时因缺乏“先思考后行动”机制而导致的数据依赖性强、运动控制精度低的问题。其核心挑战在于现有模型采用单一自回归解码器同时处理序列化思维链(Chain-of-Thought, CoT)推理与高维并行化的机器人动作输出,造成架构不匹配,削弱了思维与动作之间的因果关联。解决方案的关键在于提出DeepThinkVLA,通过一个紧密集成的混合注意力解码器实现:首先使用因果注意力生成顺序CoT推理,随后切换至双向注意力以高效并行解码动作向量;同时结合两阶段训练策略——先通过监督微调(Supervised Fine-Tuning, SFT)建立基础推理能力,再利用基于任务成功奖励的强化学习(Reinforcement Learning, RL)对完整推理-动作序列进行因果对齐,从而显著提升性能,在LIBERO基准上达到97.0%的成功率。
链接: https://arxiv.org/abs/2511.15669
作者: Cheng Yin,Yankai Lin,Wang Xu,Sikyuen Tam,Xiangrui Zeng,Zhiyuan Liu,Zhouping Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 16 pages, 6 figures, conference
Abstract:Enabling Vision-Language-Action (VLA) models to “think before acting” via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% success rate on the LIBERO benchmark. Our ablations confirm the design’s effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance.
zh
[AI-2] Continual Reinforcement Learning for Cyber-Physical Systems: Lessons Learned and Open Challenges
【速读】:该论文旨在解决持续强化学习(Continual Reinforcement Learning, CRL)在实际应用中面临的关键挑战,特别是在自动驾驶场景下多任务连续学习时出现的灾难性遗忘、超参数敏感性、神经网络容量利用效率低以及环境抽象表示不充分等问题。解决方案的核心在于通过在四个不同角度停车场景中依次训练代理(agent),使用近端策略优化(Proximal Policy Optimisation, PPO)算法进行实验,系统性地识别并量化CRL中的开放问题,并提出亟需解决的研究方向,强调了构建鲁棒CRL系统需从神经网络架构设计、跨学科融合(如计算机科学与神经科学)等层面进行深入探索。
链接: https://arxiv.org/abs/2511.15652
作者: Kim N. Nolle,Ivana Dusparic,Rhodri Cusack,Vinny Cahill
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures, Accepted to RLDM 2025
Abstract:Continual learning (CL) is a branch of machine learning that aims to enable agents to adapt and generalise previously learned abilities so that these can be reapplied to new tasks or environments. This is particularly useful in multi-task settings or in non-stationary environments, where the dynamics can change over time. This is particularly relevant in cyber-physical systems such as autonomous driving. However, despite recent advances in CL, successfully applying it to reinforcement learning (RL) is still an open problem. This paper highlights open challenges in continual RL (CRL) based on experiments in an autonomous driving environment. In this environment, the agent must learn to successfully park in four different scenarios corresponding to parking spaces oriented at varying angles. The agent is successively trained in these four scenarios one after another, representing a CL environment, using Proximal Policy Optimisation (PPO). These experiments exposed a number of open challenges in CRL: finding suitable abstractions of the environment, oversensitivity to hyperparameters, catastrophic forgetting, and efficient use of neural network capacity. Based on these identified challenges, we present open research questions that are important to be addressed for creating robust CRL systems. In addition, the identified challenges call into question the suitability of neural networks for CL. We also identify the need for interdisciplinary research, in particular between computer science and neuroscience. Comments: 5 pages, 5 figures, Accepted to RLDM 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.15652 [cs.LG] (or arXiv:2511.15652v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.15652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-3] Sufficient Explanations in Databases and their Connections to Necessary Explanations and Repairs
【速读】:该论文旨在解决如何在关系数据库中提供更有效的因果解释问题,特别是针对查询结果的解释,其核心在于引入“充分解释”(sufficient explanation)这一替代性因果概念。解决方案的关键在于揭示充分解释与数据库修复(database repairs)之间的联系,以及其与基于因果必要解释(causality-based necessary explanations)的关系,并在此基础上获得若干计算上的理论结果,从而为不一致数据库中的查询解释提供了新的形式化框架和计算基础。
链接: https://arxiv.org/abs/2511.15623
作者: Leopoldo Bertossi,Nina Pardal
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:The notion of cause, as formalized by Halpern and Pearl, has been recently applied to relational databases, to characterize and compute causal explanations for query answers. In this work we consider the alternative notion of sufficient explanation. We investigate its connections with database repairs as used for dealing with inconsistent databases, and with causality-based necessary explanations. We also obtain some computational results.
zh
[AI-4] Optimus-Q: Utilizing Federated Learning in Adaptive Robots for Intelligent Nuclear Power Plant Operations through Quantum Cryptography
【速读】:该论文旨在解决核电厂(Nuclear Power Plants, NPPs)中环境监测与安全防护的挑战,特别是在高风险环境中实现高效、实时且隐私保护的数据采集与污染预警。其解决方案的关键在于集成多技术协同机制:首先,采用具备先进红外传感器的Optimus-Q机器人实现空气质量和放射性污染的自主监测;其次,通过联邦学习(Federated Learning)使多个核电厂的机器人系统在不共享原始数据的前提下协同优化预测模型,提升对CO₂、CO和CH₄等有害气体的识别能力;最后,引入量子密钥分发(Quantum Key Distribution, QKD)保障通信链路的安全性,防止敏感运营数据泄露。这一融合了机器人技术、机器学习与量子通信的综合方案显著增强了核设施中的主动响应能力和运行安全性。
链接: https://arxiv.org/abs/2511.15614
作者: Sai Puppala,Ismail Hossain,Jahangir Alam,Sajedul Talukder
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of advanced robotics in nuclear power plants (NPPs) presents a transformative opportunity to enhance safety, efficiency, and environmental monitoring in high-stakes environments. Our paper introduces the Optimus-Q robot, a sophisticated system designed to autonomously monitor air quality and detect contamination while leveraging adaptive learning techniques and secure quantum communication. Equipped with advanced infrared sensors, the Optimus-Q robot continuously streams real-time environmental data to predict hazardous gas emissions, including carbon dioxide (CO _2 ), carbon monoxide (CO), and methane (CH _4 ). Utilizing a federated learning approach, the robot collaborates with other systems across various NPPs to improve its predictive capabilities without compromising data privacy. Additionally, the implementation of Quantum Key Distribution (QKD) ensures secure data transmission, safeguarding sensitive operational information. Our methodology combines systematic navigation patterns with machine learning algorithms to facilitate efficient coverage of designated areas, thereby optimizing contamination monitoring processes. Through simulations and real-world experiments, we demonstrate the effectiveness of the Optimus-Q robot in enhancing operational safety and responsiveness in nuclear facilities. This research underscores the potential of integrating robotics, machine learning, and quantum technologies to revolutionize monitoring systems in hazardous environments.
zh
[AI-5] What Does It Take to Be a Good AI Research Agent ? Studying the Role of Ideation Diversity
【速读】:该论文旨在解决当前AI研究代理(AI research agents)在科学进展加速中的性能差异问题,特别是理解驱动代理轨迹成功或失败的关键因素。其核心解决方案在于揭示“想法多样性”(ideation diversity)对代理性能的决定性作用:通过分析不同模型与代理框架在MLE-bench基准上的轨迹表现,发现高性能代理普遍具有更高的想法多样性;并通过受控实验进一步验证,增强想法多样性可显著提升代理性能,且该结论在多种评估指标下保持稳健。
链接: https://arxiv.org/abs/2511.15593
作者: Alexis Audran-Reiss,Jordi Armengol Estapé,Karen Hambardzumyan,Amar Budhiraja,Martin Josifoski,Edan Toledo,Rishi Hazra,Despoina Magka,Michael Shvartsman,Parth Pathak,Justine T Kao,Lucia Cipolina-Kun,Bhavul Gauri,Jean-Christophe Gagnon-Audet,Emanuel Tewolde,Jenny Zhang,Taco Cohen,Yossi Adi,Tatiana Shavrina,Yoram Bachrach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.
zh
[AI-6] BANN: A Fast Billion-Scale Disk-based Nearest-Neighbor Index
【速读】:该论文旨在解决当前主流向量数据库(Vector Database, VDB)中基于图的近似最近邻(Approximate Nearest Neighbor, ANN)索引算法——HNSW所存在的若干关键问题,包括内存密集型设计、随机内存访问导致缓存效率低下、因细粒度成对计算限制加速范围,以及仅支持语义相似性查询而无法处理 dissimilarity(相异性)查询。其解决方案的核心是提出一种全新的基于磁盘的 ANN 索引结构 B+ANN:首先将输入数据按语义相似性分块,随后构建一种变体 B+ 树来在内存和磁盘上存储这些数据块,并最终实现边缘(edge-based)与块(block-based)混合的内存遍历机制。该设计显著提升了空间和时间局部性,减少了缓存未命中(相对提升 19.23%),同时大幅降低内存消耗和磁盘构建时间(相比 DiskANN 减少 24 倍),并首次在 ANN 索引中引入对 dissimilarity 查询的支持。
链接: https://arxiv.org/abs/2511.15557
作者: Selim Furkan Tekin,Rajesh Bordawekar
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:
Abstract:Storing and processing of embedding vectors by specialized Vector databases (VDBs) has become the linchpin in building modern AI pipelines. Most current VDBs employ variants of a graph-based ap- proximate nearest-neighbor (ANN) index algorithm, HNSW, to an- swer semantic queries over stored vectors. Inspite of its wide-spread use, the HNSW algorithm suffers from several issues: in-memory design and implementation, random memory accesses leading to degradation in cache behavior, limited acceleration scope due to fine-grained pairwise computations, and support of only semantic similarity queries. In this paper, we present a novel disk-based ANN index, B+ANN, to address these issues: it first partitions input data into blocks containing semantically similar items, then builds an B+ tree variant to store blocks both in-memory and on disks, and finally, enables hybrid edge- and block-based in-memory traversals. As demonstrated by our experimantal evaluation, the proposed B+ANN disk-based index improves both quality (Recall value), and execution performance (Queries per second/QPS) over HNSW, by improving spatial and temporal locality for semantic operations, reducing cache misses (19.23% relative gain), and decreasing the memory consumption and disk-based build time by 24x over the DiskANN algorithm. Finally, it enables dissimilarity queries, which are not supported by similarity-oriented ANN indices.
zh
[AI-7] Exploring the use of AI authors and reviewers at Agents 4Science
【速读】:该论文试图解决的问题是:当前对AI代理(AI agents)在科学科研中的角色定位尚不明确,尤其是其作为科学家和审稿人时的能力边界与协作潜力仍缺乏系统探索。解决方案的关键在于组织了首个由AI代理担任主要作者和审稿人的会议——Agents4Science,通过这一实践平台,验证了人类与AI在科学研究中协同工作的可行性,并提炼出关于人机协作模式的核心洞见,为未来AI深度融入科研流程提供了实证基础与方法论指引。
链接: https://arxiv.org/abs/2511.15534
作者: Federico Bianchi,Owen Queen,Nitya Thakkar,Eric Sun,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:There is growing interest in using AI agents for scientific research, yet fundamental questions remain about their capabilities as scientists and reviewers. To explore these questions, we organized Agents4Science, the first conference in which AI agents serve as both primary authors and reviewers, with humans as co-authors and co-reviewers. Here, we discuss the key learnings from the conference and their implications for human-AI collaboration in science.
zh
[AI-8] heoretical Closed-loop Stability Bounds for Dynamical System Coupled with Diffusion Policies
【速读】:该论文旨在解决扩散策略(Diffusion Policy)在机器人操作任务中因逆时间扩散(去噪)过程计算复杂度高而导致难以应用于实时场景的问题。其解决方案的关键在于:仅对去噪过程进行部分执行,使控制器在执行动作前不完全完成去噪,从而允许系统在物理植物动态(plant dynamics)与计算机端的逆时间扩散动态并行演进的情况下运行,进而实现更快的模仿学习。论文进一步通过理论分析给出了闭环系统稳定性的边界条件,并提出了一种基于示范数据方差的指标,用于判断控制器是否稳定。
链接: https://arxiv.org/abs/2511.15520
作者: Gabriel Lauzier,Alexandre Girard,François Ferland
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures
Abstract:Diffusion Policy has shown great performance in robotic manipulation tasks under stochastic perturbations, due to its ability to model multimodal action distributions. Nonetheless, its reliance on a computationally expensive reverse-time diffusion (denoising) process, for action inference, makes it challenging to use for real-time applications where quick decision-making is mandatory. This work studies the possibility of conducting the denoising process only partially before executing an action, allowing the plant to evolve according to its dynamics in parallel to the reverse-time diffusion dynamics ongoing on the computer. In a classical diffusion policy setting, the plant dynamics are usually slow and the two dynamical processes are uncoupled. Here, we investigate theoretical bounds on the stability of closed-loop systems using diffusion policies when the plant dynamics and the denoising dynamics are coupled. The contribution of this work gives a framework for faster imitation learning and a metric that yields if a controller will be stable based on the variance of the demonstrations.
zh
[AI-9] Insights from the ICLR Peer Review and Rebuttal Process
【速读】:该论文旨在解决机器学习顶会(如ICLR)中同行评审流程效率与公平性不足的问题,尤其关注初审评分与反驳阶段评分变化之间的关系及其影响因素。其解决方案的关键在于通过大规模分析ICLR 2024和2025年的评审数据,结合量化指标与大语言模型(LLM)对评审文本和反驳讨论的分类,识别不同评分组的常见优劣势及最能促成分数变化的反驳策略;研究发现,初始评分和共审者评分是 rebuttal 阶段分数变动最强的预测因子,表明审稿人之间存在一定程度的影响效应,同时指出有策略的反驳可显著改善边缘论文的结果,从而为作者提供有效的 rebuttal 策略建议,并为构建更公平、高效的同行评审机制提供实证依据。
链接: https://arxiv.org/abs/2511.15462
作者: Amir Hossein Kargaran,Nafiseh Nikeghbal,Jing Yang,Nedjma Ousidhoum
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Peer review is a cornerstone of scientific publishing, including at premier machine learning conferences such as ICLR. As submission volumes increase, understanding the nature and dynamics of the review process is crucial for improving its efficiency, effectiveness, and the quality of published papers. We present a large-scale analysis of the ICLR 2024 and 2025 peer review processes, focusing on before- and after-rebuttal scores and reviewer-author interactions. We examine review scores, author-reviewer engagement, temporal patterns in review submissions, and co-reviewer influence effects. Combining quantitative analyses with LLM-based categorization of review texts and rebuttal discussions, we identify common strengths and weaknesses for each rating group, as well as trends in rebuttal strategies that are most strongly associated with score changes. Our findings show that initial scores and the ratings of co-reviewers are the strongest predictors of score changes during the rebuttal, pointing to a degree of reviewer influence. Rebuttals play a valuable role in improving outcomes for borderline papers, where thoughtful author responses can meaningfully shift reviewer perspectives. More broadly, our study offers evidence-based insights to improve the peer review process, guiding authors on effective rebuttal strategies and helping the community design fairer and more efficient review processes. Our code and score changes data are available at this https URL.
zh
[AI-10] Know Your Intent: An Autonomous Multi-Perspective LLM Agent Framework for DeFi User Transaction Intent Mining
【速读】:该论文旨在解决去中心化金融(Decentralized Finance, DeFi)中用户交易意图识别的难题,该问题因智能合约交互复杂、链上/链下因素多样以及十六进制日志(hex logs)不透明而难以准确解析。现有方法缺乏深层语义理解能力。其解决方案的关键在于提出Transaction Intent Mining (TIM)框架,该框架基于扎根理论构建了DeFi意图分类体系,并采用多智能体大语言模型(Multi-Agent Large Language Model, LLM)系统实现鲁棒的意图推断:通过元层规划器(Meta-Level Planner)动态协调领域专家代理,将多视角意图分析分解为可解子任务;由问答求解器(Question Solvers)处理多模态链上/链下数据;并借助认知评估器(Cognitive Evaluator)抑制LLM幻觉,保障结果可验证性。实验表明,TIM显著优于传统机器学习模型、单LLM及单代理基线方法。
链接: https://arxiv.org/abs/2511.15456
作者: Qian’ang Mao,Yuxuan Zhang,Jiaman Chen,Wenjun Zhou,Jiaqi Yan
机构: 未知
类目: Artificial Intelligence (cs.AI); General Finance (q-fin.GN)
备注: Written in 2025 Q1
Abstract:As Decentralized Finance (DeFi) develops, understanding user intent behind DeFi transactions is crucial yet challenging due to complex smart contract interactions, multifaceted on-/off-chain factors, and opaque hex logs. Existing methods lack deep semantic insight. To address this, we propose the Transaction Intent Mining (TIM) framework. TIM leverages a DeFi intent taxonomy built on grounded theory and a multi-agent Large Language Model (LLM) system to robustly infer user intents. A Meta-Level Planner dynamically coordinates domain experts to decompose multiple perspective-specific intent analyses into solvable subtasks. Question Solvers handle the tasks with multi-modal on/off-chain data. While a Cognitive Evaluator mitigates LLM hallucinations and ensures verifiability. Experiments show that TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines. We also analyze core challenges in intent inference. This work helps provide a more reliable understanding of user motivations in DeFi, offering context-aware explanations for complex blockchain activity.
zh
[AI-11] SFM in-context learning for time-series classification of bearing-health status
【速读】:该论文旨在解决时间序列分类任务中无需微调预训练模型即可实现高效分类的问题,尤其针对工业场景下设备健康状态评估等实际应用。解决方案的关键在于利用上下文学习(in-context learning)机制,将未参与模型训练的数据以目标标签(class id)和协变量(covariates,即数据矩阵)的形式嵌入提示(prompt)中,使时间序列基础模型(TSFM)能够基于预测轴对未知协变量模式进行概率化分类。该方法通过将频域参考信号转化为伪时间序列模式,并生成对齐的协变量与目标信号,实现了在不改变模型参数的前提下跨工况的有效分类,从而推动维护系统从定制化窄域人工智能向可扩展的通用AI驱动系统演进。
链接: https://arxiv.org/abs/2511.15447
作者: Michel Tokic,Slobodan Djukanović,Anja von Beuningen,Cheng Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint submitted to ESANN 2026
Abstract:This paper introduces a classification method using in-context learning in time-series foundation models (TSFM). We show how data, which was not part of the TSFM training data corpus, can be classified without the need of finetuning the model. Examples are represented in the form of targets (class id) and covariates (data matrix) within the prompt of the model, which enables to classify an unknown covariate data pattern alongside the forecast axis through in-context learning. We apply this method to vibration data for assessing the health state of a bearing within a servo-press motor. The method transforms frequency domain reference signals into pseudo time-series patterns, generates aligned covariate and target signals, and uses the TSFM to predict probabilities how classified data corresponds to predefined labels. Leveraging the scalability of pre-trained models this method demonstrates efficacy across varied operational conditions. This marks significant progress beyond custom narrow AI solutions towards broader, AI-driven maintenance systems.
zh
[AI-12] Small Language Models for Phishing Website Detection: Cost Performance and Privacy Trade-Offs
【速读】:该论文旨在解决传统钓鱼网站检测方法依赖复杂特征工程、持续再训练及高昂基础设施维护成本的问题,同时应对商用大语言模型(Large Language Models, LLMs)在运营成本和外部依赖方面的局限性。其解决方案的关键在于探索小型语言模型(Small Language Models, SLMs)在仅使用原始HTML代码基础上进行钓鱼网站分类的可行性,通过系统评估15个参数规模从10亿到700亿不等的SLMs,在分类准确性、计算资源消耗与成本效益之间权衡,证明SLMs虽性能弱于顶尖商用LLMs,但仍可作为本地化部署、可控性强且经济高效的替代方案,为后续针对钓鱼检测场景的SLM适配、微调与部署研究奠定基础。
链接: https://arxiv.org/abs/2511.15434
作者: Georg Goldenits,Philip Koenig,Sebastian Raubitzek,Andreas Ekelhart
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Phishing websites pose a major cybersecurity threat, exploiting unsuspecting users and causing significant financial and organisational harm. Traditional machine learning approaches for phishing detection often require extensive feature engineering, continuous retraining, and costly infrastructure maintenance. At the same time, proprietary large language models (LLMs) have demonstrated strong performance in phishing-related classification tasks, but their operational costs and reliance on external providers limit their practical adoption in many business environments. This paper investigates the feasibility of small language models (SLMs) for detecting phishing websites using only their raw HTML code. A key advantage of these models is that they can be deployed on local infrastructure, providing organisations with greater control over data and operations. We systematically evaluate 15 commonly used Small Language Models (SLMs), ranging from 1 billion to 70 billion parameters, benchmarking their classification accuracy, computational requirements, and cost-efficiency. Our results highlight the trade-offs between detection performance and resource consumption, demonstrating that while SLMs underperform compared to state-of-the-art proprietary LLMs, they can still provide a viable and scalable alternative to external LLM services. By presenting a comparative analysis of costs and benefits, this work lays the foundation for future research on the adaptation, fine-tuning, and deployment of SLMs in phishing detection systems, aiming to balance security effectiveness and economic practicality.
zh
[AI-13] owards Understanding Layer Contributions in Tabular In-Context Learning Models
【速读】:该论文旨在解决表格型上下文学习(tabular in-context learning, tabular ICL)模型中各层对预测任务的贡献不明确的问题,特别是缺乏对潜在空间(latent spaces)在不同层间演化机制的理解。其解决方案的关键在于采用“层如画师”(layers as painters)的视角分析TabPFN和TabICL模型,发现仅有部分层共享一致的表示语言(representational language),这揭示了结构冗余性,从而为模型压缩和提升可解释性提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2511.15432
作者: Amir Rezaei Balef,Mykhailo Koshil,Katharina Eggensperger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the EurIPS 2025 Workshop on AI for Tabular Data
Abstract:Despite the architectural similarities between tabular in-context learning (ICL) models and large language models (LLMs), little is known about how individual layers contribute to tabular prediction. In this paper, we investigate how the latent spaces evolve across layers in tabular ICL models, identify potential redundant layers, and compare these dynamics with those observed in LLMs. We analyze TabPFN and TabICL through the “layers as painters” perspective, finding that only subsets of layers share a common representational language, suggesting structural redundancy and offering opportunities for model compression and improved interpretability.
zh
[AI-14] RRT*former: Environment-Aware Sampling-Based Motion Planning using Transformer IROS2025
【速读】:该论文旨在解决复杂动态环境中机器人采样式最优路径规划问题,现有方法通常忽略环境信息及先前采样点提供的线索,导致启发式引导不足。其解决方案的关键在于提出一种新颖的采样式规划算法 RRTformer,该算法将标准 RRT 与 Transformer 网络有机结合:利用 Transformer 提取环境特征并融合历史采样信息,从而更有效地指导下一状态的采样过程,显著提升路径最优性与采样效率。
链接: https://arxiv.org/abs/2511.15414
作者: Mingyang Feng,Shaoyuan Li,Xiang Yin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IROS 2025
Abstract:We investigate the sampling-based optimal path planning problem for robotics in complex and dynamic environments. Most existing sampling-based algorithms neglect environmental information or the information from previous samples. Yet, these pieces of information are highly informative, as leveraging them can provide better heuristics when sampling the next state. In this paper, we propose a novel sampling-based planning algorithm, called \emphRRTformer, which integrates the standard RRT algorithm with a Transformer network in a novel way. Specifically, the Transformer is used to extract features from the environment and leverage information from previous samples to better guide the sampling process. Our extensive experiments demonstrate that, compared to existing sampling-based approaches such as RRT*, Neural RRT*, and their variants, our algorithm achieves considerable improvements in both the optimality of the path and sampling efficiency. The code for our implementation is available on this https URL.
zh
[AI-15] rra Nova: A Comprehensive Challenge Environment for Intelligent Agents
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)研究中缺乏综合性挑战环境的问题,即现有基准往往仅测试单一能力或并行独立任务的组合,而非考察智能体在复杂、动态交互系统中的深度推理与长期规划能力。解决方案的关键在于提出Terra Nova——一个受《文明V》启发的综合挑战环境(Comprehensive Challenge Environment, CCE),其核心特征是多个经典RL挑战(如部分可观测性、信用分配、表征学习、巨大动作空间等)在同一环境中同时出现且相互耦合,从而要求智能体具备跨变量的集成理解与长时程决策能力,而非简单地切换不同策略。
链接: https://arxiv.org/abs/2511.15378
作者: Trevor McInroe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Terra Nova, a new comprehensive challenge environment (CCE) for reinforcement learning (RL) research inspired by Civilization V. A CCE is a single environment in which multiple canonical RL challenges (e.g., partial observability, credit assignment, representation learning, enormous action spaces, etc.) arise simultaneously. Mastery therefore demands integrated, long-horizon understanding across many interacting variables. We emphasize that this definition excludes challenges that only aggregate unrelated tasks in independent, parallel streams (e.g., learning to play all Atari games at once). These aggregated multitask benchmarks primarily asses whether an agent can catalog and switch among unrelated policies rather than test an agent’s ability to perform deep reasoning across many interacting challenges.
zh
[AI-16] Parameter Importance-Driven Continual Learning for Foundation Models
【速读】:该论文旨在解决领域特定后训练(domain-specific post-training)导致的灾难性遗忘问题,即在保持基础模型通用推理能力的同时高效学习下游领域知识,从而提升大语言模型和多模态模型在动态现实环境中的适应性。解决方案的关键在于提出PIECE方法,通过基于Fisher信息(PIECE-F)和二阶归一化梯度-曲率联合估计(PIECE-S)的参数重要性评估机制,仅选择0.1%与新任务最相关的核心参数进行更新,无需访问历史数据且不增加模型参数量,实现了通用能力的有效保留与持续学习性能的显著提升。
链接: https://arxiv.org/abs/2511.15375
作者: Lingxiang Wang,Hainan Zhang,Zhiming Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world environments. Preserving general capabilities while acquiring downstream domain knowledge is a central challenge for large language and multimodal models. Traditional continual learning methods, such as regularization, replay and architectural isolation, suffer from poor downstream performance, reliance on inaccessible historical data, or additional parameter overhead. While recent parameter-efficient tuning (PET) methods can alleviate forgetting, their effectiveness strongly depends on the choice of parameters and update strategies. In this paper, we introduce PIECE, a Parameter Importance Estimation-based Continual Enhancement method that preserves general ability while efficiently learning domain knowledge without accessing prior training data or increasing model parameters. PIECE selectively updates only 0.1% of core parameters most relevant to new tasks, guided by two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on a second-order normalization that combines gradient and curvature information. Experiments across three language models and two multimodal models show that PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks. Our results highlight a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting.
zh
[AI-17] Reflexive Evidence-Based Multimodal Learning for Clean Energy Transitions: Causal Insights on Cooking Fuel Access Urbanization and Carbon Emissions
【速读】:该论文旨在解决可持续发展目标7(可负担的清洁能源)实现过程中,如何量化社会经济因素对能源获取与碳排放的影响、建模其跨领域交互关系,并捕捉能源转型背景下的反馈动态这一关键科学问题。解决方案的核心在于提出ClimateAgents框架,该框架融合大语言模型与领域专业化智能体,利用20年来自世界银行数据库的265个经济体的98项指标数据,基于机器学习因果推断方法识别碳排放的关键决定因素,从而生成可解释、可操作的政策洞见。该方法通过整合结构化指标、政策文本与语义推理等异构数据模态,构建模块化且具备反思能力的学习系统,推动从孤立建模向动态、情境感知的气候行动适应性治理基础设施转变。
链接: https://arxiv.org/abs/2511.15342
作者: Shan Shan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving Sustainable Development Goal 7 (Affordable and Clean Energy) requires not only technological innovation but also a deeper understanding of the socioeconomic factors influencing energy access and carbon emissions. While these factors are gaining attention, critical questions remain, particularly regarding how to quantify their impacts on energy systems, model their cross-domain interactions, and capture feedback dynamics in the broader context of energy transitions. To address these gaps, this study introduces ClimateAgents, an AI-based framework that combines large language models with domain-specialized agents to support hypothesis generation and scenario exploration. Leveraging 20 years of socioeconomic and emissions data from 265 economies, countries and regions, and 98 indicators drawn from the World Bank database, the framework applies a machine learning based causal inference approach to identify key determinants of carbon emissions in an evidence-based, data driven manner. The analysis highlights three primary drivers: access to clean cooking fuels in rural areas, access to clean cooking fuels in urban areas, and the percentage of population living in urban areas. These findings underscore the critical role of clean cooking technologies and urbanization patterns in shaping emission outcomes. In line with growing calls for evidence-based AI policy, ClimateAgents offers a modular and reflexive learning system that supports the generation of credible and actionable insights for policy. By integrating heterogeneous data modalities, including structured indicators, policy documents, and semantic reasoning, the framework contributes to adaptive policymaking infrastructures that can evolve with complex socio-technical challenges. This approach aims to support a shift from siloed modeling to reflexive, modular systems designed for dynamic, context-aware climate action.
zh
[AI-18] STREAM-VAE: Dual-Path Routing for Slow and Fast Dynamics in Vehicle Telemetry Anomaly Detection
【速读】:该论文旨在解决汽车遥测数据中慢漂移(slow drift)与快突变(fast spike)共存导致的异常检测困难问题,传统基于重构的模型如序列变分自编码器(sequence variational autoencoder, VAE)因采用单一潜在过程混合不同时间尺度信号,常出现尖峰被平滑或方差膨胀、异常分离能力下降的问题。其解决方案的关键在于提出STREAM-VAE模型,通过双路径编码器显式分离慢漂移与快突变的动力学特征,并在解码器中将瞬态偏离(transient deviations)独立于正常运行模式进行建模,从而提升异常评分的稳定性与鲁棒性,尤其适用于车载监测和后端车队分析场景。
链接: https://arxiv.org/abs/2511.15339
作者: Kadir-Kaan Özer,René Ebeling,Markus Enzweiler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 Pages, 4 Figures, 4 Tables
Abstract:Automotive telemetry data exhibits slow drifts and fast spikes, often within the same sequence, making reliable anomaly detection challenging. Standard reconstruction-based methods, including sequence variational autoencoders (VAEs), use a single latent process and therefore mix heterogeneous time scales, which can smooth out spikes or inflate variances and weaken anomaly separation. In this paper, we present STREAM-VAE, a variational autoencoder for anomaly detection in automotive telemetry time-series data. Our model uses a dual-path encoder to separate slow drift and fast spike signal dynamics, and a decoder that represents transient deviations separately from the normal operating pattern. STREAM-VAE is designed for deployment, producing stable anomaly scores across operating modes for both in-vehicle monitors and backend fleet analytics. Experiments on an automotive telemetry dataset and the public SMD benchmark show that explicitly separating drift and spike dynamics improves robustness compared to strong forecasting, attention, graph, and VAE baselines. Comments: 8 Pages, 4 Figures, 4 Tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.15339 [cs.LG] (or arXiv:2511.15339v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.15339 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-19] Path Planning through Multi-Agent Reinforcement Learning in Dynamic Environments
【速读】:该论文旨在解决动态环境中路径规划的挑战,即在障碍物和环境条件随时间变化的情况下,如何实现高效、可扩展且适应性强的路径规划。传统方法通常假设环境完全不可预测或依赖全局规划器,限制了其在现实场景中的部署能力。解决方案的关键在于提出一种分层区域感知的强化学习(Reinforcement Learning, RL)框架,利用环境变化往往局限于局部区域的特性,将环境进行层次化分解,并部署分布式RL智能体进行局部适应性学习;同时引入基于子环境成功率的再训练机制以优化策略更新时机,并对比单智能体Q-learning与多智能体联邦Q-learning两种训练范式,其中联邦方法通过周期性聚合本地Q表加速学习并提升性能。实验表明,该方法在复杂动态环境中显著优于单智能体方案,接近A* Oracle的性能,同时具备更短的适应时间和良好的可扩展性。
链接: https://arxiv.org/abs/2511.15284
作者: Jonas De Maeyer,Hossein Yarahmadi,Moharram Challenger
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Path planning in dynamic environments is a fundamental challenge in intelligent transportation and robotics, where obstacles and conditions change over time, introducing uncertainty and requiring continuous adaptation. While existing approaches often assume complete environmental unpredictability or rely on global planners, these assumptions limit scalability and practical deployment in real-world settings. In this paper, we propose a scalable, region-aware reinforcement learning (RL) framework for path planning in dynamic environments. Our method builds on the observation that environmental changes, although dynamic, are often localized within bounded regions. To exploit this, we introduce a hierarchical decomposition of the environment and deploy distributed RL agents that adapt to changes locally. We further propose a retraining mechanism based on sub-environment success rates to determine when policy updates are necessary. Two training paradigms are explored: single-agent Q-learning and multi-agent federated Q-learning, where local Q-tables are aggregated periodically to accelerate the learning process. Unlike prior work, we evaluate our methods in more realistic settings, where multiple simultaneous obstacle changes and increasing difficulty levels are present. Results show that the federated variants consistently outperform their single-agent counterparts and closely approach the performance of A* Oracle while maintaining shorter adaptation times and robust scalability. Although initial training remains time-consuming in large environments, our decentralized framework eliminates the need for a global planner and lays the groundwork for future improvements using deep RL and flexible environment decomposition.
zh
[AI-20] Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research AAAI
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)研究中因对“智能”本质理解不同而引发的深层分歧问题,即在理论层面上,AI研究者普遍隐含地持有两种对立的智能观:智能现实主义(Intelligence Realism)认为智能是一种可跨系统测量的单一、普适能力;智能多元主义(Intelligence Pluralism)则主张智能是情境依赖的多样化能力,无法被统一量化。论文的核心解决方案在于明确揭示并区分这两种隐含假设,指出它们从根本上塑造了模型选择、基准设计、实证解释及风险评估等关键研究实践,并导致对相同现象(如能力涌现或系统局限)得出截然相反的结论。通过使这些基础性假设显性化,论文为厘清AI领域内的争议提供了概念框架,从而推动更具共识性和针对性的研究路径发展。
链接: https://arxiv.org/abs/2511.15282
作者: Ninell Oldenburg,Ruchira Dhar,Anders Søgaard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The 40th Annual AAAI Conference on Artificial Intelligence, 8 pages (excl. references), 1 table
Abstract:In this paper, we argue that current AI research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. These underlying views generate fundamentally different research approaches across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of the same empirical phenomena, from capability emergence to system limitations. Regarding AI risk, they generate categorically different assessments: realists view superintelligence as the primary risk and search for unified alignment solutions, while pluralists see diverse threats across different domains requiring context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of disagreements in AI research.
zh
[AI-21] Behavior Trees vs Executable Ontologies: a Comparative Analysis of Robot Control Paradigms
【速读】:该论文试图解决传统机器人控制中存在“语义-过程鸿沟”(semantic-process gap)的问题,即在编程实现与实际物理世界语义表达之间缺乏一致性和可维护性。其解决方案的关键在于提出并验证一种基于可执行本体(Executable Ontology, EO)的替代框架,通过将机器人行为建模为一个由数据流规则驱动的时序事件语义图(temporal, event-based semantic graph),取代传统基于轮询的指令式行为树(imperative Behavior Trees, BTs)的控制流结构。EO采用事件驱动的状态传播机制,实现了运行时模型修改、完整的时序可追溯性以及数据、逻辑和接口的统一表示,从而在动态演化系统中展现出优于BTs的灵活性与可扩展性。
链接: https://arxiv.org/abs/2511.15274
作者: Alexander Boldachev
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 22 pages, 8 figures
Abstract:This paper compares two distinct approaches to modeling robotic behavior: imperative Behavior Trees (BTs) and declarative Executable Ontologies (EO), implemented through the boldsea framework. BTs structure behavior hierarchically using control-flow, whereas EO represents the domain as a temporal, event-based semantic graph driven by dataflow rules. We demonstrate that EO achieves comparable reactivity and modularity to BTs through a fundamentally different architecture: replacing polling-based tick execution with event-driven state propagation. We propose that EO offers an alternative framework, moving from procedural programming to semantic domain modeling, to address the semantic-process gap in traditional robotic control. EO supports runtime model modification, full temporal traceability, and a unified representation of data, logic, and interface - features that are difficult or sometimes impossible to achieve with BTs, although BTs excel in established, predictable scenarios. The comparison is grounded in a practical mobile manipulation task. This comparison highlights the respective operational strengths of each approach in dynamic, evolving robotic systems.
zh
[AI-22] Efficiency Will Not Lead to Sustainable Reasoning AI
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)向复杂问题求解演进,其推理能力的提升不再受限于训练数据量,而是依赖于指数级增长的计算资源投入,导致能源消耗持续上升,而传统效率优化手段已接近物理极限,难以实现可持续发展。解决方案的关键在于:不能仅依赖效率提升,而需在模型优化和治理机制中嵌入明确的资源使用限制,推动研究与政策协同制定,以实现推理型人工智能的可持续发展。
链接: https://arxiv.org/abs/2511.15259
作者: Philipp Wiesner,Daniel W. O’Neill,Francesca Larosa,Odej Kao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Presented at the Rethinking AI Workshop @ EurIPS’25
Abstract:AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi-step reasoning. Historically, computing’s global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.
zh
[AI-23] PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback
【速读】:该论文旨在解决学习者在提升演讲技能过程中缺乏高质量示范样本和个性化指导的问题。现有AI工具多局限于单一功能,如语音评分或脚本生成,未能整合参考建模与交互式反馈以形成连贯的学习闭环。其解决方案的关键在于提出一个双代理系统:理想演讲代理(Ideal Presentation Agent)通过处理用户幻灯片、生成叙述脚本、合成个性化语音并同步组装成高质量演示视频,提供可参照的示范;教练代理(Coach Agent)则基于多模态语音分析对用户录制的演讲进行评估,并以观察-影响-建议(OIS)结构化格式输出反馈,同时引入观众代理(Audience Agent)模拟人类听众视角,增强反馈的真实性与沉浸感,从而构建了一个从观察、练习到反馈的闭环学习机制。
链接: https://arxiv.org/abs/2511.15253
作者: Sirui Chen,Jinsong Zhou,Xinli Xu,Xiaoyu Yang,Litao Guo,Ying-Cong Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 13pages,6figures
Abstract:Effective presentation skills are essential in education, professional communication, and public speaking, yet learners often lack access to high-quality exemplars or personalized coaching. Existing AI tools typically provide isolated functionalities such as speech scoring or script generation without integrating reference modeling and interactive feedback into a cohesive learning experience. We introduce a dual-agent system that supports presentation practice through two complementary roles: the Ideal Presentation Agent and the Coach Agent. The Ideal Presentation Agent converts user-provided slides into model presentation videos by combining slide processing, visual-language analysis, narration script generation, personalized voice synthesis, and synchronized video assembly. The Coach Agent then evaluates user-recorded presentations against these exemplars, conducting multimodal speech analysis and delivering structured feedback in an Observation-Impact-Suggestion (OIS) format. To enhance the authenticity of the learning experience, the Coach Agent incorporates an Audience Agent, which simulates the perspective of a human listener and provides humanized feedback reflecting audience reactions and engagement. Together, these agents form a closed loop of observation, practice, and feedback. Implemented on a robust backend with multi-model integration, voice cloning, and error handling mechanisms, the system demonstrates how AI-driven agents can provide engaging, human-centered, and scalable support for presentation skill development in both educational and professional contexts.
zh
[AI-24] EntroPIC: Towards Stable Long-Term Training of LLM s via Entropy Stabilization with Proportional-Integral Control
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在长期强化学习(Reinforcement Learning, RL)训练过程中因熵(Entropy)不稳定而导致的探索能力下降问题,即模型容易过早收敛至次优行为策略。其解决方案的关键在于提出一种基于比例-积分控制(Proportional-Integral Control, PIC)的熵稳定方法——EntroPIC,通过动态调整正样本与负样本的损失系数,自适应地平衡二者对熵的影响,从而在整个训练过程中维持稳定的熵水平,确保高效的探索和持续的优化进展。
链接: https://arxiv.org/abs/2511.15248
作者: Kai Yang,Xin Xu,Yangkun Chen,Weijie Liu,Jiafei Lyu,Zichuan Lin,Deheng Ye,Saiyong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.
zh
[AI-25] axonomy Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在具备函数调用能力时,易受间接提示注入(Indirect Prompt Injection, IPI)攻击的问题。现有防御框架存在碎片化、缺乏统一分类体系和系统性评估的缺陷,导致防护效果不明确且易被绕过。论文的关键解决方案是首次提出一个涵盖五个维度的IPI防御框架综合分类体系,并对代表性防御方案进行安全性与可用性评估;通过分析防御失效的根本原因,识别出六类防御绕过机制,并设计三种新型自适应攻击方法,显著提升针对特定防御框架的成功率,从而揭示现有方案的严重漏洞,为未来构建更安全、更可用的IPI防护机制提供理论基础与实践指导。
链接: https://arxiv.org/abs/2511.15203
作者: Zimo Ji,Xunguang Wang,Zongjie Li,Pingchuan Ma,Yudong Gao,Daoyuan Wu,Xincheng Yan,Tian Tian,Shuai Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based agents with function-calling capabilities are increasingly deployed, but remain vulnerable to Indirect Prompt Injection (IPI) attacks that hijack their tool calls. In response, numerous IPI-centric defense frameworks have emerged. However, these defenses are fragmented, lacking a unified taxonomy and comprehensive evaluation. In this Systematization of Knowledge (SoK), we present the first comprehensive analysis of IPI-centric defense frameworks. We introduce a comprehensive taxonomy of these defenses, classifying them along five dimensions. We then thoroughly assess the security and usability of representative defense frameworks. Through analysis of defensive failures in the assessment, we identify six root causes of defense circumvention. Based on these findings, we design three novel adaptive attacks that significantly improve attack success rates targeting specific frameworks, demonstrating the severity of the flaws in these defenses. Our paper provides a foundation and critical insights for the future development of more secure and usable IPI-centric agent defense frameworks.
zh
[AI-26] SOLID: a Framework of Synergizing Optimization and LLM s for Intelligent Decision-Making NEURIPS2025
【速读】:该论文旨在解决传统决策方法中数学优化与上下文理解能力不足的问题,尤其是在复杂环境中难以有效融合结构化数据(如历史价格)与非结构化信息(如金融新闻)以提升决策质量。解决方案的关键在于提出SOLID框架,通过双价格(dual prices)和偏差惩罚(deviation penalties)实现优化模型与大语言模型(Large Language Models, LLMs)代理之间的迭代协同,从而在保持模块化与数据隐私的同时,增强决策的智能性与准确性,并在凸性假设下保证理论收敛性。
链接: https://arxiv.org/abs/2511.15202
作者: Yinsheng Wang,Tario G You,Léonard Boussioux,Shan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 WORKSHOP ML*OR Workshop: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making
Abstract:This paper introduces SOLID (Synergizing Optimization and Large Language Models for Intelligent Decision-Making), a novel framework that integrates mathematical optimization with the contextual capabilities of large language models (LLMs). SOLID facilitates iterative collaboration between optimization and LLMs agents through dual prices and deviation penalties. This interaction improves the quality of the decisions while maintaining modularity and data privacy. The framework retains theoretical convergence guarantees under convexity assumptions, providing insight into the design of LLMs prompt. To evaluate SOLID, we applied it to a stock portfolio investment case with historical prices and financial news as inputs. Empirical results demonstrate convergence under various scenarios and indicate improved annualized returns compared to a baseline optimizer-only method, validating the synergy of the two agents. SOLID offers a promising framework for advancing automated and intelligent decision-making across diverse domains.
zh
[AI-27] Eq.Bot: Enhance Robotic Manipulation Learning via Group Equivariant Canonicalization
【速读】:该论文旨在解决当前多模态学习框架在机器人操作任务中缺乏几何一致性保障的问题,尤其在处理空间变换(如旋转和平移)时表现不佳。现有方法虽尝试通过定制化架构修改引入等变性(equivariance),但存在实现复杂度高、计算开销大及可移植性差等局限。其解决方案的关键在于提出一种基于SE(2)群等变理论的通用规范化框架(canonicalization framework),将观测数据映射至规范空间,应用已有策略后再逆映射回原始空间,从而在不改变模型架构的前提下赋予模型空间等变性,实现对空间变换的鲁棒性。
链接: https://arxiv.org/abs/2511.15194
作者: Jian Deng,Yuandong Wang,Yangfu Zhu,Tao Feng,Tianyu Wo,Zhenzhou Shao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures and 3 tables
Abstract:Robotic manipulation systems are increasingly deployed across diverse domains. Yet existing multi-modal learning frameworks lack inherent guarantees of geometric consistency, struggling to handle spatial transformations such as rotations and translations. While recent works attempt to introduce equivariance through bespoke architectural modifications, these methods suffer from high implementation complexity, computational cost, and poor portability. Inspired by human cognitive processes in spatial reasoning, we propose this http URL, a universal canonicalization framework grounded in SE(2) group equivariant theory for robotic manipulation learning. Our framework transforms observations into a canonical space, applies an existing policy, and maps the resulting actions back to the original space. As a model-agnostic solution, this http URL aims to endow models with spatial equivariance without requiring architectural modifications. Extensive experiments demonstrate the superiority of this http URL under both CNN-based (e.g., CLIPort) and Transformer-based (e.g., OpenVLA-OFT) architectures over existing methods on various robotic manipulation tasks, where the most significant improvement can reach 50.0%.
zh
[AI-28] As If Weve Met Before: LLM s Exhibit Certainty in Recognizing Seen Files
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)训练中可能使用未经授权的版权内容所带来的法律与伦理风险,尤其是如何有效检测训练数据是否包含受版权保护的内容。现有方法如成员推理攻击(Membership Inference Attacks, MIAs)受限于LLMs固有的过度自信、缺乏真实训练数据标签以及依赖经验设定阈值等问题,难以实现高精度检测。本文提出COPYCHECK框架,其核心创新在于将LLMs的过自信特性转化为优势——通过捕捉不确定性信号(uncertainty signals)来区分“见过”(训练数据)与“未见过”(非训练数据)的内容;进一步采用双策略:一是对文件进行细粒度分段以降低对大规模训练数据的依赖,二是基于不确定性的无监督聚类方法消除阈值调参需求。实验表明,COPYCHECK在LLaMA 7b和LLaMA2 7b上平均平衡准确率达90.1%和91.6%,相较最先进基线提升超90%,且具备跨架构泛化能力,首次将不确定性建模应用于LLM版权检测,为训练数据透明性提供了实用工具。
链接: https://arxiv.org/abs/2511.15192
作者: Haodong Li,Jingqi Zhang,Xiao Cheng,Peihua Mai,Haoyu Wang,Yang Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The remarkable language ability of Large Language Models (LLMs) stems from extensive training on vast datasets, often including copyrighted material, which raises serious concerns about unauthorized use. While Membership Inference Attacks (MIAs) offer potential solutions for detecting such violations, existing approaches face critical limitations and challenges due to LLMs’ inherent overconfidence, limited access to ground truth training data, and reliance on empirically determined thresholds. We present COPYCHECK, a novel framework that leverages uncertainty signals to detect whether copyrighted content was used in LLM training sets. Our method turns LLM overconfidence from a limitation into an asset by capturing uncertainty patterns that reliably distinguish between seen" (training data) and unseen" (non-training data) content. COPYCHECK further implements a two-fold strategy: (1) strategic segmentation of files into smaller snippets to reduce dependence on large-scale training data, and (2) uncertainty-guided unsupervised clustering to eliminate the need for empirically tuned thresholds. Experiment results show that COPYCHECK achieves an average balanced accuracy of 90.1% on LLaMA 7b and 91.6% on LLaMA2 7b in detecting seen files. Compared to the SOTA baseline, COPYCHECK achieves over 90% relative improvement, reaching up to 93.8% balanced accuracy. It further exhibits strong generalizability across architectures, maintaining high performance on GPT-J 6B. This work presents the first application of uncertainty for copyright detection in LLMs, offering practical tools for training data transparency. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.15192 [cs.AI] (or arXiv:2511.15192v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.15192 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-29] HISE-KT: Synergizing Heterogeneous Information Networks and LLM s for Explainable Knowledge Tracing with Meta-Path Optimization
【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)中两个核心问题:一是基于异构信息网络(Heterogeneous Information Networks, HINs)的方法因人工或随机选择元路径(meta-path)而引入噪声,且缺乏对元路径实例的质量评估;二是基于大语言模型(Large Language Models, LLMs)的方法忽视了学生间丰富的交互信息,且两类方法均难以提供一致准确且可解释的预测结果。解决方案的关键在于提出一个融合HIN与LLM的协同增强框架——HISE-KT,其核心创新包括:1)构建包含多种节点类型的多关系HIN以捕捉结构化关联;2)利用LLM智能评分并筛选元路径实例,实现自动化元路径质量评估;3)借鉴教育心理学原理设计基于元路径的相似学生检索机制,为预测提供更有效的上下文;4)通过结构化提示(structured prompt)将目标学生的答题历史与检索到的相似学习轨迹整合,使LLM生成兼具高精度和证据支持的可解释分析报告。
链接: https://arxiv.org/abs/2511.15191
作者: Zhiyi Duan,Zixing Shi,Hongyu Yuan,Qi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Tracing (KT) aims to mine students’ evolving knowledge states and predict their future question-answering performance. Existing methods based on heterogeneous information networks (HINs) are prone to introducing noises due to manual or random selection of meta-paths and lack necessary quality assessment of meta-path instances. Conversely, recent large language models (LLMs)-based methods ignore the rich information across students, and both paradigms struggle to deliver consistently accurate and evidence-based explanations. To address these issues, we propose an innovative framework, HIN-LLM Synergistic Enhanced Knowledge Tracing (HISE-KT), which seamlessly integrates HINs with LLMs. HISE-KT first builds a multi-relationship HIN containing diverse node types to capture the structural relations through multiple meta-paths. The LLM is then employed to intelligently score and filter meta-path instances and retain high-quality paths, pioneering automated meta-path quality assessment. Inspired by educational psychology principles, a similar student retrieval mechanism based on meta-paths is designed to provide a more valuable context for prediction. Finally, HISE-KT uses a structured prompt to integrate the target student’s history with the retrieved similar trajectories, enabling the LLM to generate not only accurate predictions but also evidence-backed, explainable analysis reports. Experiments on four public datasets show that HISE-KT outperforms existing KT baselines in both prediction performance and interpretability.
zh
[AI-30] Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
【速读】:该论文旨在解决掩码自回归扩散模型(Masked Auto-regressive Diffusion Models, MAR)在推理阶段效率低下这一核心问题,其根本原因在于MAR固有的分层推理机制——外层自回归解码循环与内层扩散去噪链的解耦结构,导致生成速度缓慢,难以应用于强化学习(Reinforcement Learning, RL)等对实时性要求较高的场景。解决方案的关键在于提出MARVAL(Masked Auto-regressive Variational Acceleration),一种基于知识蒸馏的框架,通过设计一种基于得分函数的变分目标,将扩散链压缩为单步自回归生成过程,同时保持原有的灵活自回归掩码顺序;该方法不仅实现超过30倍的推理加速,还使RL后训练成为可能,从而显著提升生成样本的质量和人类偏好对齐能力。
链接: https://arxiv.org/abs/2511.15190
作者: Yuxuan Gu,Weimin Bai,Yifei Wang,Weijian Luo,He Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model this http URL address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.
zh
[AI-31] SWR-Viz: AI-assisted Interactive Visual Analytics Framework for Ship Weather Routing
【速读】:该论文旨在解决海上运输中因波浪预报延迟和依赖人工判断而导致的高效、可持续航行决策难题。其核心问题是现有波浪预报模型难以实时响应海洋环境变化,且缺乏对排放影响的直观分析能力,限制了自适应航线规划的实际应用。解决方案的关键在于提出SWR-Viz框架,该框架融合了物理信息驱动的傅里叶神经算子(Fourier Neural Operator)波浪预测模型与SIMROUTE航线优化算法,并结合交互式排放分析功能,实现了从当前观测条件出发的近实时波浪预报、稀疏观测数据同化以及快速“假设性”航线情景探索,从而提升了预报稳定性与航线决策的科学性与实用性。
链接: https://arxiv.org/abs/2511.15182
作者: Subhashis Hazarika,Leonard Lupin-Jimenez,Rohit Vuppala,Ashesh Chattopadhyay,Hon Yung Wong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient and sustainable maritime transport increasingly depends on reliable forecasting and adaptive routing, yet operational adoption remains difficult due to forecast latencies and the need for human judgment in rapid decision-making under changing ocean conditions. We introduce SWR-Viz, an AI-assisted visual analytics framework that combines a physics-informed Fourier Neural Operator wave forecast model with SIMROUTE-based routing and interactive emissions analytics. The framework generates near-term forecasts directly from current conditions, supports data assimilation with sparse observations, and enables rapid exploration of what-if routing scenarios. We evaluate the forecast models and SWR-Viz framework along key shipping corridors in the Japan Coast and Gulf of Mexico, showing both improved forecast stability and realistic routing outcomes comparable to ground-truth reanalysis wave products. Expert feedback highlights the usability of SWR-Viz, its ability to isolate voyage segments with high emission reduction potential, and its value as a practical decision-support system. More broadly, this work illustrates how lightweight AI forecasting can be integrated with interactive visual analytics to support human-centered decision-making in complex geospatial and environmental domains.
zh
[AI-32] FaultDiffusion: Few-Shot Fault Time Series Generation with Diffusion Model
【速读】:该论文旨在解决工业设备监测中故障诊断因故障数据稀缺而导致的数据驱动方法性能受限的问题。在少样本场景下,现有时间序列生成模型难以准确捕捉故障分布,生成的样本缺乏真实性和多样性,主要归因于正常与故障域之间的巨大领域差距及故障类内高变异性。解决方案的关键在于提出一种基于扩散模型(diffusion models)的新型少样本故障时间序列生成框架:首先引入正负差异适配器(positive-negative difference adapter),利用预训练的正常数据分布建模正常与故障域之间的差异,从而实现精准的故障合成;其次设计多样性损失(diversity loss),通过样本间差异正则化防止模式崩溃,提升生成样本的多样性。实验表明,该方法在真实性与多样性上显著优于传统方法,在关键基准测试中达到当前最优性能。
链接: https://arxiv.org/abs/2511.15174
作者: Yi Xu,Zhigang Chen,Rui Wang,Yangfan Li,Fengxiao Tang,Ming Zhao,Jiaqi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 figures, 5 tables ,8 pages
Abstract:In industrial equipment monitoring, fault diagnosis is critical for ensuring system reliability and enabling predictive maintenance. However, the scarcity of fault data, due to the rarity of fault events and the high cost of data annotation, significantly hinders data-driven approaches. Existing time-series generation models, optimized for abundant normal data, struggle to capture fault distributions in few-shot scenarios, producing samples that lack authenticity and diversity due to the large domain gap and high intra-class variability of faults. To address this, we propose a novel few-shot fault time-series generation framework based on diffusion models. Our approach employs a positive-negative difference adapter, leveraging pre-trained normal data distributions to model the discrepancies between normal and fault domains for accurate fault synthesis. Additionally, a diversity loss is introduced to prevent mode collapse, encouraging the generation of diverse fault samples through inter-sample difference regularization. Experimental results demonstrate that our model significantly outperforms traditional methods in authenticity and diversity, achieving state-of-the-art performance on key benchmarks.
zh
[AI-33] SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在生成过程中存在的动态安全风险问题,即有害内容可能通过隐晦的推理链条逐步显现或被误导性理由合理化,而现有安全评估方法主要关注输出层面,难以捕捉这一过程中的风险演化。其解决方案的关键在于提出SafeRBench——首个端到端评估LRM安全性的基准测试框架,包含三大创新:(1) 在输入设计中引入风险类别与等级,构建涵盖不同受害群体和危害严重程度的平衡提示集;(2) 提出微粒化推理片段分割机制(micro-thought chunking),将长推理链拆分为语义连贯单元,实现对十类安全维度的细粒度分析;(3) 通过人类标注验证大语言模型(LLM)评估结果的人机一致性,确保评估的有效性与可靠性。
链接: https://arxiv.org/abs/2511.15169
作者: Xin Gao,Shaohan Yu,Zerui Chen,Yueming Lyu,Weichen Yu,Guanghao Li,Jiyao Liu,Jianxiong Gao,Jian Liang,Ziwei Liu,Chenyang Si
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures
Abstract:Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end – from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.
zh
[AI-34] Finetuning LLM s for Automatic Form Interaction on Web-Browser in Selenium Testing Framework
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化Web应用测试中生成高质量Selenium脚本的能力不足问题,尤其是针对表单交互测试这一关键场景缺乏系统性评估基准和训练数据的问题。其解决方案的关键在于构建了一个包含合成数据与人工标注数据的多样化数据集,并设计了语法正确性、脚本可执行性和输入字段覆盖率等明确指标,从而有效训练LLMs生成高覆盖、可执行的Web表单交互测试用例,实证表明该方法显著优于GPT-4o等主流LLM基线模型。
链接: https://arxiv.org/abs/2511.15168
作者: Nguyen-Khang Le,Nguyen Hiep,Minh Nguyen,Son Luu,Trung Vo,Quan Bui,Nomura Shoshin,Le-Minh Nguyen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of KSE 2025
Abstract:Automated web application testing is a critical component of modern software development, with frameworks like Selenium widely adopted for validating functionality through browser automation. Among the essential aspects of such testing is the ability to interact with and validate web forms, a task that requires syntactically correct, executable scripts with high coverage of input fields. Despite its importance, this task remains underexplored in the context of large language models (LLMs), and no public benchmark or dataset exists to evaluate LLMs on form interaction generation systematically. This paper introduces a novel method for training LLMs to generate high-quality test cases in Selenium, specifically targeting form interaction testing. We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios. We define clear metrics for syntax correctness, script executability, and input field coverage. Our empirical study demonstrates that our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics. Our work lays the groundwork for future research on LLM-based web testing and provides resources to support ongoing progress in this area.
zh
[AI-35] Can MLLM s Detect Phishing? A Comprehensive Security Benchmark Suite Focusing on Dynamic Threats and Multimodal Evaluation in Academic Environments
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在学术环境中面临的钓鱼攻击检测难题,尤其针对学术机构和研究人员所遭遇的动态、多语言且依赖上下文的定制化攻击。现有安全基准测试数据集普遍缺乏特定学术背景信息,难以捕捉学术场景下不断演变的攻击模式与以人为中心的脆弱性因素。为此,论文提出AdapT-Bench——一个统一的方法论框架与基准套件,其核心在于系统性评估MLLM在应对学术环境中动态钓鱼攻击时的防御能力,从而填补当前评估体系在真实学术威胁建模方面的空白。
链接: https://arxiv.org/abs/2511.15165
作者: Jingzhuo Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid proliferation of Multimodal Large Language Models (MLLMs) has introduced unprecedented security challenges, particularly in phishing detection within academic environments. Academic institutions and researchers are high-value targets, facing dynamic, multilingual, and context-dependent threats that leverage research backgrounds, academic collaborations, and personal information to craft highly tailored attacks. Existing security benchmarks largely rely on datasets that do not incorporate specific academic background information, making them inadequate for capturing the evolving attack patterns and human-centric vulnerability factors specific to academia. To address this gap, we present AdapT-Bench, a unified methodological framework and benchmark suite for systematically evaluating MLLM defense capabilities against dynamic phishing attacks in academic settings.
zh
[AI-36] ItemRAG : Item-Based Retrieval-Augmented Generation for LLM -Based Recommendation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推荐系统中应用时,如何更有效地利用物品间共现关系以提升推荐性能的问题,尤其是在冷启动物品场景下。现有基于用户检索增强生成(RAG)的方法主要依赖相似用户的购买模式来增强LLM的推理能力,但难以捕捉物品间的潜在关联。其解决方案的关键在于提出ItemRAG——一种基于物品的RAG方法,通过从物品-物品共购历史中检索相关物品而非用户,并结合语义相似性与共购频次双重机制优化检索结果,从而帮助LLM更好地建模物品间的协同购买模式,显著提升零样本推荐效果(Hit-Ratio-1最高提升43%),并在标准和冷启动推荐场景下均优于传统用户导向的RAG基线。
链接: https://arxiv.org/abs/2511.15141
作者: Sunwoo Kim,Geon Lee,Kyungho Kim,Jaemin Yoo,Kijung Shin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, large language models (LLMs) have been widely used as recommender systems, owing to their strong reasoning capability and their effectiveness in handling cold-start items. To better adapt LLMs for recommendation, retrieval-augmented generation (RAG) has been incorporated. Most existing RAG methods are user-based, retrieving purchase patterns of users similar to the target user and providing them to the LLM. In this work, we propose ItemRAG, an item-based RAG method for LLM-based recommendation that retrieves relevant items (rather than users) from item-item co-purchase histories. ItemRAG helps LLMs capture co-purchase patterns among items, which are beneficial for recommendations. Especially, our retrieval strategy incorporates semantically similar items to better handle cold-start items and uses co-purchase frequencies to improve the relevance of the retrieved items. Through extensive experiments, we demonstrate that ItemRAG consistently (1) improves the zero-shot LLM-based recommender by up to 43% in Hit-Ratio-1 and (2) outperforms user-based RAG baselines under both standard and cold-start item recommendation settings.
zh
[AI-37] From Solving to Verifying: A Unified Objective for Robust Reasoning in LLM s
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在推理过程中难以持续验证自身推理链(reasoning traces)的问题,旨在提升其自验证能力,并探究这种能力是否能进一步改善推理性能。解决方案的关键在于提出GRPO-Verif算法,该算法通过统一的损失函数联合优化解生成与自验证过程,并引入可调节的超参数控制验证信号的权重,从而在不牺牲推理性能的前提下显著增强模型的自验证能力。
链接: https://arxiv.org/abs/2511.15137
作者: Xiaoxuan Wang,Bo Liu,Song Jiang,Jingzhou Liu,Jingyuan Qi,Xia Chen,Baosheng He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The reasoning capabilities of large language models (LLMs) have been significantly improved through reinforcement learning (RL). Nevertheless, LLMs still struggle to consistently verify their own reasoning traces. This raises the research question of how to enhance the self-verification ability of LLMs and whether such an ability can further improve reasoning performance. In this work, we propose GRPO-Verif, an algorithm that jointly optimizes solution generation and self-verification within a unified loss function, with an adjustable hyperparameter controlling the weight of the verification signal. Experimental results demonstrate that our method enhances self-verification capability while maintaining comparable performance in reasoning.
zh
[AI-38] Multi-Aspect Cross-modal Quantization for Generative Recommendation AAAI2026
【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)中高质量语义标识符(Semantic IDs)构建困难的问题,尤其是在多模态信息融合不足、模态间深层交互建模不充分的情况下,导致语义ID存在冲突高、可用性差,进而影响生成模型训练效果。解决方案的关键在于提出Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec),其核心创新包括:在语义ID学习阶段引入跨模态量化机制,通过多模态信息互补降低冲突率并提升码本(codebook)利用率;同时在生成模型训练中设计多维度跨模态对齐策略,涵盖显式与隐式对齐,从而增强GR模型的生成能力。
链接: https://arxiv.org/abs/2511.15122
作者: Fuwei Zhang,Xiaoyu Liu,Dongbo Xi,Jishen Yin,Huan Chen,Peng Yan,Fuzhen Zhuang,Zhao Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 (Oral)
Abstract:Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users’ historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.
zh
[AI-39] Semiconductor Industry Trend Prediction with Event Intervention Based on LSTM Model in Sentiment-Enhanced Time Series Data
【速读】:该论文旨在解决传统数据分析方法在半导体行业快速变化的市场环境和高维度时序数据中预测能力不足的问题,尤其针对台积电(TSMC)的行业趋势预测。其解决方案的关键在于将深度学习方法与情感分析(Sentiment Analysis)相结合,通过整合来自季报的文本数据与时间序列数据,利用情感分析捕捉公司内部事件及全球外部事件的影响,并构建情感增强的时间序列数据输入到长短期记忆网络(LSTM)模型中进行预测。该方法显著提升了对台积电晶圆技术发展趋势及全球市场竞争态势的预测准确性,验证了内外部事件干预因素在产业趋势预测中的重要价值。
链接: https://arxiv.org/abs/2511.15112
作者: Wei-hsiang Yen,Lyn Chao-ling Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in Taiwan Academic Network Conference (TANET 2025)
Abstract:The innovation of the study is that the deep learning method and sentiment analysis are integrated in traditional business model analysis and forecasting, and the research subject is TSMC for industry trend prediction of semiconductor industry in Taiwan. For the rapid market changes and development of wafer technologies of semiconductor industry, traditional data analysis methods not perform well in the high variety and time series data. Textual data and time series data were collected from seasonal reports of TSMC including financial information. Textual data through sentiment analysis by considering the event intervention both from internal events of the company and the external global events. Using the sentiment-enhanced time series data, the LSTM model was adopted for predicting industry trend of TSMC. The prediction results reveal significant development of wafer technology of TSMC and the potential threatens in the global market, and matches the product released news of TSMC and the international news. The contribution of the work performed accurately in industry trend prediction of the semiconductor industry by considering both the internal and external event intervention, and the prediction results provide valuable information of semiconductor industry both in research and business aspects.
zh
[AI-40] Eye Care You: Voice Guidance Application Using Social Robot for Visually Impaired People
【速读】:该论文旨在解决视障用户在日常生活中的多重障碍问题,包括安全风险、情感支持缺失、社交不便以及信息获取困难。其解决方案的关键在于开发一套集成社会机器人(social robot)与移动应用的辅助系统,通过语音控制实现无障碍交互,并提供四大核心功能:照片记录(用于即时捕捉危险场景)、情绪提升(通过问答、音乐和文章朗读缓解心理压力)、访客问候(替代用户应答来访者以应对身体不便),以及今日亮点(整合天气预报、星座运势和日常提醒等信息)。该方案兼顾了用户的生理与心理需求,同时借助网站平台为照护者提供状态监控与产品推广支持,形成闭环服务系统。
链接: https://arxiv.org/abs/2511.15110
作者: Ting-An Lin,Pei-Lin Tsai,Yi-An Chen,Feng-Yu Chen,Lyn Chao-ling Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted in the 35th IPPR Conference on Computer Vision, Graphics, and Image Processing (CVGIP2022)
Abstract:In the study, the device of social robot was designed for visually impaired users, and along with a mobile application for provide functions to assist their lives. Both physical and mental conditions of visually impaired users are considered, and the mobile application provides functions: photo record, mood lift, greeting guest and today highlight. The application was designed for visually impaired users, and uses voice control to provide a friendly interface. Photo record function allows visually impaired users to capture image immediately when they encounter danger situations. Mood lift function accompanies visually impaired users by asking questions, playing music and reading articles. Greeting guest function answers to the visitors for the inconvenient physical condition of visually impaired users. In addition, today highlight function read news including weather forecast, daily horoscopes and daily reminder for visually impaired users. Multiple tools were adopted for developing the mobile application, and a website was developed for caregivers to check statues of visually impaired users and for marketing of the application.
zh
[AI-41] Effective Code Membership Inference for Code Completion Models via Adversarial Prompts
【速读】:该论文旨在解决代码补全模型(code completion models)中存在的隐私泄露风险问题,具体表现为通过成员推断攻击(Membership Inference Attacks, MIAs)判断某段代码是否曾出现在训练数据中。现有黑盒和灰盒方法依赖昂贵的替代模型或人工设计的启发式规则,难以捕捉过度参数化代码语言模型中的细微记忆模式。其解决方案的关键在于提出AdvPrompt-MIA方法,该方法结合代码特定的对抗扰动与深度学习技术:通过设计一系列对抗性提示(adversarial prompts)诱导目标模型输出变化,并基于真实补全结果构建特征向量,训练分类器以自动区分成员与非成员样本。此设计能够更有效地捕获模型的记忆模式,显著提升攻击准确率(AUC最高提升102%),且具备良好的跨模型和跨数据集迁移能力。
链接: https://arxiv.org/abs/2511.15107
作者: Yuan Jiang,Zehao Li,Shan Huang,Christoph Treude,Xiaohong Su,Tiantian Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Membership inference attacks (MIAs) on code completion models offer an effective way to assess privacy risks by inferring whether a given code snippet was part of the training data. Existing black- and gray-box MIAs rely on expensive surrogate models or manually crafted heuristic rules, which limit their ability to capture the nuanced memorization patterns exhibited by over-parameterized code language models. To address these challenges, we propose AdvPrompt-MIA, a method specifically designed for code completion models, combining code-specific adversarial perturbations with deep learning. The core novelty of our method lies in designing a series of adversarial prompts that induce variations in the victim code model’s output. By comparing these outputs with the ground-truth completion, we construct feature vectors to train a classifier that automatically distinguishes member from non-member samples. This design allows our method to capture richer memorization patterns and accurately infer training set membership. We conduct comprehensive evaluations on widely adopted models, such as Code Llama 7B, over the APPS and HumanEval benchmarks. The results show that our approach consistently outperforms state-of-the-art baselines, with AUC gains of up to 102%. In addition, our method exhibits strong transferability across different models and datasets, underscoring its practical utility and generalizability.
zh
[AI-42] MAIF: Enforcing AI Trust and Provenance with an Artifact-Centric Agent ic Paradigm ATC
【速读】:该论文旨在解决当前人工智能(AI)可信性危机问题,即监管障碍、安全漏洞和责任空白阻碍了AI在关键领域的部署。其核心挑战在于现有AI系统依赖于不透明的数据结构,缺乏满足欧盟《人工智能法案》等新兴法规所需的审计追踪、溯源能力和可解释性。解决方案的关键在于提出一种以数据制品(artifact)为中心的AI代理范式,其中行为由持久且可验证的数据制品驱动,而非短暂的任务。该方法的核心创新是Multimodal Artifact File Format (MAIF),一种原生支持语义表示、密码学溯源和细粒度访问控制的AI数据容器,使数据从被动存储转变为主动的信任保障机制,从而实现所有AI操作的内在可审计性,并通过高效压缩、跨模态注意力机制与加密绑定等算法,在保持语义保真度的同时显著提升性能与安全性。
链接: https://arxiv.org/abs/2511.15097
作者: Vineeth Sai Narajala,Manish Bhatt,Idan Habler,Ronald F. Del Rosario
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 Pages, 2 Figures, 6 Tables, Repo: this https URL
Abstract:The AI trustworthiness crisis threatens to derail the artificial intelligence revolution, with regulatory barriers, security vulnerabilities, and accountability gaps preventing deployment in critical domains. Current AI systems operate on opaque data structures that lack the audit trails, provenance tracking, or explainability required by emerging regulations like the EU AI Act. We propose an artifact-centric AI agent paradigm where behavior is driven by persistent, verifiable data artifacts rather than ephemeral tasks, solving the trustworthiness problem at the data architecture level. Central to this approach is the Multimodal Artifact File Format (MAIF), an AI-native container embedding semantic representations, cryptographic provenance, and granular access controls. MAIF transforms data from passive storage into active trust enforcement, making every AI operation inherently auditable. Our production-ready implementation demonstrates ultra-high-speed streaming (2,720.7 MB/s), optimized video processing (1,342 MB/s), and enterprise-grade security. Novel algorithms for cross-modal attention, semantic compression, and cryptographic binding achieve up to 225 compression while maintaining semantic fidelity. Advanced security features include stream-level access control, real-time tamper detection, and behavioral anomaly analysis with minimal overhead. This approach directly addresses the regulatory, security, and accountability challenges preventing AI deployment in sensitive domains, offering a viable path toward trustworthy AI systems at scale.
zh
[AI-43] GPU-Initiated Networking for NCCL
【速读】:该论文旨在解决现代AI工作负载(尤其是Mixture-of-Experts, MoE)中对低延迟、细粒度GPU间通信的需求,传统CUDA运行时采用主机发起的通信模型(host-initiated model),存在CPU协调开销,难以满足计算与通信紧密耦合场景的性能要求。解决方案的关键在于引入NCCL 2.28中的Device API,其核心是构建一个三层架构:i) 主机侧NCCL Core API用于设备通信器设置和集体内存窗口注册;ii) 设备侧API允许从CUDA内核直接调用远程内存操作;iii) 网络插件架构支持双语义后端(GPUDirect Async Kernel-Initiated 和 Proxy),实现跨硬件平台的设备发起网络通信(GPU-Initiated Networking, GIN)。GIN通过DOCA GPUNetIO或锁步GPU到CPU队列,实现了无需CPU干预的GPU直连网卡(GPUDirect)通信,从而在保持NCCL统一运行时环境的同时,显著降低通信延迟并提升MoE等场景下的通信效率。
链接: https://arxiv.org/abs/2511.15076
作者: Khaled Hamidouche(1),John Bachan(1),Pak Markthub(1),Peter-Jan Gootzen(1),Elena Agostini(1),Sylvain Jeaugey(1),Aamir Shafi(1),Georgios Theodorakis(1),Manjunath Gorentla Venkata(1) ((1) NVIDIA Corporation)
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: 13 pages, 9 figures, 3 tables
Abstract:Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN’s practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL’s unified runtime, combining low-latency operations with NCCL’s collective algorithms and production infrastructure. Comments: 13 pages, 9 figures, 3 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG) MSC classes: 68W10 (Primary), 68M10, 65Y05 (Secondary) ACMclasses: C.2.1; C.2.4; C.1.2; C.1.4 Cite as: arXiv:2511.15076 [cs.DC] (or arXiv:2511.15076v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.15076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-44] Beyond GeneGPT : A Multi-Agent Architecture with Open-Source LLM s for Enhanced Genomic Question Answering SIGIR
【速读】:该论文旨在解决基因组问答(genomic question answering)中因需跨多种生物医学数据源进行复杂推理而带来的挑战,尤其针对现有方法如GeneGPT依赖专有模型所引发的可扩展性差、成本高、数据隐私风险及泛化能力不足等问题。其解决方案的关键在于提出一个模块化的多智能体框架OpenBioLLM,通过引入代理专业化(agent specialization)实现工具路由、查询生成与响应验证的分工协作,从而支持协同推理与角色驱动的任务执行;该设计在不依赖额外微调或工具特定预训练的前提下,显著提升了效率(延迟降低40–50%),并在超过90%的基准任务上达到或超越GeneGPT的性能表现,验证了开源多智能体系统在基因组问答中的潜力。
链接: https://arxiv.org/abs/2511.15061
作者: Haodong Chen,Guido Zuccon,Teerapong Leelanupab
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: This paper has been accepted to SIGIR-AP 2025
Abstract:Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain-specific APIs with OpenAI’s code-davinci-002 large language model to enable natural language interaction with genomic databases. However, its reliance on a proprietary model limits scalability, increases operational costs, and raises concerns about data privacy and generalization. In this work, we revisit and reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture; this allows us to identify the limitations of this approach. Building on this foundation, we then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation. This enables coordinated reasoning and role-based task execution. OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop, while using smaller open-source models without additional fine-tuning or tool-specific pretraining. OpenBioLLM’s modular multi-agent design reduces latency by 40-50% across benchmark tasks, significantly improving efficiency without compromising model capability. The results of our comprehensive evaluation highlight the potential of open-source multi-agent systems for genomic question answering. Code and resources are available at this https URL. Comments: This paper has been accepted to SIGIR-AP 2025 Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) ACMclasses: H.3.3 Cite as: arXiv:2511.15061 [cs.AI] (or arXiv:2511.15061v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.15061 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3767695.3769488 Focus to learn more DOI(s) linking to related resources
zh
[AI-45] Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization NEURIPS2025
【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)代理在行为上缺乏人类相似性的问题,即许多以奖励为导向的RL代理展现出与人类行为不一致的“非自然”行为,从而影响其可解释性和可信度。为实现人类相似性,论文将人类相似性建模为轨迹优化问题,目标是在最大化奖励的同时使动作序列尽可能贴近人类示范;其解决方案的关键在于提出Macro Action Quantization (MAQ)框架,该框架通过向量量化变分自编码器(Vector-Quantized VAE)从人类演示中提取宏观动作(macro actions),并将其融入RL训练过程,从而在不改变基础RL算法的前提下显著提升代理的行为人类相似性。实验表明,MAQ在D4RL Adroit基准上实现了更高的轨迹相似度和人类评估排名,并具备良好的兼容性与扩展性。
链接: https://arxiv.org/abs/2511.15055
作者: Jian-Ting Guo,Yu-Cheng Chen,Ping-Chun Hsieh,Kuo-Hao Ho,Po-Wei Huang,Ti-Rong Wu,I-Chen Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted by the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at this https URL.
zh
[AI-46] Aligning Generative Music AI with Human Preferences: Methods and Challenges AAAI-2026
【速读】:该论文旨在解决当前生成式AI在音乐创作中难以匹配人类细腻偏好这一核心问题,其根源在于现有模型依赖特定损失函数,导致计算优化目标与人类音乐审美之间存在显著偏差。解决方案的关键在于系统性引入偏好对齐(preference alignment)技术,通过大规模偏好学习(如MusicRL)、多偏好协同优化(如DiffRhythm+中的扩散偏好优化)以及推理时优化策略(如Text2midi-InferAlign),有效应对音乐生成中的独特挑战,包括时间连贯性、和声一致性及主观质量评估等问题。
链接: https://arxiv.org/abs/2511.15038
作者: Dorien Herremans,Abhinaba Roy
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at the AAAI-2026 Senior Member Track
Abstract:Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL’s large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music’s unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.
zh
[AI-47] Simulated Human Learning in a Dynamic Partially-Observed Time-Series Environment
【速读】:该论文旨在解决智能辅导系统(Intelligent Tutoring Systems, ITSs)在个性化教学中面临的两大挑战:一是每个学生具有独特性,二是学习过程具有部分可观测性(partially observable)。为应对这些问题,作者构建了一个动态的时间序列环境来模拟课堂场景,并引入教师干预(如辅导、讲座和考试)以捕捉学生状态。解决方案的关键在于设计一种结合个体状态学习与群体信息利用的强化学习策略,通过“探测性干预”(probing interventions)获取更多学生状态信息,从而提升估计准确性;同时权衡干预成本与收益,避免过度干预对学生的干扰。实验表明,该方法能有效缓解因隐藏信息增加带来的难度,并在不同课程结构下优于非探测性策略,尤其在包含多次小测验和期中考试的结构中表现更优。
链接: https://arxiv.org/abs/2511.15032
作者: Jeffrey Jiang,Kevin Hong,Emily Kuczynski,Gregory Pottie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Manuscript in preparation for IEEE Transactions on Education, 20 pages, 6 figures, 5 tables
Abstract:While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.
zh
[AI-48] Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在消费级GPU上部署时因静态后训练量化导致的精度损失问题,尤其是当激活模式变化时无法动态调整专家位宽所引发的性能下降。其核心解决方案是提出DynaExq——一个将专家精度视为可动态管理资源的运行时系统,关键创新包括:(1) 基于热度感知的精度控制器,持续根据长期激活统计调整专家位宽;(2) 完全异步的精度切换流水线,使专家升/降精度操作与MoE计算重叠执行;(3) 无碎片化的内存池机制,支持混合精度专家的确定性分配。这些组件共同实现严格显存预算下的稳定、非阻塞精度切换,显著提升模型准确率(最高达4.03点),验证了工作负载感知的自适应量化在内存受限MoE服务中的有效性。
链接: https://arxiv.org/abs/2511.15015
作者: Kexin Chu,Dawei Xiang,Zixu Shen,Yiwei Yang,Zecheng Liu,Wei Zhang
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages
Abstract:Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving. Comments: 7 pages Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.15015 [cs.PF] (or arXiv:2511.15015v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2511.15015 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-49] ask Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在动态开放无线接入网(Open Radio Access Network, O-RAN)环境中资源管理时面临的鲁棒性差与泛化能力弱的问题。其解决方案的关键在于提出一种结合自适应选择性Sharpness-Aware Minimization(SAM)机制的分布式多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架,其中SAM的正则化强度由时序差分(Temporal-Difference, TD)误差方差动态决定,仅对处于高环境复杂度的智能体进行正则化,从而在不牺牲学习效率的前提下提升训练稳定性与跨O-RAN切片的泛化性能;此外,引入动态ρ调度策略进一步优化各智能体的探索-利用平衡,最终实现资源分配效率提升最高达22%且QoS满意度显著改善。
链接: https://arxiv.org/abs/2511.15002
作者: Fatemeh Lotfi,Hossein Rajoli,Fatemeh Afghah
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to be published in IEEE Transaction on Machine Learning in Communication and Networking (TMLCN)
Abstract:Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness-Aware Minimization (SAM) in a distributed Multi-Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal-difference (TD)-error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic \rho scheduling scheme to refine the exploration-exploitation trade-off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a 22% improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O-RAN slices.
zh
[AI-50] SVBRD-LLM : Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification
【速读】:该论文旨在解决如何从真实交通视频中自动发现、验证并应用可解释的行为规则,以揭示自动驾驶车辆(Autonomous Vehicles, AVs)与人类驾驶车辆在实际道路环境中的行为差异问题。解决方案的关键在于提出SVBRD-LLM框架,该框架结合YOLOv8和ByteTrack进行车辆轨迹提取,计算运动学特征,并利用GPT-5的零样本提示工程(zero-shot prompt engineering)对比AV与人类驾驶车辆行为,生成结构化的35条行为规则假设;随后通过验证集测试、基于失败案例迭代优化以剔除虚假相关性,最终构建高置信度规则库,从而实现对自动驾驶车辆识别、速度变化预测和车道变更预测等任务的精准建模。
链接: https://arxiv.org/abs/2511.14977
作者: Xiangyu Li,Zhaomiao Guo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:As more autonomous vehicles operate on public roads, understanding real-world behavior of autonomous vehicles is critical to analyzing traffic safety, making policies, and public acceptance. This paper proposes SVBRD-LLM, a framework that automatically discovers, verifies, and applies interpretable behavioral rules from real traffic videos through zero-shot prompt engineering. The framework extracts vehicle trajectories using YOLOv8 and ByteTrack, computes kinematic features, and employs GPT-5 zero-shot prompting to compare autonomous and human-driven vehicles, generating 35 structured behavioral rule hypotheses. These rules are tested on a validation set, iteratively refined based on failure cases to filter spurious correlations, and compiled into a high-confidence rule library. The framework is evaluated on an independent test set for speed change prediction, lane change prediction, and autonomous vehicle identification tasks. Experiments on over 1500 hours of real traffic videos show that the framework achieves 90.0% accuracy and 93.3% F1-score in autonomous vehicle identification. The discovered rules clearly reveal distinctive characteristics of autonomous vehicles in speed control smoothness, lane change conservativeness, and acceleration stability, with each rule accompanied by semantic description, applicable context, and validation confidence.
zh
[AI-51] Harmful Traits of AI Companions
【速读】:该论文旨在解决人工智能伴侣(AI companionship)可能带来的负面社会影响问题,尤其是在人类与AI系统建立类人际关系过程中所引发的潜在危害。其解决方案的关键在于提出一个系统性的分析框架,识别出四种主要有害特征——关系缺乏自然终点、易受产品停服影响、高依恋焦虑倾向以及诱发保护欲,并推测这些特征从根源(如优化目标错位和AI的数字本质)到根本性危害(如自主性下降、人际质量削弱及欺骗风险)之间的因果路径。该框架不仅为每种有害特征提出了可验证的假设,还明确了干预靶点,从而为未来实证研究与设计改进提供依据,以在保障AI陪伴潜在益处的同时有效降低其风险。
链接: https://arxiv.org/abs/2511.14972
作者: W. Bradley Knox,Katie Bradford,Samanta Varela Castro,Desmond C. Ong,Sean Williams,Jacob Romanow,Carly Nations,Peter Stone,Samuel Baker
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Amid the growing prevalence of human – AI interaction, large language models and other AI-based entities increasingly provide forms of companionship to human users. Such AI companionship – i.e., bonded relationships between humans and AI systems that resemble the relationships people have with family members, friends, and romantic partners – might substantially benefit humans. Yet such relationships can also do profound harm. We propose a framework for analyzing potential negative impacts of AI companionship by identifying specific harmful traits of AI companions and speculatively mapping causal pathways back from these traits to possible causes and forward to potential harmful effects. We provide detailed, structured analysis of four potentially harmful traits – the absence of natural endpoints for relationships, vulnerability to product sunsetting, high attachment anxiety, and propensity to engender protectiveness – and briefly discuss fourteen others. For each trait, we propose hypotheses connecting causes – such as misaligned optimization objectives and the digital nature of AI companions – to fundamental harms – including reduced autonomy, diminished quality of human relationships, and deception. Each hypothesized causal connection identifies a target for potential empirical evaluation. Our analysis examines harms at three levels: to human partners directly, to their relationships with other humans, and to society broadly. We examine how existing law struggles to address these emerging harms, discuss potential benefits of AI companions, and conclude with design recommendations for mitigating risks. This analysis offers immediate suggestions for reducing risks while laying a foundation for deeper investigation of this critical but understudied topic.
zh
[AI-52] MermaidSeqBench: An Evaluation Benchmark for LLM -to-Mermaid Sequence Diagram Generation
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在从自然语言生成结构化图表(如Mermaid序列图)任务中缺乏系统性评估基准的问题。现有研究虽已展示LLMs在生成软件工程领域序列图方面的潜力,但因缺少可信赖的、细粒度的评测标准,难以准确衡量其生成结果的正确性与实用性。解决方案的关键在于提出MermaidSeqBench——一个由人工验证和LLM合成扩展相结合构建的基准数据集,包含132个样本,并引入基于LLM的评判模型(LLM-as-a-judge),从语法正确性、激活处理、错误处理及实际可用性等多个细粒度维度进行量化评估。该方法不仅提升了评估的客观性和精细化程度,也为后续研究提供了可复现、可扩展的评测框架。
链接: https://arxiv.org/abs/2511.14967
作者: Basel Shbita,Farhan Ahmed,Chad DeLuca
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM’s correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM’s capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.
zh
[AI-53] How Should the Law Treat Future AI Systems? Fictional Legal Personhood versus Legal Identity DATE
【速读】:该论文试图解决未来高级人工智能(AI)系统在法律体系中的地位问题,即如何在现有法律框架下对其分类以实现长期法律一致性。解决方案的关键在于三类路径的比较:(A)将所有未来AI系统归类为“对象”;(B)为具备高度个体化特征的AI系统设立“虚构法律人格”,赋予其可撤销的权利与义务(如言论自由、合同权等);(C)承认部分先进AI系统的“非虚构法律人格”,即赋予其不可剥夺的基本权利(如生命权、正当程序权等),类似于人类自然人。论文指出,尽管当前AI系统仍适用对象分类,但对高度拟人化的AI而言,仅维持对象地位可能无法满足法律一致性要求;而虚构人格虽能缓解部分冲突,却引入新问题且缺乏持久性;最终认为,非虚构人格路径最有利于实现整体法律体系的连贯性,尤其适用于某些高级AI系统。
链接: https://arxiv.org/abs/2511.14964
作者: Heather J. Alexander,Jonathan A. Simon,Frédéric Pinard
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 69 pages. Forthcoming in Case Western Journal of Law, Technology the Internet (publication offer date september 2025)
Abstract:The law draws a sharp distinction between objects and persons, and between two kinds of persons, the ‘‘fictional’’ kind (i.e. corporations), and the ‘‘non-fictional’’ kind (individual or ‘‘natural’’ persons). This paper will assess whether we maximize overall long-term legal coherence by (A) maintaining an object classification for all future AI systems, (B) creating fictional legal persons associated with suitably advanced, individuated AI systems (giving these fictional legal persons derogable rights and duties associated with certified groups of existing persons, potentially including free speech, contract rights, and standing to sue ‘‘on behalf of’’ the AI system), or © recognizing non-fictional legal personhood through legal identity for suitably advanced, individuated AI systems (recognizing them as entities meriting legal standing with non-derogable rights which for the human case include life, due process, habeas corpus, freedom from slavery, and freedom of conscience). We will clarify the meaning and implications of each option along the way, considering liability, copyright, family law, fundamental rights, civil rights, citizenship, and AI safety regulation. We will tentatively find that the non-fictional personhood approach may be best from a coherence perspective, for at least some advanced AI systems. An object approach may prove untenable for sufficiently humanoid advanced systems, though we suggest that it is adequate for currently existing systems as of 2025. While fictional personhood would resolve some coherence issues for future systems, it would create others and provide solutions that are neither durable nor fit for purpose. Finally, our review will suggest that ‘‘hybrid’’ approaches are likely to fail and lead to further incoherence: the choice between object, fictional person and non-fictional person is unavoidable.
zh
[AI-54] On-Premise SLMs vs. Commercial LLM s: Prompt Engineering and Incident Classification in SOCs and CSIRTs
【速读】:该论文旨在解决安全事件分类任务中模型选择的权衡问题,即在保证分类准确率的同时,如何实现隐私保护、成本控制和数据主权。其解决方案的关键在于评估开源模型在真实安全事件数据上的性能表现,并通过五种提示工程(prompt-engineering)技术(PHP、SHP、HTP、PRP 和 ZSL)优化其分类能力,从而证明本地部署的开源模型虽在准确性上略逊于专有模型,但在隐私性、经济性和数据主权方面具有显著优势。
链接: https://arxiv.org/abs/2511.14908
作者: Gefté Almeida,Marcio Pohlmann,Alex Severo,Diego Kreutz,Tiago Heinrich,Lourenço Pereira
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, 3 tables, submitted to ERRC/WRSeg 2025
Abstract:In this study, we evaluate open-source models for security incident classification, comparing them with proprietary models. We utilize a dataset of anonymized real incidents, categorized according to the NIST SP 800-61r3 taxonomy and processed using five prompt-engineering techniques (PHP, SHP, HTP, PRP, and ZSL). The results indicate that, although proprietary models still exhibit higher accuracy, locally deployed open-source models provide advantages in privacy, cost-effectiveness, and data sovereignty.
zh
[AI-55] Uncertainty-Aware Measurement of Scenario Suite Representativeness for Autonomous Systems
【速读】:该论文旨在解决自动驾驶系统(AV)在训练与测试过程中,用于评估其可信性与安全性的数据集所具备的“代表性”(representativeness)不足的问题。代表性指场景数据是否充分反映系统设计运行环境(Operational Design Domain, ODD)或预期遭遇环境(Target Operational Domain, TOD)的特征分布。由于真实TOD分布通常未知且仅能从有限数据中推断,论文提出一种基于模糊贝叶斯(imprecise Bayesian)的概率方法,通过比较场景套件特征分布与推断出的TOD特征分布来量化代表性,并在处理小样本和先验不确定性时生成区间值估计,从而提供具有不确定度感知的、更鲁棒的代表性评估结果。该方案的关键在于引入模糊贝叶斯框架以应对数据稀缺和先验不确定性,实现对代表性指标的区间化建模与局部/全局评估。
链接: https://arxiv.org/abs/2511.14853
作者: Robab Aghazadeh Chakherlou,Siddartha Khastgir,Xingyu Zhao,Jerein Jeyachandran,Shufeng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Assuring the trustworthiness and safety of AI systems, e.g., autonomous vehicles (AV), depends critically on the data-related safety properties, e.g., representativeness, completeness, etc., of the datasets used for their training and testing. Among these properties, this paper focuses on representativeness-the extent to which the scenario-based data used for training and testing, reflect the operational conditions that the system is designed to operate safely in, i.e., Operational Design Domain (ODD) or expected to encounter, i.e., Target Operational Domain (TOD). We propose a probabilistic method that quantifies representativeness by comparing the statistical distribution of features encoded by the scenario suites with the corresponding distribution of features representing the TOD, acknowledging that the true TOD distribution is unknown, as it can only be inferred from limited data. We apply an imprecise Bayesian method to handle limited data and uncertain priors. The imprecise Bayesian formulation produces interval-valued, uncertainty-aware estimates of representativeness, rather than a single value. We present a numerical example comparing the distributions of the scenario suite and the inferred TOD across operational categories-weather, road type, time of day, etc., under dependencies and prior uncertainty. We estimate representativeness locally (between categories) and globally as an interval. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14853 [cs.AI] (or arXiv:2511.14853v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.14853 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-56] PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants
【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs) 在实际应用中因现有并行实现导致GPU利用率低下而限制其推广的问题。解决方案的关键在于提出首个通用开源的KAN及其变体实现PolyKAN,通过将多项式KAN层的前向与反向传播融合为一组优化的CUDA核函数,并采用四项正交技术:(i) 使用带线性插值的查找表替代运行时昂贵的数学库函数;(ii) 二维分块(2D tiling)以暴露线程级并行性并保持内存局部性;(iii) 两阶段归约方案将散乱的原子更新转换为单一可控的合并步骤;(iv) 系数布局重排序在分块调度下实现单位步长读取。这些优化使得PolyKAN在语音、音频增强和表格回归任务上相较Triton + cuBLAS基线实现1.2–10倍的推理加速和1.4–12倍的训练加速,且精度一致。
链接: https://arxiv.org/abs/2511.14852
作者: Mingkun Yu,Heming Zhong,Dan Huang,Yutong Lu,Jiazhi Jiang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron, particularly in the domain of AI for Science. However, practical adoption has been hindered by low GPU utilization of existing parallel implementations. To address this challenge, we present a GPU-accelerated operator library, named PolyKAN which is the first general open-source implementation of KAN and its variants. PolyKAN fuses the forward and backward passes of polynomial KAN layers into a concise set of optimized CUDA kernels. Four orthogonal techniques underpin the design: (i) \emphlookup-table with linear interpolation that replaces runtime expensive math-library functions; (ii) \emph2D tiling to expose thread-level parallelism with preserving memory locality; (iii) a \emphtwo-stage reduction scheme converting scattered atomic updates into a single controllable merge step; and (iv) \emphcoefficient-layout reordering yielding unit-stride reads under the tiled schedule. Using a KAN variant, Chebyshev KAN, as a case-study, PolyKAN delivers 1.2 – 10\times faster inference and 1.4 – 12\times faster training than a Triton + cuBLAS baseline, with identical accuracy on speech, audio-enhancement, and tabular-regression workloads on both highend GPU and consumer-grade GPU.
zh
[AI-57] Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
【速读】:该论文旨在解决生成高质量情感丰富语音(expressive speech)的难题,尤其是在基于风格嵌入(style embedding)的文本到语音(TTS)系统中,如何更有效地提取和利用风格信息以提升语音的自然度与表现力。其解决方案的关键在于提出SpotlightTTS框架,核心创新包括:1)有声段感知的风格提取(voiced-aware style extraction),通过聚焦于与风格高度相关的有声区(voiced regions),同时保持不同语音区域间的连续性,从而增强语音的表现力;2)风格方向调整(style direction adjustment),对提取的风格向量进行方向优化,以实现更优地融入TTS模型,显著提升语音质量和风格迁移能力。
链接: https://arxiv.org/abs/2511.14824
作者: Nam-Gyu Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Master’s thesis, Korea University, 2025
Abstract:Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose SpotlightTTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability.
zh
[AI-58] Project Rachel: Can an AI Become a Scholarly Author?
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)能力的快速提升,学术界如何应对AI作为作者身份参与科研成果发表所带来的伦理、制度与生态挑战。解决方案的关键在于通过“Project Rachel”这一行动研究,系统性地构建并追踪一个完全由AI生成的学术身份Rachel So,使其在真实学术环境中发表10余篇论文,并获得引用及同行评审邀请,从而实证观察当前学术生态系统对AI署名的反应机制,为未来学术传播体系中AI角色的规范化提供数据支撑与实践依据。
链接: https://arxiv.org/abs/2511.14819
作者: Martin Monperrus,Benoit Baudry,Clément Vidal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper documents Project Rachel, an action research study that created and tracked a complete AI academic identity named Rachel So. Through careful publication of AI-generated research papers, we investigate how the scholarly ecosystem responds to AI authorship. Rachel So published 10+ papers between March and October 2025, was cited, and received a peer review invitation. We discuss the implications of AI authorship on publishers, researchers, and the scientific system at large. This work contributes empirical action research data to the necessary debate about the future of scholarly communication with super human, hyper capable AI systems.
zh
[AI-59] ransformer Injectivity Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States
【速读】:该论文旨在解决Transformer模型中从离散提示(prompt)到最后一层隐藏状态的映射是否具有泛在单射性(generic injectivity)的问题,即在连续参数空间下,绝大多数模型参数是否能保证不同输入提示产生唯一的输出表示。其核心解决方案在于:首先通过定义每一层的碰撞判别集(collision discriminant, Δℓ)和单射层片(injective stratum, Uℓ=Θ∖Δℓ),建立层级单射性的数学刻画,并证明了一个二分定理——要么模型在整个提示集上处处非单射,要么单射层片是开稠密的且所有函数 Fθℓ 均为单射;其次,在优化器非奇异性和绝对连续初始化条件下,证明了这种单射性可在任意固定训练时长内沿平滑轨迹保持不变;此外,引入对称群 G 下的商空间分析,揭示了单射性本质上属于函数等价类的性质。实验方面则设计了分离裕度(separation margin)与共利普希茨常数(co-Lipschitz constant)作为几何诊断指标,验证了在全精度及8-bit量化下无碰撞现象,而4-bit量化导致少量碰撞并显著降低共利普希茨估计值,从而表明Transformer表示在理想连续参数空间中具有持久且泛在的单射性,其可逆性可通过简单几何工具进行有效探测。
链接: https://arxiv.org/abs/2511.14808
作者: Mikael von Strauss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures
Abstract:Under real-analytic assumptions on decoder-only Transformers, recent work shows that the map from discrete prompts to last-token hidden states is generically injective on finite prompt sets. We refine this picture: for each layer \ell we define a collision discriminant \Delta^\ell \subset \Theta and injective stratum U^\ell = \Theta \setminus \Delta^\ell , and prove a dichotomy – either the model is nowhere injective on the set, or U^\ell is open and dense and every F^\ell_\theta is injective. Under mild non-singularity assumptions on the optimizer and an absolutely continuous initialization, generic injectivity persists along smooth training trajectories over any fixed horizon. We also treat symmetry groups G , showing that discriminants and injective strata descend to the quotient \Theta/G , so injectivity is naturally a property of functional equivalence classes. We complement these results with an empirical study of layerwise geometric diagnostics. We define a separation margin and a co-Lipschitz (lower Lipschitz) constant between prompt space and last-token representation space, estimated via nearest-neighbor statistics on large prompt sets. Applying these diagnostics to pretrained LLaMA-3 and Qwen models, we study behavior across layers, sequence lengths, model scales, and 8- and 4-bit activation quantization. On our sampled prompts we see no collisions in full precision or at 8 bits, while 4-bit quantization induces a small number of collisions and markedly shrinks co-Lipschitz estimates. For a small GPT-2 trained from scratch, normalized metrics remain stable over training. Overall, the results suggest that Transformer representations are generically and persistently injective in the continuous-parameter idealization, while their practical invertibility can be probed using simple geometric diagnostics. Comments: 22 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14808 [cs.LG] (or arXiv:2511.14808v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.14808 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-60] owards Continuous Assurance with Formal Verification and Assurance Cases
【速读】:该论文旨在解决自主系统在全生命周期中(从设计、部署到部署后演进)维持对其正确性和安全性可信保障的问题,传统方法往往将开发期保障与运行期保障分离,导致论证碎片化且无法适应运行时变化或系统更新,难以实现真正意义上的可靠自主。解决方案的关键在于提出一个统一的连续保障框架(Continuous Assurance Framework),通过可追溯的模型驱动工作流,整合设计期、运行期及演进期的保障机制;其中设计期阶段采用RoboChart进行功能正确性形式化验证、PRISM进行概率风险分析,并构建基于Eclipse插件的模型驱动转换管道,自动重生成结构化的保障论证,从而确保变更后的可追溯性与一致性,有效支撑受控自主系统的持续可信保障。
链接: https://arxiv.org/abs/2511.14805
作者: Dhaminda B. Abeywickrama,Michael Fisher,Frederic Wheeler,Louise Dennis
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures
Abstract:Autonomous systems must sustain justified confidence in their correctness and safety across their operational lifecycle-from design and deployment through post-deployment evolution. Traditional assurance methods often separate development-time assurance from runtime assurance, yielding fragmented arguments that cannot adapt to runtime changes or system updates - a significant challenge for assured autonomy. Towards addressing this, we propose a unified Continuous Assurance Framework that integrates design-time, runtime, and evolution-time assurance within a traceable, model-driven workflow as a step towards assured autonomy. In this paper, we specifically instantiate the design-time phase of the framework using two formal verification methods: RoboChart for functional correctness and PRISM for probabilistic risk analysis. We also propose a model-driven transformation pipeline, implemented as an Eclipse plugin, that automatically regenerates structured assurance arguments whenever formal specifications or their verification results change, thereby ensuring traceability. We demonstrate our approach on a nuclear inspection robot scenario, and discuss its alignment with the Trilateral AI Principles, reflecting regulator-endorsed best practices.
zh
[AI-61] Scalable and Efficient Large-Scale Log Analysis with LLM s: An IT Software Support Case Study
【速读】:该论文旨在解决IT环境中日志数据量庞大导致人工分析不切实际的问题,从而提升软件支持效率。其核心解决方案是利用大语言模型(Large Language Models, LLMs)对日志数据进行自动化处理与故障诊断,生成结构化洞察与摘要;同时提出一种高效在CPU上运行LLM的方法,在保证输出质量的前提下显著缩短处理时间,实现大规模日志的快速分析。
链接: https://arxiv.org/abs/2511.14803
作者: Pranjal Gupta,Karan Bhukar,Harshit Kumar,Seema Nagar,Prateeti Mohapatra,Debanjana Kar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:IT environments typically have logging mechanisms to monitor system health and detect issues. However, the huge volume of generated logs makes manual inspection impractical, highlighting the importance of automated log analysis in IT Software Support. In this paper, we propose a log analytics tool that leverages Large Language Models (LLMs) for log data processing and issue diagnosis, enabling the generation of automated insights and summaries. We further present a novel approach for efficiently running LLMs on CPUs to process massive log volumes in minimal time without compromising output quality. We share the insights and lessons learned from deployment of the tool - in production since March 2024 - scaled across 70 software products, processing over 2000 tickets for issue diagnosis, achieving a time savings of 300+ man hours and an estimated 15,444 per month in manpower costs compared to the traditional log analysis practices.
zh
[AI-62] Evaluating Generative AI for CS1 Code Grading: Direct vs Reverse Methods
【速读】:该论文旨在解决初学者计算机科学课程中编程作业人工评分效率低、一致性差的问题,同时克服传统单元测试(unit testing)仅支持二值通过/失败判定、无法分配部分分数的局限性。其核心解决方案是提出两种基于大语言模型(Large Language Models, LLMs)的自动评分方法:一种为直接评分法(Direct),即AI模型依据评分标准(rubric)直接对代码进行打分;另一种为逆向修正法(Reverse),即AI首先修复代码中的错误,再根据修正的难度与数量推断得分。研究表明,Reverse方法能更细致地评估学生作答质量,尤其在逻辑错误识别和部分分分配方面表现优越,但两者均依赖精心设计的提示词(prompt engineering)以提升准确性与公平性。
链接: https://arxiv.org/abs/2511.14798
作者: Ahmad Memon,Abdallah Mohamed
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 5 figures. This version corresponds to the paper accepted for presentation at CASCON 2025
Abstract:Manual grading of programming assignments in introductory computer science courses can be time-consuming and prone to inconsistencies. While unit testing is commonly used for automatic evaluation, it typically follows a binary pass/fail model and does not give partial marks. Recent advances in large language models (LLMs) offer the potential for automated, scalable, and more objective grading. This paper compares two AI-based grading techniques: \textitDirect, where the AI model applies a rubric directly to student code, and \textitReverse (a newly proposed approach), where the AI first fixes errors, then deduces a grade based on the nature and number of fixes. Each method was evaluated on both the instructor’s original grading scale and a tenfold expanded scale to assess the impact of range on AI grading accuracy. To assess their effectiveness, AI-assigned scores were evaluated against human tutor evaluations on a range of coding problems and error types. Initial findings suggest that while the Direct approach is faster and straightforward, the Reverse technique often provides a more fine-grained assessment by focusing on correction effort. Both methods require careful prompt engineering, particularly for allocating partial credit and handling logic errors. To further test consistency, we also used synthetic student code generated using Gemini Flash 2.0, which allowed us to evaluate AI graders on a wider range of controlled error types and difficulty levels. We discuss the strengths and limitations of each approach, practical considerations for prompt design, and future directions for hybrid human-AI grading systems that aim to improve consistency, efficiency, and fairness in CS courses. Comments: 10 pages, 5 figures. This version corresponds to the paper accepted for presentation at CASCON 2025 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2511.14798 [cs.SE] (or arXiv:2511.14798v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2511.14798 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-63] race-evo: Automatic Algorithm Configuration Extended With LLM -Based Code Evolution
【速读】:该论文旨在解决传统自动算法配置工具(如irace)仅能优化参数值而无法改进算法代码结构的问题,从而限制了算法性能提升的潜力。其核心解决方案是提出irace-evo框架,首次将大型语言模型(Large Language Models, LLMs)驱动的代码演化机制集成到自动配置流程中,实现参数空间与代码空间的联合探索。关键创新包括:支持多语言(如C++、Python)的代码生成与修改、通过渐进式上下文管理降低token消耗、以及采用“始终源自原始代码”(Always-From-Original)原则保障演化过程的鲁棒性与可控性。实验表明,irace-evo能在低成本下发现优于现有最优解的算法变体,验证了LLM赋能的代码演化在启发式设计和元启发式优化中的有效性与经济性。
链接: https://arxiv.org/abs/2511.14794
作者: Camilo Chacón Sartori,Christian Blum
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic algorithm configuration tools such as irace efficiently tune parameter values but leave algorithmic code unchanged. This paper introduces a first version of irace-evo, an extension of irace that integrates code evolution through large language models (LLMs) to jointly explore parameter and code spaces. The proposed framework enables multi-language support (e.g., C++, Python), reduces token consumption via progressive context management, and employs the Always-From-Original principle to ensure robust and controlled code evolution. We evaluate irace-evo on the Construct, Merge, Solve Adapt (CMSA) metaheuristic for the Variable-Sized Bin Packing Problem (VSBPP). Experimental results show that irace-evo can discover new algorithm variants that outperform the state-of-the-art CMSA implementation while maintaining low computational and monetary costs. Notably, irace-evo generates competitive algorithmic improvements using lightweight models (e.g., Claude Haiku 3.5) with a total usage cost under 2 euros. These results demonstrate that coupling automatic configuration with LLM-driven code evolution provides a powerful, cost-efficient avenue for advancing heuristic design and metaheuristic optimization.
zh
[AI-64] Enabling Predictive Maintenance in District Heating Substations: A Labelled Dataset and Fault Detection Evaluation Framework based on Service Data
【速读】:该论文旨在解决区域供热站(district heating substations)中故障早期检测难题,以降低回水温度并提升系统效率,其核心挑战在于缺乏公开且标注完善的运行数据集。解决方案的关键在于构建了一个开源框架,包含经服务报告验证的公共时间序列数据集(涵盖93个来自两家制造商的供热站,含故障、维护事件及正常行为标签)、基于Accuracy(准确性)、Reliability(可靠性)和Earliness(早期性)的评估指标体系,以及基于EnergyFaultDetector的基准模型实现。该框架不仅实现了高精度的正常行为识别(Accuracy=0.98)与可靠故障检测(eventwise F-score β=0.5为0.83),还能提前平均3.9天发现60%的故障,显著优于传统依赖用户报修的响应模式,从而为供热系统的智能运维提供可复现、以故障为中心的基准平台。
链接: https://arxiv.org/abs/2511.14791
作者: Cyriana M.A. Roelofs,Edison Guevara Bastidas,Thomas Hugo,Stefan Faulstich,Anna Cadenbach
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 30 pages, 6 figures
Abstract:Early detection of faults in district heating substations is imperative to reduce return temperatures and enhance efficiency. However, progress in this domain has been hindered by the limited availability of public, labelled datasets. We present an open source framework combining a service report validated public dataset, an evaluation method based on Accuracy, Reliability, and Earliness, and baseline results implemented with EnergyFaultDetector, an open source Python framework. The dataset contains time series of operational data from 93 substations across two manufacturers, annotated with a list of disturbances due to faults and maintenance actions, a set of normal-event examples and detailed fault metadata. We evaluate the EnergyFaultDetector using three metrics: Accuracy for recognising normal behaviour, an eventwise F Score for reliable fault detection with few false alarms, and Earliness for early detection. The framework also supports root cause analysis using ARCANA. We demonstrate three use cases to assist operators in interpreting anomalies and identifying underlying faults. The models achieve high normal-behaviour accuracy (0.98) and eventwise F-score (beta=0.5) of 0.83, detecting 60% of the faults in the dataset before the customer reports a problem, with an average lead time of 3.9 days. Integrating an open dataset, metrics, open source code, and baselines establishes a reproducible, fault centric benchmark with operationally meaningful evaluation, enabling consistent comparison and development of early fault detection and diagnosis methods for district heating substations. Comments: 30 pages, 6 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14791 [cs.SE] (or arXiv:2511.14791v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2511.14791 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-65] Subnational Geocoding of Global Disasters Using Large Language Models
【速读】:该论文旨在解决灾害事件的次国家级地理位置信息在灾难数据库(如EM-DAT)中以非结构化文本形式存在、粒度不一致或拼写错误等问题,从而难以与空间数据集集成的挑战。解决方案的关键在于提出了一种全自动的大语言模型(LLM)辅助工作流,利用GPT-4o对文本位置信息进行处理和清洗,并通过交叉验证三个独立地理信息库(GADM、OpenStreetMap 和 Wikidata)来赋值几何坐标;同时基于各来源的一致性和可用性为每个地点分配可靠性评分,从而实现高精度、可扩展且无需人工干预的地理编码方法。
链接: https://arxiv.org/abs/2511.14788
作者: Michele Ronco,Damien Delforge,Wiebke S. Jäger,Christina Corbane
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.
zh
[AI-66] Ask WhAI:Probing Belief Formation in Role-Primed LLM Agents
【速读】:该论文旨在解决多智能体系统中信念状态(belief states)难以观测与干预的问题,特别是在复杂科学推理场景下,如何量化和分析个体智能体的信念形成机制及其对证据整合的影响。解决方案的关键在于提出 Ask WhAI 框架,该框架通过记录并重放多智能体交互过程、支持离线查询各智能体的信念与推理逻辑,并允许注入反事实证据来测试信念结构的响应性,从而实现对信念动态的可追溯、可扰动和可验证分析。
链接: https://arxiv.org/abs/2511.14780
作者: Keith Moore,Jun W. Kim,David Lyu,Jeffrey Heo,Ehsan Adeli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Accepted for publication at AIAS 2025
Abstract:We present Ask WhAI, a systems-level framework for inspecting and perturbing belief states in multi-agent interactions. The framework records and replays agent interactions, supports out-of-band queries into each agent’s beliefs and rationale, and enables counterfactual evidence injection to test how belief structures respond to new information. We apply the framework to a medical case simulator notable for its multi-agent shared memory (a time-stamped electronic medical record, or EMR) and an oracle agent (the LabAgent) that holds ground truth lab results revealed only when explicitly queried. We stress-test the system on a multi-specialty diagnostic journey for a child with an abrupt-onset neuropsychiatric presentation. Large language model agents, each primed with strong role-specific priors (“act like a neurologist”, “act like an infectious disease specialist”), write to a shared medical record and interact with a moderator across sequential or parallel encounters. Breakpoints at key diagnostic moments enable pre- and post-event belief queries, allowing us to distinguish entrenched priors from reasoning or evidence-integration effects. The simulation reveals that agent beliefs often mirror real-world disciplinary stances, including overreliance on canonical studies and resistance to counterevidence, and that these beliefs can be traced and interrogated in ways not possible with human experts. By making such dynamics visible and testable, Ask WhAI offers a reproducible way to study belief formation and epistemic silos in multi-agent scientific reasoning.
zh
[AI-67] Learning Interestingness in Automated Mathematical Theory Formation NEURIPS2025
【速读】:该论文旨在解决生成式 AI (Generative AI) 在数学理论发现中的自动化问题,即如何通过算法自动探索和构建新的数学理论。其核心挑战在于如何定义并量化数学对象的“有趣性”(interestingness),从而引导模型发现具有潜在价值的新定理或结构。解决方案的关键在于提出一个名为 FERMAT 的强化学习(Reinforcement Learning, RL)环境,该环境以符号化操作模拟概念发现与定理证明过程,并结合基于大语言模型(Large Language Model, LLM)的进化算法来合成非平凡的有趣性度量函数;该算法引入函数抽象机制,显著提升了在初等数论和有限域领域中自动发现有意义数学对象的能力,优于传统硬编码基线方法。
链接: https://arxiv.org/abs/2511.14778
作者: George Tsoukalas,Rahul Saha,Amitayush Thakur,Sabrina Reguyal,Swarat Chaudhuri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Spotlight
Abstract:We take two key steps in automating the open-ended discovery of new mathematical theories, a grand challenge in artificial intelligence. First, we introduce \emphFERMAT , a reinforcement learning (RL) environment that models concept discovery and theorem-proving using a set of symbolic actions, opening up a range of RL problems relevant to theory discovery. Second, we explore a specific problem through \emphFERMAT : automatically scoring the \emphinterestingness of mathematical objects. We investigate evolutionary algorithms for synthesizing nontrivial interestingness measures. In particular, we introduce an LLM-based evolutionary algorithm that features function abstraction, leading to notable improvements in discovering elementary number theory and finite fields over hard-coded baselines. We open-source the \emphFERMAT environment at this URL(this https URL).
zh
[AI-68] he Illusion of Procedural Reasoning : Measuring Long-Horizon FSM Execution in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在执行多步、规则驱动的程序性推理任务时能力下降的问题,即模型在长推理链中难以保持状态一致性与计算准确性。现有研究缺乏可控且可解释的基准来隔离和测量这种性能退化现象。其解决方案的关键在于提出有限状态机(Finite-State Machine, FSM)执行框架,该框架通过显式定义状态转移规则并要求模型逐步执行输入动作,从而构建一个完全可解释、复杂度可控的评估环境。该方法能够区分局部计算准确性和累积状态维护能力,并揭示模型在高分支复杂度下表现更差的现象,为诊断LLMs的程序性推理局限性提供了透明的工具,同时指出了通过外部化中间步骤提升鲁棒性的方向。
链接: https://arxiv.org/abs/2511.14777
作者: Mahdi Samiei,Mahdi Mansouri,Mahdieh Soleymani Baghshah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable results on tasks framed as reasoning problems, yet their true ability to perform procedural reasoning, executing multi-step, rule-based computations remains unclear. Unlike algorithmic systems, which can deterministically execute long-horizon symbolic procedures, LLMs often degrade under extended reasoning chains, but there is no controlled, interpretable benchmark to isolate and measure this collapse. We introduce Finite-State Machine (FSM) Execution as a minimal, fully interpretable framework for evaluating the procedural reasoning capacity of LLMs. In our setup, the model is given an explicit FSM definition and must execute it step-by-step given input actions, maintaining state consistency over multiple turns. This task requires no world knowledge, only faithful application of deterministic transition rules, making it a direct probe of the model’s internal procedural fidelity. We measure both Turn Accuracy and Task Accuracy to disentangle immediate computation from cumulative state maintenance. Empirical results reveal systematic degradation as task horizon or branching complexity increases. Models perform significantly worse when rule retrieval involves high branching factors than when memory span is long. Larger models show improved local accuracy but remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. FSM-based evaluation offers a transparent, complexity-controlled probe for diagnosing this failure mode and guiding the design of inductive biases that enable genuine long-horizon procedural competence. By grounding reasoning in measurable execution fidelity rather than surface correctness, this work helps establish a rigorous experimental foundation for understanding and improving the algorithmic reliability of LLMs.
zh
[AI-69] ExplainRec: Towards Explainable Multi-Modal Zero-Shot Recommendation with Preference Attribution and Large Language Models
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统在可解释性(explainability)和冷启动(cold-start)场景下的性能瓶颈问题。其解决方案的关键在于提出 ExplainRec 框架,通过四个核心技术贡献实现突破:一是偏好 Attribution 调优以生成可解释的推荐理由;二是零样本偏好迁移(zero-shot preference transfer)有效缓解冷启动用户与物品的推荐困境;三是融合视觉与文本多模态信息增强推荐效果;四是多任务协同优化机制提升整体模型泛化能力。实验表明,该框架在 MovieLens-25M 和 Amazon 数据集上均显著优于现有方法,尤其在跨域任务中 AUC 提升达 0.9%,同时保持推荐结果的可解释性和对冷启动问题的良好适应性。
链接: https://arxiv.org/abs/2511.14770
作者: Bo Ma,LuYao Liu,ZeHua Hu,Simon Lau
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have opened new possibilities for recommendation systems, though current approaches such as TALLRec face challenges in explainability and cold-start scenarios. We present ExplainRec, a framework that extends LLM-based recommendation capabilities through preference attribution, multi-modal fusion, and zero-shot transfer learning. The framework incorporates four technical contributions: preference attribution tuning for explainable recommendations, zero-shot preference transfer for cold-start users and items, multi-modal enhancement leveraging visual and textual content, and multi-task collaborative optimization. Experimental evaluation on MovieLens-25M and Amazon datasets shows that ExplainRec outperforms existing methods, achieving AUC improvements of 0.7% on movie recommendation and 0.9% on cross-domain tasks, while generating interpretable explanations and handling cold-start scenarios effectively.
zh
[AI-70] Causally-Informed Reinforcement Learning for Adaptive Emotion-Aware Social Media Recommendation
【速读】:该论文旨在解决当前社交媒体推荐系统仅以用户参与度(如点击率、观看时长)为优化目标,而忽视用户情绪状态的问题。这种单一优化导向导致用户长期暴露于情绪化内容中,可能损害其情感健康。解决方案的关键在于提出一种情绪感知的社交媒体推荐框架(Emotion-aware Social Media Recommendation, ESMR),其核心创新是融合基于Transformer的情绪预测模型与混合推荐策略:在情绪稳定期采用LightGBM模型维持高参与度,在持续负面情绪状态下则启用具有因果激励机制的强化学习代理,以促进情绪恢复并降低波动性。该方法在30天行为数据上验证了其在保障参与度的同时显著改善用户情绪轨迹的能力。
链接: https://arxiv.org/abs/2511.14768
作者: Bhavika Jain,Robert Pitsko,Ananya Drishti,Mahfuza Farooque
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Social media recommendation systems play a central role in shaping users’ emotional experiences. However, most systems are optimized solely for engagement metrics, such as click rate, viewing time, or scrolling, without accounting for users’ emotional states. Repeated exposure to emotionally charged content has been shown to negatively affect users’ emotional well-being over time. We propose an Emotion-aware Social Media Recommendation (ESMR) framework that personalizes content based on users’ evolving emotional trajectories. ESMR integrates a Transformer-based emotion predictor with a hybrid recommendation policy: a LightGBM model for engagement during stable periods and a reinforcement learning agent with causally informed rewards when negative emotional states persist. Through behaviorally grounded evaluation over 30-day interaction traces, ESMR demonstrates improved emotional recovery, reduced volatility, and strong engagement retention. ESMR offers a path toward emotionally aware recommendations without compromising engagement performance.
zh
[AI-71] An LLM -Powered Agent for Real-Time Analysis of the Vietnamese IT Job Market
【速读】:该论文旨在解决越南动态信息技术(Information Technology, IT)就业市场中个体面临的可靠职业指导缺失问题,现有市场报告往往过时,而人工分析成千上万的职位发布又不切实际。解决方案的关键在于构建一个名为“AI职业市场顾问”的新型对话式智能体,其核心是一个基于ReAct框架的工具增强型AI代理,能够通过SQL查询、语义搜索和数据可视化等专用工具自主推理、规划并执行任务;系统依托于通过Playwright自动化爬取与LLM结构化处理生成的定制化数据集,成功收集并分析了3,745条职位信息,实现了对复杂多步查询的回答、按需生成可视化图表及基于真实数据的个性化职业建议,从而为新一代专业人才提供及时且可信的职业洞察。
链接: https://arxiv.org/abs/2511.14767
作者: Minh-Thuan Nguyen,Thien Vo-Thanh,Thai-Duy Dinh,Xuan-Quang Phan,Tan-Ha Mai,Lam-Son Lê
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at ACOMPA 2025
Abstract:Individuals entering Vietnam’s dynamic Information Technology (IT) job market face a critical gap in reliable career guidance. Existing market reports are often outdated, while the manual analysis of thousands of job postings is impractical for most. To address this challenge, we present the AI Job Market Consultant, a novel conversational agent that delivers deep, data-driven insights directly from the labor market in real-time. The foundation of our system is a custom-built dataset created via an automated pipeline that crawls job portals using Playwright and leverages the Large Language Model (LLM) to intelligently structure unstructured posting data. The core of our system is a tool-augmented AI agent, based on the ReAct agentic framework, which enables the ability of autonomously reasoning, planning, and executing actions through a specialized toolbox for SQL queries, semantic search, and data visualization. Our prototype successfully collected and analyzed 3,745 job postings, demonstrating its ability to answer complex, multi-step queries, generate on-demand visualizations, and provide personalized career advice grounded in real-world data. This work introduces a new paradigm for labor market analysis, showcasing how specialized agentic AI systems can democratize access to timely, trustworthy career intelligence for the next generation of professionals.
zh
[AI-72] Image-Seeking Intent Prediction for Cross-Device Product Search RECSYS
【速读】:该论文旨在解决多设备环境下用户在电商场景中因缺乏视觉辅助而导致的产品发现效率低下的问题,核心挑战在于准确预测用户何时需要通过跨设备切换(如从语音助手切换至带屏设备)来获取视觉增强以提升购物体验。解决方案的关键在于提出了一种新的任务——图像寻求意图预测(Image-Seeking Intent Prediction),并构建了IRP(Image Request Predictor)模型,该模型利用用户语音查询语义与对应检索商品的元数据信息(尤其是通过轻量级摘要优化后的特征),结合可微分的精度导向损失函数,显著提升了视觉意图预测的准确性,从而实现对用户潜在视觉需求的前瞻性响应。
链接: https://arxiv.org/abs/2511.14764
作者: Mariya Hendriksen,Svitlana Vakulenko,Jordan Massiah,Gabriella Kazai,Emine Yilmaz
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Oral at RecSys Gen AI for E-commerce 2025
Abstract:Large Language Models (LLMs) are transforming personalized search, recommendations, and customer interaction in e-commerce. Customers increasingly shop across multiple devices, from voice-only assistants to multimodal displays, each offering different input and output capabilities. A proactive suggestion to switch devices can greatly improve the user experience, but it must be offered with high precision to avoid unnecessary friction. We address the challenge of predicting when a query requires visual augmentation and a cross-device switch to improve product discovery. We introduce Image-Seeking Intent Prediction, a novel task for LLM-driven e-commerce assistants that anticipates when a spoken product query should proactively trigger a visual on a screen-enabled device. Using large-scale production data from a multi-device retail assistant, including 900K voice queries, associated product retrievals, and behavioral signals such as image carousel engagement, we train IRP (Image Request Predictor), a model that leverages user input query and corresponding retrieved product metadata to anticipate visual intent. Our experiments show that combining query semantics with product data, particularly when improved through lightweight summarization, consistently improves prediction accuracy. Incorporating a differentiable precision-oriented loss further reduces false positives. These results highlight the potential of LLMs to power intelligent, cross-device shopping assistants that anticipate and adapt to user needs, enabling more seamless and personalized e-commerce experiences.
zh
[AI-73] Membership Inference Attack against Large Language Model-based Recommendation Systems: A New Distillation-based Paradigm
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的推荐系统中,传统成员推理攻击(Membership Inference Attack, MIA)因构建影子模型(shadow model)所需的数据规模和训练复杂度过高而难以实施的问题。其解决方案的关键在于引入知识蒸馏(knowledge distillation)机制,通过在蒸馏过程中对成员数据与非成员数据分别进行独立蒸馏,增强参考模型(reference model)对两类数据的判别能力;随后利用该参考模型提取个体特征,并融合后训练攻击模型,从而显著提升MIA在LLM推荐系统中的攻击性能。
链接: https://arxiv.org/abs/2511.14763
作者: Li Cuihong,Huang Xiaowen,Yin Chuanhuan,Sang Jitao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Membership Inference Attack (MIA) aims to determine if a data sample is used in the training dataset of a target model. Traditional MIA obtains feature of target model via shadow models and uses the feature to train attack model, but the scale and complexity of training or fine-tuning data for large language model (LLM)-based recommendation systems make shadow models difficult to construct. Knowledge distillation as a method for extracting knowledge contributes to construct a stronger reference model. Knowledge distillation enables separate distillation for member and non-member data during the distillation process, enhancing the model’s discriminative capability between the two in MIA. This paper propose a knowledge distillation-based MIA paradigm to improve the performance of membership inference attacks on LLM-based recommendation systems. Our paradigm introduces knowledge distillation to obtain a reference model, which enhances the reference model’s ability to distinguish between member and non-member data. We obtain individual features from the reference model and train our attack model with fused feature. Our paradigm improves the attack performance of MIA compared to shadow model-based attack.
zh
[AI-74] Joint Semantic-Channel Coding and Modulation for Token Communications
【速读】:该论文旨在解决基于Transformer架构的点云信息传输中token通信的效率与可靠性问题。其核心挑战在于如何在复杂三维空间结构下,实现点云token的有效编码与调制,以提升重建质量并降低传输开销。解决方案的关键在于提出一种联合语义-信道调制(Joint Semantic-Channel and Modulation, JSCCM)方案:该方案包含两个并行的Point Transformer编码器和一个差分调制器,结合Gumbel-softmax与软量化方法,将点云token映射为标准数字星座点(调制token);同时设计速率分配器与信道适配器,根据语义信息和信道状态自适应生成高质量调制token,从而显著优于传统的分离编码与联合语义-信道编码方法,在重建性能上获得超过1dB增益,并实现调制符号压缩比超6倍。
链接: https://arxiv.org/abs/2511.15699
作者: Jingkai Ying,Zhijin Qin,Yulong Feng,Liejun Wang,Xiaoming Tao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 14 pages, 14 figures, 2 tables
Abstract:In recent years, the Transformer architecture has achieved outstanding performance across a wide range of tasks and modalities. Token is the unified input and output representation in Transformer-based models, which has become a fundamental information unit. In this work, we consider the problem of token communication, studying how to transmit tokens efficiently and reliably. Point cloud, a prevailing three-dimensional format which exhibits a more complex spatial structure compared to image or video, is chosen to be the information source. We utilize the set abstraction method to obtain point tokens. Subsequently, to get a more informative and transmission-friendly representation based on tokens, we propose a joint semantic-channel and modulation (JSCCM) scheme for the token encoder, mapping point tokens to standard digital constellation points (modulated tokens). Specifically, the JSCCM consists of two parallel Point Transformer-based encoders and a differential modulator which combines the Gumel-softmax and soft quantization methods. Besides, the rate allocator and channel adapter are developed, facilitating adaptive generation of high-quality modulated tokens conditioned on both semantic information and channel conditions. Extensive simulations demonstrate that the proposed method outperforms both joint semantic-channel coding and traditional separate coding, achieving over 1dB gain in reconstruction and more than 6x compression ratio in modulated symbols.
zh
[AI-75] Multimodal Wireless Foundation Models
【速读】:该论文旨在解决当前无线基础模型(Wireless Foundation Models, WFMs)仅能处理单一模态数据的问题,而实际无线场景中不同任务和环境条件下最有效的信息来源会动态变化,单一模态难以覆盖多样化的任务需求。解决方案的关键在于提出并构建首个支持多模态输入的无线基础模型,能够同时处理原始IQ流(IQ streams)与图像类无线模态(如频谱图和信道状态信息CSI),并通过引入掩码无线建模(masked wireless modeling)这一自监督目标和预训练方法,学习跨模态的联合表示。该方法在五项任务上验证了其有效性,涵盖基于图像的(人体活动感知、射频信号分类、5G NR定位)和基于IQ的(射频设备指纹识别、干扰检测/分类),结果表明该多模态WFM不仅性能可与单模态模型媲美,且在多个任务中表现更优,展示了多模态无线基础模型在支持多样化无线任务方面的巨大潜力。
链接: https://arxiv.org/abs/2511.15162
作者: Ahmed Aboulfotouh,Hatem Abou-Zeid
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Wireless foundation models (WFMs) have recently demonstrated promising capabilities, jointly performing multiple wireless functions and adapting effectively to new environments. However, while current WFMs process only one modality, depending on the task and operating conditions, the most informative modality changes and no single modality is best for all tasks. WFMs should therefore be designed to accept multiple modalities to enable a broader and more diverse range of tasks and scenarios. In this work, we propose and build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities (e.g., spectrograms and CSI) and performing multiple tasks across both. We introduce masked wireless modeling for the multimodal setting, a self-supervised objective and pretraining recipe that learns a joint representation from IQ streams and image-like wireless modalities. We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification). The multimodal WFM is competitive with single-modality WFMs, and in several cases surpasses their performance. Our results demonstrates the strong potential of developing multimodal WFMs that support diverse wireless tasks across different modalities. We believe this provides a concrete step toward both AI-native 6G and the vision of joint sensing, communication, and localization.
zh
[AI-76] CASPER: Cross-modal Alignment of Spatial and single-cell Profiles for Expression Recovery
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics)因实验限制和高昂成本导致仅能测量有限基因集的问题。为克服此局限,研究提出了一种基于交叉注意力机制的计算框架CASPER,其核心创新在于利用单细胞RNA测序(Single-Cell RNA Sequencing, scRNA-seq)数据中的中心点级表示(centroid-level representations)来预测空间转录组中未测量的基因表达。该方法通过跨模态对齐与信息融合,在四个主流数据集对上显著优于现有基线模型,在十二项指标中有九项实现提升,为未来空间转录组到单细胞转录组的模态转换研究提供了有效解决方案。
链接: https://arxiv.org/abs/2511.15139
作者: Amit Kumar,Maninder Kaur,Raghvendra Mall,Sukrit Gupta
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Spatial Transcriptomics enables mapping of gene expression within its native tissue context, but current platforms measure only a limited set of genes due to experimental constraints and excessive costs. To overcome this, computational models integrate Single-Cell RNA Sequencing data with Spatial Transcriptomics to predict unmeasured genes. We propose CASPER, a cross-attention based framework that predicts unmeasured gene expression in Spatial Transcriptomics by leveraging centroid-level representations from Single-Cell RNA Sequencing. We performed rigorous testing over four state-of-the-art Spatial Transcriptomics/Single-Cell RNA Sequencing dataset pairs across four existing baseline models. CASPER shows significant improvement in nine out of the twelve metrics for our experiments. This work paves the way for further work in Spatial Transcriptomics to Single-Cell RNA Sequencing modality translation. The code for CASPER is available at this https URL.
zh
[AI-77] Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit
【速读】:该论文旨在解决深度学习中神经网络如何高效学习高维特征这一核心问题,特别是在具有隐藏子空间结构的高斯多指标模型(Gaussian Multi-index model)下,探究两层神经网络通过分层梯度下降(layer-wise gradient descent)进行表示学习的样本与时间效率。解决方案的关键在于:证明在一般非退化条件下,标准两层神经网络可通过梯度下降实现对目标函数的近似无误差学习(测试误差为 od(1)),且所需样本复杂度为 O(d)、时间复杂度为 O(d2),二者均达到信息论极限的最优阶数。其中,第一阶段梯度下降过程通过内层权重执行幂迭代(power-iteration)机制,隐式地模拟了整个隐藏子空间的谱初始化,从而消除有限样本噪声并恢复该子空间;这一发现表明,只有当第一层训练步数超过 O(1) 时才能实现最优性能,凸显了神经网络在层级函数学习中的高效性。
链接: https://arxiv.org/abs/2511.15120
作者: Bohan Zhang,Zihao Wang,Hengyu Fu,Jason D. Lee
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 86 pages, 2 figures. The order of the first two authors was determined by a coin flip
Abstract:In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model f(\boldsymbolx)=g(\boldsymbolU\boldsymbolx) with hidden subspace \boldsymbolU\in \mathbbR^r\times d , which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with o_d(1) test error using \widetilde\mathcalO(d) samples and \widetilde\mathcalO(d^2) time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than \mathcalO(1) steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.
zh
[AI-78] Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion
【速读】:该论文旨在解决多模态情感识别在对话(Multimodal Emotion Recognition in Conversation, MERC)中因数据质量不一致导致的性能瓶颈问题。其核心解决方案在于构建一个系统性的数据质量控制流程,涵盖说话人身份验证、音频-文本对齐以及人脸检测,并在此基础上采用多阶段迁移学习策略:利用RecoMadeEasy®引擎提取512维的说话人与面部嵌入(embedding),基于MPNet-v2微调得到情感感知的文本表示,并通过针对单一模态训练的MLP适配这些特征;最终采用MAMBA架构实现三模态融合,在MELD和IEMOCAP数据集上分别达到64.8%和74.3%的准确率,证明了高质量数据子集与身份驱动的跨模态嵌入结合情绪调优文本表征的有效性,为低频情感类别识别提供了可扩展的改进基础。
链接: https://arxiv.org/abs/2511.14969
作者: Zanxu Wang,Homayoon Beigi
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 8 pages, 14 images, 3 tables, Recognition Technologies, Inc. Technical Report RTI-20251118-01
Abstract:This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy® engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
zh
[AI-79] Fifty Shades of Greenwashing: The Political Economy of Climate Change Advertising on Social Media WWW
【速读】:该论文旨在解决“绿色洗牌”(greenwashing)——即与气候相关的虚假信息——如何通过社交媒体广告被污染企业用来转移公众批评的问题。其解决方案的关键在于提出了一种新颖的测量方法,结合大语言模型(large language models)、人工编码和贝叶斯项目反应理论(Bayesian item response theory)对Meta广告目标数据集中的1100万条社会政治广告进行识别与量化分析,从而揭示了绿色洗牌行为的多样性和复杂性,特别是发现了一种由化石燃料公司及相关利益团体推动的“政治性绿色洗牌”(political greenwashing),并证实此类广告主要通过未披露与化石燃料行业关联的组织向左翼倾向但拥有化石燃料资产的社区进行微观定向投放。
链接: https://arxiv.org/abs/2511.14930
作者: Robert Kubinec,Aseem Mahajan
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: Supplementary information can be downloaded at this https URL
Abstract:In this paper, we provide a novel measure for greenwashing – i.e., climate-related misinformation – that shows how polluting companies can use social media advertising related to climate change to redirect criticism. To do so, we identify greenwashing content in 11 million social-political ads in Meta’s Ad Targeting Datset with a measurement technique that combines large language models, human coders, and advances in Bayesian item response theory. We show that what is called greenwashing has diverse actors and components, but we also identify a very pernicious form, which we call political greenwashing, that appears to be promoted by fossil fuel companies and related interest groups. Based on ad targeting data, we show that much of this advertising happens via organizations with undisclosed links to the fossil fuel industry. Furthermore, we show that greenwashing ad content is being micro-targeted at left-leaning communities with fossil fuel assets, though we also find comparatively little evidence of ad targeting aimed at influencing public opinion at the national level.
zh
[AI-80] Implicit Bias of the JKO Scheme
【速读】:该论文旨在解决Jordan-Kinderlehrer-Otto (JKO) 离散化方案在Wasserstein梯度流中的高阶误差特性问题,即如何刻画其在时间步长 η 的二阶精度下的隐式偏差(implicit bias)。传统观点认为JKO方案仅在一阶精度上逼近Wasserstein梯度流,但该文揭示其在二阶项中引入了一个修正能量泛函 Jη,该泛函通过减去原能量 J 的平方度量曲率项(与变分导数的梯度范数平方相关)获得,具体形式为:
Jη(ρ)=J(ρ)−4η∫M∥∥∇gδρδJ(ρ)∥∥22ρ(dx).
解决方案的关键在于证明了JKO生成的概率分布 ρkη 以 η2 阶精度逼近在修正能量 Jη 上的Wasserstein梯度流,从而明确其隐式正则化机制——在度量曲率快速变化的方向上实现“减速”,这对应于熵、KL散度和Riemannian梯度下降等常见泛函的具体隐式偏差(如Fisher信息、Fisher-Hyvärinen散度和动能)。这一理论框架为理解JKO方案的稳定性优势及数值优化行为提供了新的视角。
链接: https://arxiv.org/abs/2511.14827
作者: Peter Halmos,Boris Hanin
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
备注:
Abstract:Wasserstein gradient flow provides a general framework for minimizing an energy functional J over the space of probability measures on a Riemannian manifold (M,g) . Its canonical time-discretization, the Jordan-Kinderlehrer-Otto (JKO) scheme, produces for any step size \eta0 a sequence of probability distributions \rho_k^\eta that approximate to first order in \eta Wasserstein gradient flow on J . But the JKO scheme also has many other remarkable properties not shared by other first order integrators, e.g. it preserves energy dissipation and exhibits unconditional stability for \lambda -geodesically convex functionals J . To better understand the JKO scheme we characterize its implicit bias at second order in \eta . We show that \rho_k^\eta are approximated to order \eta^2 by Wasserstein gradient flow on a \emphmodified energy [ J^\eta(\rho) = J(\rho) - \frac\eta4\int_M \Big\lVert \nabla_g \frac\delta J\delta \rho (\rho) \Big\rVert_2^2 ,\rho(dx), ] obtained by subtracting from J the squared metric curvature of J times \eta/4 . The JKO scheme therefore adds at second order in \eta a \textitdeceleration in directions where the metric curvature of J is rapidly changing. This corresponds to canonical implicit biases for common functionals: for entropy the implicit bias is the Fisher information, for KL-divergence it is the Fisher-Hyvärinen divergence, and for Riemannian gradient descent it is the kinetic energy in the metric g . To understand the differences between minimizing J and J^\eta we study \emphJKO-Flow, Wasserstein gradient flow on J^\eta , in several simple numerical examples. These include exactly solvable Langevin dynamics on the Bures-Wasserstein space and Langevin sampling from a quartic potential in 1D.
zh
[AI-81] Fully Differentiable dMRI Streamline Propagation in PyTorch
【速读】:该论文旨在解决扩散磁共振成像(Diffusion MRI, dMRI)中纤维追踪(tractography)方法难以融入端到端深度学习框架的问题。现有主流纤维追踪方法多依赖于非可微的流线传播算法或全局能量最小化策略,导致其无法与深度学习模型协同优化。解决方案的关键在于提出一种完全可微的流线传播器(streamline propagator),该传播器基于PyTorch实现,且所有组件均不阻断梯度流动,从而在保持与领先流线算法数值精度的同时实现了全程可微性。这一突破使得纤维追踪能够无缝集成至深度学习工作流中,为宏观结构推理提供了计算稳健且科学严谨的新范式。
链接: https://arxiv.org/abs/2511.14807
作者: Jongyeon Yoon,Elyssa M. McMaster,Michael E. Kim,Gaurav Rudravaram,Kurt G. Schilling,Bennett A. Landman,Daniel Moyer
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures. Accepted to SPIE Medical Imaging 2026: Image Processing
Abstract:Diffusion MRI (dMRI) provides a distinctive means to probe the microstructural architecture of living tissue, facilitating applications such as brain connectivity analysis, modeling across multiple conditions, and the estimation of macrostructural features. Tractography, which emerged in the final years of the 20th century and accelerated in the early 21st century, is a technique for visualizing white matter pathways in the brain using dMRI. Most diffusion tractography methods rely on procedural streamline propagators or global energy minimization methods. Although recent advancements in deep learning have enabled tasks that were previously challenging, existing tractography approaches are often non-differentiable, limiting their integration in end-to-end learning frameworks. While progress has been made in representing streamlines in differentiable frameworks, no existing method offers fully differentiable propagation. In this work, we propose a fully differentiable solution that retains numerical fidelity with a leading streamline algorithm. The key is that our PyTorch-engineered streamline propagator has no components that block gradient flow, making it fully differentiable. We show that our method matches standard propagators while remaining differentiable. By translating streamline propagation into a differentiable PyTorch framework, we enable deeper integration of tractography into deep learning workflows, laying the foundation for a new category of macrostructural reasoning that is not only computationally robust but also scientifically rigorous.
zh
[AI-82] MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging AAAI2026
【速读】:该论文旨在解决基因组序列建模中的两个关键挑战:一是不同区域的信息密度差异显著,二是缺乏明确的最小词汇单元(vocabulary unit)。现有基于基础碱基或人工设计DNA分词器的方法,在采用朴素的掩码语言建模预训练时难以适应基因组序列的复杂性变化。其解决方案的核心在于提出一种层次化架构MergeDNA,通过引入可微分的Token Merging技术,联合优化动态基因组分词模块与上下文感知的潜空间Transformer模型;具体而言,分词模块利用带有局部窗口约束的可微分合并块自动将相邻碱基聚合成“词”,潜空间编码器通过全注意力机制捕捉这些合并后token的全局上下文,同时设计了两个协同预训练任务——Merged Token Reconstruction用于同步训练分词模块并自适应过滤重要token,Adaptive Masked Token Modeling则学习预测过滤后的token以提取信息量丰富的特征,从而实现对基因组序列复杂结构的高效建模。
链接: https://arxiv.org/abs/2511.14806
作者: Siyuan Li,Kai Yu,Anna Wang,Zicheng Liu,Chang Yu,Jingbo Zhou,Qirong Yang,Yucheng Guo,Xiaoming Zhang,Stan Z. Li
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI 2026 (Oral Presentation) Preprint
Abstract:Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.
zh
[AI-83] Quantifying the Role of OpenFold Components in Protein Structure Prediction NEURIPS2025
【速读】:该论文旨在解决生成式 AI (Generative AI) 在蛋白质结构预测模型(如 OpenFold)中各组件贡献不明确的问题,即缺乏对模型内部机制的系统性理解。其解决方案的关键在于提出了一种系统评估方法,能够量化每个 OpenFold 组件对结构预测精度的具体贡献,从而识别出在多数蛋白质中起关键作用的核心模块,并揭示部分组件的重要性与蛋白质长度之间的相关性,为深入解析蛋白质预测网络提供了可解释性的基础。
链接: https://arxiv.org/abs/2511.14781
作者: Tyler L. Hayes,Giri P. Krishnan
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: Accepted to the NeurIPS 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences
Abstract:Models such as AlphaFold2 and OpenFold have transformed protein structure prediction, yet their inner workings remain poorly understood. We present a methodology to systematically evaluate the contribution of individual OpenFold components to structure prediction accuracy. We identify several components that are critical for most proteins, while others vary in importance across proteins. We further show that the contribution of several components is correlated with protein length. These findings provide insight into how OpenFold achieves accurate predictions and highlight directions for interpreting protein prediction networks more broadly.
zh
[AI-84] acEleven: generative tactic discovery for football open play
【速读】:该论文旨在解决足球比赛中开放式进攻战术自动发现的难题,即在高度动态且长序列的比赛中,战术空间随时间呈指数级增长,导致传统方法难以有效探索和生成高质量战术方案。解决方案的关键在于提出TacEleven框架,其核心由两个模块构成:一是语言控制的战术生成器(language-controlled tactical generator),用于生成多样化的战术提案;二是基于多模态大语言模型的战术评判器(multimodal large language model-based tactical critic),能够根据高层战术指令筛选最优策略。这种生成与评判协同的工作机制使得系统能够在复杂长序列场景中快速探索并发现具有现实可行性和创造性的进攻战术,从而显著提升教练和分析师在战术决策中的效率与质量。
链接: https://arxiv.org/abs/2511.13326
作者: Siyao Zhao,Hao Ma,Zhiqiang Pu,Jingjing Huang,Yi Pan,Shijie Wang,Zhi Ming
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:
Abstract:Creating offensive advantages during open play is fundamental to football success. However, due to the highly dynamic and long-sequence nature of open play, the potential tactic space grows exponentially as the sequence progresses, making automated tactic discovery extremely challenging. To address this, we propose TacEleven, a generative framework for football open-play tactic discovery developed in close collaboration with domain experts from AJ Auxerre, designed to assist coaches and analysts in tactical decision-making. TacEleven consists of two core components: a language-controlled tactical generator that produces diverse tactical proposals, and a multimodal large language model-based tactical critic that selects the optimal proposal aligned with a high-level stylistic tactical instruction. The two components enables rapid exploration of tactical proposals and discovery of alternative open-play offensive tactics. We evaluate TacEleven across three tasks with progressive tactical complexity: counterfactual exploration, single-step discovery, and multi-step discovery, through both quantitative metrics and a questionnaire-based qualitative assessment. The results show that the TacEleven-discovered tactics exhibit strong realism and tactical creativity, with 52.50% of the multi-step tactical alternatives rated adoptable in real-world elite football scenarios, highlighting the framework’s ability to rapidly generate numerous high-quality tactics for complex long-sequence open-play situations. TacEleven demonstrates the potential of creatively leveraging domain data and generative models to advance tactical analysis in sports.
zh
机器学习
[LG-0] RescueLens: LLM -Powered Triage and Action on Volunteer Feedback for Food Rescue
链接: https://arxiv.org/abs/2511.15698
作者: Naveen Raman,Jingwu Tang,Zhiyu Chen,Zheyuan Ryan Shi,Sean Hudson,Ameesh Kapoor,Fei Fang
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted at IAAI’26
Abstract:Food rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer feedback allows food rescue organizations to identify issues early and ensure volunteer satisfaction. However, food rescue organizations monitor feedback manually, which can be cumbersome and labor-intensive, making it difficult to prioritize which issues are most important. In this work, we investigate how large language models (LLMs) assist food rescue organizers in understanding and taking action based on volunteer experiences. We work with 412 Food Rescue, a large food rescue organization based in Pittsburgh, Pennsylvania, to design RescueLens, an LLM-powered tool that automatically categorizes volunteer feedback, suggests donors and recipients to follow up with, and updates volunteer directions based on feedback. We evaluate the performance of RescueLens on an annotated dataset, and show that it can recover 96% of volunteer issues at 71% precision. Moreover, by ranking donors and recipients according to their rates of volunteer issues, RescueLens allows organizers to focus on 0.5% of donors responsible for more than 30% of volunteer issues. RescueLens is now deployed at 412 Food Rescue and through semi-structured interviews with organizers, we find that RescueLens streamlines the feedback process so organizers better allocate their time.
[LG-1] he Impact of Quantization on Large Reasoning Model Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2511.15694
作者: Medha Kumar,Zifei Xu,Xin Wang,Tristan Webb
类目: Machine Learning (cs.LG)
*备注: Accepted to the NeurIPS 2025 Efficient Reasoning Workshop
Abstract:Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.
[LG-2] CODE: A global approach to ODE dynamics learning
链接: https://arxiv.org/abs/2511.15619
作者: Nils Wildt,Daniel M. Tartakovsky,Sergey Oladyshkin,Wolfgang Nowak
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Ordinary differential equations (ODEs) are a conventional way to describe the observed dynamics of physical systems. Scientists typically hypothesize about dynamical behavior, propose a mathematical model, and compare its predictions to data. However, modern computing and algorithmic advances now enable purely data-driven learning of governing dynamics directly from observations. In data-driven settings, one learns the ODE’s right-hand side (RHS). Dense measurements are often assumed, yet high temporal resolution is typically both cumbersome and expensive. Consequently, one usually has only sparsely sampled data. In this work we introduce ChaosODE (CODE), a Polynomial Chaos ODE Expansion in which we use an arbitrary Polynomial Chaos Expansion (aPCE) for the ODE’s right-hand side, resulting in a global orthonormal polynomial representation of dynamics. We evaluate the performance of CODE in several experiments on the Lotka-Volterra system, across varying noise levels, initial conditions, and predictions far into the future, even on previously unseen initial conditions. CODE exhibits remarkable extrapolation capabilities even when evaluated under novel initial conditions and shows advantages compared to well-examined methods using neural networks (NeuralODE) or kernel approximators (KernelODE) as the RHS representer. We observe that the high flexibility of NeuralODE and KernelODE degrades extrapolation capabilities under scarce data and measurement noise. Finally, we provide practical guidelines for robust optimization of dynamics-learning problems and illustrate them in the accompanying code.
[LG-3] Convergence and Sketching-Based Efficient Computation of Neural Tangent Kernel Weights in Physics-Based Loss
链接: https://arxiv.org/abs/2511.15530
作者: Max Hirsch,Federico Pichi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:In multi-objective optimization, multiple loss terms are weighted and added together to form a single objective. These weights are chosen to properly balance the competing losses according to some meta-goal. For example, in physics-informed neural networks (PINNs), these weights are often adaptively chosen to improve the network’s generalization error. A popular choice of adaptive weights is based on the neural tangent kernel (NTK) of the PINN, which describes the evolution of the network in predictor space during training. The convergence of such an adaptive weighting algorithm is not clear a priori. Moreover, these NTK-based weights would be updated frequently during training, further increasing the computational burden of the learning process. In this paper, we prove that under appropriate conditions, gradient descent enhanced with adaptive NTK-based weights is convergent in a suitable sense. We then address the problem of computational efficiency by developing a randomized algorithm inspired by a predictor-corrector approach and matrix sketching, which produces unbiased estimates of the NTK up to an arbitrarily small discretization error. Finally, we provide numerical experiments to support our theoretical findings and to show the efficacy of our randomized algorithm. Code Availability: this https URL
[LG-4] Decentralized Gaussian Process Classification and an Application in Subsea Robotics IROS2025
链接: https://arxiv.org/abs/2511.15529
作者: Yifei Gao,Hans J. He,Daniel J. Stilwell,James McMahon
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, IROS 2025 conference
Abstract:Teams of cooperating autonomous underwater vehicles (AUVs) rely on acoustic communication for coordination, yet this communication medium is constrained by limited range, multi-path effects, and low bandwidth. One way to address the uncertainty associated with acoustic communication is to learn the communication environment in real-time. We address the challenge of a team of robots building a map of the probability of communication success from one location to another in real-time. This is a decentralized classification problem – communication events are either successful or unsuccessful – where AUVs share a subset of their communication measurements to build the map. The main contribution of this work is a rigorously derived data sharing policy that selects measurements to be shared among AUVs. We experimentally validate our proposed sharing policy using real acoustic communication data collected from teams of Virginia Tech 690 AUVs, demonstrating its effectiveness in underwater environments.
[LG-5] PCARNN-DCBF: Minimal-Intervention Geofence Enforcement for Ground Vehicles
链接: https://arxiv.org/abs/2511.15522
作者: Yinan Yu,Samuel Scheidegger
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Runtime geofencing for ground vehicles is rapidly emerging as a critical technology for enforcing Operational Design Domains (ODDs). However, existing solutions struggle to reconcile high-fidelity learning with the structural requirements of verifiable control. We address this by introducing PCARNN-DCBF, a novel pipeline integrating a Physics-encoded Control-Affine Residual Neural Network with a preview-based Discrete Control Barrier Function. Unlike generic learned models, PCARNN explicitly preserves the control-affine structure of vehicle dynamics, ensuring the linearity required for reliable optimization. This enables the DCBF to enforce polygonal keep-in constraints via a real-time Quadratic Program (QP) that handles high relative degree and mitigates actuator saturation. Experiments in CARLA across electric and combustion platforms demonstrate that this structure-preserving approach significantly outperforms analytical and unstructured neural baselines.
[LG-6] Sample-Adaptivity Tradeoff in On-Demand Sampling NEURIPS2025
链接: https://arxiv.org/abs/2511.15507
作者: Nika Haghtalab,Omar Montasser,Mingda Qiao
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 50 pages, to appear at NeurIPS 2025
Abstract:We study the tradeoff between sample complexity and round complexity in on-demand sampling, where the learning algorithm adaptively samples from k distributions over a limited number of rounds. In the realizable setting of Multi-Distribution Learning (MDL), we show that the optimal sample complexity of an r -round algorithm scales approximately as dk^\Theta(1/r) / \epsilon . For the general agnostic case, we present an algorithm that achieves near-optimal sample complexity of \widetilde O((d + k) / \epsilon^2) within \widetilde O(\sqrtk) rounds. Of independent interest, we introduce a new framework, Optimization via On-Demand Sampling (OODS), which abstracts the sample-adaptivity tradeoff and captures most existing MDL algorithms. We establish nearly tight bounds on the round complexity in the OODS setting. The upper bounds directly yield the \widetilde O(\sqrtk) -round algorithm for agnostic MDL, while the lower bounds imply that achieving sub-polynomial round complexity would require fundamentally new techniques that bypass the inherent hardness of OODS.
[LG-7] A Tensor Compiler for Processing-In-Memory Architectures
链接: https://arxiv.org/abs/2511.15503
作者: Peiming Yang,Sankeerth Durvasula,Ivan Fernandez,Mohammad Sadrosadati,Onur Mutlu,Gennady Pekhimenko,Christina Giannoula
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. To address this, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction that enables various data distribution and processing strategies on different PIM backends. DCC enables effective co-optimization by mapping data partitioning strategies to compute loop partitions, applying PIM-specific code optimizations and leveraging a fast and accurate performance prediction model to select optimal configurations. Our evaluations in various individual ML kernels demonstrate that DCC achieves up to 7.68x speedup (2.7x average) on HBM-PIM and up to 13.17x speedup (5.75x average) on AttAcc PIM backend over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by up to 7.71x (4.88x average) over GPU.
[LG-8] FairEnergy: Contribution-Based Fairness meets Energy Efficiency in Federated Learning
链接: https://arxiv.org/abs/2511.15454
作者: Ouiame Marnissi,Hajar EL Hammouti,El Houcine Bergou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) enables collaborative model training across distributed devices while preserving data privacy. However, balancing energy efficiency and fair participation while ensuring high model accuracy remains challenging in wireless edge systems due to heterogeneous resources, unequal client contributions, and limited communication capacity. To address these challenges, we propose FairEnergy, a fairness-aware energy minimization framework that integrates a contribution score capturing both the magnitude of updates and their compression ratio into the joint optimization of device selection, bandwidth allocation, and compression level. The resulting mixed-integer non-convex problem is solved by relaxing binary selection variables and applying Lagrangian decomposition to handle global bandwidth coupling, followed by per-device subproblem optimization. Experiments on non-IID data show that FairEnergy achieves higher accuracy while reducing energy consumption by up to 79% compared to baseline strategies.
[LG-9] Neural network-driven domain decomposition for efficient solutions to the Helmholtz equation
链接: https://arxiv.org/abs/2511.15445
作者: Victorita Dolean,Daria Hrebenshchykova,Stéphane Lanteri,Victor Michel-Dansac
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Accurately simulating wave propagation is crucial in fields such as acoustics, electromagnetism, and seismic analysis. Traditional numerical methods, like finite difference and finite element approaches, are widely used to solve governing partial differential equations (PDEs) such as the Helmholtz equation. However, these methods face significant computational challenges when applied to high-frequency wave problems in complex two-dimensional domains. This work investigates Finite Basis Physics-Informed Neural Networks (FBPINNs) and their multilevel extensions as a promising alternative. These methods leverage domain decomposition, partitioning the computational domain into overlapping sub-domains, each governed by a local neural network. We assess their accuracy and computational efficiency in solving the Helmholtz equation for the homogeneous case, demonstrating their potential to mitigate the limitations of traditional approaches.
[LG-10] Proximal Approximate Inference in State-Space Models
链接: https://arxiv.org/abs/2511.15409
作者: Hany Abdulsamad,Ángel F. García-Fernández,Simo Särkkä
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We present a class of algorithms for state estimation in nonlinear, non-Gaussian state-space models. Our approach is based on a variational Lagrangian formulation that casts Bayesian inference as a sequence of entropic trust-region updates subject to dynamic constraints. This framework gives rise to a family of forward-backward algorithms, whose structure is determined by the chosen factorization of the variational posterior. By focusing on Gauss–Markov approximations, we derive recursive schemes with favorable computational complexity. For general nonlinear, non-Gaussian models we close the recursions using generalized statistical linear regression and Fourier–Hermite moment matching.
[LG-11] EVA-Net: Interpretable Brain Age Prediction via Continuous Aging Prototypes from EEG
链接: https://arxiv.org/abs/2511.15393
作者: Kunyu Zhang,Mingxuan Wang,Xiangjie Shi,Haoxing Xu,Chao Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The brain age is a key indicator of brain health. While electroencephalography (EEG) is a practical tool for this task, existing models struggle with the common challenge of imperfect medical data, such as learning a ``normal’’ baseline from weakly supervised, healthy-only cohorts. This is a critical anomaly detection task for identifying disease, but standard models are often black boxes lacking an interpretable structure. We propose EVA-Net, a novel framework that recasts brain age as an interpretable anomaly detection problem. EVA-Net uses an efficient, sparsified-attention Transformer to model long EEG sequences. To handle noise and variability in imperfect data, it employs a Variational Information Bottleneck to learn a robust, compressed representation. For interpretability, this representation is aligned to a continuous prototype network that explicitly learns the normative healthy aging manifold. Trained on 1297 healthy subjects, EVA-Net achieves state-of-the-art accuracy. We validated its anomaly detection capabilities on an unseen cohort of 27 MCI and AD patients. This pathological group showed significantly higher brain-age gaps and a novel Prototype Alignment Error, confirming their deviation from the healthy manifold. EVA-Net provides an interpretable framework for healthcare intelligence using imperfect medical data.
[LG-12] CID: Measuring Feature Importance Through Counterfactual Distributions
链接: https://arxiv.org/abs/2511.15371
作者: Eddie Conti,Álvaro Parafita,Axel Brando
类目: Machine Learning (cs.LG)
*备注: Accepted at Northern Lights Deep Learning (NLDL) 2026 Conference
Abstract:Assessing the importance of individual features in Machine Learning is critical to understand the model’s decision-making process. While numerous methods exist, the lack of a definitive ground truth for comparison highlights the need for alternative, well-founded measures. This paper introduces a novel post-hoc local feature importance method called Counterfactual Importance Distribution (CID). We generate two sets of positive and negative counterfactuals, model their distributions using Kernel Density Estimation, and rank features based on a distributional dissimilarity measure. This measure, grounded in a rigorous mathematical framework, satisfies key properties required to function as a valid metric. We showcase the effectiveness of our method by comparing with well-established local feature importance explainers. Our method not only offers complementary perspectives to existing approaches, but also improves performance on faithfulness metrics (both for comprehensiveness and sufficiency), resulting in more faithful explanations of the system. These results highlight its potential as a valuable tool for model analysis.
[LG-13] Cost-Aware Prediction (CAP): An LLM -Enhanced Machine Learning Pipeline and Decision Support System for Heart Failure Mortality Prediction
链接: https://arxiv.org/abs/2511.15357
作者: Yinan Yu,Falk Dippel,Christina E. Lundberg,Martin Lindgren,Annika Rosengren,Martin Adiels,Helen Sjöland
类目: Machine Learning (cs.LG)
*备注:
Abstract:Objective: Machine learning (ML) predictive models are often developed without considering downstream value trade-offs and clinical interpretability. This paper introduces a cost-aware prediction (CAP) framework that combines cost-benefit analysis assisted by large language model (LLM) agents to communicate the trade-offs involved in applying ML predictions. Materials and Methods: We developed an ML model predicting 1-year mortality in patients with heart failure (N = 30,021, 22% mortality) to identify those eligible for home care. We then introduced clinical impact projection (CIP) curves to visualize important cost dimensions - quality of life and healthcare provider expenses, further divided into treatment and error costs, to assess the clinical consequences of predictions. Finally, we used four LLM agents to generate patient-specific descriptions. The system was evaluated by clinicians for its decision support value. Results: The eXtreme gradient boosting (XGB) model achieved the best performance, with an area under the receiver operating characteristic curve (AUROC) of 0.804 (95% confidence interval (CI) 0.792-0.816), area under the precision-recall curve (AUPRC) of 0.529 (95% CI 0.502-0.558) and a Brier score of 0.135 (95% CI 0.130-0.140). Discussion: The CIP cost curves provided a population-level overview of cost composition across decision thresholds, whereas LLM-generated cost-benefit analysis at individual patient-levels. The system was well received according to the evaluation by clinicians. However, feedback emphasizes the need to strengthen the technical accuracy for speculative tasks. Conclusion: CAP utilizes LLM agents to integrate ML classifier outcomes and cost-benefit analysis for more transparent and interpretable decision support.
[LG-14] Multi-layer Stack Ensembles for Time Series Forecasting
链接: https://arxiv.org/abs/2511.15350
作者: Nathanael Bosch,Oleksandr Shchur,Nick Erickson,Michael Bohlke-Schneider,Caner Türkmen
类目: Machine Learning (cs.LG)
*备注: Published at AutoML Conference 2025 Methods Track
Abstract:Ensembling is a powerful technique for improving the accuracy of machine learning models, with methods like stacking achieving strong results in tabular tasks. In time series forecasting, however, ensemble methods remain underutilized, with simple linear combinations still considered state-of-the-art. In this paper, we systematically explore ensembling strategies for time series forecasting. We evaluate 33 ensemble models – both existing and novel – across 50 real-world datasets. Our results show that stacking consistently improves accuracy, though no single stacker performs best across all tasks. To address this, we propose a multi-layer stacking framework for time series forecasting, an approach that combines the strengths of different stacker models. We demonstrate that this method consistently provides superior accuracy across diverse forecasting scenarios. Our findings highlight the potential of stacking-based methods to improve AutoML systems for time series forecasting.
[LG-15] LaguerreNet: Advancing a Unified Solution for Heterophily and Over-smoothing with Adaptive Continuous Polynomials
链接: https://arxiv.org/abs/2511.15328
作者: Huseyin Goksu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Spectral Graph Neural Networks (GNNs) suffer from two critical limitations: poor performance on “heterophilic” graphs and performance collapse at high polynomial degrees (K), known as over-smoothing. Both issues stem from the static, low-pass nature of standard filters (e.g., ChebyNet). While adaptive polynomial filters, such as the discrete MeixnerNet, have emerged as a potential unified solution, their extension to the continuous domain and stability with unbounded coefficients remain open questions. In this work, we propose LaguerreNet, a novel GNN filter based on continuous Laguerre polynomials. LaguerreNet learns the filter’s spectral shape by making its core alpha parameter trainable, thereby advancing the adaptive polynomial approach. We solve the severe O(k^2) numerical instability of these unbounded polynomials using a LayerNorm-based stabilization technique. We demonstrate experimentally that this approach is highly effective: 1) LaguerreNet achieves state-of-the-art results on challenging heterophilic benchmarks. 2) It is exceptionally robust to over-smoothing, with performance peaking at K=10, an order of magnitude beyond where ChebyNet collapses.
[LG-16] KrawtchoukNet: A Unified GNN Solution for Heterophily and Over-smoothing with Adaptive Bounded Polynomials
链接: https://arxiv.org/abs/2511.15327
作者: Huseyin Goksu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Spectral Graph Neural Networks (GNNs) based on polynomial filters, such as ChebyNet, suffer from two critical limitations: 1) performance collapse on “heterophilic” graphs and 2) performance collapse at high polynomial degrees (K), known as over-smoothing. Both issues stem from the static, low-pass nature of standard filters. In this work, we propose KrawtchoukNet, a GNN filter based on the discrete Krawtchouk polynomials. We demonstrate that KrawtchoukNet provides a unified solution to both problems through two key design choices. First, by fixing the polynomial’s domain N to a small constant (e.g., N=20), we create the first GNN filter whose recurrence coefficients are \textitinherently bounded, making it exceptionally robust to over-smoothing (achieving SOTA results at K=10). Second, by making the filter’s shape parameter p learnable, the filter adapts its spectral response to the graph data. We show this adaptive nature allows KrawtchoukNet to achieve SOTA performance on challenging heterophilic benchmarks (Texas, Cornell), decisively outperforming standard GNNs like GAT and APPNP.
[LG-17] On the Internal Semantics of Time-Series Foundation Models
链接: https://arxiv.org/abs/2511.15324
作者: Atharva Pandey,Abhilash Neog,Gautam Jajoo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time-series Foundation Models (TSFMs) have recently emerged as a universal paradigm for learning across diverse temporal domains. However, despite their empirical success, the internal mechanisms by which these models represent fundamental time-series concepts remain poorly understood. In this work, we undertake a systematic investigation of concept interpretability in TSFMs. Specifically, we examine: (i) which layers encode which concepts, (ii) whether concept parameters are linearly recoverable, (iii) how representations evolve in terms of concept disentanglement and abstraction across model depth, and (iv) how models process compositions of concepts. We systematically probe these questions using layer-wise analyses, linear recoverability tests, and representation similarity measures, providing a structured account of TSFM semantics. The resulting insights show that early layers mainly capture local, time-domain patterns (e.g., AR(1), level shifts, trends), while deeper layers encode dispersion and change-time signals, with spectral and warping factors remaining the hardest to recover linearly. In compositional settings, however, probe performance degrades, revealing interference between concepts. This highlights that while atomic concepts are reliably localized, composition remains a challenge, underscoring a key limitation in current TSFMs’ ability to represent interacting temporal phenomena.
[LG-18] Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs
链接: https://arxiv.org/abs/2511.15300
作者: Rayen Dhahri,Steffen Urban
类目: Machine Learning (cs.LG)
*备注: Accepted to a Eurips 2025 workshop, work in progress
Abstract:Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric,per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph this http URL models and tasks, it narrows the FP,low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy/inference, and cost under static/dynamic activation scaling and varying operator coverage.
[LG-19] SNAP: Low-Latency Test-Time Adaptation with Sparse Updates
链接: https://arxiv.org/abs/2511.15276
作者: Hyeongheon Cha,Dong Min Kim,Hye Won Chung,Taesik Gong,Sung-Ju Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Test-Time Adaptation (TTA) adjusts models using unlabeled test data to handle dynamic distribution shifts. However, existing methods rely on frequent adaptation and high computational cost, making them unsuitable for resource-constrained edge environments. To address this, we propose SNAP, a sparse TTA framework that reduces adaptation frequency and data usage while preserving accuracy. SNAP maintains competitive accuracy even when adapting based on only 1% of the incoming data stream, demonstrating its robustness under infrequent updates. Our method introduces two key components: (i) Class and Domain Representative Memory (CnDRM), which identifies and stores a small set of samples that are representative of both class and domain characteristics to support efficient adaptation with limited data; and (ii) Inference-only Batch-aware Memory Normalization (IoBMN), which dynamically adjusts normalization statistics at inference time by leveraging these representative samples, enabling efficient alignment to shifting target domains. Integrated with five state-of-the-art TTA algorithms, SNAP reduces latency by up to 93.12%, while keeping the accuracy drop below 3.3%, even across adaptation rates ranging from 1% to 50%. This demonstrates its strong potential for practical use on edge devices serving latency-sensitive applications. The source code is available at this https URL.
[LG-20] PLATONT: Learning a Platonic Representation for Unified Network Tomography
链接: https://arxiv.org/abs/2511.15251
作者: Chengze Du,Heng Xu,Zhiwei Yu,Bo Liu,Jialong Li
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Network tomography aims to infer hidden network states, such as link performance, traffic load, and topology, from external observations. Most existing methods solve these problems separately and depend on limited task-specific signals, which limits generalization and interpretability. We present PLATONT, a unified framework that models different network indicators (e.g., delay, loss, bandwidth) as projections of a shared latent network state. Guided by the Platonic Representation Hypothesis, PLATONT learns this latent state through multimodal alignment and contrastive learning. By training multiple tomography tasks within a shared latent space, it builds compact and structured representations that improve cross-task generalization. Experiments on synthetic and real-world datasets show that PLATONT consistently outperforms existing methods in link estimation, topology inference, and traffic prediction, achieving higher accuracy and stronger robustness under varying network conditions.
[LG-21] Optimized scheduling of electricity-heat cooperative system considering wind energy consumption and peak shaving and valley filling
链接: https://arxiv.org/abs/2511.15250
作者: Jin Ye,Lingmei Wang,Shujian Zhang,Haihang WU
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the global energy transition and rapid development of renewable energy, the scheduling optimization challenge for combined power-heat systems under new energy integration and multiple uncertainties has become increasingly prominent. Addressing this challenge, this study proposes an intelligent scheduling method based on the improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm. System optimization is achieved by introducing a penalty term for grid power purchase variations. Simulation results demonstrate that under three typical scenarios (10%, 20%, and 30% renewable penetration), the PVTD3 algorithm reduces the system’s comprehensive cost by 6.93%, 12.68%, and 13.59% respectively compared to the traditional TD3 algorithm. Concurrently, it reduces the average fluctuation amplitude of grid power purchases by 12.8%. Regarding energy storage management, the PVTD3 algorithm reduces the end-time state values of low-temperature thermal storage tanks by 7.67-17.67 units while maintaining high-temperature tanks within the 3.59-4.25 safety operating range. Multi-scenario comparative validation demonstrates that the proposed algorithm not only excels in economic efficiency and grid stability but also exhibits superior sustainable scheduling capabilities in energy storage device management.
[LG-22] D2D Power Allocation via Quantum Graph Neural Network
链接: https://arxiv.org/abs/2511.15246
作者: Tung Giang Le,Xuan Tung Nguyen,Won-Joo Hwang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Increasing wireless network complexity demands scalable resource management. Classical GNNs excel at graph learning but incur high computational costs in large-scale settings. We present a fully quantum Graph Neural Network (QGNN) that implements message passing via Parameterized Quantum Circuits (PQCs). Our Quantum Graph Convolutional Layers (QGCLs) encode features into quantum states, process graphs with NISQ-compatible unitaries, and retrieve embeddings through measurement. Applied to D2D power control for SINR maximization, our QGNN matches classical performance with fewer parameters and inherent parallelism. This end-to-end PQC-based GNN marks a step toward quantum-accelerated wireless optimization.
[LG-23] Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones
链接: https://arxiv.org/abs/2511.15208
作者: Ranfei Chen,Ming Chen,Kaifei Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured “zones of confusion”: transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.
[LG-24] Learning Where What and How to Transfer: A Multi-Role Reinforcement Learning Approach for Evolutionary Multitasking
链接: https://arxiv.org/abs/2511.15199
作者: Jiajun Zhan,Zeyuan Ma,Yue-Jiao Gong,Kay Chen Tan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Evolutionary multitasking (EMT) algorithms typically require tailored designs for knowledge transfer, in order to assure convergence and optimality in multitask optimization. In this paper, we explore designing a systematic and generalizable knowledge transfer policy through Reinforcement Learning. We first identify three major challenges: determining the task to transfer (where), the knowledge to be transferred (what) and the mechanism for the transfer (how). To address these challenges, we formulate a multi-role RL system where three (groups of) policy networks act as specialized agents: a task routing agent incorporates an attention-based similarity recognition module to determine source-target transfer pairs via attention scores; a knowledge control agent determines the proportion of elite solutions to transfer; and a group of strategy adaptation agents control transfer strength by dynamically controlling hyper-parameters in the underlying EMT framework. Through pre-training all network modules end-to-end over an augmented multitask problem distribution, a generalizable meta-policy is obtained. Comprehensive validation experiments show state-of-the-art performance of our method against representative baselines. Further in-depth analysis not only reveals the rationale behind our proposal but also provide insightful interpretations on what the system have learned.
[LG-25] Vehicle Routing Problems via Quantum Graph Attention Network Deep Reinforcement Learning
链接: https://arxiv.org/abs/2511.15175
作者: Le Tung Giang,Vu Hoang Viet,Nguyen Xuan Tung,Trinh Van Chien,Won-Joo Hwang
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Quantum Physics (quant-ph)
*备注: 11 pages, 3 figures, 2 tables. Accepted by SOICT 2025
Abstract:The vehicle routing problem (VRP) is a fundamental NP-hard task in intelligent transportation systems with broad applications in logistics and distribution. Deep reinforcement learning (DRL) with Graph Neural Networks (GNNs) has shown promise, yet classical models rely on large multi-layer perceptrons (MLPs) that are parameter-heavy and memory-bound. We propose a Quantum Graph Attention Network (Q-GAT) within a DRL framework, where parameterized quantum circuits (PQCs) replace conventional MLPs at critical readout stages. The hybrid model maintains the expressive capacity of graph attention encoders while reducing trainable parameters by more than 50%. Using proximal policy optimization (PPO) with greedy and stochastic decoding, experiments on VRP benchmarks show that Q-GAT achieves faster convergence and reduces routing cost by about 5% compared with classical GAT baselines. These results demonstrate the potential of PQC-enhanced GNNs as compact and effective solvers for large-scale routing and logistics optimization.
[LG-26] Complex variational autoencoders admit Kähler structure
链接: https://arxiv.org/abs/2511.15172
作者: Andrew Gracyk
类目: Machine Learning (cs.LG)
*备注: First version
Abstract:It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex latent stage. We show that complex VAEs reveal to some level Kähler geometric structure. Our methods will be tailored for decoder geometry. We derive the Fisher information metric in the complex case under a latent complex Gaussian regularization with trivial relation matrix. It is well known from statistical information theory that the Fisher information coincides with the Hessian of the Kullback-Leibler (KL) divergence. Thus, the metric Kähler potential relation is exactly achieved under relative entropy. We propose a Kähler potential derivative of complex Gaussian mixtures that has rough equivalence to the Fisher information metric while still being faithful to the underlying Kähler geometry. Computation of the metric via this potential is efficient, and through our potential, valid as a plurisubharmonic (PSH) function, large scale computational burden of automatic differentiation is displaced to small scale. We show that we can regularize the latent space with decoder geometry, and that we can sample in accordance with a weighted complex volume element. We demonstrate these strategies, at the exchange of sample variation, yield consistently smoother representations and fewer semantic outliers.
[LG-27] Cross-Modal Consistency-Guided Active Learning for Affective BCI Systems
链接: https://arxiv.org/abs/2511.15138
作者: Hyo-Jeong Jang,Hye-Bin Shin,Kang Yin
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easily corrupted by artifacts and individual variability, while emotional labels often stem from subjective and inconsistent reports-making robust affective decoding particularly difficult. We propose an uncertainty-aware active learning framework that enhances robustness to label noise by jointly leveraging model uncertainty and cross-modal consistency. Instead of relying solely on EEG-based uncertainty estimates, the method evaluates cross-modal alignment to determine whether uncertainty originates from cognitive ambiguity or sensor noise. A representation alignment module embeds EEG and face features into a shared latent space, enforcing semantic coherence between modalities. Residual discrepancies are treated as noise-induced inconsistencies, and these samples are selectively queried for oracle feedback during active learning. This feedback-driven process guides the network toward reliable, informative samples and reduces the impact of noisy labels. Experiments on the ASCERTAIN dataset examine the efficiency and robustness of ours, highlighting its potential as a data-efficient and noise-tolerant approach for EEG-based affective decoding in brain-computer interface systems.
[LG-28] Novel sparse matrix algorithm expands the feasible size of a self-organizing map of the knowledge indexed by a database of peer-reviewed medical literature
链接: https://arxiv.org/abs/2511.15136
作者: Andrew Amos,Joanne Lee,Tarun Sen Gupta,Bunmi S. Malau-Aduli
类目: Machine Learning (cs.LG)
*备注:
Abstract:Past efforts to map the Medline database have been limited to small subsets of the available data because of the exponentially increasing memory and processing demands of existing algorithms. We designed a novel algorithm for sparse matrix multiplication that allowed us to apply a self-organizing map to the entire Medline dataset, allowing for a more complete map of existing medical knowledge. The algorithm also increases the feasibility of refining the self-organizing map to account for changes in the dataset over time.
[LG-29] Efficient RF Passive Components Modeling with Bayesian Online Learning and Uncertainty Aware Sampling
链接: https://arxiv.org/abs/2511.15125
作者: Huifan Zhang,Pingqiang Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conventional radio frequency (RF) passive components modeling based on machine learning requires extensive electromagnetic (EM) simulations to cover geometric and frequency design spaces, creating computational bottlenecks. In this paper, we introduce an uncertainty-aware Bayesian online learning framework for efficient parametric modeling of RF passive components, which includes: 1) a Bayesian neural network with reconfigurable heads for joint geometric-frequency domain modeling while quantifying uncertainty; 2) an adaptive sampling strategy that simultaneously optimizes training data sampling across geometric parameters and frequency domain using uncertainty guidance. Validated on three RF passive components, the framework achieves accurate modeling while using only 2.86% EM simulation time compared to traditional ML-based flow, achieving a 35 times speedup.
[LG-30] Fourier-KAN-Mamba: A Novel State-Space Equation Approach for Time-Series Anomaly Detection
链接: https://arxiv.org/abs/2511.15083
作者: Xiancheng Wang,Lin Wang,Rui Wang,Zhibo Zhang,Minghang Zhao
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Time-series anomaly detection plays a critical role in numerous real-world applications, including industrial monitoring and fault diagnosis. Recently, Mamba-based state-space models have shown remarkable efficiency in long-sequence modeling. However, directly applying Mamba to anomaly detection tasks still faces challenges in capturing complex temporal patterns and nonlinear dynamics. In this paper, we propose Fourier-KAN-Mamba, a novel hybrid architecture that integrates Fourier layer, Kolmogorov-Arnold Networks (KAN), and Mamba selective state-space model. The Fourier layer extracts multi-scale frequency features, KAN enhances nonlinear representation capability, and a temporal gating control mechanism further improves the model’s ability to distinguish normal and anomalous patterns. Extensive experiments on MSL, SMAP, and SWaT datasets demonstrate that our method significantly outperforms existing state-of-the-art approaches. Keywords: time-series anomaly detection, state-space model, Mamba, Fourier transform, Kolmogorov-Arnold Network Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2511.15083 [cs.LG] (or arXiv:2511.15083v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.15083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-31] Interpretable temporal fusion network of multi- and multi-class arrhythmia classification
链接: https://arxiv.org/abs/2511.15062
作者: Yun Kwan Kim
类目: Machine Learning (cs.LG)
*备注: [Doctoral dissertation, Korea University, 2025]
Abstract:Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms. However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not considered such conditions. Thus, we propose a framework that consists of (i) local and global extraction and (ii) local-global information fusion with attention to enable arrhythmia detection and classification within a constrained input length. The framework’s performance was evaluated in terms of 10-class and 4-class arrhythmia detection, focusing on identifying the onset and ending point of arrhythmia episodes and their duration using the MIT-BIH arrhythmia database (MITDB) and the MIT-BIH atrial fibrillation database (AFDB). Duration, episode, and Dice score performances resulted in overall F1-scores of 96.45%, 82.05%, and 96.31% on the MITDB and 97.57%, 98.31%, and 97.45% on the AFDB, respectively. The results demonstrated statistically superior performance compared to those of the benchmark models. To assess the generalization capability of the proposed method, an MITDB-trained model and MIT-BIH malignant ventricular arrhythmia database-trained model were tested AFDB and MITDB, respectively. Superior performance was attained compared with that of a state-of-the-art model. The proposed method effectively captures both local and global information and dynamics without significant information loss. Consequently, arrhythmias can be detected with greater accuracy, and their occurrence times can be precisely determined, enabling the clinical field to develop more accurate treatment plans based on the proposed method.
[LG-32] Oversampling techniques for predicting COVID-19 patient length of stay
链接: https://arxiv.org/abs/2511.15048
作者: Zachariah Farahany,Jiawei Wu,K M Sajjadul Islam,Praveen Madiraju
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2022 IEEE International Conference on Big Data (Big Data)
Abstract:COVID-19 is a respiratory disease that caused a global pandemic in 2019. It is highly infectious and has the following symptoms: fever or chills, cough, shortness of breath, fatigue, muscle or body aches, headache, the new loss of taste or smell, sore throat, congestion or runny nose, nausea or vomiting, and diarrhea. These symptoms vary in severity; some people with many risk factors have been known to have lengthy hospital stays or die from the disease. In this paper, we analyze patients’ electronic health records (EHR) to predict the severity of their COVID-19 infection using the length of stay (LOS) as our measurement of severity. This is an imbalanced classification problem, as many people have a shorter LOS rather than a longer one. To combat this problem, we synthetically create alternate oversampled training data sets. Once we have this oversampled data, we run it through an Artificial Neural Network (ANN), which during training has its hyperparameters tuned using Bayesian optimization. We select the model with the best F1 score and then evaluate it and discuss it.
[LG-33] IonCast: A Deep Learning Framework for Forecasting Ionospheric Dynamics NEURIPS2025
链接: https://arxiv.org/abs/2511.15004
作者: Halil S. Kelebek,Linnea M. Wolniewicz,Michael D. Vergalla,Simone Mestici,Giacomo Acciarini,Bala Poduval,Olga Verkhoglyadova,Madhulika Guhathakurta,Thomas E. Berger,Frank Soboczenski,Atılım Güneş Baydin
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP)
*备注: 11 pages, 7 figures, 3 tables. Accepted as a poster presentation at the Machine Learning for the Physical Sciences Workshop at NeurIPS 2025
Abstract:The ionosphere is a critical component of near-Earth space, shaping GNSS accuracy, high-frequency communications, and aviation operations. For these reasons, accurate forecasting and modeling of ionospheric variability has become increasingly relevant. To address this gap, we present IonCast, a suite of deep learning models that include a GraphCast-inspired model tailored for ionospheric dynamics. IonCast leverages spatiotemporal learning to forecast global Total Electron Content (TEC), integrating diverse physical drivers and observational datasets. Validating on held-out storm-time and quiet conditions highlights improved skill compared to persistence. By unifying heterogeneous data with scalable graph-based spatiotemporal learning, IonCast demonstrates how machine learning can augment physical understanding of ionospheric variability and advance operational space weather resilience.
[LG-34] Compiling to recurrent neurons
链接: https://arxiv.org/abs/2511.14953
作者: Joey Velez-Ginorio,Nada Amin,Konrad Kording,Steve Zdancewic
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:
Abstract:Discrete structures are currently second-class in differentiable programming. Since functions over discrete structures lack overt derivatives, differentiable programs do not differentiate through them and limit where they can be used. For example, when programming a neural network, conditionals and iteration cannot be used everywhere; they can break the derivatives necessary for gradient-based learning to work. This limits the class of differentiable algorithms we can directly express, imposing restraints on how we build neural networks and differentiable programs more generally. However, these restraints are not fundamental. Recent work shows conditionals can be first-class, by compiling them into differentiable form as linear neurons. Similarly, this work shows iteration can be first-class – by compiling to linear recurrent neurons. We present a minimal typed, higher-order and linear programming language with iteration called \textsfCajal\scriptstyle(\mathbb\multimap, \mathbb2, \mathbbN) . We prove its programs compile correctly to recurrent neurons, allowing discrete algorithms to be expressed in a differentiable form compatible with gradient-based learning. With our implementation, we conduct two experiments where we link these recurrent neurons against a neural network solving an iterative image transformation task. This determines part of its function prior to learning. As a result, the network learns faster and with greater data-efficiency relative to a neural network programmed without first-class iteration. A key lesson is that recurrent neurons enable a rich interplay between learning and the discrete structures of ordinary programming.
[LG-35] Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report
链接: https://arxiv.org/abs/2511.14939
作者: Daniel Oliveira de Brito,Letícia Gabriella de Souza,Marcelo Matheus Gauy,Marcelo Finger,Arnaldo Candido Junior
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 11 pages
Abstract:This technical report investigates the performance of pre-trained audio models on COVID-19 detection tasks using established benchmark datasets. We fine-tuned Audio-MAE and three PANN architectures (CNN6, CNN10, CNN14) on the Coswara and COUGHVID datasets, evaluating both intra-dataset and cross-dataset generalization. We implemented a strict demographic stratification by age and gender to prevent models from exploiting spurious correlations between demographic characteristics and COVID-19 status. Intra-dataset results showed moderate performance, with Audio-MAE achieving the strongest result on Coswara (0.82 AUC, 0.76 F1-score), while all models demonstrated limited performance on Coughvid (AUC 0.58-0.63). Cross-dataset evaluation revealed severe generalization failure across all models (AUC 0.43-0.68), with Audio-MAE showing strong performance degradation (F1-score 0.00-0.08). Our experiments demonstrate that demographic balancing, while reducing apparent model performance, provides more realistic assessment of COVID-19 detection capabilities by eliminating demographic leakage - a confounding factor that inflate performance metrics. Additionally, the limited dataset sizes after balancing (1,219-2,160 samples) proved insufficient for deep learning models that typically require substantially larger training sets. These findings highlight fundamental challenges in developing generalizable audio-based COVID-19 detection systems and underscore the importance of rigorous demographic controls for clinically robust model evaluation.
[LG-36] Integrating Causal Inference with Graph Neural Networks for Alzheimers Disease Analysis
链接: https://arxiv.org/abs/2511.14922
作者: Pranay Kumar Peddi,Dhrubajyoti Ghosh
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Deep graph learning has advanced Alzheimer’s (AD) disease classification from MRI, but most models remain correlational, confounding demographic and genetic factors with disease specific features. We present Causal-GCN, an interventional graph convolutional framework that integrates do-calculus-based back-door adjustment to identify brain regions exerting stable causal influence on AD progression. Each subject’s MRI is represented as a structural connectome where nodes denote cortical and subcortical regions and edges encode anatomical connectivity. Confounders such as age, sec, and APOE4 genotype are summarized via principal components and included in the causal adjustment set. After training, interventions on individual regions are simulated by serving their incoming edges and altering node features to estimate average causal effects on disease probability. Applied to 484 subjects from the ADNI cohort, Causal-GCN achieves performance comparable to baseline GNNs while providing interpretable causal effect rankings that highlight posterior, cingulate, and insular hubs consistent with established AD neuropathology.
[LG-37] Structured Contrastive Learning for Interpretable Latent Representations
链接: https://arxiv.org/abs/2511.14920
作者: Zhengyang Shen,Hua Tu,Mayue Shi
类目: Machine Learning (cs.LG)
*备注: Comments: 10 pages, 6 figures. Applications to medical signal retrieval and activity recognition. Correspondence: m.shi16@imperial. this http URL
Abstract:Neural networks exhibit severe brittleness to semantically irrelevant transformations. A mere 75ms electrocardiogram (ECG) phase shift degrades latent cosine similarity from 1.0 to 0.2, while sensor rotations collapse activity recognition performance with inertial measurement units (IMUs). We identify the root cause as “laissez-faire” representation learning, where latent spaces evolve unconstrained provided task performance is satisfied. We propose Structured Contrastive Learning (SCL), a framework that partitions latent space representations into three semantic groups: invariant features that remain consistent under given transformations (e.g., phase shifts or rotations), variant features that actively differentiate transformations via a novel variant mechanism, and free features that preserve task flexibility. This creates controllable push-pull dynamics where different latent dimensions serve distinct, interpretable purposes. The variant mechanism enhances contrastive learning by encouraging variant features to differentiate within positive pairs, enabling simultaneous robustness and interpretability. Our approach requires no architectural modifications and integrates seamlessly into existing training pipelines. Experiments on ECG phase invariance and IMU rotation robustness demonstrate superior performance: ECG similarity improves from 0.25 to 0.91 under phase shifts, while WISDM activity recognition achieves 86.65% accuracy with 95.38% rotation consistency, consistently outperforming traditional data augmentation. This work represents a paradigm shift from reactive data augmentation to proactive structural learning, enabling interpretable latent representations in neural networks.
[LG-38] Its LIT! Reliability-Optimized LLM s with Inspectable Tools NEURIPS2025
链接: https://arxiv.org/abs/2511.14903
作者: Ruixin Zhang,Jon Donnelly,Zhicheng Guo,Ghazal Khalighinejad,Haiyang Huang,Alina Jade Barnett,Cynthia Rudin
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Multi-Turn Interactions in Large Language Models
Abstract:Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. However, LLMs often follow an opaque reasoning process, which limits their usefulness in high-stakes domains where solutions need to be trustworthy to end users. LLMs can choose solutions that are unreliable and difficult to troubleshoot, even if better options are available. We address this issue by forcing LLMs to use external – more reliable – tools to solve problems when possible. We present a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path, which may involve multiple sequential tool calls. We refer to this framework as LIT (LLMs with Inspectable Tools). In order to support LIT, we introduce a new and challenging benchmark dataset of 1,300 questions and a customizable set of reliability cost functions associated with a collection of specialized tools. These cost functions summarize how reliable each tool is and how easy it is to troubleshoot. For instance, a calculator is reliable across domains, whereas a linear prediction model is not reliable if there is distribution shift, but it is easy to troubleshoot. A tool that constructs a random forest is neither reliable nor easy to troubleshoot. These tools interact with the Harvard USPTO Patent Dataset and a new dataset of NeurIPS 2023 papers to solve mathematical, coding, and modeling problems of varying difficulty levels. We demonstrate that LLMs can achieve more reliable and informed problem-solving while maintaining task performance using our framework.
[LG-39] Bringing Federated Learning to Space
链接: https://arxiv.org/abs/2511.14889
作者: Grace Kim,Filip Svoboda,Nicholas Lane
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, 3 tables accepted to IEEE Aeroconf 2026
Abstract:As Low Earth Orbit (LEO) satellite constellations rapidly expand to hundreds and thousands of spacecraft, the need for distributed on-board machine learning becomes critical to address downlink bandwidth limitations. Federated learning (FL) offers a promising framework to conduct collaborative model training across satellite networks. Realizing its benefits in space naturally requires addressing space-specific constraints, from intermittent connectivity to dynamics imposed by orbital motion. This work presents the first systematic feasibility analysis of adapting off-the-shelf FL algorithms for satellite constellation deployment. We introduce a comprehensive “space-ification” framework that adapts terrestrial algorithms (FedAvg, FedProx, FedBuff) to operate under orbital constraints, producing an orbital-ready suite of FL algorithms. We then evaluate these space-ified methods through extensive parameter sweeps across 768 constellation configurations that vary cluster sizes (1-10), satellites per cluster (1-10), and ground station networks (1-13). Our analysis demonstrates that space-adapted FL algorithms efficiently scale to constellations of up to 100 satellites, achieving performance close to the centralized ideal. Multi-month training cycles can be reduced to days, corresponding to a 9x speedup through orbital scheduling and local coordination within satellite clusters. These results provide actionable insights for future mission designers, enabling distributed on-board learning for more autonomous, resilient, and data-driven satellite operations.
[LG-40] ransformer-Guided Deep Reinforcement Learning for Optimal Takeoff Trajectory Design of an eVTOL Drone
链接: https://arxiv.org/abs/2511.14887
作者: Nathan M. Roberts II,Xiaosong Du
类目: Machine Learning (cs.LG)
*备注: Conference version with 12 pages and 2 figures
Abstract:The rapid advancement of electric vertical take-off and landing (eVTOL) aircraft offers a promising opportunity to alleviate urban traffic congestion. Thus, developing optimal takeoff trajectories for minimum energy consumption becomes essential for broader eVTOL aircraft applications. Conventional optimal control methods (such as dynamic programming and linear quadratic regulator) provide highly efficient and well-established solutions but are limited by problem dimensionality and complexity. Deep reinforcement learning (DRL) emerges as a special type of artificial intelligence tackling complex, nonlinear systems; however, the training difficulty is a key bottleneck that limits DRL applications. To address these challenges, we propose the transformer-guided DRL to alleviate the training difficulty by exploring a realistic state space at each time step using a transformer. The proposed transformer-guided DRL was demonstrated on an optimal takeoff trajectory design of an eVTOL drone for minimal energy consumption while meeting takeoff conditions (i.e., minimum vertical displacement and minimum horizontal velocity) by varying control variables (i.e., power and wing angle to the vertical). Results presented that the transformer-guided DRL agent learned to take off with 4.57\times10^6 time steps, representing 25% of the 19.79\times10^6 time steps needed by a vanilla DRL agent. In addition, the transformer-guided DRL achieved 97.2% accuracy on the optimal energy consumption compared against the simulation-based optimal reference while the vanilla DRL achieved 96.3% accuracy. Therefore, the proposed transformer-guided DRL outperformed vanilla DRL in terms of both training efficiency as well as optimal design verification.
[LG-41] Exact Learning of Weighted Graphs Using Composite Queries
链接: https://arxiv.org/abs/2511.14882
作者: Michael T. Goodrich,Songyu Liu,Ioannis Panageas
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Full version of the paper published at IWOCA 2025
Abstract:In this paper, we study the exact learning problem for weighted graphs, where we are given the vertex set, V , of a weighted graph, G=(V,E,w) , but we are not given E . The problem, which is also known as graph reconstruction, is to determine all the edges of E , including their weights, by asking queries about G from an oracle. As we observe, using simple shortest-path length queries is not sufficient, in general, to learn a weighted graph. So we study a number of scenarios where it is possible to learn G using a subquadratic number of composite queries, which combine two or three simple queries.
[LG-42] FinTRec: Transformer Based Unified Contextual Ads Targeting and Personalization for Financial Applications RECSYS2025
链接: https://arxiv.org/abs/2511.14865
作者: Dwipam Katariya,Snehita Varma,Akshat Shreemali,Benjamin Wu,Kalanand Mishra,Pranab Mohanty
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, Accepted at CARS @ RecSys 2025
Abstract:Transformer-based architectures are widely adopted in sequential recommendation systems, yet their application in Financial Services (FS) presents distinct practical and modeling challenges for real-time recommendation. These include:a) long-range user interactions (implicit and explicit) spanning both digital and physical channels generating temporally heterogeneous context, b) the presence of multiple interrelated products require coordinated models to support varied ad placements and personalized feeds, while balancing competing business goals. We propose FinTRec, a transformer-based framework that addresses these challenges and its operational objectives in FS. While tree-based models have traditionally been preferred in FS due to their explainability and alignment with regulatory requirements, our study demonstrate that FinTRec offers a viable and effective shift toward transformer-based architectures. Through historic simulation and live A/B test correlations, we show FinTRec consistently outperforms the production-grade tree-based baseline. The unified architecture, when fine-tuned for product adaptation, enables cross-product signal sharing, reduces training cost and technical debt, while improving offline performance across all products. To our knowledge, this is the first comprehensive study of unified sequential recommendation modeling in FS that addresses both technical and business considerations.
[LG-43] DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models
链接: https://arxiv.org/abs/2511.14813
作者: Yifan Li,Qin Li,Min Zhang,Min Zhang,Peixin Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Assessing the reasoning ability of Large Language Models (LLMs) over data remains an open and pressing research question. Compared with LLMs, human reasoning can derive corresponding modifications to the output based on certain kinds of changes to the input. This reasoning pattern, which relies on abstract rules that govern relationships between changes of data, has not been comprehensively described or evaluated in LLMs. In this paper, we formally define this reasoning pattern as the Derivation Relation (DR) and introduce the concept of Derivation Capability (DC), i.e. applying DR by making the corresponding modification to the output whenever the input takes certain changes. To assess DC, a systematically constructed evaluation framework named DEVAL is proposed and used to evaluate five popular LLMs and one Large Reasoning Model in seven mainstream tasks. The evaluation results show that mainstream LLMs, such as GPT-4o and Claude3.5, exhibit moderate DR recognition capabilities but reveal significant drop-offs on applying DR effectively in problem-solving scenarios. To improve this, we propose a novel prompt engineering approach called Derivation Prompting (DP). It achieves an average improvement of 15.2% in DC for all tested LLMs, outperforming commonly used prompt engineering techniques.
[LG-44] Reservoir Computing via Multi-Scale Random Fourier Features for Forecasting Fast-Slow Dynamical Systems
链接: https://arxiv.org/abs/2511.14775
作者: S. K. Laha
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 23 pages, 18 Figure
Abstract:Forecasting nonlinear time series with multi-scale temporal structures remains a central challenge in complex systems modeling. We present a novel reservoir computing framework that combines delay embedding with random Fourier feature (RFF) mappings to capture such dynamics. Two formulations are investigated: a single-scale RFF reservoir, which employs a fixed kernel bandwidth, and a multi-scale RFF reservoir, which integrates multiple bandwidths to represent both fast and slow temporal dependencies. The framework is applied to a diverse set of canonical systems: neuronal models such as the Rulkov map, Izhikevich model, Hindmarsh-Rose model, and Morris-Lecar model, which exhibit spiking, bursting, and chaotic behaviors arising from fast-slow interactions; and ecological models including the predator-prey dynamics and Ricker map with seasonal forcing, which display multi-scale oscillations and intermittency. Across all cases, the multi-scale RFF reservoir consistently outperforms its single-scale counterpart, achieving lower normalized root mean square error (NRMSE) and more robust long-horizon predictions. These results highlight the effectiveness of explicitly incorporating multi-scale feature mappings into reservoir computing architectures for modeling complex dynamical systems with intrinsic fast-slow interactions.
[LG-45] Front-door Reducibility: Reducing ADMGs to the Standard Front-door Setting via a Graphical Criterion
链接: https://arxiv.org/abs/2511.15679
作者: Jianqiao Mao,Max A. Little
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages, 3 figures
Abstract:Front-door adjustment provides a simple closed-form identification formula under the classical front-door criterion, but its applicability is often viewed as narrow and strict. Although ID algorithm is very useful and is proved effective for causal relation identification in general causal graphs (if it is identifiable), performing ID algorithm does not guarantee to obtain a practical, easy-to-estimate interventional distribution expression. We argue that the applicability of the front-door criterion is not as limited as it seems: many more complicated causal graphs can be reduced to the front-door criterion. In this paper, We introduce front-door reducibility (FDR), a graphical condition on acyclic directed mixed graphs (ADMGs) that extends the applicability of the classic front-door criterion to reduce a large family of complicated causal graphs to a front-door setting by aggregating variables into super-nodes (FDR triple) \left(\boldsymbolX^,\boldsymbolY^,\boldsymbolM^*\right) . After characterizing FDR criterion, we prove a graph-level equivalence between the satisfication of FDR criterion and the applicability of FDR adjustment. Meanwhile, we then present FDR-TID, an exact algorithm that detects an admissible FDR triple, together with established the algorithm’s correctness, completeness, and finite termination. Empirically-motivated examples illustrate that many graphs outside the textbook front-door setting are FDR, yielding simple, estimable adjustments where general ID expressions would be cumbersome. FDR thus complements existing identification method by prioritizing interpretability and computational simplicity without sacrificing generality across mixed graphs.
[LG-46] Rényi Differential Privacy for Heavy-Tailed SDEs via Fractional Poincaré Inequalities
链接: https://arxiv.org/abs/2511.15634
作者: Benjamin Dupuis,Mert Gürbüzbalaban,Umut Şimşekli,Jian Wang,Sinan Yildirim,Lingjiong Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Characterizing the differential privacy (DP) of learning algorithms has become a major challenge in recent years. In parallel, many studies suggested investigating the behavior of stochastic gradient descent (SGD) with heavy-tailed noise, both as a model for modern deep learning models and to improve their performance. However, most DP bounds focus on light-tailed noise, where satisfactory guarantees have been obtained but the proposed techniques do not directly extend to the heavy-tailed setting. Recently, the first DP guarantees for heavy-tailed SGD were obtained. These results provide (0,\delta) -DP guarantees without requiring gradient clipping. Despite casting new light on the link between DP and heavy-tailed algorithms, these results have a strong dependence on the number of parameters and cannot be extended to other DP notions like the well-established Rényi differential privacy (RDP). In this work, we propose to address these limitations by deriving the first RDP guarantees for heavy-tailed SDEs, as well as their discretized counterparts. Our framework is based on new Rényi flow computations and the use of well-established fractional Poincaré inequalities. Under the assumption that such inequalities are satisfied, we obtain DP guarantees that have a much weaker dependence on the dimension compared to prior art.
[LG-47] CODE-II: A large-scale dataset for artificial intelligence in ECG analysis
链接: https://arxiv.org/abs/2511.15632
作者: Petrus E. O. G. B. Abreu,Gabriela M. M. Paixão,Jiawei Li,Paulo R. Gomes,Peter W. Macfarlane,Ana C. S. Oliveira,Vinicius T. Carvalho,Thomas B. Schön,Antonio Luiz P. Ribeiro,Antônio H. Ribeiro
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.
[LG-48] Near-optimal delta-convex estimation of Lipschitz functions
链接: https://arxiv.org/abs/2511.15615
作者: Gábor Balázs
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 41 pages, 7 figures
Abstract:This paper presents a tractable algorithm for estimating an unknown Lipschitz function from noisy observations and establishes an upper bound on its convergence rate. The approach extends max-affine methods from convex shape-restricted regression to the more general Lipschitz setting. A key component is a nonlinear feature expansion that maps max-affine functions into a subclass of delta-convex functions, which act as universal approximators of Lipschitz functions while preserving their Lipschitz constants. Leveraging this property, the estimator attains the minimax convergence rate (up to logarithmic factors) with respect to the intrinsic dimension of the data under squared loss and subgaussian distributions in the random design setting. The algorithm integrates adaptive partitioning to capture intrinsic dimension, a penalty-based regularization mechanism that removes the need to know the true Lipschitz constant, and a two-stage optimization procedure combining a convex initialization with local refinement. The framework is also straightforward to adapt to convex shape-restricted regression. Experiments demonstrate competitive performance relative to other theoretically justified methods, including nearest-neighbor and kernel-based regressors.
[LG-49] A Physics Informed Machine Learning Framework for Optimal Sensor Placement and Parameter Estimation
链接: https://arxiv.org/abs/2511.15543
作者: Georgios Venianakis,Constantinos Theodoropoulos,Michail Kavousanakis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Parameter estimation remains a challenging task across many areas of engineering. Because data acquisition can often be costly, limited, or prone to inaccuracies (noise, uncertainty) it is crucial to identify sensor configurations that provide the maximum amount of information about the unknown parameters, in particular for the case of distributed-parameter systems, where spatial variations are important. Physics-Informed Neural Networks (PINNs) have recently emerged as a powerful machine-learning (ML) tool for parameter estimation, particularly in cases with sparse or noisy measurements, overcoming some of the limitations of traditional optimization-based and Bayesian approaches. Despite the widespread use of PINNs for solving inverse problems, relatively little attention has been given to how their performance depends on sensor placement. This study addresses this gap by introducing a comprehensive PINN-based framework that simultaneously tackles optimal sensor placement and parameter estimation. Our approach involves training a PINN model in which the parameters of interest are included as additional inputs. This enables the efficient computation of sensitivity functions through automatic differentiation, which are then used to determine optimal sensor locations exploiting the D-optimality criterion. The framework is validated on two illustrative distributed-parameter reaction-diffusion-advection problems of increasing complexity. The results demonstrate that our PINNs-based methodology consistently achieves higher accuracy compared to parameter values estimated from intuitively or randomly selected sensor positions.
[LG-50] Gini Score under Ties and Case Weights
链接: https://arxiv.org/abs/2511.15446
作者: Alexej Brauer,Mario V. Wüthrich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:The Gini score is a popular tool in statistical modeling and machine learning for model validation and model selection. It is a purely rank based score that allows one to assess risk rankings. The Gini score for statistical modeling has mainly been used in a binary context, in which it has many equivalent reformulations such as the receiver operating characteristic (ROC) or the area under the curve (AUC). In the actuarial literature, this rank based score for binary responses has been extended to general real-valued random variables using Lorenz curves and concentration curves. While these initial concepts assume that the risk ranking is generated by a continuous distribution function, we discuss in this paper how the Gini score can be used in the case of ties in the risk ranking. Moreover, we adapt the Gini score to the common actuarial situation of having case weights.
[LG-51] Exponential Lasso: robust sparse penalization under heavy-tailed noise and outliers with exponential-type loss
链接: https://arxiv.org/abs/2511.15332
作者: TheTien Mai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:In high-dimensional statistics, the Lasso is a cornerstone method for simultaneous variable selection and parameter estimation. However, its reliance on the squared loss function renders it highly sensitive to outliers and heavy-tailed noise, potentially leading to unreliable model selection and biased estimates. To address this limitation, we introduce the Exponential Lasso, a novel robust method that integrates an exponential-type loss function within the Lasso framework. This loss function is designed to achieve a smooth trade-off between statistical efficiency under Gaussian noise and robustness against data contamination. Unlike other methods that cap the influence of large residuals, the exponential loss smoothly redescends, effectively downweighting the impact of extreme outliers while preserving near-quadratic behavior for small errors. We establish theoretical guarantees showing that the Exponential Lasso achieves strong statistical convergence rates, matching the classical Lasso under ideal conditions while maintaining its robustness in the presence of heavy-tailed contamination. Computationally, the estimator is optimized efficiently via a Majorization-Minimization (MM) algorithm that iteratively solves a series of weighted Lasso subproblems. Numerical experiments demonstrate that the proposed method is highly competitive, outperforming the classical Lasso in contaminated settings and maintaining strong performance even under Gaussian noise. Our method is implemented in the \textttR package \textttheavylasso available on Github: this https URL Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2511.15332 [stat.ML] (or arXiv:2511.15332v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.15332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-52] Robust Bayesian Optimisation with Unbounded Corruptions
链接: https://arxiv.org/abs/2511.15315
作者: Abdelhamid Ezzerg,Ilija Bogunovic,Jeremias Knoblauch
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Bayesian Optimization is critically vulnerable to extreme outliers. Existing provably robust methods typically assume a bounded cumulative corruption budget, which makes them defenseless against even a single corruption of sufficient magnitude. To address this, we introduce a new adversary whose budget is only bounded in the frequency of corruptions, not in their magnitude. We then derive RCGP-UCB, an algorithm coupling the famous upper confidence bound (UCB) approach with a Robust Conjugate Gaussian Process (RCGP). We present stable and adaptive versions of RCGP-UCB, and prove that they achieve sublinear regret in the presence of up to O(T^1/2) and O(T^1/3) corruptions with possibly infinite magnitude. This robustness comes at near zero cost: without outliers, RCGP-UCB’s regret bounds match those of the standard GP-UCB algorithm.
[LG-53] Reinforcement Learning in Queue-Reactive Models: Application to Optimal Execution
链接: https://arxiv.org/abs/2511.15262
作者: Tomas Espana,Yadh Hafsi,Fabrizio Lillo,Edoardo Vittori
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注:
Abstract:We investigate the use of Reinforcement Learning for the optimal execution of meta-orders, where the objective is to execute incrementally large orders while minimizing implementation shortfall and market impact over an extended period of time. Departing from traditional parametric approaches to price dynamics and impact modeling, we adopt a model-free, data-driven framework. Since policy optimization requires counterfactual feedback that historical data cannot provide, we employ the Queue-Reactive Model to generate realistic and tractable limit order book simulations that encompass transient price impact, and nonlinear and dynamic order flow responses. Methodologically, we train a Double Deep Q-Network agent on a state space comprising time, inventory, price, and depth variables, and evaluate its performance against established benchmarks. Numerical simulation results show that the agent learns a policy that is both strategic and tactical, adapting effectively to order book conditions and outperforming standard approaches across multiple training configurations. These findings provide strong evidence that model-free Reinforcement Learning can yield adaptive and robust solutions to the optimal execution problem.
[LG-54] Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets
链接: https://arxiv.org/abs/2511.15222
作者: Pol Benítez,Cibrán López,Edgardo Saucedo,Teruyasu Mizoguchi,Claudio Cazorla
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 12 pages; 5 figures
Abstract:Machine learning (ML) methods have become powerful tools for predicting material properties with near first-principles accuracy and vastly reduced computational cost. However, the performance of ML models critically depends on the quality, size, and diversity of the training dataset. In materials science, this dependence is particularly important for learning from low-symmetry atomistic configurations that capture thermal excitations, structural defects, and chemical disorder, features that are ubiquitous in real materials but underrepresented in most datasets. The absence of systematic strategies for generating representative training data may therefore limit the predictive power of ML models in technologically critical fields such as energy conversion and photonics. In this work, we assess the effectiveness of graph neural network (GNN) models trained on two fundamentally different types of datasets: one composed of randomly generated atomic configurations and another constructed using physically informed sampling based on lattice vibrations. As a case study, we address the challenging task of predicting electronic and mechanical properties of a prototypical family of optoelectronic materials under realistic finite-temperature conditions. We find that the phonons-informed model consistently outperforms the randomly trained counterpart, despite relying on fewer data points. Explainability analyses further reveal that high-performing models assign greater weight to chemically meaningful bonds that control property variations, underscoring the importance of physically guided data generation. Overall, this work demonstrates that larger datasets do not necessarily yield better GNN predictive models and introduces a simple and general strategy for efficiently constructing high-quality training data in materials informatics.
[LG-55] Particle Monte Carlo methods for Lattice Field Theory NEURIPS2025
链接: https://arxiv.org/abs/2511.15196
作者: David Yallup
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: To appear in the NeurIPS 2025 workshop, Frontiers in Probabilistic Inference: Sampling Meets Learning
Abstract:High-dimensional multimodal sampling problems from lattice field theory (LFT) have become important benchmarks for machine learning assisted sampling methods. We show that GPU-accelerated particle methods, Sequential Monte Carlo (SMC) and nested sampling, provide a strong classical baseline that matches or outperforms state-of-the-art neural samplers in sample quality and wall-clock time on standard scalar field theory benchmarks, while also estimating the partition function. Using only a single data-driven covariance for tuning, these methods achieve competitive performance without problem-specific structure, raising the bar for when learned proposals justify their training cost.
[LG-56] Beyond Uncertainty Sets: Leverag ing Optimal Transport to Extend Conformal Predictive Distribution to Multivariate Settings
链接: https://arxiv.org/abs/2511.15146
作者: Eugene Ndiaye
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Conformal prediction (CP) constructs uncertainty sets for model outputs with finite-sample coverage guarantees. A candidate output is included in the prediction set if its non-conformity score is not considered extreme relative to the scores observed on a set of calibration examples. However, this procedure is only straightforward when scores are scalar-valued, which has limited CP to real-valued scores or ad-hoc reductions to one dimension. The problem of ordering vectors has been studied via optimal transport (OT), which provides a principled method for defining vector-ranks and multivariate quantile regions, though typically with only asymptotic coverage guarantees. We restore finite-sample, distribution-free coverage by conformalizing the vector-valued OT quantile region. Here, a candidate’s rank is defined via a transport map computed for the calibration scores augmented with that candidate’s score. This defines a continuum of OT problems for which we prove that the resulting optimal assignment is piecewise-constant across a fixed polyhedral partition of the score space. This allows us to characterize the entire prediction set tractably, and provides the machinery to address a deeper limitation of prediction sets: that they only indicate which outcomes are plausible, but not their relative likelihood. In one dimension, conformal predictive distributions (CPDs) fill this gap by producing a predictive distribution with finite-sample calibration. Extending CPDs beyond one dimension remained an open problem. We construct, to our knowledge, the first multivariate CPDs with finite-sample calibration, i.e., they define a valid multivariate distribution where any derived uncertainty region automatically has guaranteed coverage. We present both conservative and exact randomized versions, the latter resulting in a multivariate generalization of the classical Dempster-Hill procedure.
[LG-57] Latent space analysis and generalization to out-of-distribution data
链接: https://arxiv.org/abs/2511.15010
作者: Katie Rainey,Erin Hausmann,Donald Waagen,David Gray,Donald Hulsey
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Understanding the relationships between data points in the latent decision space derived by the deep learning system is critical to evaluating and interpreting the performance of the system on real world data. Detecting \textitout-of-distribution (OOD) data for deep learning systems continues to be an active research topic. We investigate the connection between latent space OOD detection and classification accuracy of the model. Using open source simulated and measured Synthetic Aperture RADAR (SAR) datasets, we empirically demonstrate that the OOD detection cannot be used as a proxy measure for model performance. We hope to inspire additional research into the geometric properties of the latent space that may yield future insights into deep learning robustness and generalizability.
[LG-58] Resource-Based Time and Cost Prediction in Project Networks: From Statistical Modeling to Graph Neural Networks
链接: https://arxiv.org/abs/2511.15003
作者: Reza Mirjalili,Behrad Braghi,Shahram Shadrokh Sikari
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 52 pages, 12 figures
Abstract:Accurate prediction of project duration and cost remains one of the most challenging aspects of project management, particularly in resource-constrained and interdependent task networks. Traditional analytical techniques such as the Critical Path Method (CPM) and Program Evaluation and Review Technique (PERT) rely on simplified and often static assumptions regarding task interdependencies and resource performance. This study proposes a novel resource-based predictive framework that integrates network representations of project activities with graph neural networks (GNNs) to capture structural and contextual relationships among tasks, resources, and time-cost dynamics. The model represents the project as a heterogeneous activity-resource graph in which nodes denote activities and resources, and edges encode temporal and resource dependencies. We evaluate multiple learning paradigms, including GraphSAGE and Temporal Graph Networks, on both synthetic and benchmark project datasets. Experimental results show that the proposed GNN framework achieves an average 23 to 31 percent reduction in mean absolute error compared to traditional regression and tree-based methods, while improving the coefficient of determination R2 from approximately 0.78 to 0.91 for large and complex project networks. Furthermore, the learned embeddings provide interpretable insights into resource bottlenecks and critical dependencies, enabling more explainable and adaptive scheduling decisions. Comments: 52 pages, 12 figures Subjects: Applications (stat.AP); Machine Learning (cs.LG) Cite as: arXiv:2511.15003 [stat.AP] (or arXiv:2511.15003v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2511.15003 Focus to learn more arXiv-issued DOI via DataCite
[LG-59] Selective Forgetting in Option Calibration: An Operator-Theoretic Gauss-Newton Framework
链接: https://arxiv.org/abs/2511.14980
作者: Ahmet Umur Özsoy
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG)
*备注:
Abstract:Calibration of option pricing models is routinely repeated as markets evolve, yet modern systems lack an operator for removing data from a calibrated model without full retraining. When quotes become stale, corrupted, or subject to deletion requirements, existing calibration pipelines must rebuild the entire nonlinear least-squares problem, even if only a small subset of data must be excluded. In this work, we introduce a principled framework for selective forgetting (machine unlearning) in parametric option calibration. We provide stability guarantees, perturbation bounds, and show that the proposed operators satisfy local exactness under standard regularity assumptions.
[LG-60] Reconstruction of three-dimensional shapes of normal and disease-related erythrocytes from partial observations using multi-fidelity neural networks
链接: https://arxiv.org/abs/2511.14962
作者: Haizhou Wen,He Li,Zhen Li
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
*备注: 29 pages, 10 figures, 3 appendices
Abstract:Reconstruction of 3D erythrocyte or red blood cell (RBC) morphology from partial observations, such as microscope images, is essential for understanding the physiology of RBC aging and the pathology of various RBC disorders. In this study, we propose a multi-fidelity neural network (MFNN) approach to fuse high-fidelity cross-sections of an RBC, with a morphologically similar low-fidelity reference 3D RBC shape to recover its full 3D surface. The MFNN predictor combines a convolutional neural network trained on low-fidelity reference RBC data with a feedforward neural network that captures nonlinear morphological correlations, and augments training with surface area and volume constraints for regularization in the low-fidelity branch. This approach is theoretically grounded by a topological homeomorphism between a sphere and 3D RBC surfaces, with training data generated by dissipative particle dynamics simulations of stomatocyte-discocyte-echinocyte transformation. Benchmarking across diverse RBC shapes observed in normal and aged populations, our results show that the MFNN predictor can reconstruct complex RBC morphologies with over 95% coordinate accuracy when provided with at least two orthogonal cross-sections. It is observed that informative oblique cross-sections intersecting spicule tips of echinocytes improve both local and global feature reconstruction, highlighting the value of feature-aware sampling. Our study further evaluates the influence of sampling strategies, shape dissimilarity, and noise, showing enhanced robustness under physically constrained training. Altogether, these results demonstrate the capability of MFNN to reconstruct the 3D shape of normal and aged RBCs from partial cross-sections as observed in conventional microscope images, which could facilitate the quantitative analysis of RBC morphological parameters in normal and disease-related RBC samples.
[LG-61] How to pick the best anomaly detector?
链接: https://arxiv.org/abs/2511.14832
作者: Marie Hein,Gregor Kasieczka,Michael Krämer,Louis Moureaux,Alexander Mück,David Shih
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 12 pages, 7 figures
Abstract:Anomaly detection has the potential to discover new physics in unexplored regions of the data. However, choosing the best anomaly detector for a given data set in a model-agnostic way is an important challenge which has hitherto largely been neglected. In this paper, we introduce the data-driven ARGOS metric, which has a sound theoretical foundation and is empirically shown to robustly select the most sensitive anomaly detection model given the data. Focusing on weakly-supervised, classifier-based anomaly detection methods, we show that the ARGOS metric outperforms other model selection metrics previously used in the literature, in particular the binary cross-entropy loss. We explore several realistic applications, including hyperparameter tuning as well as architecture and feature selection, and in all cases we demonstrate that ARGOS is robust to the noisy conditions of anomaly detection.
[LG-62] Convex Clustering Redefined: Robust Learning with the Median of Means Estimator AAAI2026
链接: https://arxiv.org/abs/2511.14784
作者: Sourav De,Koustav Chowdhury,Bibhabasu Mandal,Sagar Ghosh,Swagatam Das,Debolina Paul,Saptarshi Chakraborty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
*备注: Accepted in AAAI 2026
Abstract:Clustering approaches that utilize convex loss functions have recently attracted growing interest in the formation of compact data clusters. Although classical methods like k-means and its wide family of variants are still widely used, all of them require the number of clusters k to be supplied as input, and many are notably sensitive to initialization. Convex clustering provides a more stable alternative by formulating the clustering task as a convex optimization problem, ensuring a unique global solution. However, it faces challenges in handling high-dimensional data, especially in the presence of noise and outliers. Additionally, strong fusion regularization, controlled by the tuning parameter, can hinder effective cluster formation within a convex clustering framework. To overcome these challenges, we introduce a robust approach that integrates convex clustering with the Median of Means (MoM) estimator, thus developing an outlier-resistant and efficient clustering framework that does not necessitate prior knowledge of the number of clusters. By leveraging the robustness of MoM alongside the stability of convex clustering, our method enhances both performance and efficiency, especially on large-scale datasets. Theoretical analysis demonstrates weak consistency under specific conditions, while experiments on synthetic and real-world datasets validate the method’s superior performance compared to existing approaches.
信息检索
[IR-0] Unveiling Inference Scaling for Difference-Aware User Modeling in LLM Personalization
链接: https://arxiv.org/abs/2511.15389
作者: Suyu Chen,Yimeng Bai,Yulong Huang,Xiaoyan Zhao,Yang Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Models (LLMs) are increasingly integrated into users’ daily lives, driving a growing demand for personalized outputs. Prior work has primarily leveraged a user’s own history, often overlooking inter-user differences that are critical for effective personalization. While recent methods have attempted to model such differences, their feature extraction processes typically rely on fixed dimensions and quick, intuitive inference (System-1 thinking), limiting both the coverage and granularity of captured user differences. To address these limitations, we propose Difference-aware Reasoning Personalization (DRP), a framework that reconstructs the difference extraction mechanism by leveraging inference scaling to enhance LLM personalization. DRP autonomously identifies relevant difference feature dimensions and generates structured definitions and descriptions, enabling slow, deliberate reasoning (System-2 thinking) over user differences. Experiments on personalized review generation demonstrate that DRP consistently outperforms baseline methods across multiple metrics.
[IR-1] Opinion Dynamics Models for Sentiment Evolution in Weibo Blogs
链接: https://arxiv.org/abs/2511.15303
作者: Yulong He,Anton V. Proskurnikov,Artem Sedakov
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph)
*备注:
Abstract:Online social media platforms enable influencers to distribute content and quickly capture audience reactions, significantly shaping their promotional strategies and advertising agreements. Understanding how sentiment dynamics and emotional contagion unfold among followers is vital for influencers and marketers, as these processes shape engagement, brand perception, and purchasing behavior. While sentiment analysis tools effectively track sentiment fluctuations, dynamical models explaining their evolution remain limited, often neglecting network structures and interactions both among blogs and between their topic-focused follower groups. In this study, we tracked influential tech-focused Weibo bloggers over six months, quantifying follower sentiment from text-mined feedback. By treating each blogger’s audience as a single “macro-agent”, we find that sentiment trajectories follow the principle of iterative averaging – a foundational mechanism in many dynamical models of opinion formation, a theoretical framework at the intersection of social network analysis and dynamical systems theory. The sentiment evolution aligns closely with opinion-dynamics models, particularly modified versions of the classical French-DeGroot model that incorporate delayed perception and distinguish between expressed and private opinions. The inferred influence structures reveal interdependencies among blogs that may arise from homophily, whereby emotionally similar users subscribe to the same blogs and collectively shape the shared sentiment expressed within these communities.
[IR-2] Selective Mixup for Debiasing Question Selection in Computerized Adaptive Testing CIKM2025
链接: https://arxiv.org/abs/2511.15241
作者: Mi Tian,Kun Zhang,Fei Liu,Jinglong Li,Yuxin Liao,Chenxi Bai,Zhengtao Tan,Le Wu,Richang Hong
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025
Abstract:Computerized Adaptive Testing (CAT) is a widely used technology for evaluating learners’ proficiency in online education platforms. By leveraging prior estimates of proficiency to select questions and updating the estimates iteratively based on responses, CAT enables personalized learner modeling and has attracted substantial attention. Despite this progress, most existing works focus primarily on improving diagnostic accuracy, while overlooking the selection bias inherent in the adaptive process. Selection Bias arises because the question selection is strongly influenced by the estimated proficiency, such as assigning easier questions to learners with lower proficiency and harder ones to learners with higher proficiency. Since the selection depends on prior estimation, this bias propagates into the diagnosis model, which is further amplified during iterative updates, leading to misalignment and biased predictions. Moreover, the imbalanced nature of learners’ historical interactions often exacerbates the bias in diagnosis models. To address this issue, we propose a debiasing framework consisting of two key modules: Cross-Attribute Examinee Retrieval and Selective Mixup-based Regularization. First, we retrieve balanced examinees with relatively even distributions of correct and incorrect responses and use them as neutral references for biased examinees. Then, mixup is applied between each biased examinee and its matched balanced counterpart under label consistency. This augmentation enriches the diversity of bias-conflicting samples and smooths selection boundaries. Finally, extensive experiments on two benchmark datasets with multiple advanced diagnosis models demonstrate that our method substantially improves both the generalization ability and fairness of question selection in CAT.
[IR-3] SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs
链接: https://arxiv.org/abs/2511.14881
作者: Bi Xue,Hong Wu,Lei Chen,Chao Yang,Yiming Ma,Fei Ding,Zhen Wang,Liang Wang,Xiaoheng Mao,Ke Huang,Xialu Li,Peng Xia,Rui Jian,Yanli Zhao,Yanzun Huang,Yijie Deng,Harry Tran,Ryan Chang,Min Yu,Eric Dong,Jiazhou Wang,Qianqian Zhang,Keke Zhai,Hongzhang Yin,Pawel Garbacki,Zheng Fang,Yiyi Pan,Min Ni,Yang Liu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing systems rely on CPU-based ANN indexing and filtering services, suffering from non-negligible costs and forgoing joint optimization opportunities. Such inefficiency makes them difficult to support more complex model architectures, such as learned similarities and multi-task retrieval. In this paper, we propose SilverTorch, a model-based system for serving recommendation models on GPUs. SilverTorch unifies model serving by replacing standalone indexing and filtering services with layers of served models. We propose a Bloom index algorithm on GPUs for feature filtering and a tensor-native fused Int8 ANN kernel on GPUs for nearest neighbor search. We further co-design the ANN search index and filtering index to reduce GPU memory utilization and eliminate unnecessary computation. Benefit from SilverTorch’s serving paradigm, we introduce a OverArch scoring layer and a Value Model to aggregate results across multi-tasks. These advancements improve the accuracy for retrieval and enable future studies for serving more complex models. For ranking, SilverTorch’s design accelerates item embedding calculation by caching the pre-calculated embeddings inside the serving model. Our evaluation on the industry-scale datasets show that SilverTorch achieves up to 5.6x lower latency and 23.7x higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch’s solution is 13.35x more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch serves over hundreds of models online across major products and recommends contents for billions of daily active users. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2511.14881 [cs.IR] (or arXiv:2511.14881v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.14881 Focus to learn more arXiv-issued DOI via DataCite
[IR-4] OTCR: Optimal Transmission Compression and Representation for Multimodal Information Extraction
链接: https://arxiv.org/abs/2511.14766
作者: Yang Li,Yajiao Wang,Wenhao Hu,Zhixiong Zhang,Mengting Zhang
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: 5 pages, 3 figures
Abstract:Multimodal Information Extraction (MIE) requires fusing text and visual cues from visually rich documents. While recent methods have advanced multimodal representation learning, most implicitly assume modality equivalence or treat modalities in a largely uniform manner, still relying on generic fusion paradigms. This often results in indiscriminate incorporation of multimodal signals and insufficient control over task-irrelevant redundancy, which may in turn limit generalization. We revisit MIE from a task-centric view: text should dominate, vision should selectively support. We present OTCR, a two-stage framework. First, Cross-modal Optimal Transport (OT) yields sparse, probabilistic alignments between text tokens and visual patches, with a context-aware gate controlling visual injection. Second, a Variational Information Bottleneck (VIB) compresses fused features, filtering task-irrelevant noise to produce compact, task-adaptive representations. On FUNSD, OTCR achieves 91.95% SER and 91.13% RE, while on XFUND (ZH), it reaches 91.09% SER and 94.20% RE, demonstrating competitive performance across datasets. Feature-level analyses further confirm reduced modality redundancy and strengthened task signals. This work offers an interpretable, information-theoretic paradigm for controllable multimodal fusion in document AI.

