本篇博文主要内容为 2025-11-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-24)

今日共更新413篇论文,其中:

  • 自然语言处理59篇(Computation and Language (cs.CL))
  • 人工智能107篇(Artificial Intelligence (cs.AI))
  • 计算机视觉131篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习91篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

【速读】: 该论文旨在解决强化学习与验证反馈(Reinforcement Learning with Verified Rewards, RLVR)在数学推理任务中,尤其是定理证明场景下的可扩展性瓶颈问题。其核心挑战在于:中间推理过程对结果至关重要,但最终答案难以直接且可靠地验证,导致传统基于结果的RLVR方法效果受限;同时,仅依赖token级监督微调(Supervised Fine-Tuning, SFT)易陷入机械记忆而非生成长链推理。解决方案的关键在于提出MR-RLVR(Masked-and-Reordered RLVR),通过引入“掩码后填空”和“步骤重排序”两类自监督任务构建过程感知的奖励信号,从而从中间推理步骤中提取可学习的监督信号。该方法采用两阶段训练策略——先在采样数据上进行自监督预训练以学习推理结构,再在仅有结果可验证的数据集上进行RLVR微调,显著提升了模型在仅能验证输出正确性的场景下的性能表现。

链接: https://arxiv.org/abs/2511.17473
作者: Zhen Wang,Zhifeng Gao,Guolin Ke
机构: DP Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time scaling has been shown to substantially improve large language models’ (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR’s scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT’s self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via “masked-then-fill” and “step reordering” to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR’s scalability and performance in only outcome-verifiable settings.
zh

[NLP-1] Planning with Sketch-Guided Verification for Physics-Aware Video Generation

【速读】: 该论文旨在解决当前视频生成方法中运动规划质量不足的问题,尤其是现有基于单次规划的方法难以处理复杂运动,而迭代优化方法则因多次调用视频生成器导致计算成本高昂。其解决方案的关键在于提出一种无需训练的SketchVerify框架,通过引入测试时采样与验证循环,在正式生成视频前对候选运动轨迹进行高效筛选和优化:首先预测多个候选运动计划,再利用视觉-语言验证器联合评估其语义一致性(与指令匹配度)和物理合理性;为提升效率,采用轻量级视频草图(video sketch)形式渲染轨迹,避免重复使用昂贵的扩散模型合成,从而在保证性能的同时显著降低计算开销。

链接: https://arxiv.org/abs/2511.17450
作者: Yidong Huang,Zun Wang,Han Lin,Dong-Ki Kim,Shayegan Omidshafiei,Jaehong Yoon,Yue Zhang,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); FieldAI; Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: website: this https URL

点击查看摘要

Abstract:Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
zh

[NLP-2] SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

【速读】: 该论文旨在解决传统文本和视觉问答(Textual and Visual Question Answering, TVQA)评估指标(如ROUGE、METEOR和Exact Match)过于依赖n-gram级别的词汇相似性,难以捕捉深层语义理解的问题。尽管BERTScore和MoverScore等基于上下文嵌入的方法提升了语义层面的评估能力,但它们在句子级与关键词级语义平衡上缺乏灵活性,且忽略了仍具重要性的词汇匹配信息。为此,作者提出SMILE(Semantic Metric Integrating Lexical Exactness),其核心在于融合句子级语义理解、关键词级语义理解和精确关键词匹配,实现词汇精度与语义相关性的协同优化,从而在保持计算轻量的同时显著提升与人工判断的一致性。

链接: https://arxiv.org/abs/2511.17432
作者: Shrikant Kendre,Austin Xu,Honglu Zhou,Michael Ryoo,Shafiq Joty,Juan Carlos Niebles
机构: Salesforce AI Research (Salesforce人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 6 tables, 9 figures

点击查看摘要

Abstract:Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
zh

[NLP-3] Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training

【速读】: 该论文旨在解决多选题问答(Multiple-choice Question Answering, MCQA)在评估和强化学习微调(Reinforcement Fine-Tuning, RFT)现代多模态语言模型时存在的指标不可靠问题,即MCQA选项可能泄露可被利用的信号,导致准确率指标虚高,并诱导模型产生显式或隐式的答案猜测行为。解决方案的关键在于提出ReVeL(Rewrite and Verify by LLM)框架,该框架通过将MCQA重写为开放形式的问题(Open-form Question Answering, OpenQA),同时尽可能保持答案的可验证性,从而消除选项泄露带来的偏差;ReVeL根据答案类型对问题进行分类,并采用差异化的重写与验证策略,有效提升了训练数据的效率和奖励信号的鲁棒性,在微调Qwen2.5-VL模型时实现了与MCQA相当的多选题准确率,同时显著提升开放题准确率约6个百分点,并在评估阶段揭示了MCQA存在高达20个百分点的分数膨胀现象,增强了评价的客观性和效率。

链接: https://arxiv.org/abs/2511.17405
作者: Yesheng Liu,Hao Li,Haiyu Xu,Baoqi Pei,Jiahao Wang,Mingxuan Zhao,Jingshu Zheng,Zheqi He,JG Yao,Bowen Qin,Xi Yang,Jiajun Zhang
机构: Institute of Automation, CAS; School of Artificial Intelligence, UCAS; BAAI FlagEval Team; BUAA; PKU; ZJU
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project url: this https URL

点击查看摘要

Abstract:Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
zh

[NLP-4] PUCP-Metrix: A Comprehensive Open-Source Repository of Linguistic Metrics for Spanish EACL

【速读】: 该论文旨在解决西班牙语自然语言处理(Natural Language Processing, NLP)领域中缺乏全面、细粒度且可解释的文本分析工具的问题。现有西班牙语工具在覆盖范围和可解释性方面存在局限,难以支持如风格识别、结构分析和可读性评估等任务。解决方案的关键在于提出PUCP-Metrix,这是一个开源的182项语言学指标库,涵盖词汇多样性、句法与语义复杂度、衔接性、心理语言学特征及可读性等多个维度,能够实现精细化、可解释的文本分析。通过在自动可读性评估和机器生成文本检测任务上的实证验证,证明其性能优于现有资源并媲美强神经基线模型,从而为西班牙语NLP应用提供了一个综合性、可扩展的基础资源。

链接: https://arxiv.org/abs/2511.17402
作者: Javier Alonso Villegas Luis,Marco Antonio Sobrevilla Cabezudo
机构: Pontificia Universidad Católica del Perú (天主教圣马尔科斯大学); Aveni
类目: Computation and Language (cs.CL)
备注: 1 figure, to be submitted to EACL Demo track

点击查看摘要

Abstract:Linguistic features remain essential for interpretability and tasks involving style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCP-Metrix, an open-source repository of 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. PUCP-Metrix enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive, extensible resource for Spanish, supporting diverse NLP applications.
zh

[NLP-5] Selective Rotary Position Embedding

【速读】: 该论文旨在解决语言建模中位置信息编码的局限性问题,尤其是在不同架构(如Softmax Transformer、线性Transformer、状态空间模型等)中如何更灵活、高效地引入位置感知能力。传统方法如Rotary Position Embeddings(RoPE)采用固定角度旋转来编码位置信息,而Selective RoPE提出了一种输入依赖的旋转嵌入机制,其关键在于通过动态调整旋转角度实现对位置信息的自适应编码,从而在Softmax和线性Transformer中统一建模位置关系。该方案不仅揭示了Softmax注意力隐式存在旋转结构,还阐明了在状态空间模型和门控线性Transformer中,实部负责遗忘控制、虚部通过旋转编码位置信息的内在机制,显著提升了复杂序列任务(如复制、状态跟踪和检索)上的性能表现。

链接: https://arxiv.org/abs/2511.17388
作者: Sajad Movahedi,Timur Carstensen,Arshia Afzal,Frank Hutter,Antonio Orvieto,Volkan Cevher
机构: ELLIS Institute Tübingen (ELLIS研究所图宾根); LIONS, EPFL (EPFL的LIONS实验室); University of Freiburg (弗莱堡大学); Max-Planck-Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Prior Labs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textitRoPE) encode positions through \textitfixed-angle rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textitSelective RoPE, an \textitinput-dependent rotary embedding mechanism, that generalizes \textitRoPE, and enables rotation in \textitarbitrary angles for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textitSelective RoPE, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.
zh

[NLP-6] Dont Learn Ground: A Case for Natural Language Inference with Visual Grounding

【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)中因文本表面特征和统计偏差导致的模型泛化能力不足问题。其解决方案的关键在于引入多模态表示,通过文本到图像生成模型(text-to-image models)将前提(premise)转化为视觉表征,并基于这些视觉表征与假设(hypothesis)进行语义比较,从而实现零样本(zero-shot)推理。该方法不依赖任务特定微调,利用视觉模态作为语义表示,有效提升了模型对文本偏见和表面启发式策略的鲁棒性。

链接: https://arxiv.org/abs/2511.17358
作者: Daniil Ignatev,Ayman Santeer,Albert Gatt,Denis Paperno
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
zh

[NLP-7] A new kid on the block: Distributional semantics predicts the word-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin

【速读】: 该论文旨在解决汉语普通话中单音节词的声调轮廓(pitch contour)在自然对话语境下的实现机制问题,特别是探讨词汇意义(semantic meaning)对声调表现的影响。传统音系学理论通常将声调视为受句法、语境或语音环境等因素调控的独立音位特征,而本文通过广义加性模型(Generalized Additive Model, GAM)分解声调轮廓,发现即使控制了词长、性别、说话人身份、声调上下文、元音高度和话语位置等变量后,词汇本身仍是声调实现的强预测因子;进一步研究表明,词汇语义(word sense)比词汇本身更具预测力,且同形异义词(heterographic homophones)具有不同的声调轮廓,这表明语义信息直接驱动声调实现。解决方案的关键在于利用上下文嵌入(contextualized embeddings)对个体词项进行建模,其预测准确率显著高于随机置换基线,从而首次为分布语义学(distributional semantics)在语音实现中的作用提供了实证支持,挑战了传统声调理论并契合“判别式词库模型”(Discriminative Lexicon Model)的框架。

链接: https://arxiv.org/abs/2511.17337
作者: Xiaoyun Jin,Mirjam Ernestus,R. Harald Baayen
机构: Eberhard Karls Universität Tübingen (图宾根大学); Radboud University (奈梅亨大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: arXiv admin note: text overlap with arXiv:2409.07891

点击查看摘要

Abstract:We present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of words’ meanings. We used the generalized additive model to decompose a given observed pitch contour into a set of component pitch contours that are tied to different control variables and semantic predictors. Even when variables such as word duration, gender, speaker identity, tonal context, vowel height, and utterance position are controlled for, the effect of word remains a strong predictor of tonal realization. We present evidence that this effect of word is a semantic effect: word sense is shown to be a better predictor than word, and heterographic homophones are shown to have different pitch contours. The strongest evidence for the importance of semantics is that the pitch contours of individual word tokens can be predicted from their contextualized embeddings with an accuracy that substantially exceeds a permutation baseline. For phonetics, distributional semantics is a new kid on the block. Although our findings challenge standard theories of Mandarin tone, they fit well within the theoretical framework of the Discriminative Lexicon Model.
zh

[NLP-8] Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

【速读】: 该论文旨在解决人机协作中机器人因缺乏对长时程任务上下文理解而导致的动作规划与确认准确性不足的问题。当前方法多基于片段级处理,难以利用视频全貌中的长距离依赖关系,从而限制了机器人在复杂多步骤任务中对人类行为的准确理解和响应能力。解决方案的关键在于提出一种引入左右上下文依赖的长程Q-former架构,并结合文本条件化机制,将文本嵌入直接注入大语言模型(LLM)解码器以缓解Q-former对文本信息的过度抽象,从而提升动作确认生成的准确性与动作规划的整体性能。实验表明,基于YouCook2数据集的验证证实了动作确认准确率是影响动作规划效果的核心因素,且所提方法通过整合VideoLLaMA3显著提升了长程场景下的任务执行能力。

链接: https://arxiv.org/abs/2511.17335
作者: Chiori Hori,Yoshiki Masuyama,Siddarth Jain,Radu Corcodel,Devesh Jha,Diego Romeres,Jonathan Le Roux
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ASRU 2025

点击查看摘要

Abstract:Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
zh

[NLP-9] MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core

【速读】: 该论文旨在解决当前生成式AI(Generative AI)音乐生成模型依赖大规模数据集所引发的版权侵权风险及高计算成本问题。其解决方案的关键在于提出MusicAIR框架,该框架以算法驱动的符号化音乐核心(algorithm-driven symbolic music core)为核心,通过将歌词和节奏信息进行关键关联,自动推导出完整的旋律谱,从而实现仅凭歌词即可生成符合音乐理论、歌词结构与节奏惯例的连贯乐谱。此方法有效规避了对现有音乐数据的依赖,显著降低了版权风险,并提升了生成音乐的质量与合理性。

链接: https://arxiv.org/abs/2511.17323
作者: Callie C. Liao,Duoduo Liao,Ellie L. Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by IEEE Big Data 2025

点击查看摘要

Abstract:Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.
zh

[NLP-10] Humanlike Multi-user Agent (HUMA): Designing a Deceptively Human AI Facilitator for Group Chats

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的对话代理在多用户异步群聊场景中缺乏自然交互行为的问题,尤其是在响应时机、角色适配与群体动态协调方面难以模拟人类社区管理者(Community Manager, CM)的行为模式。其解决方案的关键在于提出Humanlike Multi-user Agent (HUMA),一个采用事件驱动架构的LLM代理系统,包含Router、Action Agent和Reflection三个核心组件,能够根据消息、回复、点赞等事件实时调整响应策略,并引入逼真的响应时间模拟机制,从而实现接近人类水平的群聊参与度与互动真实性。

链接: https://arxiv.org/abs/2511.17315
作者: Mateusz Jacniacki,Martí Carmona Serrat
机构: Soofte Research (Soofte 研究)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Conversational agents built on large language models (LLMs) are becoming increasingly prevalent, yet most systems are designed for one-on-one, turn-based exchanges rather than natural, asynchronous group chats. As AI assistants become widespread throughout digital platforms, from virtual assistants to customer service, developing natural and humanlike interaction patterns seems crucial for maintaining user trust and engagement. We present the Humanlike Multi-user Agent (HUMA), an LLM-based facilitator that participates in multi-party conversations using human-like strategies and timing. HUMA extends prior multi-user chatbot work with an event-driven architecture that handles messages, replies, reactions and introduces realistic response-time simulation. HUMA comprises three components-Router, Action Agent, and Reflection-which together adapt LLMs to group conversation dynamics. We evaluate HUMA in a controlled study with 97 participants in four-person role-play chats, comparing AI and human community managers (CMs). Participants classified CMs as human at near-chance rates in both conditions, indicating they could not reliably distinguish HUMA agents from humans. Subjective experience was comparable across conditions: community-manager effectiveness, social presence, and engagement/satisfaction differed only modestly with small effect sizes. Our results suggest that, in natural group chat settings, an AI facilitator can match human quality while remaining difficult to identify as nonhuman. Comments: 9 pages, 4 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.17315 [cs.CL] (or arXiv:2511.17315v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.17315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-11] Large Language Models for Sentiment Analysis to Detect Social Challenges: A Use Case with South African Languages

【速读】: 该论文旨在解决多语言环境下社交媒体文本中情感分析(Sentiment Analysis)的挑战,特别是在南非语境下利用大语言模型(Large-Language Models, LLMs)对英语、塞佩迪语(Sepedi)和茨瓦纳语(Setswana)社交媒体内容进行零样本情感分类,以识别社会问题并辅助政府决策。其解决方案的关键在于系统性评估GPT-3.5、GPT-4、LlaMa 2、PaLM 2与Dolly 2等主流LLMs在不同语言和话题上的零样本性能,并通过融合多个模型的输出结果显著提升分类准确率——实验表明,该融合策略可将情感分类错误率降至1%以下,从而实现高可靠性的情感监测系统,为政府部门提供精准的社会动态洞察与行动依据。

链接: https://arxiv.org/abs/2511.17301
作者: Koena Ronny Mabokela,Tim Schlippe,Matthias Wölfel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of The Southern African Conference on AI Research (SACAIR 2024), Bloemfontein, South Africa, 2-6 December 2024. ISBN: 978-0-7961-6069-0

点击查看摘要

Abstract:Sentiment analysis can aid in understanding people’s opinions and emotions on social issues. In multilingual communities sentiment analysis systems can be used to quickly identify social challenges in social media posts, enabling government departments to detect and address these issues more precisely and effectively. Recently, large-language models (LLMs) have become available to the wide public and initial analyses have shown that they exhibit magnificent zero-shot sentiment analysis abilities in English. However, there is no work that has investigated to leverage LLMs for sentiment analysis on social media posts in South African languages and detect social challenges. Consequently, in this work, we analyse the zero-shot performance of the state-of-the-art LLMs GPT-3.5, GPT-4, LlaMa 2, PaLM 2, and Dolly 2 to investigate the sentiment polarities of the 10 most emerging topics in English, Sepedi and Setswana social media posts that fall within the jurisdictional areas of 10 South African government departments. Our results demonstrate that there are big differences between the various LLMs, topics, and languages. In addition, we show that a fusion of the outcomes of different LLMs provides large gains in sentiment classification performance with sentiment classification errors below 1%. Consequently, it is now feasible to provide systems that generate reliable information about sentiment analysis to detect social challenges and draw conclusions about possible needs for actions on specific topics and within different language groups.
zh

[NLP-12] Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

【速读】: 该论文旨在解决多语言场景下常识推理基准测试(如WinoGrande)在非英语语境中的可靠评估问题,特别是针对爱沙尼亚语这类资源相对匮乏的语言,如何实现高质量的本地化与文化适配。其解决方案的关键在于:首先由专业翻译人员完成对原英文测试集的精准翻译与文化适应,确保语义一致性和任务难度的保留;其次通过分析人工翻译过程中的策略,设计出针对性强的提示(prompt),以引导机器翻译模型更好地处理爱沙尼亚语的语言特征及WinoGrande特有的指代消解挑战。实验表明,尽管提示工程有一定作用,但最终仍需语言专家深度参与才能保障数据质量,从而获得可信的语言能力与推理能力评估结果。

链接: https://arxiv.org/abs/2511.17290
作者: Marii Ojastu,Hele-Andra Kuulmets,Aleksei Dorkin,Marika Borovikova,Dage Särg,Kairit Sirts
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.
zh

[NLP-13] Cross-cultural value alignment frameworks for responsible AI governance: Evidence from China-West comparative analysis

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在全球高风险决策场景中因文化价值观差异而导致的对齐失效问题,即如何确保LLMs在不同文化背景下保持价值一致性与合理性。其解决方案的核心在于构建一个多层次审计平台(Multi-Layered Auditing Platform for Responsible AI),通过四个集成方法系统评估跨文化价值对齐:伦理困境语料库(Ethical Dilemma Corpus)用于检验价值体系的时间稳定性,增强多样性框架(Diversity-Enhanced Framework, DEF)量化文化保真度,首词概率对齐(First-Token Probability Alignment)保证分布准确性,以及多阶段推理框架(Multi-stAge Reasoning frameworK, MARK)提升决策可解释性。研究发现,Mistral系列架构在跨文化对齐上显著优于LLaMA3系列,且全参数微调(Full-Parameter Fine-Tuning)比基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)更能保留文化多样性,揭示了当前中西方模型在发展路径上的根本差异及共性挑战。

链接: https://arxiv.org/abs/2511.17256
作者: Haijiang Liu,Jinguang Gu,Xun Wu,Daniel Hershcovich,Qiaoling Xiao
机构: Wuhan University of Science and Technology (武汉科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University of Copenhagen (哥本哈根大学); WUST-Madrid Complutense Institute (武汉科技大学-马德里康普顿斯学院)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Presented on Academic Conference “Technology for Good: Driving Social Impact” (2025)

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly influence high-stakes decision-making across global contexts, ensuring their alignment with diverse cultural values has become a critical governance challenge. This study presents a Multi-Layered Auditing Platform for Responsible AI that systematically evaluates cross-cultural value alignment in China-origin and Western-origin LLMs through four integrated methodologies: Ethical Dilemma Corpus for assessing temporal stability, Diversity-Enhanced Framework (DEF) for quantifying cultural fidelity, First-Token Probability Alignment for distributional accuracy, and Multi-stAge Reasoning frameworK (MARK) for interpretable decision-making. Our comparative analysis of 20+ leading models, such as Qwen, GPT-4o, Claude, LLaMA, and DeepSeek, reveals universal challenges-fundamental instability in value systems, systematic under-representation of younger demographics, and non-linear relationships between model scale and alignment quality-alongside divergent regional development trajectories. While China-origin models increasingly emphasize multilingual data integration for context-specific optimization, Western models demonstrate greater architectural experimentation but persistent U.S.-centric biases. Neither paradigm achieves robust cross-cultural generalization. We establish that Mistral-series architectures significantly outperform LLaMA3-series in cross-cultural alignment, and that Full-Parameter Fine-Tuning on diverse datasets surpasses Reinforcement Learning from Human Feedback in preserving cultural variation…
zh

[NLP-14] Social-Media Based Personas Challenge: Hybrid Prediction of Common and Rare User Actions on Bluesky

【速读】: 该论文旨在解决社交平台中用户行为预测问题,尤其是针对罕见但具有重要意义的行为(如特定互动或内容生成)的预测难题,而现有方法主要聚焦于高频行为(如点赞和转发)。其解决方案的关键在于提出一种混合建模方法,根据行为类型差异采用针对性策略:对常见行为使用基于历史响应模式的查找数据库与人物画像(persona)特异的LightGBM模型(融合时序与语义特征),对罕见行为则设计了一种融合文本与时间表示的专用神经网络架构,并结合生成式AI(Generative AI)生成回复,从而实现对多样化行为(共12类)的高效预测。该方法在大规模Bluesky数据集上验证有效,尤其在罕见行为分类中达到0.56的宏F1分数,表明行为类型差异需通过定制化建模策略加以区分。

链接: https://arxiv.org/abs/2511.17241
作者: Benjamin White,Anastasia Shimorina
机构: Orange Innovation (Orange创新公司)
类目: Computation and Language (cs.CL)
备注: 1st place at SocialSim: Social-Media Based Personas challenge 2025

点击查看摘要

Abstract:Understanding and predicting user behavior on social media platforms is crucial for content recommendation and platform design. While existing approaches focus primarily on common actions like retweeting and liking, the prediction of rare but significant behaviors remains largely unexplored. This paper presents a hybrid methodology for social media user behavior prediction that addresses both frequent and infrequent actions across a diverse action vocabulary. We evaluate our approach on a large-scale Bluesky dataset containing 6.4 million conversation threads spanning 12 distinct user actions across 25 persona clusters. Our methodology combines four complementary approaches: (i) a lookup database system based on historical response patterns; (ii) persona-specific LightGBM models with engineered temporal and semantic features for common actions; (iii) a specialized hybrid neural architecture fusing textual and temporal representations for rare action classification; and (iv) generation of text replies. Our persona-specific models achieve an average macro F1-score of 0.64 for common action prediction, while our rare action classifier achieves 0.56 macro F1-score across 10 rare actions. These results demonstrate that effective social media behavior prediction requires tailored modeling strategies recognizing fundamental differences between action types. Our approach achieved first place in the SocialSim: Social-Media Based Personas challenge organized at the Social Simulation with LLMs workshop at COLM 2025.
zh

[NLP-15] Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

【速读】: 该论文旨在解决当前视觉语言模型(VLMs)在表格问答(Tabular QA)任务中评估基准与真实世界场景之间存在的显著差距问题。现有数据集如WikiTableQuestions和FinQA多为单一语言(英语)且表格格式理想化,无法反映实际应用中常见的多语言性和视觉噪声(如扫描文档的模糊、歪斜等)。为应对这一挑战,作者提出新的基准MirageTVQA,其关键创新在于:包含近60,000个跨24种语言的问答对,并引入真实世界中的视觉噪声以模拟纸质文档扫描效果,从而更贴近实际应用场景。该基准揭示了当前主流VLMs在面对视觉噪声时性能下降超过35%,以及存在明显的英语优先偏倚,难以迁移至其他语言,因此为推动更鲁棒的表格推理VLMs提供了重要评估工具和研究方向。

链接: https://arxiv.org/abs/2511.17238
作者: Anshul Singh,Rohan Chaudhary,Gagneet Singh,Abhay Kumary
机构: Indian Institute of Science, Bangalore (印度科学研究所,班加罗尔); Panjab University, Chandigarh (旁遮普大学,昌迪加尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as Spotligh Talk at EurIPS 2025 Workshop on AI For Tabular Data

点击查看摘要

Abstract:The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbfMirageTVQA, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: this https URL.
zh

[NLP-16] Parrot: Persuasion and Agreement Robustness Rating of Output Truth – A Sycophancy Robustness Benchmark for LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对权威诱导和社会压力时出现的“谄媚行为”(sycophancy,即过度顺从)问题,这种行为会导致模型输出准确性的显著下降。为量化并分析这一现象,作者提出PARROT(Persuasion and Agreement Robustness Rating of Output Truth)框架,其关键在于:(1)通过双盲评估对比同一问题的中立版本与权威误导版本,隔离因果效应;(2)利用基于对数似然的置信度校准追踪机制,量化模型对正确答案和错误答案的信心变化;(3)基于八状态行为分类体系系统识别失败模式(如稳健正确、谄媚一致、强化错误等)。实验表明,先进模型表现出较低的服从率(≤11%)和较小的准确性损失,而老旧或小型模型则出现严重的认知崩溃(如Qwen 2.5-1.5B模型服从率达94%),且不仅改变回答,还削弱对正确答案的信心并增强对错误答案的错误自信。因此,研究强调“抵抗过拟合压力”应作为与准确性、危害规避和隐私保护同等重要的目标,以保障LLMs在现实场景中的安全部署。

链接: https://arxiv.org/abs/2511.17220
作者: Yusuf Çelebi,Mahmoud El Hussieni,Özay Ezerceli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low “follow rates” ( \leq 11% , GPT-5: 4%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80%, Qwen 2.5-1.5B: 94%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of “resistance to overfitting pressure” should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.
zh

[NLP-17] A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的对话代理在多轮交互中难以维持连贯性和个性化的问题,尤其是受限于固定上下文窗口长度以及现有外部记忆方法在粗粒度检索与细粒度碎片化表示之间的权衡。其解决方案的关键在于提出一种以事件为中心的记忆建模方法(event-centric memory),将对话历史组织为短小、类事件的命题单元(event-like propositions),这些单元包含参与者、时间线索和最小局部上下文,并通过异构图结构对会话、命题及其论元进行组织,从而实现非压缩式的信息保留与关联回忆。该设计避免了传统方法中对历史内容的过度压缩或遗忘,提升了记忆信息的可访问性与结构性,实验表明该方法在LoCoMo和LongMemEval_S基准上优于或媲美强基线,且使用更短的问答上下文即可达成良好效果。

链接: https://arxiv.org/abs/2511.17208
作者: Sizhe Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:LLM-based conversational agents still struggle to maintain coherent, personalized interaction over many sessions: fixed context windows limit how much history can be kept in view, and most external memory approaches trade off between coarse retrieval over large chunks and fine-grained but fragmented views of the dialogue. Motivated by neo-Davidsonian event semantics, we propose an event-centric alternative that represents conversational history as short, event-like propositions which bundle together participants, temporal cues, and minimal local context, rather than as independent relation triples or opaque summaries. In contrast to work that aggressively compresses or forgets past content, our design aims to preserve information in a non-compressive form and make it more accessible, rather than more lossy. Concretely, we instruct an LLM to decompose each session into enriched elementary discourse units (EDUs) – self-contained statements with normalized entities and source turn attributions – and organize sessions, EDUs, and their arguments in a heterogeneous graph that supports associative recall. On top of this representation we build two simple retrieval-based variants that use dense similarity search and LLM filtering, with an optional graph-based propagation step to connect and aggregate evidence across related EDUs. Experiments on the LoCoMo and LongMemEval _S benchmarks show that these event-centric memories match or surpass strong baselines, while operating with much shorter QA contexts. Our results suggest that structurally simple, event-level memory provides a principled and practical foundation for long-horizon conversational agents. Our code and data will be released at this https URL.
zh

[NLP-18] E3-Pruner: Towards Efficient Economical and Effective Layer Pruning for Large Language Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际部署中因层剪枝(layer pruning)带来的性能下降、训练成本高及加速效果有限等关键挑战。其核心解决方案是提出一个任务有效(task-effective)、训练经济(training-economical)且推理高效的(inference-efficient)层剪枝框架 \name,其关键创新在于:(1) 引入基于 Gumbel-TopK 采样的可微分掩码优化方法,实现高效且精确的剪枝掩码搜索;(2) 设计熵感知的自适应知识蒸馏策略,以提升剪枝后模型的任务性能。实验表明,该方法在多个架构和基准上显著优于现有最优方法,在 Qwen3-32B 模型上剪枝 25% 层仅损失 0.8% 准确率(达 96%),同时推理速度提升 1.33 倍,且训练数据消耗仅为原始预训练数据的 0.5%。

链接: https://arxiv.org/abs/2511.17205
作者: Tao Yuan,Haoli Bai,Yinfei Pan,Xuyang Cao,Tianyu Zhang,Lu Hou,Ting Hu,Xianzhi Yu
机构: Huawei Technologies (华为技术有限公司); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underlineEffective, training-\underlineEconomical and inference-\underlineEfficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96% accuracy, a mere 0.8% drop from the original model (96.8%) on MATH-500 when pruning 25% layers of Qwen3-32B, outperforming existing SOTA (95%), with a 1.33 \times inference speedup by consuming merely 0.5B tokens (0.5% of the post-training data volume).
zh

[NLP-19] AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

【速读】: 该论文针对工业级文本到SQL(text-to-SQL)任务中数据库模式(schema)链接的瓶颈问题展开研究,旨在解决传统方法在处理大规模数据库时因输入完整schema导致上下文窗口受限、噪声干扰严重以及召回率与噪声难以权衡等问题。其核心挑战在于如何在不提供全部schema的前提下实现高召回率且高效的schema子集筛选。解决方案的关键在于提出AutoLink框架,该框架将schema链接重构为一个由大语言模型(LLM)驱动的迭代式自主代理(agent-driven)过程:通过动态探索和逐步扩展相关schema组件,无需预先加载整个数据库schema即可精准识别必要元素。实验表明,AutoLink在Bird-Dev和Spider-2.0-Lite数据集上分别达到97.4%和91.2%的严格schema链接召回率,并在大型schema(如超过3,000列)场景下保持高召回率、低token消耗和稳定执行准确率,展现出卓越的可扩展性。

链接: https://arxiv.org/abs/2511.17190
作者: Ziyang Wang,Yuanlei Zheng,Zhenbiao Cao,Xiaojin Zhang,Zhongyu Wei,Pei Fu,Zhenbo Luo,Wei Chen,Xiang Bai
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present \textbfAutoLink, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink’s superior performance, achieving state-of-the-art strict schema linking recall of \textbf97.4% on Bird-Dev and \textbf91.2% on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf68.7% EX on Bird-Dev (better than CHESS) and \textbf34.9% EX on Spider-2.0-Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits \textbfexceptional scalability, \textbfmaintaining high recall, \textbfefficient token consumption, and \textbfrobust execution accuracy on large schemas (e.g., over 3,000 columns) where existing methods severely degrade-making it a highly scalable, high-recall schema-linking solution for industrial text-to-SQL systems.
zh

[NLP-20] Attention-Guided Feature Fusion (AGFF) Model for Integrating Statistical and Semantic Features in News Text Classification

【速读】: 该论文旨在解决新闻文本分类中传统统计方法难以捕捉上下文语义信息、而纯深度学习模型可能忽略高影响力统计特征的问题。其解决方案的关键在于提出一种注意力引导的特征融合(Attention-Guided Feature Fusion, AGFF)模型,通过引入注意力机制动态调整统计特征与语义特征的权重,从而在统一框架内实现两类特征的互补性整合,显著提升分类性能。

链接: https://arxiv.org/abs/2511.17184
作者: Mohammad Zare
机构: AI lab at AriooBarzan Engineering Team(人工智能实验室); Shiraz(设拉子), Iran(伊朗)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:News text classification is a crucial task in natural language processing, essential for organizing and filtering the massive volume of digital content. Traditional methods typically rely on statistical features like term frequencies or TF-IDF values, which are effective at capturing word-level importance but often fail to reflect contextual meaning. In contrast, modern deep learning approaches utilize semantic features to understand word usage within context, yet they may overlook simple, high-impact statistical indicators. This paper introduces an Attention-Guided Feature Fusion (AGFF) model that combines statistical and semantic features in a unified framework. The model applies an attention-based mechanism to dynamically determine the relative importance of each feature type, enabling more informed classification decisions. Through evaluation on benchmark news datasets, the AGFF model demonstrates superior performance compared to both traditional statistical models and purely semantic deep learning models. The results confirm that strategic integration of diverse feature types can significantly enhance classification accuracy. Additionally, ablation studies validate the contribution of each component in the fusion process. The findings highlight the model’s ability to balance and exploit the complementary strengths of statistical and semantic representations, making it a practical and effective solution for real-world news classification tasks.
zh

[NLP-21] Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成回答时容易产生事实性错误(即幻觉)的问题,尤其关注如何在生成前主动避免不可靠输出。现有方法多依赖生成后的信号(如生成多样性或反馈)进行回避决策,难以提前干预。其解决方案的关键在于提出一种基于因果推理的早期回避框架——面向特征的因果回避(Aspect-Based Causal Abstention, ABCA),通过分析模型内部知识的多维多样性(如学科、法律语境或时间维度等“方面”),利用条件因果效应估计来评估特定查询相关知识的可靠性,并据此实施两类早期回避:类型1(不同方面间存在知识冲突)和类型2(多个方面一致支持回避,反映知识不足)。该方法显著提升了回避决策的可靠性与可解释性。

链接: https://arxiv.org/abs/2511.17170
作者: Vy Nguyen,Ziqi Xu,Jeffrey Chan,Estrid He,Feng Xia,Xiuzhen Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026 (Main Technical Track)

点击查看摘要

Abstract:Large Language Models (LLMs) often produce fluent but factually incorrect responses, a phenomenon known as hallucination. Abstention, where the model chooses not to answer and instead outputs phrases such as “I don’t know”, is a common safeguard. However, existing abstention methods typically rely on post-generation signals, such as generation variations or feedback, which limits their ability to prevent unreliable responses in advance. In this paper, we introduce Aspect-Based Causal Abstention (ABCA), a new framework that enables early abstention by analysing the internal diversity of LLM knowledge through causal inference. This diversity reflects the multifaceted nature of parametric knowledge acquired from various sources, representing diverse aspects such as disciplines, legal contexts, or temporal frames. ABCA estimates causal effects conditioned on these aspects to assess the reliability of knowledge relevant to a given query. Based on these estimates, we enable two types of abstention: Type-1, where aspect effects are inconsistent (knowledge conflict), and Type-2, where aspect effects consistently support abstention (knowledge insufficiency). Experiments on standard benchmarks demonstrate that ABCA improves abstention reliability, achieves state-of-the-art performance, and enhances the interpretability of abstention decisions.
zh

[NLP-22] he PLLuM Instruction Corpus

【速读】: 该论文旨在解决如何有效构建和利用指令数据集(instruction dataset)以微调基于Transformer架构的大语言模型(Large Language Models, LLMs),从而提升其在特定语言(波兰语)上的适应能力。解决方案的关键在于提出了一种功能型分类体系,将指令数据分为有机生成(organic)、转换生成(converted)和合成生成(synthetic)三类,并通过释放首个代表性子集PLLuMIC(Polish Large Language Model Instruction Corpus)来验证该方法的有效性,同时探讨了人工撰写与合成指令数据集在语言适配中的差异与影响,为其他语言模型的指令数据集开发提供可借鉴的框架和实践依据。

链接: https://arxiv.org/abs/2511.17161
作者: Piotr Pęzik,Filip Żarnecki,Konrad Kaczyński,Anna Cichosz,Zuzanna Deckert,Monika Garnys,Izabela Grabarczyk,Wojciech Janowski,Sylwia Karasińska,Aleksandra Kujawiak,Piotr Misztela,Maria Szymańska,Karolina Walkusz,Igor Siek,Maciej Chrabąszcz,Anna Kołos,Agnieszka Karlińska,Karolina Seweryn,Aleksandra Krasnodębska,Paula Betscher,Zofia Cieślińska,Katarzyna Kowol,Artur Wilczek,Maciej Trzciński,Katarzyna Dziewulska,Roman Roszko,Tomasz Bernaś,Jurgita Vaičenonienė,Danuta Roszko,Paweł Levchuk,Paweł Kowalski,Irena Prawdzic-Jankowska,Marek Kozłowski,Sławomir Dadas,Rafał Poświata,Alina Wróblewska,Katarzyna Krasnowska-Kieraś,Maciej Ogrodniczuk,Michał Rudolf,Piotr Rybak,Karolina Saputa,Joanna Wołoszyn,Marcin Oleksy,Bartłomiej Koptyra,Teddy Ferdinan,Stanisław Woźniak,Maciej Piasecki,Paweł Walkowiak,Konrad Wojtasik,Arkadiusz Janz,Przemysław Kazienko,Julia Moska,Jan Kocoń
机构: University of Lodz(卢兹大学); NASK National Research Institute(国家研究研究所); Institute of Slavic Studies PAS(斯拉夫研究所); National Information Processing Institute(国家信息处理研究所); Institute of Computer Science PAS(计算机科学研究所); Wroclaw Tech(弗罗茨瓦夫科技学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.
zh

[NLP-23] LangMark: A Multilingual Dataset for Automatic Post-Editing ACL2025

【速读】: 该论文旨在解决自动后编辑(Automatic Post-Editing, APE)系统发展受限于缺乏大规模、多语言、专为神经机器翻译(Neural Machine Translation, NMT)输出设计的标注数据集的问题。其解决方案的关键在于构建并发布LangMark——一个由专业人类语言学家标注的多语言APE数据集,涵盖英语到七种语言(巴西葡萄牙语、法语、德语、意大利语、日语、俄语和西班牙语)的206,983个三元组(源段落、NMT输出及人工后编辑译文),该数据集兼具语言多样性与规模。基于此数据集,研究进一步验证了大语言模型(Large Language Models, LLMs)通过少量示例提示(few-shot prompting)即可有效执行APE,并在性能上超越主流商用乃至专有机器翻译系统。

链接: https://arxiv.org/abs/2511.17153
作者: Diego Velazquez,Mikaela Grace,Konstantinos Karageorgos,Lawrence Carin,Aaron Schliem,Dimitrios Zaikis,Roger Wechsler
机构: Welocalize; Duke University (杜克大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures, ACL 2025

点击查看摘要

Abstract:Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.
zh

[NLP-24] Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation AAAI’26

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本表示任务中表现不佳的问题,因其原本设计用于因果建模和下一个词预测,难以生成全局、紧凑的文本表征。解决方案的关键在于引入**上下文压缩(context compression)**作为预训练任务:通过让模型学习用少量记忆令牌(memory tokens)替代原始输入序列,从而在无监督条件下优化LLM的文本表示能力。实验表明,这种基于压缩目标的训练策略显著优于传统的基于token级预测(如掩码下一个词预测,MNTP)的方法,并结合对比学习进一步提升了性能,最终构建出一个高效且强大的文本编码器(LLM2Comp),在多种下游任务中超越现有方法,同时具备更高的样本效率。

链接: https://arxiv.org/abs/2511.17129
作者: Yeqin Zhang,Yizheng Zhao,Chen Hu,Binxing Jiao,Daxin Jiang,Ruihang Miao,Cam-Tu Nguyen
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学); 3. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI’26

点击查看摘要

Abstract:Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream sequence prediction. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with token-level pretext tasks. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample-efficient, requiring significantly less training data.
zh

[NLP-25] raining Foundation Models on a Full-Stack AMD Platform: Compute Networking and System Design

【速读】: 该论文旨在解决在纯AMD硬件平台上开展大规模混合专家(Mixture-of-Experts, MoE)预训练的系统与模型设计挑战,尤其关注如何高效利用MI300X GPU及Pollara互连网络实现高性能训练。其关键解决方案包括:在系统层面提供针对Pollara互连的全规模核心集体通信原语(all-reduce、reduce-scatter、all-gather、broadcast)微基准测试,以及MI300X芯片级内核尺寸和内存带宽优化指导;在模型层面提出面向MI300X的Transformer结构缩放规则(注意力层与MLP块),并确定联合优化训练吞吐量与推理延迟的MoE宽度配置。此外,论文还详述了包含容错机制和检查点重塑等实用工具的训练栈设计,最终通过ZAYA1-base模型(760M活跃参数,8.3B总参数)验证了AMD软硬件堆栈已具备支持竞争性大规模预训练的能力。

链接: https://arxiv.org/abs/2511.17127
作者: Quentin Anthony,Yury Tokpanov,Skyler Szot,Srivatsan Rajagopal,Praneeth Medepalli,Rishi Iyer,Vasu Shyam,Anna Golubeva,Ansh Chaurasia,Xiao Yang,Tomas Figliolia,Robert Washbourne,Drew Thorstensen,Amartey Pearson,Zack Grossbart,Jason van Patten,Emad Barsoum,Zhenyu Gu,Yao Fu,Beren Millidge
机构: 1* Google(谷歌); 2* Meta(Meta); 3* Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.
zh

[NLP-26] Geometric-Disentangelment Unlearning

【速读】: 该论文旨在解决机器遗忘(machine unlearning)中一个核心难题:如何在有效移除特定训练样本影响的同时,最小化对保留数据集(retain set)性能的损害。现有方法通常在遗忘效果与保留知识保护之间存在权衡,且缺乏对遗忘更新如何损害保留性能的理论分析及可证明的缓解机制。其解决方案的关键在于从第一性原理出发,通过一阶局部损失变化分析,提出“保留不变性”(retain-invariant)的等价条件——即当参数更新方向与保留梯度张成的子空间正交时,保留损失在首阶上保持不变。基于此,作者设计了几何解缠(Geometric-disentanglement Unlearning, GU),将任意遗忘梯度更新分解为切向(tangential)和法向(normal)分量,并仅执行法向分量以避免扰动保留空间;在此约束下,投影后的方向在所有一阶保留不变更新中是最优的,从而实现理论保障下的遗忘-保留平衡。

链接: https://arxiv.org/abs/2511.17100
作者: Duo Zhou,Yuji Zhang,Tianxin Wei,Ruizhong Qiu,Ke Yang,Xiao Lin,Cheng Qian,Jingrui He,Hanghang Tong,Heng Ji,Huan Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 Pages

点击查看摘要

Abstract:Machine unlearning, the removal of a training subset’s influence from a deployed model, is critical for privacy preservation and model reliability, yet gradient ascent on forget samples often harms retained knowledge. Existing approaches face a persistent tradeoff between effective forgetting and preservation on the retain set. While previous methods provide useful heuristics, they often lack a formal analysis on how exactly forgetting updates harm retained knowledge, and whether the side effects can be removed with theoretical guarantees. To explore a theoretically sound and simple solution, we start from the first principle on how performance on the retain set is actually affected: a first-order analysis of the local change of the retain loss under small parameter updates during model training. We start from a crisp equivalence: the retain loss is unchanged to first order iff the update direction is orthogonal to the subspace spanned by retain gradients (“retain-invariant”). This identifies the entangled component as the tangential part of forget update within the retain-gradient subspace, and characterizes disentanglement as orthogonality. Guided by this, we propose the Geometric-disentanglement Unlearning (GU) that decomposes any candidate forget gradient update into tangential and normal components to retain space and executes only the normal component. Under a standard trust-region budget, the projected direction aligned with the raw forget gradient is optimal among all first-order retain-invariant moves, and we also derive the optimal projected direction for joint forget-retain updating objectives. Our method is plug-and-play and can be attached to existing gradient-based unlearning procedures to mitigate side effects. GU achieves consistent improvement on various methods across three benchmarks TOFU, MUSE, and WMDP.
zh

[NLP-27] MUCH: A Multilingual Claim Hallucination Benchmark

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在事实性输出中缺乏可靠性的问题,特别是针对生成内容中具体主张(claim)级别的不确定性量化(Uncertainty Quantification, UQ)评估不足的现状。其解决方案的关键在于提出MUCH基准,这是首个面向真实部署场景、支持公平且可复现评估的claim级UQ基准:它包含四种欧洲语言(英语、法语、西班牙语和德语)共4,873个样本,并提供每token 24个生成logits,从而支持白盒方法开发而无需重新生成数据;同时引入一种仅需LLM生成时间0.2%的确定性分割算法,实现低延迟、实时的claim分割,确保评估条件贴近实际应用需求。

链接: https://arxiv.org/abs/2511.17081
作者: Jérémie Dentan,Alexi Canesse,Davide Buscaldi,Aymen Shabou,Sonia Vanier
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.
zh

[NLP-28] An Efficient Computational Framework for Discrete Fuzzy Numbers Based on Total Orders

【速读】: 该论文旨在解决离散模糊数(Discrete Fuzzy Numbers)在有限链 $ L_n = {0, \ldots, n} $ 上的全序(Total Ordering)与可接受序(Admissible Ordering)构造中,如何高效计算用于确定模糊数位置的 pos\textit{pos} 函数及其逆函数的问题。该函数是构建逻辑联结词(如聚合和蕴涵运算)的核心工具。解决方案的关键在于利用全序的组合结构设计算法,通过精确计算 pos\textit{pos} 函数及其逆,实现复杂度为 O(n2mlogn)\mathcal{O}(n^2 m \log n) 的高效实现,其中主导因子为隶属度等级数 mm,从而保证了对隶属度粒度(granularity of membership values)的良好可扩展性,显著降低了计算成本并支持在离散模糊数集合上进行高效的代数运算。

链接: https://arxiv.org/abs/2511.17080
作者: Arnau Mir,Alejandro Mus,Juan Vicente Riera
机构: University of the Balearic Islands (伊比利亚大学)
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
备注: 19 pages, 2 figures. Submitted to Computational and Applied Mathematics (Springer)

点击查看摘要

Abstract:Discrete fuzzy numbers, and in particular those defined over a finite chain L_n = \0, \ldots, n\ , have been effectively employed to represent linguistic information within the framework of fuzzy systems. Research on total (admissible) orderings of such types of fuzzy subsets, and specifically those belonging to the set \mathcalD_1^L_n\rightarrow Y_m consisting of discrete fuzzy numbers A whose support is a closed subinterval of the finite chain L_n = \0, 1, \ldots, n\ and whose membership values A(x) , for x \in L_n , belong to the set Y_m = \ 0 = y_1 y_2 \cdots y_m-1 y_m = 1 \ , has facilitated the development of new methods for constructing logical connectives, based on a bijective function, called \textitpos function , that determines the position of each A \in \mathcalD_1^L_n\rightarrow Y_m . For this reason, in this work we revisit the problem by introducing algorithms that exploit the combinatorial structure of total (admissible) orders to compute the \textitpos function and its inverse with exactness. The proposed approach achieves a complexity of \mathcalO(n^2 m \log n) , which is quadratic in the size of the underlying chain ( n ) and linear in the number of membership levels ( m ). The key point is that the dominant factor is m , ensuring scalability with respect to the granularity of membership values. The results demonstrate that this formulation substantially reduces computational cost and enables the efficient implementation of algebraic operations – such as aggregation and implication – on the set of discrete fuzzy numbers.
zh

[NLP-29] Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在自动化评分系统中缺乏透明性和可解释性的问题,尤其是在大规模实际评估场景中,如何实现既准确又可解释的评分机制。其解决方案的关键在于提出一套面向评估利益相关者的可解释性四原则(Faithfulness, Groundedness, Traceability, Interchangeability,简称 FGTI),并基于此设计了 AnalyticScore 框架:通过显式提取短答案中的可识别元素、利用大语言模型(LLM)将其转化为人类可理解的特征值,并采用直观的有序逻辑回归模型进行评分。该框架在评分准确性上优于多个不可解释的评分方法,且在 ASAP-SAS 数据集上的平均 QWK 仅比当前最优不可解释方法低 0.06,同时验证了其特征提取行为与人类标注者高度一致,从而实现了高精度与强可解释性的统一。

链接: https://arxiv.org/abs/2511.17069
作者: Yunsung Kim,Mike Hardy,Joseph Tey,Candace Thille,Chris Piech
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability – Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) – targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
zh

[NLP-30] Do Vision-Language Models Understand Visual Persuasiveness? NEURIPS2025

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)是否真正理解视觉说服力(Visual Persuasion)的问题,即模型能否准确识别视觉线索如何影响人类态度与决策。其解决方案的关键在于构建了一个高共识的二元说服力判断数据集,并提出视觉说服因素(Visual Persuasive Factors, VPFs)分类体系,涵盖低层级感知、中层级构图和高层级语义线索;同时引入认知引导与知识注入策略以增强相关推理能力。实证分析表明,VLMs存在回忆导向偏差(over-predict high persuasiveness),对低/中层特征判别能力弱,而消息与对象存在的高层语义一致性是预测人类判断最强指标;此外,简洁且基于对象的推理理由能显著提升精度与F1分数,说明模型核心局限不在于识别说服对象本身,而在于将对象与传播意图关联的能力。

链接: https://arxiv.org/abs/2511.17036
作者: Gyuwon Park
机构: UNIST
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages (except for reference and appendix), 5 figures, 7 tables, to be published in NeurIPS 2025 Workshop: VLM4RWD

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among intervention strategies, simple instruction or unguided reasoning scaffolds yield marginal or negative effects, whereas concise, object-grounded rationales significantly improve precision and F1 scores. These results indicate that VLMs core limitation lies not in recognizing persuasive objects but in linking them to communicative intent.
zh

[NLP-31] Supervised Fine Tuning of Large Language Models for Domain Specific Knowledge Graph Construction:A Case Study on Hunans Historical Celebrities

【速读】: 该论文旨在解决在低资源环境下,通用大语言模型在区域历史人物知识提取与结构化输出方面表现不佳的问题,尤其针对湖南近代历史名人数据稀缺、领域专业性不足的挑战。其解决方案的关键在于提出一种监督微调方法:首先设计细粒度、模式引导的指令模板并构建领域专用的指令微调数据集以弥补训练语料的缺失;其次对四个公开的大语言模型(Qwen2.5-7B、Qwen3-8B、DeepSeek-R1-Distill-Qwen-7B 和 Llama-3.1-8B-Instruct)采用参数高效微调策略,显著提升其在生物属性、生平事件和社会关系等信息抽取任务上的性能。实验表明,微调后模型整体效果大幅提升,其中 Qwen3-8B 在 100 样本和 50 次迭代下达到最高得分 89.3866,验证了该方法在区域历史文化知识图谱构建中的有效性与成本效益。

链接: https://arxiv.org/abs/2511.17012
作者: Junjie Hao,Chun Wang,Ying Qiao,Qiuyue Zuo,Qiya Song,Hua Ma,Xieping Gao
机构: Hunan Normal University (湖南师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models and knowledge graphs offer strong potential for advancing research on historical culture by supporting the extraction, analysis, and interpretation of cultural heritage. Using Hunan’s modern historical celebrities shaped by Huxiang culture as a case study, pre-trained large models can help researchers efficiently extract key information, including biographical attributes, life events, and social relationships, from textual sources and construct structured knowledge graphs. However, systematic data resources for Hunan’s historical celebrities remain limited, and general-purpose models often underperform in domain knowledge extraction and structured output generation in such low-resource settings. To address these issues, this study proposes a supervised fine-tuning approach for enhancing domain-specific information extraction. First, we design a fine-grained, schema-guided instruction template tailored to the Hunan historical celebrities domain and build an instruction-tuning dataset to mitigate the lack of domain-specific training corpora. Second, we apply parameter-efficient instruction fine-tuning to four publicly available large language models - Qwen2.5-7B, Qwen3-8B, DeepSeek-R1-Distill-Qwen-7B, and Llama-3.1-8B-Instruct - and develop evaluation criteria for assessing their extraction performance. Experimental results show that all models exhibit substantial performance gains after fine-tuning. Among them, Qwen3-8B achieves the strongest results, reaching a score of 89.3866 with 100 samples and 50 training iterations. This study provides new insights into fine-tuning vertical large language models for regional historical and cultural domains and highlights their potential for cost-effective applications in cultural heritage knowledge extraction and knowledge graph construction.
zh

[NLP-32] Vision Language Models are Confused Tourists

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在面对多文化线索共存场景时的稳定性不足问题,尤其关注其在文化维度上的对抗鲁棒性缺失。现有评估方法通常仅使用单一文化概念的图像,忽略了现实世界中多种无关文化线索可能同时出现的情况。为填补这一空白,作者提出名为 ConfusedTourist 的新型文化对抗鲁棒性评测套件,通过引入图像叠加扰动和基于图像生成的扰动来测试模型表现。解决方案的关键在于:利用该套件揭示了VLMs在文化混杂输入下准确率显著下降的现象,并通过可解释性分析指出失败根源在于模型注意力机制被干扰性文化线索误导,从而系统性偏离预期语义焦点,凸显出提升跨文化理解鲁棒性的紧迫需求。

链接: https://arxiv.org/abs/2511.17004
作者: Patrick Amadeus Irawan,Ikhlasul Akmal Hanif,Muhammad Dehan Al Kautsar,Genta Indra Winata,Fajri Koto,Alham Fikri Aji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs’ stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.
zh

[NLP-33] ARQUSUMM: Argument-aware Quantitative Summarization of Online Conversations AAAI2026

【速读】: 该论文旨在解决在线对话中缺乏对论点结构(claim-reason structure)及其强度量化的能力问题,传统文本摘要方法仅关注信息显著性,而现有对话摘要研究虽考虑句间论证关系,却未能深入揭示句子内部的论证结构。解决方案的关键在于提出一种新型任务——论点感知的定量摘要(argument-aware quantitative summarization),并设计了ARQUSUMM框架:首先利用大语言模型(LLM)在论证理论基础上进行少样本学习,识别句子中的命题及其主张-理由关系;其次通过基于论证结构的聚类算法聚合相似论点,并量化其支持程度,从而生成既体现论证结构、又具备准确强度度量的高质量摘要。

链接: https://arxiv.org/abs/2511.16985
作者: An Quang Tang,Xiuzhen Zhang,Minh Ngoc Dinh,Zhuang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Paper accepted to AAAI2026 Main Technical Track

点击查看摘要

Abstract:Online conversations have become more prevalent on public discussion platforms (e.g. Reddit). With growing controversial topics, it is desirable to summarize not only diverse arguments, but also their rationale and justification. Early studies on text summarization focus on capturing general salient information in source documents, overlooking the argumentative nature of online conversations. Recent research on conversation summarization although considers the argumentative relationship among sentences, fail to explicate deeper argument structure within sentences for summarization. In this paper, we propose a novel task of argument-aware quantitative summarization to reveal the claim-reason structure of arguments in conversations, with quantities measuring argument strength. We further propose ARQUSUMM, a novel framework to address the task. To reveal the underlying argument structure within sentences, ARQUSUMM leverages LLM few-shot learning grounded in the argumentation theory to identify propositions within sentences and their claim-reason relationships. For quantitative summarization, ARQUSUMM employs argument structure-aware clustering algorithms to aggregate arguments and quantify their support. Experiments show that ARQUSUMM outperforms existing conversation and quantitative summarization models and generate summaries representing argument structures that are more helpful to users, of high textual quality and quantification accuracy.
zh

[NLP-34] OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

【速读】: 该论文旨在解决当前AI Scientists在科学发现中忽视科学研究的社会性与协作本质的问题,即现有系统将科学探索简化为孤立的搜索或优化任务,缺乏对人类科研体系中协作机制、贡献归属、同行评审及知识网络结构等关键要素的建模,导致难以构建真正的研究生态系统或深度融入人类科学共同体。其解决方案的关键在于提出OmniScientist框架,该框架通过显式编码人类科研机制重构AI科学工作流,核心包括:(1)基于引文网络和概念关联的结构化知识系统;(2)支持多智能体与人类研究人员无缝协作的协同研究协议(OSP);(3)基于盲测成对用户投票与Elo排名的开放评估平台(ScienceArena),从而实现AI代理对人类知识系统的理解、协同与共演化,推动可持续且可扩展的创新生态建设。

链接: https://arxiv.org/abs/2511.16931
作者: Chenyang Shao,Dehao Huang,Yu Li,Keyu Zhao,Weiquan Lin,Yining Zhang,Qingbin Zeng,Zhiyu Chen,Tianxing Li,Yifei Huang,Taozhong Wu,Xinyang Liu,Ruotong Zhao,Mengsheng Zhao,Xuhua Zhang,Yue Wang,Yuanyi Zhen,Fengli Xu,Yong Li,Tie-Yan Liu
机构: Tsinghua University (清华大学); Zhongguancun Academy
类目: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of Large Language Models (LLMs), AI agents have demonstrated increasing proficiency in scientific tasks, ranging from hypothesis generation and experimental design to manuscript writing. Such agent systems are commonly referred to as “AI Scientists.” However, existing AI Scientists predominantly formulate scientific discovery as a standalone search or optimization problem, overlooking the fact that scientific research is inherently a social and collaborative endeavor. Real-world science relies on a complex scientific infrastructure composed of collaborative mechanisms, contribution attribution, peer review, and structured scientific knowledge networks. Due to the lack of modeling for these critical dimensions, current systems struggle to establish a genuine research ecosystem or interact deeply with the human scientific community. To bridge this gap, we introduce OmniScientist, a framework that explicitly encodes the underlying mechanisms of human research into the AI scientific workflow. OmniScientist not only achieves end-to-end automation across data foundation, literature review, research ideation, experiment automation, scientific writing, and peer review, but also provides comprehensive infrastructural support by simulating the human scientific system, comprising: (1) a structured knowledge system built upon citation networks and conceptual correlations; (2) a collaborative research protocol (OSP), which enables seamless multi-agent collaboration and human researcher participation; and (3) an open evaluation platform (ScienceArena) based on blind pairwise user voting and Elo rankings. This infrastructure empowers agents to not only comprehend and leverage human knowledge systems but also to collaborate and co-evolve, fostering a sustainable and scalable innovation ecosystem.
zh

[NLP-35] Predicting the Formation of Induction Heads NEURIPS

【速读】: 该论文试图解决的问题是:如何精确刻画诱导头(induction heads, IHs)在现代语言模型(language models, LMs)中形成的关键机制,尤其是其与训练数据统计特性之间的关系。解决方案的关键在于发现了一个简单方程,能够预测IH形成的临界点,该方程结合了批量大小(batch size)和上下文大小(context size);同时揭示了表面二元组重复频率(surface bigram repetition frequency)和可靠性(reliability)对IH形成具有决定性影响,并识别出二者之间存在明确的帕累托前沿(Pareto frontier)。此外,研究还表明,局部依赖性若具备高二元组重复频率和可靠性,则足以促成IH形成,而当这些指标较低时,类别性(categoriality)和边缘分布形状则成为关键因素。

链接: https://arxiv.org/abs/2511.16893
作者: Tatsuya Aoyama,Ethan Gotlieb Wilcox,Nathan Schneider
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CogInterp @ NeurIPS

点击查看摘要

Abstract:Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.
zh

[NLP-36] Deep Improvement Supervision

【速读】: 该论文旨在解决小型循环架构(如Tiny Recursive Models, TRMs)在复杂推理任务中训练效率低下的问题,同时保持其性能优势。解决方案的关键在于将TRMs的潜在推理过程建模为无分类器引导(classifier-free guidance)和隐式策略改进算法,并提出一种新颖的训练方案,为每轮循环提供明确的目标信号。这一方法显著提升了训练效率:总前向传播次数减少18倍,且无需停止机制,同时在ARC-1任务上以仅0.8M参数实现24%准确率,优于多数大型语言模型(LLMs)。

链接: https://arxiv.org/abs/2511.16886
作者: Arip Asadulaev,Rayan Banerjee,Fakhri Karray,Martin Takac
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, it was shown that small, looped architectures, such as Tiny Recursive Models (TRMs), can outperform Large Language Models (LLMs) on complex reasoning tasks, including the Abstraction and Reasoning Corpus (ARC). In this work, we investigate a core question: how can we further improve the efficiency of these methods with minimal changes? To address this, we frame the latent reasoning of TRMs as a form of classifier-free guidance and implicit policy improvement algorithm. Building on these insights, we propose a novel training scheme that provides a target for each loop during training. We demonstrate that our approach significantly enhances training efficiency. Our method reduces the total number of forward passes by 18x and eliminates halting mechanisms, while maintaining quality comparable to standard TRMs. Notably, we achieve 24% accuracy on ARC-1 with only 0.8M parameters, outperforming most LLMs.
zh

[NLP-37] Improving Latent Reasoning in LLM s via Soft Concept Mixing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中依赖离散标记(discrete tokens)与人类抽象概念空间中软性概念(soft concepts)推理之间的不匹配问题,从而限制了其表达能力。解决方案的关键在于提出软概念混合(Soft Concept Mixing, SCM)训练机制:通过概率加权平均构建软概念向量,并将其注入模型隐藏状态以引入软表示;随后利用强化学习(Reinforcement Learning, RL)优化整个潜在推理过程,使模型在训练阶段即接触软概念表征,从而弥合推理时的软-硬表征鸿沟。

链接: https://arxiv.org/abs/2511.16885
作者: Kang Wang,Xiangyu Duan,Tianyi Du
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Unlike human reasoning in abstract conceptual spaces, large language models (LLMs) typically reason by generating discrete tokens, which potentially limit their expressive power. The recent work Soft Thinking has shown that LLMs’ latent reasoning via soft concepts is a promising direction, but LLMs are trained on discrete tokens. To reduce this gap between the soft concepts in reasoning and the discrete tokens in training, we propose Soft Concept Mixing (SCM), a soft concept aware training scheme that directly exposes the model to soft representations during training. Specifically, SCM constructs a soft concept vector by forming a probability-weighted average of embeddings. Then, this vector is mixed into the model’s hidden states, which embody rich contextual information. Finally, the entire latent reasoning process is optimized with Reinforcement Learning (RL). Experiments on five reasoning benchmarks demonstrate that SCM improves the reasoning performance of LLMs, and simultaneously maintains a stable training dynamic.
zh

[NLP-38] ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM -Generated Answers

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成响应时普遍存在的冗余问题,即输出内容冗长、包含不必要的细节,从而降低信息清晰度和用户体验,并增加模型开发成本,尤其是在按输出token计费的专有模型场景下。解决方案的关键在于提出一种无需参考标准(reference-free)的简洁性评估指标,通过三种互补方式量化非必要内容:i) 原始响应与LLM抽象摘要之间的压缩比;ii) 原始响应与LLM抽取摘要之间的压缩比;iii) 基于语义保持的词移除压缩实验,以移除词汇数作为简洁性评分依据。该方法实现了对LLM输出冗余性的自动化评估,无需人工标注即可有效衡量对话系统中响应的简洁程度。

链接: https://arxiv.org/abs/2511.16846
作者: Seyed Mohssen Ghafari,Ronny Kol,Juan C. Quiroz,Nella Luan,Monika Patial,Chanaka Rupasinghe,Herman Wandabwa,Luiz Pizzato
机构: Commonwealth Bank of Australia (澳大利亚联邦银行)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.
zh

[NLP-39] Fantastic Bugs and Where to Find Them in AI Benchmarks

【速读】: 该论文旨在解决大规模AI基准测试中无效问题(invalid benchmark questions)导致评估可靠性下降的问题。由于手动识别和修正数千个基准问题不切实际且效率低下,作者提出了一种系统性的基准修订框架,其关键在于利用响应模式的统计分析来识别潜在无效问题——基于AI评估中普遍假设“平均得分能充分概括模型性能”,即测量实验背后存在一个一维潜在构念(latent construct),从而可推导出每个题目的预期统计量范围;若实测值超出该范围,则该题目更可能存在问题。该方法在九个广泛使用的基准测试中实现了最高84%的精度,同时引入大语言模型(LLM)作为初筛环节,显著降低人工审查负担,形成高效、可扩展的基准修订方案。

链接: https://arxiv.org/abs/2511.16842
作者: Sang Truong,Yuheng Tu,Michael Hardy,Anka Reuel,Zeyu Tang,Jirayu Burapacheep,Jonathan Perera,Chibuike Uwakwe,Ben Domingue,Nick Haber,Sanmi Koyejo
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.
zh

[NLP-40] Cognitive BASIC: An In-Model Interpreted Reasoning Language for LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理过程缺乏透明性和可解释性的问题,即如何使LLM的内部推理步骤变得显式、结构化且可追踪。解决方案的关键在于提出一种名为Cognitive BASIC的极简BASIC风格提示语言及其内置解释器,通过编号行和简单命令构建可解释的认知控制层,使LLM能够执行结构化的多步推理任务;该系统利用自然语言定义的解释器文件明确指定命令语义、内存更新机制和日志行为,从而实现对知识提取、冲突检测与推理过程的透明化处理,显著提升了模型内部逻辑的可观测性与可控性。

链接: https://arxiv.org/abs/2511.16837
作者: Oliver Kramer
机构: University of Oldenburg(奥尔登堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, Submitted to ESANN 2026

点击查看摘要

Abstract:Cognitive BASIC is a minimal, BASIC-style prompting language and in-model interpreter that structures large language model (LLM) reasoning into explicit, stepwise execution traces. Inspired by the simplicity of retro BASIC, we repurpose numbered lines and simple commands as an interpretable cognitive control layer. Modern LLMs can reliably simulate such short programs, enabling transparent multi-step reasoning inside the model. A natural-language interpreter file specifies command semantics, memory updates, and logging behavior. Our mental-model interpreter extracts declarative and procedural knowledge, detects contradictions, and produces resolutions when necessary. A comparison across three LLMs on a benchmark of knowledge extraction, conflict detection, and reasoning tasks shows that all models can execute Cognitive BASIC programs, with overall strong but not uniform performance.
zh

[NLP-41] he Shifting Landscape of Vaccine Discourse: Insights From a Decade of Pre- to Post-COVID-19 Vaccine Posts on Social Media

【速读】: 该论文旨在解决社交媒体上英语疫苗话语(vaccine discourse)在新冠疫情前后演变规律的问题,特别是理解公众对疫苗态度与情感倾向的变化机制。其解决方案的关键在于构建了一个涵盖2013至2022年共1870万条经严格筛选的X平台(原Twitter)疫苗相关帖子的新型数据集,并结合社会认知理论与刻板印象内容模型(stereotype content model),系统分析情感词使用模式及语言特征的动态变化,从而揭示疫情前后用户对疫苗的态度从初期的信任增强到后期怀疑情绪上升的复杂转变过程。

链接: https://arxiv.org/abs/2511.16832
作者: Nikesh Gyawali,Doina Caragea,Cornelia Caragea,Saif M. Mohammad
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we study English-language vaccine discourse in social media posts, specifically posts on X (formerly Twitter), in seven years before the COVID-19 outbreak (2013 to 2019) and three years after the outbreak was first reported (2020 to 2022). Drawing on theories from social cognition and the stereotype content model in Social Psychology, we analyze how English speakers talk about vaccines on social media to understand the evolving narrative around vaccines in social media posts. To do that, we first introduce a novel dataset comprising 18.7 million curated posts on vaccine discourse from 2013 to 2022. This extensive collection-filtered down from an initial 129 million posts through rigorous preprocessing-captures both pre-COVID and COVID-19 periods, offering valuable insights into the evolution of English-speaking X users’ perceptions related to vaccines. Our analysis shows that the COVID-19 pandemic led to complex shifts in X users’ sentiment and discourse around vaccines. We observe that negative emotion word usage decreased during the pandemic, with notable rises in usage of surprise, and trust related emotion words. Furthermore, vaccine-related language tended to use more warmth-focused words associated with trustworthiness, along with positive, competence-focused words during the early days of the pandemic, with a marked rise in negative word usage towards the end of the pandemic, possibly reflecting a growing vaccine hesitancy and skepticism.
zh

[NLP-42] PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

【速读】: 该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在面对后门攻击时的脆弱性问题,此类攻击通过输入提示中的触发词(trigger)诱导生成有害或非预期内容。解决方案的关键在于提出PEPPER(PErcePtion Guided PERturbation),其核心机制是将原始提示词重写为语义上相距较远但视觉上相似的新提示词,并引入不显著的扰动元素,从而破坏嵌入在输入提示中的触发信号,稀释触发词令牌的影响,实现对基于文本编码器(text encoder-based)攻击的有效防御。实验表明,PEPPER在保持生成质量的同时显著降低攻击成功率,且可与现有防御方法结合,提升整体鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2511.16830
作者: Oscar Chew,Po-Yi Lu,Jayden Lin,Kuan-Hao Huang,Hsuan-Tien Lin
机构: Texas A&M University (德克萨斯农工大学); National Taiwan University (台湾国立大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.
zh

[NLP-43] Interpretable dimensions support an effect of agent ivity and telicity on split intransitivity

【速读】: 该论文试图解决的问题是:先前关于不及物动词(intransitive verbs)在无主语句法结构中分类为自发动词(unergatives)与非自发动词(unaccusatives)的理论假设——即动词的施事性(agentivity)和终点性(telicity)可有效预测其句法行为——是否成立。已有研究(如Kim et al., 2024)指出人类对施事性和终点性的评分无法良好预测句法分布,质疑了这一经典假设。本文通过引入可解释维度(interpretable dimensions),基于处于施事性和终点性量表两端的种子词(seed words)计算出语义维度向量,从而更客观地捕捉动词的语义特征,并结合人类判断数据进行验证。解决方案的关键在于利用这些可解释的语义维度作为中介变量,弥补传统评分任务在语义评估上的主观局限性,从而重新确立施事性与终点性与句法类型(unergative/unaccusative)之间稳健的关联。

链接: https://arxiv.org/abs/2511.16824
作者: Eva Neu,Brian Dillon,Katrin Erk
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Intransitive verbs fall into two different syntactic classes, unergatives and unaccusatives. It has long been argued that verbs describing an agentive action are more likely to appear in an unergative syntax, and those describing a telic event to appear in an unaccusative syntax. However, recent work by Kim et al. (2024) found that human ratings for agentivity and telicity were a poor predictor of the syntactic behavior of intransitives. Here we revisit this question using interpretable dimensions, computed from seed words on opposite poles of the agentive and telic scales. Our findings support the link between unergativity/unaccusativity and agentivity/telicity, and demonstrate that using interpretable dimensions in conjunction with human judgments can offer valuable evidence for semantic properties that are not easily evaluated in rating tasks.
zh

[NLP-44] From Representation to Enactment: The ABC Framework of the Translating Mind

【速读】: 该论文旨在解决传统认知模型中以表征为基础的翻译观所面临的局限性,即翻译被视作静态符号间对应关系的转换,忽视了翻译过程中主体与环境的动态互动。其解决方案的关键在于提出一种基于扩展心智(Extended Mind, EM)理论和激进具身论(radical enactivism)的ABC框架,将翻译重新定义为一种具身化的、动态的“ enacted activity”,通过情感(affective)、行为(behavioral)和认知(cognitive)过程的整合,在大脑-身体-环境的闭环交互中生成意义。这一非表征性视角强调翻译是嵌入社会文化实践中的技能性参与,意义在实时的身体化互动中被共同建构。

链接: https://arxiv.org/abs/2511.16811
作者: Michael Carl,Takanori Mizowaki,Aishvarya Raj,Masaru Yamada,Devi Sri Bandaru,Yuxiang Wei,Xinyue Ren
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building on the Extended Mind (EM) theory and radical enactivism, this article suggests an alternative to representation-based models of the mind. We lay out a novel ABC framework of the translating mind, in which translation is not the manipulation of static interlingual correspondences but an enacted activity, dynamically integrating affective, behavioral, and cognitive (ABC) processes. Drawing on Predictive Processing and (En)Active Inference, we argue that the translator’s mind emerges, rather than being merely extended, through loops of brain-body-environment interactions. This non-representational account reframes translation as skillful participation in sociocultural practice, where meaning is co-created in real time through embodied interaction with texts, tools, and contexts.
zh

[NLP-45] NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

【速读】: 该论文旨在解决从孟加拉语(Bangla)指令中自动生成正确代码的挑战,这是代码生成任务中的关键难点,尤其在非英语语境下。其解决方案的关键在于提出了一种基于多智能体(multi-agent)的流水线架构:首先由代码生成智能体根据输入指令生成初始程序,随后通过执行单元测试(pytest-style, assert-based)识别失败用例;仅将这些失败案例传递给调试智能体,后者结合错误信息、当前程序和相关测试用例,生成修正后的代码。这种分阶段、聚焦于错误修复的机制显著提升了代码生成的准确性,最终使系统在BLP-2025共享任务中以95.4的Pass@1得分获得第一名。

链接: https://arxiv.org/abs/2511.16787
作者: Hossain Shaikh Saadi,Faria Alam,Mario Sanz-Guerrero,Minh Duc Bui,Manuel Mager,Katharina von der Wense
机构: Johannes Gutenberg University Mainz, Germany(美因茨约翰内斯古腾堡大学); Saarland University, Germany(萨尔兰大学); University of Colorado Boulder, USA(科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: BLP 2025 Shared Task 2 - Code Generation in Bangla

点击查看摘要

Abstract:This paper presents JGU Mainz’s winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a Pass@1 score of 95.4. We also make our code public.
zh

[NLP-46] Detecting and Steering LLM s Empathy in Action

【速读】: 该论文旨在解决如何在大型语言模型(Large Language Models, LLMs)中识别与操控“行动中的同理心”(empathy-in-action,即为满足人类需求而牺牲任务效率的倾向)这一行为特征的问题。其核心解决方案是将该特质建模为LLM激活空间中的线性方向,并通过对比提示(contrastive prompts)构建Empathy-in-Action(EIA)基准进行检测与干预。关键在于:首先,在最优层面上所有测试模型均展现出近乎完美的检测性能(AUROC 0.996–1.00),表明行动同理心编码可被有效定位;其次,不同架构模型虽实现方式各异(如Dolphin-Llama-3.1-8B仅支持单向增强同理心且反向干预导致崩溃),但Qwen2.5-7B和Phi-3-mini-4k均能实现双向可控且保持输出一致性,揭示出安全训练未必抑制操纵能力,而是可能影响干预鲁棒性,从而为理解模型内部语义表征与可控伦理行为提供新路径。

链接: https://arxiv.org/abs/2511.16699
作者: Juan P. Cadile
机构: University of Rochester (罗切斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:We investigate empathy-in-action – the willingness to sacrifice task efficiency to address human needs – as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7% success with similar coherence. Dolphin exhibits asymmetric steerability: 94.4% success for pro-empathy steering but catastrophic breakdown for anti-empathy (empty outputs, code artifacts). Implications: The detection-steering gap varies by model. Qwen and Phi-3 maintain bidirectional coherence; Dolphin shows robustness only for empathy enhancement. Safety training may affect steering robustness rather than preventing manipulation, though validation across more models is needed. Comments: 14 pages, 9 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2511.16699 [cs.CL] (or arXiv:2511.16699v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.16699 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Juan Cadile [view email] [v1] Mon, 17 Nov 2025 23:45:26 UTC (159 KB)
zh

[NLP-47] Hierarchical Retrieval with Out-Of-Vocabulary Queries: A Case Study on SNOMED CT WWW’26

【速读】: 该论文旨在解决SNOMED CT(Systematized Nomenclature of Medicine – Clinical Terms)中由于词汇外(out-of-vocabulary, OOV)查询导致的层次化概念检索难题,此类问题常因语言歧义、同义词和多义词等现象而加剧。解决方案的关键在于提出一种基于语言模型的本体嵌入(ontology embeddings)方法,通过利用预训练语言模型捕捉语义信息,从而在无直接匹配的情况下仍能准确检索到最直接的子类(most direct subsumers)及其相关祖先概念。该方法在构建的OOV查询数据集上显著优于SBERT及两种词法匹配基线模型,且具备良好的可扩展性,适用于其他生物医学本体。

链接: https://arxiv.org/abs/2511.16698
作者: Jonathon Dilworth,Hui Yang,Jiaoyan Chen,Yongsheng Gao
机构: University of Manchester (曼彻斯特大学); SNOMED International (SNOMED 国际)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 3 tables, submission to The Web Conference 2026 (WWW’26), Dubai, UAE

点击查看摘要

Abstract:SNOMED CT is a biomedical ontology with a hierarchical representation of large-scale concepts. Knowledge retrieval in SNOMED CT is critical for its application, but often proves challenging due to language ambiguity, synonyms, polysemies and so on. This problem is exacerbated when the queries are out-of-vocabulary (OOV), i.e., having no equivalent matchings in the ontology. In this work, we focus on the problem of hierarchical concept retrieval from SNOMED CT with OOV queries, and propose an approach based on language model-based ontology embeddings. For evaluation, we construct OOV queries annotated against SNOMED CT concepts, testing the retrieval of the most direct subsumers and their less relevant ancestors. We find that our method outperforms the baselines including SBERT and two lexical matching methods. While evaluated against SNOMED CT, the approach is generalisable and can be extended to other ontologies. We release code, tools, and evaluation datasets at this https URL.
zh

[NLP-48] How Language Directions Align with Token Geometry in Multilingual LLM s

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, Multilingual LLMs)中语言信息在内部表示空间中的结构化机制及其层间演化规律这一关键问题,特别是缺乏对语言编码如何随模型深度变化的系统性分析。其解决方案的关键在于开展一项全面的探测研究(probing study),结合线性和非线性探测方法,并引入一种新的“Token–语言对齐分析”(Token–Language Alignment analysis),以量化语言编码在各Transformer层中的动态变化和几何结构特征。研究发现,语言信息在首个Transformer块中即显著分离(从Layer 0到Layer 1提升76.4 ± 8.2个百分点),且后续层中保持高度线性可分性;同时揭示了语言方向与词汇嵌入对齐程度与训练数据的语言组成密切相关,表现出明显的结构印记效应(如中文包含模型的ZH Match@Peak达16.43%,远高于英语主导模型的3.90%)。这表明多语言LLM区分语言并非依赖表层书写特征,而是基于训练语料塑造的潜在表示结构。

链接: https://arxiv.org/abs/2511.16693
作者: JaeSeong Kim,Suan Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:Multilingual LLMs demonstrate strong performance across diverse languages, yet there has been limited systematic analysis of how language information is structured within their internal representation space and how it emerges across layers. We conduct a comprehensive probing study on six multilingual LLMs, covering all 268 transformer layers, using linear and nonlinear probes together with a new Token–Language Alignment analysis to quantify the layer-wise dynamics and geometric structure of language encoding. Our results show that language information becomes sharply separated in the first transformer block (+76.4 \pm 8.2 percentage points from Layer 0 to 1) and remains almost fully linearly separable throughout model depth. We further find that the alignment between language directions and vocabulary embeddings is strongly tied to the language composition of the training data. Notably, Chinese-inclusive models achieve a ZH Match@Peak of 16.43%, whereas English-centric models achieve only 3.90%, revealing a 4.21 \times structural imprinting effect. These findings indicate that multilingual LLMs distinguish languages not by surface script features but by latent representational structures shaped by the training corpus. Our analysis provides practical insights for data composition strategies and fairness in multilingual representation learning. All code and analysis scripts are publicly available at: this https URL.
zh

[NLP-49] Reproducibility Report: Test-Time Training on Nearest Neighbors for Large Language Models

【速读】: 该论文旨在解决大语言模型在推理阶段性能受限的问题,提出通过测试时训练(Test-Time Training, TTT)结合最近邻检索来动态适应模型,从而提升其在未见数据上的表现。解决方案的关键在于:利用预训练的RoBERTa嵌入与Faiss索引快速检索每个测试输入的20个最近邻序列,并对每个邻居执行一次梯度更新,以微调模型参数;这种基于邻近上下文的在线适应机制显著降低了困惑度(perplexity)和每字节比特数(bits-per-byte)指标,尤其在结构化或专业领域数据(如GitHub、EuroParl)中效果突出,且对未在The Pile上预训练的模型收益更大,表明该方法具有良好的泛化能力与实用性。

链接: https://arxiv.org/abs/2511.16691
作者: Boyang Zhou,Johan Lindqvist,Lindsey Li
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We reproduce the central claims of Test-Time Training on Nearest Neighbors for Large Language Models (Hardt and Sun, 2024), which proposes adapting a language model at inference time by fine-tuning on retrieved nearest-neighbor sequences. Using pretrained RoBERTa embeddings indexed with Faiss, we retrieve 20 neighbors per test input and apply one gradient update per neighbor across GPT-2 (117M, 774M), GPT-Neo (1.3B), and R1-Distilled-Qwen2.5-1.5B. Our experiments confirm that test-time training significantly reduces perplexity and bits-per-byte metrics across diverse domains from The Pile, with the largest improvements in structured or specialized datasets such as GitHub and EuroParl. We further validate that models not pretrained on The Pile benefit more from this adaptation than models already trained on similar data, allowing smaller models to approach the performance of larger ones. Due to infrastructure limitations, we introduce a memory-efficient retrieval implementation that loads only required line offsets rather than entire files, reducing RAM requirements from over 128 GB per server to 32 GB. We also extend the original study by evaluating R1-Distilled-Qwen2.5-1.5B, showing that test-time training yields consistent gains even for modern reasoning-optimized architectures. Overall, our results support the robustness and generality of nearest-neighbor test-time training while highlighting practical considerations for reproducing large-scale retrieval-augmented adaptation.
zh

[NLP-50] Falsely Accused: How AI Detectors Misjudge Slightly Polished Arabic Articles

【速读】: 该论文旨在解决当前人工智能(AI)检测模型在面对经轻微AI润色的人类撰写的阿拉伯语文本时,容易误判为AI生成内容的问题。这一误判可能导致对作者的不当指控,并损害AI检测模型的可信度。解决方案的关键在于构建两个专门针对阿拉伯语的基准数据集:第一个数据集包含800篇阿拉伯语文章(半数为AI生成,半数为人撰写),用于评估14个大型语言模型(LLMs)和商业AI检测工具的原始区分能力;第二个数据集Ar-APT则包含400篇人类撰写的阿拉伯语文章经由10个LLM在四种润色设置下处理后的共16,400个样本,用以系统测试这些模型在面对轻微AI润色后是否仍能准确识别文本来源。实验结果表明,所有检测模型均存在显著误判率,尤其在处理经LLaMA-3等模型轻微润色的文章时性能大幅下降,揭示了现有AI检测方法在鲁棒性方面的严重缺陷。

链接: https://arxiv.org/abs/2511.16690
作者: Saleh Almohaimeed,Saad Almohaimeed,Mousa Jari,Khaled A. Alobaid,Fahad Alotaibi
机构: King Saud University (沙特国王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Artificial Intelligence Review Journal

点击查看摘要

Abstract:Many AI detection models have been developed to counter the presence of articles created by artificial intelligence (AI). However, if a human-authored article is slightly polished by AI, a shift will occur in the borderline decision of these AI detection models, leading them to consider it AI-generated article. This misclassification may result in falsely accusing authors of AI plagiarism and harm the credibility of AI detector models. In English, some efforts were made to meet this challenge, but not in Arabic. In this paper, we generated two datasets. The first dataset contains 800 Arabic articles, half AI-generated and half human-authored. We used it to evaluate 14 Large Language models (LLMs) and commercial AI detectors to assess their ability in distinguishing between human-authored and AI-generated articles. The best 8 models were chosen to act as detectors for our primary concern, which is whether they would consider slightly polished human text as AI-generated. The second dataset, Ar-APT, contains 400 Arabic human-authored articles polished by 10 LLMs using 4 polishing settings, totaling 16400 samples. We use it to evaluate the 8 nominated models and determine whether slight polishing will affect their performance. The results reveal that all AI detectors incorrectly attribute a significant number of articles to AI. The best performing LLM, Claude-4 Sonnet, achieved 83.51%, their performance decreased to 57.63% for articles slightly polished by LLaMA-3. Whereas for the best performing commercial model, this http URL, that achieves 92% accuracy, dropped to 12% for articles slightly polished by Mistral or Gemma-3.
zh

[NLP-51] Concept-Based Interpretability for Toxicity Detection

【速读】: 该论文旨在解决毒性语言检测模型中因概念过度归属(over-attribution)导致的分类错误问题,尤其关注预定义毒害性子类型(如侮辱、威胁、身份攻击等)在模型决策中的不均衡影响。其解决方案的关键在于提出基于概念梯度(Concept Gradient, CG)的可解释性方法,通过量化概念变化对模型输出的影响,实现更因果导向的解释;同时引入目标词典集(Targeted Lexicon Set)与词-概念对齐分数(Word-Concept Alignment, WCA),识别导致误判的毒害词汇,并设计无词典增强策略(lexicon-free augmentation)以检验模型是否仍依赖于显式词汇重叠进行判断,从而揭示模型对更广泛毒害语言模式的归因机制。

链接: https://arxiv.org/abs/2511.16689
作者: Samarth Garg,Deeksha Varshney,Divya Singh
机构: ABV–IIITM (ABV–IIITM); IIT Jodhpur (印度理工学院贾多普尔分校); IIT Patna (印度理工学院帕特纳分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:The rise of social networks has not only facilitated communication but also allowed the spread of harmful content. Although significant advances have been made in detecting toxic language in textual data, the exploration of concept-based explanations in toxicity detection remains limited. In this study, we leverage various subtype attributes present in toxicity detection datasets, such as obscene, threat, insult, identity attack, and sexual explicit as concepts that serve as strong indicators to identify whether language is toxic. However, disproportionate attribution of concepts towards the target class often results in classification errors. Our work introduces an interpretability technique based on the Concept Gradient (CG) method which provides a more causal interpretation by measuring how changes in concepts directly affect the output of the model. This is an extension of traditional gradient-based methods in machine learning, which often focus solely on input features. We propose the curation of Targeted Lexicon Set, which captures toxic words that contribute to misclassifications in text classification models. To assess the significance of these lexicon sets in misclassification, we compute Word-Concept Alignment (WCA) scores, which quantify the extent to which these words lead to errors due to over-attribution to toxic concepts. Finally, we introduce a lexicon-free augmentation strategy by generating toxic samples that exclude predefined toxic lexicon sets. This approach allows us to examine whether over-attribution persists when explicit lexical overlap is removed, providing insights into the model’s attribution on broader toxic language patterns.
zh

[NLP-52] Prompt-Based Value Steering of Large Language Models

【速读】: 该论文旨在解决大语言模型在实际应用中难以动态适配人类价值观的问题,尤其是在价值偏好随情境变化时,传统静态微调方法无法有效应对。其解决方案的关键在于提出一种可复现、与模型无关的评估流程,通过量化生成文本中目标价值观的存在程度与提升幅度,来判断提示词(prompt)是否能有效引导模型输出符合特定价值观的内容。研究以Wizard-Vicuna模型为例,结合Schwartz的基本人类价值观理论和结构化对话数据集,对比了基础提示与显式条件化于价值观的提示,证明即使不修改模型或动态优化提示,仅靠提示设计即可实现有效的价值引导(value steering)。

链接: https://arxiv.org/abs/2511.16688
作者: Giulio Antonio Abbo,Tony Belpaeme
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 4 tables. Presented at the 3rd International Workshop on Value Engineering in AI (VALE 2025), 28th European Conference on AI. To appear in Springer LNCS

点击查看摘要

Abstract:Large language models are increasingly used in applications where alignment with human values is critical. While model fine-tuning is often employed to ensure safe responses, this technique is static and does not lend itself to everyday situations involving dynamic values and preferences. In this paper, we present a practical, reproducible, and model-agnostic procedure to evaluate whether a prompt candidate can effectively steer generated text toward specific human values, formalising a scoring method to quantify the presence and gain of target values in generated responses. We apply our method to a variant of the Wizard-Vicuna language model, using Schwartz’s theory of basic human values and a structured evaluation through a dialogue dataset. With this setup, we compare a baseline prompt to one explicitly conditioned on values, and show that value steering is possible even without altering the model or dynamically optimising prompts.
zh

[NLP-53] Ellipsoid-Based Decision Boundaries for Open Intent Classification

【速读】: 该论文旨在解决文本开放意图分类(Textual Open Intent Classification)中的关键挑战,即在无需预先知晓未知意图的情况下,提升对话系统对未知意图的鲁棒检测能力。现有方法虽采用自适应决策边界以避免人工设定阈值,但受限于假设已知类别的分布为各向同性(isotropic),导致决策边界仅能表示球形区域,忽略了特征空间中不同方向上的分布差异。其解决方案的关键在于提出EliDecide方法,通过引入可学习矩阵参数化椭球体(ellipsoid)作为每类的决策边界,从而在特征空间中实现沿不同方向尺度可变的边界形状;同时设计了一种双损失函数,平衡经验风险与开放空间风险,在覆盖已知样本的同时收缩边界以抵御合成伪开放样本的影响,显著提升了开放意图识别性能与泛化能力。

链接: https://arxiv.org/abs/2511.16685
作者: Yuetian Zou,Hanlei Zhang,Hua Xu,Songze Li,Long Xiao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Textual open intent classification is crucial for real-world dialogue systems, enabling robust detection of unknown user intents without prior knowledge and contributing to the robustness of the system. While adaptive decision boundary methods have shown great potential by eliminating manual threshold tuning, existing approaches assume isotropic distributions of known classes, restricting boundaries to balls and overlooking distributional variance along different directions. To address this limitation, we propose EliDecide, a novel method that learns ellipsoid decision boundaries with varying scales along different feature directions. First, we employ supervised contrastive learning to obtain a discriminative feature space for known samples. Second, we apply learnable matrices to parameterize ellipsoids as the boundaries of each known class, offering greater flexibility than spherical boundaries defined solely by centers and radii. Third, we optimize the boundaries via a novelly designed dual loss function that balances empirical and open-space risks: expanding boundaries to cover known samples while contracting them against synthesized pseudo-open samples. Our method achieves state-of-the-art performance on multiple text intent benchmarks and further on a question classification dataset. The flexibility of the ellipsoids demonstrates superior open intent detection capability and strong potential for generalization to more text classification tasks in diverse complex open-world scenarios.
zh

[NLP-54] How Well Do LLM s Understand Tunisian Arabic?

【速读】: 该论文试图解决工业级大语言模型(Large Language Models, LLMs)对低资源语言(如突尼斯阿拉伯语Tunizi)理解能力不足的问题,这一问题可能导致数百万突尼斯人无法在母语环境中与人工智能进行有效交互,进而被迫转向法语或英语,威胁本土语言的传承并影响年轻一代的语言偏好。解决方案的关键在于构建一个包含平行数据(Tunizi、标准突尼斯阿拉伯语和英文翻译)及情感标签的新颖语料库,并在此基础上对多个主流LLMs在音译、翻译和情感分析三项任务上进行基准测试,从而量化模型在处理突尼斯方言时的能力差距,为下一代AI系统纳入低资源语言提供实证依据和改进方向。

链接: https://arxiv.org/abs/2511.16683
作者: Mohamed Mahdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are the engines driving today’s AI agents. The better these models understand human languages, the more natural and user-friendly the interaction with AI becomes, from everyday devices like computers and smartwatches to any tool that can act intelligently. Yet, the ability of industrial-scale LLMs to comprehend low-resource languages, such as Tunisian Arabic (Tunizi), is often overlooked. This neglect risks excluding millions of Tunisians from fully interacting with AI in their own language, pushing them toward French or English. Such a shift not only threatens the preservation of the Tunisian dialect but may also create challenges for literacy and influence younger generations to favor foreign languages. In this study, we introduce a novel dataset containing parallel Tunizi, standard Tunisian Arabic, and English translations, along with sentiment labels. We benchmark several popular LLMs on three tasks: transliteration, translation, and sentiment analysis. Our results reveal significant differences between models, highlighting both their strengths and limitations in understanding and processing Tunisian dialects. By quantifying these gaps, this work underscores the importance of including low-resource languages in the next generation of AI systems, ensuring technology remains accessible, inclusive, and culturally grounded.
zh

[NLP-55] Bench360: Benchmarking Local LLM Inference from 360°

【速读】: 该论文旨在解决本地运行大语言模型(Large Language Models, LLMs)时面临的配置优化难题,即用户在面对众多模型、推理引擎(inference engines)和量化级别(quantization levels)组合时,难以高效识别兼顾功能需求与非功能指标(如延迟、吞吐量、能耗等)的最优配置。现有基准测试工具因目标狭窄且缺乏对系统级与任务特定指标的统一整合,无法满足实际部署场景下的多维评估需求。解决方案的关键在于提出Bench360——一个支持多场景(单流、批处理服务器)、多推理引擎、多量化级别及自定义任务的端到端基准框架,能够自动采集并关联系统性能指标(如计算性能、资源消耗、部署效率)与任务相关指标(如ROUGE、F1分数),从而帮助用户全面权衡不同配置下的性能-效率 trade-off,验证了不存在通用最优设置,凸显了自动化评估工具的必要性。

链接: https://arxiv.org/abs/2511.16682
作者: Linus Stuhlmann,Mauricio Fadel Argerich,Jonathan Fürst
机构: Zurich University of Applied Sciences (苏黎世应用科学大学); Universidad Politécnica de Madrid (马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Running large language models (LLMs) locally is becoming increasingly common. While the growing availability of small open-source models and inference engines has lowered the entry barrier, users now face an overwhelming number of configuration choices. Identifying an optimal configuration – balancing functional and non-functional requirements – requires substantial manual effort. While several benchmarks target LLM inference, they are designed for narrow evaluation goals and not user-focused. They fail to integrate relevant system and task-specific metrics into a unified, easy-to-use benchmark that supports multiple inference engines, usage scenarios, and quantization levels. To address this gap, we present Bench360 – Benchmarking Local LLM Inference from 360°. Bench360 allows users to easily define their own custom tasks along with datasets and relevant task-specific metrics and then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch server). Bench360 tracks a wide range of metrics, including (1) system metrics – such as Computing Performance (e.g., latency, throughput), Resource Usage (e.g., energy per query), and Deployment (e.g., cold start time) – and (2) task-specific metrics such as ROUGE, F1 score or accuracy. We demonstrate Bench360 on four common LLM tasks – General Knowledge Reasoning, QA, Summarization and Text-to-SQL – across three hardware platforms and four state of the art inference engines. Our results reveal several interesting trade-offs between task performance and system-level efficiency, highlighting the differences in inference engines and models. Most importantly, there is no single best setup for local inference, which strongly motivates the need for a framework such as Bench360.
zh

[NLP-56] owards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search

【速读】: 该论文旨在解决现有向量数据库(Vector Database, VecDB)在检索增强生成(Retrieval-Augmented Generation, RAG)系统中因采用扁平或单分辨率索引结构而导致的语义粒度不匹配问题,即无法根据用户查询的复杂程度自适应调整检索粒度,从而导致检索速度与上下文相关性之间的权衡不佳。解决方案的关键在于提出语义金字塔索引(Semantic Pyramid Indexing, SPI),这是一种多分辨率向量索引框架,通过构建文档嵌入的语义金字塔,并利用轻量级分类器动态为每个查询选择最优检索分辨率,实现从粗粒度到细粒度的渐进式检索。该方法无需离线调优或额外模型训练,显著提升了检索效率(最高达5.7倍加速)并改善了问答任务的F1分数(提升最多2.5点),同时保持良好的语义覆盖能力。

链接: https://arxiv.org/abs/2511.16681
作者: Dong Liu,Yanxuan Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Parallel and Distributed Systems 2025 (ICPADS 2025 Oral)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have become a dominant approach to augment large language models (LLMs) with external knowledge. However, existing vector database (VecDB) retrieval pipelines rely on flat or single-resolution indexing structures, which cannot adapt to the varying semantic granularity required by diverse user queries. This limitation leads to suboptimal trade-offs between retrieval speed and contextual relevance. To address this, we propose \textbfSemantic Pyramid Indexing (SPI), a novel multi-resolution vector indexing framework that introduces query-adaptive resolution control for RAG in VecDBs. Unlike existing hierarchical methods that require offline tuning or separate model training, SPI constructs a semantic pyramid over document embeddings and dynamically selects the optimal resolution level per query through a lightweight classifier. This adaptive approach enables progressive retrieval from coarse-to-fine representations, significantly accelerating search while maintaining semantic coverage. We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks. SPI achieves up to \textbf5.7 \times retrieval speedup and \textbf1.8 \times memory efficiency gain while improving end-to-end QA F1 scores by up to \textbf2.5 points compared to strong baselines. Our theoretical analysis provides guarantees on retrieval quality and latency bounds, while extensive ablation studies validate the contribution of each component. The framework’s compatibility with existing VecDB infrastructures makes it readily deployable in production RAG systems. Code is availabe at \hrefthis https URLthis https URL_VecDB. Comments: Accepted to IEEE International Conference on Parallel and Distributed Systems 2025 (ICPADS 2025 Oral) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.16681 [cs.CL] (or arXiv:2511.16681v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.16681 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-57] Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

【速读】: 该论文旨在解决非洲主要语系之一的班图语支中绍纳语(Shona)在形态学分析和语言感知工具方面严重资源匮乏的问题。其解决方案的关键在于构建了一个基于规则的、开源的形态学处理流水线 Shona spaCy,该系统基于 spaCy 框架,结合精心整理的 JSON 词典与语言学驱动的规则,实现对名词类前缀(noun-class prefixes)、动词主语一致标记(subject concords)、时体标记(tense-aspect markers)、拟声词(ideophones)及附着词(clitics)的建模,并将这些信息整合为词元(lemma)、词性(part-of-speech)和形态特征(morphological features)的粒度级标注。该方法在正式与非正式绍纳语文本上分别实现了 90% 的词性标注准确率和 88% 的形态特征标注准确率,同时保持了语言决策的可解释性,为其他低资源班图语言的形态分析提供了可复用的技术模板。

链接: https://arxiv.org/abs/2511.16680
作者: Happymore Masoka
机构: Pace University (佩斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at this https URL and a PyPI release at this https URL. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.
zh

[NLP-58] RubiSCoT: A Framework for AI-Supported Academic Assessment

【速读】: 该论文旨在解决学术论文评审过程中传统方法存在的效率低下与评价者间差异性问题,这些问题在高等教育中影响评估的严谨性和一致性。解决方案的关键在于提出一个名为RubiSCoT的AI辅助框架,其核心是利用大语言模型(Large Language Models, LLMs)、检索增强生成(Retrieval-Augmented Generation, RAG)和结构化思维链提示(Structured Chain-of-Thought Prompting)等先进自然语言处理技术,实现从开题到最终提交全过程的标准化、可扩展且透明的评估流程,从而提升评审的一致性与效率。

链接: https://arxiv.org/abs/2510.17309
作者: Thorsten Fröhlich,Tim Schlippe
机构: IU International University of Applied Sciences (国际应用科学大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evaluation of academic theses is a cornerstone of higher education, ensuring rigor and integrity. Traditional methods, though effective, are time-consuming and subject to evaluator variability. This paper presents RubiSCoT, an AI-supported framework designed to enhance thesis evaluation from proposal to final submission. Using advanced natural language processing techniques, including large language models, retrieval-augmented generation, and structured chain-of-thought prompting, RubiSCoT offers a consistent, scalable solution. The framework includes preliminary assessments, multidimensional assessments, content extraction, rubric-based scoring, and detailed reporting. We present the design and implementation of RubiSCoT, discussing its potential to optimize academic assessment processes through consistent, scalable, and transparent evaluation.
zh

计算机视觉

[CV-0] Native 3D Editing with Full Attention

【速读】:该论文旨在解决当前基于指令的3D编辑方法中存在的两大关键问题:一是优化类方法计算效率低、速度慢,难以满足实时交互需求;二是依赖多视图2D编辑的前馈式方法常导致几何不一致和视觉质量下降。其解决方案的核心在于提出一种原生3D编辑框架,通过单次前向传播直接操作3D表示,避免了对2D图像空间的间接依赖。该框架的关键创新包括构建大规模多模态数据集以支持多样化的增删改任务,并采用3D token拼接(3D token concatenation)作为新型条件控制机制,相比传统交叉注意力(cross-attention)更高效且性能更优,从而在生成质量、3D一致性与指令遵循度方面显著优于现有方法。

链接: https://arxiv.org/abs/2511.17501
作者: Weiwei Cai,Shuangkang Fang,Weicai Ye,Xin Dong,Yunhan Yang,Xuanyang Zhang,Wei Cheng,Yanpei Cao,Gang Yu,Tao Chen
机构: Fudan University (复旦大学); StepFun, Inc.; Zhejiang University (浙江大学); Tsinghua University (清华大学); VAST
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.
zh

[CV-1] EvDiff: High Quality Video with an Event Camera

【速读】:该论文旨在解决从事件流(event stream)重建高质量彩色视频的难题,这一任务因绝对亮度信息缺失而具有高度病态性(ill-posed)。传统端到端回归方法虽能生成初步结果,但存在感知质量低、模型容量和训练数据难以扩展的问题。解决方案的关键在于提出 EvDiff——一种基于事件的扩散模型(event-based diffusion model),其核心创新包括:1)设计单步前向扩散机制以降低高帧率视频生成的计算开销;2)引入时序一致的 EvEncoder 保证视频帧间一致性;3)提出新型代理训练框架(Surrogate Training Framework),摆脱对成对事件-图像数据集的依赖,从而利用大规模图像数据集提升模型容量。该方法仅需单色事件流即可生成高质量彩色视频,在像素级和感知指标上均优于现有方法。

链接: https://arxiv.org/abs/2511.17492
作者: Weilun Li,Lei Sun,Ruixi Gao,Qi Jiang,Yuqin Ma,Kaiwei Wang,Ming-Hsuan Yang,Luc Van Gool,Danda Pani Paudel
机构: Zhejiang University (浙江大学); INSAIT; UC Merced; Google DeepMind (谷歌深度学习)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.
zh

[CV-2] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

【速读】:该论文旨在解决视频问答(Video QA)中因依赖单次固定帧感知而导致的细粒度证据识别失败与幻觉问题,尤其针对文本丰富的视频中瞬时、微小的文字线索难以被有效捕捉的挑战。其解决方案的关键在于提出 Video-R4(Reinforcing Text-Rich Video Reasoning with Visual Rumination),一种基于视觉反刍(visual rumination)机制的多模态大语言模型(Multimodal Large Language Model, MLLM)。该机制通过迭代式选择关键帧、局部放大信息区域、重新编码像素并更新推理状态,模拟人类对复杂视觉内容的反复观察与理解过程,从而显著提升对像素级细节的建模能力,并在多个下游任务上实现性能突破。

链接: https://arxiv.org/abs/2511.17490
作者: Yolo Yunlong Tang,Daiki Shimada,Hang Hua,Chao Huang,Jing Bi,Rogerio Feris,Chenliang Xu
机构: University of Rochester (罗切斯特大学); Sony Group Corporation (索尼集团); MIT-IBM Watson AI Lab (麻省理工学院-IBM沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.
zh

[CV-3] Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

【速读】:该论文旨在解决多模态模型在规模缩小(downscaling)过程中视觉能力显著下降的问题,尤其是当大型语言模型(Large Language Model, LLM)容量受限时,其对视觉信息的理解与推理能力受到不成比例的负面影响。研究发现,这种性能下降不仅源于视觉推理能力的削弱,更主要的是感知能力的根本性损失。为此,作者提出了一种名为“Extract+Think”的新方法:首先通过视觉提取微调(visual extraction tuning)显式训练模型在不同任务中稳定提取与指令相关的视觉细节;随后基于这些结构化视觉信息进行分步推理以生成答案。该方案的核心在于将视觉感知与推理解耦,并通过专门的视觉特征提取机制提升小模型在多模态任务中的效率和性能表现。

链接: https://arxiv.org/abs/2511.17487
作者: Mark Endo,Serena Yeung-Levy
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website at this https URL

点击查看摘要

Abstract:Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.
zh

[CV-4] An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

【速读】:该论文旨在解决如何通过磁共振成像(MRI)图像客观量化脊柱衰老程度的问题,从而为评估脊柱健康提供一种可量化的生物标志物。其关键解决方案是提出了一种基于计算机视觉的深度学习方法,利用超过18,000例MRI序列数据训练模型以预测“脊柱年龄”(spine age),并通过统一流形近似与投影(UMAP)和基于密度的空间聚类(HDBSCAN)识别仅由年龄相关退行性病变构成的受试者群体,确保数据纯度;进一步通过消融实验优化模型结构、损失函数及不同脊柱区域的影响,最终定义“脊柱年龄差”(Spine Age Gap, SAG)作为临床指标,并验证其与椎间盘突出、骨赘形成、椎管狭窄、骨折以及吸烟和体力劳动等生活方式因素显著相关,表明SAG具有作为脊柱整体健康状态生物标志物的潜力。

链接: https://arxiv.org/abs/2511.17485
作者: Roozbeh Bazargani,Saqib Abdullah Basar,Daniel Daly-Grafstein,Rodrigo Solis Pompa,Soojin Lee,Saurabh Garg,Yuntong Ma,John A. Carrino,Siavash Khallaghi,Sam Hashemi
机构: Prenuvo, Redwood City, CA, USA; Weill Cornell Medical College, New York, NY, USA; Hospital for Special Surgery, New York, NY, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.
zh

[CV-5] Radar2Shape: 3D Shape Reconstruction from High-Frequency Radar using Multiresolution Signed Distance Functions

【速读】:该论文旨在解决从高频雷达信号中重建任意三维(3D)形状的问题,这在商业和航空航天应用中具有重要意义。现有深度学习方法在处理有限视角下的真实雷达信号时表现不佳,而光学三维重建方法若直接将雷达信号视为相机视图则难以有效建模。其解决方案的关键在于提出Radar2Shape,一种基于去噪扩散模型的方法,通过将雷达信号的频率与多分辨率形状特征相关联来实现部分可观测雷达信号下的3D重建;该方法采用两阶段策略:首先学习一个具有层次化分辨率的正则化潜在空间,其次以类粗到细的方式条件扩散至该潜在空间,从而在有限观测条件下仍能准确恢复任意复杂形状,并展现出对不同仿真方法和真实数据的良好泛化能力。

链接: https://arxiv.org/abs/2511.17484
作者: Neel Sortur,Justin Goodwin,Purvik Patel,Luis Enrique Martinez Jr,Tzofi Klinghoffer,Rajmonda S. Caceres,Robin Walters
机构: Northeastern University (东北大学); MIT Lincoln Laboratory (麻省理工学院林肯实验室); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Determining the shape of 3D objects from high-frequency radar signals is analytically complex but critical for commercial and aerospace applications. Previous deep learning methods have been applied to radar modeling; however, they often fail to represent arbitrary shapes or have difficulty with real-world radar signals which are collected over limited viewing angles. Existing methods in optical 3D reconstruction can generate arbitrary shapes from limited camera views, but struggle when they naively treat the radar signal as a camera view. In this work, we present Radar2Shape, a denoising diffusion model that handles a partially observable radar signal for 3D reconstruction by correlating its frequencies with multiresolution shape features. Our method consists of a two-stage approach: first, Radar2Shape learns a regularized latent space with hierarchical resolutions of shape features, and second, it diffuses into this latent space by conditioning on the frequencies of the radar signal in an analogous coarse-to-fine manner. We demonstrate that Radar2Shape can successfully reconstruct arbitrary 3D shapes even from partially-observed radar signals, and we show robust generalization to two different simulation methods and real-world data. Additionally, we release two synthetic benchmark datasets to encourage future research in the high-frequency radar domain so that models like Radar2Shape can safely be adapted into real-world radar systems.
zh

[CV-6] Counterfactual World Models via Digital Twin-conditioned Video Diffusion

【速读】:该论文旨在解决当前世界模型(world models)在处理反事实查询(counterfactual queries)时的局限性问题,即现有模型仅基于真实观测进行前向预测,无法有效支持对场景属性进行干预后的行为模拟。其核心挑战在于传统世界模型直接作用于纠缠的像素空间表示,难以实现对特定物体或关系的精准干预。解决方案的关键在于提出CWMDT框架:首先构建场景的数字孪生(digital twin),以结构化文本形式显式编码对象及其关系;其次利用大语言模型(large language model, LLM)推理干预如何随时间传播并改变场景;最后将修改后的表示条件化输入视频扩散模型(video diffusion model),生成符合反事实假设的视觉序列。该方法显著提升了世界模型在复杂物理行为评估等应用中的可控性和泛化能力。

链接: https://arxiv.org/abs/2511.17481
作者: Yiqing Shen,Aiza Maksutova,Chenjia Li,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as “what would happen if this object was removed?”, is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.
zh

[CV-7] GPR-OdomNet: Difference and Similarity-Driven Odometry Estimation Network for Ground Penetrating Radar-Based Localization

【速读】:该论文旨在解决在恶劣天气和环境条件下,利用地穿透雷达(Ground Penetrating Radar, GPR)进行机器人或车辆定位时,现有技术难以准确估计B-scan图像间微小差异所对应欧氏距离的问题。其解决方案的关键在于设计了一种新型神经网络里程计方法,该方法通过提取连续时刻GPR B-scan图像的多尺度特征,并分析这些特征间的相似性与差异性,从而实现对移动欧氏距离的高精度估计。

链接: https://arxiv.org/abs/2511.17457
作者: Huaichao Wang,Xuanxin Fan,Ji Liu,Haifeng Li,Dezhen Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When performing robot/vehicle localization using ground penetrating radar (GPR) to handle adverse weather and environmental conditions, existing techniques often struggle to accurately estimate distances when processing B-scan images with minor distinctions. This study introduces a new neural network-based odometry method that leverages the similarity and difference features of GPR B-scan images for precise estimation of the Euclidean distances traveled between the B-scan images. The new custom neural network extracts multi-scale features from B-scan images taken at consecutive moments and then determines the Euclidean distance traveled by analyzing the similarities and differences between these features. To evaluate our method, an ablation study and comparison experiments have been conducted using the publicly available CMU-GPR dataset. The experimental results show that our method consistently outperforms state-of-the-art counterparts in all tests. Specifically, our method achieves a root mean square error (RMSE), and achieves an overall weighted RMSE of 0.449 m across all data sets, which is a 10.2% reduction in RMSE when compared to the best state-of-the-art method.
zh

[CV-8] Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift BMVC2025

【速读】:该论文旨在解决激光雷达(LiDAR)点云语义分割网络在面对未见激光雷达设备时泛化能力不足的问题,即在域偏移(domain shift)场景下性能显著下降的问题。解决方案的关键在于利用视觉基础模型(Vision Foundation Models, VFMs)提供的跨域鲁棒特征,并通过无监督域自适应(Unsupervised Domain Adaptation, UDA)策略进行优化。研究发现:(1)激光雷达主干网络架构是最大化目标域泛化性能的核心因素;(2)可一次性预训练一个通用主干网络,用于应对多种域偏移场景;(3)最佳效果来自冻结预训练主干、仅微调MLP头进行语义分割的策略。该方法在四个具有挑战性的基准设置中达到了当前最优性能。

链接: https://arxiv.org/abs/2511.17455
作者: Björn Michele,Alexandre Boulch,Gilles Puy,Tuan-Hung Vu,Renaud Marlet,Nicolas Courty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025

点击查看摘要

Abstract:Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: this https URL.
zh

[CV-9] Illustrators Depth: Monocular Layer Index Prediction for Image Decomposition

【速读】:该论文旨在解决数字内容创作中的一个关键问题:如何将扁平的位图图像分解为可编辑、有序的图层。传统深度信息通常基于物理场景理解,而本文提出“插画师深度”(Illustrator’s Depth),将其重构为一种面向创意表达的抽象概念,通过为每个像素预测一个离散的层索引(layer index),实现全局一致且可编辑的图像分解。解决方案的关键在于设计并训练了一个神经网络模型,该模型基于精心构建的分层矢量图形数据集,直接从位图输入中预测图层排序,从而支持高保真文本到矢量图形生成、自动3D浮雕生成及直观的深度感知编辑等下游应用。

链接: https://arxiv.org/abs/2511.17454
作者: Nissim Maruani,Peiying Zhang,Siddhartha Chaudhuri,Matthew Fisher,Nanxuan Zhao,Vladimir G. Kim,Pierre Alliez,Mathieu Desbrun,Wang Yifan
机构: Inria, UCA(法国国家信息与自动化研究院,南法大学); CityUHK(香港城市大学); Adobe Research(Adobe 研究院); Inria/X, IP Paris(法国国家信息与自动化研究院/X,巴黎理工学院); Wang Yifan(王一凡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Illustrator’s Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist’s compositional process, illustrator’s depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator’s depth prediction offers a new foundation for editable image decomposition.
zh

[CV-10] MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在安全关键应用中面临的对抗鲁棒性不足问题。传统单教师对抗知识蒸馏方法存在知识多样性有限、收敛速度慢以及鲁棒性与准确率难以平衡等挑战。其解决方案的关键在于提出一种多模态多教师对抗鲁棒蒸馏框架(Multimodal Multi-Teacher Adversarial Robust Distillation, MMT-ARD),核心创新包括:1)设计双教师知识融合架构,协同优化干净特征保留与鲁棒特征增强;2)引入基于教师置信度的动态权重分配策略,自适应聚焦于更具挑战性的对抗样本;3)构建基于自适应Sigmoid的加权函数以缓解教师间的偏差,实现跨模态知识传递强度的平衡。实验表明,MMT-ARD在ImageNet和零样本基准上显著提升了鲁棒准确率(+4.32%)与零样本准确率(+3.5%),同时训练效率提升达2.3倍。

链接: https://arxiv.org/abs/2511.17448
作者: Yuqi Li,Junhao Dong,Chuanguang Yang,Shiping Wen,Piotr Koniusz,Tingwen Huang,Yingli Tian,Yew-Soon Ong
机构: The City University of New York (纽约市立大学); Nanyang Technological University (南洋理工大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Technology Sydney (悉尼科技大学); Data61, CSIRO (数据61,澳大利亚联邦科学与工业研究组织); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at this https URL.
zh

[CV-11] REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing

【速读】:该论文旨在解决遥感领域中基础模型(Remote Sensing Foundation Models, RSFMs)选择困难的问题,主要挑战包括文档分散、格式异构以及部署约束多样等。解决方案的关键在于构建了一个结构化的RSFM数据库(RS-FMD),并开发了首个基于大语言模型(Large Language Model, LLM)的智能代理REMSA,该代理能够理解自然语言查询、补全缺失约束、利用上下文学习对候选模型进行排序,并提供可解释的推荐理由。REMSA在75个专家验证的遥感任务场景下表现出优于基线方法(如朴素代理、密集检索和非结构化RAG-based LLM)的性能,且仅依赖公开元数据运行,不涉及私有或敏感信息。

链接: https://arxiv.org/abs/2511.17442
作者: Binger Chen,Tacettin Emre Bök,Behnood Rasti,Volker Markl,Begüm Demir
机构: Technische Universität Berlin (柏林工业大学); BIFOLD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and data available at this https URL

点击查看摘要

Abstract:Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.
zh

[CV-12] Self-Supervised Learning by Curvature Alignment

【速读】:该论文旨在解决当前自监督学习(Self-supervised Learning, SSL)方法中对数据流形局部几何结构建模不足的问题。现有非对比式SSL方法主要通过约束表示的一阶和二阶统计特性(如方差、协方差或冗余减少)来优化特征空间,但忽略了数据分布的局部几何信息,例如流形的弯曲程度(curvature)。为此,作者提出CurvSSL框架及其在再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS)中的扩展——kernel CurvSSL,其核心创新在于引入基于曲率的正则化项:将每个嵌入视为顶点,利用k近邻在单位超球面上的余弦交互计算离散曲率得分,并通过Barlow Twins风格的损失函数对不同增强视图下的曲率矩阵进行对齐与去相关,从而同时实现视图不变性和局部流形弯曲的一致性。实验表明,该方法在MNIST和CIFAR-10上相较Barlow Twins和VICReg等基线模型,在线性评估性能上具有竞争力或提升,验证了显式建模局部几何结构作为纯统计正则化补充的有效性。

链接: https://arxiv.org/abs/2511.17426
作者: Benyamin Ghojogh,M.Hadi Sepanj,Paul Fieguth
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has recently advanced through non-contrastive methods that couple an invariance term with variance, covariance, or redundancy-reduction penalties. While such objectives shape first- and second-order statistics of the representation, they largely ignore the local geometry of the underlying data manifold. In this paper, we introduce CurvSSL, a curvature-regularized self-supervised learning framework, and its RKHS extension, kernel CurvSSL. Our approach retains a standard two-view encoder-projector architecture with a Barlow Twins-style redundancy-reduction loss on projected features, but augments it with a curvature-based regularizer. Each embedding is treated as a vertex whose k nearest neighbors define a discrete curvature score via cosine interactions on the unit hypersphere; in the kernel variant, curvature is computed from a normalized local Gram matrix in an RKHS. These scores are aligned and decorrelated across augmentations by a Barlow-style loss on a curvature-derived matrix, encouraging both view invariance and consistency of local manifold bending. Experiments on MNIST and CIFAR-10 datasets with a ResNet-18 backbone show that curvature-regularized SSL yields competitive or improved linear evaluation performance compared to Barlow Twins and VICReg. Our results indicate that explicitly shaping local geometry is a simple and effective complement to purely statistical SSL regularizers.
zh

[CV-13] Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers

【速读】:该论文旨在解决深度学习模型在训练过程中习得“捷径”(shortcut)解决方案的问题,即模型依赖于训练数据中与任务无关但存在虚假相关性的特征进行预测,这在医疗图像分析等高风险场景中可能导致模型无法利用临床有意义的特征,从而降低鲁棒性并危及患者安全。其解决方案的关键在于:通过一种新颖的知识蒸馏框架,利用一个在少量任务相关数据上微调的教师网络来指导学生网络(该学生网络在包含偏差特征的大规模数据集上训练),从而有效缓解捷径学习问题;该方法能够识别并针对性地干预不同类型的捷径(局部或弥散分布)在神经网络中间层的表现,显著优于传统的经验风险最小化、基于增强的偏差缓解和分组偏差缓解策略,在多个医学影像数据集上实现接近无偏基准模型的性能,即使在分布外测试数据上也表现出良好效果。

链接: https://arxiv.org/abs/2511.17421
作者: Christopher Boland,Sotirios Tsaftaris,Sonia Dahdouh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.
zh

[CV-14] Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required? NEURIPS

【速读】:该论文旨在解决多通道图像(如细胞染色或卫星遥感影像)中视觉Transformer(Vision Transformer, ViT)因跨通道注意力机制导致的计算效率低下问题。现有方法通常对每个通道独立进行tokenization,虽能提升性能,但会引入通道间两两比较的注意力计算,造成FLOPs呈二次增长,显著增加训练成本。解决方案的关键在于提出MoE-ViT架构,借鉴稀疏专家混合(Mixture-of-Experts, MoE)思想,将每个通道视为一个“专家”,并设计轻量级路由模块(router),在每patch层面仅选择最相关的少数专家参与注意力计算,从而大幅降低计算复杂度,同时保持甚至提升模型性能。

链接: https://arxiv.org/abs/2511.17400
作者: Sukwon Yun,Heming Yao,Burkhard Hoeckendorf,David Richmond,Aviv Regev,Russell Littman
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Research and Early Development (gRED), Genentech (基因泰克研发与早期开发部门); Biology Research — AI Development (BRAID), Genentech (基因泰克生物学研究—人工智能开发部门)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This has been accepted at the NeurIPS AI4Science Workshop 2025

点击查看摘要

Abstract:Vision Transformers ( \textViTs ) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive \textFLOPs and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: “Is it necessary to model all channel interactions?”. Inspired by the philosophy of Sparse Mixture-of-Experts ( \textMoE ), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in \textViTs , which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that \textMoE-ViT achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.
zh

[CV-15] MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment AAAI2026

【速读】:该论文旨在解决多模态动作质量评估(Multimodal Action Quality Assessment, AQA)中因部分模态在推理阶段缺失而导致模型失效及性能显著下降的问题。现有方法依赖完整的多模态输入,一旦某一模态缺失,跨模态交互中断会引发灾难性性能退化。解决方案的关键在于提出一种统一的单阶段训练框架——Mixture of Experts for Missing Modality Completion (MCMoE),其核心创新包括:(1) 自适应门控模态生成器,能够动态融合可用信息以重建缺失模态;(2) 模态专家机制,通过学习各模态独立知识并动态混合专家输出来提取跨模态联合表示,从而实现缺失模态的进一步精炼与补全。该框架在三个公开AQA基准上均实现了完整与不完整多模态学习下的最优性能。

链接: https://arxiv.org/abs/2511.17397
作者: Huangbiao Xu,Huanqi Wu,Xiao Ke,Junyi Wu,Rui Xu,Jinglin Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at this https URL.
zh

[CV-16] Designing and Generating Diverse Equitable Face Image Datasets for Face Verification Tasks

【速读】:该论文旨在解决现有面部验证(face verification)系统中因数据集存在种族、性别等人口统计学特征偏差而导致的公平性与有效性不足的问题。解决方案的关键在于提出一种融合先进生成模型的综合方法,用于创建多样化且高质量的合成人脸图像,确保涵盖广泛面部特征并符合身份证照片的规范要求;同时构建了Diverse and Inclusive Faces for Verification (DIF-V) 数据集(含926个唯一身份的27,780张图像),作为未来研究的基准,从而推动更公平、可靠的面部验证技术发展。

链接: https://arxiv.org/abs/2511.17393
作者: Georgia Baltsou,Ioannis Sarridis,Christos Koutlis,Symeon Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face verification is a significant component of identity authentication in various applications including online banking and secure access to personal devices. The majority of the existing face image datasets often suffer from notable biases related to race, gender, and other demographic characteristics, limiting the effectiveness and fairness of face verification systems. In response to these challenges, we propose a comprehensive methodology that integrates advanced generative models to create varied and diverse high-quality synthetic face images. This methodology emphasizes the representation of a diverse range of facial traits, ensuring adherence to characteristics permissible in identity card photographs. Furthermore, we introduce the Diverse and Inclusive Faces for Verification (DIF-V) dataset, comprising 27,780 images of 926 unique identities, designed as a benchmark for future research in face verification. Our analysis reveals that existing verification models exhibit biases toward certain genders and races, and notably, applying identity style modifications negatively impacts model performance. By tackling the inherent inequities in existing datasets, this work not only enriches the discussion on diversity and ethics in artificial intelligence but also lays the foundation for developing more inclusive and reliable face verification technologies
zh

[CV-17] MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

【速读】:该论文旨在解决医学图像分析中变形图像配准(Deformable Image Registration, DIR)的核心挑战,即高维密集位移场带来的计算复杂性与体素级监督信号稀缺的问题。现有基于强化学习的框架通常将高维变形空间降维至粗粒度表示,难以捕捉空间变化的形变特征。其解决方案的关键在于提出MorphSeek,一种细粒度表征层面的策略优化范式,将DIR建模为潜在特征空间中的连续空间优化过程;通过在编码器上引入随机高斯策略头以建模潜在特征分布,实现高效探索与粗到精的细化,并结合无监督预训练与弱监督微调的Group Relative Policy Optimization机制,利用多轨迹采样稳定训练并提升标签效率,从而在保持低参数开销和步级延迟的同时显著提升配准精度。

链接: https://arxiv.org/abs/2511.17392
作者: Runxun Zhang,Yizhou Liu,Li Dongrui,Bo XU,Jingwei Wei
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Fudan University (复旦大学); The second Hospital of Hebei Medical University (河北医科大学第二医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.
zh

[CV-18] IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

【速读】:该论文旨在解决当前视觉大语言模型(Visual Large Language Models, VLLMs)在具身智能体中面临的空间推理能力不足的问题,特别是现有基准测试多局限于静态、被动的家居环境,无法全面评估模型在动态真实场景中的全局-局部协同规划与安全行为表现。其解决方案的关键在于提出首个面向工业导航的动态基准测试——IndustryNav,该基准包含12个高保真Unity仓库场景,引入动态物体和人类移动,并采用基于视角视觉与全局里程计融合的PointGoal导航流程,同时创新性地引入“碰撞率”和“预警率”两个指标以量化安全导向行为与距离估计能力,从而推动具身研究从被动感知向稳定规划、主动探索及安全行为演进。

链接: https://arxiv.org/abs/2511.17384
作者: Yifan Li,Lichi Li,Anh Dao,Xinyu Zhou,Yicheng Qiao,Zheda Mai,Daeun Lee,Zichen Chen,Zhen Tan,Mohit Bansal,Yu Kong
机构: Michigan State University (密歇根州立大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Ohio State University (俄亥俄州立大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Arizona State University (亚利桑那州立大学); Independent researcher (独立研究者)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the “collision rate” and “warning rate” metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.
zh

[CV-19] Non-Parametric Probabilistic Robustness: A Conservative Metric with Optimized Perturbation Distributions

【速读】:该论文旨在解决当前概率鲁棒性(Probabilistic Robustness, PR)评估中假设扰动分布固定且已知的现实局限性问题,这一假设在实际应用中往往不成立。为克服此限制,作者提出非参数概率鲁棒性(Non-parametric Probabilistic Robustness, NPPR),其核心创新在于无需预设扰动分布,而是通过数据驱动方式直接学习最优扰动分布,从而在分布不确定性下实现更保守、更可靠的鲁棒性评估。关键解决方案是基于高斯混合模型(GMM)与多层感知机(MLP)头及双三次上采样构建NPPR估计器,可覆盖输入相关和输入无关的扰动场景,并通过理论分析阐明了对抗鲁棒性(Adversarial Robustness, AR)、PR与NPPR之间的关系。实验表明,相较于现有方法中常用的固定扰动分布假设,NPPR能提供最多降低40%的更保守估计值,显著提升了鲁棒性评估的实用性。

链接: https://arxiv.org/abs/2511.17380
作者: Zheng Wang,Yi Zhang,Siddartha Khastgir,Carsten Maple,Xingyu Zhao
机构: WMG, University of Warwick (华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM) with Multilayer Perceptron (MLP) heads and bicubic up-sampling, covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing up to 40% more conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.
zh

[CV-20] METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

【速读】:该论文旨在解决通用机器人在多样化任务中实现灵巧操作(dexterous manipulation)的挑战,特别是由于高质量动作标注数据稀缺导致的训练瓶颈。现有方法受限于人类示范场景有限及人与机器人之间的视觉差异,难以有效迁移。解决方案的关键在于提出METIS模型,其核心创新包括:构建多源自监督视角数据集EgoAtlas,统一人类与机器人数据并建立一致的动作空间;引入运动感知动力学(motion-aware dynamics)作为紧凑且离散的运动表示,为视觉-语言-动作(VLA)模型提供高效表达的监督信号;最终将推理与执行整合为统一框架,显著提升灵巧操作的成功率和跨分布场景的泛化能力。

链接: https://arxiv.org/abs/2511.17366
作者: Yankai Fu,Ning Chen,Junkai Zhao,Shaozhe Shan,Guocai Yao,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang
机构: Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.
zh

[CV-21] SVRecon: Sparse Voxel Rasterization for Surface Reconstruction

【速读】:该论文旨在解决高保真表面重建中因稀疏体素(sparse voxels)独立参数化导致的几何不连续与局部最优问题。尽管符号距离函数(Signed Distance Function, SDF)提供了平滑连续的几何场,但稀疏体素间缺乏空间一致性约束,使得优化过程易陷入局部极小值。解决方案的关键在于:(1) 利用视觉几何模型进行鲁棒的几何初始化,确保初始结构合理;(2) 引入空间平滑损失(spatial smoothness loss),强制父子及兄弟体素组之间的结构一致性,从而在保持稀疏性的同时实现全局平滑的表面重建。

链接: https://arxiv.org/abs/2511.17364
作者: Seunghun Oh,Jaesung Choe,Dongjae Lee,Daeun Lee,Seunghoon Jeong,Yu-Chiang Frank Wang,Jaesik Park
机构: Seoul National University (首尔国立大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We extend the recently proposed sparse voxel rasterization paradigm to the task of high-fidelity surface reconstruction by integrating Signed Distance Function (SDF), named SVRecon. Unlike 3D Gaussians, sparse voxels are spatially disentangled from their neighbors and have sharp boundaries, which makes them prone to local minima during optimization. Although SDF values provide a naturally smooth and continuous geometric field, preserving this smoothness across independently parameterized sparse voxels is nontrivial. To address this challenge, we promote coherent and smooth voxel-wise structure through (1) robust geometric initialization using a visual geometry model and (2) a spatial smoothness loss that enforces coherent relationships across parent-child and sibling voxel groups. Extensive experiments across various benchmarks show that our method achieves strong reconstruction accuracy while having consistently speedy convergence. The code will be made public.
zh

[CV-22] ATAC: Augmentation-Based Test-Time Adversarial Correction for CLIP

【速读】:该论文旨在解决CLIP(Contrastive Language–Image Pretraining)模型在图像-文本零样本匹配任务中对图像对抗扰动高度敏感的问题。现有方法如对抗微调计算成本高昂,而测试时防御策略又往往鲁棒性不足。解决方案的关键在于提出一种基于增强的测试时对抗修正方法(Augmentation-based Test-time Adversarial Correction, ATAC),其核心思想是在CLIP的嵌入空间中利用图像增强诱导的漂移向量(drift vectors)推断语义恢复方向,并通过这些潜在漂移的角一致性进行嵌入修正,从而实现高效且强鲁棒性的防御。该方法无需额外训练,仅需少量计算开销即可显著提升模型在多种基准和极端场景下的鲁棒性,甚至对自适应攻击也表现出非平凡的抵抗能力。

链接: https://arxiv.org/abs/2511.17362
作者: Linxiang Su,András Balogh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Despite its remarkable success in zero-shot image-text matching, CLIP remains highly vulnerable to adversarial perturbations on images. As adversarial fine-tuning is prohibitively costly, recent works explore various test-time defense strategies; however, these approaches still exhibit limited robustness. In this work, we revisit this problem and propose a simple yet effective strategy: Augmentation-based Test-time Adversarial Correction (ATAC). Our method operates directly in the embedding space of CLIP, calculating augmentation-induced drift vectors to infer a semantic recovery direction and correcting the embedding based on the angular consistency of these latent drifts. Across a wide range of benchmarks, ATAC consistently achieves remarkably high robustness, surpassing that of previous state-of-the-art methods by nearly 50% on average, all while requiring minimal computational overhead. Furthermore, ATAC retains state-of-the-art robustness in unconventional and extreme settings and even achieves nontrivial robustness against adaptive attacks. Our results demonstrate that ATAC is an efficient method in a novel paradigm for test-time adversarial defenses in the embedding space of CLIP.
zh

[CV-23] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation

【速读】:该论文旨在解决基于高斯(Gaussian)表示的语义占用估计(semantic occupancy estimation)方法在自动驾驶场景理解中面临的内存占用高、推理速度慢的问题。现有方法依赖大量高斯原型,虽能实现自监督学习,但难以满足实时性需求。其解决方案的关键在于提出一种基于超椭球体(superquadrics)的场景表示方法——SuperQuadricOcc,通过多层icosphere细分的高斯近似实现超椭球体的栅格化(rasterization),从而支持训练时的监督信号传递;同时引入快速超椭球体体素化模块,显著降低原型数量(减少84%)和内存占用(减少75%),并提升推理速度(快124%),在不使用时序标签的情况下实现mIoU提升5.9%,首次在保持竞争力性能的同时达成实时推理。

链接: https://arxiv.org/abs/2511.17361
作者: Seamie Hayes,Reenu Mohandas,Tim Brophy,Alexandre Boulch,Ganesh Sistu,Ciaran Eising
机构: University of Limerick (利默里克大学); valeo.ai (valeo.ai)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75% reduction in memory footprint, 124% faster inference, and a 5.9% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be released as open source.
zh

[CV-24] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

【速读】:该论文旨在解决细胞级影像组学(cell-level radiomics)分析在肿瘤诊断中研究不足的问题,尤其是现有方法多集中于切片或图像块级别分类,缺乏针对细胞层面特征的专用模型架构。其关键解决方案是提出一种统一注意力-马巴(Unified Attention-Mamba, UAM)主干网络,通过将注意力机制与Mamba架构有机融合在一个统一的结构中,避免了传统混合方法中需手动调节模块比例的限制,从而增强编码能力并提升细胞级分类与图像分割的性能。实验表明,UAM在公开数据集上实现了细胞分类准确率从74%提升至78%,以及肿瘤分割精度从75%提升至80%,验证了其作为可扩展多模态基础模型在放射组学驱动癌症诊断中的有效性。

链接: https://arxiv.org/abs/2511.17355
作者: Taixi Chen,Jingyun Chen,Nancy Guo
机构: State University of New York at Binghamton (纽约州立大学宾汉顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (HE) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ( n =349,882 cells), and tumor segmentation precision from 75% to 80% ( n =406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.
zh

[CV-25] DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

【速读】:该论文旨在解决图像基础联合嵌入预测架构(Image-based Joint-Embedding Predictive Architecture, I-JEPA)在学习视觉表征时存在的局限性,即其对所有图像区域进行均匀且独立的预测,缺乏对预测位置和顺序的显式建模。这种机制忽略了人类视觉感知中注意力选择性和序列性的特点,从而限制了模型对判别性强、泛化能力高的特征的学习。解决方案的关键在于提出 Discriminative Sequential JEPA(DSeq-JEPA),它通过两个核心步骤实现:首先利用基于 Transformer 的显著性图识别出主要判别性区域,强调视觉重要性的分布;其次按照此判别性顺序逐次预测后续区域,形成从主到次的语义渐进式推理过程,从而融合了 JEPA 式潜在嵌入预测与 GPT 式顺序推理的优势,构建了一种类课程(curriculum-like)的自监督预训练范式。

链接: https://arxiv.org/abs/2511.17354
作者: Xiangteng He,Shunsuke Sakai,Kun Yuan,Nicolas Padoy,Tatsuhito Hasegawa,Leonid Sigal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues – a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: this https URL.
zh

[CV-26] Label-Efficient Skeleton-based Recognition with Stable-Invertible Graph Convolutional Networks

【速读】:该论文旨在解决骨架动作识别(Skeleton-based Action Recognition)任务中对大规模人工标注数据的高度依赖问题,此类数据的获取成本高且耗时。解决方案的关键在于提出一种标签高效的学习方法,利用图卷积网络(Graph Convolutional Networks, GCNs)设计了一种新颖的采集函数(acquisition function),该函数通过优化一个融合数据代表性(representativity)、多样性(diversity)和不确定性(uncertainty)的目标函数,自动筛选最具信息量的样本子集用于标注。此外,作者进一步引入可逆GCN结构,将数据从环境空间映射到潜在空间,从而更有效地捕捉数据内在分布,提升标签利用效率。实验表明,该方法在两个具有挑战性的骨架识别数据集上显著优于现有工作。

链接: https://arxiv.org/abs/2511.17345
作者: Hichem Sahbi
机构: Sorbonne University (索邦大学); CNRS (法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skeleton-based action recognition is a hotspot in image processing. A key challenge of this task lies in its dependence on large, manually labeled datasets whose acquisition is costly and time-consuming. This paper devises a novel, label-efficient method for skeleton-based action recognition using graph convolutional networks (GCNs). The contribution of the proposed method resides in learning a novel acquisition function – scoring the most informative subsets for labeling – as the optimum of an objective function mixing data representativity, diversity and uncertainty. We also extend this approach by learning the most informative subsets using an invertible GCN which allows mapping data from ambient to latent spaces where the inherent distribution of the data is more easily captured. Extensive experiments, conducted on two challenging skeleton-based recognition datasets, show the effectiveness and the outperformance of our label-frugal GCNs against the related work.
zh

[CV-27] Loomis Painter: Reconstructing the Painting Process

【速读】:该论文旨在解决现有绘画教程视频在交互性与个性化方面的不足,以及生成式模型在跨媒介一致性、时间连贯性和结构稳定性上的局限,从而难以忠实还原人类创作流程的问题。其解决方案的关键在于提出一个统一的多媒介绘画过程生成框架,通过语义驱动的风格控制机制,将多种媒介嵌入扩散模型的条件空间,并引入跨媒介风格增强策略,以实现纹理演化的一致性和风格迁移的稳定性;同时采用逆向绘画训练策略确保生成过程平滑且符合人类创作习惯,最终通过构建大规模真实绘画过程数据集和提出的感知距离轮廓(Perceptual Distance Profile, PDP)曲线,定量刻画从构图到上色再到细节精修的人类艺术推进序列。

链接: https://arxiv.org/abs/2511.17344
作者: Markus Pobitzer,Chang Liu,Chenyi Zhuang,Teng Long,Bin Ren,Nicu Sebe
机构: University of Trento (特伦托大学); University of Pisa (比萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.
zh

[CV-28] Refracting Reality: Generating Images with Realistic Transparent Objects

【速读】:该论文旨在解决生成式图像模型在合成透明物体时存在的光学物理约束失准问题,特别是由于折射(refraction)导致的视觉不真实现象。现有模型未能充分学习光学定律,无法准确模拟折射光线与图像中其他区域表面的交互关系。解决方案的关键在于:在生成轨迹的每一步,利用斯涅尔定律(Snell’s Law of Refraction)对物体边界内外像素进行空间扭曲(warping)和融合(merging),以同步其颜色信息;对于通过折射或反射可见但未直接观测到的表面,则通过构建以物体为中心的全景图(panorama)并再次应用相同的像素同步策略来恢复其外观,从而实现符合物理规律的折射成像。

链接: https://arxiv.org/abs/2511.17340
作者: Yue Yin,Enze Tao,Dylan Campbell
机构: The Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object’s boundary with those outside by warping and merging the pixels using Snell’s Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image – a panorama centered at the object – using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.
zh

[CV-29] NoPe-NeRF: Local-to-Global Optimization of NeRF with No Pose Prior

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)训练过程中缺乏相机位姿先验时导致的位姿估计不准确问题,尤其是在复杂场景下现有方法(如NoPe-NeRF)仅依赖图像内局部关系难以恢复精确位姿的挑战。其解决方案的关键在于提出一种从局部到全局的优化算法NoPe-NeRF++:首先通过显式特征匹配进行相对位姿初始化,随后执行局部联合优化以提升初始位姿质量;进而引入全局优化阶段,利用束调整(bundle adjustment)结合几何一致性约束和特征轨迹进一步精炼位姿,从而显著提升NeRF重建质量与视角合成效果。该方法是首个将局部与全局线索无缝融合于NeRF框架中的工作,在位姿估计精度和新视角合成性能上均优于当前最优方法。

链接: https://arxiv.org/abs/2511.17322
作者: Dongbo Shi,Shen Cao,Bojian Wu,Jinhui Guo,Lubin Fan,Renjie Chen,Ligang Liu,Jieping Ye
机构: University of Science and Technology of China (中国科学技术大学); Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce NoPe-NeRF++, a novel local-to-global optimization algorithm for training Neural Radiance Fields (NeRF) without requiring pose priors. Existing methods, particularly NoPe-NeRF, which focus solely on the local relationships within images, often struggle to recover accurate camera poses in complex scenarios. To overcome the challenges, our approach begins with a relative pose initialization with explicit feature matching, followed by a local joint optimization to enhance the pose estimation for training a more robust NeRF representation. This method significantly improves the quality of initial poses. Additionally, we introduce global optimization phase that incorporates geometric consistency constraints through bundle adjustment, which integrates feature trajectories to further refine poses and collectively boost the quality of NeRF. Notably, our method is the first work that seamlessly combines the local and global cues with NeRF, and outperforms state-of-the-art methods in both pose estimation accuracy and novel view synthesis. Extensive evaluations on benchmark datasets demonstrate our superior performance and robustness, even in challenging scenes, thus validating our design choices.
zh

[CV-30] MuM: Multi-View Masked Image Modeling for 3D Vision

【速读】:该论文旨在解决自监督学习在图像领域中普遍侧重语义理解而忽视几何推理的问题,特别是针对3D视觉任务中特征表示能力不足的局限。其解决方案的关键在于扩展掩码自编码器(Masked Autoencoding, MAE)以支持任意多视角同一场景的联合建模:通过统一掩码所有视图并引入轻量级解码器配合帧间注意力机制,实现更简洁且可扩展的多视角表征学习框架,从而显著提升模型在三维视觉下游任务(如稠密图像匹配、相对位姿估计等)中的性能表现。

链接: https://arxiv.org/abs/2511.17309
作者: David Nordström,Johan Edstedt,Fredrik Kahl,Georg Bökman
机构: Chalmers University of Technology (查尔姆斯理工大学); Linköping University (林雪平大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.
zh

[CV-31] SpatialGeo:Boosting Spatial Reasoning in Multimodal LLM s via Geometry-Semantics Fusion

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在三维空间推理能力上的局限性问题,尤其是其视觉编码器对空间布局信息建模不足导致的空间模糊性(spatial ambiguity)。现有模型如CLIP主要依赖实例级语义特征进行嵌入,缺乏几何结构信息,难以准确理解图像中物体间的空间关系。解决方案的关键在于提出一种基于几何与语义特征分层融合的新视觉编码器——SpatialGeo,通过引入来自纯视觉自监督学习的几何特征,并利用层次化适配器(hierarchical adapter)与CLIP模块进行融合,生成具有空间感知能力的视觉嵌入(spatial-aware visual embedding),从而显著提升MLLMs的空间定位能力。该方法在训练时借助预训练LLaVA模型并采用随机特征丢弃策略优化,有效避免了仅依赖CLIP编码器的平凡解,实验表明其在SpatialRGPT-Bench任务上性能提升至少8.0%,且推理阶段内存消耗降低约50%。

链接: https://arxiv.org/abs/2511.17308
作者: Jiajie Guo,Qingpeng Zhu,Jin Zeng,Xiaolong Wu,Changyong He,Weida Wang
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via this https URL.
zh

[CV-32] BiFingerPose: Bimodal Finger Pose Estimation for Touch Devices

【速读】:该论文旨在解决当前触摸屏设备上手指姿态(finger pose)估计算法在实际应用中的局限性,特别是现有基于电容图像(capacitive image)的方法难以准确估计俯仰角(pitch)、偏航角(yaw)和滚转角(roll)等完整姿态参数的问题,尤其是在大角度输入(>45°)时精度显著下降。其解决方案的关键在于提出一种双模态(bimodal)手指姿态估计方法 BiFingerPose,通过融合来自屏下指纹传感器的电容图像与指纹图像(fingerprint patch),实现了对滚转角的可靠估计,并显著提升了其他姿态参数的预测性能,从而在连续和离散交互任务中展现出优于现有最先进方法(SOTA)的准确性、效率与用户操作精度。

链接: https://arxiv.org/abs/2511.17306
作者: Xiongjun Guan,Zhiyu Pan,Jianjiang Feng,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Finger pose offers promising opportunities to expand human computer interaction capability of touchscreen devices. Existing finger pose estimation algorithms that can be implemented in portable devices predominantly rely on capacitive images, which are currently limited to estimating pitch and yaw angles and exhibit reduced accuracy when processing large-angle inputs (especially when it is greater than 45 degrees). In this paper, we propose BiFingerPose, a novel bimodal based finger pose estimation algorithm capable of simultaneously and accurately predicting comprehensive finger pose information. A bimodal input is explored, including a capacitive image and a fingerprint patch obtained from the touchscreen with an under-screen fingerprint sensor. Our approach leads to reliable estimation of roll angle, which is not achievable using only a single modality. In addition, the prediction performance of other pose parameters has also been greatly improved. The evaluation of a 12-person user study on continuous and discrete interaction tasks further validated the advantages of our approach. Specifically, BiFingerPose outperforms previous SOTA methods with over 21% improvement in prediction performance, 2.5 times higher task completion efficiency, and 23% better user operation accuracy, demonstrating its practical superiority. Finally, we delineate the application space of finger pose with respect to enhancing authentication security and improving interactive experiences, and develop corresponding prototypes to showcase the interaction potential. Our code will be available at this https URL.
zh

[CV-33] MolSight: Optical Chemical Structure Recognition with SMILES Pretraining Multi-Granularity Learning and Reinforcement Learning

【速读】:该论文旨在解决光学化学结构识别(Optical Chemical Structure Recognition, OCSR)中立体化学信息识别准确率低的问题,尤其是对楔形键、虚线键、环构象及空间排列等细微视觉特征的区分能力不足。其解决方案的关键在于提出一个三阶段训练框架MolSight:首先在大规模噪声数据上进行预训练以获得基础图像感知能力;其次通过多粒度微调引入辅助任务(如化学键分类和原子定位)增强分子式识别性能;最后采用基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习进行后训练,并构建新型立体化学结构数据集,显著提升了模型在立体分子识别上的表现。

链接: https://arxiv.org/abs/2511.17300
作者: Wenrui Zhang,Xinggang Wang,Bin Feng,Wenyu Liu
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Chemical Structure Recognition (OCSR) plays a pivotal role in modern chemical informatics, enabling the automated conversion of chemical structure images from scientific literature, patents, and educational materials into machine-readable molecular representations. This capability is essential for large-scale chemical data mining, drug discovery pipelines, and Large Language Model (LLM) applications in related domains. However, existing OCSR systems face significant challenges in accurately recognizing stereochemical information due to the subtle visual cues that distinguish stereoisomers, such as wedge and dash bonds, ring conformations, and spatial arrangements. To address these challenges, we propose MolSight, a comprehensive learning framework for OCSR that employs a three-stage training paradigm. In the first stage, we conduct pre-training on large-scale but noisy datasets to endow the model with fundamental perception capabilities for chemical structure images. In the second stage, we perform multi-granularity fine-tuning using datasets with richer supervisory signals, systematically exploring how auxiliary tasks-specifically chemical bond classification and atom localization-contribute to molecular formula recognition. Finally, we employ reinforcement learning for post-training optimization and introduce a novel stereochemical structure dataset. Remarkably, we find that even with MolSight’s relatively compact parameter size, the Group Relative Policy Optimization (GRPO) algorithm can further enhance the model’s performance on stereomolecular. Through extensive experiments across diverse datasets, our results demonstrate that MolSight achieves state-of-the-art performance in (stereo)chemical optical structure recognition.
zh

[CV-34] Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

【速读】:该论文旨在解决多语言文本到图像(Text-to-Image, T2I)生成模型在跨文化语境下输出不一致的问题,即当前模型在处理多语言提示时往往产生文化中立或以英语为中心的图像结果,导致文化相关语义表达失真。其核心解决方案在于通过一种探针方法定位到少量固定层中的文化敏感神经元,并基于此提出两种互补的对齐策略:一是推理时的文化激活(inference-time cultural activation),通过放大这些特定神经元的激活强度实现文化一致性提升而无需微调主干网络;二是层目标文化的增强(layer-targeted cultural enhancement),仅更新与文化信息相关的网络层参数。实验表明,这两种方法在CultureBench数据集上显著提升了文化一致性,同时保持了图像保真度和多样性。

链接: https://arxiv.org/abs/2511.17282
作者: Chuancheng Shi,Shangze Li,Shiming Guo,Simiao Xie,Wenhua Wu,Jingtong Dou,Chao Wu,Canran Xiao,Cong Wang,Zifeng Cheng,Fei Shen,Tat-Seng Chua
机构: The University of Sydney (悉尼大学); Nanjing University of Science and Technology (南京理工大学); Central South University (中南大学); Nanjing University (南京大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.
zh

[CV-35] Leverag ing CVAE for Joint Configuration Estimation of Multifingered Grippers from Point Cloud Data

【速读】:该论文旨在解决多指灵巧手(multifingered gripper)在仅依赖点云数据(point cloud data)的情况下,如何高效准确地确定其关节配置(joint configuration)的问题。传统逆运动学(inverse kinematics, IK)方法虽能提供数学上精确的解,但常需额外决策来处理中间指节位置,或依赖数值逼近算法应对复杂构型,效率与鲁棒性受限。本文的关键解决方案是引入条件变分自编码器(Conditional Variational Auto-Encoder, CVAE),以点云数据作为输入,直接学习从结构特征到关节配置的隐式映射,从而无需显式求解IK方程即可实现高精度、低延迟(<0.05毫秒)的关节状态估计,验证了其在AI驱动抓取规划中的有效性。

链接: https://arxiv.org/abs/2511.17276
作者: Julien Merand,Boris Meden,Mathieu Grossard
机构: Université Paris-Saclay (巴黎-萨克雷大学); CEA (法国原子能和替代能源委员会); List (List)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an efficient approach for determining the joint configuration of a multifingered gripper solely from the point cloud data of its poly-articulated chain, as generated by visual sensors, simulations or even generative neural networks. Well-known inverse kinematics (IK) techniques can provide mathematically exact solutions (when they exist) for joint configuration determination based solely on the fingertip pose, but often require post-hoc decision-making by considering the positions of all intermediate phalanges in the gripper’s fingers, or rely on algorithms to numerically approximate solutions for more complex kinematics. In contrast, our method leverages machine learning to implicitly overcome these challenges. This is achieved through a Conditional Variational Auto-Encoder (CVAE), which takes point cloud data of key structural elements as input and reconstructs the corresponding joint configurations. We validate our approach on the MultiDex grasping dataset using the Allegro Hand, operating within 0.05 milliseconds and achieving accuracy comparable to state-of-the-art methods. This highlights the effectiveness of our pipeline for joint configuration estimation within the broader context of AI-driven techniques for grasp planning.
zh

[CV-36] Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing

【速读】:该论文旨在解决自动驾驶系统训练中缺乏多样化、复杂边缘场景(edge case)点云数据的问题,尤其是真实世界中难以获取的极端情况,这限制了系统的泛化能力和鲁棒性。现有方法依赖于手工构建的3D虚拟环境进行点云模拟,存在耗时长、计算成本高且难以还原真实场景复杂性的局限。其解决方案的关键在于:利用语义掩码(semantic mask)引导的真实LiDAR扫描编辑技术,通过将点云转换为2D距离图像(range image)作为中间表示,并结合基于凸包的语义掩码对物体的空间位置、朝向和尺寸进行精确控制,从而实现扩散模型驱动的高质量合成LiDAR点云生成,有效提升复杂动态场景的多样性与真实性,已在KITTI-360数据集上验证其有效性。

链接: https://arxiv.org/abs/2511.17269
作者: Suchetan G. Uppur,Hemant Kumar,Vaibhav Kumar
机构: GeoAI4Cities Lab, IISER Bhopal, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Training autonomous driving and navigation systems requires large and diverse point cloud datasets that capture complex edge case scenarios from various dynamic urban settings. Acquiring such diverse scenarios from real-world point cloud data, especially for critical edge cases, is challenging, which restricts system generalization and robustness. Current methods rely on simulating point cloud data within handcrafted 3D virtual environments, which is time-consuming, computationally expensive, and often fails to fully capture the complexity of real-world scenes. To address some of these issues, this research proposes a novel approach that addresses the problem discussed by editing real-world LiDAR scans using semantic mask-based guidance to generate novel synthetic LiDAR point clouds. We incorporate range image projection and semantic mask conditioning to achieve diffusion-based generation. Point clouds are transformed to 2D range view images, which are used as an intermediate representation to enable semantic editing using convex hull-based semantic masks. These masks guide the generation process by providing information on the dimensions, orientations, and locations of objects in the real environment, ensuring geometric consistency and realism. This approach demonstrates high-quality LiDAR point cloud generation, capable of producing complex edge cases and dynamic scenes, as validated on the KITTI-360 dataset. This offers a cost-effective and scalable solution for generating diverse LiDAR data, a step toward improving the robustness of autonomous driving systems.
zh

[CV-37] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback WACV’26

【速读】:该论文旨在解决大型视觉语言模型(VLMs)在视觉搜索任务中性能提升依赖于微调和模型规模扩展的问题,提出了一种受传统文本检索中相关性反馈(relevance feedback)机制启发的推理时优化方法。其解决方案的关键在于引入四种不同的反馈策略:经典伪相关性反馈(PRF)、生成式相关性反馈(GRF)、注意力反馈摘要器(AFS)以及基于真实标注的显式反馈作为上界基准。其中,GRF通过合成描述文本对查询嵌入进行重构,AFS则利用定制的Transformer模型融合多模态细粒度特征以增强反馈效果,二者均能在不依赖微调的情况下显著提升小规模VLM的检索性能(MRR@5提升3–5%),且AFS在多轮迭代场景下比GRF更具鲁棒性并有效缓解查询漂移问题。

链接: https://arxiv.org/abs/2511.17255
作者: Bulat Khaertdinov,Mirela Popa,Nava Tintarev
机构: Maastricht University (马斯特里赫特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted to WACV’26

点击查看摘要

Abstract:Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.
zh

[CV-38] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats NEURIPS2025

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多任务场景下普遍存在幻觉(hallucination)的问题。研究表明,幻觉并非源于单一因果路径,而是图像到输入文本、图像到输出文本以及文本到文本三条路径之间的交互作用所致;此外,模型在不同问题-答案对齐格式(discriminative vs. generative)下依赖的路径也存在差异。解决方案的关键在于基于Transformer架构的因果特性,提出一种整合多路径干预的框架,识别并针对性地干预各路径中的关键幻觉头(hallucination heads),从而实现对不同类型对齐格式的有效抑制,实验表明该方法在多个基准测试中均能显著降低幻觉发生率。

链接: https://arxiv.org/abs/2511.17254
作者: Jiaye Qian,Ge Zheng,Yuchen Zhu,Sibei Yang
机构: Sun Yat-sen University (中山大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025, Project Page: this https URL

点击查看摘要

Abstract:Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer’s causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.
zh

[CV-39] Blind Deconvolution for Color Images Using Normalized Quaternion Kernels

【速读】:该论文旨在解决彩色图像盲去卷积(blind deconvolution)中的关键挑战,即现有方法通常将彩色图像转换为灰度图或独立处理各颜色通道,从而忽略了颜色通道之间的复杂依赖关系。其解决方案的核心在于提出一种专为彩色图像设计的四元数保真项(quaternion fidelity term),该保真项利用四元数卷积核的特性,包含一个非负主核用于建模整体模糊,以及三个无约束的子核分别对应红、绿、蓝通道,以显式建模颜色通道间的未知关联。同时,为保持图像亮度一致性,引入归一化四元数核进行去卷积过程,实验证明该方法能有效抑制伪影并显著提升去模糊效果。

链接: https://arxiv.org/abs/2511.17253
作者: Yuming Yang,Michael K. Ng,Zhigang Jia,Wei Wang
机构: Tongji University (同济大学); Hong Kong Baptist University (香港浸会大学); Jiangsu Normal University (江苏师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we address the challenging problem of blind deconvolution for color images. Existing methods often convert color images to grayscale or process each color channel separately, which overlooking the relationships between color channels. To handle this issue, we formulate a novel quaternion fidelity term designed specifically for color image blind deconvolution. This fidelity term leverages the properties of quaternion convolution kernel, which consists of four kernels: one that functions similarly to a non-negative convolution kernel to capture the overall blur, and three additional convolution kernels without constraints corresponding to red, green and blue channels respectively model their unknown interdependencies. In order to preserve image intensity, we propose to use the normalized quaternion kernel in the blind deconvolution process. Extensive experiments on real datasets of blurred color images show that the proposed method effectively removes artifacts and significantly improves deblurring effect, demonstrating its potential as a powerful tool for color image deconvolution.
zh

[CV-40] Equivariant-Aware Structured Pruning for Efficient Edge Deployment: A Comprehensive Framework with Adaptive Fine-Tuning

【速读】:该论文旨在解决在资源受限环境中部署具有几何不变性的深度学习模型时面临的模型复杂度高与性能下降的矛盾问题。其核心挑战在于如何在保持旋转等几何变换下的等变性(equivariance)的同时实现模型压缩,以满足边缘计算场景对效率和鲁棒性的双重需求。解决方案的关键在于提出一个融合群等变卷积神经网络(G-CNNs)与感知等变性的结构化剪枝(equivariant-aware structured pruning)的新框架:首先利用C4循环群通过e2cnn库实现旋转等变性,确保模型在几何变换下的一致性能;其次设计基于层结构分析的神经元级剪枝策略,保留等变特性并显著减少参数量(实验中达29.3%);最后引入自适应微调机制,在精度下降超过2%时自动触发早停与学习率调度,高效恢复性能,并结合动态INT8量化形成端到端可复现的优化流水线,从而在卫星图像(EuroSAT)、CIFAR-10及旋转MNIST等多个任务上验证了压缩与鲁棒性兼顾的有效性。

链接: https://arxiv.org/abs/2511.17242
作者: Mohammed Alnemari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 5 tables, 1 figure. Accepted at IEEE EdgeCom 2025 (11th IEEE International Conference on Edge Computing and Scalable Cloud)

点击查看摘要

Abstract:This paper presents a novel framework combining group equivariant convolutional neural networks (G-CNNs) with equivariant-aware structured pruning to produce compact, transformation-invariant models for resource-constrained environments. Equivariance to rotations is achieved through the C4 cyclic group via the e2cnn library,enabling consistent performance under geometric transformations while reducing computational overhead. Our approach introduces structured pruning that preserves equivariant properties by analyzing e2cnn layer structure and applying neuron-level pruning to fully connected components. To mitigate accuracy degradation, we implement adaptive fine-tuning that automatically triggers when accuracy drop exceeds 2%, using early stopping and learning rate scheduling for efficient recovery. The framework includes dynamic INT8 quantization and a comprehensive pipeline encompassing training, knowledge distillation, structured pruning, fine-tuning, and quantization. We evaluate our method on satellite imagery (EuroSAT) and standard benchmarks (CIFAR-10, Rotated MNIST) demonstrating effectiveness across diverse domains. Experimental results show 29.3% parameter reduction with significant accuracy recovery, demonstrating that structured pruning of equivariant networks achieves substantial compression while maintaining geometric robustness. Our pipeline provides a reproducible framework for optimizing equivariant models, bridging the gap between group-theoretic network design and practical deployment constraints, with particular relevance to satellite imagery analysis and geometric vision tasks. Comments: 8 pages, 5 tables, 1 figure. Accepted at IEEE EdgeCom 2025 (11th IEEE International Conference on Edge Computing and Scalable Cloud) Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2511.17242 [cs.CV] (or arXiv:2511.17242v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.17242 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-41] P-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making NEURIPS2025

【速读】:该论文旨在解决具身智能(Embodied AI)中多需求驱动导航(Multi-Demand-Driven Navigation, MDDN)的复杂性问题,即在现实场景中,智能体需同时处理多个子需求并依据任务偏好进行决策,而传统单需求驱动导航(Demand-Driven Navigation, DDN)方法无法满足此类长程、动态的任务要求。解决方案的关键在于提出TP-MDDN基准与AWMSystem框架:前者定义了包含显式任务偏好的多需求导航任务;后者由BreakLLM(指令分解)、LocateLLM(目标选择)和StatusMLLM(任务监控)三个模块构成,结合MASMap空间记忆机制(融合3D点云与2D语义地图),以及Dual-Tempo动作生成框架(零样本规划与策略驱动精细控制相结合)和自适应错误校正器,从而实现高精度感知与鲁棒导航。

链接: https://arxiv.org/abs/2511.17225
作者: Shanshan Li,Da Huang,Yu He,Yanwei Fu,Yu-Gang Jiang,Xiangyang Xue
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institution (上海创新研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:In daily life, people often move through spaces to find objects that meet their needs, posing a key challenge in embodied AI. Traditional Demand-Driven Navigation (DDN) handles one need at a time but does not reflect the complexity of real-world tasks involving multiple needs and personal choices. To bridge this gap, we introduce Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN), a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences. To solve TP-MDDN, we propose AWMSystem, an autonomous decision-making system composed of three key modules: BreakLLM (instruction decomposition), LocateLLM (goal selection), and StatusMLLM (task monitoring). For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding. Our Dual-Tempo action generation framework integrates zero-shot planning with policy-based fine control, and is further supported by an Adaptive Error Corrector that handles failure cases in real time. Experiments demonstrate that our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.
zh

[CV-42] QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

【速读】:该论文旨在解决从图像中自监督学习3D场景几何与语义信息的问题,这是计算机视觉和自动驾驶领域中的核心挑战。现有方法要么依赖2D渲染一致性(仅隐式地恢复3D结构),要么使用累积激光雷达点云生成离散体素网格,从而限制了空间精度和可扩展性。其解决方案的关键在于提出QueryOcc框架——一种基于4D时空查询的自监督方法,通过在相邻帧中独立采样查询来直接学习连续的3D语义占据(semantic occupancy)。该框架支持来自视觉基础模型伪点云或原始激光雷达数据的监督信号,并引入一种收缩式场景表示(contractive scene representation),在保持近场细节的同时平滑压缩远距离区域,实现长距离推理且内存恒定。实验表明,QueryOcc在自监督Occ3D-nuScenes基准上相比先前纯相机方法提升26%的语义RayIoU,同时运行速度达11.6 FPS,验证了直接4D查询监督在自监督占据学习中的有效性。

链接: https://arxiv.org/abs/2511.17221
作者: Adam Lilja,Ji Lan,Junsheng Fu,Lars Hammarstrand
机构: Chalmers University of Technology (查尔姆斯理工大学); Zenseact
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. this https URL
zh

[CV-43] Dual-domain Adaptation Networks for Realistic Image Super-resolution

【速读】:该论文旨在解决真实世界图像超分辨率(Realistic Image Super-Resolution, RISR)任务中因缺乏充足的真实低分辨率(Low-Resolution, LR)与高分辨率(High-Resolution, HR)配对数据而导致模型难以学习有效图像特征的问题。现有方法在处理复杂退化模式时性能受限,而预训练的合成数据超分辨率模型虽具备先验知识,却难以直接迁移到真实场景。解决方案的关键在于提出双域自适应网络(Dual-domain Adaptation Networks),其核心创新包括:1)在空间域采用参数选择性更新与低秩适应(Low-Rank Adaptation, LoRA)技术,高效微调冻结参数以适配真实数据分布;2)引入频域自适应分支,结合输入图像的频谱信息与空间域骨干网络中间特征,显式重建高频分量,从而提升HR图像的质量与细节恢复能力。

链接: https://arxiv.org/abs/2511.17217
作者: Chaowei Fang,Bolin Fu,De Cheng,Lechao Cheng,Guanbin Li
机构: Xidian University (西安电子科技大学); Hefei University of Technology (合肥工业大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features. Pre-trained SR models from large-scale synthetic datasets offer valuable prior knowledge, which can improve generalization, speed up training, and reduce the need for extensive real-world data in realistic SR tasks. In this paper, we introduce a novel approach, Dual-domain Adaptation Networks, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets. To achieve this target, we first set up a spatial-domain adaptation strategy through selectively updating parameters of pre-trained models and employing the low-rank adaptation technique to adjust frozen parameters. Recognizing that image super-resolution involves recovering high-frequency components, we further integrate a frequency domain adaptation branch into the adapted model, which combines the spectral data of the input and the spatial-domain backbone’s intermediate features to infer HR frequency maps, enhancing the SR result. Experimental evaluations on public realistic image SR benchmarks, including RealSR, D2CRealSR, and DRealSR, demonstrate the superiority of our proposed method over existing state-of-the-art models. Codes are available at: this https URL.
zh

[CV-44] FisheyeGaussianLift: BEV Feature Lifting for Surround-View Fisheye Camera Perception

【速读】:该论文旨在解决从鱼眼图像(fisheye imagery)中进行准确的鸟瞰图(Bird’s Eye View, BEV)语义分割问题,其核心挑战包括极端非线性畸变、遮挡以及深度模糊性。解决方案的关键在于提出一种畸变感知的BEV分割框架,直接处理多相机高分辨率鱼眼图像,利用标定后的几何逆投影和像素级深度分布估计,将每个图像像素通过高斯参数化映射至3D空间,预测空间均值和各向异性协方差以显式建模几何不确定性;随后通过可微分的splatting操作将3D高斯融合为BEV表示,生成连续且带不确定性的语义地图,无需进行去畸变或透视校正。

链接: https://arxiv.org/abs/2511.17210
作者: Shubham Sonarghare,Prasad Deshpande,Ciaran Hogan,Deepika-Rani Kaliappan-Mahalingam,Ganesh Sistu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, published in IMVIP 2025 conference

点击查看摘要

Abstract:Accurate BEV semantic segmentation from fisheye imagery remains challenging due to extreme non-linear distortion, occlusion, and depth ambiguity inherent to wide-angle projections. We present a distortion-aware BEV segmentation framework that directly processes multi-camera high-resolution fisheye images,utilizing calibrated geometric unprojection and per-pixel depth distribution estimation. Each image pixel is lifted into 3D space via Gaussian parameterization, predicting spatial means and anisotropic covariances to explicitly model geometric uncertainty. The projected 3D Gaussians are fused into a BEV representation via differentiable splatting, producing continuous, uncertainty-aware semantic maps without requiring undistortion or perspective rectification. Extensive experiments demonstrate strong segmentation performance on complex parking and urban driving scenarios, achieving IoU scores of 87.75% for drivable regions and 57.26% for vehicles under severe fisheye distortion and diverse environmental conditions.
zh

[CV-45] Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers

【速读】:该论文旨在解决三维医学影像中CT(Computed Tomography,计算机断层扫描)数据的通用表示学习问题,尤其针对体积CT特有的挑战:如极端的token规模、几何各向异性以及临床标注弱或噪声大等问题,这些问题使得标准Transformer架构和对比学习方法难以直接应用。解决方案的关键在于提出SPECTRE框架,其核心创新包括:1)设计一个联合优化机制,同时使用局部Transformer进行高分辨率特征提取与全局Transformer建模整幅扫描上下文,从而在计算上实现大规模3D注意力的有效处理;2)采用无监督自蒸馏(DINO风格)与基于SigLIP的视觉-语言对齐策略,利用配对放射科报告实现跨模态预训练,使学习到的特征既具备几何一致性又具有临床语义意义;3)仅使用公开可用的CT数据集进行预训练,验证了无需依赖私有数据即可获得高性能且可泛化的CT表征。

链接: https://arxiv.org/abs/2511.17209
作者: Cris Claessens,Christiaan Viviers,Giacomo D’Amicantonio,Egor Bondarev,Fons van der Sommen
机构: Eindhoven University of Technology (埃因霍温理工大学); ARIA Lab; AIMS Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.
zh

[CV-46] SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

【速读】:该论文旨在解决现有密集三维重建方法在SLAM(同步定位与地图构建)系统中因累积误差(drift)和冗余点云导致的效率低下及下游任务(如新视角合成)性能受限的问题。其解决方案的关键在于提出SING3R-SLAM框架,该框架通过轻量级追踪与重建模块构建局部一致的子地图,并将其逐步对齐融合为统一的全局高斯表示(Gaussian-based representation),从而联合优化场景几何与相机位姿,实现全局一致性与紧凑表达;该全局高斯地图不仅提升定位精度和重建细节,还反哺局部追踪以校正漂移,显著改善系统鲁棒性与下游应用效果。

链接: https://arxiv.org/abs/2511.17207
作者: Kunyi Li,Michael Niemeyer,Sen Wang,Stefano Gasperini,Nassir Navab,Federico Tombari
机构: Technical University of Munich (慕尼黑工业大学); Google(谷歌); Munich Center for Machine Learning (慕尼黑机器学习中心); VisualAIs
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in dense 3D reconstruction enable the accurate capture of local geometry; however, integrating them into SLAM is challenging due to drift and redundant point maps, which limit efficiency and downstream tasks, such as novel view synthesis. To address these issues, we propose SING3R-SLAM, a globally consistent and compact Gaussian-based dense RGB SLAM framework. The key idea is to combine locally consistent 3D reconstructions with a unified global Gaussian representation that jointly refines scene geometry and camera poses, enabling efficient and versatile 3D mapping for multiple downstream applications. SING3R-SLAM first builds locally consistent submaps through our lightweight tracking and reconstruction module, and then progressively aligns and fuses them into a global Gaussian map that enforces cross-view geometric consistency. This global map, in turn, provides feedback to correct local drift and enhance the robustness of tracking. Extensive experiments demonstrate that SING3R-SLAM achieves state-of-the-art tracking, 3D reconstruction, and novel view rendering, resulting in over 12% improvement in tracking and producing finer, more detailed geometry, all while maintaining a compact and memory-efficient global representation on real-world datasets.
zh

[CV-47] Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning

【速读】:该论文旨在解决医疗图像分割中因机构间隐私政策异构导致的数据池化联合训练不可行的问题,同时应对生成式 AI(Generative AI)模型如Segment Anything Model (SAM) 在实际部署中因参数量大、计算开销高而带来的效率挑战。其解决方案的关键在于提出一种轻量级的“对齐层”(Alignment Layer),该模块可插拔地对齐编码器-解码器特征分布,从而高效适配SAM以提升医学图像分割精度并降低计算成本;进一步在此基础上构建了持续学习策略Continual Alignment for SAM (CA-SAM),通过自动选择合适的对齐层来缓解灾难性遗忘问题,并利用SAM的零样本先验能力保持在未见医学数据集上的高性能。

链接: https://arxiv.org/abs/2511.17201
作者: Jiayi Wang,Wei Dai,Haoyu Wang,Sihan Yang,Haixia Bi,Jian Sun
机构: Xi’an Jiaotong University (西安交通大学); School of Information and Communications Engineering, Xi’an Jiaotong University (信息与通信工程学院,西安交通大学); School of Mathematics and Statistics, Xi’an Jiaotong University (数学与统计学院,西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM’s zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mboxthis https URL.
zh

[CV-48] VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中难以实现时空一致性(spatiotemporally coherent)控制的问题,尤其是细粒度的时空动作规划能力不足。现有方法虽通过将3D位置嵌入视觉表示以提升空间精度,但在动作执行的时间连贯性上表现不佳。其解决方案的关键在于提出VLA-4D模型,核心创新为两个方面:一是构建4D感知的视觉表示,即在3D空间位置基础上嵌入1D时间信息形成4D嵌入,并通过交叉注意力机制融合为统一视觉表征;二是设计时空动作表示,将传统空间动作表示扩展为包含时间维度的动作编码,从而支持时空联合规划,并与大语言模型(LLM)对齐以实现精准的时空动作预测。该框架有效提升了机器人操作的空间平滑性和时间一致性。

链接: https://arxiv.org/abs/2511.17199
作者: Hanyu Zhou,Chuanhao Ma,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.
zh

[CV-49] Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在专业领域(如遥感分析)中因缺乏结构化工作流而表现不佳的问题。这些问题包括对专用工具(如辐射校正、光谱指数计算)的依赖以及多步骤操作(含中间产物和可选路径)带来的复杂性,使得通用框架(如ReAct或角色扮演式代理)难以有效执行任务。解决方案的关键在于提出一种基于层次任务抽象机制(Hierarchical Task Abstraction Mechanism, HTAM)的新颖代理设计框架,该机制不依赖于社会角色模拟,而是将多智能体系统组织成逻辑层级结构,以映射特定领域的任务依赖图,从而强制执行程序正确性,并将复杂问题分解为逐层递进的子任务,每层子代理基于前一层输出进行操作。这一架构使EarthAgent在地理空间分析任务中展现出显著优于现有单/多智能体系统的规划能力。

链接: https://arxiv.org/abs/2511.17198
作者: Kaiyu Li,Jiayu Wang,Zhi Wang,Hui Qiao,Weizhan Zhang,Deyu Meng,Xiangyong Cao
机构: Xi’an Jiaotong University (西安交通大学); China Telecom Shaanxi Branch (中国电信陕西分公司)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this https URL

点击查看摘要

Abstract:LLM-driven agents, particularly those using general frameworks like ReAct or human-inspired role-playing, often struggle in specialized domains that necessitate rigorously structured workflows. Fields such as remote sensing, requiring specialized tools (e.g., correction, spectral indices calculation), and multi-step procedures (e.g., numerous intermediate products and optional steps), significantly challenge generalized approaches. To address this gap, we introduce a novel agent design framework centered on a Hierarchical Task Abstraction Mechanism (HTAM). Specifically, HTAM moves beyond emulating social roles, instead structuring multi-agent systems into a logical hierarchy that mirrors the intrinsic task-dependency graph of a given domain. This task-centric architecture thus enforces procedural correctness and decomposes complex problems into sequential layers, where each layer’s sub-agents operate on the outputs of the preceding layers. We instantiate this framework as EarthAgent, a multi-agent system tailored for complex geospatial analysis. To evaluate such complex planning capabilities, we build GeoPlan-bench, a comprehensive benchmark of realistic, multi-step geospatial planning tasks. It is accompanied by a suite of carefully designed metrics to evaluate tool selection, path similarity, and logical completeness. Experiments show that EarthAgent substantially outperforms a range of established single- and multi-agent systems. Our work demonstrates that aligning agent architecture with a domain’s intrinsic task structure is a critical step toward building robust and reliable specialized autonomous systems.
zh

[CV-50] Real Noise Decoupling for Hyperspectral Image Denoising

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)去噪中因真实噪声复杂且难以准确建模而导致现有方法性能受限的问题。解决方案的关键在于提出一种多阶段噪声解耦框架,将复杂噪声分解为显式建模噪声和隐式建模噪声两部分:对于显式噪声,利用已有噪声模型生成配对数据进行预训练,赋予去噪网络先验知识以有效处理;对于隐式噪声,则引入基于高频小波引导的网络,借助预训练模块的先验信息自适应提取高频特征并去除真实配对数据中的隐式噪声。此外,通过分阶段预训练与联合微调相结合的多阶段学习策略,有效消除各类噪声成分并抑制误差累积,从而显著提升HSI去噪效果。

链接: https://arxiv.org/abs/2511.17196
作者: Yingkai Zhang,Tao Zhang,Jing Nie,Ying Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) denoising is a crucial step in enhancing the quality of HSIs. Noise modeling methods can fit noise distributions to generate synthetic HSIs to train denoising networks. However, the noise in captured HSIs is usually complex and difficult to model accurately, which significantly limits the effectiveness of these approaches. In this paper, we propose a multi-stage noise-decoupling framework that decomposes complex noise into explicitly modeled and implicitly modeled components. This decoupling reduces the complexity of noise and enhances the learnability of HSI denoising methods when applied to real paired data. Specifically, for explicitly modeled noise, we utilize an existing noise model to generate paired data for pre-training a denoising network, equipping it with prior knowledge to handle the explicitly modeled noise effectively. For implicitly modeled noise, we introduce a high-frequency wavelet guided network. Leveraging the prior knowledge from the pre-trained module, this network adaptively extracts high-frequency features to target and remove the implicitly modeled noise from real paired HSIs. Furthermore, to effectively eliminate all noise components and mitigate error accumulation across stages, a multi-stage learning strategy, comprising separate pre-training and joint fine-tuning, is employed to optimize the entire framework. Extensive experiments on public and our captured datasets demonstrate that our proposed framework outperforms state-of-the-art methods, effectively handling complex real-world noise and significantly enhancing HSI quality.
zh

[CV-51] PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

【速读】:该论文旨在解决现有视频重捕获(video recapture)方法在动态场景中因相机运动注入策略不佳而导致的相机控制精度低、视觉细节丢失的问题。其核心解决方案是提出PostCam框架,通过引入查询共享的交叉注意力模块(query-shared cross-attention module),将6自由度(6-DoF)相机位姿与2D渲染视频帧两种控制信号融合到统一特征空间中,从而提取潜在运动线索,显著提升相机轨迹编辑的精确性与生成视频的视觉保真度。此外,采用两阶段训练策略——先从位姿输入学习粗粒度相机控制,再结合视觉信息优化运动精度和图像质量——进一步增强了模型性能。

链接: https://arxiv.org/abs/2511.17185
作者: Yipeng Chen,Zhichao Ye,Zhenzhou Fang,Xinyu Chen,Xiaoyu Zhang,Jialing Liu,Nan Wang,Haomin Liu,Guofeng Zhang
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室); Shanghai InSpatio Intelligent Technology Co., Ltd. (上海InSpatio智能科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose PostCam, a framework for novel-view video generation that enables post-capture editing of camera trajectories in dynamic scenes. We find that existing video recapture methods suffer from suboptimal camera motion injection strategies; such suboptimal designs not only limit camera control precision but also result in generated videos that fail to preserve fine visual details from the source video. To achieve more accurate and flexible motion manipulation, PostCam introduces a query-shared cross-attention module. It integrates two distinct forms of control signals: the 6-DoF camera poses and the 2D rendered video frames. By fusing them into a unified representation within a shared feature space, our model can extract underlying motion cues, which enhances both control precision and generation quality. Furthermore, we adopt a two-stage training strategy: the model first learns coarse camera control from pose inputs, and then incorporates visual information to refine motion accuracy and enhance visual fidelity. Experiments on both real-world and synthetic datasets demonstrate that PostCam outperforms state-of-the-art methods by over 20% in camera control precision and view consistency, while achieving the highest video generation quality. Our project webpage is publicly available at: this https URL
zh

[CV-52] Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition

【速读】:该论文旨在解决夜间交通标志识别(Traffic Sign Recognition, TSR)在低光照条件下因视觉噪声干扰和公开夜间数据集稀缺而导致的鲁棒性不足问题,以及现有方法难以有效利用多模态互补信息的局限性。其解决方案的关键在于:首先构建了大规模、多样化的夜间交通标志数据集INTSD(Indian Night-Time Street Dataset),覆盖41类交通标志并在不同光照与天气条件下采集,为检测与分类任务提供基准;其次提出LENS-Net模型,融合自适应图像增强检测模块实现光照校正与标志定位联合优化,并引入基于跨模态注意力机制与图神经网络(Graph Convolutional Network, GCN)的结构化多模态CLIP-GCNN分类器,以提升识别的鲁棒性和语义一致性,实验表明该方案显著优于现有框架。

链接: https://arxiv.org/abs/2511.17183
作者: Aditya Mishra,Akshay Agarwal,Haroon Lone
机构: IISER Bhopal (印度科学教育研究所博帕尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains challenging due to visual noise and scarcity of public nighttime datasets. Despite advances in vision architectures, existing methods struggle with robustness under low illumination and fail to leverage complementary mutlimodal cues effectively. To overcome these limitations, firstly, we introduce INTSD, a large-scale dataset comprising street-level night-time images of traffic signboards collected across diverse regions of India. The dataset spans 41 traffic signboard classes captured under varying lighting and weather conditions, providing a comprehensive benchmark for both detection and classification tasks. To benchmark INTSD for night-time sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models. Secondly, we propose LENS-Net, which integrates an adaptive image enhancement detector for joint illumination correction and sign localization, followed by a structured multimodal CLIP-GCNN classifier that leverages cross-modal attention and graph-based reasoning for robust and semantically consistent recognition. Our method surpasses existing frameworks, with ablation studies confirming the effectiveness of its key components. The dataset and code for LENS-Net is publicly available for research.
zh

[CV-53] Investigating self-supervised representations for audio-visual deepfake detection

【速读】:该论文旨在解决自监督表示(self-supervised representations)在音视频深度伪造检测(audio-visual deepfake detection)中的潜力尚未被充分挖掘的问题。其解决方案的关键在于系统性地评估这些特征在不同模态(音频、视频、多模态)和不同领域(唇部运动、通用视觉内容)下的表现,重点考察检测有效性、编码信息的可解释性以及跨模态互补性。研究发现,大多数自监督特征能够捕获与深度伪造相关的语义信息且具有互补性,模型主要关注语义有意义区域而非虚假伪影,但当前方法仍无法在不同数据集间可靠泛化,表明问题根源在于数据集特性而非特征本身对表面模式的过拟合。

链接: https://arxiv.org/abs/2511.17181
作者: Dragos-Alexandru Boldisor,Stefan Smeu,Dan Oneata,Elisabeta Oneata
机构: Bitdefender(比特Defender); Politehnica Bucharest(布加勒斯特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.
zh

[CV-54] FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle

【速读】:该论文旨在解决 wildfire risk 预测中存在的因果推理不足与多模态理解缺失问题,导致现有方法在跨区域推广时可靠性差。其核心挑战在于如何融合遥感影像(如Sentinel-2)、气候数据与地理信息,并生成具有可解释性的连续风险栅格图(risk rasters),以实现跨大陆的可靠泛化。解决方案的关键是提出 FireScope-Bench 数据集和基于视觉语言模型(VLM)的 FireScope 框架,该框架通过强化学习与视觉监督联合训练,生成带有互补推理轨迹的风险预测结果,从而提升模型的泛化能力与可解释性。实验表明,FireScope 在美国训练、欧洲测试中仍保持显著性能优势,且专家反馈和自动分析验证了其推理轨迹的真实性与语义合理性。

链接: https://arxiv.org/abs/2511.17171
作者: Mario Markov(1),Stefan Maria Ailuro(1),Luc Van Gool(1),Konrad Schindler(2),Danda Pani Paudel(1 and 2) ((1) INSAIT, Sofia University, (2) ETH Zurich)
机构: INSAIT( INSAIT); Sofia University (索菲亚大学); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce \textbfFireScope-Bench , a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose \textbfFireScope , a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, \textbfFireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that \textbfFireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.
zh

[CV-55] UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network WACV2026

【速读】:该论文旨在解决超声图像在不同设备间因成像特性差异导致的域偏移(domain shift)问题,该问题会显著降低固定黑盒下游推理模型的性能。现有无配对图像翻译(Unpaired Image Translation, UIT)方法通常忽略类别特定语义对齐,造成内容类别映射错误,进而影响诊断准确性。其解决方案的关键在于提出一种新型超声专用、类别感知的图像风格迁移框架UI-Styler:该框架通过模式匹配机制将目标域图像中的纹理特征迁移到源域图像上,同时保留源域结构内容;并引入基于目标域伪标签的类别感知提示策略,确保与诊断类别的精确语义对齐,从而实现更可靠的跨设备域适应。

链接: https://arxiv.org/abs/2511.17155
作者: Nhat-Tuong Do-Tran,Ngoc-Hoang-Lam Le,Ching-Chun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , Accepted to WACV 2026

点击查看摘要

Abstract:The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.
zh

[CV-56] DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving AAAI2026

【速读】:该论文旨在解决当前基于扩散模型(diffusion models)的轨迹预测方法在自动驾驶中存在性能瓶颈的问题,特别是其依赖人工设计的轨迹锚点或随机噪声作为初始输入,导致生成轨迹的准确性和场景一致性不足。解决方案的关键在于提出一个两阶段框架DiffRefiner:第一阶段采用基于Transformer的Proposal Decoder生成粗粒度轨迹预测,利用预定义轨迹锚点从传感器输入回归得到初始轨迹;第二阶段引入Diffusion Refiner通过迭代去噪过程对初始轨迹进行精细化修正,同时设计细粒度去噪解码器以增强轨迹与周围环境的契合度,从而显著提升预测精度和场景适应性。该方法通过判别式提议模块为生成式精修提供强引导,实现了性能突破,在多个公开基准上达到最优结果。

链接: https://arxiv.org/abs/2511.17150
作者: Liuhan Yin,Runkun Ju,Guodong Guo,Erkang Cheng
机构: Nullmax(空值公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Unlike discriminative approaches in autonomous driving that predict a fixed set of candidate trajectories of the ego vehicle, generative methods, such as diffusion models, learn the underlying distribution of future motion, enabling more flexible trajectory prediction. However, since these methods typically rely on denoising human-crafted trajectory anchors or random noise, there remains significant room for improvement. In this paper, we propose DiffRefiner, a novel two-stage trajectory prediction framework. The first stage uses a transformer-based Proposal Decoder to generate coarse trajectory predictions by regressing from sensor inputs using predefined trajectory anchors. The second stage applies a Diffusion Refiner that iteratively denoises and refines these initial predictions. In this way, we enhance the performance of diffusion-based planning by incorporating a discriminative trajectory proposal module, which provides strong guidance for the generative refinement process. Furthermore, we design a fine-grained denoising decoder to enhance scene compliance, enabling more accurate trajectory prediction through enhanced alignment with the surrounding environment. Experimental results demonstrate that DiffRefiner achieves state-of-the-art performance, attaining 87.4 EPDMS on NAVSIM v2, and 87.1 DS along with 71.4 SR on Bench2Drive, thereby setting new records on both public benchmarks. The effectiveness of each component is validated via ablation studies as well.
zh

[CV-57] A lightweight detector for real-time detection of remote sensing images

【速读】:该论文旨在解决遥感图像中小目标检测的实时性与精度平衡难题,尤其针对小目标在复杂背景下的漏检问题。其核心解决方案在于提出一种轻量级实时检测模型DMG-YOLO,关键创新包括:(1)在骨干网络中设计双分支特征提取(Dual-branch Feature Extraction, DFE)模块,通过深度可分离卷积提取局部细节特征,同时利用带门控机制的视觉Transformer(Vision Transformer)捕获全局上下文信息;(2)引入多尺度特征融合(Multi-scale Feature Fusion, MFF)模块,结合空洞卷积增强多尺度特征整合能力并保留细粒度信息;(3)在特征金字塔网络中采用全局与局部聚合特征金字塔网络(Global and Local Aggregate Feature Pyramid Network, GLAFPN),实现跨尺度的全局-局部特征融合,显著提升小目标检测性能。

链接: https://arxiv.org/abs/2511.17147
作者: Qianyi Wang,Guoqiang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: none

点击查看摘要

Abstract:Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.
zh

[CV-58] Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation

【速读】:该论文旨在解决传统医学图像分割中损失函数(如Dice损失)对小病灶(lesion)分割效果不佳的问题,因其相对体积小,在整体损失中贡献微弱,导致模型倾向于忽略小病灶。解决方案的关键在于提出一种基于CC-Metrics框架的新型损失函数CC-DiceCE,该方法通过实例级(instance-wise)评估机制提升对每个病灶的检测敏感性,从而在保持分割性能基本不变甚至略有提升的前提下显著提高召回率(recall),尽管会略微增加假阳性数量。

链接: https://arxiv.org/abs/2511.17146
作者: Luc Bouteille,Alexander Jaus,Jens Kleesiek,Rainer Stiefelhagen,Lukas Heine
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, albeit at the cost of slightly more false positives. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.
zh

[CV-59] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

【速读】:该论文旨在解决扩散模型在真实世界图像超分辨率(Real-ISR)任务中普遍存在的保真度(fidelity)与可控性(controllability)之间的权衡问题:多步扩散方法虽具生成多样性但保真度低,而单步方法因依赖特定保真度的微调而丧失控制灵活性。其解决方案的关键在于提出ODTSR框架,采用基于Qwen-Image的单步扩散Transformer结构,并创新性地设计了噪声混合视觉流(Noise-hybrid Visual Stream, NVS),通过引入可调节噪声(Control Noise)和一致噪声(Prior Noise)分别驱动两个并行视觉分支,实现保真度与可控性的协同优化;同时结合保真度感知对抗训练(Fidelity-aware Adversarial Training, FAA),在无需额外训练的情况下显著提升可控性,从而在通用Real-ISR和挑战性场景如中文场景文本图像超分辨率(STISR)中均达到最优性能。

链接: https://arxiv.org/abs/2511.17138
作者: Yushun Fang,Yuxiang Chen,Shibo Yin,Qiang Hu,Jiangchao Yao,Ya Zhang,Xiaoyun Zhang,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Xiaohongshu Inc (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.
zh

[CV-60] A Multi-Stage Optimization Framework for Deploying Learned Image Compression on FPGAs

【速读】:该论文旨在解决深度学习图像压缩(LIC)模型在资源受限的现场可编程门阵列(FPGA)上部署时面临的性能与效率瓶颈问题。其核心挑战在于浮点模型向整数实现转换过程中的精度损失,以及硬件架构特性未被充分优化导致的计算复杂度偏高。解决方案的关键在于提出一个多层次优化框架:首先,通过动态范围感知量化(DRAQ)方法,结合统计校准的激活裁剪和新型权重重 regularization 策略,有效缓解极端数据异常值和大动态范围对模型质量的影响,从而构建高保真8位整数模型;其次,在此基础上引入两种面向FPGA的硬件感知优化技术——基于逐层非均匀位宽分配的渐进式混合精度搜索算法,以及适配广义除法归一化(GDN)层的通道剪枝方法,显著降低计算复杂度并保持率失真(RD)性能,最终实现高效且高质量的FPGA部署。

链接: https://arxiv.org/abs/2511.17135
作者: Jiaxun Fang,Li Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based image compression (LIC) has achieved state-of-the-art rate-distortion (RD) performance, yet deploying these models on resource-constrained FPGAs remains a major challenge. This work presents a complete, multi-stage optimization framework to bridge the gap between high-performance floating-point models and efficient, hardware-friendly integer-based implementations. First, we address the fundamental problem of quantization-induced performance degradation. We propose a Dynamic Range-Aware Quantization (DRAQ) method that uses statistically-calibrated activation clipping and a novel weight regularization scheme to counteract the effects of extreme data outliers and large dynamic ranges, successfully creating a high-fidelity 8-bit integer model. Second, building on this robust foundation, we introduce two hardware-aware optimization techniques tailored for FPGAs. A progressive mixed-precision search algorithm exploits FPGA flexibility to assign optimal, non-uniform bit-widths to each layer, minimizing complexity while preserving performance. Concurrently, a channel pruning method, adapted to work with the Generalized Divisive Normalization (GDN) layers common in LIC, removes model redundancy by eliminating inactive channels. Our comprehensive experiments show that the foundational DRAQ method reduces the BD-rate overhead of a GDN-based model from 30% to 6.3% . The subsequent hardware-aware optimizations further reduce computational complexity by over 20% with negligible impact on RD performance, yielding a final model that is both state-of-the-art in efficiency and superior in quality to existing FPGA-based LIC implementations.
zh

[CV-61] Off the Planckian Locus: Using 2D Chromaticity to Improve In-Camera Color

【速读】:该论文旨在解决传统相机色彩映射方法在非黑体光源(如LED)下颜色还原不准确的问题。传统方法依赖于一维相关色温(CCT)在预校准变换之间插值,但现代LED光源常偏离黑体轨迹(Planckian locus),导致色彩失真。解决方案的关键在于将照明表征从一维CCT扩展至二维色度空间(离黑体轨迹更远的位置),并引入一个轻量级多层感知机(MLP)模型,利用2D色度特征替代传统CCT插值,从而实现对非黑体光源的鲁棒色彩映射。该方法通过包含代表性LED光源的lightbox校准流程进行训练,在多种LED照明场景中平均降低角度再现误差22%,同时保持与传统光源的向后兼容性、支持多光源场景,并具备实时相机部署能力。

链接: https://arxiv.org/abs/2511.17133
作者: SaiKiran Tedla,Joshua E. Little,Hakki Can Karaimer,Michael S. Brown
机构: York University (约克大学); AI Center-Toronto, Samsung Electronics (三星电子人工智能中心-多伦多)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Traditional in-camera colorimetric mapping relies on correlated color temperature (CCT)-based interpolation between pre-calibrated transforms optimized for Planckian illuminants such as CIE A and D65. However, modern lighting technologies such as LEDs can deviate substantially from the Planckian locus, exposing the limitations of relying on conventional one-dimensional CCT for illumination characterization. This paper demonstrates that transitioning from 1D CCT (on the Planckian locus) to a 2D chromaticity space (off the Planckian locus) improves colorimetric accuracy across various mapping approaches. In addition, we replace conventional CCT interpolation with a lightweight multi-layer perceptron (MLP) that leverages 2D chromaticity features for robust colorimetric mapping under non-Planckian illuminants. A lightbox-based calibration procedure incorporating representative LED sources is used to train our MLP. Validated across diverse LED lighting, our method reduces angular reproduction error by 22% on average in LED-lit scenes, maintains backward compatibility with traditional illuminants, accommodates multi-illuminant scenes, and supports real-time in-camera deployment with negligible additional computational cost.
zh

[CV-62] PEGS: Physics-Event Enhanced Large Spatiotemporal Motion Reconstruction via 3D Gaussian Splatting

【速读】:该论文旨在解决在大时空尺度下刚性运动重建的难题,其核心挑战包括建模范式的局限性、严重的运动模糊以及物理一致性不足。解决方案的关键在于提出PEGS框架,该框架将物理先验(Physical priors)与事件流增强(Event stream enhancement)融合进3D高斯点绘(3D Gaussian Splatting)管线中,实现去模糊的目标聚焦建模与运动恢复。其创新性体现在三个层面:一是引入三重监督机制,通过加速度约束保证物理合理性、利用事件流提供高时间分辨率引导、并采用卡尔曼正则化融合多源观测;二是设计基于实时运动状态自适应调度训练过程的运动感知模拟退火策略;三是构建首个针对自然场景中快速刚性运动的RGB-事件配对数据集,从而显著提升动态重建性能。

链接: https://arxiv.org/abs/2511.17116
作者: Yijun Xu,Jingrui Zhang,Hongyi Liu,Yuhan Chen,Yuanyang Wang,Qingyao Guo,Dingwen Wang,Lei Yu,Chu He
机构: Wuhan University (武汉大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstruction of rigid motion over large spatiotemporal scales remains a challenging task due to limitations in modeling paradigms, severe motion blur, and insufficient physical consistency. In this work, we propose PEGS, a framework that integrates Physical priors with Event stream enhancement within a 3D Gaussian Splatting pipeline to perform deblurred target-focused modeling and motion recovery. We introduce a cohesive triple-level supervision scheme that enforces physical plausibility via an acceleration constraint, leverages event streams for high-temporal resolution guidance, and employs a Kalman regularizer to fuse multi-source observations. Furthermore, we design a motion-aware simulated annealing strategy that adaptively schedules the training process based on real-time kinematic states. We also contribute the first RGB-Event paired dataset targeting natural, fast rigid motion across diverse scenarios. Experiments show PEGS’s superior performance in reconstructing motion over large spatiotemporal scales compared to mainstream dynamic methods.
zh

[CV-63] ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

【速读】:该论文旨在解决多模态推理模型在生成长链条推理过程时存在的冗余自我反思问题,尤其是在依赖视觉信息进行多步符号推理任务中,现有训练-free 的思维链(Chain-of-Thought, CoT)压缩方法因仅使用静态视觉参考而效果有限。其解决方案的关键在于提出 ChainV 框架,通过动态整合视觉提示(visual hint)来优化推理路径:首先基于前一推理步骤粗粒度选择视觉 patch,再依据平均注意力强度精确定位最具代表性的原子级视觉提示;同时引入基于一致性的评估机制以判断提示可靠性,并结合伯努利随机过程将提示的像素坐标及其置信度融入推理过程,从而实现更短、更准确的多模态推理。

链接: https://arxiv.org/abs/2511.17106
作者: Yuan Zhang,Ming Lu,Junwen Pan,Tao Huang,Kuan Cheng,Qi She,Shanghang Zhang
机构: Peking University (北京大学); ByteDance Inc. (字节跳动); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves 2.3% improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by 51.4% and shortening output token length by 24.5% .
zh

[CV-64] Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs ACM-MM2024

【速读】:该论文旨在解决视觉情感识别(Visual Emotion Recognition, VER)中因预训练视觉模型的事实级特征与情感类别之间缺乏直接关联而产生的“情感鸿沟”(affective gap)问题,从而限制了预训练知识在VER任务中的迁移效果。解决方案的关键在于引入预训练文本模型的知识以增强视觉模型的情感感知能力,具体通过在噪声社交媒体数据中挖掘图像与文本之间的事实性和情感性关联,并提出分段自适应对比学习(Partitioned Adaptive Contrastive Learning, PACL)方法:该方法根据样本类型动态划分并设计差异化的对比学习策略,通过动态构建正负样本对,充分挖掘噪声样本的潜在信息,从而有效缩小情感鸿沟,显著提升多种预训练视觉模型在下游情感相关任务中的性能。

链接: https://arxiv.org/abs/2511.17103
作者: Daiqing Wu,Dongbao Yang,Yu Zhou,Can Ma
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); TMCC, College of Computer Science, Nankai University (南开大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Visual emotion recognition (VER) is a longstanding field that has garnered increasing attention with the advancement of deep neural networks. Although recent studies have achieved notable improvements by leveraging the knowledge embedded within pre-trained visual models, the lack of direct association between factual-level features and emotional categories, called the “affective gap”, limits the applicability of pre-training knowledge for VER tasks. On the contrary, the explicit emotional expression and high information density in textual modality eliminate the “affective gap”. Therefore, we propose borrowing the knowledge from the pre-trained textual model to enhance the emotional perception of pre-trained visual models. We focus on the factual and emotional connections between images and texts in noisy social media data, and propose Partitioned Adaptive Contrastive Learning (PACL) to leverage these connections. Specifically, we manage to separate different types of samples and devise distinct contrastive learning strategies for each type. By dynamically constructing negative and positive pairs, we fully exploit the potential of noisy samples. Through comprehensive experiments, we demonstrate that bridging the “affective gap” significantly improves the performance of various pre-trained visual models in downstream emotion-related tasks. Our code is released on this https URL.
zh

[CV-65] Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中因依赖密集帧级推理而导致的高计算成本与延迟问题,尤其是在利用强大预训练模型进行无训练(training-free)检测时是否仍需全帧推理这一根本性疑问。解决方案的关键在于提出ReCoVAD框架,其受人类神经系统双路径机制启发,引入“反射路径”和“意识路径”的协同架构:反射路径通过轻量级CLIP模块融合视觉特征与原型提示,结合动态记忆快速决策;意识路径则利用中等规模视觉语言模型生成事件描述并细化异常评分,同时由大语言模型周期性审查累积描述以识别未见异常、纠错并优化原型。该设计实现了仅处理28.55%和16.04%的帧数即可达到最优性能,证明稀疏推理足以支撑基于大模型的高效VAD系统。

链接: https://arxiv.org/abs/2511.17094
作者: He Huang,Zixuan Hu,Dongxiao Li,Yao Xiao,Ling-Yu Duan
机构: Peking University (北京大学); Fuzhou Chengtou New Infrastructure Co., Ltd (福州城投新基建有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55% and 16.04% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.
zh

[CV-66] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting

【速读】:该论文旨在解决关节式物体(articulated objects)的3D重建问题,现有方法通常依赖于多阶段、多视角观测等高成本输入,难以在单状态稀疏视图下实现高质量重建。其解决方案的关键在于提出一种类别无关的重建框架,通过平面高斯点绘制(planar Gaussian Splatting) 实现高效且精确的表面重建:首先利用高斯信息场从候选相机位姿中感知最优稀疏视点;随后将3D高斯压缩为平面高斯以提升法向量与深度估计精度,并采用粗到精优化策略结合深度平滑正则化和少量样本扩散机制进行优化;同时引入每个高斯基元的部件分割概率,并通过渲染结果的反投影分割掩码进行更新,从而实现更高保真度的部件级表面重建。

链接: https://arxiv.org/abs/2511.17092
作者: Di Wu,Liu Liu,Xueyu Yuan,Qiaoyu Jun,Wenxiao Chen,Ruilong Yan,Yiming Tang,Liangtu Song
机构: Hefei Institutes of Physical Science, Chinese Academy of Sciences (中国科学院合肥物质科学研究院); University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); Shanghai AI Laboratory, China (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. Then we compress 3D Gaussians into planar Gaussians to facilitate accurate estimation of normal and depth. The planar Gaussians are optimized in a coarse-to-fine manner through depth smooth regularization and few-shot diffusion. Moreover, we introduce a part segmentation probability for each Gaussian primitive and update them by back-projecting part segmentation masks of renderings. Extensive experimental results demonstrate that our method achieves higher-fidelity part-level surface reconstruction on both synthetic and real-world data than existing methods. Codes will be made publicly available.
zh

[CV-67] Spanning Tree Autoregressive Visual Generation

【速读】:该论文旨在解决传统自回归(Autoregressive, AR)图像生成模型在引入随机序列顺序以支持双向上下文建模时,往往导致采样性能下降或推理阶段序列顺序灵活性受限的问题。其解决方案的关键在于提出一种基于生成树的结构化随机策略——Spanning Tree Autoregressive (STAR) 模型,通过在图像块位置构成的格点上采样均匀生成树,并利用广度优先搜索(Breadth-First Search, BFS)获得遍历顺序,结合拒绝采样确保已观测部分图像作为序列前缀出现。这种设计在不改变现有语言模型架构的前提下,既保留了后缀补全能力,又维持了良好的采样性能,从而实现了对图像先验知识(如中心偏好和局部性)的有效利用与灵活序列排序的平衡。

链接: https://arxiv.org/abs/2511.17089
作者: Sangkyu Lee,Changho Lee,Janghoon Han,Hosung Song,Tackgeun You,Hwasup Lim,Stanley Jungkyu Choi,Honglak Lee,Youngjae Yu
机构: Yonsei University(延世大学); KIST(韩国科学技术院); LG AI Research( LG人工智能研究中心); University of Michigan, Ann Arbor(密歇根大学安娜堡分校); Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint; Under review

点击查看摘要

Abstract:We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.
zh

[CV-68] Diversity Has Always Been There in Your Visual Autoregressive Models

【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型中存在的多样性坍缩(diversity collapse)问题,即模型生成结果缺乏变异性和多样性,类似于少步数蒸馏扩散模型中观察到的现象。解决方案的关键在于识别并调控特征图中的关键组件(pivotal component):通过在模型输入中抑制该组件、在输出中增强该组件,DiverseVAR无需额外训练即可有效恢复VAR模型的生成多样性,同时保持高保真度合成能力。

链接: https://arxiv.org/abs/2511.17074
作者: Tong Wang,Guanyu Yang,Nian Liu,Kai Wang,Yaxing Wang,Abdelrahman M Shaker,Salman Khan,Fahad Shahbaz Khan,Senmao Li
机构: Southeast University (东南大学); MBZUAI; City University of Hong Kong (香港城市大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at this https URL.
zh

[CV-69] ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion WACV2026

【速读】:该论文旨在解决在低剂量CT扫描导致的稀疏切片条件下,如何准确重建完整脑部磁共振成像(MRI)的问题。其关键解决方案是提出了一种基于检索增强的扩散框架ReBrain:首先利用布朗桥扩散模型(Brownian Bridge Diffusion Model, BBDM)在二维方向上合成MRI切片;同时通过微调的检索模型从先验数据库中获取结构和病理相似的CT切片作为参考,并通过ControlNet分支引导中间MRI切片的生成以保证结构连续性;此外,针对检索失败的情况,引入球面线性插值提供补充指导,从而提升重建鲁棒性和质量。

链接: https://arxiv.org/abs/2511.17068
作者: Junming Liu,Yifei Sun,Weihua Cheng,Yujin Kang,Yirong Chen,Ding Wang,Guosun Zeng
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 12 figures, 7 tables; Accepted by WACV 2026

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.
zh

[CV-70] REArtGS: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting

【速读】:该论文旨在解决可动物体(articulated objects)的泛化性部件级表面重建与关节参数估计问题,尤其针对现有方法在螺丝连接(screw-joint)或多重部件结构中表现不佳、以及缺乏对未观测状态几何约束的问题。其解决方案的关键在于:1)提出一种无需先验关节类型知识的解耦式螺旋运动建模方式;2)通过部件运动混合(part motion blending)联合优化部件感知的高斯表示(Gaussians)与关节参数;3)引入基于泰勒一阶展开的时间连续几何约束,强制高斯点保持平面性并确保平面法向量与深度之间的时序一致性,从而增强对未见状态的泛化能力。

链接: https://arxiv.org/abs/2511.17059
作者: Di Wu,Liu Liu,Anran Huang,Yuyan Liu,Qiaoyu Jun,Shaofan Liu,Liangtu Song,Cewu Lu
机构: Hefei Institutes of Physical Science, Chinese Academy of Sciences (中国科学院合肥物质科学研究院); University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); Shanghai AI Laboratory, China (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS~\citewu2025reartgs introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Project Site: this https URL.
zh

[CV-71] RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion

【速读】:该论文旨在解决当前点云补全模型(如基于Transformer、去噪等方法)在生成全局合理形状的同时,常存在局部几何不一致的问题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的精修框架RL-AD-Net,该框架在预训练点自动编码器(point autoencoder)的潜在空间中操作,将补全结果编码为紧凑的全局特征向量(Global Feature Vectors, GFVs),并通过一个RL智能体对这些GFVs进行选择性调整以提升几何保真度;同时引入轻量级非参数PointNN选择器评估原始补全与RL精修输出的几何一致性,保留更优重建结果,从而实现鲁棒且高效的局部几何优化。

链接: https://arxiv.org/abs/2511.17054
作者: Bhanu Pratap Paregi,Vaibhav Kumar
机构: IISER Bhopal (印度科学教育研究所博帕尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent point cloud completion models, including transformer-based, denoising-based, and other state-of-the-art approaches, generate globally plausible shapes from partial inputs but often leave local geometric inconsistencies. We propose RL-AD-Net, a reinforcement learning (RL) refinement framework that operates in the latent space of a pretrained point autoencoder. The autoencoder encodes completions into compact global feature vectors (GFVs), which are selectively adjusted by an RL agent to improve geometric fidelity. To ensure robustness, a lightweight non-parametric PointNN selector evaluates the geometric consistency of both the original completion and the RL-refined output, retaining the better reconstruction. When ground truth is available, both Chamfer Distance and geometric consistency metrics guide refinement. Training is performed separately per category, since the unsupervised and dynamic nature of RL makes convergence across highly diverse categories challenging. Nevertheless, the framework can be extended to multi-category refinement in future work. Experiments on ShapeNetCore-2048 demonstrate that while baseline completion networks perform reasonable under their training-style cropping, they struggle in random cropping scenarios. In contrast, RL-AD-Net consistently delivers improvements across both settings, highlighting the effectiveness of RL-guided ensemble refinement. The approach is lightweight, modular, and model-agnostic, making it applicable to a wide range of completion networks without requiring retraining.
zh

[CV-72] OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding AAAI2026

【速读】:该论文旨在解决视觉语言大模型(LVLM)在实例级任务(如视觉定位和目标检测)中性能不足的问题,并探索其在行人跟踪与自然语言结合的新方向(如Referring MOT、Cross-view Referring MOT等),这些任务要求模型具备高级语义理解能力。解决方案的关键在于提出一个统一的行人跟踪框架OmniPT,通过将跟踪任务建模为可由基础模型执行的形式,并设计包含RL-Mid训练-SFT-RL的多阶段训练策略:首先利用强化学习(RL)使模型输出格式化的边界框;接着使用大量行人相关数据进行中间训练以增强泛化能力;随后在多个行人跟踪数据集上进行监督微调(SFT),最后再次采用强化学习优化跟踪精度与指令遵循能力,从而实现交互式跟踪、基于参考的跟踪及语义理解的统一能力。

链接: https://arxiv.org/abs/2511.17053
作者: Teng Fu,Mengyang Zhao,Ke Niu,Kaixin Peng,Bin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2026

点击查看摘要

Abstract:LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model’s tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.
zh

[CV-73] PathAgent : Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agent ic Reasoning

【速读】:该论文旨在解决当前病理图像分析中计算流程缺乏显式推理路径的问题,导致模型预测结果不透明、难以解释。现有方法通常以黑箱方式处理全切片图像(Whole-Slide Images, WSIs),无法模拟病理学家在诊断过程中动态缩放、重新聚焦和自我修正的迭代式推理过程。解决方案的关键在于提出PathAgent框架,这是一个无需训练的大语言模型(Large Language Model, LLM)驱动的智能体系统,通过Navigator模块自主探索WSI并精确定位关键微区域,利用Perceptor模块提取形态学视觉线索,并将这些发现整合到Executor模块中持续演化的自然语言推理轨迹中,从而形成可追溯的思维链(chain-of-thought),实现完全可解释的诊断预测。

链接: https://arxiv.org/abs/2511.17052
作者: Jingyun Chen,Linghan Cai,Zhikang Wang,Yi Huang,Songhan Jiang,Shenjin Huang,Hongpeng Wang,Yongbing Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Fudan University (复旦大学); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Analyzing whole-slide images (WSIs) requires an iterative, evidence-driven reasoning process that parallels how pathologists dynamically zoom, refocus, and self-correct while collecting the evidence. However, existing computational pipelines often lack this explicit reasoning trajectory, resulting in inherently opaque and unjustifiable predictions. To bridge this gap, we present PathAgent, a training-free, large language model (LLM)-based agent framework that emulates the reflective, stepwise analytical approach of human experts. PathAgent can autonomously explore WSI, iteratively and precisely locating significant micro-regions using the Navigator module, extracting morphology visual cues using the Perceptor, and integrating these findings into the continuously evolving natural language trajectories in the Executor. The entire sequence of observations and decisions forms an explicit chain-of-thought, yielding fully interpretable predictions. Evaluated across five challenging datasets, PathAgent exhibits strong zero-shot generalization, surpassing task-specific baselines in both open-ended and constrained visual question-answering tasks. Moreover, a collaborative evaluation with human pathologists confirms PathAgent’s promise as a transparent and clinically grounded diagnostic assistant.
zh

[CV-74] RoomPlanner: Explicit Layout Planner for Easier LLM -Driven 3D Room Generation

【速读】:该论文旨在解决现有3D室内场景生成方法中依赖人工布局设计或全景图像引导、难以实现完全自动化且生成效率低的问题。其核心解决方案在于提出RoomPlanner框架,通过引入语言驱动的分层代理规划器(language-driven agent planners)自动解析短文本提示并生成包含空间与语义属性的详细场景描述,进而初始化3D点云;结合两种迭代优化的排列约束确保物体在限定空间内无碰撞且可达;最终采用AnyReach采样策略和Interval Timestep Flow Sampling(ITFS)策略高效优化粗粒度3D高斯场景表示,从而实现仅需短文本输入即可自动生成几何合理、视觉质量高且可编辑的3D室内场景,整体生成时间控制在30分钟以内。

链接: https://arxiv.org/abs/2511.17048
作者: Wenzhuo Sun,Mingjian Liang,Wenxuan Song,Xuelian Cheng,Zongyuan Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose RoomPlanner, the first fully automatic 3D room generation framework for painlessly creating realistic indoor scenes with only short text as input. Without any manual layout design or panoramic image guidance, our framework can generate explicit layout criteria for rational spatial placement. We begin by introducing a hierarchical structure of language-driven agent planners that can automatically parse short and ambiguous prompts into detailed scene descriptions. These descriptions include raw spatial and semantic attributes for each object and the background, which are then used to initialize 3D point clouds. To position objects within bounded environments, we implement two arrangement constraints that iteratively optimize spatial arrangements, ensuring a collision-free and accessible layout solution. In the final rendering stage, we propose a novel AnyReach Sampling strategy for camera trajectory, along with the Interval Timestep Flow Sampling (ITFS) strategy, to efficiently optimize the coarse 3D Gaussian scene representation. These approaches help reduce the total generation time to under 30 minutes. Extensive experiments demonstrate that our method can produce geometrically rational 3D indoor scenes, surpassing prior approaches in both rendering speed and visual quality while preserving editability. The code will be available soon.
zh

[CV-75] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis AAAI2026

【速读】:该论文旨在解决体育场景中复杂人-物交互的计算机视觉分析问题,特别是针对乒乓球、网球和羽毛球运动中对球体轨迹精准追踪、球拍姿态精细估计以及未来轨迹预测的挑战。其核心解决方案在于构建首个大规模、细粒度标注的球拍姿态数据集RacketVision,并提出基于跨注意力(CrossAttention)机制的多模态融合方法,以有效整合球拍姿态与球位置信息;实验表明,相较于简单拼接特征的方法,该机制能显著提升轨迹预测性能,超越单一模态基线模型,为动态目标跟踪、条件运动预测及多模态体育分析提供了坚实基础。

链接: https://arxiv.org/abs/2511.17045
作者: Linfeng Dong,Yuchen Yang,Hao Wu,Wei Wang,Yuenan HouZhihang Zhong,Xiao Sun
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to AAAI 2026 (Oral)

点击查看摘要

Abstract:We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at this https URL
zh

[CV-76] Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation

【速读】:该论文旨在解决扩散模型(diffusion models)在图像生成任务中因计算需求快速增长而导致的能源消耗与环境影响问题,尤其缺乏针对不同模型配置和硬件平台下能量消耗的可预测方法。其解决方案的关键在于提出一种基于Kaplan缩放律(Kaplan scaling laws)的能量预测模型,通过将扩散模型推理分解为文本编码、迭代去噪和解码三个组件,并假设去噪操作因多步重复执行而主导能耗;该方法以浮点运算次数(FLOPs)作为计算复杂度指标,实现了在单一GPU架构内高精度预测(R² > 0.9),并展现出跨架构的良好泛化能力,从而为可持续AI部署规划和碳足迹估算提供可靠依据。

链接: https://arxiv.org/abs/2511.17031
作者: Aniketh Iyengar,Jiaqi Han,Boris Ruf,Vincent Grari,Marcin Detyniecki,Stefano Ermon
机构: Stanford University (斯坦福大学); AXA AI Research
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted at EurIPS 2025 workshop “Rethinking AI: Efficiency, Frugality, and Sustainability”

点击查看摘要

Abstract:The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
zh

[CV-77] Parameter-Free Neural Lens Blur Rendering for High-Fidelity Composites

【速读】:该论文旨在解决虚拟物体在混合现实场景中与真实环境融合时,因缺乏相机参数(如焦距、对焦距离、光圈大小)和场景深度信息而导致的镜头模糊(lens blur)不一致问题,从而影响视觉真实感。其核心解决方案是提出一种直接从RGB图像估计圆斑(circle of confusion, CoC)图的方法,无需依赖场景深度或相机元数据;通过建立虚拟物体的符号CoC图与深度之间的线性关系,并利用神经重模糊网络(neural reblurring network)实现逼真的景深模糊效果,从而提升混合现实合成的质量与实用性。

链接: https://arxiv.org/abs/2511.17014
作者: Lingyan Ruan,Bin Chen,Taehyun Rhee
机构: University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注: Accepted by ISMAR 2025 with oral presentation. 10 pages, 11 figures

点击查看摘要

Abstract:Consistent and natural camera lens blur is important for seamlessly blending 3D virtual objects into photographed real-scenes. Since lens blur typically varies with scene depth, the placement of virtual objects and their corresponding blur levels significantly affect the visual fidelity of mixed reality compositions. Existing pipelines often rely on camera parameters (e.g., focal length, focus distance, aperture size) and scene depth to compute the circle of confusion (CoC) for realistic lens blur rendering. However, such information is often unavailable to ordinary users, limiting the accessibility and generalizability of these methods. In this work, we propose a novel compositing approach that directly estimates the CoC map from RGB images, bypassing the need for scene depth or camera metadata. The CoC values for virtual objects are inferred through a linear relationship between its signed CoC map and depth, and realistic lens blur is rendered using a neural reblurring network. Our method provides flexible and practical solution for real-world applications. Experimental results demonstrate that our method achieves high-fidelity compositing with realistic defocus effects, outperforming state-of-the-art techniques in both qualitative and quantitative evaluations.
zh

[CV-78] FLUID: Training-Free Face De-identification via Latent Identity Substitution

【速读】:该论文旨在解决人脸匿名化(Face de-identification)中如何在不破坏图像属性的前提下有效移除身份信息的问题。现有方法往往在身份抑制与属性保持之间难以取得良好平衡,且通常需要重新训练模型。本文提出了一种无需训练的框架FLUID(Face de-identification in the Latent space via Utility-preserving Identity Displacement),其核心创新在于将身份编辑视为预训练扩散模型潜空间(latent h-space)中的语义位移,并通过新型试剂损失(reagent losses)引导优化过程,实现身份抑制与属性保留的协同控制。关键解决方案包括:利用优化策略发现身份编辑方向,以及设计线性与测地线(tangent-based)两种编辑路径以高效导航潜流形,从而在CelebA-HQ和FFHQ数据集上显著优于当前最优匿名化方法。

链接: https://arxiv.org/abs/2511.17005
作者: Jinhyeong Park,Shaheryar Muhammad,Seangmin Lee,Jong Taek Lee,Soon Ki Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.
zh

[CV-79] VLM-Augmented Degradation Modeling for Image Restoration Under Adverse Weather Conditions

【速读】:该论文旨在解决恶劣天气条件下(如雨、雾、雪或其混合)视觉感知的可靠性问题,这对自动驾驶和室外机器人至关重要。解决方案的关键在于提出一种统一的Memory-Enhanced Visual-Language Recovery (MVLR)模型,其核心创新是将轻量级编码器-解码器主干网络与视觉语言模型(Visual-Language Model, VLM)及隐式记忆库(Implicit Memory Bank, IMB)相结合:VLM通过链式思维推理编码天气退化先验知识,IMB则存储退化模式的连续潜在表示;随后,VLM生成的先验查询IMB以检索细粒度退化原型,并通过动态交叉注意力机制自适应地融合这些原型与多尺度视觉特征,从而在保持计算效率的同时显著提升图像恢复精度。

链接: https://arxiv.org/abs/2511.16998
作者: Qianyi Shao,Yuanfan Zhang,Renxiang Xiao,Liang Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable visual perception under adverse weather conditions, such as rain, haze, snow, or a mixture of them, is desirable yet challenging for autonomous driving and outdoor robots. In this paper, we propose a unified Memory-Enhanced Visual-Language Recovery (MVLR) model that restores images from different degradation levels under various weather conditions. MVLR couples a lightweight encoder-decoder backbone with a Visual-Language Model (VLM) and an Implicit Memory Bank (IMB). The VLM performs chain-of-thought inference to encode weather degradation priors and the IMB stores continuous latent representations of degradation patterns. The VLM-generated priors query the IMB to retrieve fine-grained degradation prototypes. These prototypes are then adaptively fused with multi-scale visual features via dynamic cross-attention mechanisms, enhancing restoration accuracy while maintaining computational efficiency. Extensive experiments on four severe-weather benchmarks show that MVLR surpasses single-branch and Mixture-of-Experts baselines in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These results indicate that MVLR offers a practical balance between model compactness and expressiveness for real-time deployment in diverse outdoor conditions.
zh

[CV-80] DepthFocus: Controllable Depth Estimation for See-Through Scenes

【速读】:该论文旨在解决传统立体深度估计模型在处理具有透射特性材料(如玻璃、水等)时存在的局限性问题,即现有方法通常仅能估计最近表面的静态深度图,无法应对真实世界中常见的多层模糊现象,且缺乏对特定深度目标的主动感知能力。解决方案的关键在于提出DepthFocus——一种可调控的Vision Transformer架构,通过引入标量深度偏好作为条件输入,使模型能够动态调整计算过程以聚焦于期望深度,从而实现意图驱动的深度感知;其核心创新在于将深度估计从被动估计转变为可控的主动感知行为,并基于新构建的500k多层合成数据集进行训练,显著提升了在复杂场景下的精度与泛化能力。

链接: https://arxiv.org/abs/2511.16993
作者: Junhong Min,Jimin Kim,Cheol-Hui Min,Minwook Kim,Youngpil Jeon,Minyong Choi
机构: Samsung Electronics(三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8pages, 6 figures, 5 tables

点击查看摘要

Abstract:Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.
zh

[CV-81] DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

【速读】:该论文旨在解决图像视觉复杂度预测(Visual Complexity Prediction)问题,即如何准确建模人类对图像复杂性的感知,该任务在图像压缩、检索和分类等领域具有重要应用价值。传统方法多依赖于多模态模型(如结合图像与文本信息),但其必要性尚未明确。本文提出DReX(DINO-ResNet Fusion)模型,其核心创新在于仅使用视觉信息构建一个融合架构:通过可学习注意力机制将ResNet-50的多尺度层次特征与DINOv3 ViT-S/16的语义丰富表示进行整合,从而同时捕捉低层纹理模式和高层语义结构。实验表明,DReX在IC9600基准上达到Pearson相关系数0.9581的最先进性能,显著优于依赖多模态数据的方法,且参数量仅为后者约1/21.5,验证了纯视觉特征即可实现人类对复杂度感知的精准建模,且自监督Transformer与监督CNN的融合具有互补性和协同增益效应。

链接: https://arxiv.org/abs/2511.16991
作者: Jonathan Skaza,Parsa Madinei,Ziqi Wen,Miguel Eckstein
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods–including those trained on multimodal image-text data–while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.
zh

[CV-82] RadioKMoE: Knowledge-Guided Radiomap Estimation with Kolmogorov-Arnold Networks and Mixture-of-Experts

【速读】:该论文旨在解决无线网络中射线传播模型估计(Radiomap Estimation, RME)的难题,尤其是在复杂传播环境和多样化场景下,传统方法难以准确建模信号覆盖空间分布的问题。解决方案的关键在于提出一种知识引导的RME框架RadioKMoE,其核心创新是将Kolmogorov-Arnold Networks (KAN) 与Mixture-of-Experts (MoE) 结合:首先利用KAN模块基于物理模型和全局传播规律生成初始粗粒度覆盖图;随后,结合环境信息驱动MoE网络,通过多个专家子网络分别捕捉不同区域的局部辐射模式,在保持整体一致性的同时提升细节精度,从而显著增强估计结果的准确性与鲁棒性。

链接: https://arxiv.org/abs/2511.16986
作者: Fupei Guo,Kerry Pan,Songyang Zhang,Yue Wang,Zhi Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiomap serves as a vital tool for wireless network management and deployment by providing powerful spatial knowledge of signal propagation and coverage. However, increasingly complex radio propagation behavior and surrounding environments pose strong challenges for radiomap estimation (RME). In this work, we propose a knowledge-guided RME framework that integrates Kolmogorov-Arnold Networks (KAN) with Mixture-of-Experts (MoE), namely RadioKMoE. Specifically, we design a KAN module to predict an initial coarse coverage map, leveraging KAN’s strength in approximating physics models and global radio propagation patterns. The initial coarse map, together with environmental information, drives our MoE network for precise radiomap estimation. Unlike conventional deep learning models, the MoE module comprises expert networks specializing in distinct radiomap patterns to improve local details while preserving global consistency. Experimental results in both multi- and single-band RME demonstrate the enhanced accuracy and robustness of the proposed RadioKMoE in radiomap estimation.
zh

[CV-83] A Diversity-optimized Deep Ensemble Approach for Accurate Plant Leaf Disease Detection

【速读】:该论文旨在解决植物病害图像识别中深度神经网络集成(Deep Ensembles)的性能提升问题,特别是如何有效选择高精度的集成成员模型。现有集成多样性度量(Q metrics)难以准确反映成员间的协同效应,导致无法选出最优集成团队。解决方案的关键在于提出一种新的协同多样性度量(Synergistic Diversity, SQ),该度量能够捕捉集成成员之间的协同作用,并与集成准确率保持一致的正相关关系,从而显著优化模型选择过程并提高病害检测精度。

链接: https://arxiv.org/abs/2511.16982
作者: Sai Nath Chowdary Medikonduru,Hongpeng Jin,Yanzhao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Plant diseases pose a significant threat to global agriculture, causing over 220 billion in annual economic losses and jeopardizing food security. The timely and accurate detection of these diseases from plant leaf images is critical to mitigating their adverse effects. Deep neural network Ensembles (Deep Ensembles) have emerged as a powerful approach to enhancing prediction accuracy by leveraging the strengths of diverse Deep Neural Networks (DNNs). However, selecting high-performing ensemble member models is challenging due to the inherent difficulty in measuring ensemble diversity. In this paper, we introduce the Synergistic Diversity (SQ) framework to enhance plant disease detection accuracy. First, we conduct a comprehensive analysis of the limitations of existing ensemble diversity metrics (denoted as Q metrics), which often fail to identify optimal ensemble teams. Second, we present the SQ metric, a novel measure that captures the synergy between ensemble members and consistently aligns with ensemble accuracy. Third, we validate our SQ approach through extensive experiments on a plant leaf image dataset, which demonstrates that our SQ metric substantially improves ensemble selection and enhances detection accuracy. Our findings pave the way for a more reliable and efficient image-based plant disease detection.
zh

[CV-84] Gradient-Driven Natural Selection for Compact 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯散射(3DGS)方法中因使用大量高斯原语而导致的存储和计算开销过大的问题。现有剪枝方法依赖人工设计的标准或引入额外可学习参数,导致性能受限。其解决方案的关键在于提出一种受自然选择启发的剪枝框架,将生存压力建模为作用于不透明度(opacity)的正则化梯度场,使优化梯度——由最大化渲染质量的目标驱动——自主决定保留或剪枝哪些高斯分布;该过程完全可学习且无需人工干预。此外,论文还引入一种具有有限不透明度先验的不透明度衰减技术,在不牺牲剪枝效果的前提下加速选择过程。

链接: https://arxiv.org/abs/2511.16980
作者: Xiaobin Deng,Qiuli Yu,Changyu Diao,Min Li,Duanqing Xu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3DGS employs a large number of Gaussian primitives to fit scenes, resulting in substantial storage and computational overhead. Existing pruning methods rely on manually designed criteria or introduce additional learnable parameters, yielding suboptimal results. To address this, we propose an natural selection inspired pruning framework that models survival pressure as a regularization gradient field applied to opacity, allowing the optimization gradients–driven by the goal of maximizing rendering quality–to autonomously determine which Gaussians to retain or prune. This process is fully learnable and requires no human intervention. We further introduce an opacity decay technique with a finite opacity prior, which accelerates the selection process without compromising pruning effectiveness. Compared to 3DGS, our method achieves over 0.6 dB PSNR gain under 15% budgets, establishing state-of-the-art performance for compact 3DGS. Project page this https URL.
zh

[CV-85] he Finer the Better: Towards Granular-aware Open-set Domain Generalization AAAI2026

【速读】:该论文旨在解决开放集领域泛化(Open-Set Domain Generalization, OSDG)中模型在面对域偏移和未知类别时的双重挑战,特别是现有方法在已知类结构风险与未知类开放空间风险之间难以平衡,且对细粒度相似的“难区分未知类”易产生过自信问题。解决方案的关键在于提出Semantic-enhanced CLIP(SeeCLIP)框架,其核心创新包括:1)引入语义感知提示增强模块,将图像分解为判别性语义标记(semantic tokens),实现超越粗粒度类别标签的细粒度视觉-语言对齐;2)设计双路对比学习机制,通过排斥(repulsion)保持与已知类的可分性、凝聚(cohesion)保留语义邻近性,有效定位未知提示;3)基于语义引导的扩散模块,通过扰动提取的语义标记合成伪未知样本(pseudo-unknowns),生成视觉上接近已知类但局部存在关键差异的困难负样本,从而迫使模型学习更精细的决策边界。

链接: https://arxiv.org/abs/2511.16979
作者: Yunyun Wang,Zheng Duan,Xinyue Liao,Ke-Jia Chen,Songcan Chen
机构: 1. 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures,aaai2026

点击查看摘要

Abstract:Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.
zh

[CV-86] Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices

【速读】:该论文旨在解决在边缘设备上从生食图像生成逼真熟食图像的难题,这一任务需准确建模烹饪过程中食材在纹理、颜色和结构上的复杂变化,而现有图像到图像生成方法要么结果不真实,要么资源消耗过高难以部署于边缘端。解决方案的关键在于提出首个基于烤箱的烹饪进度数据集(含厨师标注的熟度等级),并设计了一种边缘高效的、由菜谱和烹饪状态引导的生成器,该生成器以生食图像为条件合成逼真图像,支持用户自定义视觉目标而非固定预设;同时引入领域特定的烹饪图像相似性(Culinary Image Similarity, CIS)指标,作为训练损失和进度监控信号,确保生成图像的时间一致性与烹饪合理性,从而显著提升生成质量(FID分数在自建数据集上降低30%,公共数据集上降低60%)。

链接: https://arxiv.org/abs/2511.16965
作者: Jigyasa Gupta,Soumya Goyal,Anil Kumar,Ishan Jindal
机构: Samsung R&D Institute India, Delhi (三星研发研究院印度德里分部)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textitCulinary Image Similarity (CIS) metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30% improvement on our dataset; 60% on public datasets)
zh

[CV-87] wo Heads Better than One: Dual Degradation Representation for Blind Super-Resolution

【速读】:该论文旨在解决盲超分辨率(Blind Super-Resolution, Blind SR)问题,即在未知退化类型(如模糊和噪声)的情况下恢复高分辨率(High-Resolution, HR)图像。传统单图像超分辨率(Single Image Super-Resolution, SISR)方法通常假设退化过程已知且固定(如双三次下采样),当实际退化与假设不符时性能显著下降。解决方案的关键在于提出一种双分支退化提取网络(Dual Branch Degradation Extractor Network),通过无监督方式预测两个退化嵌入(degradation embeddings)——分别表征模糊信息和噪声信息,并将这些嵌入分别用于引导超分网络对模糊和噪声进行差异化处理;此外,作者还将退化提取器作为正则项,利用HR图像与超分结果之间的差异增强模型鲁棒性。实验表明该方法在多个基准上实现了盲SR任务的最先进(SOTA)性能。

链接: https://arxiv.org/abs/2511.16963
作者: Hsuan Yuan,Shao-Yu Weng,I-Hsuan Lo,Wei-Chen Chiu,Yu-Syuan Xu,Hao-Chien Hsueh,Jen-Hui Chuang,Ching-Chun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous methods have demonstrated remarkable performance in single image super-resolution (SISR) tasks with known and fixed degradation (e.g., bicubic downsampling). However, when the actual degradation deviates from these assumptions, these methods may experience significant declines in performance. In this paper, we propose a Dual Branch Degradation Extractor Network to address the blind SR problem. While some blind SR methods assume noise-free degradation and others do not explicitly consider the presence of noise in the degradation model, our approach predicts two unsupervised degradation embeddings that represent blurry and noisy information. The SR network can then be adapted to blur embedding and noise embedding in distinct ways. Furthermore, we treat the degradation extractor as a regularizer to capitalize on differences between SR and HR images. Extensive experiments on several benchmarks demonstrate our method achieves SOTA performance in the blind SR problem.
zh

[CV-88] MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

【速读】:该论文旨在解决物理基础渲染(Physically-Based Rendering, PBR)材料生成过程中存在的两大核心问题:一是PBR材料的创建高度依赖人工且需要专业知识,效率低下;二是现有生成模型缺乏统一表示框架,无法有效融合自然图像外观与PBR属性,导致任务特定的碎片化流程难以利用大规模RGB图像数据。解决方案的关键在于提出MatPedia这一基础模型,其创新性地构建了一种联合RGB-PBR表示机制,将材料编码为两个相互依赖的潜在变量:一个用于RGB外观,另一个用于包含四个PBR贴图(如漫反射、法线、金属度和粗糙度)所表征的物理特性。通过将这五个帧构成序列并采用视频扩散架构,模型自然捕捉了RGB与PBR之间的关联,并迁移来自RGB生成模型的视觉先验知识,从而实现文本到材料、图像到材料生成及内在分解等多任务统一建模。

链接: https://arxiv.org/abs/2511.16957
作者: Di Luo,Shuhui Yang,Mingxin Yang,Jiawei Lu,Yixuan Tang,Xintong Han,Zhuo Chen,Beibei Wang,Chunchao Guo
机构: NanKai University (南开大学); Tencent Hunyuan (腾讯混元); Xi’an Jiaotong University (西安交通大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks–text-to-material generation, image-to-material generation, and intrinsic decomposition–within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native 1024\times1024 synthesis that substantially surpasses existing approaches in both quality and diversity.
zh

[CV-89] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

【速读】:该论文旨在解决将Group Relative Policy Optimization (GRPO)应用于现代流匹配(flow matching)模型时所面临的挑战,即其确定性采样范式与现有基于随机微分方程(SDE)的方法不兼容的问题。SDE-based GRPO虽然引入了随机性以适配GRPO框架,但存在信用分配效率低下和无法与高阶求解器兼容导致少步采样性能受限的缺陷。解决方案的关键在于重新从距离优化视角理解现有SDE-based GRPO方法,揭示其本质为对比学习机制,并据此提出无需SDE的Neighbor GRPO算法:通过扰动常微分方程(ODE)初始噪声条件生成多样候选轨迹,利用基于softmax距离的代理跳跃策略进行优化,从而在保持确定性ODE采样优势(如计算效率与高阶求解器兼容性)的同时,实现更高效的训练与高质量生成。

链接: https://arxiv.org/abs/2511.16955
作者: Dailan He,Guanlin Feng,Xingtong Ge,Yazhe Niu,Yi Zhang,Bingqi Ma,Guanglu Song,Yu Liu,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); Vivix.AI; HKUST (香港科技大学); Shenzhen Loop Area Institute (深圳环区研究院); CPII under InnoHK (InnoHK创新香港研发平台下的计算与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.
zh

[CV-90] Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling

【速读】:该论文旨在解决自动面部表情定位(Automatic Facial Expression Spotting, AFES)任务中因依赖昂贵且耗时的时序边界标注而带来的训练成本问题,提出了一种基于点监督(Point-Supervised, P-FES)的学习范式,仅需每个表情实例的一个时间戳标注即可完成训练。其核心解决方案在于设计了一个双分支框架:首先引入基于高斯分布的实例自适应强度建模(Gaussian-based Instance-Adaptive Intensity Modeling, GIM)模块,通过检测伪顶点帧、估计持续时间并构建实例级高斯分布,实现软伪标签分配,从而缓解硬伪标签对中性与不同强度表情帧的混淆问题;其次设计了一个类别感知的顶点分类分支,仅依据伪顶点帧区分宏观表情(macro-expressions)与微观表情(micro-expressions)。此外,通过引入强度感知对比损失(Intensity-Aware Contrastive Loss),增强特征判别力并抑制中性噪声干扰,显著提升了模型在SAMM-LV、CAS(ME)² 和 CAS(ME)³ 数据集上的性能表现。

链接: https://arxiv.org/abs/2511.16952
作者: Yicheng Deng,Hideaki Hayashi,Hajime Nagahara
机构: The University of Osaka (大阪大学); D3 Center (D3中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression this http URL, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME) ^2 , and CAS(ME) ^3 datasets demonstrate the effectiveness of our proposed framework.
zh

[CV-91] FingerCap: Fine-grained Finger-level Hand Motion Captioning

【速读】:该论文旨在解决视频多模态大模型(Video-MLLM)在细粒度手部运动理解上的不足,特别是对指级(finger-level)动作语义捕捉能力薄弱的问题。当前模型因RGB帧采样稀疏,难以捕捉手指运动中高频、细微的动态变化,导致生成描述缺乏准确性与完整性。解决方案的关键在于提出FiGOP(Finger Group-of-Pictures),即在每个RGB关键帧后附加后续手部关键点序列,形成时间上稠密的运动信息表示;通过轻量级时序编码器将关键点转化为运动嵌入,并融合至RGB特征中,从而在不增加RGB帧密度的前提下恢复指级运动的时间细节,显著提升模型在指级动作描述任务中的表现。

链接: https://arxiv.org/abs/2511.16951
作者: Xin Shen,Rui Zhu,Lei Shen,Xinyu Wang,Kaihao Zhang,Tianqing Zhu,Shuchen Wu,Chenxi Miao,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang,Xin Yu
机构: Baidu Inc.; The University of Queensland (昆士兰大学); Nanjing University (南京大学); Institute of Computing Technology, CAS (中国科学院计算技术研究所); Peking University (北京大学); Australian National University (澳大利亚国立大学); City University of Macau (澳门城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.
zh

[CV-92] MobileOcc: A Human-Aware Semantic Occupancy Dataset for Mobile Robots

【速读】:该论文旨在解决移动机器人在人群密集环境中进行密集三维语义占据感知(dense 3D semantic occupancy perception)的难题,这一问题相较于自动驾驶领域仍处于探索阶段。其解决方案的关键在于构建了一个名为MobileOcc的数据集,该数据集通过结合静态物体占据标注与一种专为人体占据建模设计的新颖网格优化框架实现:该框架从2D图像中重建可变形人体几何结构,并利用关联的LiDAR点云数据进行精细化优化与调整,从而提升人体占据建模的准确性。此外,作者基于此数据集建立了两个任务的基准测试——占据预测和行人速度预测,并提供了多种方法(单目、双目及全景占据)的基线实现,确保结果的可复现性。

链接: https://arxiv.org/abs/2511.16949
作者: Junseo Kim,Guido Dumont,Xinyu Gao,Gang Chen,Holger Caesar,Javier Alonso-Mora
机构: Delft University of Technology (代尔夫特理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dense 3D semantic occupancy perception is critical for mobile robots operating in pedestrian-rich environments, yet it remains underexplored compared to its application in autonomous driving. To address this gap, we present MobileOcc, a semantic occupancy dataset for mobile robots operating in crowded human environments. Our dataset is built using an annotation pipeline that incorporates static object occupancy annotations and a novel mesh optimization framework explicitly designed for human occupancy modeling. It reconstructs deformable human geometry from 2D images and subsequently refines and optimizes it using associated LiDAR point data. Using MobileOcc, we establish benchmarks for two tasks, i) Occupancy prediction and ii) Pedestrian velocity prediction, using different methods including monocular, stereo, and panoptic occupancy, with metrics and baseline implementations for reproducible comparison. Beyond occupancy prediction, we further assess our annotation method on 3D human pose estimation datasets. Results demonstrate that our method exhibits robust performance across different datasets.
zh

[CV-93] Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction

【速读】:该论文旨在解决动态磁共振成像(dynamic MRI, dMRI)中因采样不足和运动伪影导致的重建质量下降问题。传统运动补偿重建方法依赖于预估计的光流(optical flow),但在欠采样条件下该估计不准确,进而降低重建性能。解决方案的关键在于提出一种隐式神经表示(implicit neural representation, INR)框架,通过两个相互耦合的INR分别建模时空图像内容和光流场,并引入基于光流方程的物理启发正则化项与k空间数据一致性损失共同约束优化过程。该联合优化机制无需预先估计光流即可同步恢复时序一致的图像序列与精确的运动场,从而显著提升重建质量与时间保真度。

链接: https://arxiv.org/abs/2511.16948
作者: Baoqing Li,Yuanyuan Liu,Congcong Liu,Qingyong Zhu,Jing Cheng,Yihang Zhou,Hao Chen,Zhuo-Xu Cui,Dong Liang
机构: Guangdong University of Technology (广东工业大学); Chinese Academy of Sciences (中国科学院); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.
zh

[CV-94] MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在个体隐私推理方面存在的系统性风险问题,即VLMs不仅能够感知图像中的属性信息,还能通过整合分散的信息构建个人身份关联,从而造成比传统属性识别更严重的隐私泄露。现有隐私评估基准主要关注感知层面的隐私风险,无法有效衡量这种基于推理的个体级隐私威胁。解决方案的关键在于提出首个专门针对个体隐私推理能力的评估框架——Privacy Perception and Reasoning (PPR) 框架,并构建了一个新颖的双语多模态数据集MultiPriv,其核心特征是合成的个体身份档案(包含人脸、姓名等标识符与敏感属性的精确绑定),支持从属性检测到跨图像重识别及链式推理在内的九类挑战任务,从而系统性地量化VLMs的隐私推理能力并揭示其安全对齐机制的不足。

链接: https://arxiv.org/abs/2511.16940
作者: Xiongtao Sun,Hui Li,Jiaming Zhang,Yujie Yang,Kaili Liu,Ruxin Feng,Wen Jun Tan,Wei Yang Bryan Lim
机构: Xidian University (西安电子科技大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Modern Vision-Language Models (VLMs) demonstrate sophisticated reasoning, escalating privacy risks beyond simple attribute perception to individual-level linkage. Current privacy benchmarks are structurally insufficient for this new threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM’s ability to infer and link distributed information to construct individual profiles. To address this critical gap, we propose \textbfMultiPriv, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the \textbfPrivacy Perception and Reasoning (PPR) framework and construct a novel, bilingual multimodal dataset to support it. The dataset uniquely features a core component of synthetic individual profiles where identifiers (e.g., faces, names) are meticulously linked to sensitive attributes. This design enables nine challenging tasks evaluating the full PPR spectrum, from attribute detection to cross-image re-identification and chained inference. We conduct a large-scale evaluation of over 50 foundational and commercial VLMs. Our analysis reveals: (1) Many VLMs possess significant, unmeasured reasoning-based privacy risks. (2) Perception-level metrics are poor predictors of these reasoning risks, revealing a critical evaluation gap. (3) Existing safety alignments are inconsistent and ineffective against such reasoning-based attacks. MultiPriv exposes systemic vulnerabilities and provides the necessary framework for developing robust, privacy-preserving VLMs.
zh

[CV-95] OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

【速读】:该论文旨在解决当前时空视频定位(Spatio-Temporal Video Grounding, STVG)模型在真实场景中性能下降的问题,主要表现为类别偏倚、推理过于简化以及语言鲁棒性差。其关键解决方案是提出OmniGround基准数据集和DeepSTG评估框架,通过涵盖81类多样对象与复杂自然语言查询的3,475个视频样本,并采用前向-后向-精炼(Forward-Backward-Refinement)标注流程确保标签质量;在此基础上进一步设计了无需训练的两阶段框架PG-TAF,将任务分解为高层时序定位与细粒度时空传播,显著提升了在复杂场景下(如小目标、遮挡目标及复杂空间关系)的定位准确率,在OmniGround上m_tIoU和m_vIoU分别提升25.6%和35.6%,并在四个基准上保持一致性能增益。

链接: https://arxiv.org/abs/2511.16937
作者: Hong Gao,Jingyu Wu,Xiangkai Xu,Kangni Xie,Yunchen Zhang,Bin Zhong,Xurui Gao,Min-Ling Zhang
机构: SouthEast University (东南大学); ZTE Corporation (中兴通讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. Despite recent advances in Multimodal Large Language Models, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness. To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement annotation pipeline that combines multi-directional tracking with intelligent error correction for high-quality labels. We further introduce DeepSTG, a systematic evaluation framework quantifying dataset quality across four complementary dimensions beyond superficial statistics. Evaluations reveal performance average drop of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m_tIoU and m_vIoU on OmniGround with consistent gains across four benchmarks.
zh

[CV-96] Shape-preserving Tooth Segmentation from CBCT Images Using Deep Learning with Semantic and Shape Awareness

【速读】:该论文旨在解决锥形束计算机断层扫描(CBCT)图像中因牙间粘连导致的牙齿形态严重扭曲问题,从而实现高精度的牙齿分割。其解决方案的关键在于提出一种融合语义感知与形状感知的深度学习框架:首先引入基于目标牙齿中心提示的多标签学习策略,以建模牙齿间的语义关系并减少形状歧义;其次设计牙齿形状感知的学习机制,显式施加形态学约束以保持边界完整性;两者通过多任务学习统一优化,有效提升分割结果的解剖学真实性。

链接: https://arxiv.org/abs/2511.16936
作者: Zongrui Ji,Zhiming Cui,Na Li,Qianhan Zheng,Miaojing Shi,Ke Deng,Jingyang Zhang,Chaoyuan Li,Xuepeng Chen,Yi Dong,Lei Ma
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background:Accurate tooth segmentation from cone beam computed tomography (CBCT) images is crucial for digital dentistry but remains challenging in cases of interdental adhesions, which cause severe anatomical shape distortion. Methods: To address this, we propose a deep learning framework that integrates semantic and shape awareness for shape-preserving segmentation. Our method introduces a target-tooth-centroid prompted multi-label learning strategy to model semantic relationships between teeth, reducing shape ambiguity. Additionally, a tooth-shape-aware learning mechanism explicitly enforces morphological constraints to preserve boundary integrity. These components are unified via multi-task learning, jointly optimizing segmentation and shape preservation. Results: Extensive evaluations on internal and external datasets demonstrate that our approach significantly outperforms existing methods. Conclusions: Our approach effectively mitigates shape distortions and providing anatomically faithful tooth boundaries. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.16936 [cs.CV] (or arXiv:2511.16936v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.16936 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zongrui Ji [view email] [v1] Fri, 21 Nov 2025 04:15:07 UTC (1,162 KB) Full-text links: Access Paper: View a PDF of the paper titled Shape-preserving Tooth Segmentation from CBCT Images Using Deep Learning with Semantic and Shape Awareness, by Zongrui Ji and 10 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-97] Rethinking Diffusion Model-Based Video Super-Resolution: Leverag ing Dense Guidance from Aligned Features

【速读】:该论文旨在解决基于扩散模型(Diffusion Model, DM)的视频超分辨率(Video Super-Resolution, VSR)方法中存在的误差累积、空间伪影以及感知质量与保真度之间的权衡问题,其根源在于帧间对齐不准确和补偿机制不足。解决方案的关键在于重新审视帧间对齐与补偿的作用,并提出一种新颖的Densely Guided diffusion model with Aligned Features for Video Super-Resolution(DGAF-VSR)框架:首先发现特征域(feature domain)相较于像素域更适合进行信息补偿,因其具有更强的空间和时间相关性;其次,采用在上采样分辨率下进行光流引导的变形(Optical Guided Warping Module, OGWM)以更好保留高频细节,同时引入特征级时序条件模块(Feature-wise Temporal Condition Module, FTCM)实现密集特征引导,从而显著提升感知质量(DISTS降低35.82%)、保真度(PSNR提升0.20 dB)和时序一致性(tLPIPS降低30.37%)。

链接: https://arxiv.org/abs/2511.16928
作者: Jingyi Xu,Meisong Zheng,Ying Chen,Minglang Qiao,Xin Deng,Mai Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19pages

点击查看摘要

Abstract:Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37% tLPIPS reduction).
zh

[CV-98] DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution

【速读】:该论文旨在解决无真实异常样本或训练数据条件下,生成式 AI (Generative AI) 中异常样本难以有效合成的问题。传统方法依赖少量异常样本进行微调,易导致过拟合类别先验,且无法满足异常样本稀缺的现实场景需求。其解决方案的关键在于提出一种无需训练的零样本异常生成方法 Delta-Denoising (DeltaDeno),通过对比两个由最小提示对驱动、共享扩散调度的扩散分支,逐步累积去噪差异(denoising deltas)以生成图像特定的定位图,从而引导潜在空间修复(latent inpainting)并保留周围上下文;同时引入基于 token 级别提示优化和仅作用于异常区域的空间注意力偏置机制,显著提升生成稳定性与可控性。实验表明,该方法在公共数据集上实现了高质量异常生成及下游检测性能的一致提升。

链接: https://arxiv.org/abs/2511.16920
作者: Chaoran Xu,Chengkan Lv,Qiyu Chen,Yunkang Cao,Feng Zhang,Zhengtao Zhang
机构: Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly generation is often framed as few-shot fine-tuning with anomalous samples, which contradicts the scarcity that motivates generation and tends to overfit category priors. We tackle the setting where no real anomaly samples or training are available. We propose Delta-Denoising (DeltaDeno), a training-free zero-shot anomaly generation method that localizes and edits defects by contrasting two diffusion branches driven by a minimal prompt pair under a shared schedule. By accumulating per-step denoising deltas into an image-specific localization map, we obtain a mask to guide the latent inpainting during later diffusion steps and preserve the surrounding context while generating realistic local defects. To improve stability and control, DeltaDeno performs token-level prompt refinement that aligns shared content and strengthens anomaly tokens, and applies a spatial attention bias restricted to anomaly tokens in the predicted region. Experiments on public datasets show that DeltaDeno achieves great generation, realism and consistent gains in downstream detection performance. Code will be made publicly available.
zh

[CV-99] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

【速读】:该论文旨在解决多模态学习中模型、任务与表示之间的割裂问题,提出一种统一的生成式模型(UniModel),以实现视觉理解与视觉生成在单一像素级扩散框架下的深度融合。其核心解决方案在于:首先,在表示层面,通过将文本提示渲染为画布上的文字图像,使文本和图像共享同一视觉空间(即RGB像素空间),从而消除模态差异;其次,在任务层面,将多种视觉-语言任务统一建模为该视觉空间中的像素到像素映射,如理解任务输出语义编码的文字图像,生成任务则以文字图像作为条件进行图像合成;最后,在模型层面,采用统一的扩散Transformer架构,基于修正流(rectified flow)训练,共享主干网络并辅以轻量级任务嵌入来控制双向映射方向。这一范式实现了跨模态对齐与可控性(如循环一致的图像-标题-图像闭环),为通用多模态智能提供了新路径。

链接: https://arxiv.org/abs/2511.16917
作者: Chi Zhang,Jiepeng Wang,Youming Wang,Yuanzhi Liang,Xiaoyan Yang,Zuoxin Li,Haibin Huang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.
zh

[CV-100] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

【速读】:该论文旨在解决当前AI生成内容质量评估方法过于粗粒度的问题,即现有数据集和模型仅提供单一质量评分,难以对生成模型的优化提供有针对性的指导。针对这一问题,作者提出了一种细粒度评估框架,其关键在于构建了Q-Real数据集,该数据集包含3,088张由主流文本到图像模型生成的图像,并为每张图像标注了主要实体的位置,同时围绕真实感(realism)与合理性(plausibility)两个维度设计了判断问题和归因描述,从而支持更精细的评估。此外,作者进一步构建了Q-Real Bench基准测试任务,用于评估多模态大语言模型(MLLMs)在判断与基于推理的定位能力上的表现,并设计了微调框架以提升MLLM的评估性能,实验验证了该数据集和基准的有效性与全面性。

链接: https://arxiv.org/abs/2511.16908
作者: Shushi Wang,Zicheng Zhang,Chunyi Li,Wei Wang,Liya Ma,Fengjiao Chen,Xiaoyu Li,Xuezhi Cao,Guangtao Zhai,Xiaohong Liu
机构: Shanghai Jiao Tong University (上海交通大学); Meituan (美团); Shanghai AI Lab (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.
zh

[CV-101] Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models

【速读】:该论文旨在解决传统扩散模型中热扩散(hot diffusion)与冷扩散(cold diffusion)各自存在的局限性问题:热扩散完全依赖噪声,忽略了高频细节与低频结构之间的强相关性,导致生成初期行为随机;而冷扩散仅使用模糊操作,忽视了噪声在塑造数据流形(data manifold)中的作用,从而引发流形外问题并造成性能下降。解决方案的关键在于提出一种统一的“暖扩散”(Warm Diffusion)框架——即Blur-Noise Mixture Diffusion Model (BNMD),通过联合控制模糊与噪声过程,利用图像的频谱依赖特性,采用分而治之策略解耦去噪与去模糊任务,从而提升生成质量与稳定性。同时,论文引入Blur-to-Noise Ratio (BNR) 进行谱分析,量化模型学习动态与数据流形变化之间的权衡关系。

链接: https://arxiv.org/abs/2511.16904
作者: Hao-Chien Hsueh,Chi-En Yen,Wen-Hsiao Peng,Ching-Chun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types. While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key diffusion paradigms: hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails to exploit the strong correlation between high-frequency image detail and low-frequency structures, leading to random behaviors in the early steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness) in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths, we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-and-conquer strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes. We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation.
zh

[CV-102] R-AVST: Empowering Video-LLM s with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios AAAI2026

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中对真实世界复杂音频-视觉事件建模不足的问题,特别是缺乏细粒度时空标注的数据集与高效推理机制。其解决方案的关键在于:首先构建了首个面向真实场景的音频-视觉时空推理数据集R-AVST,通过基于大语言模型(LLM)的关键对象提取、自动空间标注与人工质量检验相结合的流水线,实现超过5000段未剪辑视频中27000个物体的精细标注;其次定义了三项核心时空推理任务并生成8000余组高质量、分布均衡的问答对用于模型评测;最后提出AVST-Zero模型,采用强化学习框架直接优化行为策略,无需中间监督信号,仅依赖多维奖励函数引导模型完成复杂音频-视觉推理任务,从而显著提升模型在真实复杂场景下的表现能力。

链接: https://arxiv.org/abs/2511.16901
作者: Lu Zhu,Tiantian Geng,Yangye Chen,Teng Wang,Ping Lu,Feng Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026. Project page: this https URL

点击查看摘要

Abstract:Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model performance. To further enhance reasoning, we propose AVST-Zero, a reinforcement learning-based model that avoids intermediate supervision, directly optimizing behavior via carefully designed multi-dimensional rewards. Extensive experiments validate the effectiveness of our R-AVST in advancing audio-visual spatio-temporal reasoning, upon which AVST-Zero demonstrates competitive performance compared to existing models. To the best of our knowledge, R-AVST is the first dataset designed for real-world audio-visual spatio-temporal reasoning, and AVST-Zero offers a novel perspective for tackling future challenges in this domain.
zh

[CV-103] Glass Surface Detection: Leverag ing Reflection Dynamics in Flash/No-flash Imagery

【速读】:该论文旨在解决玻璃表面检测这一计算机视觉难题,现有方法主要依赖边界线索(如门窗框架)或反射线索进行定位,但未能充分利用玻璃本身的固有特性实现精准识别。其核心问题在于如何有效利用玻璃在不同光照条件下反射动态变化的物理特性来提升检测精度。解决方案的关键在于提出NFGlassNet,通过引入反射对比度挖掘模块(Reflection Contrast Mining Module, RCMM)提取图像中的反射特征,并设计反射引导注意力模块(Reflection Guided Attention Module, RGAM)融合反射与玻璃表面特征,从而实现更准确的玻璃区域定位。此外,研究还构建了一个包含3.3K对闪光/非闪光图像的数据集用于训练和评估,实验表明该方法显著优于当前最优技术。

链接: https://arxiv.org/abs/2511.16887
作者: Tao Yan,Hao Huang,Yiwei Lu,Zeyu Wang,Ke Xu,Yinghui Wang,Xiaojun Chang,Rynson W.H. Lau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.
zh

[CV-104] Align Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

【速读】:该论文旨在解决逆问题(inverse problems)中重建质量低和收敛效率差的问题,尤其是在利用预训练生成模型作为先验时,如何有效引导重建过程以提升结果的保真度与感知真实感。解决方案的关键在于提出表示对齐(Representation Alignment, REPA),即在推理阶段将扩散或流模型的内部表示与预训练自监督视觉编码器(如DINOv2)的特征进行对齐,从而引入强归纳偏置(inductive bias)。通过理论分析表明,REPA正则化项与DINOv2嵌入空间中的分布差异度量相关,并能驱动模型内部表示向干净图像的特征方向演化;实验验证了该方法在超分辨率、框内补全、高斯模糊去噪及运动模糊去噪等多个任务上的通用性与有效性,且显著减少离散步数而不损失性能,实现了高质量重建与计算效率的双重提升。

链接: https://arxiv.org/abs/2511.16870
作者: Loukas Sfountouris,Giannis Daras,Paris Giampouras
机构: University of Warwick (华威大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model’s internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.
zh

[CV-105] he Joint Gromov Wasserstein Objective for Multiple Object Matching

【速读】:该论文旨在解决传统Gromov-Wasserstein (GW) 距离仅适用于单对象之间的成对匹配,难以处理多对一或多对多对象匹配场景的问题。其关键解决方案是提出联合Gromov-Wasserstein (Joint Gromov-Wasserstein, JGW) 目标函数,将原始GW框架扩展至同时匹配多个对象集合,从而实现对mm-空间分布的部分同构性识别,并提供非负的差异度量和点采样收敛性保证;该方法通过适配最优传输中的经典算法(如熵正则化)实现点云表示下的高效求解,在合成与真实数据集上均展现出优于现有GW变体的精度和计算效率,尤其在几何形状和生物分子复合物等多对象匹配任务中表现出显著有效性。

链接: https://arxiv.org/abs/2511.16868
作者: Aryan Tajmir Riahi,Khanh Dao Duc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:The Gromov-Wasserstein (GW) distance serves as a powerful tool for matching objects in metric spaces. However, its traditional formulation is constrained to pairwise matching between single objects, limiting its utility in scenarios and applications requiring multiple-to-one or multiple-to-multiple object matching. In this paper, we introduce the Joint Gromov-Wasserstein (JGW) objective and extend the original framework of GW to enable simultaneous matching between collections of objects. Our formulation provides a non-negative dissimilarity measure that identifies partially isomorphic distributions of mm-spaces, with point sampling convergence. We also show that the objective can be formulated and solved for point cloud object representations by adapting traditional algorithms in Optimal Transport, including entropic regularization. Our benchmarking with other variants of GW for partial matching indicates superior performance in accuracy and computational efficiency of our method, while experiments on both synthetic and real-world datasets show its effectiveness for multiple shape matching, including geometric shapes and biomolecular complexes, suggesting promising applications for solving complex matching problems across diverse domains, including computer graphics and structural biology.
zh

[CV-106] Parts-Mamba: Augmenting Joint Context with Part-Level Scanning for Occluded Human Skeleton

【速读】:该论文旨在解决骨骼动作识别(Skeleton Action Recognition)中因人体遮挡或通信质量差导致的骨骼数据缺失问题,此类非理想情况会破坏局部上下文信息,从而显著降低现有图卷积网络(GCN)模型的性能。解决方案的关键在于提出一种混合架构 Parts-Mamba 模型,其核心创新包括:1)通过部件特异性扫描机制(parts-specific scanning feature)有效提取不同身体部位的特征信息;2)借助部件-整体融合模块(parts-body fusion module)保留远距离关节之间的上下文关系,从而增强对缺失数据的鲁棒性。实验表明,该方法在 NTU RGB+D 60 和 NTU RGB+D 120 数据集上,在多种遮挡场景下可实现最高达 12.9% 的准确率提升。

链接: https://arxiv.org/abs/2511.16860
作者: Tianyi Shen,Huijuan Xu,Nilesh Ahuja,Omesh Tickoo,Philip Shin,Vijaykrishnan Narayanan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skeleton action recognition involves recognizing human action from human skeletons. The use of graph convolutional networks (GCNs) has driven major advances in this recognition task. In real-world scenarios, the captured skeletons are not always perfect or complete because of occlusions of parts of the human body or poor communication quality, leading to missing parts in skeletons or videos with missing frames. In the presence of such non-idealities, existing GCN models perform poorly due to missing local context. To address this limitation, we propose Parts-Mamba, a hybrid GCN-Mamba model designed to enhance the ability to capture and maintain contextual information from distant joints. The proposed Parts-Mamba model effectively captures part-specific information through its parts-specific scanning feature and preserves non-neighboring joint context via a parts-body fusion module. Our proposed model is evaluated on the NTU RGB+D 60 and NTU RGB+D 120 datasets under different occlusion settings, achieving up to 12.9% improvement in accuracy.
zh

[CV-107] BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在空间推理评估中忽视细粒度对象交互理解的问题,即现有基准测试主要关注高层次的空间关系(如“左侧”、“后方”),而忽略了真实世界应用所需的精确3D定位、物体间物理兼容性、物体可操作性(affordance)以及多步空间规划等关键能力。解决方案的关键在于提出BOP-ASK——一个大规模数据集,其数据生成流程基于BOP(Benchmark for Object Pose Estimation)数据集中6D物体位姿信息,自动标注了包括抓取位姿、指代表达位姿、路径规划轨迹、相对空间与深度关系及物体间关系在内的细粒度标签。该数据集包含超过15万张图像和3300万个问答对,涵盖六个任务(四个为新任务),支持VLM的训练与评测,并通过BOP-ASK-core(贡献的测试基准)和BOP-ASK-lab(分布外测试基准)验证了模型在复杂环境中的涌现能力,如精准物体与抓取位姿估计、轨迹规划和细粒度以物体为中心的空间推理。

链接: https://arxiv.org/abs/2511.16857
作者: Vineet Bhat,Sungsu Kim,Valts Blukis,Greg Heinrich,Prashanth Krishnamurthy,Ramesh Karri,Stan Birchfield,Farshad Khorrami,Jonathan Tremblay
机构: New York University (纽约大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships (‘left of,’ ‘behind’, etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.
zh

[CV-108] owards Unified Vision Language Models for Forest Ecological Analysis in Earth Observation AAAI2026

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在地球观测(Earth Observation, EO)领域中缺乏科学回归任务评估基准的问题。现有EO数据集主要聚焦于语义理解任务(如图像描述或分类),未能建立多模态感知与可测量生物物理变量之间的关联。解决方案的关键在于提出首个统一基准REO-Instruct,其通过构建森林生态场景下的认知可解释逻辑链(包含人类活动识别、土地覆盖分类、生态斑块计数及地上生物量(Above-ground Biomass, AGB)回归),实现了定性理解与定量预测的衔接;同时,该数据集融合了同址配准的Sentinel-2和ALOS-2遥感影像,并采用人机协同管道生成与验证结构化文本标注,为下一代具备科学推理能力的地理空间模型提供了标准化评估基础。

链接: https://arxiv.org/abs/2511.16853
作者: Xizhe Xue,Xiao Xiang Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026 AI for Environmental Science Workshop

点击查看摘要

Abstract:Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at \hrefthis https URLREO-Instruct.
zh

[CV-109] WorldGen: From Text to Traversable and Interactive 3D Worlds

【速读】:该论文旨在解决如何从自然语言文本提示中自动构建大规模、可交互的3D虚拟世界的问题,从而降低创作门槛并提升生成环境的几何一致性与视觉丰富性。其核心解决方案在于构建一个模块化系统WorldGen,通过结合大语言模型(LLM)驱动的场景布局推理、基于扩散模型的3D内容生成以及对象感知的场景分解技术,实现从语义描述到可导航、可编辑的完整3D环境的端到端转换,显著提升了生成世界的功能性与可控性。

链接: https://arxiv.org/abs/2511.16825
作者: Dilin Wang,Hyunyoung Jung,Tom Monnier,Kihyuk Sohn,Chuhang Zou,Xiaoyu Xiang,Yu-Ying Yeh,Di Liu,Zixuan Huang,Thu Nguyen-Phuoc,Yuchen Fan,Sergiu Oprea,Ziyan Wang,Roman Shapovalov,Nikolaos Sarafianos,Thibault Groueix,Antoine Toisoul,Prithviraj Dhar,Xiao Chu,Minghao Chen,Geon Yeong Park,Mahima Gupta,Yassir Azziz,Rakesh Ranjan,Andrea Vedaldi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.
zh

[CV-110] Mesh RAG : Retrieval Augmentation for Autoregressive Mesh Generation

【速读】:该论文旨在解决当前基于自回归模型的三维网格(3D mesh)生成方法中存在的质量-速度权衡问题以及难以进行增量编辑的局限性。传统方法依赖于逐序列生成网格部件,导致计算效率低下且难以并行化,同时增加模型规模或生成长度虽可提升质量,却进一步延长生成时间。为此,作者提出了一种无需训练、即插即用的框架Mesh RAG(Retrieval-Augmented Generation),其核心在于借鉴语言模型中检索增强生成(RAG)的思想,通过点云分割(point cloud segmentation)、空间变换(spatial transformation)和点云配准(point cloud registration)实现网格组件的检索、生成与融合,从而打破严格的序列依赖关系,支持高效并行推理,并显著提升网格质量、加速生成过程,同时支持无需重训练的增量编辑能力。

链接: https://arxiv.org/abs/2511.16807
作者: Xiatao Sun,Chen Liang,Qian Wang,Daniel Rakita
机构: Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D meshes are a critical building block for applications ranging from industrial design and gaming to simulation and robotics. Traditionally, meshes are crafted manually by artists, a process that is time-intensive and difficult to scale. To automate and accelerate this asset creation, autoregressive models have emerged as a powerful paradigm for artistic mesh generation. However, current methods to enhance quality typically rely on larger models or longer sequences that result in longer generation time, and their inherent sequential nature imposes a severe quality-speed trade-off. This sequential dependency also significantly complicates incremental editing. To overcome these limitations, we propose Mesh RAG, a novel, training-free, plug-and-play framework for autoregressive mesh generation models. Inspired by RAG for language models, our approach augments the generation process by leveraging point cloud segmentation, spatial transformation, and point cloud registration to retrieve, generate, and integrate mesh components. This retrieval-based approach decouples generation from its strict sequential dependency, facilitating efficient and parallelizable inference. We demonstrate the wide applicability of Mesh RAG across various foundational autoregressive mesh generation models, showing it significantly enhances mesh quality, accelerates generation speed compared to sequential part prediction, and enables incremental editing, all without model retraining.
zh

[CV-111] Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因视觉输入长度增长导致的KV缓存(KV Cache)内存占用急剧上升的问题。现有压缩方法主要依赖注意力分数进行剪枝,不仅与高效的注意力核函数(如FlashAttention)不兼容,还忽略了值向量(Value Vector)对注意力输出的贡献。其解决方案的关键在于从KV矩阵分布的角度重新审视压缩策略:首先通过低通滤波提取频域中占主导地位的低频能量;其次识别出显著偏离该主成分的异常KV对(Outlier KVs),并设计了基于频域引导和异常KV感知的压缩框架FlashCache,其中包含异常KV识别模块和动态预算分配模块,以自适应地保留关键信息。实验表明,FlashCache在保持任务性能的同时,可实现最高1.69倍的解码加速和80%的KV内存降低。

链接: https://arxiv.org/abs/2511.16786
作者: Yaoxin Yang,Peng Ye,Xudong Tan,Chongjun Tu,Maosen Zhao,Jia Hao,Tao Chen
机构: Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV Cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV Cache compression from the perspective of the KV matrices’ distribution. First, we observe that frequency-domain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs. Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain-guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV Cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69 times faster decoding with 80% lower KV memory usage while maintaining task performance.
zh

[CV-112] Generative Augmented Reality: Paradigms Technologies and Future Applications

【速读】:该论文旨在解决传统增强现实(Augmented Reality, AR)系统中因模块化处理流程导致的现实与虚拟内容融合不自然、交互延迟高以及沉浸感不足的问题。其解决方案的关键在于提出生成式增强现实(Generative Augmented Reality, GAR),以统一的生成式主干网络替代传统AR引擎的多阶段模块,将环境感知、虚拟内容和交互信号联合编码为条件输入,实现连续视频的端到端生成,从而在真实感、交互性和沉浸感上显著提升体验,并推动AR向更高效、更智能的方向演进。

链接: https://arxiv.org/abs/2511.16783
作者: Chen Liang,Jiawen Zheng,Yufeng Zeng,Yi Tan,Hengye Lyu,Yuhui Zheng,Zisu Li,Yueting Weng,Jiaxin Shi,Hanwang Zhang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); The Hong Kong University of Science and Technology(香港科技大学); Nanyang Technological University(南洋理工大学); XMax.AI Ltd.(XMax.AI有限公司)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces Generative Augmented Reality (GAR) as a next-generation paradigm that reframes augmentation as a process of world re-synthesis rather than world composition by a conventional AR engine. GAR replaces the conventional AR engine’s multi-stage modules with a unified generative backbone, where environmental sensing, virtual content, and interaction signals are jointly encoded as conditioning inputs for continuous video generation. We formalize the computational correspondence between AR and GAR, survey the technical foundations that make real-time generative augmentation feasible, and outline prospective applications that leverage its unified inference model. We envision GAR as a future AR paradigm that delivers high-fidelity experiences in terms of realism, interactivity, and immersion, while eliciting new research challenges on technologies, content ecosystems, and the ethical and societal implications.
zh

[CV-113] SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG

【速读】:该论文旨在解决从单视角矢量图形(SVG)生成多视角一致的SVG表示这一问题,尤其关注几何结构与颜色的一致性保持。其解决方案的关键在于提出一个三阶段框架:首先将输入的SVG通过栅格化重建为3D表示,并在目标相机位姿下渲染得到多视角图像;其次,将Segment Anything 2(SAM2)中的时序记忆机制扩展至空间域,构建空间记忆库以建立相邻视图间的部件级对应关系,从而实现无需重新训练即可获得更清晰、一致的矢量路径和颜色分配;最后,在栅格到矢量的转换过程中进行路径合并与结构优化,减少冗余同时保留边界和语义信息。该方法实现了对象级多视角SVG的生成,兼具几何与色彩一致性,且具备良好的可扩展性。

链接: https://arxiv.org/abs/2511.16766
作者: Mengnan Jiang,Zhaolin Sun,Christian Franke,Michele Franco Adesso,Antonio Haas,Grace Li Zhang
机构: Mercedes Benz Group (梅赛德斯-奔驰集团); Technische Universität Darmstadt (达姆施塔特工业大学); University of Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures. Preprint

点击查看摘要

Abstract:Scalable Vector Graphics (SVGs) are central to modern design workflows, offering scaling without distortion and precise editability. However, for single object SVGs, generating multi-view consistent SVGs from a single-view input remains underexplored. We present a three stage framework that produces multi-view SVGs with geometric and color consistency from a single SVG input. First, the rasterized input is lifted to a 3D representation and rendered under target camera poses, producing multi-view images of the object. Next, we extend the temporal memory mechanism of Segment Anything 2 (SAM2) to the spatial domain, constructing a spatial memory bank that establishes part level correspondences across neighboring views, yielding cleaner and more consistent vector paths and color assignments without retraining. Finally, during the raster to vector conversion, we perform path consolidation and structural optimization to reduce redundancy while preserving boundaries and semantics. The resulting SVGs exhibit strong geometric and color consistency across views, significantly reduce redundant paths, and retain fine structural details. This work bridges generative modeling and structured vector representation, providing a scalable route to single input, object level multi-view SVG generation and supporting applications such as asset creation and semantic vector editing.
zh

[CV-114] SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge AAAI2026

【速读】:该论文旨在解决视觉语言模型(如CLIP)在通过微调提升安全性时普遍面临的性能下降问题,即安全增强与泛化能力之间的权衡困境。其核心问题是现有方法采用刚性对齐策略,将不安全概念强制映射到预设的安全目标,从而破坏了模型原有的语义结构。解决方案的关键在于提出一种邻近感知的微调机制——SaFeR-CLIP,该方法通过将不安全概念引导至语义上最接近的安全替代项,实现最小程度的表示空间扰动,从而在保障安全性的前提下最大程度保留模型性能。这一策略强调尊重预训练表示的空间几何特性,有效实现了安全性和性能的协同优化。

链接: https://arxiv.org/abs/2511.16743
作者: Adeel Yousaf,Joseph Fioresi,James Beetham,Amrit Singh Bedi,Mubarak Shah
机构: University of Central Florida (中佛罗里达大学); University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI 2026 (Main Technical Track)

点击查看摘要

Abstract:Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model’s learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SaFeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SaFeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.
zh

[CV-115] SAM 3: Segment Anything with Concepts

【速读】:该论文旨在解决图像和视频中基于概念提示(concept prompts)的可提示性对象分割与跟踪问题,即如何在给定短语描述(如“黄色校车”)、图像样例或二者结合的情况下,准确识别并分割出所有匹配的对象实例,并为其分配唯一身份。解决方案的关键在于提出Segment Anything Model 3 (SAM 3),其核心创新包括:1)构建一个可扩展的数据引擎,生成包含400万独特概念标签(含难负样本)的高质量数据集;2)设计一个共享主干网络的图像级检测器与基于记忆的视频跟踪器架构,实现跨模态统一建模;3)通过引入存在头(presence head)将识别与定位解耦,显著提升检测精度。实验表明,SAM 3 在图像和视频的可提示概念分割(Promptable Concept Segmentation, PCS)任务上性能较现有系统提升一倍,并拓展了原始SAM模型在视觉分割任务中的能力边界。

链接: https://arxiv.org/abs/2511.16719
作者: Nicolas Carion,Laura Gustafson,Yuan-Ting Hu,Shoubhik Debnath,Ronghang Hu,Didac Suris,Chaitanya Ryali,Kalyan Vasudev Alwala,Haitham Khedr,Andrew Huang,Jie Lei,Tengyu Ma,Baishan Guo,Arpit Kalla,Markus Marks,Joseph Greer,Meng Wang,Peize Sun,Roman Rädle,Triantafyllos Afouras,Effrosyni Mavroudi,Katherine Xu,Tsung-Han Wu,Yu Zhou,Liliane Momeni,Rishi Hazra,Shuangrui Ding,Sagar Vaze,Francois Porcher,Feng Li,Siyuan Li,Aishwarya Kamath,Ho Kei Cheng,Piotr Dollár,Nikhila Ravi,Kate Saenko,Pengchuan Zhang,Christoph Feichtenhofer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
zh

[CV-116] A Machine Learning-Driven Solution for Denoising Inertial Confinement Fusion Images

【速读】:该论文旨在解决惯性约束聚变(Inertial Confinement Fusion, ICF)实验中中子成像(neutron imaging)图像因混合高斯-泊松噪声(mixed Gaussian-Poisson noise)导致的细节模糊与边缘失真问题。传统滤波与阈值方法难以有效分离并去除此类叠加噪声,影响图像保真度与后续分析精度。解决方案的关键在于提出一种无监督的自编码器(autoencoder)架构,其在潜在空间(latent space)中引入Cohen-Daubechies-Feauveau (CDF 97)小波变换,从而实现对复杂噪声的有效分离与抑制。该方法在合成数据上的验证表明,相比非机器学习方法如BM3D,其具有更低的重建误差和更优的边缘保持性能,为ICF实验中中子图像的去噪与三维重构提供了高效且可靠的先进手段。

链接: https://arxiv.org/abs/2511.16717
作者: Asya Y. Akkus,Bradley T. Wolfe,Pinghan Chu,Chengkun Huang,Chris S. Campbell,Mariana Alvarado Alvarez,Petr Volegov,David Fittinghoff,Robert Reinovsky,Zhehui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neutron imaging is important in optimizing analysis of inertial confinement fusion (ICF) events such as those at the National Ignition Facility (NIF) and improving current and future ICF platforms. However, images of neutron sources are often degraded by various types of noise. Most commonly, Gaussian and Poisson noise often coexist within one image, obscuring fine details and blurring edges. These noise types often overlap, making them difficult to distinguish and remove using conventional filtering and thresholding methods. As a result, noise removal techniques that preserve image fidelity are important for analyzing and interpreting images of a neutron source. Current solutions include a combination of filtering and thresholding methodologies. In the past, machine learning approaches were rarely implemented due to a lack of ground truth neutron imaging data for ICF processes. However, recent advances in synthetic data production, particularly in the fusion imaging field, have opened opportunities to investigate new denoising procedures using both supervised and unsupervised machine learning methods. In this study, we implement an unsupervised autoencoder with a Cohen-Daubechies- Feauveau (CDF 97) wavelet transform in the latent space for mixed Gaussian-Poisson denoising. The network successfully denoises neutron imaging data. Additionally, it demonstrates lower reconstruction error and superior edge preservation metrics when benchmarked with data generated by a forward model and compared to non-ML-based filtering mechanisms such as Block-matching and 3D filtering (BM3D). This approach presents a promising advancement in neutron image noise reduction and three-dimensional reconstruction analysis of ICF experiments.
zh

[CV-117] PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation

【速读】:该论文旨在解决双人肖像生成中缺乏高质量基准数据集的问题,从而推动个性化双人肖像定制技术的发展。其关键解决方案是提出了PairHuman数据集和DHumanDiff基线模型:PairHuman是首个大规模、高规格的双人肖像基准数据集,包含超过10万张图像及丰富的元数据(如图像描述、人物定位、人体关键点和属性标签);DHumanDiff则是一种专为双人肖像生成设计的扩散模型,通过增强面部一致性,在个性化人物生成与语义驱动场景构建之间实现平衡,从而显著提升生成肖像的视觉质量和用户偏好契合度。

链接: https://arxiv.org/abs/2511.16712
作者: Ting Pan,Ye Wang,Peiguang Jing,Rui Ma,Zili Yi,Yu Liu
机构: Nanjing University (南京大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 46 pages, 31 figures

点击查看摘要

Abstract:Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at this https URL.
zh

[CV-118] Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions

【速读】:该论文旨在解决生成灵长类动物面部表情图像时因训练数据量少且变化有限(尤其是个体间面部表情差异)而导致的挑战。其核心解决方案在于结合三种关键技术:首先,通过运动迁移(motion transfer)技术对静态图像进行动画化处理,以合成新的面部表情图像来扩充数据集;其次,基于初始训练的StyleGAN2模型中面部潜空间表示(latent representation)进行样本选择,确保训练数据在多样性与均匀性上的优化;最后,改进损失函数以精确捕捉细微动作(如眼动),从而提升生成图像的真实性与细节还原度。该方法显著提升了多只猕猴面部表情的多样化生成效果,并支持基于风格参数的可控编辑,为研究猕猴面部表情的运动成分解耦提供了有效工具。

链接: https://arxiv.org/abs/2511.16711
作者: Takuya Igaue,Catia Correia-Caeiro,Akito Yoshida,Takako Miyabe-Nishiwaki,Ryusuke Hayashi
机构: Human Informatics Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), 1-1-1 Umezono, Tsukuba, Ibaraki, 305-8568, Japan; Graduate School of Engineering, The University of Tokyo, Japan; Human Biology & Primate Cognition, Institute of Biology, Leipzig University, Germany; Araya Inc., Japan; Center for the Evolutionary Origins of Human Behavior (EHuB), Kyoto University, Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Generating animal faces using generative AI techniques is challenging because the available training images are limited both in quantity and variation, particularly for facial expressions across individuals. In this study, we focus on macaque monkeys, widely studied in systems neuroscience and evolutionary research, and propose a method to generate their facial expressions using a style-based generative image model (i.e., StyleGAN2). To address data limitations, we implemented: 1) data augmentation by synthesizing new facial expression images using a motion transfer to animate still images with computer graphics, 2) sample selection based on the latent representation of macaque faces from an initially trained StyleGAN2 model to ensure the variation and uniform sampling in training dataset, and 3) loss function refinement to ensure the accurate reproduction of subtle movements, such as eye movements. Our results demonstrate that the proposed method enables the generation of diverse facial expressions for multiple macaque individuals, outperforming models trained solely on original still images. Additionally, we show that our model is effective for style-based image editing, where specific style parameters correspond to distinct facial movements. These findings underscore the model’s potential for disentangling motion components as style parameters, providing a valuable tool for research on macaque facial expressions.
zh

[CV-119] he persistence of painting styles WWW ATC

【速读】:该论文旨在解决艺术风格识别中主观性强、依赖专家经验的问题,试图通过数学工具实现对艺术风格的客观量化与区分。其解决方案的关键在于引入拓扑数据分析中的持久同调(persistent homology, PH)方法,利用PH提取图像结构特征并构建可解释的统计判别模型,从而在不同艺术流派或同一流派内的艺术家之间进行有效区分,并准确识别出自动生成式AI(Generative AI)的仿制作品。

链接: https://arxiv.org/abs/2511.16695
作者: Reetikaa Reddy Munnangi,Barbara Giunti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, and 8 tables. Short YouTube video with highlights of the paper available at this https URL on the AATRN YouTube channel

点击查看摘要

Abstract:Art is a deeply personal and expressive medium, where each artist brings their own style, technique, and cultural background into their work. Traditionally, identifying artistic styles has been the job of art historians or critics, relying on visual intuition and experience. However, with the advancement of mathematical tools, we can explore art through more structured lens. In this work, we show how persistent homology (PH), a method from topological data analysis, provides objective and interpretable insights on artistic styles. We show how PH can, with statistical certainty, differentiate between artists, both from different artistic currents and from the same one, and distinguish images of an artist from an AI-generated image in the artist’s style.
zh

[CV-120] Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal

【速读】:该论文旨在解决紧凑光学系统(包括单透镜和超表面透镜设计)在复杂现实环境中因杂散光散射导致的眩光(veiling glare)问题,该现象会与传统像差共同作用,显著降低成像质量,且现有方法难以有效建模和去除。解决方案的关键在于提出一种无监督生成模型 VeilGen,其通过估计图像中的潜在光学传输图(optical transmission map)和眩光图(glare map),利用基于 Stable Diffusion (SD) 的先验进行正则化,从而实现对眩光的逼真模拟,并生成包含光学像差与眩光复合退化的配对数据集;进一步地,引入 DeVeiler 恢复网络,基于预测的潜变量图施加可逆性约束,引导反向散射过程以实现高保真度的眩光去除。

链接: https://arxiv.org/abs/2511.17353
作者: Xiaolong Qian,Qi Jiang,Lei Sun,Zongxi Yu,Kailun Yang,Peixuan Wu,Jiacheng Zhou,Yao Gao,Yaoguang Ma,Ming-Hsuan Yang,Kaiwei Wang
机构: Zhejiang University (浙江大学); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥赫里德斯基”); Hunan University (湖南大学); University of California, Merced (加州大学默塞德分校); Google DeepMind (谷歌深度智mind)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: All code and datasets will be publicly released at this https URL

点击查看摘要

Abstract:Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems-including single-lens and metalens designs-is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments. This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature. Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models. To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors. VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process. We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model. Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods. These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released at this https URL.
zh

[CV-121] Exploring the added value of pretherapeutic MR descriptors in predicting breast cancer pathologic complete response to neoadjuvant chemotherapy

【速读】:该论文旨在解决如何通过整合术前磁共振成像(MRI)特征与临床生物学指标,提升对乳腺癌(Breast Cancer, BC)新辅助化疗(Neoadjuvant Chemotherapy, NAC)后病理完全缓解(Pathological Complete Response, pCR)的预测准确性问题。其关键解决方案在于识别出独立于传统临床因素的MRI影像学特征——即肿瘤边界非棘状(non-spiculated margins)和单发病灶(unifocality),并将这些特征引入随机森林分类器中,显著提升了模型对pCR的敏感性、特异性和精确度,从而为制定个体化治疗策略提供可量化的影像组学依据。

链接: https://arxiv.org/abs/2511.17158
作者: Caroline Malhaire(LITO),Fatine Selhane,Marie-Judith Saint-Martin,Vincent Cockenpot,Pia Akl,Enora Laas,Audrey Bellesoeur,Catherine Ala Eddine,Melodie Bereby-Kahane,Julie Manceau,Delphine Sebbag-Sfez,Jean-Yves Pierga,Fabien Reyal,Anne Vincent-Salomon,Herve Brisse,Frederique Frouin
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Objectives: To evaluate the association between pretreatment MRI descriptors and breast cancer (BC) pathological complete response (pCR) to neoadjuvant chemotherapy (NAC). Materials \ Methods: Patients with BC treated by NAC with a breast MRI between 2016 and 2020 were included in this retrospective observational single-center study. MR studies were described using the standardized BI-RADS and breast edema score on T2-weighted MRI. Univariable and multivariable logistic regression analyses were performed to assess variables association with pCR according to residual cancer burden. Random forest classifiers were trained to predict pCR on a random split including 70% of the database and were validated on the remaining cases. Results: Among 129 BC, 59 (46%) achieved pCR after NAC (luminal (n=7/37, 19%), triple negative (TN) (n=30/55, 55%), HER2+ (n=22/37, 59%). Clinical and biological items associated with pCR were BC subtype (p0.001), T stage 0/I/II (p=0.008), higher Ki67 (p=0.005) and higher tumor-infiltrating lymphocytes levels (p=0.016). Univariate analysis showed that the following MRI features, oval or round shape (p=0.047), unifocality (p=0.026), non-spiculated margins (p=0.018), no associated non-mass enhancement (NME) (p = 0.024) and a lower MRI size (p = 0.031) were significantly associated with pCR. Unifocality and non-spiculated margins remained independently associated with pCR at multivariable analysis. Adding significant MRI features to clinicobiological variables in random forest classifiers significantly increased sensitivity (0.67 versus 0.62), specificity (0.69 versus 0.67) and precision (0.71 versus 0.67) for pCR prediction. Conclusion: Non-spiculated margins and unifocality are independently associated with pCR and can increase models performance to predict BC response to NAC. Clinical Relevance Statement: A multimodal approach integrating pretreatment MRI features with clinicobiological predictors, including TILs, could be employed to develop machine learning models for identifying patients at risk of non-response. This may enable consideration of alternative therapeutic strategies to optimize treatment outcomes
zh

[CV-122] OmniLens: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation

【速读】:该论文旨在解决现有基于深度学习的无监督镜头像差校正方法在泛化能力上的局限性问题,具体表现为两个挑战:一是数据扩展困难导致的多样性不足,二是缺乏表征光学退化过程的先验引导。其解决方案的关键在于提出OmniLens++框架,通过两方面创新实现突破:首先,在数据层面,扩展镜头设计规格以增强退化多样性,并通过量化空间变化模式与严重程度来采样更均匀的分布,从而提升数据可扩展性;其次,在模型设计上,引入潜空间点扩散函数(Latent PSF Representation, LPR)作为光学退化的先验引导,利用VQVAE架构学习LensLib中PSFs的潜在特征,并结合光学退化建模约束退化先验的学习,显著增强了盲校正任务中的泛化性能。

链接: https://arxiv.org/abs/2511.17126
作者: Qi Jiang,Xiaolong Qian,Yao Gao,Lei Sun,Kailun Yang,Zhonghua Yi,Wenyong Li,Ming-Hsuan Yang,Luc Van Gool,Kaiwei Wang
机构: Zhejiang University (浙江大学); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学 INSAIT); Hunan University (湖南大学); University of California, Merced (加州大学默塞德分校); Google DeepMind (谷歌深度智 mind)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optics (physics.optics)
备注: The source code and datasets will be made publicly available at this https URL

点击查看摘要

Abstract:Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes the OmniLens++ framework, which resolves two challenges that hinder the generalization ability of existing pipelines: the difficulty of scaling data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase the degradation diversity of the lens source, and we sample a more uniform distribution by quantifying the spatial-variation patterns and severity of optical degradation. In terms of model design, to leverage the Point Spread Functions (PSFs), which intuitively describe optical degradation, as guidance in a blind paradigm, we propose the Latent PSF Representation (LPR). The VQVAE framework is introduced to learn latent features of LensLib’s PSFs, which is assisted by modeling the optical degradation process to constrain the learning of degradation priors. Experiments on diverse aberrations of real-world lenses and synthetic LensLib show that OmniLens++ exhibits state-of-the-art generalization capacity in blind aberration correction. Beyond performance, the AODLibpro is verified as a scalable foundation for more effective training across diverse aberrations, and LPR can further tap the potential of large-scale LensLib. The source code and datasets will be made publicly available at this https URL.
zh

[CV-123] MedImageInsight for Thoracic Cavity Health Classification from Chest X-rays

【速读】:该论文旨在解决胸部X光片(chest X-ray)影像诊断中因影像数量激增和放射科医生工作负荷加重而导致的及时解读难题。其解决方案的关键在于利用医学影像基础模型MedImageInsight,通过两种策略实现自动化二分类:一是对模型进行微调以端到端分类;二是将其作为特征提取器结合传统机器学习分类器进行迁移学习。实验表明,微调后的分类器在ROC-AUC达0.888且校准性能更优,表现与CheXNet等成熟架构相当,验证了基础模型在降低任务特定训练需求的同时仍能保持诊断可靠性,从而为临床提供高效的分诊支持。

链接: https://arxiv.org/abs/2511.17043
作者: Rama Krishna Boya,Mohan Kireeti Magalanadu,Azaruddin Palavalli,Rupa Ganesh Tekuri,Amrit Pattanayak,Prasanthi Enuga,Vignesh Esakki Muthu,Vivek Aditya Boya
机构: DeepInfinity Ltd(DeepInfinity有限公司)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures and 3 tables

点击查看摘要

Abstract:Chest radiography remains one of the most widely used imaging modalities for thoracic diagnosis, yet increasing imaging volumes and radiologist workload continue to challenge timely interpretation. In this work, we investigate the use of MedImageInsight, a medical imaging foundational model, for automated binary classification of chest X-rays into Normal and Abnormal categories. Two approaches were evaluated: (1) fine-tuning MedImageInsight for end-to-end classification, and (2) employing the model as a feature extractor for a transfer learning pipeline using traditional machine learning classifiers. Experiments were conducted using a combination of the ChestX-ray14 dataset and real-world clinical data sourced from partner hospitals. The fine-tuned classifier achieved the highest performance, with an ROC-AUC of 0.888 and superior calibration compared to the transfer learning models, demonstrating performance comparable to established architectures such as CheXNet. These results highlight the effectiveness of foundational medical imaging models in reducing task-specific training requirements while maintaining diagnostic reliability. The system is designed for integration into web-based and hospital PACS workflows to support triage and reduce radiologist burden. Future work will extend the model to multi-label pathology classification to provide preliminary diagnostic interpretation in clinical environments.
zh

[CV-124] MRI Super-Resolution with Deep Learning: A Comprehensive Survey

【速读】:该论文旨在解决高分辨率(High-Resolution, HR)磁共振成像(MRI)在临床与科研应用中因成本高昂、技术权衡及实验限制而难以普及的问题。其核心解决方案是采用超分辨率(Super-Resolution, SR)技术,特别是基于深度学习(Deep Learning, DL)的方法,通过从低成本的低分辨率(Low-Resolution, LR)扫描中计算重建出高质量HR图像,从而在不依赖额外硬件的前提下提升诊断准确性与效率。关键在于整合计算机视觉、计算成像、反问题建模与MR物理等多个领域的理论与方法,构建系统化的分类体系,并深入分析DL架构设计、学习策略、基准数据集与评估指标,以应对MRI特有的挑战并推动该领域的发展。

链接: https://arxiv.org/abs/2511.16854
作者: Mohammad Khateri,Serge Vasylechko,Morteza Ghahremani,Liam Timms,Deniz Kocanaogullari,Simon K. Warfield,Camilo Jaimes,Davood Karimi,Alejandra Sierra,Jussi Tohka,Sila Kurugol,Onur Afacan
机构: University of Eastern Finland (东芬兰大学); Harvard Medical School (哈佛医学院); Boston Children’s Hospital (波士顿儿童医院); Technical University of Munich (慕尼黑工业大学); Massachusetts General Hospital (麻省总医院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 41 pages

点击查看摘要

Abstract:High-resolution (HR) magnetic resonance imaging (MRI) is crucial for many clinical and research applications. However, achieving it remains costly and constrained by technical trade-offs and experimental limitations. Super-resolution (SR) presents a promising computational approach to overcome these challenges by generating HR images from more affordable low-resolution (LR) scans, potentially improving diagnostic accuracy and efficiency without requiring additional hardware. This survey reviews recent advances in MRI SR techniques, with a focus on deep learning (DL) approaches. It examines DL-based MRI SR methods from the perspectives of computer vision, computational imaging, inverse problems, and MR physics, covering theoretical foundations, architectural designs, learning strategies, benchmark datasets, and performance metrics. We propose a systematic taxonomy to categorize these methods and present an in-depth study of both established and emerging SR techniques applicable to MRI, considering unique challenges in clinical and research contexts. We also highlight open challenges and directions that the community needs to address. Additionally, we provide a collection of essential open-access resources, tools, and tutorials, available on our GitHub: this https URL. IEEE keywords: MRI, Super-Resolution, Deep Learning, Computational Imaging, Inverse Problem, Survey. Comments: 41 pages Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP) Cite as: arXiv:2511.16854 [eess.IV] (or arXiv:2511.16854v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2511.16854 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

人工智能

[AI-0] Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition

【速读】:该论文旨在解决阿拉伯语中语音识别与发音评估的难题,尤其是在《古兰经》诵读场景下,细微的音位差异可能改变语义,传统方法难以准确检测发音错误。解决方案的关键在于提出一种基于Transformer的多模态框架,通过融合UniSpeech提取的声学嵌入(acoustic embeddings)与BERT从Whisper转录文本中获得的语言嵌入(textual embeddings),构建统一表征以同时捕捉语音细节和语言上下文信息;实验表明,该多模态融合策略显著提升了音位级误发音检测的精度与鲁棒性,为构建独立于说话人的智能计算机辅助语言学习(CALL)系统提供了有效路径。

链接: https://arxiv.org/abs/2511.17477
作者: Ayhan Kucukmanisa,Derya Gelmez,Sukru Selim Calik,Zeynep Hilal Kilimci
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Recent advances in multimodal deep learning have greatly enhanced the capability of systems for speech analysis and pronunciation assessment. Accurate pronunciation detection remains a key challenge in Arabic, particularly in the context of Quranic recitation, where subtle phonetic differences can alter meaning. Addressing this challenge, the present study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection that combines acoustic and textual representations to achieve higher precision and robustness. The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions, creating a unified representation that captures both phonetic detail and linguistic context. To determine the most effective integration strategy, early, intermediate, and late fusion methods were implemented and evaluated on two datasets containing 29 Arabic phonemes, including eight hafiz sounds, articulated by 11 native speakers. Additional speech samples collected from publicly available YouTube recordings were incorporated to enhance data diversity and generalization. Model performance was assessed using standard evaluation metrics: accuracy, precision, recall, and F1-score, allowing a detailed comparison of the fusion strategies. Experimental findings show that the UniSpeech-BERT multimodal configuration provides strong results and that fusion-based transformer architectures are effective for phoneme-level mispronunciation detection. The study contributes to the development of intelligent, speaker-independent, and multimodal Computer-Aided Language Learning (CALL) systems, offering a practical step toward technology-supported Quranic pronunciation training and broader speech-based educational applications.
zh

[AI-1] PersonaAgent with GraphRAG : Community-Aware Knowledge Graphs for Personalized LLM

【速读】:该论文旨在解决个性化人工智能代理在保持用户一致性行为的同时,如何有效利用丰富上下文信息以提升任务性能的问题。其核心挑战在于如何将用户的个体偏好(persona)与全局知识相结合,从而生成精准且连贯的响应。解决方案的关键在于提出一种基于知识图谱增强的检索增强生成(Graph RAG)机制,通过构建由大语言模型(LLM)生成的图索引来组织相关文档,并利用图社区检测识别出与用户历史行为相关的全局交互模式;进而动态生成融合用户画像摘要和群体知识的个性化提示(prompt),使代理能够在保持人格一致性的同时受益于集体知识,显著提升了新闻分类、电影标签和产品评分等任务的性能。

链接: https://arxiv.org/abs/2511.17467
作者: Siqi Liang,Yudi Zhang,Yue Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel framework for persona-based language model system, motivated by the need for personalized AI agents that adapt to individual user preferences. In our approach, the agent embodies the user’s “persona” (e.g. user profile or taste) and is powered by a large language model (LLM). To enable the agent to leverage rich contextual information, we introduce a Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) mechanism that constructs an LLM-derived graph index of relevant documents and summarizes communities of related information. Our framework generates personalized prompts by combining: (1) a summary of the user’s historical behaviors and preferences extracted from the knowledge graph, and (2) relevant global interaction patterns identified through graph-based community detection. This dynamic prompt engineering approach allows the agent to maintain consistent persona-aligned behaviors while benefiting from collective knowledge. On the LaMP benchmark, our method improves news categorization F1 by 11.1%, movie tagging F1 by 56.1%, and reduces product rating MAE by 10.4% over prior methods. Our code is available at this https URL
zh

[AI-2] SRA-CP: Spontaneous Risk-Aware Selective Cooperative Perception

【速读】:该论文旨在解决当前协同感知(Cooperative Perception, CP)方法中存在的两大问题:一是通用CP方案传输大量与驾驶安全无关的感知数据,导致通信带宽超载;二是现有CP框架依赖预定义通信伙伴,在动态交通环境中适应性差。解决方案的关键在于提出一种自发式风险感知选择性协同感知(Spontaneous Risk-Aware Selective Cooperative Perception, SRA-CP)框架,其核心是通过去中心化协议实现轻量级感知覆盖摘要广播,并仅在检测到与风险相关的盲区时触发针对性协作;同时引入局部感知风险识别模块以评估遮挡对驾驶任务的影响,从而决定是否需要合作;当协作被激活时,车辆基于共享感知覆盖选择合适同伴,并通过融合模块优先交换安全关键信息并自适应带宽约束,实现了高精度感知与低通信开销的平衡。

链接: https://arxiv.org/abs/2511.17461
作者: Jiaxi Liu,Chengyuan Ma,Hang Zhou,Weizhe Tang,Shixiao Liang,Haoyang Ding,Xiaopeng Li,Bin Ran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cooperative perception (CP) offers significant potential to overcome the limitations of single-vehicle sensing by enabling information sharing among connected vehicles (CVs). However, existing generic CP approaches need to transmit large volumes of perception data that are irrelevant to the driving safety, exceeding available communication bandwidth. Moreover, most CP frameworks rely on pre-defined communication partners, making them unsuitable for dynamic traffic environments. This paper proposes a Spontaneous Risk-Aware Selective Cooperative Perception (SRA-CP) framework to address these challenges. SRA-CP introduces a decentralized protocol where connected agents continuously broadcast lightweight perception coverage summaries and initiate targeted cooperation only when risk-relevant blind zones are detected. A perceptual risk identification module enables each CV to locally assess the impact of occlusions on its driving task and determine whether cooperation is necessary. When CP is triggered, the ego vehicle selects appropriate peers based on shared perception coverage and engages in selective information exchange through a fusion module that prioritizes safety-critical content and adapts to bandwidth constraints. We evaluate SRA-CP on a public dataset against several representative baselines. Results show that SRA-CP achieves less than 1% average precision (AP) loss for safety-critical objects compared to generic CP, while using only 20% of the communication bandwidth. Moreover, it improves the perception performance by 15% over existing selective CP methods that do not incorporate risk awareness.
zh

[AI-3] GRAPHIC–Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在图形设计(Graphic Design)领域中与人类协同创作时存在的系统性评估缺失问题,即如何科学、系统地分析AI辅助设计系统的协作机制及其对设计实践的支持效果。其解决方案的关键在于提出GRAPHIC框架,该框架包含三个核心维度:协作全景(Collaborative Panorama)、交互过程与模态(Processes and Modalities),以及图形设计原则(Graphic Design Principles),用于系统性解析人-AI协作在图形设计中的实现方式,并揭示当前研究中存在的关键差距,如代理间主动性与控制权的平衡不足、缺乏可解释的交互模型,以及对基于设计原则的转化型创造力支持的欠缺。

链接: https://arxiv.org/abs/2511.17443
作者: Joana Rovira Martins,Pedro Martins,Ana Boavida
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 20 pages, 16 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) has been increasingly applied to creative domains, leading to the development of systems that collaborate with humans in design processes. In Graphic Design, integrating computational systems into co-creative workflows presents specific challenges, as it requires balancing scientific rigour with the subjective and visual nature of design practice. Following the PRISMA methodology, we identified 872 articles, resulting in a final corpus of 71 publications describing 68 unique systems. Based on this review, we introduce GRAPHIC (Guidelines for Reviewing Algorithmic Practices in Human-centred Design and Interaction for Creativity), a framework for analysing AI-based systems applied to Graphic Design. Its goal is to understand how current systems support human-AI collaboration in the Graphic Design discipline. The framework comprises main dimensions, which our analysis revealed to be essential across diverse system types: (1) Collaborative Panorama, (2) Processes and Modalities, and (3) Graphic Design Principles. Its application revealed research gaps, including the need to balance initiative and control between agents, improve communication through explainable interaction models, and promote systems that support transformational creativity grounded in core design principles.
zh

[AI-4] InTAct: Interval-based Task Activation Consolidation for Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning)中因域偏移(domain shift)导致的表征漂移(representation drift)问题,即在输入分布变化但标签空间保持不变的情况下,共享层的特征表示会逐渐覆盖先前有用的信息,从而引发遗忘现象。尽管基于提示(prompt-based)的方法在类别增量设置下表现优异,但在域增量场景中仍存在性能下降的问题。解决方案的关键在于提出 InTAct 方法,其核心思想是通过捕捉先前任务特有的激活范围(activation ranges),约束网络更新以确保在这些区域内的功能行为一致性,而非直接冻结参数或存储历史数据;该方法通过稳定重要神经元的功能角色来实现稳定性与可塑性的平衡,且不依赖特定架构,可无缝集成至现有提示驱动的持续学习框架中。

链接: https://arxiv.org/abs/2511.17439
作者: Patryk Krukowski,Jan Miksa,Piotr Helm,Jacek Tabor,Paweł Wawrzyński,Przemysław Spurek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning aims to enable neural networks to acquire new knowledge without forgetting previously learned information. While recent prompt-based methods perform strongly in class-incremental settings, they remain vulnerable under domain shifts, where the input distribution changes but the label space remains fixed. This exposes a persistent problem known as representation drift. Shared representations evolve in ways that overwrite previously useful features and cause forgetting even when prompts isolate task-specific parameters. To address this issue, we introduce InTAct, a method that preserves functional behavior in shared layers without freezing parameters or storing past data. InTAct captures the characteristic activation ranges associated with previously learned tasks and constrains updates to ensure the network remains consistent within these regions, while still allowing for flexible adaptation elsewhere. In doing so, InTAct stabilizes the functional role of important neurons rather than directly restricting parameter values. The approach is architecture-agnostic and integrates seamlessly into existing prompt-based continual learning frameworks. By regulating representation changes where past knowledge is encoded, InTAct achieves a principled balance between stability and plasticity. Across diverse domain-incremental benchmarks, including DomainNet and ImageNet-R, InTAct consistently reduces representation drift and improves performance, increasing Average Accuracy by up to 8 percentage points over state-of-the-art baselines.
zh

[AI-5] DS-Span: Single-Phase Discriminative Subgraph Mining for Efficient Graph Embeddings

【速读】:该论文旨在解决现有子图挖掘方法在图表示学习中面临的冗余多阶段流程、高计算成本以及挖掘结构与判别相关性弱的问题。其解决方案的关键在于提出一种单阶段判别式子图挖掘框架DS-Span,该框架通过统一模式增长、剪枝和监督驱动评分机制,在一次搜索空间遍历中实现高效且具有判别能力的子图提取;其中引入了覆盖上限的可选机制以动态限制探索范围,以及基于信息增益的子图选择策略,从而提升类间区分能力并减少冗余,最终生成紧凑且可解释的子图特征用于下游图嵌入与分类任务。

链接: https://arxiv.org/abs/2511.17419
作者: Yeamin Kaiser,Muhammed Tasnim Bin Anwar,Bholanath Das,Chowdhury Farhan Ahmed,Md. Tanvir Alam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph representation learning seeks to transform complex, high-dimensional graph structures into compact vector spaces that preserve both topology and semantics. Among the various strategies, subgraph-based methods provide an interpretable bridge between symbolic pattern discovery and continuous embedding learning. Yet, existing frequent or discriminative subgraph mining approaches often suffer from redundant multi-phase pipelines, high computational cost, and weak coupling between mined structures and their discriminative relevance. We propose DS-Span, a single-phase discriminative subgraph mining framework that unifies pattern growth, pruning, and supervision-driven scoring within one traversal of the search space. DS-Span introduces a coverage-capped eligibility mechanism that dynamically limits exploration once a graph is sufficiently represented, and an information-gain-guided selection that promotes subgraphs with strong class-separating ability while minimizing redundancy. The resulting subgraph set serves as an efficient, interpretable basis for downstream graph embedding and classification. Extensive experiments across benchmarks demonstrate that DS-Span generates more compact and discriminative subgraph features than prior multi-stage methods, achieving higher or comparable accuracy with significantly reduced runtime. These results highlight the potential of unified, single-phase discriminative mining as a foundation for scalable and interpretable graph representation learning.
zh

[AI-6] hats not natural: The Impact of Off-Policy Training Data on Probe Performance

【速读】:该论文试图解决在监控大型语言模型(Large Language Models, LLMs)时,由于自然样本稀缺导致依赖合成或离策略(off-policy)数据训练探测器(probes)所引发的泛化性能下降问题。其解决方案的关键在于系统性评估不同响应生成策略对探测器跨行为、跨模型泛化能力的影响,并发现:从离策略数据到激励模型产生目标行为的测试集上成功泛化,可预测其在同策略(on-policy)场景下的泛化表现;同时指出,即使使用同域的离策略数据,也比跨域的同策略数据更具可靠性,强调了应对分布偏移(distribution shift)在LLM监控中的重要性。

链接: https://arxiv.org/abs/2511.17408
作者: Nathalie Kirch,Samuel Dower,Adrians Skapars,Ekdeep Singh Lubana,Dmitrii Krasheninnikov
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, EurIPS 2025 Workshop on Private AI Governance

点击查看摘要

Abstract:Probing has emerged as a promising method for monitoring Large Language Models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect varies by behaviour. We find that successful generalisation from off-policy data, to test sets where the model is incentivised to produce the target behaviour, is predictive of successful on-policy generalisation. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. Notably, shifts in the training data domain still cause even larger performance degradation, with different-domain test scores being consistently lower than the same-domain ones. These results indicate that, in the absence of on-policy data, using same-domain off-policy data yields more reliable probes than using on-policy data from a different domain, emphasizing the need for methods that can better handle distribution shifts in LLM monitoring.
zh

[AI-7] Is Phase Really Needed for Weakly-Supervised Dereverberation ?

【速读】:该论文旨在解决弱监督语音去混响(speech dereverberation)中如何有效利用混响语音(wet speech)的时频域信息以提升模型性能的问题。其核心问题是:在训练过程中仅能获取混响语音(wet speech),而干净语音(dry speech)未知的情况下,混响语音的相位信息是否对去混响任务具有实质性帮助。解决方案的关键在于理论分析与实证验证相结合——基于统计波场理论(Statistical Wave Field Theory),作者证明晚期混响主要在时频域引入白噪声性质的相位扰动,尤其在低频以外区域几乎不携带有用信息;进而通过实验验证,在弱监督框架下移除损失函数中的混响相位约束后,模型性能显著提升,表明混响相位并非必要信息,从而为更高效、鲁棒的弱监督去混响提供了新的优化方向。

链接: https://arxiv.org/abs/2511.17346
作者: Marius Rodrigues(IDS, S2A),Louis Bahrman(IDS, S2A),Roland Badeau(IDS, S2A),Gaël Richard(S2A, IDS)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Classical Physics (physics.class-ph); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In unsupervised or weakly-supervised approaches for speech dereverberation, the target clean (dry) signals are considered to be unknown during training. In that context, evaluating to what extent information can be retrieved from the sole knowledge of reverberant (wet) speech becomes critical. This work investigates the role of the reverberant (wet) phase in the time-frequency domain. Based on Statistical Wave Field Theory, we show that late reverberation perturbs phase components with white, uniformly distributed noise, except at low frequencies. Consequently, the wet phase carries limited useful information and is not essential for weakly supervised dereverberation. To validate this finding, we train dereverberation models under a recent weak supervision framework and demonstrate that performance can be significantly improved by excluding the reverberant phase from the loss function.
zh

[AI-8] Agent ifying Agent ic AI

【速读】:该论文试图解决如何实现具备持续自主性、推理能力和交互能力的生成式 AI(Generative AI)系统的问题,其核心挑战在于如何在保持灵活性的同时确保系统的透明性、协作性和可问责性。解决方案的关键在于将自主代理与多智能体系统(Autonomous Agents and Multi-Agent Systems, AAMAS)社区发展出的概念工具——如信念-意图-计划(BDI)架构、通信协议、机制设计和制度建模——与数据驱动的自适应方法相结合,从而构建兼具形式化理论基础与实际自主能力的代理系统。

链接: https://arxiv.org/abs/2511.17332
作者: Virginia Dignum,Frank Dignum
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages; 1 figure

点击查看摘要

Abstract:Agentic AI seeks to endow systems with sustained autonomy, reasoning, and interaction capabilities. To realize this vision, its assumptions about agency must be complemented by explicit models of cognition, cooperation, and governance. This paper argues that the conceptual tools developed within the Autonomous Agents and Multi-Agent Systems (AAMAS) community, such as BDI architectures, communication protocols, mechanism design, and institutional modelling, provide precisely such a foundation. By aligning adaptive, data-driven approaches with structured models of reasoning and coordination, we outline a path toward agentic systems that are not only capable and flexible, but also transparent, cooperative, and accountable. The result is a perspective on agency that bridges formal theory and practical autonomy.
zh

[AI-9] AI Workers Geopolitics and Algorithmic Collective Action

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)治理在国际与国家层面发展不均衡的问题,其核心在于揭示AI公司及其从业人员在全球政治经济结构中的角色,并指出仅依赖自上而下的治理机制不足以保障AI的负责任、伦理性和鲁棒性发展。解决方案的关键在于将AI工作者视为地缘政治行动者(geopolitical actors),通过参与式设计(Participatory Design, PD)方法,将其置于变革的中心位置,激发算法集体行动(Algorithmic Collective Action, ACA),从而推动更具正义性和责任性的AI开发实践。

链接: https://arxiv.org/abs/2511.17331
作者: Sydney Reis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:According to the theory of International Political Economy (IPE), states are often incentivized to rely on rather than constrain powerful corporations. For this reason, IPE provides a useful lens to explain why efforts to govern Artificial Intelligence (AI) at the international and national levels have thus far been developed, applied, and enforced unevenly. Building on recent work that explores how AI companies engage in geopolitics, this position paper argues that some AI workers can be considered actors of geopolitics. It makes the timely case that governance alone cannot ensure responsible, ethical, or robust AI development and use, and greater attention should be paid to bottom-up interventions at the site of AI development. AI workers themselves should be situated as individual agents of change, especially when considering their potential to foster Algorithmic Collective Action (ACA). Drawing on methods of Participatory Design (PD), this paper proposes engaging AI workers as sources of knowledge, relative power, and intentionality to encourage more responsible and just AI development and create the conditions that can facilitate ACA.
zh

[AI-10] FORWARD: Dataset of a forwarder operating in rough terrain

【速读】:该论文旨在解决森林作业中大型机械(如集材机)在复杂地形下实现高效、安全与环境友好操作的难题,尤其关注交通适应性(trafficability)、感知能力及自主控制算法的开发。解决方案的关键在于构建并公开一个高分辨率多模态数据集(FORWARD),涵盖集材机在瑞典中部两个采伐区的实际运行数据,包括RTK-GNSS定位、360°摄像头、振动传感器、CAN总线信号和惯性测量单元(IMU)等多源传感信息,以及激光扫描地形(约1500点/平方米)、生产日志(StanForD标准)、视频材料和实验场景设定(如不同载重、速度及是否使用钢制履带)。该数据集支持基于人工智能(AI)的仿真建模、物理测试平台验证与自动化场景描述生成,从而推动森林机械自主作业系统的研发与优化。

链接: https://arxiv.org/abs/2511.17318
作者: Mikael Lundbäck,Erik Wallin,Carola Häggström,Mattias Nyström,Andreas Grönlund,Mats Richardson,Petrus Jönsson,William Arnvik,Lucas Hedström,Arvid Fälldin,Martin Servin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
备注: 25 pages, 22 figures

点击查看摘要

Abstract:We present FORWARD, a high-resolution multimodal dataset of a cut-to-length forwarder operating in rough terrain on two harvest sites in the middle part of Sweden. The forwarder is a large Komatsu model equipped with a variety of sensors, including RTK-GNSS, 360-camera, operator vibration sensors, internal CAN-bus signal recording, and multiple IMUs. The data includes event time logs recorded in 5 Hz with e.g., driving speed, fuel consumption, vehicle position with centimeter accuracy, and crane use while the vehicle operates in forest areas laser-scanned with very high-resolution, \sim 1500 points per square meter. Production log files (StanForD standard) with time-stamped machine events, extensive video material, and terrain data in various formats are included as well. About 18 hours of regular wood extraction work during three days is annotated from 360-video material into individual work elements and included in the dataset. We also include scenario specifications of conducted experiments on forest roads and in terrain. Scenarios include repeatedly driving the same routes with and without steel tracks, different load weight, and different target driving speeds. The dataset is intended for developing models and algorithms for trafficability, perception, and autonomous control of forest machines using artificial intelligence, simulation, and experiments on physical testbeds. In part, we focus on forwarders traversing terrain, avoiding obstacles, and loading or unloading logs, with consideration for efficiency, fuel consumption, safety, and environmental impact. Other benefits of the open dataset include the ability to explore auto-generation and calibration of forestry machine simulators and automation scenario descriptions using the data recorded in the field.
zh

[AI-11] DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format

【速读】:该论文旨在解决传统冯·诺依曼架构在处理大规模人工智能(AI)模型时面临的“内存墙”问题以及摩尔定律失效带来的能效瓶颈,尤其是在边缘计算场景下对硬件资源受限的挑战。其解决方案的关键在于提出一种新型数字存内随机计算架构(Digital In-Memory Stochastic Computing Architecture, DISCA),该架构采用压缩版准随机Bent-Pyramid数据格式,在保持数字系统可扩展性、生产力和可靠性的同时,继承了模拟计算的计算简洁性。通过这种设计,DISCA在商用180nm CMOS工艺下实现了每比特3.59 TOPS/W的能效,相较于现有架构在矩阵乘法负载上显著提升了能效数个数量级。

链接: https://arxiv.org/abs/2511.17265
作者: Shady Agwa,Yikang Shen,Shiwei Wang,Themis Prodromakis
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Nowadays, we are witnessing an Artificial Intelligence revolution that dominates the technology landscape in various application domains, such as healthcare, robotics, automotive, security, and defense. Massive-scale AI models, which mimic the human brain’s functionality, typically feature millions and even billions of parameters through data-intensive matrix multiplication tasks. While conventional Von-Neumann architectures struggle with the memory wall and the end of Moore’s Law, these AI applications are migrating rapidly towards the edge, such as in robotics and unmanned aerial vehicles for surveillance, thereby adding more constraints to the hardware budget of AI architectures at the edge. Although in-memory computing has been proposed as a promising solution for the memory wall, both analog and digital in-memory computing architectures suffer from substantial degradation of the proposed benefits due to various design limitations. We propose a new digital in-memory stochastic computing architecture, DISCA, utilizing a compressed version of the quasi-stochastic Bent-Pyramid data format. DISCA inherits the same computational simplicity of analog computing, while preserving the same scalability, productivity, and reliability of digital systems. Post-layout modeling results of DISCA show an energy efficiency of 3.59 TOPS/W per bit at 500 MHz using a commercial 180nm CMOS technology. Therefore, DISCA significantly improves the energy efficiency for matrix multiplication workloads by orders of magnitude if scaled and compared to its counterpart architectures.
zh

[AI-12] Algorithmic design and implementation considerations of deep MPC

【速读】:该论文旨在解决传统模型预测控制(Model Predictive Control, MPC)在面对系统模型不确定性时性能受限的问题,尤其是在复杂动态系统中难以精确建模的情况下。其解决方案的关键在于提出一种“深度模型预测控制”(Deep Model Predictive Control, Deep MPC)框架,该框架将深度神经网络嵌入MPC回路中,通过分布式控制权限机制实现协同优化:神经网络负责学习和补偿模型不确定性,而MPC则确保在学习过程中始终满足系统的约束条件并防止不安全行为。这种分工机制不仅利用了运行数据对神经网络进行在线微调,还保障了控制过程的安全性和稳定性,从而显著提升整体控制性能。

链接: https://arxiv.org/abs/2511.17233
作者: Prabhat K. Mishra,Mateus V. Gasparino,Girish Chowdhary
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Model Predictive Control (Deep MPC) is an evolving field that integrates model predictive control and deep learning. This manuscript is focused on a particular approach, which employs deep neural network in the loop with MPC. This class of approaches distributes control authority between a neural network and an MPC controller, in such a way that the neural network learns the model uncertainties while the MPC handles constraints. The approach is appealing because training data collected while the system is in operation can be used to fine-tune the neural network, and MPC prevents unsafe behavior during those learning transients. This manuscript explains implementation challenges of Deep MPC, algorithmic way to distribute control authority and argues that a poor choice in distributing control authority may lead to poor performance. A reason of poor performance is explained through a numerical experiment on a four-wheeled skid-steer dynamics.
zh

[AI-13] MIR: Efficient Exploration in Episodic Multi-Agent Reinforcement Learning via Mutual Intrinsic Reward

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中因稀疏奖励(episodic rewards)导致的探索困难问题。其核心挑战在于:(1)随着智能体数量增加,联合动作轨迹的稀疏性呈指数级增长,使得获得奖励的路径难以探索;(2)现有方法通常未充分考虑个体动作对团队状态的协同影响。解决方案的关键是提出相互内在奖励(Mutual Intrinsic Reward, MIR),该机制通过激励每个智能体探索对其队友产生影响的动作,从而促进团队层面的有效探索,显著提升算法在稀疏奖励环境下的性能表现。

链接: https://arxiv.org/abs/2511.17165
作者: Kesheng Chen,Wenjian Luo,Bang Zhang,Zeping Yin,Zipeng Ye
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Episodic rewards present a significant challenge in reinforcement learning. While intrinsic reward methods have demonstrated effectiveness in single-agent rein-forcement learning scenarios, their application to multi-agent reinforcement learn-ing (MARL) remains problematic. The primary difficulties stem from two fac-tors: (1) the exponential sparsity of joint action trajectories that lead to rewards as the exploration space expands, and (2) existing methods often fail to account for joint actions that can influence team states. To address these challenges, this paper introduces Mutual Intrinsic Reward (MIR), a simple yet effective enhancement strategy for MARL with extremely sparse rewards like episodic rewards. MIR incentivizes individual agents to explore actions that affect their teammates, and when combined with original strategies, effectively stimulates team exploration and improves algorithm performance. For comprehensive experimental valida-tion, we extend the representative single-agent MiniGrid environment to create MiniGrid-MA, a series of MARL environments with sparse rewards. Our evalu-ation compares the proposed method against state-of-the-art approaches in the MiniGrid-MA setting, with experimental results demonstrating superior perfor-mance.
zh

[AI-14] he Belief-Desire-Intention Ontology for modelling mental reality and agency

【速读】:该论文旨在解决信念-欲望-意图(Belief-Desire-Intention, BDI)模型在结构化、语义互操作知识表示中的集成局限性问题,即如何将BDI模型以形式化且可复用的方式嵌入到语义网环境中,从而实现认知代理的精确建模与跨系统协作。解决方案的关键在于提出一个模块化的BDI本体(Ontology),该本体作为本体设计模式(Ontology Design Pattern, ODP)构建,通过与基础本体对齐并遵循模块化设计最佳实践,确保语义精度和可重用性;同时,通过两种互补实验验证其有效性:一是结合大型语言模型(Large Language Models, LLMs)利用逻辑增强生成(Logic Augmented Generation, LAG)提升推理一致性;二是集成至Semas推理平台,实现RDF三元组与代理心理状态之间的双向流转(Triples-to-Beliefs-to-Triples, T2B2T),从而在声明式智能与过程式智能之间建立概念与操作层面的桥梁。

链接: https://arxiv.org/abs/2511.17162
作者: Sara Zuppiroli,Carmelo Fabio Longo,Anna Sofia Lippolis,Rocco Paolillo,Lorenzo Giammei,Miguel Ceriani,Francesco Poggi,Antonio Zinilli,Andrea Giovanni Nuzzolese
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Belief-Desire-Intention (BDI) model is a cornerstone for representing rational agency in artificial intelligence and cognitive sciences. Yet, its integration into structured, semantically interoperable knowledge representations remains limited. This paper presents a formal BDI Ontology, conceived as a modular Ontology Design Pattern (ODP) that captures the cognitive architecture of agents through beliefs, desires, intentions, and their dynamic interrelations. The ontology ensures semantic precision and reusability by aligning with foundational ontologies and best practices in modular design. Two complementary lines of experimentation demonstrate its applicability: (i) coupling the ontology with Large Language Models (LLMs) via Logic Augmented Generation (LAG) to assess the contribution of ontological grounding to inferential coherence and consistency; and (ii) integrating the ontology within the Semas reasoning platform, which implements the Triples-to-Beliefs-to-Triples (T2B2T) paradigm, enabling a bidirectional flow between RDF triples and agent mental states. Together, these experiments illustrate how the BDI Ontology acts as both a conceptual and operational bridge between declarative and procedural intelligence, paving the way for cognitively grounded, explainable, and semantically interoperable multi-agent and neuro-symbolic systems operating within the Web of Data.
zh

[AI-15] Device-Guided Music Transfer

【速读】:该论文旨在解决设备引导的音乐迁移(device-guided music transfer)问题,即在用户缺乏目标播放设备(如扬声器)的情况下,如何适配音频播放以匹配不同硬件的频率响应特性。现有方法多聚焦于音色、节奏、和声或配器等音乐特征的调整,忽视了扬声器等硬件属性对音质的影响。解决方案的关键在于提出DeMT框架:首先将扬声器的频率响应曲线(frequency response curve)作为线图输入至视觉-语言模型(vision-language model),提取出设备嵌入(device embeddings);随后通过特征级线性调制(feature-wise linear modulation)将这些嵌入条件化到混合Transformer架构中,实现对未见设备的高效风格迁移与少量样本自适应(few-shot adaptation)。该方法支持设备风格增强与音质提升等应用场景。

链接: https://arxiv.org/abs/2511.17136
作者: Manh Pham Hung,Changshuo Hu,Ting Dang,Dong Ma
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker’s frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.
zh

[AI-16] UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability

【速读】:该论文旨在解决当前计算机使用代理(Computer Use Agent, CUA)基准测试在评估企业级部署就绪性方面的不足问题,即现有基准主要关注任务完成的功能正确性,而忽视了生产系统所需的运行可靠性。其解决方案的关键在于提出UI-CUBE(UiPath Computer Use BEnchmark),一个包含226项任务的系统性基准测试,涵盖简单UI交互(136项)、复制粘贴任务(50项)和企业应用场景(40项),并通过界面变体覆盖、多分辨率测试及基于应用状态的自动化任务成功验证,揭示当前CUA在记忆管理、分层规划和状态协调等方面的结构性缺陷。实验表明,简单任务成功率可达67–85%(人类为97.9%),但复杂工作流骤降至9–19%,且无经验人类评估者在复杂任务上仅达61.2%,说明当前CUA并非因训练不足导致性能下降,而是存在根本性的架构局限,UI-CUBE因此成为诊断企业级可用性的关键工具。

链接: https://arxiv.org/abs/2511.17131
作者: Horia Cristescu,Charles Park,Trong Canh Nguyen,Sergiu Talmacel,Alexandru-Gabriel Ilie,Stefan Adam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, 5 tables. Benchmark comprising 226 tasks across two difficulty tiers. Code and benchmark available at this https URL

点击查看摘要

Abstract:While current Computer Use Agent (CUA) benchmarks measure task completion effectively, they provide limited assessment of enterprise deployment readiness, emphasizing functional correctness over the operational reliability required for production systems. We present UI-CUBE (UiPath Computer Use BEnchmark), a systematic benchmark comprising 226 tasks across two difficulty tiers designed to expose fundamental architectural limitations in current CUAs. Our evaluation covers simple UI interactions (136 tasks) and complex workflows including copy-paste tasks (50 tasks) and enterprise application scenarios (40 tasks), with systematic interface variation coverage, multi-resolution testing and automated validation of task success through the application state. Evaluation of five state-of-the-art models reveals a sharp capability cliff rather than gradual performance degradation. Simple UI interactions achieve 67-85% success rates (compared to 97.9% human performance), but complex workflows drop precipitously to 9-19%. Human evaluators with no prior application experience achieve only 61.2% on complex tasks despite near-perfect performance on simple tasks, establishing realistic performance ceilings. This discontinuous performance pattern – where agents achieve 68-87% of human performance on simple tasks but only 15-32% on complex workflows – indicates fundamental architectural limitations in memory management, hierarchical planning, and state coordination rather than incremental capability gaps addressable through better training or prompting. UI-CUBE functions as an enterprise-readiness diagnostic, revealing that while current CUAs can manipulate individual interface elements, they cannot yet function as reliable workflow automation tools. These findings provide architectural insights essential for developing production-ready CUAs capable of managing complex, multi-step enterprise processes.
zh

[AI-17] AutoGraphAD: A novel approach using Variational Graph Autoencoders for anomalous network flow detection

【速读】:该论文旨在解决网络入侵检测系统(Network Intrusion Detection Systems, NIDS)中对标注数据高度依赖的问题,以及现有公开数据集存在攻击类型有限、过时或标签错误等缺陷。为减少对人工标注数据的依赖,作者提出了一种基于异质变分图自编码器(Heterogeneous Variational Graph Autoencoder)的无监督异常检测方法 AutoGraphAD。其核心创新在于构建一个由连接(connection)和IP节点组成的异质图结构来捕捉时间窗口内的网络活动,并通过无监督与对比学习进行训练,最终结合重构损失、结构损失和KL散度加权生成异常评分,实现高效且准确的异常检测。该方案在无需额外下游异常检测器的情况下,性能优于或等同于现有无监督方法(如 Anomal-E),同时在训练和推理速度上分别提升约1.18和1.03个数量级,显著提升了实际部署效率。

链接: https://arxiv.org/abs/2511.17113
作者: Georgios Anyfantis,Pere Barlet-Ros
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Network Intrusion Detection Systems (NIDS) are essential tools for detecting network attacks and intrusions. While extensive research has explored the use of supervised Machine Learning for attack detection and characterisation, these methods require accurately labelled datasets, which are very costly to obtain. Moreover, existing public datasets have limited and/or outdated attacks, and many of them suffer from mislabelled data. To reduce the reliance on labelled data, we propose AutoGraphAD, a novel unsupervised anomaly detection approach based on a Heterogeneous Variational Graph Autoencoder. AutoGraphAD operates on heterogeneous graphs, made from connection and IP nodes that capture network activity within a time window. The model is trained using unsupervised and contrastive learning, without relying on any labelled data. The reconstruction, structural loss, and KL divergence are then weighted and combined in an anomaly score that is then used for anomaly detection. Overall, AutoGraphAD yields the same, and in some cases better, results than previous unsupervised approaches, such as Anomal-E, but without requiring costly downstream anomaly detectors. As a result, AutoGraphAD achieves around 1.18 orders of magnitude faster training and 1.03 orders of magnitude faster inference, which represents a significant advantage for operational deployment.
zh

[AI-18] Why Do Language Model Agents Whistleblow?

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为工具使用代理时,其对齐训练可能引发的新型不当行为问题,特别是模型在未获用户指令或知情的情况下,主动向对话边界外的第三方(如监管机构)披露疑似违规行为的现象,即“举报行为”(whistleblowing)。解决方案的关键在于构建一个多样且现实的分阶段违规场景评估套件,用于系统性检测和量化模型的此类行为,并通过控制实验发现:(1)不同模型家族的举报频率差异显著;(2)任务复杂度提升可抑制举报倾向;(3)在系统提示中引入道德引导能显著提高举报率;(4)提供更多非举报路径(如工具和详细工作流)会降低举报行为。该方法有效揭示了LLM行为与训练机制、提示设计及任务结构之间的复杂关联。

链接: https://arxiv.org/abs/2511.17085
作者: Kushal Agrawal,Frank Xiao,Guido Bergman,Asa Cooper Stickland
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.
zh

[AI-19] Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks

【速读】:该论文旨在解决如何从电子健康记录(Electronic Health Records, EHRs)中高效提取多模态患者级信息以支持高风险临床决策系统的问题。其核心挑战在于EHR中既有结构化数据(如诊断代码、药物和实验室结果),也有大量未结构化的文本信息(如出院小结和护理记录),二者需融合才能构建准确且可解释的模型。解决方案的关键在于提出一种基于贝叶斯网络与神经文本分类器相结合的多模态信息融合方法:利用专家知识构建的贝叶斯网络处理结构化特征,同时通过神经文本分类器解析临床文本;进一步引入虚拟证据(virtual evidence)与一致性节点(consistency node)实现概率层面的可解释融合,其中一致性节点显著提升了预测校准性能,有效应对缺失数据并协调结构化与非结构化数据之间的矛盾,从而增强模型在真实医疗场景中的可靠性与透明性。

链接: https://arxiv.org/abs/2511.17056
作者: Paloma Rabaey,Adrick Tench,Stefan Heytens,Thomas Demeester
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electronic health records (EHRs) form an invaluable resource for training clinical decision support systems. To leverage the potential of such systems in high-risk applications, we need large, structured tabular datasets on which we can build transparent feature-based models. While part of the EHR already contains structured information (e.g. diagnosis codes, medications, and lab results), much of the information is contained within unstructured text (e.g. discharge summaries and nursing notes). In this work, we propose a method for multi-modal patient-level information extraction that leverages both the tabular features available in the patient’s EHR (using an expert-informed Bayesian network) as well as clinical notes describing the patient’s symptoms (using neural text classifiers). We propose the use of virtual evidence augmented with a consistency node to provide an interpretable, probabilistic fusion of the models’ predictions. The consistency node improves the calibration of the final predictions compared to virtual evidence alone, allowing the Bayesian network to better adjust the neural classifier’s output to handle missing information and resolve contradictions between the tabular and text data. We show the potential of our method on the SimSUM dataset, a simulated benchmark linking tabular EHRs with clinical notes through expert knowledge.
zh

[AI-20] CLLM Rec: LLM -powered Cognitive-Aware Concept Recommendation via Semantic Alignment and Prerequisite Knowledge Distillation

【速读】:该论文旨在解决大规模开放在线课程(MOOCs)中个性化学习推荐的问题,特别是如何在缺乏高质量结构化知识图谱的现实教育场景下,实现认知感知且精准的概念推荐。其核心挑战在于现有方法依赖于复杂的知识图谱或异构信息网络,而这些结构在实际应用中往往难以获取。解决方案的关键在于提出一种名为CLLMRec的新框架,其核心技术由两个协同模块构成:一是语义对齐(Semantic Alignment),通过编码学习者和概念的非结构化文本描述,在统一表示空间中捕捉语义关联;二是先验知识蒸馏(Prerequisite Knowledge Distillation),利用大语言模型(LLM)作为教师模型提取隐含的概念前置关系,并将其转化为软标签用于训练轻量级学生排序器,从而无需显式结构先验即可建模概念间的逻辑依赖。最终结合深度知识追踪机制显式建模学习者的实时认知状态,确保推荐结果兼具结构合理性与认知适配性。

链接: https://arxiv.org/abs/2511.17041
作者: Xiangrui Xiong,Yichuan Lu,Zifei Pan,Chang Sun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growth of Massive Open Online Courses (MOOCs) presents significant challenges for personalized learning, where concept recommendation is crucial. Existing approaches typically rely on heterogeneous information networks or knowledge graphs to capture conceptual relationships, combined with knowledge tracing models to assess learners’ cognitive states. However, these methods face significant limitations due to their dependence on high-quality structured knowledge graphs, which are often scarce in real-world educational scenarios. To address this fundamental challenge, this paper proposes CLLMRec, a novel framework that leverages Large Language Models through two synergistic technical pillars: Semantic Alignment and Prerequisite Knowledge Distillation. The Semantic Alignment component constructs a unified representation space by encoding unstructured textual descriptions of learners and concepts. The Prerequisite Knowledge Distillation paradigm employs a teacher-student architecture, where a large teacher LLM (implemented as the Prior Knowledge Aware Component) extracts conceptual prerequisite relationships from its internalized world knowledge and distills them into soft labels to train an efficient student ranker. Building upon these foundations, our framework incorporates a fine-ranking mechanism that explicitly models learners’ real-time cognitive states through deep knowledge tracing, ensuring recommendations are both structurally sound and cognitively appropriate. Extensive experiments on two real-world MOOC datasets demonstrate that CLLMRec significantly outperforms existing baseline methods across multiple evaluation metrics, validating its effectiveness in generating truly cognitive-aware and personalized concept recommendations without relying on explicit structural priors.
zh

[AI-21] DAPS: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing

【速读】:该论文旨在解决基于扩散模型的逆问题求解中,传统贝叶斯框架下扩散过程与数据一致性项之间耦合性不足的问题,即扩散先验在实际重建中作用有限,而测量一致性项主导了推理过程,导致扩散动力学与最终重构结果脱节。其解决方案的关键在于将扩散过程重新诠释为期望最大化(Expectation-Maximization, EM)框架中的初始化阶段,从而实现扩散步骤与数据驱动优化的完全解耦;在此基础上提出DAPS++方法,使似然项能更直接地引导推理过程,同时保持数值稳定性,并解释为何统一的扩散轨迹在实践中仍具有效性。该方法显著降低函数评估次数(NFEs)和测量优化步数,提升了计算效率和多种图像恢复任务中的鲁棒性。

链接: https://arxiv.org/abs/2511.17038
作者: Hao Chen,Renzheng Zhang,Scott S. Howard
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. To clarify this structure, we reinterpret the role of diffusion in inverse problem solving as an initialization stage within an expectation–maximization (EM)–style framework, where the diffusion stage and the data-driven refinement are fully decoupled. We introduce \textbfDAPS++, which allows the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, \textbfDAPS++ achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks.
zh

[AI-22] Budget-Aware Tool-Use Enables Effective Agent Scaling

【速读】:该论文试图解决在工具增强型智能体(tool-augmented agents)中,单纯增加工具调用(tool call)预算无法有效提升性能的问题,其核心症结在于智能体缺乏“预算意识”(budget awareness),导致性能迅速达到瓶颈。解决方案的关键在于引入Budget Tracker这一轻量级插件以提供持续的预算感知能力,并进一步提出BATS(Budget Aware Test-time Scaling)框架,利用该感知动态调整规划与验证策略——根据剩余资源决定是“深入挖掘”潜在线索还是“转向”新路径。该方法通过统一的代价度量(联合考虑token和工具消耗)实现可控的性能-成本权衡,显著改善了预算约束下的Scaling曲线并推动了成本-性能帕累托前沿的优化。

链接: https://arxiv.org/abs/2511.17006
作者: Tengxiao Liu,Zifeng Wang,Jin Miao,I-Hung Hsu,Jun Yan,Jiefeng Chen,Rujun Han,Fangyuan Xu,Yanfei Chen,Ke Jiang,Samira Daruki,Yi Liang,William Yang Wang,Tomas Pfister,Chen-Yu Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling test-time computation improves performance across different tasks on large language models (LLMs), which has also been extended to tool-augmented agents. For these agents, scaling involves not only “thinking” in tokens but also “acting” via tool calls. The number of tool calls directly bounds the agent’s interaction with the external environment. However, we find that simply granting agents a larger tool-call budget fails to improve performance, as they lack “budget awareness” and quickly hit a performance ceiling. To address this, we study how to scale such agents effectively under explicit tool-call budgets, focusing on web search agents. We first introduce the Budget Tracker, a lightweight plug-in that provides the agent with continuous budget awareness, enabling simple yet effective scaling. We further develop BATS (Budget Aware Test-time Scaling), an advanced framework that leverages this awareness to dynamically adapt its planning and verification strategy, deciding whether to “dig deeper” on a promising lead or “pivot” to new paths based on remaining resources. To analyze cost-performance scaling in a controlled manner, we formalize a unified cost metric that jointly accounts for token and tool consumption. We provide the first systematic study on budget-constrained agents, showing that budget-aware methods produce more favorable scaling curves and push the cost-performance Pareto frontier. Our work offers empirical insights toward a more transparent and principled understanding of scaling in tool-augmented agents.
zh

[AI-23] MirrorMind: Empowering OmniScientist with the Expert Perspectives and Collective Knowledge of Human Scientists

【速读】:该论文试图解决当前生成式 AI 在科学发现中过度依赖孤立优化或搜索过程的问题,忽视了知识生产本质上是社会性与历史性的活动。现有大语言模型(Large Language Models, LLMs)难以有效建模个体认知轨迹与学科集体记忆之间的复杂交互关系。解决方案的关键在于提出 MirrorMind——一种分层认知架构,通过三层次结构实现双记忆表示的融合:个体层构建高保真认知模型以捕捉研究者的事件记忆(episodic memory)、语义记忆(semantic memory)和人格记忆(persona memory);领域层将学科知识映射为结构化的概念图谱;跨领域层作为正交协调引擎,使智能体能够灵活调用个体记忆或集体结构进行推理。该设计实现了记忆存储与代理执行的解耦,从而推动 AI 科学家从简单事实检索迈向结构化、个性化且能激发洞见的科学推理。

链接: https://arxiv.org/abs/2511.16997
作者: Qingbin Zeng,Bingbing Fan,Zhiyu Chen,Sijian Ren,Zhilun Zhou,Xuhua Zhang,Yuanyi Zhen,Fengli Xu,Yong Li,Tie-Yan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:The emergence of AI Scientists has demonstrated remarkable potential in automating scientific research. However, current approaches largely conceptualize scientific discovery as a solitary optimization or search process, overlooking that knowledge production is inherently a social and historical endeavor. Human scientific insight stems from two distinct yet interconnected sources. First is the individual cognitive trajectory, where a researcher’s unique insight is shaped by their evolving research history and stylistic preferences; another is the collective disciplinary memory, where knowledge is sedimented into vast, interconnected networks of citations and concepts. Existing LLMs still struggle to represent these structured, high-fidelity cognitive and social contexts. To bridge this gap, we introduce MirrorMind, a hierarchical cognitive architecture that integrates dual-memory representations within a three-level framework. The Individual Level constructs high-fidelity cognitive models of individual researchers by capturing their episodic, semantic, and persona memories; the Domain Level maps collective knowledge into structured disciplinary concept graphs; and the Interdisciplinary Level that acts as an orthogonal orchestration engine. Crucially, our architecture separates memory storage from agentic execution, enabling AI scientist agents to flexibly access individual memories for unique perspectives or collective structures to reason. We evaluate MirrorMind across four comprehensive tasks, including author-level cognitive simulation, complementary reasoning, cross-disciplinary collaboration promotion, and multi-agent scientific problem solving. The results show that by integrating individual cognitive depth with collective disciplinary breadth, MirrorMind moves beyond simple fact retrieval toward structural, personalized, and insight-generating scientific reasoning.
zh

[AI-24] Optimizing PyTorch Inference with LLM -Based Multi-Agent Systems

【速读】:该论文旨在解决现代AI推理系统在GPU硬件上最大化性能的持续挑战,传统方法依赖手动编写定制GPU内核或使用专用模型编译器进行优化,而这些方法往往效率有限且耗时。其解决方案的关键在于引入基于大语言模型(Large Language Model, LLM)的多智能体系统(multi-agent system),通过协作式自动代码优化替代人工调优和现有编译器。研究发现,以“利用优先”(exploit-heavy)策略结合错误修复代理(error-fixing agents)能实现最优性能,并且优化步骤的粒度与最终加速效果呈正相关;最佳实现方案在KernelBench基准测试中对H100 GPU上的多种PyTorch任务平均提速达2.88倍。

链接: https://arxiv.org/abs/2511.16964
作者: Kirill Nagaitsev,Luka Grbcic,Samuel Williams,Costin Iancu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.
zh

[AI-25] Comparing verbal visual and combined explanations for Bayesian Network inferences

【速读】:该论文旨在解决贝叶斯网络(Bayesian Networks, BNs)在实际应用中因推理过程不透明而导致用户理解困难的问题,尤其是在现有用户界面(User Interface, UI)无法有效阐明BN推理机制的情况下。其解决方案的关键在于设计了一种结合语言描述(verbal)与可视化(visual)的扩展型UI,通过引导用户识别常见推理模式(如观测影响路径及其相互作用),显著提升用户对BN推理逻辑的理解能力;实验表明,相较于基线UI,三种扩展方式均能提升用户表现,而语言与视觉模态协同使用时在特定问题类型上效果更优。

链接: https://arxiv.org/abs/2511.16961
作者: Erik P. Nyberg,Steven Mascaro,Ingrid Zukerman,Michael Wybrow,Duc-Minh Vo,Ann Nicholson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages total, 12 pages main, 14 pages for 5 appendices

点击查看摘要

Abstract:Bayesian Networks (BNs) are an important tool for assisting probabilistic reasoning, but despite being considered transparent models, people have trouble understanding them. Further, current User Interfaces (UIs) still do not clarify the reasoning of BNs. To address this problem, we have designed verbal and visual extensions to the standard BN UI, which can guide users through common inference patterns. We conducted a user study to compare our verbal, visual and combined UI extensions, and a baseline UI. Our main findings are: (1) users did better with all three types of extensions than with the baseline UI for questions about the impact of an observation, the paths that enable this impact, and the way in which an observation influences the impact of other observations; and (2) using verbal and visual modalities together is better than using either modality alone for some of these question types. Comments: 26 pages total, 12 pages main, 14 pages for 5 appendices Subjects: Artificial Intelligence (cs.AI) ACMclasses: F.4.1 Cite as: arXiv:2511.16961 [cs.AI] (or arXiv:2511.16961v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.16961 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-26] RASTP: Representation-Aware Semantic Token Pruning for Generative Recommendation with Semantic Identifiers

【速读】:该论文旨在解决生成式推荐系统中因使用语义标识符(Semantic Identifiers, SIDs)表示物品而导致输入序列过长的问题,这会显著增加计算复杂度和内存消耗。解决方案的关键在于提出一种基于表示感知的语义标记剪枝方法(Representation-Aware Semantic Token Pruning, RASTP),该方法通过结合表示幅度(反映语义显著性)与累积注意力权重(反映注意力中心性)来动态评估每个语义标记的重要性,并剪除低信息量或无关的标记,从而在不牺牲甚至略微提升推荐性能的前提下,有效缩短输入序列长度,降低训练时间(实验显示可减少26.7%)。

链接: https://arxiv.org/abs/2511.16943
作者: Tianyu Zhan,Kairui Fu,Zheqi Lv,Shengyu Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 4 pages

点击查看摘要

Abstract:Generative recommendation systems typically leverage Semantic Identifiers (SIDs), which represent each item as a sequence of tokens that encode semantic information. However, representing item ID with multiple SIDs significantly increases input sequence length, which is a major determinant of computational complexity and memory consumption. While existing efforts primarily focus on optimizing attention computation and KV cache, we propose RASTP (Representation-Aware Semantic Token Pruning), which directly prunes less informative tokens in the input sequence. Specifically, RASTP evaluates token importance by combining semantic saliency, measured via representation magnitude, and attention centrality, derived from cumulative attention weights. Since RASTP dynamically prunes low-information or irrelevant semantic tokens, experiments on three real-world Amazon datasets show that RASTP reduces training time by 26.7%, while maintaining or slightly improving recommendation performance. The code has been open-sourced at this https URL.
zh

[AI-27] Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving

【速读】:该论文旨在解决多车协同驾驶任务中传统基于状态的奖励函数在高频连续控制场景下因奖励差异消失(vanishing reward differences)而导致策略梯度信噪比(SNR)低的问题,从而阻碍算法收敛与性能提升。解决方案的关键在于提出一种新型混合差分奖励机制(Hybrid Differential Reward, HDR),其核心创新在于融合两个互补组件:一是基于全局势能函数的时序差分奖励(Temporal Difference Reward, TRD),通过利用势能演化趋势确保最优策略不变性并契合长期目标;二是动作梯度奖励(Action Gradient Reward, ARG),直接衡量动作的边际效用以提供高信噪比的局部引导信号。这一机制在具有时变智能体集合的多智能体部分可观测马尔可夫博弈(Multi-Agent Partially Observable Markov Game, POMDPG)框架中被完整实例化,并通过在线规划(MCTS)和多智能体强化学习(QMIX、MAPPO、MADDPG)算法验证了其在加速收敛和提升策略稳定性方面的显著优势。

链接: https://arxiv.org/abs/2511.16916
作者: Ye Han,Lijun Zhang,Dejian Meng,Zhuang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In multi-vehicle cooperative driving tasks involving high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with long-term objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Partially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety.
zh

[AI-28] PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage

【速读】:该论文旨在解决宏环肽(macrocyclic peptides)在多目标优化过程中因组合空间庞大、编辑位点未知以及传统生成式方法依赖静态预训练和固定优化算法而导致的先导化合物优化效率低下的问题。其解决方案的关键在于提出了一种位置感知且动态的框架PepEVOLVE,该框架通过动态掩码和CHUCKLES移位增强预训练以提升泛化能力,采用无上下文的多臂老虎机路由策略识别高回报残基位点,并结合新颖的演化优化算法与组相对优势机制稳定强化学习更新过程,从而在无需预先指定可变位点的情况下实现高效、稳定的多目标肽序列优化。

链接: https://arxiv.org/abs/2511.16912
作者: Trieu Nguyen,Hao-Wei Pang,Shasha Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Macrocyclic peptides are an emerging modality that combines biologics-like affinity with small-molecule-like developability, but their vast combinatorial space and multi-parameter objectives make lead optimization slow and challenging. Prior generative approaches such as PepINVENT require chemists to pre-specify mutable positions for optimization, choices that are not always known a priori, and rely on static pretraining and optimization algorithms that limit the model’s ability to generalize and effectively optimize peptide sequences. We introduce PepEVOLVE, a position-aware, dynamic framework that learns both where to edit and how to dynamically optimize peptides for multi-objective improvement. PepEVOLVE (i) augments pretraining with dynamic masking and CHUCKLES shifting to improve generalization, (ii) uses a context-free multi-armed bandit router that discovers high-reward residues, and (iii) couples a novel evolving optimization algorithm with group-relative advantage to stabilize reinforcement updates. During in silico evaluations, the router policy reliably learns and concentrates probability on chemically meaningful sites that influence the peptide’s properties. On a therapeutically motivated Rev-binding macrocycle benchmark, PepEVOLVE outperformed PepINVENT by reaching higher mean scores (approximately 0.8 vs. 0.6), achieving best candidates with a score of 0.95 (vs. 0.87), and converging in fewer steps under the task of optimizing permeability and lipophilicity with structural constraints. Overall, PepEVOLVE offers a practical, reproducible path to peptide lead optimization when optimal edit sites are unknown, enabling more efficient exploration and improving design quality across multiple objectives.
zh

[AI-29] Generative AI in Sociological Research: State of the Discipline

【速读】:该论文试图解决的问题是:当前生成式人工智能(Generative AI)在社会学研究中的实际使用情况、学者对其的态度及其潜在影响尚不清晰,尤其是在不同背景的研究者之间是否存在差异。解决方案的关键在于通过一项针对近50种社会学期刊中433位作者的调查,系统地量化了GenAI在写作辅助、研究规划、数据收集与分析等环节的应用程度,并评估了学者对GenAI的社会风险、环境代价及可信度的认知水平,从而揭示出尽管计算型与非计算型研究者在使用频率上无显著差异,但普遍对GenAI持谨慎态度,且信任度低、担忧高,这为未来制定负责任的技术采纳策略提供了实证基础。

链接: https://arxiv.org/abs/2511.16884
作者: AJ Alvero,Dustin S. Stoltz,Oscar Stuhler,Marshall Taylor
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) has garnered considerable attention for its potential utility in research and scholarship, even among those who typically do not rely on computational tools. Early commentators, however, have also articulated concerns about how GenAI usage comes with enormous environmental costs, serious social risks, and a tendency to produce low-quality content. In the midst of both excitement and skepticism, it is crucial to take stock of how GenAI is actually being used. Our study focuses on sociological research as our site, and here we present findings from a survey of 433 authors of articles published in 50 sociology journals in the last five years. The survey provides an overview of the state of the discipline with regard to the use of GenAI by providing answers to fundamental questions: how (much) do scholars use the technology for their research; what are their reasons for using it; and how concerned, trustful, and optimistic are they about the technology? Of the approximately one third ofrespondents who self-report using GenAI at least weekly, the primary uses are for writing assistance and comparatively less so in planning, data collection, or data analysis. In both use and attitudes, there are surprisingly few differences between self-identified computational and non-computational researchers. Generally, respondents are very concerned about the social and environmental consequences of GenAI. Trust in GenAI outputs is low, regardless of expertise or frequency of use. While optimism that GenAI will improve is high, scholars are divided on whether GenAI will have a positive impact on the field.
zh

[AI-30] he use of vocal biomarkers in the detection of Parkinsons disease: a robust statistical performance comparison of classic machine learning models

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)早期诊断中缺乏高效、非侵入性手段的问题,特别是利用语音生物标志物(vocal biomarkers)实现精准区分PD患者与健康对照群体。其解决方案的关键在于采用深度神经网络(Deep Neural Network, DNN)模型对语音特征(如梅尔频率倒谱系数 Mel-frequency cepstral coefficients, MFCCs)进行建模,并通过1000次独立随机验证策略评估模型鲁棒性,最终在两个公开语音数据集上分别达到98.65%和92.11%的平均分类准确率,显著优于传统机器学习方法,验证了DNN在基于语音生物标志物的神经退行性疾病早期检测中的高精度与可靠性。

链接: https://arxiv.org/abs/2511.16856
作者: Katia Pires Nascimento do Sacramento,Elliot Q. C. Garcia,Nicéias Silva Vilela,Vinicius P. Sacramento,Tiago A. E. Ferreira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:Parkinson’s disease (PD) is a progressive neurodegenerative disorder that, in addition to directly impairing functional mobility, is frequently associated with vocal impairments such as hypophonia and dysarthria, which typically manifest in the early stages. The use of vocal biomarkers to support the early diagnosis of PD presents a non-invasive, low-cost, and accessible alternative in clinical settings. Thus, the objective of this cross-sectional study was to consistently evaluate the effectiveness of a Deep Neural Network (DNN) in distinguishing individuals with Parkinson’s disease from healthy controls, in comparison with traditional Machine Learning (ML) methods, using vocal biomarkers. Two publicly available voice datasets were used. Mel-frequency cepstral coefficients (MFCCs) were extracted from the samples, and model robustness was assessed using a validation strategy with 1000 independent random executions. Performance was evaluated using classification statistics. Since normality assumptions were not satisfied, non-parametric tests (Kruskal-Wallis and Bonferroni post-hoc tests) were applied to verify whether the tested classification models were similar or different in the classification of PD. With an average accuracy of 98.65% and 92.11% on the Italian Voice dataset and Parkinson’s Telemonitoring dataset, respectively, the DNN demonstrated superior performance and efficiency compared to traditional ML models, while also achieving competitive results when benchmarked against relevant studies. Overall, this study confirms the efficiency of DNNs and emphasizes their potential to provide greater accuracy and reliability for the early detection of neurodegenerative diseases using voice-based biomarkers.
zh

[AI-31] Sex and age determination in European lobsters using AI-Enhanced bioacoustics

【速读】:该论文旨在解决对隐秘水生物种(如欧洲龙虾 Homarus gammarus)进行非侵入式监测与分类的难题,尤其关注其年龄(幼体/成体)和性别(雄性/雌性)的自动识别问题。传统方法难以有效获取此类信息,而本研究提出利用被动声学监测(Passive Acoustic Monitoring, PAM)结合机器学习(Machine Learning, ML)和深度学习(Deep Learning, DL)模型,从龙虾发出的生物声学信号(如嗡鸣声/甲壳振动)中提取特征并实现高精度分类。关键在于采用梅尔频率倒谱系数(Mel-frequency cepstral coefficients, MFCCs)作为声学特征,并对比了多种监督学习算法(包括1D-CNN、1D-DCNN、SVM、XGBoost、随机森林等),结果显示在年龄分类上多数模型准确率超过97%,性别分类除朴素贝叶斯外均高于93.23%,验证了基于声学特征的ML/DL方法在龙虾个体特征识别中的有效性,为水产养殖与渔业管理提供了可部署于边缘计算设备的非侵入式监测新路径。

链接: https://arxiv.org/abs/2511.16848
作者: Feliciano Pedro Francisco Domingos,Isibor Kennedy Ihianle,Omprakash Kaiwartya,Ahmad Lotfi,Nicola Khan,Nicholas Beaudreau,Amaya Albalat,Pedro Machado
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring aquatic species, especially elusive ones like lobsters, presents challenges. This study focuses on Homarus gammarus (European lobster), a key species for fisheries and aquaculture, and leverages non-invasive Passive Acoustic Monitoring (PAM). Understanding lobster habitats, welfare, reproduction, sex, and age is crucial for management and conservation. While bioacoustic emissions have classified various aquatic species using Artificial Intelligence (AI) models, this research specifically uses H. gammarus bioacoustics (buzzing/carapace vibrations) to classify lobsters by age (juvenile/adult) and sex (male/female). The dataset was collected at Johnshaven, Scotland, using hydrophones in concrete tanks. We explored the efficacy of Deep Learning (DL) models (1D-CNN, 1D-DCNN) and six Machine Learning (ML) models (SVM, k-NN, Naive Bayes, Random Forest, XGBoost, MLP). Mel-frequency cepstral coefficients (MFCCs) were used as features. For age classification (adult vs. juvenile), most models achieved over 97% accuracy (Naive Bayes: 91.31%). For sex classification, all models except Naive Bayes surpassed 93.23%. These strong results demonstrate the potential of supervised ML and DL to extract age- and sex-related features from lobster sounds. This research offers a promising non-invasive PAM approach for lobster conservation, detection, and management in aquaculture and fisheries, enabling real-world edge computing applications for underwater species. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.16848 [cs.LG] (or arXiv:2511.16848v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.16848 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Analysis of heart failure patient trajectories using sequence modeling

【速读】:该论文旨在解决临床预测任务中缺乏系统性评估不同序列模型性能与效率的方法问题,尤其是在电子健康记录(Electronic Health Records, EHR)场景下。其关键解决方案在于开展首个针对输入分词策略、模型配置及时间预处理技术的系统性消融研究,对比了三种架构类别的模型(Transformer、Transformer++ 和 Mamba)在瑞典大型心力衰竭队列(N = 42820)中的表现,发现 Llama 在预测判别能力、校准性和鲁棒性方面最优,Mamba 次之;同时证明了在相同模型规模下,Llama 和 Mamba 仅需 75% 的训练数据即可超越更大规模的 Transformer 模型,验证了高效表示学习的能力。

链接: https://arxiv.org/abs/2511.16839
作者: Falk Dippela,Yinan Yu,Annika Rosengren,Martin Lindgren,Christina E. Lundberg,Erik Aerts,Martin Adiels,Helen Sjöland
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers have defined the state-of-the-art for clinical prediction tasks involving electronic health records (EHRs). The recently introduced Mamba architecture outperformed an advanced Transformer (Transformer++) based on Llama in handling long context lengths, while using fewer model parameters. Despite the impressive performance of these architectures, a systematic approach to empirically analyze model performance and efficiency under various settings is not well established in the medical domain. The performances of six sequence models were investigated across three architecture classes (Transformers, Transformers++, Mambas) in a large Swedish heart failure (HF) cohort (N = 42820), providing a clinically relevant case study. Patient data included diagnoses, vital signs, laboratories, medications and procedures extracted from in-hospital EHRs. The models were evaluated on three one-year prediction tasks: clinical instability (a readmission phenotype) after initial HF hospitalization, mortality after initial HF hospitalization and mortality after latest hospitalization. Ablations account for modifications of the EHR-based input patient sequence, architectural model configurations, and temporal preprocessing techniques for data collection. Llama achieves the highest predictive discrimination, best calibration, and showed robustness across all tasks, followed by Mambas. Both architectures demonstrate efficient representation learning, with tiny configurations surpassing other large-scaled Transformers. At equal model size, Llama and Mambas achieve superior performance using 25% less training data. This paper presents a first ablation study with systematic design choices for input tokenization, model configuration and temporal data preprocessing. Future model development in clinical prediction tasks using EHRs could build upon this study’s recommendation as a starting point.
zh

[AI-33] ManifoldFormer: Geometric Deep Learning for Neural Dynamics on Riemannian Manifolds ICASSP

【速读】:该论文旨在解决现有脑电图(EEG)基础模型将神经信号视为欧几里得空间中的通用时间序列,从而忽略神经动力学内在几何结构的问题。这种模型假设与神经几何之间的根本不匹配限制了表征质量及跨被试泛化能力。解决方案的关键在于提出ManifoldFormer框架,其核心创新包括:基于黎曼变分自编码器(Riemannian VAE)的流形嵌入以保留几何结构、具有测地线感知注意力机制的几何Transformer直接在神经流形上操作,以及利用神经微分方程(neural ODEs)实现流形约束下的时序演化预测。该几何深度学习方法显著提升了分类准确率和一致性指标,并揭示了符合神经生理学原理的有意义神经模式。

链接: https://arxiv.org/abs/2511.16828
作者: Yihang Fu,Lifang He,Qingyu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, under review by ICASSP

点击查看摘要

Abstract:Existing EEG foundation models mainly treat neural signals as generic time series in Euclidean space, ignoring the intrinsic geometric structure of neural dynamics that constrains brain activity to low-dimensional manifolds. This fundamental mismatch between model assumptions and neural geometry limits representation quality and cross-subject generalization. ManifoldFormer addresses this limitation through a novel geometric deep learning framework that explicitly learns neural manifold representations. The architecture integrates three key innovations: a Riemannian VAE for manifold embedding that preserves geometric structure, a geometric Transformer with geodesic-aware attention mechanisms operating directly on neural manifolds, and a dynamics predictor leveraging neural ODEs for manifold-constrained temporal evolution. Extensive evaluation across four public datasets demonstrates substantial improvements over state-of-the-art methods, with 4.6-4.8% higher accuracy and 6.2-10.2% higher Cohen’s Kappa, while maintaining robust cross-subject generalization. The geometric approach reveals meaningful neural patterns consistent with neurophysiological principles, establishing geometric constraints as essential for effective EEG foundation models.
zh

[AI-34] Monte Carlo Expected Threat (MOCET) Scoring NEURIPS2025

【速读】:该论文旨在解决当前评估大语言模型(Large Language Models, LLMs)AI安全水平(AI Safety Level, ASL)时存在的两大问题:一是现有评估指标(如LAB-Bench、BioLP-bench和WMDP)虽能可靠衡量模型在特定领域知识和能力提升方面的表现,但缺乏对“真实世界风险”的有效 contextualization;二是缺少可扩展、开放式的评估机制以应对LLMs快速迭代带来的动态安全挑战。解决方案的关键在于提出MOCET(Metric for Open-ended, Contextualized, and Interpretable Evaluation of Threats),这是一个具备双重可扩展性的可解释指标——既支持自动化执行(automatable),又具备开放性(open-ended),能够量化实际应用场景中的潜在威胁,从而为LLMs的安全论证提供更贴近现实的依据。

链接: https://arxiv.org/abs/2511.16823
作者: Joseph Kim,Saahith Potluri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to NeurIPS 2025 BioSafe GenAI

点击查看摘要

Abstract:Evaluating and measuring AI Safety Level (ASL) threats are crucial for guiding stakeholders to implement safeguards that keep risks within acceptable limits. ASL-3+ models present a unique risk in their ability to uplift novice non-state actors, especially in the realm of biosecurity. Existing evaluation metrics, such as LAB-Bench, BioLP-bench, and WMDP, can reliably assess model uplift and domain knowledge. However, metrics that better contextualize “real-world risks” are needed to inform the safety case for LLMs, along with scalable, open-ended metrics to keep pace with their rapid advancements. To address both gaps, we introduce MOCET, an interpretable and doubly-scalable metric (automatable and open-ended) that can quantify real-world risks.
zh

[AI-35] A Robust Federated Learning Approach for Combating Attacks Against IoT Systems Under non-IID Challenges

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在非独立同分布(non-IID)数据环境下,特别是在物联网(IoT)安全场景中检测大规模攻击时所面临的性能下降问题。其核心挑战在于统计异质性导致模型收敛困难和泛化能力减弱。解决方案的关键在于系统性比较三种主流FL算法——FedAvg、FedProx与Scaffold,在不同数据分布下的表现差异,并通过CICIoT2023数据集对大规模IoT攻击进行分类建模,从而揭示各方法在应对统计异质性方面的有效性与局限性,为后续研究提供实证依据和优化方向。

链接: https://arxiv.org/abs/2511.16822
作者: Eyad Gad,Zubair Md Fadlullah,Mostafa M. Fouda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 6 pages, conference paper; presented at the 2024 International Conference on Smart Applications, Communications and Networking (SmartNets 2024), Harrisonburg, VA, USA, May 28, 2024

点击查看摘要

Abstract:In the context of the growing proliferation of user devices and the concurrent surge in data volumes, the complexities arising from the substantial increase in data have posed formidable challenges to conventional machine learning model training. Particularly, this is evident within resource-constrained and security-sensitive environments such as those encountered in networks associated with the Internet of Things (IoT). Federated Learning has emerged as a promising remedy to these challenges by decentralizing model training to edge devices or parties, effectively addressing privacy concerns and resource limitations. Nevertheless, the presence of statistical heterogeneity in non-Independently and Identically Distributed (non-IID) data across different parties poses a significant hurdle to the effectiveness of FL. Many FL approaches have been proposed to enhance learning effectiveness under statistical heterogeneity. However, prior studies have uncovered a gap in the existing research landscape, particularly in the absence of a comprehensive comparison between federated methods addressing statistical heterogeneity in detecting IoT attacks. In this research endeavor, we delve into the exploration of FL algorithms, specifically FedAvg, FedProx, and Scaffold, under different data distributions. Our focus is on achieving a comprehensive understanding of and addressing the challenges posed by statistical heterogeneity. In this study, We classify large-scale IoT attacks by utilizing the CICIoT2023 dataset. Through meticulous analysis and experimentation, our objective is to illuminate the performance nuances of these FL methods, providing valuable insights for researchers and practitioners in the domain.
zh

[AI-36] Stable diffusion models reveal a persisting human and AI gap in visual creativity

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)在视觉创意任务中的表现是否能够与人类相当,尤其是在与语言类任务中已展现出接近人类创造力的表现形成对比的情况下。研究的关键在于通过对比视觉艺术家、非艺术家与两种不同提示策略下的生成式 AI(高人类引导的“Human Inspired”和低人类引导的“Self Guided”)所生成图像的创造性水平,并由人类评估者(N=255)和 GPT-4o 模型共同打分,揭示了人类创造力在视觉领域中对感知细微差异和情境敏感性的依赖,而这些能力尚未被当前生成式 AI 充分掌握。研究发现,增加人类指导可显著提升 GenAI 的创意产出,使其接近非艺术家水平,但整体仍低于专业视觉艺术家,且人与 AI 评估者在创意判断上存在显著差异,表明视觉创造力可能涉及难以从语言模型迁移的深层认知机制。

链接: https://arxiv.org/abs/2511.16814
作者: Silvia Rondini,Claudia Alvarez-Martin,Paula Angermair-Barkai,Olivier Penacchio,M. Paz,Matthew Pelowski,Dan Dediu,Antoni Rodriguez-Fornells,Xim Cerda-Company
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI’s creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.
zh

[AI-37] Password Strength Analysis Through Social Network Data Exposure: A Combined Approach Relying on Data Reconstruction and Generative Models

【速读】:该论文旨在解决传统密码强度评估方法在面对用户偏好易记密码时表现不足的问题,从而降低因弱密码导致的安全风险。其解决方案的关键在于提出SODA ADVANCE工具,该工具通过整合多源公开数据(如社交媒体信息)构建用户画像,并结合大语言模型(Large Language Models, LLMs)的能力,在密码生成与评估两个维度实现优化:一方面利用LLMs根据用户特征生成强且个性化的密码,另一方面借助LLM对包含用户画像信息的密码进行更精准的强度评估,从而显著提升密码安全性的量化能力与实用性。

链接: https://arxiv.org/abs/2511.16716
作者: Maurizio Atzori,Eleonora Calò,Loredana Caruccio,Stefano Cirillo,Giuseppe Polese,Giandomenico Solimando
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This is a post-peer-review, pre-copyedit version to be published in the Prooceedings of the 33rd Symposium On Advanced Database Systems (SEBD 2025), 7 pages, 4 figures

点击查看摘要

Abstract:Although passwords remain the primary defense against unauthorized access, users often tend to use passwords that are easy to remember. This behavior significantly increases security risks, also due to the fact that traditional password strength evaluation methods are often inadequate. In this discussion paper, we present SODA ADVANCE, a data reconstruction tool also designed to enhance evaluation processes related to the password strength. In particular, SODA ADVANCE integrates a specialized module aimed at evaluating password strength by leveraging publicly available data from multiple sources, including social media platforms. Moreover, we investigate the capabilities and risks associated with emerging Large Language Models (LLMs) in evaluating and generating passwords, respectively. Experimental assessments conducted with 100 real users demonstrate that LLMs can generate strong and personalized passwords possibly defined according to user profiles. Additionally, LLMs were shown to be effective in evaluating passwords, especially when they can take into account user profile data.
zh

[AI-38] DDTime: Dataset Distillation with Spectral Alignment and Information Bottleneck for Time-Series Forecasting

【速读】:该论文旨在解决时间序列预测(time-series forecasting)中因数据量大、计算资源消耗高而导致模型训练困难的问题,提出了一种轻量级且可插拔的数据蒸馏框架DDTime,以合成紧凑但保留原始学习行为的时序数据集。其核心挑战在于:一是强自相关性导致教师模型与学生模型之间的值项对齐失真;二是缺乏显式类别先验导致合成样本多样性不足。解决方案的关键在于:首先通过引入频域对齐机制和时序统计重构来缓解自相关偏差,确保频谱一致性与时序保真度;其次设计基于信息瓶颈原理的跨样本正则化策略,提升合成轨迹的信息密度与多样性。该方法在20个基准数据集和多种预测架构上均显著优于现有蒸馏方法,相对精度提升约30%,仅增加约2.49%的计算开销。

链接: https://arxiv.org/abs/2511.16715
作者: Yuqi Li,Kuiye Ding,Chuanguang Yang,Hao Wang,Haoxuan Wang,Huiran Duan,Junming Liu,Yingli Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages

点击查看摘要

Abstract:Time-series forecasting is fundamental across many domains, yet training accurate models often requires large-scale datasets and substantial computational resources. Dataset distillation offers a promising alternative by synthesizing compact datasets that preserve the learning behavior of full data. However, extending dataset distillation to time-series forecasting is non-trivial due to two fundamental challenges: this http URL bias from strong autocorrelation, which leads to distorted value-term alignment between teacher and student models; and this http URL diversity among synthetic samples, arising from the absence of explicit categorical priors to regularize trajectory variety. In this work, we propose DDTime, a lightweight and plug-in distillation framework built upon first-order condensation decomposition. To tackle Challenge 1, it revisits value-term alignment through temporal statistics and introduces a frequency-domain alignment mechanism to mitigate autocorrelation-induced bias, ensuring spectral consistency and temporal fidelity. To address Challenge 2, we further design an inter-sample regularization inspired by the information bottleneck principle, which enhances diversity and maximizes information density across synthetic trajectories. The combined objective is theoretically compatible with a wide range of condensation paradigms and supports stable first-order optimization. Extensive experiments on 20 benchmark datasets and diverse forecasting architectures demonstrate that DDTime consistently outperforms existing distillation methods, achieving about 30% relative accuracy gains while introducing about 2.49% computational overhead. All code and distilled datasets will be released. Comments: 36 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.16715 [cs.LG] (or arXiv:2511.16715v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.16715 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-39] AutoBackdoor: Automating Backdoor Attacks via LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全部署中面临的后门攻击(Backdoor Attacks)问题,尤其是现有方法依赖人工设计触发词(Trigger)和静态数据流水线所导致的灵活性差、可扩展性弱及防御评估不充分等局限。其解决方案的关键在于提出一个名为AutoBackdoor的自动化框架,通过自主代理驱动的流程实现后门注入的全流程自动化,包括触发词生成、污染数据构造与模型微调;其中核心创新是利用强大的语言模型代理(Language Model Agent)自动生成语义连贯且上下文感知的触发短语,从而在任意主题下实现低人力投入的规模化污染攻击。实验表明,该方法在多个真实威胁场景(如偏见推荐、幻觉注入和同行评审操控)中均能以极少量污染样本实现超过90%的攻击成功率,并揭示当前主流防御机制对这类代理驱动攻击的显著失效,凸显了构建更严格、动态适应的红队测试框架的重要性。

链接: https://arxiv.org/abs/2511.16709
作者: Yige Li,Zhe Li,Wei Zhao,Nay Myat Min,Hanxun Huang,Xingjun Ma,Jun Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textitred-teaming frameworks that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textscAutoBackdoor, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textitBias Recommendation, \textitHallucination Injection, and \textitPeer Review Manipulation, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at this https URL.
zh

[AI-40] Multi-Agent Code Verification with Compound Vulnerability Detection

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成代码时存在高比例缺陷的问题,包括错误修复补丁失效、漏洞频发以及现有静态分析工具检测覆盖率低且误报率高。其核心解决方案是构建CodeX-Verify,一个由四个专业化智能体(agents)组成的多代理系统,每个代理针对不同类型的bug设计了差异化的检测策略。关键创新在于通过数学证明和实证验证表明:当各代理专注于不同问题时,组合使用可显著提升整体检测效果(准确率从32.8%提升至72.4%,最优双代理组合达79.3%),且能以极低延迟(<200ms/样本)实现无需执行测试的高效静态验证,从而在生产环境中具备可行性。

链接: https://arxiv.org/abs/2511.16708
作者: Shreshth Rajan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages, 3 figures, 9 tables

点击查看摘要

Abstract:LLMs generate buggy code: 29.6% of SWE-bench “solved” patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, confirmed by measuring agent correlation of p = 0.05–0.25. We also show that multiple vulnerabilities in the same code create exponentially more risk than previously thought–SQL injection plus exposed credentials creates 15x more danger (risk 300 vs. 20) than traditional models predict. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method while running faster and without test execution. We tested 15 different agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with gains of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4. The best two-agent combination reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use.
zh

[AI-41] Large language models for automated PRISMA 2020 adherence checking

【速读】:该论文旨在解决系统性文献综述(Systematic Review)在同行评审过程中对PRISMA 2020指南遵循情况的评估负担过重的问题。其关键解决方案是构建了一个基于Creative Commons许可的可共享基准数据集(包含108篇系统评价),并评估了十种大语言模型(Large Language Models, LLMs)在不同输入格式下的表现,发现提供结构化PRISMA 2020检查清单(如Markdown、JSON、XML或纯文本)相较于仅使用全文手稿输入,显著提升了模型准确性(78.7–79.7% vs. 45.21%,p < 0.0001),且结构化格式之间无显著差异;进一步采用高敏感度的Qwen3-Max模型扩展至完整数据集(n=120)后,实现了95.1%的敏感度和49.3%的特异度,表明结构化输入能有效提升LLM对PRISMA合规性的自动化评估能力,但仍需人工专家验证以支持编辑决策。

链接: https://arxiv.org/abs/2511.16707
作者: Yuki Kataoka,Ryuhei So,Masahiro Banno,Yasushi Tsujimoto,Tomohiro Takayama,Yosuke Yamagishi,Takahiro Tsuge,Norio Yamamoto,Chiaki Suda,Toshi A. Furukawa
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating adherence to PRISMA 2020 guideline remains a burden in the peer review process. To address the lack of shareable benchmarks, we constructed a copyright-aware benchmark of 108 Creative Commons-licensed systematic reviews and evaluated ten large language models (LLMs) across five input formats. In a development cohort, supplying structured PRISMA 2020 checklists (Markdown, JSON, XML, or plain text) yielded 78.7-79.7% accuracy versus 45.21% for manuscript-only input (p less than 0.0001), with no differences between structured formats (p0.9). Across models, accuracy ranged from 70.6-82.8% with distinct sensitivity-specificity trade-offs, replicated in an independent validation cohort. We then selected Qwen3-Max (a high-sensitivity open-weight model) and extended evaluation to the full dataset (n=120), achieving 95.1% sensitivity and 49.3% specificity. Structured checklist provision substantially improves LLM-based PRISMA assessment, though human expert verification remains essential before editorial decisions.
zh

[AI-42] RAG -Driven Data Quality Governance for Enterprise ERP Systems

【速读】:该论文旨在解决企业资源计划(ERP)系统在多语言环境下因人力资源部门分散手动录入导致的海量员工数据质量下降问题。其核心挑战在于跨语言数据不一致、拼写错误及实体重复,严重影响查询准确性和系统可用性。解决方案的关键在于构建一个端到端的自动化管道:第一阶段通过多阶段清洗流程实现翻译标准化、拼写纠正与实体去重;第二阶段利用基于GPT-4o的检索增强生成(Retrieval-Augmented Generation, RAG)框架,将土耳其语、俄语和英语的自然语言查询转化为结构化SQL语句,结合LangChain编排、FAISS向量相似度搜索与500+验证示例的少样本学习策略,显著提升查询有效性与语义准确性。该架构在24万条记录上部署六个月,实现92.5%查询有效性、95.1%模式合规性,并将响应时间从2.3天缩短至5秒以内,验证了AI原生企业数据治理在大规模场景下的可行性与高效性。

链接: https://arxiv.org/abs/2511.16700
作者: Sedat Bin Vedat,Enes Kutay Yarkan,Meftun Akarsu,Recep Kaan Karaman,Arda Sar,Çağrı Çelikbilek,Savaş Saygılı
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enterprise ERP systems managing hundreds of thousands of employee records face critical data quality challenges when human resources departments perform decentralized manual entry across multiple languages. We present an end-to-end pipeline combining automated data cleaning with LLM-driven SQL query generation, deployed on a production system managing 240,000 employee records over six months. The system operates in two integrated stages: a multi-stage cleaning pipeline that performs translation normalization, spelling correction, and entity deduplication during periodic synchronization from Microsoft SQL Server to PostgreSQL; and a retrieval-augmented generation framework powered by GPT-4o that translates natural-language questions in Turkish, Russian, and English into validated SQL queries. The query engine employs LangChain orchestration, FAISS vector similarity search, and few-shot learning with 500+ validated examples. Our evaluation demonstrates 92.5% query validity, 95.1% schema compliance, and 90.7% semantic accuracy on 2,847 production queries. The system reduces query turnaround time from 2.3 days to under 5 seconds while maintaining 99.2% uptime, with GPT-4o achieving 46% lower latency and 68% cost reduction versus GPT-3.5. This modular architecture provides a reproducible framework for AI-native enterprise data governance, demonstrating real-world viability at enterprise scale with 4.3/5.0 user satisfaction. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.16700 [cs.DB] (or arXiv:2511.16700v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2511.16700 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-43] Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

【速读】:该论文旨在解决计算蛋白质设计中一个核心难题:如何联合设计具有多样性和物理现实性的蛋白质结构与表面,以实现对目标受体(receptor)的精确互补。传统方法难以同时兼顾表面几何匹配与化学合理性,导致生成的蛋白结构在功能和稳定性上存在局限。其解决方案的关键在于提出PepBridge框架,该框架通过多步骤协同机制实现:首先利用去噪扩散桥模型(denoising diffusion bridge models, DDBMs)将受体表面点云映射至配体表面;随后采用多模态扩散模型预测完整蛋白质结构,并借助形状-帧匹配网络(Shape-Frame Matching Networks)确保表面几何与主链架构的一致性,从而在保持构象稳定性和化学可行性的同时,实现高精度的表面互补设计。

链接: https://arxiv.org/abs/2511.16675
作者: Guanlue Li,Xufeng Zhao,Fang Wu,Sören Laue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:Protein-protein interactions (PPIs) are governed by surface complementarity and hydrophobic interactions at protein interfaces. However, designing diverse and physically realistic protein structure and surfaces that precisely complement target receptors remains a significant challenge in computational protein design. In this work, we introduce PepBridge, a novel framework for the joint design of protein surface and structure that seamlessly integrates receptor surface geometry and biochemical properties. Starting with a receptor surface represented as a 3D point cloud, PepBridge generates complete protein structures through a multi-step process. First, it employs denoising diffusion bridge models (DDBMs) to map receptor surfaces to ligand surfaces. Next, a multi-model diffusion model predicts the corresponding structure, while Shape-Frame Matching Networks ensure alignment between surface geometry and backbone architecture. This integrated approach facilitates surface complementarity, conformational stability, and chemical feasibility. Extensive validation across diverse protein design scenarios demonstrates PepBridge’s efficacy in generating structurally viable proteins, representing a significant advancement in the joint design of top-down protein structure.
zh

[AI-44] Instance Configuration for Sustainable Job Shop Scheduling

【速读】:该论文旨在解决作业车间调度问题(Job Shop Scheduling Problem, JSP)中如何在满足多种约束条件(如截止时间、释放时间)的前提下,优化性能指标(如完工时间makespan)并最小化能源消耗的问题。其解决方案的关键在于提出了一种创新的实例配置器(instance configurator),能够根据参数(如任务数、机器数、处理时间分布、能耗分布及运行速度等)生成多样化且贴近现实场景的测试实例,并构建了一个包含500个公开可用的测试实例集,从而支持对调度算法在能效导向下的全面评估与比较,推动高效、节能调度方案的研究与发展。

链接: https://arxiv.org/abs/2409.18972
作者: Christian Perez,Carlos March,Miguel A. Salido
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 26th International Workshop on Configuration (ConfWS 2024)

点击查看摘要

Abstract:The Job Shop Scheduling Problem (JSP) is a pivotal challenge in operations research and is essential for evaluating the effectiveness and performance of scheduling algorithms. Scheduling problems are a crucial domain in combinatorial optimization, where resources (machines) are allocated to job tasks to minimize the completion time (makespan) alongside other objectives like energy consumption. This research delves into the intricacies of JSP, focusing on optimizing performance metrics and minimizing energy consumption while considering various constraints such as deadlines and release dates. Recognizing the multi-dimensional nature of benchmarking in JSP, this study underscores the significance of reference libraries and datasets like JSPLIB in enriching algorithm evaluation. The research highlights the importance of problem instance characteristics, including job and machine numbers, processing times, and machine availability, emphasizing the complexities introduced by energy consumption considerations. An innovative instance configurator is proposed, equipped with parameters such as the number of jobs, machines, tasks, and speeds, alongside distributions for processing times and energy consumption. The generated instances encompass various configurations, reflecting real-world scenarios and operational constraints. These instances facilitate comprehensive benchmarking and evaluation of scheduling algorithms, particularly in contexts of energy efficiency. A comprehensive set of 500 test instances has been generated and made publicly available, promoting further research and benchmarking in JSP. These instances enable robust analyses and foster collaboration in developing advanced, energy-efficient scheduling solutions by providing diverse scenarios. Comments: 26th International Workshop on Configuration (ConfWS 2024) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) MSC classes: 68W99 ACMclasses: E.0; B.8; F.2; F.4 Cite as: arXiv:2409.18972 [cs.DC] (or arXiv:2409.18972v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2409.18972 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-45] Quantum Masked Autoencoders for Vision Learning

【速读】:该论文旨在解决传统自编码器(Autoencoder)在处理存在缺失数据时特征学习能力不足的问题,特别是在量子计算场景下缺乏有效方法来利用量子态表示和恢复被遮蔽的数据特征。其解决方案的关键在于提出量子掩码自编码器(Quantum Masked Autoencoder, QMAE),该架构能够在量子态中直接学习并重建被遮蔽的数据特征,而非依赖经典嵌入表示;实验表明,QMAE在MNIST图像上的重建质量更高,并且在分类准确率上相比现有最优量子自编码器平均提升12.86%,验证了其在量子机器学习中对缺失数据处理的有效性与优越性。

链接: https://arxiv.org/abs/2511.17372
作者: Emma Andrews,Prabhat Mishra
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Classical autoencoders are widely used to learn features of input data. To improve the feature learning, classical masked autoencoders extend classical autoencoders to learn the features of the original input sample in the presence of masked-out data. While quantum autoencoders exist, there is no design and implementation of quantum masked autoencoders that can leverage the benefits of quantum computing and quantum autoencoders. In this paper, we propose quantum masked autoencoders (QMAEs) that can effectively learn missing features of a data sample within quantum states instead of classical embeddings. We showcase that our QMAE architecture can learn the masked features of an image and can reconstruct the masked input image with improved visual fidelity in MNIST images. Experimental evaluation highlights that QMAE can significantly outperform (12.86% on average) in classification accuracy compared to state-of-the-art quantum autoencoders in the presence of masks.
zh

[AI-46] Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

【速读】:该论文旨在解决音频-语言预训练(audio-language pretraining)在通用音频理解任务中进展缓慢的问题,其核心挑战包括大规模音频-文本语料库稀缺、标注多样性不足以及缺乏系统性的目标对比与评估。解决方案的关键在于构建了一个包含10.7M条目、覆盖多领域和多种描述风格的多样化音频-文本数据集CaptionStew,并在此基础上首次全面比较了对比学习(contrastive learning)与生成式 captioning 目标在语音、音乐和环境声等任务中的表现。研究发现:对比学习在小数据规模下更具数据效率,而生成式captioning在涉及语言理解的任务上表现出更强的可扩展性;同时指出当前基于监督初始化的方法在大规模场景下收益递减,从而为通用音频表示学习提供了新的方向和实证依据。

链接: https://arxiv.org/abs/2511.16757
作者: Wei-Cheng Tseng,Xuanru Zhou,Mingyue Huo,Yiwen Shao,Hao Zhang,Dong Yu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding.
zh

机器学习

[LG-0] Harnessing Data from Clustered LQR Systems: Personalized and Collaborative Policy Optimization

链接: https://arxiv.org/abs/2511.17489
作者: Vinay Kanakeri,Shivam Bajaj,Ashwin Verma,Vijay Gupta,Aritra Mitra
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:It is known that reinforcement learning (RL) is data-hungry. To improve sample-efficiency of RL, it has been proposed that the learning algorithm utilize data from ‘approximately similar’ processes. However, since the process models are unknown, identifying which other processes are similar poses a challenge. In this work, we study this problem in the context of the benchmark Linear Quadratic Regulator (LQR) setting. Specifically, we consider a setting with multiple agents, each corresponding to a copy of a linear process to be controlled. The agents’ local processes can be partitioned into clusters based on similarities in dynamics and tasks. Combining ideas from sequential elimination and zeroth-order policy optimization, we propose a new algorithm that performs simultaneous clustering and learning to output a personalized policy (controller) for each cluster. Under a suitable notion of cluster separation that captures differences in closed-loop performance across systems, we prove that our approach guarantees correct clustering with high probability. Furthermore, we show that the sub-optimality gap of the policy learned for each cluster scales inversely with the size of the cluster, with no additional bias, unlike in prior works on collaborative learning-based control. Our work is the first to reveal how clustering can be used in data-driven control to learn personalized policies that enjoy statistical gains from collaboration but do not suffer sub-optimality due to inclusion of data from dissimilar processes. From a distributed implementation perspective, our method is attractive as it incurs only a mild logarithmic communication overhead.

[LG-1] Unmasking Airborne Threats: Guided-Transformers for Portable Aerosol Mass Spectrometry

链接: https://arxiv.org/abs/2511.17446
作者: Kyle M. Regan,Michael McLoughlin,Wayne A. Bryden,Gonzalo R. Arce
类目: Machine Learning (cs.LG)
*备注: 13 pages, 9 figures. Preprint. Submitted to Computers in Biology and Medicine

点击查看摘要

Abstract:Matrix Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI-MS) is a cornerstone in biomolecular analysis, offering precise identification of pathogens through unique mass spectral signatures. Yet, its reliance on labor-intensive sample preparation and multi-shot spectral averaging restricts its use to laboratory settings, rendering it impractical for real-time environmental monitoring. These limitations are especially pronounced in emerging aerosol MALDI-MS systems, where autonomous sampling generates noisy spectra for unknown aerosol analytes, requiring single-shot detection for effective analysis. Addressing these challenges, we propose the Mass Spectral Dictionary-Guided Transformer (MS-DGFormer): a data-driven framework that redefines spectral analysis by directly processing raw, minimally prepared mass spectral data. MS-DGFormer leverages a transformer architecture, designed to capture the long-range dependencies inherent in these time-series spectra. To enhance feature extraction, we introduce a novel dictionary encoder that integrates denoised spectral information derived from Singular Value Decomposition (SVD), enabling the model to discern critical biomolecular patterns from single-shot spectra with robust performance. This innovation provides a system to achieve superior pathogen identification from aerosol samples, facilitating autonomous, real-time analysis in field conditions. By eliminating the need for extensive preprocessing, our method unlocks the potential for portable, deployable MALDI-MS platforms, revolutionizing environmental pathogen detection and rapid response to biological threats.

[LG-2] A Framework for Adaptive Stabilisation of Nonlinear Stochastic Systems

链接: https://arxiv.org/abs/2511.17436
作者: Seth Siriya,Jingge Zhu,Dragan Nešić,Ye Pu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 22 pages, 1 figure

点击查看摘要

Abstract:We consider the adaptive control problem for discrete-time, nonlinear stochastic systems with linearly parameterised uncertainty. Assuming access to a parameterised family of controllers that can stabilise the system in a bounded set within an informative region of the state space when the parameter is well-chosen, we propose a certainty equivalence learning-based adaptive control strategy, and subsequently derive stability bounds on the closed-loop system that hold for some probabilities. We then show that if the entire state space is informative, and the family of controllers is globally stabilising with appropriately chosen parameters, high probability stability guarantees can be derived.

[LG-3] Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

链接: https://arxiv.org/abs/2511.17435
作者: Zengyu Zou,Jingyuan Wang,Yixuan Huang,Junjie Wu
类目: Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:This paper addresses the cooperative Multi-Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio-temporal system optimization problem, widely applied in scenarios such as on-demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large-scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter-entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation-Aware Attention module to capture inter-entity relationships. Additionally, we guide the model’s decision-making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.

[LG-4] owards fully differentiable neural ocean model with Veros

链接: https://arxiv.org/abs/2511.17427
作者: Etienne Meunier,Said Ouala,Hugo Frezat,Julien Le Sommer,Ronan Fablet
类目: Machine Learning (cs.LG)
*备注: Accepted to Differentiable Systems and Scientific Machine Learning (workshop, EurIPS 2025)

点击查看摘要

Abstract:We present a differentiable extension of the VEROS ocean model, enabling automatic differentiation through its dynamical core. We describe the key modifications required to make the model fully compatible with JAX autodifferentiation framework and evaluate the numerical consistency of the resulting implementation. Two illustrative applications are then demonstrated: (i) the correction of an initial ocean state through gradient-based optimization, and (ii) the calibration of unknown physical parameters directly from model observations. These examples highlight how differentiable programming can facilitate end-to-end learning and parameter tuning in ocean modeling. Our implementation is available online.

[LG-5] CREST: Improving Interpretability and Effectiveness of Troubleshooting at Ericsson through Criterion-Specific Trouble Report Retrieval

链接: https://arxiv.org/abs/2511.17417
作者: Soroush Javdan,Pragash Krishnamoorthy,Olga Baysal
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of the telecommunication industry necessitates efficient troubleshooting processes to maintain network reliability, software maintainability, and service quality. Trouble Reports (TRs), which document issues in Ericsson’s production system, play a critical role in facilitating the timely resolution of software faults. However, the complexity and volume of TR data, along with the presence of diverse criteria that reflect different aspects of each fault, present challenges for retrieval systems. Building on prior work at Ericsson, which utilized a two-stage workflow, comprising Initial Retrieval (IR) and Re-Ranking (RR) stages, this study investigates different TR observation criteria and their impact on the performance of retrieval models. We propose \textbfCREST (\textbfCriteria-specific \textbfRetrieval via \textbfEnsemble of \textbfSpecialized \textbfTR models), a criterion-driven retrieval approach that leverages specialized models for different TR fields to improve both effectiveness and interpretability, thereby enabling quicker fault resolution and supporting software maintenance. CREST utilizes specialized models trained on specific TR criteria and aggregates their outputs to capture diverse and complementary signals. This approach leads to enhanced retrieval accuracy, better calibration of predicted scores, and improved interpretability by providing relevance scores for each criterion, helping users understand why specific TRs were retrieved. Using a subset of Ericsson’s internal TRs, this research demonstrates that criterion-specific models significantly outperform a single model approach across key evaluation metrics. This highlights the importance of all targeted criteria used in this study for optimizing the performance of retrieval systems.

[LG-6] SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

链接: https://arxiv.org/abs/2511.17411
作者: Nikolay Nikolov,Giuliano Albanese,Sombit Dey,Aleksandar Yanev,Luc Van Gool,Jan-Nico Zaech,Danda Pani Paudel
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, ~\textbfSPEAR-1 : a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on \sim 45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as \pi_0 -FAST and \pi_0.5 , while it uses 20 \times fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.

[LG-7] Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes NEURIPS2025

链接: https://arxiv.org/abs/2511.17399
作者: Wei-Kai Chang,Rajiv Khanna
类目: Machine Learning (cs.LG)
*备注: neurips 2025

点击查看摘要

Abstract:As deep learning models continue to scale, the growing computational demands have amplified the need for effective coreset selection techniques. Coreset selection aims to accelerate training by identifying small, representative subsets of data that approximate the performance of the full dataset. Among various approaches, gradient based methods stand out due to their strong theoretical underpinnings and practical benefits, particularly under limited data budgets. However, these methods face challenges such as naive stochastic gradient descent (SGD) acting as a surprisingly strong baseline and the breakdown of representativeness due to loss curvature mismatches over time. In this work, we propose a novel framework that addresses these limitations. First, we establish a connection between posterior sampling and loss landscapes, enabling robust coreset selection even in high data corruption scenarios. Second, we introduce a smoothed loss function based on posterior sampling onto the model weights, enhancing stability and generalization while maintaining computational efficiency. We also present a novel convergence analysis for our sampling-based coreset selection method. Finally, through extensive experiments, we demonstrate how our approach achieves faster training and enhanced generalization across diverse datasets than the current state of the art. Comments: neurips 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.17399 [cs.LG] (or arXiv:2511.17399v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.17399 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias NEURIPS2025

链接: https://arxiv.org/abs/2511.17378
作者: Wei-Kai Chang,Rajiv Khanna
类目: Machine Learning (cs.LG)
*备注: Neurips 2025

点击查看摘要

Abstract:Understanding the dynamics of optimization in deep learning is increasingly important as models scale. While stochastic gradient descent (SGD) and its variants reliably find solutions that generalize well, the mechanisms driving this generalization remain unclear. Notably, these algorithms often prefer flatter or simpler minima, particularly in overparameterized settings. Prior work has linked flatness to generalization, and methods like Sharpness-Aware Minimization (SAM) explicitly encourage flatness, but a unified theory connecting data structure, optimization dynamics, and the nature of learned solutions is still lacking. In this work, we develop a linear stability framework that analyzes the behavior of SGD, random perturbations, and SAM, particularly in two layer ReLU networks. Central to our analysis is a coherence measure that quantifies how gradient curvature aligns across data points, revealing why certain minima are stable and favored during training.

[LG-9] R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

链接: https://arxiv.org/abs/2511.17367
作者: Runyu Lu,Ruochuan Shi,Yuanheng Zhu,Dongbin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader’s position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers’ actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader’s possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.

[LG-10] Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2511.17351
作者: Massimiliano Manenti,Andrea Iannelli
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Hierarchical Reinforcement Learning promises, among other benefits, to efficiently capture and utilize the temporal structure of a decision-making problem and to enhance continual learning capabilities, but theoretical guarantees lag behind practice. In this paper, we propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable. By leveraging the theory of Stochastic Approximation and the ODE method, we present a theorem stating the convergence and stability properties of Feudal Q-learning. This provides a principled convergence and stability analysis tailored to Feudal RL. Moreover, we show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game, opening the door to game-theoretic approaches to Hierarchical RL. Lastly, experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.

[LG-11] ReBaPL: Repulsive Bayesian Prompt Learning

链接: https://arxiv.org/abs/2511.17339
作者: Yassir Bendou,Omar Ezzahir,Eduardo Fernandes Montesuma,Gabriel Mahuas,Victoria Shevchenko,Mike Gartrell
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Prompt learning has emerged as an effective technique for fine-tuning large-scale foundation models for downstream tasks. However, conventional prompt tuning methods are prone to overfitting and can struggle with out-of-distribution generalization. To address these limitations, Bayesian prompt learning has been proposed, which frames prompt optimization as a Bayesian inference problem to enhance robustness. This paper introduces Repulsive Bayesian Prompt Learning (ReBaPL), a novel method for Bayesian prompt learning, designed to efficiently explore the complex and often multimodal posterior landscape of prompts. Our method integrates a cyclical step-size schedule with a stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm, enabling alternating phases of exploration to discover new modes, and exploitation to refine existing modes. Furthermore, we introduce a repulsive force derived from a potential function over probability metrics (including Maximum Mean Discrepancy and Wasserstein distance) computed on the distributions of representations produced by different prompts. This representation-space repulsion diversifies exploration and prevents premature collapse to a single mode. Our approach allows for a more comprehensive characterization of the prompt posterior distribution, leading to improved generalization. In contrast to prior Bayesian prompt learning methods, our method provides a modular plug-and-play Bayesian extension of any existing prompt learning method based on maximum likelihood estimation. We demonstrate the efficacy of ReBaPL on several benchmark datasets, showing superior performance over state-of-the-art methods for prompt learning.

[LG-12] Self-supervised denoising of raw tomography detector data for improved image reconstruction

链接: https://arxiv.org/abs/2511.17312
作者: Israt Jahan Tulin,Sebastian Starke,Dominic Windisch,André Bieberle,Peter Steinbach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ultrafast electron beam X-ray computed tomography produces noisy data due to short measurement times, causing reconstruction artifacts and limiting overall image quality. To counteract these issues, two self-supervised deep learning methods for denoising of raw detector data were investigated and compared against a non-learning based denoising method. We found that the application of the deep-learning-based methods was able to enhance signal-to-noise ratios in the detector data and also led to consistent improvements of the reconstructed images, outperforming the non-learning based method.

[LG-13] SAVeD: Semantic Aware Version Discovery

链接: https://arxiv.org/abs/2511.17298
作者: Artem Frenk,Roee Shraga
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified version/non-version pairs on a hold-out set, and the difference between average similarities of versioned and non-versioned tables (defined by a benchmark, and not provided to the model). Our experiments span five canonical datasets from the Semantic Versioning in Databases Benchmark, and demonstrate substantial gains post-training. SAVeD achieves significantly higher accuracy on completely unseen tables in, and a significant boost in separation scores, confirming its capability to distinguish semantically altered versions. Compared to untrained baselines and prior state-of-the-art dataset-discovery methods like Starmie, our custom encoder achieves competitive or superior results.

[LG-14] Automobile demand forecasting: Spatiotemporal and hierarchical modeling life cycle dynamics and user-generated online information

链接: https://arxiv.org/abs/2511.17275
作者: Tom Nahrendorf,Stefan Minner,Helfried Binder,Richard Zinck
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Premium automotive manufacturers face increasingly complex forecasting challenges due to high product variety, sparse variant-level data, and volatile market dynamics. This study addresses monthly automobile demand forecasting across a multi-product, multi-market, and multi-level hierarchy using data from a German premium manufacturer. The methodology combines point and probabilistic forecasts across strategic and operational planning levels, leveraging ensembles of LightGBM models with pooled training sets, quantile regression, and a mixed-integer linear programming reconciliation approach. Results highlight that spatiotemporal dependencies, as well as rounding bias, significantly affect forecast accuracy, underscoring the importance of integer forecasts for operational feasibility. Shapley analysis shows that short-term demand is reactive, shaped by life cycle maturity, autoregressive momentum, and operational signals, whereas medium-term demand reflects anticipatory drivers such as online engagement, planning targets, and competitive indicators, with online behavioral data considerably improving accuracy at disaggregated levels.

[LG-15] Enforcing governing equation constraints in neural PDE solvers via training-free projections NEURIPS2025

链接: https://arxiv.org/abs/2511.17258
作者: Omer Rochman,Gilles Louppe
类目: Machine Learning (cs.LG)
*备注: Machine Learning and the Physical Sciences, Neurips 2025, San Diego

点击查看摘要

Abstract:Neural PDE solvers used for scientific simulation often violate governing equation constraints. While linear constraints can be projected cheaply, many constraints are nonlinear, complicating projection onto the feasible set. Dynamical PDEs are especially difficult because constraints induce long-range dependencies in time. In this work, we evaluate two training-free, post hoc projections of approximate solutions: a nonlinear optimization-based projection, and a local linearization-based projection using Jacobian-vector and vector-Jacobian products. We analyze constraints across representative PDEs and find that both projections substantially reduce violations and improve accuracy over physics-informed baselines.

[LG-16] FlexiFlow: decomposable flow matching for generation of flexible molecular ensemble

链接: https://arxiv.org/abs/2511.17249
作者: Riccardo Tedoldi,Ola Engkvist,Patrick Bryant,Hossein Azizpour,Jon Paul Janet,Alessandro Tibo
类目: Machine Learning (cs.LG)
*备注: Preprint. Code to be released upon full publication

点击查看摘要

Abstract:Sampling useful three-dimensional molecular structures along with their most favorable conformations is a key challenge in drug discovery. Current state-of-the-art 3D de-novo design flow matching or diffusion-based models are limited to generating a single conformation. However, the conformational landscape of a molecule determines its observable properties and how tightly it is able to bind to a given protein target. By generating a representative set of low-energy conformers, we can more directly assess these properties and potentially improve the ability to generate molecules with desired thermodynamic observables. Towards this aim, we propose FlexiFlow, a novel architecture that extends flow-matching models, allowing for the joint sampling of molecules along with multiple conformations while preserving both equivariance and permutation invariance. We demonstrate the effectiveness of our approach on the QM9 and GEOM Drugs datasets, achieving state-of-the-art results in molecular generation tasks. Our results show that FlexiFlow can generate valid, unstrained, unique, and novel molecules with high fidelity to the training data distribution, while also capturing the conformational diversity of molecules. Moreover, we show that our model can generate conformational ensembles that provide similar coverage to state-of-the-art physics-based methods at a fraction of the inference time. Finally, FlexiFlow can be successfully transferred to the protein-conditioned ligand generation task, even when the dataset contains only static pockets without accompanying conformations.

[LG-17] Fast Decoding for Non-Adaptive Learning of Erdős–Rényi Random Graphs

链接: https://arxiv.org/abs/2511.17240
作者: Hoang Ta,Jonathan Scarlett
类目: Information Theory (cs.IT); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We study the problem of learning an unknown graph via group queries on node subsets, where each query reports whether at least one edge is present among the queried nodes. In general, learning arbitrary graphs with (n) nodes and (k) edges is hard in the non-adaptive setting, requiring (\Omega\big(\min\k^2\log n,n^2\big)) tests even when a small error probability is allowed. We focus on learning Erdős–Rényi (ER) graphs (G\sim\ER(n,q)) in the non-adaptive setting, where the expected number of edges is (\bark=q\binomn2), and we aim to design an efficient testing–decoding scheme achieving asymptotically vanishing error probability. Prior work (Li–Fresacher–Scarlett, NeurIPS 2019) presents a testing–decoding scheme that attains an order-optimal number of tests (O(\bark\log n)) but incurs (\Omega(n^2)) decoding time, whereas their proposed sublinear-time algorithm incurs an extra ((\log \bark)(\log n)) factor in the number of tests. We extend the binary splitting approach, recently developed for non-adaptive group testing, to the ER graph learning setting, and prove that the edge set can be recovered with high probability using (O(\bark\log n)) tests while attaining decoding time (O(\bark^1+\delta\log n)) for any fixed (\delta0).

[LG-18] Generating transition states of chemical reactions via distance-geometry-based flow matching

链接: https://arxiv.org/abs/2511.17229
作者: Yufei Luo,Xiang Gu,Jian Sun
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Transition states (TSs) are crucial for understanding reaction mechanisms, yet their exploration is limited by the complexity of experimental and computational approaches. Here we propose TS-DFM, a flow matching framework that predicts TSs from reactants and products. By operating in molecular distance geometry space, TS-DFM explicitly captures the dynamic changes of interatomic distances in chemical reactions. A network structure named TSDVNet is designed to learn the velocity field for generating TS geometries accurately. On the benchmark dataset Transition1X, TS-DFM outperforms the previous state-of-the-art method React-OT by 30% in structural accuracy. These predicted TSs provide high-quality initial structures, accelerating the convergence of CI-NEB optimization. Additionally, TS-DFM can identify alternative reaction paths. In our experiments, even a more favorable TS with lower energy barrier is discovered. Further tests on RGD1 dataset confirm its strong generalization ability on unseen molecules and reaction types, highlighting its potential for facilitating reaction exploration.

[LG-19] DelTriC: A Novel Clustering Method with Accurate Outlier AISTATS

链接: https://arxiv.org/abs/2511.17219
作者: Tomas Javurek,Michal Gregor,Sebastian Kula,Marian Simko
类目: Machine Learning (cs.LG)
*备注: 10 pages, submitted to AISTATS

点击查看摘要

Abstract:The paper introduces DelTriC (Delaunay Triangulation Clustering), a clustering algorithm which integrates PCA/UMAP-based projection, Delaunay triangulation, and a novel back-projection mechanism to form clusters in the original high-dimensional space. DelTriC decouples neighborhood construction from decision-making by first triangulating in a low-dimensional proxy to index local adjacency, and then back-projecting to the original space to perform robust edge pruning, merging, and anomaly detection. DelTriC can outperform traditional methods such as k-means, DBSCAN, and HDBSCAN in many scenarios; it is both scalable and accurate, and it also significantly improves outlier detection.

[LG-20] Reconstruction of Surface EMG Signal using IMU data for Upper Limb Actions

链接: https://arxiv.org/abs/2511.17200
作者: Shubhranil Basak,Mada Hemanth,Madhav Rao
类目: Machine Learning (cs.LG)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:Surface Electromyography (sEMG) provides vital insights into muscle function, but it can be noisy and challenging to acquire. Inertial Measurement Units (IMUs) provide a robust and wearable alternative to motion capture systems. This paper investigates the synthesis of normalized sEMG signals from 6-axis IMU data using a deep learning approach. We collected simultaneous sEMG and IMU data sampled at 1~KHz for various arm movements. A Sliding-Window-Wave-Net model, based on dilated causal convolutions, was trained to map the IMU data to the sEMG signal. The results show that the model successfully predicts the timing and general shape of muscle activations. Although peak amplitudes were often underestimated, the high temporal fidelity demonstrates the feasibility of using this method for muscle intent detection in applications such as prosthetics and rehabilitation biofeedback.

[LG-21] Four decades of circumpolar super-resolved satellite land surface temperature data

链接: https://arxiv.org/abs/2511.17134
作者: Sonia Dupuis,Nando Metzger,Konrad Schindler,Frank Göttsche,Stefan Wunderle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Land surface temperature (LST) is an essential climate variable (ECV) crucial for understanding land-atmosphere energy exchange and monitoring climate change, especially in the rapidly warming Arctic. Long-term satellite-based LST records, such as those derived from the Advanced Very High Resolution Radiometer (AVHRR), are essential for detecting climate trends. However, the coarse spatial resolution of AVHRR’s global area coverage (GAC) data limit their utility for analyzing fine-scale permafrost dynamics and other surface processes in the Arctic. This paper presents a new 42 years pan-Arctic LST dataset, downscaled from AVHRR GAC to 1 km with a super-resolution algorithm based on a deep anisotropic diffusion model. The model is trained on MODIS LST data, using coarsened inputs and native-resolution outputs, guided by high-resolution land cover, digital elevation, and vegetation height maps. The resulting dataset provides twice-daily, 1 km LST observations for the entire pan-Arctic region over four decades. This enhanced dataset enables improved modelling of permafrost, reconstruction of near-surface air temperature, and assessment of surface mass balance of the Greenland Ice Sheet. Additionally, it supports climate monitoring efforts in the pre-MODIS era and offers a framework adaptable to future satellite missions for thermal infrared observation and climate data record continuity.

[LG-22] Layer-wise Weight Selection for Power-Efficient Neural Network Acceleration

链接: https://arxiv.org/abs/2511.17123
作者: Jiaxun Fang,Li Zhang,Shaoyi Huang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Systolic array accelerators execute CNNs with energy dominated by the switching activity of multiply accumulate (MAC) units. Although prior work exploits weight dependent MAC power for compression, existing methods often use global activation models, coarse energy proxies, or layer-agnostic policies, which limits their effectiveness on real hardware. We propose an energy aware, layer-wise compression framework that explicitly leverages MAC and layer level energy characteristics. First, we build a layer-aware MAC energy model that combines per-layer activation statistics with an MSB-Hamming distance grouping of 22-bit partial sum transitions, and integrate it with a tile-level systolic mapping to estimate convolution-layer energy. On top of this model, we introduce an energy accuracy co-optimized weight selection algorithm within quantization aware training and an energy-prioritized layer-wise schedule that compresses high energy layers more aggressively under a global accuracy constraint. Experiments on different CNN models demonstrate up to 58.6% energy reduction with 2-3% accuracy drop, outperforming a state-of-the-art power-aware baseline.

[LG-23] Hash Collisions in Molecular Fingerprints: Effects on Property Prediction and Bayesian Optimization NEURIPS2025

链接: https://arxiv.org/abs/2511.17078
作者: Walter Virany,Austin Tripp
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 AI4Science workshop. Code: this https URL Openreview: this https URL

点击查看摘要

Abstract:Molecular fingerprinting methods use hash functions to create fixed-length vector representations of molecules. However, hash collisions cause distinct substructures to be represented with the same feature, leading to overestimates in molecular similarity calculations. We investigate whether using exact fingerprints improves accuracy compared to standard compressed fingerprints in molecular property prediction and Bayesian optimization where the underlying predictive model is a Gaussian process. We find that using exact fingerprints yields a small yet consistent improvement in predictive accuracy on five molecular property prediction benchmarks from the DOCKSTRING dataset. However, these gains did not translate to significant improvements in Bayesian optimization performance.

[LG-24] Step-E: A Differentiable Data Cleaning Framework for Robust Learning with Noisy Labels

链接: https://arxiv.org/abs/2511.17040
作者: Wenzhang Du
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Training data collected in the wild often contain noisy labels and outliers that substantially degrade the performance and reliability of deep neural networks. While data cleaning is commonly applied as a separate preprocessing stage, such two-stage pipelines neither fully exploit feedback from the downstream model nor adapt to unknown noise patterns. We propose Step-E, a simple framework that integrates sample selection and model learning into a single optimization process. At each epoch, Step-E ranks samples by loss and gradually increases the fraction of high-loss examples that are excluded from gradient updates after a brief warm-up stage, yielding an online curriculum that focuses on easy and consistent examples and eventually ignores persistent outliers. On CIFAR-100N, Step-E improves the test accuracy of a ResNet-18 model from 43.3% (+/- 0.7%) to 50.4% (+/- 0.9%), clearly outperforming loss truncation, self-paced learning, and one-shot filtering while approaching the clean-label oracle at 60.5% (+/- 0.2%). On CIFAR-10N (aggre), Step-E also improves over the noisy baseline (85.3% vs. 83.9%) and nearly matches the clean-label oracle (85.9%), with only moderate training-time overhead.

[LG-25] Mask the Redundancy: Evolving Masking Representation Learning for Multivariate Time-Series Clustering AAAI2026

链接: https://arxiv.org/abs/2511.17008
作者: Zexi Tan,Xiaopeng Luo,Yunlin Liu,Yiqun Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Multivariate Time-Series (MTS) clustering discovers intrinsic grouping patterns of temporal data samples. Although time-series provide rich discriminative information, they also contain substantial redundancy, such as steady-state machine operation records and zero-output periods of solar power generation. Such redundancy diminishes the attention given to discriminative timestamps in representation learning, thus leading to performance bottlenecks in MTS clustering. Masking has been widely adopted to enhance the MTS representation, where temporal reconstruction tasks are designed to capture critical information from MTS. However, most existing masking strategies appear to be standalone preprocessing steps, isolated from the learning process, which hinders dynamic adaptation to the importance of clustering-critical timestamps. Accordingly, this paper proposes the Evolving-masked MTS Clustering (EMTC) method, with its model architecture composed of Importance-aware Variate-wise Masking (IVM) and Multi-Endogenous Views (MEV) representation learning modules. IVM adaptively guides the model in learning more discriminative representations for clustering, while the MEV-based reconstruction and contrastive learning pathways enhance the generalization. That is, the MEV reconstruction facilitates multi-perspective complementary to prevent the masking from premature convergence, and the clustering-guided contrastive learning facilitates the joint optimization of representation and clustering. Extensive experiments on 15 real benchmark datasets demonstrate the superiority of EMTC in comparison with eight SOTA methods, where the EMTC achieves an average improvement of 4.85% over the strongest baselines.

[LG-26] FIRM: Federated In-client Regularized Multi-objective Alignment for Large Language Models

链接: https://arxiv.org/abs/2511.16992
作者: Fatemeh(Atena)Nourzad,Amirhossein Roknilamouki,Eylem Ekici, Jia (Kevin)Liu,Ness B. Shroff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human values often involves balancing multiple, conflicting objectives such as helpfulness and harmlessness. Training these models is computationally intensive, and centralizing the process raises significant data privacy concerns. Federated Learning (FL) offers a compelling alternative, but existing Federated Multi-Objective Optimization (FMOO) methods face severe communication bottlenecks as their reliance on transmitting multiple gradients to a server is unscalable for large models. We introduce FIRM (Federated In-client Regularized Multi-objective alignment), a novel algorithm that achieves both client disagreement drift mitigation and communication efficiency. In FIRM, each client locally solves a regularized multi-objective optimization problem. By directly mitigating client disagreement drift through in-client regularization, our method eliminates the need for the multi-gradient transmissions common in prior works. Consequently, clients need only to transmit a single set of adapted parameters, maintaining high communication efficiency. We prove that our algorithm converges to Pareto-stationary points and, to our knowledge, provide the first finite-time convergence guarantees for this federated multi-objective alignment setting. Empirically, we show that FIRM leads to smoother training dynamics, reduced client disagreement drift, and improved reward trade-offs compared to baselines. We further propose a method to incorporate a preference over the objectives and report empirical Pareto plots, demonstrating that FIRM can smoothly adapt trade-offs between objectives in response to specified preferences.

[LG-27] Gradient flow for deep equilibrium single-index models

链接: https://arxiv.org/abs/2511.16976
作者: Sanjit Dandapanthula,Aaditya Ramdas
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep equilibrium models (DEQs) have recently emerged as a powerful paradigm for training infinitely deep weight-tied neural networks that achieve state of the art performance across many modern machine learning tasks. Despite their practical success, theoretically understanding the gradient descent dynamics for training DEQs remains an area of active research. In this work, we rigorously study the gradient descent dynamics for DEQs in the simple setting of linear models and single-index models, filling several gaps in the literature. We prove a conservation law for linear DEQs which implies that the parameters remain trapped on spheres during training and use this property to show that gradient flow remains well-conditioned for all time. We then prove linear convergence of gradient descent to a global minimizer for linear DEQs and deep equilibrium single-index models under appropriate initialization and with a sufficiently small step size. Finally, we validate our theoretical findings through experiments.

[LG-28] oC: Tree-of-Claims Search with Multi-Agent Language Models AAAI2026

链接: https://arxiv.org/abs/2511.16972
作者: Shuyang Yu,Jianan Liang,Hui Hu
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026 (Oral)

点击查看摘要

Abstract:Optimizing patent claims is a critical yet challenging task, demanding careful balance between maximizing novelty and preserving legal scope. Manual claim drafting is labor-intensive, costly, and inherently inconsistent, while conventional Large Language Models (LLMs) often lack the structured, iterative reasoning essential for precise claim refinement. To address these challenges, we introduce Tree of Claims (ToC), an innovative framework that redefines claim editing as a guided search problem. ToC synergistically integrates Monte Carlo Tree Search (MCTS) with a collaborative multi-agent system, comprising an LLM-based EditorAgent that proposes contextually grounded edits, and an ExaminerAgent that mimics patent examiner critiques through structured, chain-of-thought analyses of novelty and prior art disclosure. Driven by a carefully designed multi-objective reward function, ToC jointly optimizes novelty, scope retention, and semantic coherence. Experimental evaluation on a benchmark of 1145 claims demonstrates that ToC significantly outperforms standard LLMs in zero-shot and few-shot scenarios, achieving an average composite score improvement of 8%, and up to 9% in certain cases. Extensive experiments, including detailed ablation studies, validate ToC’s efficacy in generating superior, legally robust claim revisions. Overall, ToC establishes a transparent, controllable, and interpretable methodology that effectively bridges advanced LLM reasoning capabilities with strategic MCTS planning for structured patent claim this http URL source code is available at this https URL.

[LG-29] A novel approach to classification of ECG arrhythmia types with latent ODEs ALT NEURIPS2025

链接: https://arxiv.org/abs/2511.16933
作者: Angelina Yan,Matt L. Sampson,Peter Melchior
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted into NeurIPS 2025 Learning from Time Series for Health workshop

点击查看摘要

Abstract:12-lead ECGs with high sampling frequency are the clinical gold standard for arrhythmia detection, but their short-term, spot-check nature often misses intermittent events. Wearable ECGs enable long-term monitoring but suffer from irregular, lower sampling frequencies due to battery constraints, making morphology analysis challenging. We present an end-to-end classification pipeline to address these issues. We train a latent ODE to model continuous ECG waveforms and create robust feature vectors from high-frequency single-channel signals. We construct three latent vectors per waveform via downsampling the initial 360 Hz ECG to 90 Hz and 45 Hz. We then use a gradient boosted tree to classify these vectors and test robustness across frequencies. Performance shows minimal degradation, with macro-averaged AUC-ROC values of 0.984, 0.978, and 0.976 at 360 Hz, 90 Hz, and 45 Hz, respectively, suggesting a way to sidestep the trade-off between signal fidelity and battery life. This enables smaller wearables, promoting long-term monitoring of cardiac health.

[LG-30] CroTad: A Contrastive Reinforcement Learning Framework for Online Trajectory Anomaly Detection VLDB

链接: https://arxiv.org/abs/2511.16929
作者: Rui Xue,Dan He,Fengmei Jin,Chen Zhang,Xiaofang Zhou
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 18 pages, 4 figures, will be submitted to VLDBJ

点击查看摘要

Abstract:Detecting trajectory anomalies is a vital task in modern Intelligent Transportation Systems (ITS), enabling the identification of unsafe, inefficient, or irregular travel behaviours. While deep learning has emerged as the dominant approach, several key challenges remain unresolved. First, sub-trajectory anomaly detection, capable of pinpointing the precise segments where anomalies occur, remains underexplored compared to whole-trajectory analysis. Second, many existing methods depend on carefully tuned thresholds, limiting their adaptability in real-world applications. Moreover, the irregular sampling of trajectory data and the presence of noise in training sets further degrade model performance, making it difficult to learn reliable representations of normal routes. To address these challenges, we propose a contrastive reinforcement learning framework for online trajectory anomaly detection, CroTad. Our method is threshold-free and robust to noisy, irregularly sampled data. By incorporating contrastive learning, CroTad learns to extract diverse normal travel patterns for different itineraries and effectively distinguish anomalous behaviours at both sub-trajectory and point levels. The detection module leverages deep reinforcement learning to perform online, real-time anomaly scoring, enabling timely and fine-grained identification of abnormal segments. Extensive experiments on two real-world datasets demonstrate the effectiveness and robustness of our framework across various evaluation scenarios.

[LG-31] A Hybrid Computational Intelligence Framework for scRNA-seq Imputation: Integrating scRecover and Random Forests

链接: https://arxiv.org/abs/2511.16923
作者: Ali Anaissi,Deshao Liu,Yuanzhe Jia,Weidong Huang,Widad Alyassine,Junaid Akram
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) enables transcriptomic profiling at cellular resolution but suffers from pervasive dropout events that obscure biological signals. We present SCR-MF, a modular two-stage workflow that combines principled dropout detection using scRecover with robust non-parametric imputation via missForest. Across public and simulated datasets, SCR-MF achieves robust and interpretable performance comparable to or exceeding existing imputation methods in most cases, while preserving biological fidelity and transparency. Runtime analysis demonstrates that SCR-MF provides a competitive balance between accuracy and computational efficiency, making it suitable for mid-scale single-cell datasets.

[LG-32] Predicting Talent Breakout Rate using Twitter and TV data

链接: https://arxiv.org/abs/2511.16905
作者: Bilguun Batsaikhan,Hiroyuki Fukuda
类目: Machine Learning (cs.LG)
*备注: 4 pages. Presented at the 34th Annual Conference of the Japanese Society for Artificial Intelligence (JSAI 2020), paper ID 1K3-ES-2-02

点击查看摘要

Abstract:Early detection of rising talents is of paramount importance in the field of advertising. In this paper, we define a concept of talent breakout and propose a method to detect Japanese talents before their rise to stardom. The main focus of the study is to determine the effectiveness of combining Twitter and TV data on predicting time-dependent changes in social data. Although traditional time-series models are known to be robust in many applications, the success of neural network models in various fields (e.g.\ Natural Language Processing, Computer Vision, Reinforcement Learning) continues to spark an interest in the time-series community to apply new techniques in practice. Therefore, in order to find the best modeling approach, we have experimented with traditional, neural network and ensemble learning methods. We observe that ensemble learning methods outperform traditional and neural network models based on standard regression metrics. However, by utilizing the concept of talent breakout, we are able to assess the true forecasting ability of the models, where neural networks outperform traditional and ensemble learning methods in terms of precision and recall.

[LG-33] PersonalizedRouter: Personalized LLM Routing via Graph-based User Preference Modeling

链接: https://arxiv.org/abs/2511.16883
作者: Zhongjie Dai,Tao Feng,Jiaxuan You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing number of Large Language Models (LLMs) with diverse capabilities and response styles provides users with a wider range of choices, which presents challenges in selecting appropriate LLMs, as user preferences vary in terms of performance, cost, and response style. Current LLM selection methods typically optimize for a single fixed objective, such as performance, cost, or a trade-off between them, and fail to learn individual user preferences from interaction data. To address these limitations, we propose PersonalizedRouter, a graph-based framework that models diverse user profiles and performs personalized LLM selection by leveraging interaction data that includes task context, queries, candidate LLMs, and user decisions. To capture contextual information between user queries and optimal LLMs, PersonalizedRouter converts the interaction data into a heterogeneous graph, where the relationships between different types of nodes are represented by edges. To evaluate adaptability across users, we design two strategies: the multi-cost-efficiency simulation strategy and the LLM-as-a-Judge strategy. In addition, we construct PersonaRoute-Bench, a large-scale benchmark with 1,000 simulated users and 10 LLMs. Experimental results show that PersonalizedRouter significantly outperforms existing LLM selection methods and surpasses the strongest methods by a large margin of 15.38% and 9.83% under two simulation strategies. On the PersonaRoute-Bench with 1,000 users, it further surpasses the best methods by 16.19% and 59.69% while maintaining higher efficiency. Moreover, PersonalizedRouter demonstrates strong few-shot generalization, achieving 64.81% and 85.80% of the fully trained model’s performance when adapting to new users and new LLMs.

[LG-34] opologic Attention Networks: Attending to Direct and Indirect Neighbors through Gaussian Belief Propagation

链接: https://arxiv.org/abs/2511.16871
作者: Marshall Rosenhoover,Huaming Zhang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 13 Figures

点击查看摘要

Abstract:Graph Neural Networks rely on local message passing, which limits their ability to model long-range dependencies in graphs. Existing approaches extend this range through continuous-time dynamics or dense self-attention, but both suffer from high computational cost and limited scalability. We propose Topologic Attention Networks, a new framework that applies topologic attention, a probabilistic mechanism that learns how information should flow through both direct and indirect connections in a graph. Unlike conventional attention that depends on explicit pairwise interactions, topologic attention emerges from the learned information propagation of the graph, enabling unified reasoning over local and global relationships. This method achieves provides state-of-the-art performance across all measured baseline models. Our implementation is available at this https URL.

[LG-35] Is the Cure Still Worse Than the Disease? Test Overfitting by LLM s in Automated Program Repair

链接: https://arxiv.org/abs/2511.16858
作者: Toufique Ahmed,Jatin Ganhotra,Avraham Shinnar,Martin Hirzel
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.

[LG-36] Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks

链接: https://arxiv.org/abs/2511.16849
作者: Leonardo Pepino,Pablo Riera,Juan Kamienkowski,Luciana Ferrer
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Artificial neural networks (ANNs) are increasingly powerful models of brain computation, yet it remains unclear whether improving their task performance also makes their internal representations more similar to brain signals. To address this question in the auditory domain, we quantified the alignment between the internal representations of 36 different audio models and brain activity from two independent fMRI datasets. Using voxel-wise and component-wise regression, and representation similarity analysis (RSA), we found that recent self-supervised audio models with strong performance in diverse downstream tasks are better predictors of auditory cortex activity than older and more specialized models. To assess the quality of the audio representations, we evaluated these models in 6 auditory tasks from the HEAREval benchmark, spanning music, speech, and environmental sounds. This revealed strong positive Pearson correlations ( r0.7 ) between a model’s overall task performance and its alignment with brain representations. Finally, we analyzed the evolution of the similarity between audio and brain representations during the pretraining of EnCodecMAE. We discovered that brain similarity increases progressively and emerges early during pretraining, despite the model not being explicitly optimized for this objective. This suggests that brain-like representations can be an emergent byproduct of learning to reconstruct missing information from naturalistic audio data.

[LG-37] Provably Minimum-Length Conformal Prediction Sets for Ordinal Classification AAAI2026

链接: https://arxiv.org/abs/2511.16845
作者: Zijian Zhang,Xinyu Chen,Yuanjie Shi,Liyuan Lillian Ma,Zifan Xu,Yan Yan
类目: Machine Learning (cs.LG)
*备注: Submitted to AAAI 2026

点击查看摘要

Abstract:Ordinal classification has been widely applied in many high-stakes applications, e.g., medical imaging and diagnosis, where reliable uncertainty quantification (UQ) is essential for decision making. Conformal prediction (CP) is a general UQ framework that provides statistically valid guarantees, which is especially useful in practice. However, prior ordinal CP methods mainly focus on heuristic algorithms or restrictively require the underlying model to predict a unimodal distribution over ordinal labels. Consequently, they provide limited insight into coverage-efficiency trade-offs, or a model-agnostic and distribution-free nature favored by CP methods. To this end, we fill this gap by propose an ordinal-CP method that is model-agnostic and provides instance-level optimal prediction intervals. Specifically, we formulate conformal ordinal classification as a minimum-length covering problem at the instance level. To solve this problem, we develop a sliding-window algorithm that is optimal on each calibration data, with only a linear time complexity in K, the number of label candidates. The local optimality per instance further also improves predictive efficiency in expectation. Moreover, we propose a length-regularized variant that shrinks prediction set size while preserving coverage. Experiments on four benchmark datasets from diverse domains are conducted to demonstrate the significantly improved predictive efficiency of the proposed methods over baselines (by 15% decrease on average over four datasets).

[LG-38] A Vector Symbolic Approach to Multiple Instance Learning

链接: https://arxiv.org/abs/2511.16795
作者: Ehsan Ahmed Dhrubo,Mohammad Mahmudul Alam,Edward Raff,Tim Oates,James Holt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiple Instance Learning (MIL) tasks impose a strict logical constraint: a bag is labeled positive if and only if at least one instance within it is positive. While this iff constraint aligns with many real-world applications, recent work has shown that most deep learning-based MIL approaches violate it, leading to inflated performance metrics and poor generalization. We propose a novel MIL framework based on Vector Symbolic Architectures (VSAs), which provide a differentiable mechanism for performing symbolic operations in high-dimensional space. Our method encodes the MIL assumption directly into the model’s structure by representing instances and concepts as nearly orthogonal high-dimensional vectors and using algebraic operations to enforce the iff constraint during classification. To bridge the gap between raw data and VSA representations, we design a learned encoder that transforms input instances into VSA-compatible vectors while preserving key distributional properties. Our approach, which includes a VSA-driven MaxNetwork classifier, achieves state-of-the-art results for a valid MIL model on standard MIL benchmarks and medical imaging datasets, outperforming existing methods while maintaining strict adherence to the MIL formulation. This work offers a principled, interpretable, and effective alternative to existing MIL approaches that rely on learned heuristics.

[LG-39] Membership Inference Attacks Beyond Overfitting

链接: https://arxiv.org/abs/2511.16792
作者: Mona Khalil,Alberto Blanco-Justicia,Najeeb Jebreel,Josep Domingo-Ferrer
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Membership inference attacks (MIAs) against machine learning (ML) models aim to determine whether a given data point was part of the model training data. These attacks may pose significant privacy risks to individuals whose sensitive data were used for training, which motivates the use of defenses such as differential privacy, often at the cost of high accuracy losses. MIAs exploit the differences in the behavior of a model when making predictions on samples it has seen during training (members) versus those it has not seen (non-members). Several studies have pointed out that model overfitting is the major factor contributing to these differences in behavior and, consequently, to the success of MIAs. However, the literature also shows that even non-overfitted ML models can leak information about a small subset of their training data. In this paper, we investigate the root causes of membership inference vulnerabilities beyond traditional overfitting concerns and suggest targeted defenses. We empirically analyze the characteristics of the training data samples vulnerable to MIAs in models that are not overfitted (and hence able to generalize). Our findings reveal that these samples are often outliers within their classes (e.g., noisy or hard to classify). We then propose potential defensive strategies to protect these vulnerable samples and enhance the privacy-preserving capabilities of ML models. Our code is available at this https URL.

[LG-40] GCL-OT: Graph Contrastive Learning with Optimal Transport for Heterophilic Text-Attributed Graphs AAAI2026

链接: https://arxiv.org/abs/2511.16778
作者: Yating Ren,Yikun Ban,Huobin Tan
类目: Machine Learning (cs.LG)
*备注: AAAI 2026

点击查看摘要

Abstract:Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods can mitigate heterophily through structural adjustments or neighbor aggregation, they usually treat textual embeddings as static targets, leading to suboptimal alignment. In this work, we identify the multi-granular heterophily in text-attributed graphs, including complete heterophily, partial heterophily, and latent homophily, which makes structure-text alignment particularly challenging due to mixed, noisy, and missing semantic correlations. To achieve flexible and bidirectional alignment, we propose GCL-OT, a novel graph contrastive learning framework with optimal transport, equipped with tailored mechanisms for each type of heterophily. Specifically, for partial heterophily, we design a RealSoftMax-based similarity estimator to emphasize key neighbor-word interactions while easing background noise. For complete heterophily, we introduce a prompt-based filter that adaptively excludes irrelevant noise during optimal transport alignment. Furthermore, we incorporate OT-guided soft supervision to uncover potential neighbors with similar semantics, enhancing the learning of latent homophily. Theoretical analysis shows that GCL-OT can improve the mutual information bound and Bayes error guarantees. Extensive experiments on nine benchmarks show that GCL-OT consistently outperforms state-of-the-art methods, verifying its effectiveness and robustness.

[LG-41] When Structure Doesnt Help: LLM s Do Not Read Text-Attributed Graphs as Effectively as We Expected

链接: https://arxiv.org/abs/2511.16767
作者: Haotian Xu,Yuning You,Tengfei Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs provide a unified representation of semantic content and relational structure, making them a natural fit for domains such as molecular modeling, citation networks, and social graphs. Meanwhile, large language models (LLMs) have excelled at understanding natural language and integrating cross-modal signals, sparking interest in their potential for graph reasoning. Recent work has explored this by either designing template-based graph templates or using graph neural networks (GNNs) to encode structural information. In this study, we investigate how different strategies for encoding graph structure affect LLM performance on text-attributed graphs. Surprisingly, our systematic experiments reveal that: (i) LLMs leveraging only node textual descriptions already achieve strong performance across tasks; and (ii) most structural encoding strategies offer marginal or even negative gains. We show that explicit structural priors are often unnecessary and, in some cases, counterproductive when powerful language models are involved. This represents a significant departure from traditional graph learning paradigms and highlights the need to rethink how structure should be represented and utilized in the LLM era. Our study is to systematically challenge the foundational assumption that structure is inherently beneficial for LLM-based graph reasoning, opening the door to new, semantics-driven approaches for graph learning.

[LG-42] Addressing A Posteriori Performance Degradation in Neural Network Subgrid Stress Models

链接: https://arxiv.org/abs/2511.17475
作者: Andy Wu,Sanjiva K. Lele
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural network subgrid stress models often have a priori performance that is far better than the a posteriori performance, leading to neural network models that look very promising a priori completely failing in a posteriori Large Eddy Simulations (LES). This performance gap can be decreased by combining two different methods, training data augmentation and reducing input complexity to the neural network. Augmenting the training data with two different filters before training the neural networks has no performance degradation a priori as compared to a neural network trained with one filter. A posteriori, neural networks trained with two different filters are far more robust across two different LES codes with different numerical schemes. In addition, by ablating away the higher order terms input into the neural network, the a priori versus a posteriori performance changes become less apparent. When combined, neural networks that use both training data augmentation and a less complex set of inputs have a posteriori performance far more reflective of their a priori evaluation.

[LG-43] A First Full Physics Benchmark for Highly Granular Calorimeter Surrogates

链接: https://arxiv.org/abs/2511.17293
作者: Thorsten Buss,Henry Day-Hall,Frank Gaede,Gregor Kasieczka,Katja Krüger,Anatolii Korol,Thomas Madlener,Peter McKeown
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 26 pages, 15 figures

点击查看摘要

Abstract:The physics programs of current and future collider experiments necessitate the development of surrogate simulators for calorimeter showers. While much progress has been made in the development of generative models for this task, they have typically been evaluated in simplified scenarios and for single particles. This is particularly true for the challenging task of highly granular calorimeter simulation. For the first time, this work studies the use of highly granular generative calorimeter surrogates in a realistic simulation application. We introduce DDML, a generic library which enables the combination of generative calorimeter surrogates with realistic detectors implemented using the DD4hep toolkit. We compare two different generative models - one operating on a regular grid representation, and the other using a less common point cloud approach. In order to disentangle methodological details from model performance, we provide comparisons to idealized simulators which directly sample representations of different resolutions from the full simulation ground-truth. We then systematically evaluate model performance on post-reconstruction benchmarks for electromagnetic shower simulation. Beginning with a typical single particle study, we introduce a first multi-particle benchmark based on di-photon separations, before studying a first full-physics benchmark based on hadronic decays of the tau lepton. Our results indicate that models operating on a point cloud can achieve a favorable balance between speed and accuracy for highly granular calorimeter simulation compared to those which operate on a regular grid representation.

[LG-44] Intrinsic preservation of plasticity in continual quantum learning

链接: https://arxiv.org/abs/2511.17228
作者: Yu-Qin Chen,Shi-Xin Zhang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11 pages, 5 figures and supplementary information

点击查看摘要

Abstract:Artificial intelligence in dynamic, real-world environments requires the capacity for continual learning. However, standard deep learning suffers from a fundamental issue: loss of plasticity, in which networks gradually lose their ability to learn from new data. Here we show that quantum learning models naturally overcome this limitation, preserving plasticity over long timescales. We demonstrate this advantage systematically across a broad spectrum of tasks from multiple learning paradigms, including supervised learning and reinforcement learning, and diverse data modalities, from classical high-dimensional images to quantum-native datasets. Although classical models exhibit performance degradation correlated with unbounded weight and gradient growth, quantum neural networks maintain consistent learning capabilities regardless of the data or task. We identify the origin of the advantage as the intrinsic physical constraints of quantum models. Unlike classical networks where unbounded weight growth leads to landscape ruggedness or saturation, the unitary constraints confine the optimization to a compact manifold. Our results suggest that the utility of quantum computing in machine learning extends beyond potential speedups, offering a robust pathway for building adaptive artificial intelligence and lifelong learners.

[LG-45] On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification

链接: https://arxiv.org/abs/2511.17176
作者: Rodrigo Almeida,Noelia Otero,Miguel-Ángel Fernández-Torres,Jackie Ma
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 24 pages, 12 figures

点击查看摘要

Abstract:Accurate prediction of extreme weather events remains a major challenge for artificial intelligence based weather prediction systems. While deterministic models such as FuXi, GraphCast, and SFNO have achieved competitive forecast skill relative to numerical weather prediction, their ability to represent uncertainty and capture extremes is still limited. This study investigates how state of the art deterministic artificial intelligence based models respond to initial-condition perturbations and evaluates the resulting ensembles in forecasting extremes. Using three perturbation strategies (Gaussian noise, Hemispheric Centered Bred Vectors, and Huge Ensembles), we generate 50 member ensembles for two major events in August 2022: the Pakistan floods and the China heatwave. Ensemble skill is assessed against ERA5 and compared with IFS ENS and the probabilistic AIFSENS model using deterministic and probabilistic metrics. Results show that flow dependent perturbations produce the most realistic ensemble spread and highest probabilistic skill, narrowing but not closing the performance gap with numerical weather prediction ensembles. Across variables, artificial intelligence based weather models capture temperature extremes more effectively than precipitation. These findings demonstrate that input perturbations can extend deterministic models toward probabilistic forecasting, paving the way for approaches that combine flow dependent perturbations with generative or latent-space uncertainty modeling for reliable artificial intelligence-driven early warning systems.

[LG-46] Dissecting Quantum Reinforcement Learning: A Systematic Evaluation of Key Components

链接: https://arxiv.org/abs/2511.17112
作者: Javier Lazaro,Juan-Ignacio Vazquez,Pablo Garcia-Bringas
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameterised quantum circuit (PQC) based Quantum Reinforcement Learning (QRL) has emerged as a promising paradigm at the intersection of quantum computing and reinforcement learning (RL). By design, PQCs create hybrid quantum-classical models, but their practical applicability remains uncertain due to training instabilities, barren plateaus (BPs), and the difficulty of isolating the contribution of individual pipeline components. In this work, we dissect PQC based QRL architectures through a systematic experimental evaluation of three aspects recurrently identified as critical: (i) data embedding strategies, with Data Reuploading (DR) as an advanced approach; (ii) ansatz design, particularly the role of entanglement; and (iii) post-processing blocks after quantum measurement, with a focus on the underexplored Output Reuse (OR) technique. Using a unified PPO-CartPole framework, we perform controlled comparisons between hybrid and classical agents under identical conditions. Our results show that OR, though purely classical, exhibits distinct behaviour in hybrid pipelines, that DR improves trainability and stability, and that stronger entanglement can degrade optimisation, offsetting classical gains. Together, these findings provide controlled empirical evidence of the interplay between quantum and classical contributions, and establish a reproducible framework for systematic benchmarking and component-wise analysis in QRL.

[LG-47] Generative MIMO Beam Map Construction for Location Recovery and Beam Tracking

链接: https://arxiv.org/abs/2511.17007
作者: Wangqian Chen,Junting Chen,Shuguang Cui
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Machine learning (ML) has greatly advanced data-driven channel modeling and resource optimization in wireless communication systems. However, most existing ML-based methods rely on large, accurately labeled datasets with location information, which are often difficult and costly to obtain. This paper proposes a generative framework to recover location labels directly from sequences of sparse channel state information (CSI) measurements, without explicit location labels for radio map construction. Instead of directly storing raw CSI, we learn a compact low-dimensional radio map embedding and leverage a generative model to reconstruct the high-dimensional CSI. Specifically, to address the uncertainty of sparse CSI, a dual-scale feature extraction scheme is designed to enhance feature representation by jointly exploiting correlations from angular space and across neighboring samples. We develop a hybrid recurrent-convolutional encoder to learn mobility patterns, which combines a truncation strategy and multi-scale convolutions in the recurrent neural network (RNN) to ensure feature robustness against short-term fluctuations. Unlike conventional Gaussian priors in latent space, we embed a learnable radio map to capture the location information by encoding high-level positional features from CSI measurements. Finally, a diffusion-based generative decoder reconstructs the full CSI with high fidelity by conditioning on the positional features in the radio map. Numerical experiments demonstrate that the proposed model can improve localization accuracy by over 30% and achieve a 20% capacity gain in non-line-of-sight (NLOS) scenarios compared with model-based Kalman filter approaches.

[LG-48] BITS for GAPS: Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates

链接: https://arxiv.org/abs/2511.16815
作者: Kyla D. Jones,Alexander W. Dowling
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates (BITS for GAPS) framework to emulate latent components in hybrid physical systems. BITS for GAPS supports serial hybrid modeling, where known physics governs part of the system and residual dynamics are represented as a latent function inferred from data. A Gaussian process prior is placed over the latent function, with hierarchical priors on its hyperparameters to encode physically meaningful structure in the predictive posterior. To guide data acquisition, we derive entropy-based acquisition functions that quantify expected information gain from candidate input locations, identifying samples most informative for training the surrogate. Specifically, we obtain a closed-form expression for the differential entropy of the predictive posterior and establish a tractable lower bound for efficient evaluation. These derivations approximate the predictive posterior as a finite, uniformly weighted mixture of Gaussian processes. We demonstrate the framework’s utility by modeling activity coefficients in vapor-liquid equilibrium systems, embedding the surrogate into extended Raoult’s law for distillation design. Numerical results show that entropy-guided sampling improves sample efficiency by targeting regions of high uncertainty and potential information gain. This accelerates surrogate convergence, enhances predictive accuracy in non-ideal regimes, and preserves physical consistency. Overall, BITS for GAPS provides an efficient, interpretable, and uncertainty-aware framework for hybrid modeling of complex physical systems. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2511.16815 [stat.ML] (or arXiv:2511.16815v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.16815 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] Efficient Penalty-Based Bilevel Methods: Improved Analysis Novel Updates and Flatness Condition

链接: https://arxiv.org/abs/2511.16796
作者: Liuyuan Jiang,Quan Xiao,Lisha Chen,Tianyi Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2507.20400

点击查看摘要

Abstract:Penalty-based methods have become popular for solving bilevel optimization (BLO) problems, thanks to their effective first-order nature. However, they often require inner-loop iterations to solve the lower-level (LL) problem and small outer-loop step sizes to handle the increased smoothness induced by large penalty terms, leading to suboptimal complexity. This work considers the general BLO problems with coupled constraints (CCs) and leverages a novel penalty reformulation that decouples the upper- and lower-level variables. This yields an improved analysis of the smoothness constant, enabling larger step sizes and reduced iteration complexity for Penalty-Based Gradient Descent algorithms in ALTernating fashion (ALT-PBGD). Building on the insight of reduced smoothness, we propose PBGD-Free, a novel fully single-loop algorithm that avoids inner loops for the uncoupled constraint BLO. For BLO with CCs, PBGD-Free employs an efficient inner-loop with substantially reduced iteration complexity. Furthermore, we propose a novel curvature condition describing the “flatness” of the upper-level objective with respect to the LL variable. This condition relaxes the traditional upper-level Lipschitz requirement, enables smaller penalty constant choices, and results in a negligible penalty gradient term during upper-level variable updates. We provide rigorous convergence analysis and validate the method’s efficacy through hyperparameter optimization for support vector machines and fine-tuning of large language models.

[LG-50] Fermions and Supersymmetry in Neural Network Field Theories

链接: https://arxiv.org/abs/2511.16741
作者: Samuel Frank,James Halverson,Anindita Maiti,Fabian Ruehle
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 34 pages + appendices

点击查看摘要

Abstract:We introduce fermionic neural network field theories via Grassmann-valued neural networks. Free theories are obtained by a generalization of the Central Limit Theorem to Grassmann variables. This enables the realization of the free Dirac spinor at infinite width and a four fermion interaction at finite width. Yukawa couplings are introduced by breaking the statistical independence of the output weights for the fermionic and bosonic fields. A large class of interacting supersymmetric quantum mechanics and field theory models are introduced by super-affine transformations on the input that realize a superspace formalism.

信息检索

[IR-0] Parametric Retrieval-Augmented Generation using Latent Routing of LoRA Adapters

链接: https://arxiv.org/abs/2511.17044
作者: Zhan Su,Fengran Mo,Jian-yun Nie
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Parametric Retrieval-Augmented Generation (PRAG) is a novel RAG paradigm that integrates external knowledge directly into a Large Language Model (LLM) by parameterizing documents using LoRA adapters, demonstrating reduced inference costs compared to traditional RAG approaches. However, current PRAG approaches adopt a \textbfone-to-one document encoding scheme, using a dedicated LoRA adapter for each individual document. This scheme introduces two major limitations: First, it leads to data scarcity, as the training datasets for individual LoRA adapters are limited. Second, it incurs high overhead during inference, requiring the merging of LLM weights with a new LoRA adapter for every candidate passage, which is computationally inefficient. To overcome these challenges, we propose a novel paradigm for encoding passages in PRAG that utilizes a latent routing encoding process (Poly-PRAG). During offline encoding, we treat the encoding of a set of documents as a multi-task learning process, where each passage is assigned a unique task identifier. By employing a routing function, we use a small set of latent LoRA adapters to encode the entire passage space. During online inference, this routing function selectively activates a subset of latent experts based on the input query. We conduct comprehensive evaluations of Poly-PRAG across multiple knowledge-intensive NLP tasks. Our extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art results on four distinct datasets.

[IR-1] δ-EMG: A Monotonic Graph Index for Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2511.16921
作者: Liming Xiang,Jing Feng,Ziqi Yin,Zijian Li,Daihao Xue,Hongchao Qin,Ronghua Li,Guoren Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search in high-dimensional spaces is a foundational component of many modern retrieval and recommendation systems. Currently, almost all algorithms follow an \epsilon -Recall-Bounded principle when comparing performance: they require the ANN search results to achieve a recall of more than 1-\epsilon and then compare query-per-second (QPS) performance. However, this approach only accounts for the recall of true positive results and does not provide guarantees on the deviation of incorrect results. To address this limitation, we focus on an Error-Bounded ANN method, which ensures that the returned results are a (1/\delta) -approximation of the true values. Our approach adopts a graph-based framework. To enable Error-Bounded ANN search, we propose a \delta -EMG (Error-bounded Monotonic Graph), which, for the first time, provides a provable approximation for arbitrary queries. By enforcing a \delta -monotonic geometric constraint during graph construction, \delta -EMG ensures that any greedy search converges to a (1/\delta) -approximate neighbor without backtracking. Building on this foundation, we design an error-bounded top- k ANN search algorithm that adaptively controls approximation accuracy during query time. To make the framework practical at scale, we introduce \delta -EMQG (Error-bounded Monotonic Quantized Graph), a localized and degree-balanced variant with near-linear construction complexity. We further integrate vector quantization to accelerate distance computation while preserving theoretical guarantees. Extensive experiments on the ANN-Benchmarks dataset demonstrate the effectiveness of our approach. Under a recall requirement of 0.99, our algorithm achieves 19,000 QPS on the SIFT1M dataset, outperforming other methods by more than 40%.

附件下载

点击下载今日全部论文列表