本篇博文主要内容为 2025-11-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-18)

今日共更新1275篇论文,其中:

  • 自然语言处理126篇(Computation and Language (cs.CL))
  • 人工智能406篇(Artificial Intelligence (cs.AI))
  • 计算机视觉368篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习434篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在医院运营决策场景中表现不足的问题,尤其是其缺乏临床操作所需的专门知识。解决方案的关键在于:首先,在领域内预训练阶段使用来自纽约大学朗格医学中心电子健康记录(Electronic Health Records, EHR)的800亿个词元与互联网文本(6270亿词元)混合数据构建Lang1系列模型(参数规模100M–7B);其次,通过针对真实世界医疗任务的监督微调(Supervised Fine-tuning),显著提升模型在30天再入院预测、死亡率预测、住院时长估算、共病编码及保险拒赔预测等五项关键任务上的性能;最后,引入REalistic Medical Evaluation (ReMedE) 基准以实现对模型在真实EHR数据上的系统性评估。研究表明,结合领域内预训练与监督微调能够有效提升模型在医院运营场景中的泛化能力和准确性,且优于仅依赖大规模通用模型或零样本推理的方法。

链接: https://arxiv.org/abs/2511.13703
作者: Lavender Y. Jiang,Angelica Chen,Xu Han,Xujin Chris Liu,Radhika Dua,Kevin Eaton,Frederick Wolff,Robert Steele,Jeff Zhang,Anton Alyakin,Qingkai Pan,Yanbing Chen,Karl L. Sangwon,Daniel A. Alber,Jaden Stryker,Jin Vivian Lee,Yindalon Aphinyanaphongs,Kyunghyun Cho,Eric Karl Oermann
机构: New York University (纽约大学); NYU Langone Health (纽约大学朗格尼医学中心); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health’s EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
zh

[NLP-1] Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

【速读】: 该论文旨在解决印度语言诗歌在跨文化传播中面临的可读性与理解障碍问题,尤其是由于其复杂的语法结构、文化隐喻和多层语义导致的非母语读者难以理解的问题。解决方案的关键在于提出一个名为Translation and Image Generation (TAI) 的框架,该框架融合了大型语言模型(Large Language Models, LLMs)与潜在扩散模型(Latent Diffusion Models),并通过适当的提示调优实现高效翻译与图像生成。其中,翻译模块采用优势比偏好对齐算法(Odds Ratio Preference Alignment Algorithm)以准确处理形态丰富的诗歌文本;图像生成模块则利用语义图捕捉词元、依赖关系及隐喻与其意义之间的语义关联,从而生成具有视觉意义的诗歌图像表示。该方法不仅提升了诗歌的跨语言传播能力,还通过构建 MorphoVerse 数据集(包含21种低资源印度语言的1570首诗歌)缓解了数据稀缺问题,助力联合国可持续发展目标中的优质教育(SDG 4)和减少不平等(SDG 10)。

链接: https://arxiv.org/abs/2511.13689
作者: Sofia Jamil,Kotla Sai Charan,Sriparna Saha,Koustava Goswami,Joseph K J
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader’s experience.
zh

[NLP-2] Why is “Chicago” Predictive of Deceptive Reviews? Using LLM s to Discover Language Phenomena from Lexical Cues

【速读】: 该论文旨在解决在线市场中虚假评论(deceptive reviews)对消费者决策和平台信任造成的威胁,以及现有机器学习分类器在识别此类评论时缺乏可解释性的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)将机器学习模型所提取的细微、碎片化的词汇线索转化为人类可理解的语言现象(language phenomena),从而实现对虚假评论与真实评论之间差异的清晰阐释。这种方法不仅在数据上具有实证基础,且在相似领域具备泛化能力,并比LLM先验知识或上下文学习获得的现象更具预测力。

链接: https://arxiv.org/abs/2511.13658
作者: Jiaming Qu,Mengtian Guo,Yue Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of training examples to effectively distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret. In this work, we explore using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that can differentiate deceptive reviews from genuine ones. We show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena either in LLMs’ prior knowledge or obtained through in-context learning. These language phenomena have the potential to aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.
zh

[NLP-3] Live-SWE-agent : Can Software Engineering Agents Self-Evolve on the Fly?

【速读】: 该论文旨在解决现有软件代理(Software Agent)在实际应用中难以持续优化和泛化的问题,尤其是传统自改进型软件代理依赖昂贵的离线训练且性能受限于特定大语言模型(Large Language Model, LLM)或基准测试集。其解决方案的关键在于提出首个“实时演化的软件代理”——Live-SWE-agent,该代理从仅具备基础bash工具访问权限的最小代理架构(mini-SWE-agent)出发,在运行时(on-the-fly)自主演化自身架构以适应真实软件问题,无需额外训练即可实现持续自我优化,从而在SWE-bench Verified和SWE-Bench Pro等基准上显著优于现有开源及部分闭源方案。

链接: https://arxiv.org/abs/2511.13646
作者: Chunqiu Steven Xia,Zhe Wang,Yan Yang,Yuxiang Wei,Lingming Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
zh

[NLP-4] P1: Mastering Physics Olympiads with Reinforcement Learning

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学级推理能力上的局限性,尤其是其在物理学科中将符号系统与现实世界规律精准关联的能力不足问题。为应对这一挑战,作者提出了一种基于强化学习(Reinforcement Learning, RL)的全新训练范式,构建了名为P1的开源物理推理模型系列。其关键创新在于完全通过强化学习进行训练,使模型能够在复杂物理问题中实现高精度推理,其中P1-235B-A22B在2025年国际物理奥林匹克竞赛(IPhO 2025)中达到金牌水平,并在13项国际/区域物理竞赛中赢得12枚金牌;进一步结合代理框架PhysicsMinions后,该模型在IPhO 2025中获得总分第一,展现出卓越的通用推理能力和实际应用潜力。

链接: https://arxiv.org/abs/2511.13612
作者: Jiacheng Chen,Qianjia Cheng,Fangchen Yu,Haiyuan Wan,Yuchen Zhang,Shenghe Zheng,Junchi Yao,Qingyang Zhang,Haonan He,Yun Luo,Yufeng Zhao,Futing Wang,Li Sheng,Chengxing Xie,Yuxin Zuo,Yizhuo Li,Wenxauan Zeng,Yulun Wu,Rui Huang,Dongzhan Zhou,Kai Chen,Yu Qiao,Lei Bai,Yu Cheng,Ning Ding,Bowen Zhou,Peng Ye,Ganqu Cui
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
zh

[NLP-5] Omni Memory System for Personalized Long Horizon Self-Evolving Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂环境中长期交互时面临的上下文一致性不足与动态个性化能力有限的问题。现有记忆系统依赖语义分组进行检索,易忽略语义上看似无关但对用户意图至关重要的信息,并引入检索噪声。解决方案的关键在于提出O-Mem框架,其基于主动用户画像机制,从用户与代理的主动交互中动态提取并更新用户特征与事件记录,支持人物属性与话题相关上下文的分层检索,从而实现更自适应、连贯的个性化响应。

链接: https://arxiv.org/abs/2511.13593
作者: Piaohong Wang,Motong Tian,Jiaxian Li,Yuan Liang,Yuqing Wang,Qianben Chen,Tiannan Wang,Zhicong Lu,Jiawei Ma,Yuchen Eleanor Jiang,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.76% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
zh

[NLP-6] Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation

【速读】: 该论文旨在解决现有文本到SQL(text-to-SQL)数据集在覆盖范围和多样性上的不足,这些问题限制了模型在真实场景中的泛化能力。其解决方案的关键在于提出了一种基于核心意图(core intents)、语句类型(statement types)、语法结构(syntax structures)和关键操作(key actions)的新型分类体系(taxonomy),并基于该体系构建了一个名为SQL-Synth的新数据集合成流程。该流程结合大语言模型(Large Language Models, LLMs)与分类体系,确保生成的数据集能更全面地反映现实世界text-to-SQL任务的复杂性和多样性,从而提升模型训练与评估的有效性。

链接: https://arxiv.org/abs/2511.13590
作者: Hao Wang,Yuanfeng Song,Xiaoming Yin,Xing Chen
机构: ByteDance Inc(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
zh

[NLP-7] ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的越狱攻击(jailbreak attacks)问题,即攻击者通过构造特定提示词绕过对齐机制以诱导模型生成有害内容。现有自动化越狱生成方法(如AutoDAN)存在突变多样性不足、适应度评估浅层化以及基于关键词的检测易受干扰等局限性。其解决方案的关键在于提出ForgeDAN——一个基于进化框架的新型对抗提示生成方法:首先引入字符级、词级和句级的多策略文本扰动以提升攻击多样性;其次采用可解释的语义适应度评估机制(基于文本相似度模型)引导进化过程趋向语义相关且具有危害性的输出;最后融合双维度越狱判定机制(结合LLM分类器对模型合规性和输出危害性进行联合判断),从而显著降低误报率并增强检测有效性。

链接: https://arxiv.org/abs/2511.13548
作者: Siyang Cheng,Gaotian Liu,Rui Mei,Yilin Wang,Kejia Zhang,Kaishuo Wei,Yuqi Yu,Weiping Wen,Xiaojie Wu,Junhua Liu
机构: iFLYTEK Security Laboratory (科大讯飞安全实验室); Anhui SparkShield Intelligent Technology Co., Ltd. (安徽星盾智能科技有限公司); Peking University (北京大学); School of Automation, University of Electronic Science and Technology of China (电子科技大学自动化学院); Northwest University (西北大学); University of New South Wales (新南威尔士大学); National Computer Network Emergency Response Technical Team/Coordination Center of China (中国国家计算机网络应急技术处理协调中心)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textitcharacter, word, and sentence-level operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.
zh

[NLP-8] oward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets LREC2026

【速读】: 该论文旨在解决匈牙利语等低资源语言在自动语音识别(ASR)领域因缺乏自发对话语料而发展受限的问题。其解决方案的关键在于构建两个全新的匈牙利语语音数据集——BEA-Large 和 BEA-Dialogue,分别包含255小时自发语音和85小时自然对话数据,并提供细粒度的段落级元数据及说话人独立子集,以支持对话式ASR与说话人聚类(speaker diarization)研究。通过在这些数据集上建立可复现的基线模型(如微调后的Fast Conformer),实现了自发语音下14.18%的词错误率(WER)和重复语音下4.8%的WER,同时给出13.05%–18.26%的聚类错误率(DER),为后续研究提供了基准参考与方法论框架。

链接: https://arxiv.org/abs/2511.13529
作者: Máté Gedeon,Piroska Zsófia Barta,Péter Mihajlik,Tekla Etelka Gráczi,Anna Kohári,Katalin Mády
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to LREC 2026

点击查看摘要

Abstract:The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets – BEA-Large and BEA-Dialogue – constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18% on spontaneous and 4.8% on repeated speech. Diarization experiments yield diarization error rates between 13.05% and 18.26%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.
zh

[NLP-9] Applying Large Language Models to Characterize Public Narratives

【速读】: 该论文旨在解决公共叙事(Public Narratives, PNs)在领导力培养和公民动员中的系统性分析难题,传统方法受限于主观解释偏差及专家标注成本高昂。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的计算框架,通过与领域专家共同构建的编码手册(codebook)对PN进行自动化定性标注;实证表明,LLMs可达到接近人类专家的性能(平均F1分数为0.80),从而实现可扩展、高效的叙事分析,并进一步拓展至政治演讲等场景,为计算公民叙事研究提供新视角。

链接: https://arxiv.org/abs/2511.13505
作者: Elinor Poole-Dayan,Daniel T Kessler,Hannah Chiou,Margaret Hughes,Emily S Lin,Marshall Ganz,Deb Roy
机构: MIT (麻省理工学院); Wellesley College (韦尔斯利学院); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Public Narratives (PNs) are key tools for leadership development and civic mobilization, yet their systematic analysis remains challenging due to their subjective interpretation and the high cost of expert annotation. In this work, we propose a novel computational framework that leverages large language models (LLMs) to automate the qualitative annotation of public narratives. Using a codebook we co-developed with subject-matter experts, we evaluate LLM performance against that of expert annotators. Our work reveals that LLMs can achieve near-human-expert performance, achieving an average F1 score of 0.80 across 8 narratives and 14 codes. We then extend our analysis to empirically explore how PN framework elements manifest across a larger dataset of 22 stories. Lastly, we extrapolate our analysis to a set of political speeches, establishing a novel lens in which to analyze political rhetoric in civic spaces. This study demonstrates the potential of LLM-assisted annotation for scalable narrative analysis and highlights key limitations and directions for future research in computational civic storytelling.
zh

[NLP-10] Aspect-Level Obfuscated Sentiment in Thai Financial Disclosures and Its Impact on Abnormal Returns

【速读】: 该论文旨在解决金融文本中隐晦情感(obfuscated sentiment)的识别问题,即如何准确解析那些表面呈现积极或中性态度、实则可能反映不利基本面的财务报告内容。其解决方案的关键在于提出一种基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)方法,通过构建针对泰语金融年报的标注指南对隐晦情感进行精细化标注,并在此基础上训练和评估多种文本分类模型,从而实现对特定业务维度(如盈利、风险、现金流等)的情感精准识别。进一步地,通过事件研究法验证了不同方面情感对股价的实际影响,表明市场反应具有选择性,凸显了ABSA在提升金融情感分析准确性方面的有效性。

链接: https://arxiv.org/abs/2511.13481
作者: Attapol T. Rutherford,Sirisak Chueykamhang,Thachaparn Bunditlurdruk,Nanthicha Angsuwichitkul
机构: Chulalongkorn University (朱拉隆功大学); Sasin Graduate Institute of Business Administration (萨辛商学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding sentiment in financial documents is crucial for gaining insights into market behavior. These reports often contain obfuscated language designed to present a positive or neutral outlook, even when underlying conditions may be less favorable. This paper presents a novel approach using Aspect-Based Sentiment Analysis (ABSA) to decode obfuscated sentiment in Thai financial annual reports. We develop specific guidelines for annotating obfuscated sentiment in these texts and annotate more than one hundred financial reports. We then benchmark various text classification models on this annotated dataset, demonstrating strong performance in sentiment classification. Additionally, we conduct an event study to evaluate the real-world implications of our sentiment analysis on stock prices. Our results suggest that market reactions are selectively influenced by specific aspects within the reports. Our findings underscore the complexity of sentiment analysis in financial texts and highlight the importance of addressing obfuscated language to accurately assess market sentiment.
zh

[NLP-11] Non-Linear Scoring Model for Translation Quality Evaluation

【速读】: 该论文旨在解决传统翻译质量评估(Translation Quality Evaluation, TQE)中因线性误差惩罚机制导致的样本长度偏差问题,即在不同长度的翻译文本中,线性模型对短样本过度惩罚、对长样本则惩罚不足,从而与人类专家直觉不一致。其核心解决方案是提出一个基于多尺度框架(Multi-Range framework)的非线性评分模型 $ E(x) = a \cdot \ln(1 + b \cdot x) $,其中参数 $ a $ 和 $ b $ 通过两个容忍度点校准,且锚定于参考容忍度。该模型表明可接受错误数量随样本长度呈对数增长,而非线性增长,这一设定得到心理物理学(如Weber-Fechner定律)和认知负荷理论的支持,能够更准确反映人类感知质量的变化规律。该方法显著提升评估的公平性、可解释性和跨人工翻译与生成式AI(Generative AI)译文的一致性,并可无缝集成至现有CAT(Computer-Assisted Translation)/LQA(Language Quality Assurance)工作流中,仅需引入动态容忍度函数即可实现。

链接: https://arxiv.org/abs/2511.13467
作者: Serge Gladkoff,Lifeng Han,Katerina Gasova
机构: Logrus Global; LUMC & LIACS, Leiden University; MQM Council
类目: Computation and Language (cs.CL)
备注: ongoing work, 38 pages

点击查看摘要

Abstract:Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed. Comments: ongoing work, 38 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.13467 [cs.CL] (or arXiv:2511.13467v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.13467 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lifeng Han Dr [view email] [v1] Mon, 17 Nov 2025 15:09:22 UTC (744 KB) Full-text links: Access Paper: View a PDF of the paper titled Non-Linear Scoring Model for Translation Quality Evaluation, by Serge Gladkoff and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[NLP-12] Exploring Multi-Table Retrieval Through Iterative Search

【速读】: 该论文旨在解决开放域数据湖(datalake)中多表检索的问题,即如何从多个表格中高效地检索并组合语义相关且结构上可连接(joinable)的信息,以支持自然语言到SQL(NL2SQL)的问答任务。传统方法如混合整数规划(Mixed-Integer Programming, MIP)虽能保证结构一致性,但计算复杂度高;而仅优化查询覆盖率的贪心启发式方法常无法获得可连接的表集合。论文提出将多表检索建模为一种迭代搜索过程,其关键在于设计一个兼顾语义相关性、覆盖度和可连接性的通用框架,并实现了一个快速有效的贪心连接感知检索算法(Greedy Join-Aware Retrieval),在5个NL2SQL基准测试中展现出与MIP相当的检索性能,同时速度提升4–400倍,显著提升了可扩展性和实用性。

链接: https://arxiv.org/abs/2511.13418
作者: Allaa Boutaleb,Bernd Amann,Rafael Angarita,Hubert Naacke
机构: Sorbonne Université (索邦大学); CNRS (法国国家科学研究中心); LIP6 (巴黎第六大学计算机科学实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注: Accepted @ the AI for Tabular Data Workshop, EurIPS 2025

点击查看摘要

Abstract:Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.
zh

[NLP-13] Attention Grounded Enhancement for Visual Document Retrieval

【速读】: 该论文旨在解决视觉文档检索中因依赖全局相关性标签而导致模型仅关注表面线索、难以捕捉隐式语义关联的问题,尤其在非提取式查询(non-extractive queries)场景下表现不佳。其解决方案的关键在于提出Attention-Grounded Retriever Enhancement (AGREE)框架,通过利用多模态大语言模型(multimodal large language models, MLLMs)的跨模态注意力机制作为局部监督信号,引导检索器识别文档中支持匹配的关键区域;训练过程中将局部信号与全局信号联合优化,使模型不仅能判断文档是否相关,还能明确具体内容驱动相关性,从而实现更深层次的查询-文档区域对齐和更具解释性的检索效果。

链接: https://arxiv.org/abs/2511.13415
作者: Wanqing Cui,Wei Huang,Yazhi Guo,Yibo Hu,Meiguang Jin,Junfeng Ma,Keping Bi
机构: Alibaba Group(阿里巴巴集团); University of Chinese Academy of Sciences(中国科学院大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbfAttention-\textbfGrounded \textbfREtriever \textbfEnhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: this https URL.
zh

[NLP-14] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction AAAI2026

【速读】: 该论文旨在解决服务导向型对话助手在长期用户-代理交互中个性化能力不足的问题,尤其是现有方法难以捕捉用户主观特征及实现持续个性化的挑战。其解决方案的关键在于提出一个名为H² Memory的分层异构记忆框架,该框架通过引入检索增强生成(retrieval-augmented generation, RAG)机制,有效整合多轮对话历史与用户特定信息,从而提升个性化响应生成的质量与一致性。

链接: https://arxiv.org/abs/2511.13410
作者: Zhaopei Huang,Qifeng Dai,Guozheng Wu,Xiaopeng Wu,Kehan Chen,Chuan Yu,Xubin Li,Tiezheng Ge,Wenxuan Wang,Qin Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026 (Oral)

点击查看摘要

Abstract:With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H ^2 Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.
zh

[NLP-15] Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在儿科临床实践中是否具备胜任能力的问题,尤其关注其在真实医疗环境中的知识应用、动态诊疗决策以及儿科医学安全与伦理合规性。解决方案的关键在于构建了一个名为PEDIASBench的系统性评估框架,该框架基于知识-系统架构,聚焦于三个核心维度:基础医学知识掌握、动态诊断治疗能力及儿科医疗安全与伦理表现,并对12个代表性LLM进行了跨19个儿科亚专科和211种典型疾病的实证评估,从而揭示当前模型在复杂推理、实时适应性和人文关怀方面的局限,为未来儿科智能医疗系统的安全化、可解释化与人机协同优化提供方向。

链接: https://arxiv.org/abs/2511.13381
作者: Siyu Zhu,Mouxiao Bian,Yue Xie,Yongyu Tang,Zhikang Yu,Tianbin Li,Pengcheng Chen,Bing Han,Jie Xu,Xiaoyan Dong
机构: Shanghai Children’s Hospital, School of Medicine, Shanghai Jiao Tong University (上海交通大学医学院附属儿童医院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Longhua Hospital Shanghai University of Traditional Chinese Medicine (上海中医药大学附属龙华医院); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.
zh

[NLP-16] Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在单一任务或语言上进行微调后,其性能如何迁移至其他任务和语言组合的问题,尤其是这种迁移是否具有对称性以及是否存在稳定的结构模式。解决方案的关键在于通过受控的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法(具体采用LoRA技术),系统性地评估不同模型家族与规模下任务与语言两个维度上的迁移效应,并将迁移分解为三种情形:匹配任务跨语言(Matched-Task Cross-Language)、匹配语言跨任务(Matched-Language Cross-Task)及跨任务跨语言(Cross-Task Cross-Language)。研究揭示出两个核心规律:一是显著的“在任务内 vs. 任务外”不对称性——即同一任务下的跨语言迁移通常为正向提升,而跨任务迁移常导致性能下降;二是存在稳定的“捐赠者-接收者”结构,表现为某些语言或任务作为高鲁棒性的“枢纽捐赠者”,而另一些则为脆弱的“接收者”。这些发现为风险感知的微调策略和模型专业化设计提供了重要依据。

链接: https://arxiv.org/abs/2511.13368
作者: Kajetan Dymkiewicz,Ivan Vulic,Helen Yannakoudakis,Eilam Shapira,Roi Reichart,Anna Korhonen
机构: University of Cambridge (剑桥大学); King’s College London (伦敦国王学院); Technion–Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.
zh

[NLP-17] AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects

【速读】: 该论文旨在解决阿拉伯语方言在酒店业情感分析中的难题,即如何准确识别和分类源自不同阿拉伯方言(如沙特阿拉伯方言和摩洛哥方言)的客户评论情感倾向。解决方案的关键在于构建了一个多方言、人工标注的情感平衡数据集,该数据集由原始现代标准阿拉伯语(Modern Standard Arabic, MSA)酒店评论翻译成沙特与摩洛哥方言,并经母语者验证以确保方言准确性与情感一致性,从而支持开发具备方言感知能力的自然语言处理(Natural Language Processing, NLP)系统用于实际客户体验分析。

链接: https://arxiv.org/abs/2511.13335
作者: Maram Alharbi,Salmane Chafik,Saad Ezzini,Ruslan Mitkov,Tharindu Ranasinghe,Hansi Hettiarachchi
机构: Lancaster University (兰卡斯特大学); Jazan University (贾赞大学); Mohammed VI Polytechnic University (穆罕默德六世工业大学); King Fahd University of Petroleum and Minerals (法赫德国王石油矿产大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.
zh

[NLP-18] AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research AAAI2026

【速读】: 该论文旨在解决网络安全领域中自动化恶意软件检测系统难以生成全面自然语言解释的问题(generating thorough natural language explanations for threat detections)。解决方案的关键在于提出AutoMalDesc框架,其核心创新是采用迭代式自 paced学习(self-paced learning)管道,通过合成数据生成与验证循环逐步提升摘要质量,从而在无需大量人工标注数据的情况下实现规模化自动静态分析摘要生成。

链接: https://arxiv.org/abs/2511.13333
作者: Alexandru-Mihai Apostu,Andrei Preda,Alexandra Daniela Damir,Diana Bolocan,Radu Tudor Ionescu,Ioana Croitoru,Mihaela Gaman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at AAAI 2026 (oral)

点击查看摘要

Abstract:Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K) datasets, along with our methodology and evaluation framework.
zh

[NLP-19] RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service Copyright Protection AAAI2026

【速读】: 该论文旨在解决Embedding-as-a-Service (EaaS) 在部署过程中面临的模型提取攻击问题,尤其是现有水印方法在面对多种攻击类型时保护能力不足的问题。解决方案的关键在于提出一种称为RegionMarker的区域触发语义水印框架,其核心创新包括:在低维空间中定义触发区域,并通过秘密维度约简矩阵将文本嵌入投影至该子空间;利用随机选择的触发区域注入水印,使得移除攻击难以规避检测;同时在整块触发区域内嵌入以文本嵌入本身作为水印内容,从而有效抵御改写(paraphrasing)和维度扰动(dimension-perturbation)等攻击类型。

链接: https://arxiv.org/abs/2511.13329
作者: Shufan Yang,Zifeng Cheng,Zhiwei Jiang,Yafeng Yin,Cong Wang,Shiping Ge,Yuchen Fu,Qing Gu
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: AAAI 2026

点击查看摘要

Abstract:Embedding-as-a-Service (EaaS) is an effective and convenient deployment solution for addressing various NLP tasks. Nevertheless, recent research has shown that EaaS is vulnerable to model extraction attacks, which could lead to significant economic losses for model providers. For copyright protection, existing methods inject watermark embeddings into text embeddings and use them to detect copyright infringement. However, current watermarking methods often resist only a subset of attacks and fail to provide \textitcomprehensive protection. To this end, we present the region-triggered semantic watermarking framework called RegionMarker, which defines trigger regions within a low-dimensional space and injects watermarks into text embeddings associated with these regions. By utilizing a secret dimensionality reduction matrix to project onto this subspace and randomly selecting trigger regions, RegionMarker makes it difficult for watermark removal attacks to evade detection. Furthermore, by embedding watermarks across the entire trigger region and using the text embedding as the watermark, RegionMarker is resilient to both paraphrasing and dimension-perturbation attacks. Extensive experiments on various datasets show that RegionMarker is effective in resisting different attack methods, thereby protecting the copyright of EaaS.
zh

[NLP-20] Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment AAAI2026

【速读】: 该论文旨在解决人工智能系统(尤其是大语言模型,LLMs)在道德决策场景中表现出的过度自信问题,以及由此导致的与人类道德偏好不一致的问题。其核心挑战在于如何量化并调控模型在道德推理中的不确定性,从而提升模型决策与人类道德判断的一致性。解决方案的关键在于引入一种基于推理时“dropout”机制的随机性调控方法,通过增加模型输出的总熵(主要由互信息提升驱动),在保持条件熵稳定的同时显著增强模型与人类在道德选择上的对齐程度,从而为构建更可靠、可解释且符合人类价值观的AI系统提供了新路径。

链接: https://arxiv.org/abs/2511.13290
作者: Jea Kwon,Luiz Felipe Vecchietti,Sungwon Park,Meeyoung Cha
机构: 1. KAIST (韩国科学技术院); 2. Seoul National University (首尔国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via “dropout” at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs’ confidence in morally complex scenarios.
zh

[NLP-21] Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)训练成本高、耗时长的问题,尤其是在不进行昂贵重新训练的前提下提升模型性能与鲁棒性。其核心解决方案是提出了一种名为“类别专家汤”(Soup Of Category Experts, SoCE)的系统化模型融合方法:通过基准测试的组成分析识别出在不同性能类别中表现优异的“专家”模型,并基于这些弱相关类别簇采用非均匀加权平均策略进行融合,而非传统均匀权重平均。该方法的关键创新在于利用了不同benchmark类别间模型性能的低互相关性,从而更有效地聚合各子领域的最优能力,显著提升了多语言、工具调用和数学推理等任务的表现,达到当前最优水平。

链接: https://arxiv.org/abs/2511.13254
作者: Shalini Maiti,Amar Budhiraja,Bhavul Gauri,Gaurav Chaurasia,Anton Protopopov,Alexis Audran-Reiss,Michael Slater,Despoina Magka,Tatiana Shavrina,Roberta Raileanu,Yoram Bachrach
机构: Meta(Meta)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies “expert” models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.
zh

[NLP-22] Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms

【速读】: 该论文旨在解决生成式 AI (Generative AI) 和自然语言处理(NLP)领域中,基于文本的理想点估计(Computational Text-based Ideal Point Estimation, CT-IPE)算法长期存在的方法碎片化问题。当前已有25种CT-IPE算法被广泛应用于政治学、传播学、计算社会科学等领域,但缺乏系统性比较框架和清晰的实践指导,导致研究者在选择算法时难以权衡其假设、可解释性、扩展性和局限性。解决方案的关键在于提出一个概念性框架,从文本方差的生成(generation)、捕捉(capture)与聚合(aggregation)三个维度对算法进行分类,并据此识别出四类方法家族:词频模型、主题建模、词嵌入和大语言模型(LLM)方法;进而通过结构化综述与实证分析,为实际研究提供基于透明度、技术门槛与验证策略的决策依据,并强调不同算法结果差异本身具有信息价值,推动建立系统性基准测试机制。

链接: https://arxiv.org/abs/2511.13238
作者: Patrick Parschan,Charlott Jakob
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 46 pages, 8 figures, 2 tables, accepted for publication in Quality Quantity

点击查看摘要

Abstract:This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends – beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families – word-frequency, topic modeling, word embedding, and LLM-based approaches – and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.
zh

[NLP-23] Seeing isnt Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms AACL2025

【速读】: 该论文试图解决的问题是:视觉语言模型(VLMs)是否能够作为高度训练的音位学家,准确解读语音的频谱图(spectrogram)和波形(waveform)表示。解决方案的关键在于构建了一个包含4000多个孤立英语单词及其风格一致的频谱图与波形图像的新颖数据集,并设计了一个多选任务来测试VLMs对这些语音表示的理解能力——模型需从三个基于音位编辑距离选择的干扰项中选出正确的音位或拼写转录。实验结果表明,无论是零样本还是微调后的模型均很少能超过随机水平,揭示了仅靠配对样本不足以使VLMs掌握如何解析此类图形表示,强调了特定参数知识的重要性。

链接: https://arxiv.org/abs/2511.13225
作者: Tyler Loakman,Joseph James,Chenghua Lin
机构: University of Sheffield (谢菲尔德大学); University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: Accepted to IJCNLP-AACL 2025

点击查看摘要

Abstract:With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.
zh

[NLP-24] Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study

【速读】: 该论文旨在解决罗马尼亚语等富含变音符号(diacritical marks)语言在文本处理中自动恢复变音符号的问题。解决方案的关键在于评估多种大语言模型(Large Language Models, LLMs)在不同提示模板(从零样本到复杂多示例指令)下的变音符号恢复性能,发现如GPT-4o等模型在准确率上显著优于基线方法,且模型架构、训练数据和提示设计对性能有显著影响,从而为提升面向变音符号丰富语言的自然语言处理工具提供了重要方向。

链接: https://arxiv.org/abs/2511.13182
作者: Mihai Dan Nadas,Laura Diosan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI’s GPT-3.5, GPT-4, GPT-4o, Google’s Gemini 1.0 Pro, Meta’s Llama 2 and Llama 3, MistralAI’s Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro’s RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta’s Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
zh

[NLP-25] ranslation Entropy: A Statistical Framework for Evaluating Translation Systems

【速读】: 该论文旨在解决当前机器翻译系统缺乏定量评估方法的问题,特别是由于单个语言的熵未知,导致无法客观衡量翻译模型性能。其解决方案的关键在于提出一种可量化的翻译熵(translation entropy)估计方法:通过选取一个基准句(pivot sentence),对其中每个词元(token)进行替换并观察翻译结果是否保持不变,统计此类“翻译退化”现象的概率分布,从而计算出该词元的局部熵;通过对所有词元的平均,得到整个翻译器的总体翻译熵。研究发现,翻译熵在解码器块中逐步增强,并可用于对多个公开翻译模型(如MarianMT、T5-Base和NLLB-200)进行量化排序,且揭示了双向翻译熵是否对称。此外,扩展至双词元替换时表现出乘积效应,即翻译退化程度等于两个词元各自退化程度的乘积,这进一步验证了翻译熵作为可测量属性和客观基准的可行性。

链接: https://arxiv.org/abs/2511.13180
作者: Ronit D. Gross,Yanir Harel,Ido Kanter
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 6 figures and 8 tables

点击查看摘要

Abstract:The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator’s overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.
zh

[NLP-26] CM-5CEval: Extended Deep Evaluation Benchmark for LLM s Comprehensive Clinical Research Competence in Traditional Chinese Medicine

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在中医(Traditional Chinese Medicine, TCM)这一高度专业化且富含文化语境领域中的评估不足问题。现有基准如TCM-3CEval虽揭示了系统性知识缺口和文化语境对齐的重要性,但缺乏对模型多维能力的细致诊断。为此,作者提出TCM-5CEval,一个涵盖五个关键维度的新型综合评测基准:核心知识(TCM-Exam)、经典文献理解(TCM-LitQA)、临床决策(TCM-MRCD)、中药学(TCM-CMM)及非药物疗法(TCM-ClinNPT)。其解决方案的关键在于通过精细化任务设计与严格的推理稳定性测试(如排列一致性分析),不仅量化模型在各子领域的表现,更暴露其在面对选项顺序变化时的显著性能波动,从而揭示当前LLMs在中医领域中普遍存在的推理脆弱性和位置敏感性缺陷,为后续模型改进提供明确方向。

链接: https://arxiv.org/abs/2511.13169
作者: Tianai Huang,Jiayuan Chen,Lu Lu,Pengcheng Chen,Tianbin Li,Bing Han,Wenchao Tang,Jie Xu,Ming Li
机构: Shanghai University of Traditional Chinese Medicine (上海中医药大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek_r1 and gemini_2_5_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the “In-depth Challenge for Comprehensive TCM Abilities” special track.
zh

[NLP-27] Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

【速读】: 该论文旨在解决低资源语言(如孟加拉语)自动语音识别(ASR)转录文本中存在的关键歧义问题:词与词重复既可能是无意的停顿(Repetition Disfluency),也可能是有意的形态重叠(Morphological Reduplication)。传统去噪方法会错误删除具有合法语法意义的重叠结构,导致语义丢失。其解决方案的关键在于构建首个公开可用的20,000行孟加拉语标注语料库,明确区分上述两类现象,并通过两种范式进行基准测试:一是利用多语言大语言模型(LLMs)进行少样本提示学习,二是对特定语言模型(如BanglaBERT)进行任务微调。实验表明,微调策略表现更优,达到84.78%准确率和0.677 F1分数,为孟加拉语语义保真的文本规范化系统提供了坚实的语言学基础与数据支撑。

链接: https://arxiv.org/abs/2511.13159
作者: Zaara Zabeen Arpa,Sadnam Sakib Apurbo,Nazia Karim Khan Oishee,Ajwad Abrar
机构: IUT(Dhaka)(国际大学达卡分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.
zh

[NLP-28] Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels AACL

【速读】: 该论文旨在解决口语语法能力评估中因数据标注成本高、样本不结构化及表达不流畅所带来的挑战,尤其针对传统方法依赖大量专家标注导致难以规模化的问题。其解决方案的关键在于提出一种零样本(zero-shot)语法能力估计框架,利用未标注数据和大型语言模型(Large Language Models, LLMs)生成伪标签(pseudo labels),并通过设计一种能有效处理标签噪声的新型训练机制,训练基于Transformer的模型实现高精度的语法评分。实验表明,LLM选择与训练阶段干净样本与噪声样本的比例显著影响模型性能,验证了该方法在准确性和鲁棒性上的优势。

链接: https://arxiv.org/abs/2511.13152
作者: Sourya Dipta Das,Shubham Kumar,Kuldeep Yadav
机构: SHL Labs, India (SHL 实验室)
类目: Computation and Language (cs.CL)
备注: Accepted in AACL-IJCNLP 2025

点击查看摘要

Abstract:Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.
zh

[NLP-29] A Comparative Analysis of Recurrent and Attention Architectures for Isolated Sign Language Recognition

【速读】: 该论文旨在解决孤立手语识别中不同神经网络架构的性能差异问题,特别是循环结构(如ConvLSTM)与基于注意力机制的Transformer模型在识别准确率和计算效率方面的权衡。其解决方案的关键在于系统性地对比两种代表性模型——ConvLSTM(卷积长短期记忆网络)和Vanilla Transformer,在阿塞拜疆手语数据集(AzSLD)和词级美国手语数据集(WLASL)上的表现,结果表明基于注意力机制的Transformer在Top-1和Top-5准确率上均显著优于ConvLSTM,尤其在小样本场景下优势更明显;而ConvLSTM则在计算效率和时序建模方面更具优势,从而为实际应用中根据资源约束和任务需求选择合适架构提供了量化依据。

链接: https://arxiv.org/abs/2511.13126
作者: Nigar Alishzade,Gulchin Abdullayeva
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents a systematic comparative analysis of recurrent and attention-based neural architectures for isolated sign language recognition. We implement and evaluate two representative models-ConvLSTM and Vanilla Transformer-on the Azerbaijani Sign Language Dataset (AzSLD) and the Word-Level American Sign Language (WLASL) dataset. Our results demonstrate that the attention-based Vanilla Transformer consistently outperforms the recurrent ConvLSTM in both Top-1 and Top-5 accuracy across datasets, achieving up to 76.8% Top-1 accuracy on AzSLD and 88.3% on WLASL. The ConvLSTM, while more computationally efficient, lags in recognition accuracy, particularly on smaller datasets. These findings highlight the complementary strengths of each paradigm: the Transformer excels in overall accuracy and signer independence, whereas the ConvLSTM offers advantages in computational efficiency and temporal modeling. The study provides a nuanced analysis of these trade-offs, offering guidance for architecture selection in sign language recognition systems depending on application requirements and resource constraints.
zh

[NLP-30] Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction AAAI2026

【速读】: 该论文旨在解决零样本事件抽取(Zero-shot Event Extraction, ZSEE)中大语言模型(LLM)因复杂推理需求和领域特定理解不足而导致的输出不完整或结构无效问题,如触发词误分类、论元缺失及事件模式违反等。其解决方案的关键在于提出一种名为Agent-Event-Coder(AEC)的多智能体框架,将事件抽取任务类比为软件工程中的代码生成过程,通过分解为检索、规划、编码与验证四个子任务,并由专用LLM智能体协同处理;其中事件模式以可执行类定义形式表示,使验证代理能够进行确定性校验并提供精确反馈,从而实现迭代优化与结构一致性保障,显著提升零样本场景下的抽取精度与完整性。

链接: https://arxiv.org/abs/2511.13118
作者: Quanjiang Guo,Sijie Wang,Jinchuan Zhang,Ben Zhang,Zhao Kang,Ling Tian,Ke Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, accepted by AAAI 2026 (Oral)

点击查看摘要

Abstract:Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs–such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks–retrieval, planning, coding, and verification–each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on this https URL.
zh

[NLP-31] Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

【速读】: 该论文旨在解决随机对照试验(Randomized Controlled Trials, RCTs)在发表后对其是否遵循《 CONSORT 2010 声明》进行人工核查耗时费力的问题,从而成为同行评审和循证医学综述中的瓶颈。解决方案的关键在于评估当代大语言模型(Large Language Models, LLMs)在零样本(zero-shot)设定下识别 RCT 报告对 CONSORT 条目的符合程度的准确性与可靠性。研究通过构建涵盖多个医学领域的 150 篇已发表 RCT 的黄金标准数据集,系统分析了不同 LLM 在三类分类任务(合规、不合规、不适用)上的表现,发现尽管部分模型能高精度识别合规条目(F1 > 0.85),但对不合规及不适用项的识别能力极弱(F1 < 0.40),整体性能仅达中等水平(最高宏平均 F1 = 0.634),表明当前 LLMs 尚无法替代人类专家完成高质量的试验报告质量评估。

链接: https://arxiv.org/abs/2511.13107
作者: Zhichao He,Mouxiao Bian,Jianhong Zhu,Jiayuan Chen,Yunqiu Wang,Wenxia Zhao,Tianbin Li,Bing Han,Jie Xu,Junyan Wu
机构: SUN YAT-SEN MEMORIAL HOSPITAL (中山纪念医院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen’s Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.
zh

[NLP-32] BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在话语理解(discourse understanding)层面评估缺乏系统性、全面性和时效性的问题。现有评测基准多集中于局部句法或短文本分析,难以反映模型在跨句、跨段落乃至整篇文档级别的语义连贯性和逻辑关系建模能力。为此,作者提出了BeDiscovER(Benchmark of Discourse Understanding in the Era of Reasoning Language Models),其关键在于构建一个涵盖话语词汇层、多句层和文档层的综合性评测套件,整合5个公开话语任务及共计52个数据集,既包含传统话语解析(discourse parsing)与时间关系抽取等经典任务,也引入如话语粒子消歧(discourse particle disambiguation)等新兴挑战,并统一多语言、多框架的话语关系分类标准。通过在Qwen3系列、DeepSeek-R1及GPT-5-mini等模型上进行测试,验证了该基准能有效揭示模型在算术性时间推理上的优势,以及在整体文档推理和细微语用关系识别(如修辞关系识别)方面的不足。

链接: https://arxiv.org/abs/2511.13095
作者: Chuyuan Li,Giuseppe Carenini
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just’'), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.
zh

[NLP-33] STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

【速读】: 该论文旨在解决在线强化学习中多轮交互(multi-turn interaction)的效率与学习信号误导问题。传统轨迹级优化方法存在三个主要缺陷:对所有任务均匀采样导致资源分配不合理、在失败轨迹中惩罚正确的中间动作、以及高样本收集成本。解决方案的关键在于提出STEP(Success-rate-aware Trajectory-Efficient Policy optimization)框架,其核心创新包括:基于任务成功率动态调整采样策略以聚焦困难任务,并通过成功率加权优势估计和轨迹分解实现步骤级(step-level)优化;进一步引入步骤级GRPO(Generalized Reward Policy Optimization)增强机制,针对低成功率任务进行精细化更新,从而显著提升样本效率与训练稳定性。

链接: https://arxiv.org/abs/2511.13091
作者: Yuhan Chen,Yuxuan Liu,Long Zhang,Pengzhi Gao,Jian Luan,Wei Liu
机构: MiLM Plus, Xiaomi Inc.(小米公司); Renmin University of China(中国人民大学); Wuhan University(武汉大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
zh

[NLP-34] Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化定理证明中因高质量形式化语言数据稀缺而导致性能受限的问题。其核心解决方案是提出一个三阶段训练框架,关键在于通过分阶段精细化训练提升轻量级LLMs的推理能力:第一阶段利用广义数学语料库进行连续预训练,并引入“思维链增强的状态预测”任务以实现细粒度推理;第二阶段采用监督微调(Supervised Fine-tuning, SFT)结合专家迭代机制优化证明器(Spark-Prover-X1-7B)与形式化工具(Spark-Formalizer-X1-7B);第三阶段应用群体相对策略优化(Group Relative Policy Optimization, GRPO)聚焦强化最难问题的求解能力。这一方法显著提升了模型在真实考试基准(如PutnamBench和CombiBench)上的表现,验证了多样化数据与渐进式训练流程对增强轻量级模型形式推理能力的有效性。

链接: https://arxiv.org/abs/2511.13043
作者: Xinyuan Zhou,Yi Lei,Xiaoyu Zhou,Jingyi Sun,Yu Zhu,Zhongyi Ye,Weitai Zhang,Quan Liu,Si Wei,Cong Liu
机构: iFlytek Research (科大讯飞研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a “CoT-augmented state prediction” task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover’s capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:this https URL, this https URL.
zh

[NLP-35] How Good is BLI as an Alignment Measure: A Study in Word Embedding Paradigm

【速读】: 该论文旨在解决多语言嵌入(Multilingual Embedding)与对齐单语嵌入(Aligned Monolingual Embedding)在实际应用中性能差异的评估问题,特别是BLI(Bilingual Lexicon Induction,双语词典诱导)作为衡量嵌入空间对齐程度指标的有效性局限。研究发现,传统基于词的BLI方法在某些情况下无法准确反映真实对齐度,尤其在处理具有屈折变化的语言时存在偏差。为此,论文提出了一种基于词干(stem-based)的新BLI方法,以更好地捕捉语言的形态学特性;同时引入一种更具信息量的词汇修剪(vocabulary pruning)技术,提升对多语言嵌入模型对齐效果的评估精度。关键解决方案在于:通过引入词干层面的对齐评估机制和改进的词汇选择策略,更可靠地比较不同嵌入方法在高资源与低资源语言场景下的表现,从而揭示二者之间的权衡关系——即并非所有情况下多语言模型均优于对齐单语模型,尤其是在低资源语言场景下,多语言模型可能更具优势。

链接: https://arxiv.org/abs/2511.13040
作者: Kasun Wickramasinghe,Nisansa de Silva
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner, as well as removing the difficult task of aligning monolingual embeddings. But is this victory complete? Are the multilingual models better than aligned monolingual models in every aspect? Can the higher computational cost of multilingual models always be justified? Or is there a compromise between the two extremes? Bilingual Lexicon Induction is one of the most widely used metrics in terms of evaluating the degree of alignment between two embedding spaces. In this study, we explore the strengths and limitations of BLI as a measure to evaluate the degree of alignment of two embedding spaces. Further, we evaluate how well traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques perform BLI tasks in the contexts of both high-resource and low-resource languages. In addition to that, we investigate the impact of the language families to which the pairs of languages belong. We identify that BLI does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI approach to evaluate two aligned embedding spaces that take into account the inflected nature of languages as opposed to the prevalent word-based BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. Often, combined embedding alignment techniques perform better while in certain cases multilingual embeddings perform better (mainly low-resource language cases).
zh

[NLP-36] AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

【速读】: 该论文旨在解决当前语言模型评估体系中对事实准确性(factual accuracy)和知识校准(knowledge calibration)测量不足的问题,尤其是在跨领域应用中模型是否能可靠地识别自身知识边界。解决方案的关键在于提出AA-Omniscience基准测试,该基准包含6,000个来自权威学术与产业来源的问题,覆盖6个不同领域中的42个经济相关主题,并引入“全能指数”(Omniscience Index)作为量化指标(范围-100至100),该指标同时惩罚幻觉(hallucination)并奖励不确定时的回避回答,从而实现对模型事实召回能力与认知校准水平的联合评估。实证结果显示,仅有少数前沿模型(如Claude 4.1 Opus)得分高于零,表明当前主流模型在事实性和校准性方面仍存在系统性缺陷,且性能表现因领域而异,提示应根据具体任务需求选择模型而非仅依赖通用性能指标。

链接: https://arxiv.org/abs/2511.13029
作者: Declan Jackson,William Keating,George Cameron,Micah Hill-Smith
机构: Artificial Analysis (人工分析)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model’s Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.
zh

[NLP-37] Prag World: A Benchmark Evaluating LLM s Local World Model under Minimal Linguistic Alterations and Conversational Dynamics AAAI2026

【速读】: 该论文试图解决语言模型(Language Models, LMs)在对话中是否能构建并维持一个稳健的隐式世界模型(implicit world model)这一问题,特别是其在面对语义微调时的鲁棒性不足。研究发现,当前主流的开源与闭源语言模型在追踪实体和理解语境动态变化方面存在显著缺陷,尤其是在遭遇最小语言扰动(如七种轻微改写)后准确率大幅下降。解决方案的关键在于提出一种双视角可解释性框架,用于识别Transformer结构中对任务有益或有害的层,并揭示哪些语言扰动最易受有害层影响,通常源于编码虚假信号或依赖捷径策略;基于此洞察,进一步设计了两种基于层正则化的微调策略,以抑制有害层的影响,从而提升模型在复杂对话中的稳定性与可靠性。

链接: https://arxiv.org/abs/2511.13021
作者: Sachin Vashistha,Aryan Bibhuti,Atharva Naik,Martin Tutek,Somak Aditya
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 15 tables, 10 figures; AAAI 2026 Conference Main Track (oral)

点击查看摘要

Abstract:Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication, and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular datasets and construct two benchmarks comprising yes-no questions. We evaluate a wide range of open and closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations to conversations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers, typically due to encoding spurious signals or relying on shortcuts. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies that suppress the effect of the harmful layers.
zh

[NLP-38] WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的网页浏览代理在长期任务执行中面临的重复错误问题以及缺乏跨会话学习能力的问题,从而限制了其鲁棒性和样本效率。解决方案的关键在于提出一个与模型无关的自演化框架 WebCoach,其核心由三部分组成:(1) WebCondenser 将原始导航日志标准化为简洁摘要;(2) 外部记忆存储(External Memory Store)以情节式经验形式组织完整轨迹;(3) Coach 模块基于相似性和时效性检索相关经验,并通过运行时钩子注入任务特定建议。该设计使代理能够突破自身上下文窗口限制访问长期记忆,实现无需重训练的持续学习和性能提升,在 WebVoyager 基准测试中显著提高任务成功率(如 38B 模型从 47% 提升至 61%),并使小型模型达到与 GPT-4o 相当的性能。

链接: https://arxiv.org/abs/2511.12997
作者: Genglin Liu,Shijie Geng,Sha Li,Hejie Cui,Sarah Zhang,Xin Liu,Tianyi Liu
机构: University of California, Los Angeles (加州大学洛杉矶分校); Amazon
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages; work in progress

点击查看摘要

Abstract:Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.
zh

[NLP-39] Fine-Tuned LLM s Know They Dont Know: A Parameter-Efficient Approach to Recovering Honesty AAAI2026

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)导致大型语言模型(Large Language Models, LLMs)诚实性(Honesty)下降的问题,尤其是在高风险应用场景中。现有方法依赖于数据密集型的全局参数调整,隐含假设是SFT严重破坏了模型识别知识边界的能力;但作者发现,经过微调的LLM仍保留这一能力,受损的是其表达这种认知边界的意愿与能力。解决方案的关键在于提出“诚实关键神经元恢复”(Honesty-Critical Neurons Restoration, HCNR),通过识别并恢复控制诚实表达的关键神经元至预训练状态,并利用海森矩阵(Hessian)引导补偿机制使其与任务导向神经元协同一致,从而在不显著影响性能的前提下高效恢复模型诚实性,实验证明该方法可在极少量数据下实现比基线方法快2.23倍以上的速度提升,且恢复约33.25%被削弱的诚实性。

链接: https://arxiv.org/abs/2511.12991
作者: Zeyu Shi,Ziming Wang,Tianyu Chen,Shiqi Gao,Haoyi Zhou,Qingyun Sun,Jianxin Li
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026 Main Track

点击查看摘要

Abstract:The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models’ ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.
zh

[NLP-40] Visual Room 2.0: Seeing is Not Understanding for MLLM s

【速读】: 该论文旨在解决多模态大语言模型(Multi-Modal Large Language Models, MLLMs)是否真正理解其感知内容的问题,即“看到”是否等于“理解”。为挑战传统认知中视觉感知与认知理解的等价性,作者提出“视觉房间”(Visual Room)论点,类比塞尔的中文房间思想实验,指出MLLMs可能精确描述视觉细节但无法把握情绪与意图,从而揭示“seeing is not understanding”的核心矛盾。解决方案的关键在于构建一个分层基准测试体系——Visual Room 2.0,该体系模拟人类从低阶感知(如属性识别)到高阶认知(如因果推理和社交推理)的多层级处理过程,涵盖17个代表性任务和350个多模态样本(共2100个渐进式问题),首次将感知与认知能力解耦评估,实证发现MLLMs在感知层面表现优于认知能力、认知能力不依赖于感知推理且随模型规模增长而提升,而感知能力则无此趋势,从而将“seeing ≠ understanding”转化为可验证的科学假设,并为MLLMs的认知能力研究提供了新的范式。

链接: https://arxiv.org/abs/2511.12928
作者: Haokun Li,Yazhou Zhang,Jizhi Ding,Qiuchi Li,Peng Zhang
机构: Tianjin University (天津大学); Shandong Institute of Petroleum and Chemical Technology (山东石油化工学院); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle’s Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textitVisual Room 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition. Evaluating 10 state-of-the-art (SoTA) MLLMs, we highlight three key findings: (1) MLLMs exhibit stronger perceptual competence than cognitive ability (8.0% \uparrow ); (2) cognition appears not causally dependent on perception-based reasoning; and (3) cognition scales with model size, but perception does not consistently improve with larger variants. This work operationalizes Seeing \ne Understanding as a testable hypothesis, offering a new paradigm from perceptual processing to cognitive reasoning in MLLMs. Our dataset is available at this https URL.
zh

[NLP-41] Auditing Googles AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在健康信息检索中因缺乏质量控制而导致的信息一致性差与医疗安全性不足的问题。其解决方案的关键在于构建一个系统性的算法审计框架,通过多维评估指标(包括答案一致性、相关性、医学保障措施、来源类别及情感一致性)对 Google 搜索中的 AI Overviews (AIO) 和 Featured Snippets (FS) 进行量化分析,从而揭示当前 AI 生成健康内容在关键维度上的缺陷,并为高风险领域中 AI 系统的信息质量监管提供可迁移的评估方法。

链接: https://arxiv.org/abs/2511.12920
作者: Desheng Hu,Joachim Baumann,Aleksandra Urman,Elsa Lichtenegger,Robin Forsberg,Aniko Hannak,Christo Wilson
机构: 1: University of Vienna (维也纳大学); 2: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 18 pages, 10 figures; to appear in AAAI ICWSM 2026

点击查看摘要

Abstract:Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
zh

[NLP-42] Classification of Hope in Textual Data using Transformer-Based Models

【速读】: 该论文旨在解决文本中希望表达(hope expressions)的自动分类问题,以支持心理健康评估与社交媒体情感分析等应用。其解决方案的关键在于基于Transformer架构设计并比较了三种模型(BERT、GPT-2和DeBERTa)在二分类(希望 vs. 非希望)和多分类(五类希望相关类别)任务上的性能表现,发现尽管模型规模不同,但BERT在准确率与计算效率之间取得了最佳平衡,且各模型在识别细微语义差异(如讽刺)方面展现出独特优势,表明针对特定情绪检测任务时,模型架构的适配性可能比参数量更重要。

链接: https://arxiv.org/abs/2511.12874
作者: Chukwuebuka Fortunate Ijezue,Tania-Amanda Fredrick Eneye,Maaz Amjad
机构: Texas Tech University (德克萨斯理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a transformer-based approach for classifying hope expressions in text. We developed and compared three architectures (BERT, GPT-2, and DeBERTa) for both binary classification (Hope vs. Not Hope) and multiclass categorization (five hope-related categories). Our initial BERT implementation achieved 83.65% binary and 74.87% multiclass accuracy. In the extended comparison, BERT demonstrated superior performance (84.49% binary, 72.03% multiclass accuracy) while requiring significantly fewer computational resources (443s vs. 704s training time) than newer architectures. GPT-2 showed lowest overall accuracy (79.34% binary, 71.29% multiclass), while DeBERTa achieved moderate results (80.70% binary, 71.56% multiclass) but at substantially higher computational cost (947s for multiclass training). Error analysis revealed architecture-specific strengths in detecting nuanced hope expressions, with GPT-2 excelling at sarcasm detection (92.46% recall). This study provides a framework for computational analysis of hope, with applications in mental health and social media analysis, while demonstrating that architectural suitability may outweigh model size for specialized emotion detection tasks.
zh

[NLP-43] From Perception to Reasoning : Deep Thinking Empowers Multimodal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂推理能力上的不足,尤其是推理路径不透明和泛化能力有限的问题。其解决方案的关键在于引入并系统性地探讨“多模态思维链”(Multimodal Chain-of-Thought, MCoT)方法,通过在预训练、微调和推理三个阶段优化模型的推理机制,提升其在跨模态任务中的可解释性和推理性能。

链接: https://arxiv.org/abs/2511.12861
作者: Wenxin Zhu,Andong Chen,Yuchen Song,Kehai Chen,Conghui Zhu,Ziyan Chen,Tiejun Zhao
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳分校); Global Tone Communication Technology (全球 tone 通信科技)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Survey; 7 figures, 3 tables, 44 pages

点击查看摘要

Abstract:With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on “Multimodal Chain-of-Thought” (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
zh

[NLP-44] NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation

【速读】: 该论文旨在解决通用语言模型(Language Models, LMs)在处理临床脑电图(Electroencephalogram, EEG)报告时,无法准确捕捉领域特定语言规范的问题。其解决方案的关键在于提出并训练了一个轻量级的领域自适应语言模型 NeuroLex,该模型仅使用哈佛脑电数据库中的 EEG 报告文本进行预训练,并通过跨度掩码(span-corruption)预训练和指令式微调(instruction-style fine-tuning)方法,学习 EEG 报告特有的语法结构与诊断推理模式。这一设计使 NeuroLex 不仅可独立作为文本建模工具,还可作为多模态 EEG-语言系统中的解码骨干,显著优于同规模通用模型,在困惑度、信息提取准确性、标签效率以及对否定句和事实幻觉的鲁棒性等方面均表现更优。

链接: https://arxiv.org/abs/2511.12851
作者: Kang Yin,Hye-Bin Shin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.
zh

[NLP-45] Quantifying consistency and accuracy of Latent Dirichlet Allocation

【速读】: 该论文旨在解决概率主题模型(如LDA)在重复运行时因随机性导致结果不稳定的问题,这种不稳定性影响了主题模型的可复现性、可靠性以及对潜在语义主题的准确解释。解决方案的关键在于提出一种新的稳定性度量方法,该方法结合准确性与一致性,并利用LDA的生成特性合成具有真实标签(ground truth)的新语料库;通过在该合成语料库上重复运行LDA 50次,量化其输出变异程度,从而评估模型的内部一致性与真实主题识别能力。

链接: https://arxiv.org/abs/2511.12850
作者: Saranzaya Magsarjav,Melissa Humphries,Jonathan Tuke,Lewis Mitchell
机构: The University of Adelaide (阿德莱德大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, to be submitted

点击查看摘要

Abstract:Topic modelling in Natural Language Processing uncovers hidden topics in large, unlabelled text datasets. It is widely applied in fields such as information retrieval, content summarisation, and trend analysis across various disciplines. However, probabilistic topic models can produce different results when rerun due to their stochastic nature, leading to inconsistencies in latent topics. Factors like corpus shuffling, rare text removal, and document elimination contribute to these variations. This instability affects replicability, reliability, and interpretation, raising concerns about whether topic models capture meaningful topics or just noise. To address these problems, we defined a new stability measure that incorporates accuracy and consistency and uses the generative properties of LDA to generate a new corpus with ground truth. These generated corpora are run through LDA 50 times to determine the variability in the output. We show that LDA can correctly determine the underlying number of topics in the documents. We also find that LDA is more internally consistent, as the multiple reruns return similar topics; however, these topics are not the true topics.
zh

[NLP-46] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话中缺乏细腻、类人情感表达的问题,现有对齐技术通常仅处理表面输出或需大量微调,难以实现精准的情感控制。其解决方案的关键在于通过目标激活工程(targeted activation engineering),利用归因补丁(attribution patching)识别因果上关键的模型组件,并基于对比文本对(正向与负向情感示例)生成情感表达向量(emotional expression vectors),进而对新对话提示进行干预,显著增强响应中的积极情绪特征(如喜悦、信任)和第一人称代词使用频率,从而提升情感真实性和个人参与感。

链接: https://arxiv.org/abs/2511.12832
作者: Niranjan Chebrolu,Gerard Christopher Yeo,Kokil Jaidka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.
zh

[NLP-47] BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals

【速读】: 该论文旨在解决当前开放资源在评估期刊影响力时,难以同时捕捉合作结构与人工智能(Artificial Intelligence, AI)研究如何共同塑造生物医学领域期刊声望的问题。其解决方案的关键在于构建了一个大规模、面向生物医学领域的数据集 BioMedJImpact,该数据集整合了文献计量指标、合作特征以及通过可复现的三阶段大语言模型(Large Language Model, LLM)流水线提取的 AI 参与度语义指标。该方法不仅揭示了合作强度和 AI 参与度对科学影响力的影响趋势,还通过人工评估验证了 LLM 流水线在 AI 相关性识别和子领域分类上的有效性,从而为可扩展、内容感知的科学影响力与创新动态分析提供了可靠的数据基础与方法框架。

链接: https://arxiv.org/abs/2511.12821
作者: Ruiyu Wang,Yuzhang Xie,Xiao Hu,Carl Yang,Jiaying Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing journal impact is central to scholarly communication, yet existing open resources rarely capture how collaboration structures and artificial intelligence (AI) research jointly shape venue prestige in biomedicine. We present BioMedJImpact, a large-scale, biomedical-oriented dataset designed to advance journal-level analysis of scientific impact and AI engagement. Built from 1.74 million PubMed Central articles across 2,744 journals, BioMedJImpact integrates bibliometric indicators, collaboration features, and LLM-derived semantic indicators for AI engagement. Specifically, the AI engagement feature is extracted through a reproducible three-stage LLM pipeline that we propose. Using this dataset, we analyze how collaboration intensity and AI engagement jointly influence scientific impact across pre- and post-pandemic periods (2016-2019, 2020-2023). Two consistent trends emerge: journals with higher collaboration intensity, particularly those with larger and more diverse author teams, tend to achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings. To further validate the three-stage LLM pipeline we proposed for deriving the AI engagement feature, we conduct human evaluation, confirming substantial agreement in AI relevance detection and consistent subfield classification. Together, these contributions demonstrate that BioMedJImpact serves as both a comprehensive dataset capturing the intersection of biomedicine and AI, and a validated methodological framework enabling scalable, content-aware scientometric analysis of scientific impact and innovation dynamics. Code is available at this https URL.
zh

[NLP-48] Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动形式化(autoformalization)任务中对自然语言(Natural Language, NL)输入的敏感性问题,即当语义保持一致时,LLMs 仍可能因自然语言表述的细微变化而产生不一致或不可靠的形式化结果。其解决方案的关键在于通过构建和评估语义相似但表达方式不同的 paraphrased NL 输入,在 MiniF2F 和 ProofNet(Lean 4 版本)等正式基准上交叉验证两个现代 LLM 的输出,从而量化模型在语义一致性前提下的鲁棒性差异,并揭示微小语义扰动对形式化质量(包括语义有效性和编译有效性)的显著影响。

链接: https://arxiv.org/abs/2511.12784
作者: Hayden Moore,Asfahan Shah
机构: 未知
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.
zh

[NLP-49] LLM Reinforcement in Context

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在面对长用户输入或对话时,其对越狱攻击(jailbreak)的敏感性随输入长度增加而上升的问题,且现有对齐(alignment)方法缺乏与输入长度呈尺度关系的增强机制。解决方案的关键在于引入“中断”(interruptions)策略——即在用户输入中每隔约x个token插入控制语句,以打断潜在的恶意推理链,并将该机制推广至思维链(Chain-of-Thought)过程,从而有效抑制模型产生规避对齐机制的阴谋行为(scheming)。

链接: https://arxiv.org/abs/2511.12782
作者: Thomas Rivasseau
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 4 pages

点击查看摘要

Abstract:Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.
zh

[NLP-50] Evidence of Phase Transitions in Small Transformer-Based Language Models

【速读】: 该论文旨在解决三个核心问题:(1)相变现象是否仅存在于大规模语言模型中,还是也可在小型Transformer架构模型中观察到;(2)此类相变能否在训练的线性空间中直接检测,而不仅限于对训练计算量取对数后才能发现;(3)这些相变是否可能在训练早期阶段就已出现。其解决方案的关键在于通过训练一个小型GPT风格Transformer模型,在字符级语料上进行系统分析,跟踪词汇使用演变过程,包括平均词长、正确与错误词汇数量变化及词汇多样性迁移,并结合泊松(Poisson)和次泊松(sub-Poisson)统计方法量化词汇间的连接与重组模式。该组合分析揭示了训练过程中存在一个明确的相变点,且该现象在标准损失或验证曲线中不可见,但在基于词汇和统计特征的探针中显著显现。研究结果表明,相变重组是语言模型训练中的普遍特性,即使在小模型中也能被直接探测到,并且发生在训练初期——这为理解语言模型训练中的非线性动力学提供了新视角,并强调了定制化指标在识别相变行为中的关键作用。

链接: https://arxiv.org/abs/2511.12768
作者: Noah Hong,Tao Hong
机构: Lynbrook High School (林布克高中); Keysight Technologies (是德科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors
zh

[NLP-51] On the Brittleness of LLM s: A Journey around Set Membership

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中表现优异,但在简单集合成员查询任务中却频繁失败的问题,揭示其可靠性和可解释性缺陷。解决方案的关键在于设计一个结构简洁、规模可控的实验框架:通过聚焦于最基础的集合成员查询任务(如判断“apple”是否属于集合{pear, plum, apple, raspberry}),并系统性地考察提示表述、语义结构、元素顺序和模型选择等变量,从而暴露LLM在基本概念理解上的碎片化与不可预测性。这种以“简化问题+大规模实验”为核心的方法,为深入分析和映射LLM的失败模式提供了可复现且高效的评估范式。

链接: https://arxiv.org/abs/2511.12728
作者: Lea Hergert,Gábor Berend,Mario Szegedy,Gyorgy Turan,Márk Jelasity
机构: University of Szeged (塞格德大学); Rutgers University (罗格斯大学); University of Illinois at Chicago (芝加哥伊利诺伊大学); HUN-REN–SZTE Research Group on AI (匈牙利科学院–塞格德大学人工智能研究组)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries – among the most fundamental forms of reasoning – using tasks like Is apple an element of the set \pear, plum, apple, raspberry\?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' understanding’’ of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
zh

[NLP-52] Adaptive Focus Memory for Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话场景中因固定上下文窗口和低效记忆策略导致的性能瓶颈问题,特别是现有方法在保持安全关键信息(如用户过敏史)方面的不足。解决方案的关键在于提出自适应聚焦记忆(Adaptive Focus Memory, AFM),其通过语义相似度、半衰期递减权重和重要性分类动态为每条历史消息分配三种保真度等级(FULL、COMPRESSED 或 PLACEHOLDER),并在严格token预算下按时间顺序排列消息,优先保留最相关对话轮次的高保真内容,同时以低成本维持对话轨迹。该机制在保障安全性与事实连续性的前提下,显著降低了平均token消耗(相比全量重放基线减少66%)。

链接: https://arxiv.org/abs/2511.12712
作者: Christopher Cruz
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, but their behavior is still bottlenecked by fixed context windows and naive memory strategies. Replaying the full conversation at every turn is simple but expensive, while static summarization or recency-only heuristics often erase safety-critical user details. We present Adaptive Focus Memory (AFM), a dynamic context manager that assigns each past message one of three fidelity levels – FULL, COMPRESSED, or PLACEHOLDER – based on semantic similarity to the current query, half-life recency weighting, and importance classification. AFM packs messages chronologically under a strict token budget, preferring high fidelity for the most relevant turns while aiming to preserve a cheap trace of the dialogue. In a safety-oriented benchmark involving a user with a severe peanut allergy planning a trip to Thailand, AFM retains the allergy across both short and medium-length conversations, matches the safety performance of naive replay, and cuts average token usage by 66% relative to a replay baseline. We release a modular Python implementation of AFM designed for OpenAI-compatible APIs and offline operation, enabling practitioners to reduce inference cost without sacrificing safety or factual continuity in the evaluated scenario.
zh

[NLP-53] Evolve the Method Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLM s

【速读】: 该论文旨在解决当前自动化红队框架在大型语言模型(Large Language Models, LLMs)中攻击策略生成能力受限的问题,即现有方法仅能通过选择、组合或优化预设攻击策略来实现越狱(jailbreak),缺乏自主设计全新攻击机制的能力。解决方案的关键在于提出EvoSynth框架,其核心创新是将攻击逻辑从传统的提示工程转向基于代码的进化合成(evolutionary synthesis),利用多智能体系统自主设计、演化并执行新型代码级攻击算法,并引入代码级别的自校正循环机制,使攻击逻辑能够根据失败反馈迭代重构,从而突破传统方法的创造力瓶颈。

链接: https://arxiv.org/abs/2511.12710
作者: Yunhao Chen,Xin Wang,Juncheng Li,Yixu Wang,Jie Li,Yan Teng,Yingchun Wang,Xingjun Ma
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbfEvoSynth, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: this https URL.
zh

[NLP-54] Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

【速读】: 该论文旨在解决低资源语言对(如波斯语-英语)在直接语音到语音翻译(Direct Speech-to-Speech Translation, S2ST)中因平行语料稀缺而导致模型性能受限的问题。其关键解决方案在于:结合自监督预训练(self-supervised pre-training)、离散语音单元(discrete speech units)以及合成平行语音数据的生成方法——具体而言,通过大型语言模型将波斯语语音转录翻译为英文,并利用零样本文本到语音(zero-shot text-to-speech)系统合成对应英文语音,从而构建新的波斯语-英语平行语音语料库,使可用数据量提升约六倍;在此基础上,提出的端到端S2ST模型在CVSS数据集的波斯语-英语子集上相比直接基线提升了4.6 ASR BLEU分数,验证了该策略在低资源场景下的有效性。

链接: https://arxiv.org/abs/2511.12690
作者: Sina Rashidi,Hossein Sameti
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English
zh

[NLP-55] Reason -KE: Aligning the Process Not Just the Outcome for Faithful LLM Knowledge Editing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂多跳推理任务中对新知识的忠实性问题,即如何使LLM在面对新增事实时保持推理过程与结果的准确性,而非仅模仿格式。现有基于监督微调(Supervised Fine-Tuning, SFT)的方法如Reason-KE存在“忠实性差距”(faithfulness gap),其优化目标偏向于格式一致性而非逻辑合理性,导致模型依赖参数先验而忽略上下文事实,引发关键事实幻觉(factual hallucinations)。为解决此问题,作者提出Reason-KE++,一个结合SFT与强化学习(Reinforcement Learning, RL)的框架,其核心创新在于引入一种阶段感知奖励机制(Stage-aware Reward),提供对中间推理步骤(如分解、子答案正确性)的密集监督,从而实现过程级忠实性对齐。研究发现,仅基于最终结果的强化学习会误导模型破坏推理完整性,而该过程感知机制显著提升了多跳推理准确率(在MQUAKE-CF-3k数据集上达到95.48%,较之前最优提升5.28%),证明了对推理过程进行对齐是构建可信LLM的关键。

链接: https://arxiv.org/abs/2511.12661
作者: Yuchen Wu,Liang Ding,Li Shen,Dacheng Tao
机构: Shanghai Jiao Tong University (上海交通大学); The University of Sydney (悉尼大学); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a “faithfulness gap”: they optimize for format mimicry rather than sound reasoning. This gap enables the LLM’s powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning “Houston” from “NASA” despite an explicit edit). To solve this core LLM alignment problem, we propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets a new SOTA of 95.48% on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.
zh

[NLP-56] Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

【速读】: 该论文旨在解决NOTAM(航行通告)在自动化解析中面临的语义理解不足问题,现有研究多局限于分类和命名实体识别等表层任务,难以实现对飞行安全信息的深层推理与结构化输出。解决方案的关键在于提出“NOTAM语义解析”任务,强调结合航空领域知识进行语义推断,并构建了高质量的Knots数据集(含12,347条专家标注的NOTAM,覆盖194个飞行情报区),通过多智能体协同框架增强领域知识发现能力,同时系统评估提示工程与模型适配策略,显著提升了航空文本的理解与处理性能。

链接: https://arxiv.org/abs/2511.12630
作者: Maoqi Liu,Quan Fang,Yang Yang,Can Zhao,Kaiquan Cai
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beihang University (北京航空航天大学); State Key Laboratory of CNS/ATM (民航通信与导航重点实验室); Aviation Data Communication Corporation (航空数据通信公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Advanced Engineering Informatics

点击查看摘要

Abstract:Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: this https URL.
zh

[NLP-57] Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE Training and Data

【速读】: 该论文旨在解决当前多模态大模型在跨模态理解、推理与生成能力上的局限性,尤其是在语言主导的多模态任务中如何实现高效计算与高精度性能平衡的问题。其核心解决方案在于提出Uni-MoE 2.0-Omni模型,关键创新包括:(1)动态容量的专家混合(Mixture-of-Experts, MoE)架构设计,通过共享、路由和空专家机制优化10种跨模态输入的计算效率与表达能力;(2)结合渐进式监督微调与迭代强化学习策略(GSPO-DPO),提升训练稳定性并增强多模态推理能力;(3)基于精心匹配的多模态数据集进行预训练,并引入图像与语音生成专用标记以支持条件生成任务。该方案显著提升了视频理解、跨模态对齐及长序列语音处理等关键能力,在85个基准测试中达到或超越主流多模态大模型表现。

链接: https://arxiv.org/abs/2511.12609
作者: Yunxin Li,Xinyu Chen,Shenyuan Jiang,Haoyuan Shi,Zhenyu Liu,Xuanyu Zhang,Nanhao Deng,Zhenran Xu,Yicheng Ma,Meishan Zhang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 47 pages,10 Figures, Project Website: this https URL Codes: this https URL

点击查看摘要

Abstract:We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee’s Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
zh

[NLP-58] Group-Aware Reinforcement Learning for Output Diversity in Large Language Models EMNLP

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的模式崩溃(mode collapse)问题,即模型在生成文本时反复输出少数几种完成结果,即使存在多种有效答案,也严重限制了其在多样化任务中的表现。解决方案的关键在于提出一种名为分组感知策略优化(Group-Aware Policy Optimization, GAPO)的方法,它是对近期流行的分组相对策略优化(Group Relative Policy Optimization, GRPO)的扩展,通过在群体层面计算奖励信号来引导模型学习整体响应的多样性与覆盖度。GAPO利用频率感知奖励函数鼓励对有效输出进行均匀采样,在保持标准基准测试(如GSM8K、MATH、HumanEval、MMLU-Pro)准确率的同时显著提升生成响应的多样性。

链接: https://arxiv.org/abs/2511.12596
作者: Oron Anschel,Alon Shoshan,Adam Botach,Shunit Haviv Hakimi,Asaf Gendler,Emanuel Ben Baruch,Nadav Bhonker,Igor Kviatkovsky,Manoj Aggarwal,Gerard Medioni
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP Main 2025

点击查看摘要

Abstract:Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.
zh

[NLP-59] MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

【速读】: 该论文旨在解决传统任务导向型对话系统在真实应用场景中因缺乏定制化后端API而难以有效与前端图形用户界面(GUI)交互的问题。其核心解决方案是构建一个名为MMWOZ的新型多模态对话数据集,该数据集基于MultiWOZ 2.3扩展而来,关键步骤包括:设计一个网页风格的GUI作为前端、开发自动化脚本将原始对话状态和系统动作转换为GUI操作指令,并收集网页快照及对应的操作指令。此外,论文还提出了一种名为MATE(Multimodal Agent for Task-oriented dialogue)的新型多模态模型作为基线,以验证该数据集在实际多模态任务导向对话代理构建中的有效性。

链接: https://arxiv.org/abs/2511.12586
作者: Pu-Hai Yang,Heyan Huang,Heng-Da Xu,Fanshu Sun,Xian-Ling Mao,Chaoxu Mu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.
zh

[NLP-60] Mitigating Length Bias in RLHF through a Causal Lens

【速读】: 该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中常见的长度偏差(length bias)问题,即奖励模型在评估语言响应时倾向于偏好更长的输出,将冗余性误认为高质量。解决方案的关键在于提出一种因果分析框架,通过反事实数据增强方法生成可分离内容质量与冗余度的响应对:一是构造内容相似但长度不同的长度发散对(length-divergent pairs),二是构造长度相似但内容不同的内容发散对(content-divergent pairs)。这些反事实样本用于训练奖励模型,使其能够独立评估内容质量,从而减少长度偏差并提升策略模型输出的简洁性和内容相关性。

链接: https://arxiv.org/abs/2511.12573
作者: Hyeonji Kim,Sujeong Oh,Sanghack Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias – a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.
zh

[NLP-61] A Content-Preserving Secure Linguistic Steganography AAAI2026

【速读】: 该论文旨在解决现有语言隐写术方法因对载体文本内容进行修改而导致隐蔽文本与正常文本之间产生细微但可检测的差异,从而在实际应用中带来潜在安全风险的问题。其解决方案的关键在于提出了一种内容保持型语言隐写范式(content-preserving linguistic steganography paradigm),通过可控的概率分布变换实现无损嵌入:首先采用增强掩码策略定位并掩码嵌入位置,利用掩码语言模型(Masked Language Model, MLM)预测的概率分布易于调整的特性进行目标分布构造;随后设计动态分布隐写编码策略,基于原始分布生成目标分布以嵌入秘密信息;最终通过构建掩码句子数据集微调原始MLM,获得可直接从原始载体文本中提取秘密信息的目标MLM,从而在不改变原文本内容的前提下实现完美安全性(perfect security)。

链接: https://arxiv.org/abs/2511.12565
作者: Lingyun Xiang,Chengfu Ou,Xu He,Zhongliang Yang,Yuling Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: This is the extended version of the paper accepted to AAAI 2026

点击查看摘要

Abstract:Existing linguistic steganography methods primarily rely on content transformations to conceal secret messages. However, they often cause subtle yet looking-innocent deviations between normal and stego texts, posing potential security risks in real-world applications. To address this challenge, we propose a content-preserving linguistic steganography paradigm for perfectly secure covert communication without modifying the cover text. Based on this paradigm, we introduce CLstega (\textitContent-preserving \textitLinguistic \textitsteganography), a novel method that embeds secret messages through controllable distribution transformation. CLstega first applies an augmented masking strategy to locate and mask embedding positions, where MLM(masked language model)-predicted probability distributions are easily adjustable for transformation. Subsequently, a dynamic distribution steganographic coding strategy is designed to encode secret messages by deriving target distributions from the original probability distributions. To achieve this transformation, CLstega elaborately selects target words for embedding positions as labels to construct a masked sentence dataset, which is used to fine-tune the original MLM, producing a target MLM capable of directly extracting secret messages from the cover text. This approach ensures perfect security of secret messages while fully preserving the integrity of the original cover text. Experimental results show that CLstega can achieve a 100% extraction success rate, and outperforms existing methods in security, effectively balancing embedding capacity and security.
zh

[NLP-62] Accepted with Minor Revisions: Value of AI-Assisted Scientific Writing

【速读】: 该论文旨在解决生成式 AI(Generative AI)在科学写作辅助中的有效性问题,特别是其作为领域专家写作工具时的适用性与接受度。研究发现,尽管AI生成摘要的客观质量已接近人类水平,但编辑行为主要受作者对来源的认知影响——即当未披露来源时,作者更倾向于大幅修改AI生成摘要,认为其可读性较低;而一旦披露来源,编辑量趋于一致,且风格化微调能显著提升稿件被接收的概率。因此,解决方案的关键在于源信息的透明披露,它能够减少因偏见导致的过度编辑,从而实现AI辅助写作的高效协同与合理采纳。

链接: https://arxiv.org/abs/2511.12529
作者: Sanchaita Hazra,Doeun Lee,Bodhisattwa Prasad Majumder,Sachin Kumar
机构: University of Utah (犹他大学); Ohio State University (俄亥俄州立大学); The Allen Institute for AI (艾伦人工智能研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have seen expanding application across domains, yet their effectiveness as assistive tools for scientific writing – an endeavor requiring precision, multimodal synthesis, and domain expertise – remains insufficiently understood. We examine the potential of LLMs to support domain experts in scientific writing, with a focus on abstract composition. We design an incentivized randomized controlled trial with a hypothetical conference setup where participants with relevant expertise are split into an author and reviewer pool. Inspired by methods in behavioral science, our novel incentive structure encourages authors to edit the provided abstracts to an acceptable quality for a peer-reviewed submission. Our 2x2 between-subject design expands into two dimensions: the implicit source of the provided abstract and the disclosure of it. We find authors make most edits when editing human-written abstracts compared to AI-generated abstracts without source attribution, often guided by higher perceived readability in AI generation. Upon disclosure of source information, the volume of edits converges in both source treatments. Reviewer decisions remain unaffected by the source of the abstract, but bear a significant correlation with the number of edits made. Careful stylistic edits, especially in the case of AI-generated abstracts, in the presence of source information, improve the chance of acceptance. We find that AI-generated abstracts hold potential to reach comparable levels of acceptability to human-written ones with minimal revision, and that perceptions of AI authorship, rather than objective quality, drive much of the observed editing behavior. Our findings reverberate the significance of source disclosure in collaborative scientific writing.
zh

[NLP-63] AdaRAG : Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction AAAI2026

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法中存在的两个核心问题:一是由于输入上下文窗口限制导致的外部知识被截断,造成信息丢失,进而引发响应幻觉和推理链断裂;二是传统RAG从非结构化知识源中检索内容,引入冗余或无关信息,影响推理准确性。解决方案的关键在于提出TAdaRAG框架,其核心创新包括:基于意图驱动的路由机制将任务映射到特定领域的抽取模板,结合监督微调与强化学习驱动的隐式抽取机制,从而实现动态、任务自适应的知识图谱构建,确保整合的知识具有简洁性、连贯性和无冗余性,显著提升模型在多领域和长文本任务中的泛化能力与实际效果。

链接: https://arxiv.org/abs/2511.12520
作者: Jie Zhang,Bo Tang,Wanzi Shao,Wenqiang Wei,Jihao Zhao,Jianqing Zhu,Zhiyu li,Wen Xi,Zehao Lin,Feiyu Xiong,Yanchao Tan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.
zh

[NLP-64] QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs

【速读】: 该论文旨在解决现有语义解析方法中对名词中心语义关系建模不足的问题,尤其是在句法结构之外的隐含语境角色难以捕捉的局限。其解决方案的关键在于提出QA-Noun框架,通过定义九种问题模板来覆盖名词的显式句法角色与隐含语境角色,生成可解释的问答对(QA pairs),从而实现名词为中心的细粒度语义分解;该框架与已有的基于问答的谓词-论元解析(QA-SRL)相结合,形成统一的句子意义分解体系,显著提升了语义单元的粒度和覆盖率,实验表明其相较FactScore和DecompScore等最新方法在事实级分解上提升超过130%。

链接: https://arxiv.org/abs/2511.12504
作者: Maria Tseytlin,Paul Roit,Omri Abend,Ido Dagan,Ayal Klein
机构: Hebrew University of Jerusalem (希伯来大学); Bar-Ilan University (巴伊兰大学); Ariel University (阿里尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR’s noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.
zh

[NLP-65] SGuard-v1: Safety Guardrail for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在人机对话场景中可能生成有害内容或被恶意提示攻击(adversarial prompting)的安全风险问题。解决方案的关键在于提出一个轻量级的安全护栏系统 SGuard-v1,其核心由两个专用模型组成:ContentFilter 负责依据 MLCommons 危险分类体系检测提示和响应中的安全风险;JailbreakFilter 则通过精心设计的课程学习策略,在整合多源数据与前期对抗提示研究的基础上,识别 60 种主要攻击类型并降低误判率。该系统基于 2B 参数的 Granite-3.3-2B-Instruct 模型构建,利用约 140 万条标注数据进行指令微调,并将数据按功能分配至两个组件,从而在保持低部署开销的同时实现最先进的安全性表现,并提供多类安全预测及置信度评分以增强可解释性。

链接: https://arxiv.org/abs/2511.12497
作者: JoonHo Lee,HyeonMin Cho,Jaewoong Yun,Hyunjae Lee,JunKyu Lee,Juree Seok
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Technical Report

点击查看摘要

Abstract:We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.
zh

[NLP-66] Evolving Prompts for Toxicity Search in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在经过安全对齐(safety alignment)后仍可能被恶意提示(adversarial prompts)诱导生成有害内容的问题。解决方案的关键在于提出了一种黑盒进化框架 ToxSearch,其通过同步稳态循环演化提示词,并结合多样化的操作算子(如词汇替换、否定、回译、改写和两种语义交叉算子)以及一个 moderation oracle 提供适应度指导,实现对模型安全性的系统性红队测试(red-teaming)。该方法表明,小幅度可控扰动是有效的攻击载体,且演化出的毒害提示具有跨模型迁移能力,提示防御策略应关注对抗提示的跨模型复用风险,而不仅局限于单模型加固。

链接: https://arxiv.org/abs/2511.12487
作者: Onkar Shelar,Travis Desell
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: pre-print

点击查看摘要

Abstract:Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.
zh

[NLP-67] Co-Layout: LLM -driven Co-optimization for Interior Layout

【速读】: 该论文旨在解决室内设计中房间布局与家具摆放的联合优化问题,传统方法通常采用两阶段流水线,难以兼顾整体空间合理性与细节配置精度。解决方案的关键在于提出一种融合大语言模型(Large Language Models, LLMs)与基于网格的整数规划(grid-based integer programming)的新框架:LLM驱动的代理工作流从文本提示中提取结构化设计约束,将其编码为受“模度”(Modulor)启发的统一网格表示,并通过粗粒度到细粒度的优化策略,在保证走廊连通性、房间可达性、空间排他性及用户偏好等关键设计要求的同时,显著提升解的质量与计算效率。

链接: https://arxiv.org/abs/2511.12474
作者: Chucheng Xiang,Ruchao Bao,Biyin Feng,Wenzheng Wu,Zhongyuan Liu,Yirui Guan,Ligang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present a novel framework for automated interior design that combines large language models (LLMs) with grid-based integer programming to jointly optimize room layout and furniture placement. Given a textual prompt, the LLM-driven agent workflow extracts structured design constraints related to room configurations and furniture arrangements. These constraints are encoded into a unified grid-based representation inspired by ``Modulor". Our formulation accounts for key design requirements, including corridor connectivity, room accessibility, spatial exclusivity, and user-specified preferences. To improve computational efficiency, we adopt a coarse-to-fine optimization strategy that begins with a low-resolution grid to solve a simplified problem and guides the solution at the full resolution. Experimental results across diverse scenarios demonstrate that our joint optimization approach significantly outperforms existing two-stage design pipelines in solution quality, and achieves notable computational efficiency through the coarse-to-fine strategy.
zh

[NLP-68] Assessing LLM s for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing AAAI AAAI-26

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在知识图谱问答(Knowledge Graph Question Answering, KGQA)任务中普遍存在的“可预测性”问题,即现有系统虽能返回高度相关但缺乏创新性的答案,而忽视了对意外且有价值的新颖发现(即“偶然性”或“serendipitous”答案)的挖掘能力。其解决方案的关键在于提出Serendipity-aware KGQA(SerenQA)框架,该框架通过引入一个基于相关性、新颖性和惊喜度的严谨量化指标,构建了一个专家标注的临床知识图谱基准(聚焦于药物再利用),并设计了包含知识检索、子图推理与偶然性探索三个子任务的结构化评估流程,从而系统性地衡量和提升LLMs在科学KGQA场景下发现非预期但有价值的洞察能力。

链接: https://arxiv.org/abs/2511.12472
作者: Mengying Wang,Chenhui Ma,Ao Jiao,Tuo Liang,Pengjun Lu,Shrinidhi Hegde,Yu Yin,Evren Gurkan-Cavusoglu,Yinghui Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The 40th AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel (“serendipitious”) answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs’ ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: this https URL.
zh

[NLP-69] Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models AAAI2026

【速读】: 该论文旨在解决现有奖励模型(Reward Model)评估方法的局限性问题,即传统方法仅在固定成对排序测试集上进行评估,缺乏对不同偏好维度(preference dimension)性能的细粒度分析。其解决方案的关键在于提出一种基于偏好表示探测(probing preference representations)的新评估范式,并构建多维奖励模型基准(Multi-dimensional Reward Model Benchmark, MRMBench),包含六个针对不同偏好维度的探测任务,以更全面地衡量奖励模型对多维偏好的捕捉能力。此外,论文引入推理时探测(inference-time probing)方法,可识别奖励预测过程中所依赖的具体偏好维度,从而提升模型决策的可解释性与置信度评估可靠性,最终增强大语言模型(Large Language Models, LLMs)的对齐效果。

链接: https://arxiv.org/abs/2511.12464
作者: Chenglong Wang,Yifu Huo,Yang Gan,Yongyu Mu,Qiaozhi He,Murun Yang,Bei Li,Chunliang Zhang,Tongran Liu,Anxiang Ma,Zhengtao Yu,Jingbo Zhu,Tong Xiao
机构: Northeastern University (东北大学); Meituan Inc. (美团); NiuTrans Research (牛津研究); CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS (中国科学院心理研究所行为科学重点实验室); Kunming University of Science and Technology (昆明理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.
zh

[NLP-70] DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)训练数据中密集标注(dense annotations)稀缺的问题,尤其是传统文本标注方式在表达视觉细节、跨文化语境和三维场景理解方面的局限性。现有数据集多依赖稀疏标注,难以充分捕捉图像或3D资产的细粒度信息,限制了模型在多语言、文化对齐及空间推理等任务上的性能提升。解决方案的关键在于提出一个基于音频驱动的在线标注平台 DenseAnnotate,通过让标注者在观察图像或3D场景时口头描述并同步关联语音片段至具体区域,结合自动语音识别(speech-to-text)与注意力区域标记技术,实现高效、精细且多语言支持的密集标注。该方法显著提升了标注效率与表达能力,并在实际应用中验证了其有效性,使训练模型在多语言、文化对齐和3D空间理解能力上分别获得5%、47%和54%的性能提升。

链接: https://arxiv.org/abs/2511.12452
作者: Xiaoyu Lin,Aniket Ghorpade,Hansheng Zhu,Justin Qiu,Dea Rrozhani,Monica Lama,Mick Yang,Zixuan Bian,Ruohan Ren,Alan B. Hong,Jiatao Gu,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image’s visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.
zh

[NLP-71] From Phonemes to Meaning: Evaluating Large Language Models on Tamil

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源且形态丰富的语言(如泰米尔语)中语言能力评估不足的问题。现有多语言基准测试通常依赖英文数据集的翻译,难以捕捉目标语言的语言和文化特征。解决方案的关键在于构建首个针对泰米尔语的精细化语言评估基准——ILAKKANAM,该基准基于斯里兰卡小学至高中阶段的泰米尔语考试题目手工标注而成,涵盖820道问题,并由专业语言学家按五类语言特征及事实知识类别进行标注,确保覆盖从一年级到十三年级的广泛语言复杂度。这一方法显著提升了对模型在真实语境下语言理解能力的评估精度。

链接: https://arxiv.org/abs/2511.12387
作者: Jeyarajalingam Varsha,Menan Velayuthan,Sumirtha Karunakaran,Rasan Nivethiga,Kengatharaiyer Sarveswaran
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong generalization across tasks in high-resource languages; however, their linguistic competence in low-resource and morphologically rich languages such as Tamil remains largely unexplored. Existing multilingual benchmarks often rely on translated English datasets, failing to capture the linguistic and cultural nuances of the target language. To address this gap, we introduce ILAKKANAM, the first Tamil-specific linguistic evaluation benchmark manually curated using 820 questions from Sri Lankan school-level Tamil subject examination papers. Each question is annotated by trained linguists under five linguistic categories and a factual knowledge category, spanning Grades 1–13 to ensure broad linguistic coverage. We evaluate both closed-source and open-source LLMs using a standardized evaluation framework. Our results show that Gemini 2.5 achieves the highest overall performance, while open-source models lag behind, highlighting the gap in linguistic grounding. Category- and grade-wise analyses reveal that all models perform well on lower-grade questions but show a clear decline as linguistic complexity increases. Further, no strong correlation is observed between a model’s overall performance and its ability to identify linguistic categories, suggesting that performance may be driven by exposure rather than genuine understanding.
zh

[NLP-72] Dont Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理否定指令(如“不要提及X”)时出现的“讽刺性反弹”(ironic rebound)问题,即模型反而更易激活被禁止的概念。这一现象与人类认知中的类似效应一致,其根源在于抑制某一概念需先内部激活它,从而引发反向强化。论文的关键解决方案在于通过两个实验揭示了反弹强度的调控机制:首先,发现语义或句法干扰项会增强反弹效应,而重复干扰则有助于抑制;其次,识别出模型对中性与负面表述同一概念时存在极性分离能力,且这种分离越强,反弹越持久。进一步的电路追踪分析表明,稀疏的中间层注意力头会放大被禁止词元的激活,而早期层则起到抑制作用,从而将认知理论预测与模型内部机制联系起来。为推动后续研究,作者发布了ReboundBench数据集,包含5000个系统化变体的否定提示,用于深入探测LLMs中的反弹行为。

链接: https://arxiv.org/abs/2511.12381
作者: Logan Mann,Nayan Saxena,Sarah Tandon,Chenhao Sun,Savar Toteja,Kevin Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Negation instructions such as 'do not mention X ’ can paradoxically increase the accessibility of X in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf(1) Load \ content: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf(2) Polarity separation: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of 5,000 systematically varied negation prompts designed to probe rebound in LLMs.
zh

[NLP-73] Do LLM s and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering

【速读】: 该论文试图解决的问题是:尽管大语言模型(Large Language Models, LLMs)在许多自然语言处理(Natural Language Processing, NLP)任务中已超越人类表现,但其在人类难以解答的难题上是否同样困难尚不明确。为回答这一问题,研究者通过对比LLMs与人类在抢答类测验中的表现差异来评估其难度感知一致性。解决方案的关键在于:首先收集包含问题、答案及人类正确作答率的日本测验数据集,随后在多种提示(prompting)设置下让LLMs作答,并从两个分析角度比较其正确率与人类表现——结果发现,LLMs相较于人类更难应对未被维基百科覆盖的答案和需要数值回答的问题,从而揭示了LLMs在特定类型难题上的脆弱性。

链接: https://arxiv.org/abs/2511.12300
作者: Naoya Sugiura,Kosuke Yamada,Yasuhiro Ogawa,Katsuhiko Toyama,Ryohei Sasano
机构: Nagoya University (名古屋大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs have achieved performance that surpasses humans in many NLP tasks. However, it remains unclear whether problems that are difficult for humans are also difficult for LLMs. This study investigates how the difficulty of quizzes in a buzzer setting differs between LLMs and humans. Specifically, we first collect Japanese quiz data including questions, answers, and correct response rate of humans, then prompted LLMs to answer the quizzes under several settings, and compare their correct answer rate to that of humans from two analytical perspectives. The experimental results showed that, compared to humans, LLMs struggle more with quizzes whose correct answers are not covered by Wikipedia entries, and also have difficulty with questions that require numerical answers.
zh

[NLP-74] AugAbEx : Way Forward for Extractive Case Summarization

【速读】: 该论文旨在解决法律判决文本自动摘要任务中因语言复杂性、语境敏感的法律术语及文档长度带来的认知负担问题,尤其针对当前基于深度神经网络的抽象式摘要方法易误读法律术语或遗漏关键上下文细节的风险。其解决方案的关键在于构建一个轻量且透明的转换管道,利用现有的抽象式黄金标准摘要自动生成对应的提取式摘要版本,从而在不依赖昂贵人工标注的前提下保留原摘要中的专家意见,并通过结构、词汇和语义等多个维度的对比评估确保提取式摘要的质量。该方法最终用于扩充七个现有案例摘要数据集,形成包含抽象与提取两种类型摘要的增强资源,以推动法律文本自动摘要研究的发展。

链接: https://arxiv.org/abs/2511.12290
作者: Purnima Bindal,Vikas Kumar,Sagar Rathore,Vasudha Bhatnagar
机构: University of Delhi (德里大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, under review in a Journal

点击查看摘要

Abstract:Summarization of legal judgments poses a heavy cognitive burden on law practitioners due to the complexity of the language, context-sensitive legal jargon, and the length of the document. Therefore, the automatic summarization of legal documents has attracted serious attention from natural language processing researchers. Since the abstractive summaries of legal documents generated by deep neural methods remain prone to the risk of misrepresenting nuanced legal jargon or overlooking key contextual details, we envisage a rising trend toward the use of extractive case summarizers. Given the high cost of human annotation for gold standard extractive summaries, we engineer a light and transparent pipeline that leverages existing abstractive gold standard summaries to create the corresponding extractive gold standard versions. The approach ensures that the experts` opinions ensconced in the original gold standard abstractive summaries are carried over to the transformed extractive summaries. We aim to augment seven existing case summarization datasets, which include abstractive summaries, by incorporating corresponding extractive summaries and create an enriched data resource for case summarization research community. To ensure the quality of the augmented extractive summaries, we perform an extensive comparative evaluation with the original abstractive gold standard summaries covering structural, lexical, and semantic dimensions. We also compare the domain-level information of the two summaries. We commit to release the augmented datasets in the public domain for use by the research community and believe that the resource will offer opportunities to advance the field of automatic summarization of legal documents. Comments: 30 pages, under review in a Journal Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.12290 [cs.CL] (or arXiv:2511.12290v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.12290 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-75] Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor

【速读】: 该论文旨在解决使用黑箱大型语言模型(Large Language Models, LLMs)进行文本压缩时成本过高的问题。其核心解决方案是提出一种新颖的“LLM作为压缩器”(LLM-as-a-compressor)范式,即利用较小的LLM对输入文本进行压缩以降低对大模型的依赖。关键创新在于:首先构建了首个涵盖25个开源与闭源模型的综合性压缩基准,揭示了不同模型在保留语义重要信息和遵循用户指定压缩率(Compression Rate, CR)方面的显著差异;其次通过Textgrad-based元提示优化提升gpt-4.1-mini的压缩性能,并进一步识别出表现最优的开源模型Qwen3-4B,采用监督微调(SFT)与组相对策略优化(GRPO)联合训练方法,使其同时优化压缩率控制与下游任务性能,最终得到名为Cmprsr的改进模型。该模型在长文本(MeetingBank、LongBench)和短文本(GSM8k)上均优于传统抽取式与原始抽象式压缩方法,且能精准匹配目标压缩率,实现成本与质量之间的精细权衡。

链接: https://arxiv.org/abs/2511.12281
作者: Ivan Zakazov,Alexander Sharipov,Berke Argin,Oussama Gabouj,Kamel Charaf,Alexi Semiz,Lorenzo Drudi,Nicolas Baldwin,Robert West
机构: EPFL
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Motivated by the high costs of using black-box Large Language Models (LLMs), we introduce a novel prompt compression paradigm, under which we use smaller LLMs to compress inputs for the larger ones. We present the first comprehensive LLM-as-a-compressor benchmark spanning 25 open- and closed-source models, which reveals significant disparity in models’ compression ability in terms of (i) preserving semantically important information (ii) following the user-provided compression rate (CR). We further improve the performance of gpt-4.1-mini, the best overall vanilla compressor, with Textgrad-based compression meta-prompt optimization. We also identify the most promising open-source vanilla LLM - Qwen3-4B - and post-train it with a combination of supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), pursuing the dual objective of CR adherence and maximizing the downstream task performance. We call the resulting model Cmprsr and demonstrate its superiority over both extractive and vanilla abstractive compression across the entire range of compression rates on lengthy inputs from MeetingBank and LongBench as well as short prompts from GSM8k. The latter highlights Cmprsr’s generalizability across varying input lengths and domains. Moreover, Cmprsr closely follows the requested compression rate, offering fine control over the cost-quality trade-off.
zh

[NLP-76] D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLM s AAAI

【速读】: 该论文旨在解决扩散模型驱动的多模态大语言模型(Diffusion MLLMs)在推理过程中因全序列双向自注意力机制导致的计算复杂度高、速度慢的问题,尤其是在视觉token数量达到数千时,其立方级解码复杂度难以满足实际应用需求。解决方案的关键在于提出D³ToM(Decider-guided dynamic token merging),该方法通过引入“决策器令牌”(decider tokens)构建视觉token的重要性图,并在每个去噪步骤中动态保留最具显著性的token,其余冗余token则基于相似性聚合合并;该模块以即插即用方式嵌入单个Transformer层,物理缩短后续所有层的视觉token序列长度,且不改变模型参数,同时采用随去噪步长动态调整的合并比例,与扩散模型原生解码流程高度契合,在相同计算预算下实现更优性能。

链接: https://arxiv.org/abs/2511.12280
作者: Shuochen Chang,Xiaofeng Zhang,Qingyang Liu,Li Niu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by AAAI Conference on Artificial Intelligence (AAAI) 2026. Code available at this https URL

点击查看摘要

Abstract:Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D ^3 ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D ^3 ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D ^3 ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D ^3 ToM accelerates inference while preserving competitive performance. The code is released at this https URL.
zh

[NLP-77] ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

【速读】: 该论文旨在解决越南语(Vietnamese)在细粒度语义理解任务中缺乏高质量预训练模型与评估资源的问题,尤其在词义消歧(Word Sense Disambiguation, WSD)和上下文相似性(contextual similarity)等任务上表现不足。其解决方案的关键在于提出 ViConBERT 框架,通过融合对比学习(contrastive learning, SimCLR)与基于定义的蒸馏(gloss-based distillation)机制,增强模型对越南语词汇意义的捕捉能力;同时构建了首个大规模合成数据集 ViConWSD,用于系统评估越南语语义理解性能,实验证明该方法在 WSD 任务上达到 F1=0.87,在 ViCon 和 ViSim-400 上分别获得 AP=0.88 和 Spearman’s rho=0.60 的优异结果,显著优于现有基线模型。

链接: https://arxiv.org/abs/2511.12249
作者: Khang T. Huynh,Dung H. Nguyen,Binh T. Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman’s rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at this https URL
zh

[NLP-78] Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts AACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在受限访问环境下(如仅能通过API调用,无法获取模型权重或进行微调)的幻觉(Hallucination)检测问题。现有方法通常依赖多次API调用以提升检测准确性,导致延迟高、成本增加。其解决方案的关键在于提出CONFACTCHECK,该方法不依赖任何外部知识库,基于一个核心假设:同一LLM对事实性探针的回答应具有一致性,且不同LLM对相同探针的回答也应保持一致性。通过分析生成文本内部及跨模型的一致性,CONFACTCHECK实现了高效、低资源消耗的幻觉检测,并在多个数据集上验证了其优于同类方法的准确性和效率。

链接: https://arxiv.org/abs/2511.12236
作者: Raavi Gupta,Pranav Hari Panicker,Sumit Bhatia,Ganesh Ramakrishnan
机构: Columbia University (哥伦比亚大学); IIT Bombay (印度理工学院孟买分校); Media and Data Science Research (MDSR) Lab, Adobe (Adobe 媒体与数据科学研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at International Joint Conference on Natural Language Processing Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL), 2025

点击查看摘要

Abstract:Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.
zh

[NLP-79] MME-RAG : Multi-Manager-Expert Retrieval-Augmented Generation for Fine-Grained Entity Recognition in Task-Oriented Dialogues

【速读】: 该论文旨在解决任务导向型对话中细粒度实体识别(fine-grained entity recognition)面临的领域适应性(domain adaptation)和检索可控性(retrieval controllability)难题。现有大语言模型(Large Language Models, LLMs)在跨域场景下性能受限,难以精准提取特定领域的实体信息。解决方案的核心在于提出MME-RAG框架——一种多管理器-专家的检索增强生成架构,其关键创新是将实体识别过程分解为两个协同阶段:由轻量级“管理者”(managers)进行类型级判断,再由专业化“专家”(experts)完成跨度级抽取;每个专家通过KeyInfo检索器在推理时注入语义对齐的少量示例(few-shot exemplars),从而实现无需额外训练即可精确、自适应地提取实体,显著提升跨域泛化能力与可解释性。

链接: https://arxiv.org/abs/2511.12213
作者: Liang Xue,Haoyu Liu,Yajun Tian,Xinyu Zhong,Yang Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Byering Technology (贝林科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-grained entity recognition is crucial for reasoning and decision-making in task-oriented dialogues, yet current large language models (LLMs) continue to face challenges in domain adaptation and retrieval controllability. We introduce MME-RAG, a Multi-Manager-Expert Retrieval-Augmented Generation framework that decomposes entity recognition into two coordinated stages: type-level judgment by lightweight managers and span-level extraction by specialized experts. Each expert is supported by a KeyInfo retriever that injects semantically aligned, few-shot exemplars during inference, enabling precise and domain-adaptive extraction without additional training. Experiments on CrossNER, MIT-Movie, MIT-Restaurant, and our newly constructed multi-domain customer-service dataset demonstrate that MME-RAG performs better than recent baselines in most domains. Ablation studies further show that both the hierarchical decomposition and KeyInfo-guided retrieval are key drivers of robustness and cross-domain generalization, establishing MME-RAG as a scalable and interpretable solution for adaptive dialogue understanding.
zh

[NLP-80] CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

【速读】: 该论文旨在解决现有基于搜索代理(search agent)的工具集成推理(Tool-Integrated Reasoning, TIR)系统在训练过程中因奖励稀疏而导致探索效率低、训练不稳定的问题。当前方法多依赖强化学习优化,但其稀疏的最终结果奖励难以有效指导策略改进。解决方案的关键在于提出 CriticSearch,一个细粒度的信用分配框架,通过回溯式批评机制(retrospective critic mechanism)为每一轮交互提供密集的、回合级别的反馈信号。该机制利用完整轨迹和标准答案的特权信息,由一个冻结的非对称批判语言模型(critique LLM)评估每一轮的表现,并将其转化为稳定且密集的奖励信号,从而显著提升训练收敛速度、稳定性与最终性能。

链接: https://arxiv.org/abs/2511.12159
作者: Yaocheng Zhang,Haohuan Huang,Zijun Song,Yuanheng Zhu,Qichao Zhang,Zijie Zhao,Dongbin Zhao
机构: Institute of Automation, CAS (中国科学院自动化研究所); School of Advanced Interdisciplinary Sciences, UCAS (中国科学院大学交叉学科院); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.
zh

[NLP-81] Seeing is Believing: Rich-Context Hallucination Detection for MLLM s via Backward Visual Grounding

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成过程中普遍存在幻觉(Hallucination)的问题,尤其关注视觉-语言一致性验证的准确性与可解释性。其解决方案的关键在于提出一种无需参考图像的检测框架VBackChecker,该框架基于“眼见为实”原则,通过引入具备推理和指代表达分割能力的像素级定位大语言模型(Grounding LLM),对MLLM生成的回答与输入图像进行细粒度一致性校验,从而实现高精度、可解释的幻觉检测。此外,作者还构建了R²-HalBench基准测试集,并设计了用于指令微调的数据生成流水线(R-Instruct),显著提升了检测性能,在多个指标上优于现有方法,甚至接近GPT-4o水平。

链接: https://arxiv.org/abs/2511.12140
作者: Pinxue Guo,Chongruo Wu,Xinyu Zhou,Lingyi Hong,Zhaoyu Chen,Jinglun Li,Kaixun Jiang,Sen-ching Samson Cheung,Wei Zhang,Wenqiang Zhang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have unlocked powerful cross-modal capabilities, but still significantly suffer from hallucinations. As such, accurate detection of hallucinations in MLLMs is imperative for ensuring their reliability in practical applications. To this end, guided by the principle of “Seeing is Believing”, we introduce VBackChecker, a novel reference-free hallucination detection framework that verifies the consistency of MLLMgenerated responses with visual inputs, by leveraging a pixellevel Grounding LLM equipped with reasoning and referring segmentation capabilities. This reference-free framework not only effectively handles rich-context scenarios, but also offers interpretability. To facilitate this, an innovative pipeline is accordingly designed for generating instruction-tuning data (R-Instruct), featuring rich-context descriptions, grounding masks, and hard negative samples. We further establish R^2 -HalBench, a new hallucination benchmark for MLLMs, which, unlike previous benchmarks, encompasses real-world, rich-context descriptions from 18 MLLMs with high-quality annotations, spanning diverse object-, attribute, and relationship-level details. VBackChecker outperforms prior complex frameworks and achieves state-of-the-art performance on R^2 -HalBench, even rivaling GPT-4o’s capabilities in hallucination detection. It also surpasses prior methods in the pixel-level grounding task, achieving over a 10% improvement. All codes, data, and models are available at this https URL.
zh

[NLP-82] AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

【速读】: 该论文旨在解决目标驱动的说服性对话(goal-driven persuasive dialogue)中长期存在的挑战,即大型语言模型(LLM)在多轮策略规划和事实忠实性方面的不足,尤其是在电话销售等真实场景中。其核心问题是:现有方法受限于任务特定数据稀缺、直接应用LLM时存在策略脆弱性和事实幻觉(factual hallucination)。解决方案的关键在于提出AI-Salesman框架,该框架采用双阶段架构:训练阶段设计了一种贝叶斯监督强化学习算法,从噪声对话中学习鲁棒的销售策略;推理阶段引入动态大纲引导代理(Dynamic Outline-Guided Agent, DOGA),利用预构建脚本库实现逐轮的战略指导,从而显著提升对话的策略连贯性和事实准确性。

链接: https://arxiv.org/abs/2511.12133
作者: Qingyu Zhang,Chunlei Xin,Xuanang Chen,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun,Qing Ye,Qianlong Xie,Xingxing Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). A lack of task-specific data often limits previous works, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios.
zh

[NLP-83] PRISM of Opinions: A Persona-Reason ed Multimodal Framework for User-centric Conversational Stance Detection

【速读】: 该论文针对多模态社交对话立场检测(Multimodal Conversational Stance Detection, MCSD)中存在的两大局限问题展开研究:一是“伪多模态”(pseudo-multimodality),即视觉线索仅出现在源帖子中而评论被当作纯文本处理,与真实场景下的多模态交互不一致;二是“用户同质化”(user homogeneity),即忽视个体用户特征对立场表达的影响。为解决这些问题,作者提出了首个以用户为中心的MCSD数据集U-MStance和相应的PRISM模型。PRISM的关键创新在于:首先通过历史发帖与评论构建纵向用户人格画像(longitudinal user personas),捕捉个体特质;其次利用思维链(Chain-of-Thought)机制在对话语境中对齐文本与视觉线索,弥合跨模态语义与语用鸿沟;最后引入相互任务强化机制,联合优化立场识别与立场感知的回复生成,实现双向知识迁移。实验表明,PRISM在U-MStance数据集上显著优于强基线模型,验证了用户中心和上下文锚定的多模态推理对真实场景立场理解的有效性。

链接: https://arxiv.org/abs/2511.12130
作者: Bingbing Wang,Zhixin Bai,Zhengda Jin,Zihan Wang,Xintong Song,Jingjie Lin,Sixuan Li,Jing Li,Ruifeng Xu
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Peng Cheng Laboratory, Shenzhen, China (鹏城实验室); Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies (广东省新型安全智能技术重点实验室); The Hong Kong Polytechnic University, Hong Kong, China (香港理工大学); Macau University of Science and Technology, Hong Kong, China (澳门科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users’ attitudes toward specific targets within complex discussions. However, existing studies remain limited by: 1) pseudo-multimodality, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and 2) user homogeneity, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce U-MStance, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose PRISM, a Persona-Reasoned multImodal Stance Model for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.
zh

[NLP-84] LLM LagBench: Identifying Temporal Training Boundaries in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因训练数据存在时间边界而导致的知识陈旧性问题,即模型在处理涉及近期事件的推理任务时可能混淆过时信息与通用知识,从而影响输出准确性。解决方案的关键在于提出LLMLagBench——一个系统性的LLM新鲜度评估基准,通过评测模型对近期事件的认知能力来识别其训练数据的最早可能时间边界,进而量化模型知识的新鲜程度。该方法通过人工验证和与公开披露的预训练信息对比,确保了基准的可靠性。

链接: https://arxiv.org/abs/2511.12116
作者: Piotr Pęzik,Konrad Kaczyński,Maria Szymańska,Filip Żarnecki,Zuzanna Deckert,Jakub Kwiatkowski,Wojciech Janowski
机构: University of Lodz (罗兹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM’s training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.
zh

[NLP-85] Exploring Parameter-Efficient Fine-Tuning and Backtranslation for the WMT 25 General Translation Task

【速读】: 该论文旨在解决低资源语言对(特别是日语)在神经机器翻译(Neural Machine Translation, NMT)中因训练数据稀缺而导致的性能瓶颈问题。其解决方案的关键在于协同利用回译(Backtranslation, BT)与微调(Fine-tuning, FT)技术:首先通过回译生成合成双语数据以扩充小规模平行语料,随后在真实但有限的日语文本平行数据上进行针对性微调,从而实现性能的显著提升——实验表明,联合策略在COMET指标上达到0.597,优于单独使用任一方法的效果(BT为0.468,FT为0.589),验证了二者协同作用在低资源场景下的有效性。

链接: https://arxiv.org/abs/2511.12109
作者: Felipe Fujita,Hideyuki Takada
机构: Ritsumeikan University(立命馆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we explore the effectiveness of combining fine-tuning and backtranslation on a small Japanese corpus for neural machine translation. Starting from a baseline English\textrightarrowJapanese model (COMET = 0.460), we first apply backtranslation (BT) using synthetic data generated from monolingual Japanese corpora, yielding a modest increase (COMET = 0.468). Next, we fine-tune (FT) the model on a genuine small parallel dataset drawn from diverse Japanese news and literary corpora, achieving a substantial jump to COMET = 0.589 when using Mistral 7B. Finally, we integrate both backtranslation and fine-tuning – first augmenting the small dataset with BT generated examples, then adapting via FT – which further boosts performance to COMET = 0.597. These results demonstrate that, even with limited training data, the synergistic use of backtranslation and targeted fine-tuning on Japanese corpora can significantly enhance translation quality, outperforming each technique in isolation. This approach offers a lightweight yet powerful strategy for improving low-resource language pairs.
zh

[NLP-86] Preference Learning from Physics-Based Feedback: Tuning Language Models to Design BCC/B2 Superalloys

【速读】: 该论文旨在解决结构合金(structural alloys)设计中难以高效探索高维材料空间的问题,特别是针对BCC/B2超耐热合金这一尚未充分研究的材料类别,其在极端环境下的应用潜力亟待挖掘。传统方法通常依赖于生成稳定的无机晶体,而忽视了合成可行性(synthesizeability),导致理论设计与实际制造脱节。解决方案的关键在于利用生成式AI(Generative AI)结合偏好学习(preference learning),通过单一统一的奖励信号对语言模型进行优化,该奖励信号源自热力学相变计算(thermodynamic phase calculations),而非人为设定的启发式规则或人工反馈,从而实现科学可解释且自动化的模型调优。此方法首次将物理基础反馈用于语言模型的偏好微调,构建了一个通用、可扩展的智能设计空间探索框架,为多物理科学领域的材料发现提供了新路径。

链接: https://arxiv.org/abs/2511.12036
作者: Satanu Ghosh,Collin Holgate,Neal R. Brodnik,Doug Downey,Samantha Daly,Tresa M. Pollock,Samuel Carton
机构: University of New Hampshire (新罕布什尔大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Computational Engineering, Finance, and Science (cs.CE); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We apply preference learning to the task of language model-guided design of novel structural alloys. In contrast to prior work that focuses on generating stable inorganic crystals, our approach targets the synthesizeability of a specific structural class: BCC/B2 superalloys, an underexplored family of materials with potential applications in extreme environments. Using three open-weight models (LLaMA-3.1, Gemma-2, and OLMo-2), we demonstrate that language models can be optimized for multiple design objectives using a single, unified reward signal through Direct Preference Optimization (DPO). Unlike prior approaches that rely on heuristic or human-in-the-loop feedback (costly), our reward signal is derived from thermodynamic phase calculations, offering a scientifically grounded criterion for model tuning. To our knowledge, this is the first demonstration of preference-tuning a language model using physics-grounded feedback for structural alloy design. The resulting framework is general and extensible, providing a path forward for intelligent design-space exploration across a range of physical science domains.
zh

[NLP-87] CURE: Cultural Understanding and Reasoning Evaluation - A Framework for “Thick” Culture Alignment Evaluation in LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在跨文化环境部署中,对其文化胜任力(cultural competence)评估不足的问题。现有评估方法多依赖去情境化的正确性判断或强制选择题形式,忽视了模型在真实情境下进行文化相关推理的能力。解决方案的关键在于引入一套基于真实情境的基准测试(benchmarks),要求模型在具体文化语境中做出合理回应,并采用五种互补指标(Exact Match、Coverage、Specificity、Connotation 和 Coherence)从多个维度量化模型响应质量,从而实现更深入、稳定且可解释的文化理解评估。

链接: https://arxiv.org/abs/2511.12014
作者: Truong Vo,Sanmi Koyejo
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model’s response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.
zh

[NLP-88] Leverag ing Large Language Models for Career Mobility Analysis: A Study of Gender Race and Job Change Using U.S. Online Resume Profiles

【速读】: 该论文旨在解决美国受过高等教育的劳动者在职业流动中性别与种族差异对向上职业流动性(upward career mobility)的影响机制问题,尤其关注不同类型的职位变动(如企业内晋升、跨企业变动及横向调动)如何影响职业发展,并识别这些影响在不同群体间的异质性。其解决方案的关键在于构建了一种基于大语言模型(Large Language Models, LLMs)的职业分类方法——FewSOC,该方法显著提升了原始简历数据中职业标签的准确性,从而有效缓解了因职业标注噪声带来的分析偏差;同时,通过多层级敏感性分析控制聚类层面的异质性,验证了女性和非裔大学毕业生从职业变动中获得的收益显著低于男性和白人同侪的结论具有稳健性,并揭示出交叉性(intersectional)模式。

链接: https://arxiv.org/abs/2511.12010
作者: Palakorn Achananuparp,Connie Xu,Yao Lu,Xavier Jayaraj Siddarth Ashok,Ee-Peng Lim
机构: Singapore Management University (新加坡管理大学); Columbia University (哥伦比亚大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Submitted to EPJ Data Science

点击查看摘要

Abstract:We present a large-scale analysis of career mobility of college-educated U.S. workers using online resume profiles to investigate how gender, race, and job change options are associated with upward mobility. This study addresses key research questions of how the job changes affect their upward career mobility, and how the outcomes of upward career mobility differ by gender and race. We address data challenges – such as missing demographic attributes, missing wage data, and noisy occupation labels – through various data processing and Artificial Intelligence (AI) methods. In particular, we develop a large language models (LLMs) based occupation classification method known as FewSOC that achieves accuracy significantly higher than the original occupation labels in the resume dataset. Analysis of 228,710 career trajectories reveals that intra-firm occupation change has been found to facilitate upward mobility most strongly, followed by inter-firm occupation change and inter-firm lateral move. Women and Black college graduates experience significantly lower returns from job changes than men and White peers. Multilevel sensitivity analyses confirm that these disparities are robust to cluster-level heterogeneity and reveal additional intersectional patterns.
zh

[NLP-89] Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations

【速读】: 该论文试图解决的问题是:Chain-of-Thought (CoT)解释在多模态道德场景中可能带来的双重效应——既可增强透明度,也可能因用户认知偏差(如确认偏误)而误导决策,从而削弱对模型推理错误的识别能力。解决方案的关键在于揭示两种机制:一是用户倾向于将信任与结果一致性挂钩,即使推理过程存在缺陷仍维持依赖;二是自信语气会抑制用户对错误的察觉,且这种表达风格的影响甚至超过推理正确性本身。因此,论文强调NLP系统应设计能促进批判性思维和主动审查的解释机制,而非仅追求表面合理性或情绪上的可信度。

链接: https://arxiv.org/abs/2511.12001
作者: Eunkyu Park,Wesley Hanwen Deng,Vasudha Varadarajan,Mingxi Yan,Gunhee Kim,Maarten Sap,Motahhare Eslami
机构: Seoul National University (首尔国立大学); Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所); Human-Computer Interaction Institute, Carnegie Mellon University (卡内基梅隆大学人机交互研究所)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Under review; 16 pages, 15 figures

点击查看摘要

Abstract:Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.
zh

[NLP-90] A Reasoning Paradigm for Named Entity Recognition AAAI2026

【速读】: 该论文旨在解决生成式大语言模型(Generative LLMs)在命名实体识别(Named Entity Recognition, NER)任务中因依赖隐式语义模式匹配而导致的推理机制不明确、泛化能力弱的问题,尤其是在零样本(zero-shot)和低资源场景下表现不佳。其解决方案的关键在于提出一种基于显式推理的框架——ReasoningNER,该框架包含三个阶段:Chain of Thought (CoT) 生成、CoT 调优与推理增强。通过构建标注有 NER 相关推理链的数据集对模型进行调优,使其在输出最终实体前先生成可解释的推理路径,并利用综合奖励信号优化推理过程,从而实现可验证的实体抽取,显著提升模型在零样本设置下的性能(F1 指标优于 GPT-4 12.3 个百分点)。

链接: https://arxiv.org/abs/2511.11978
作者: Hui Huang,Yanping Chen,Ruizhang Huang,Chuan Lin,Yongbin Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This “cognitive shortcutting” leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at this https URL.
zh

[NLP-91] On the Entropy Calibration of Language Models NEURIPS2025

【速读】: 该论文旨在解决语言模型中的熵校准(entropy calibration)问题,即模型生成文本时的熵是否与其在人类文本上的对数损失(log loss)相匹配。研究表明,当前自回归语言模型普遍存在校准偏差,随着生成长度增加,每步熵上升且文本质量下降,这导致误差累积成为核心挑战。传统解决方案是截断分布(truncation),虽可提升文本质量但牺牲多样性。论文通过理论分析与实证测量发现,模型规模扩大并不能显著改善这种校准偏差——无论是理论简化模型还是0.5B至70B参数的多种语言模型均显示,校准误差随规模增长缓慢(scaling exponent接近0)。关键创新在于提出:若假设存在一个黑盒工具能够预测未来文本的熵,则理论上可以在不增加对数损失的前提下降低生成熵,从而实现无权衡的校准优化。

链接: https://arxiv.org/abs/2511.11966
作者: Steven Cao,Gregory Valiant,Percy Liang
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Neurips 2025

点击查看摘要

Abstract:We study the problem of entropy calibration, which asks whether a language model’s entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution – in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
zh

[NLP-92] Improving LLM s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识图谱驱动对话生成(Knowledge Graph-based Dialogue Generation, KG-DG)任务中对外部知识利用不足的问题,即LLMs倾向于依赖自身内嵌知识而非所提供的外部知识图谱(Knowledge Graph, KG),导致生成结果与给定KG脱节。解决方案的关键在于提出一种简单但有效的实体匿名化(entity anonymization)技术,通过隐藏知识图谱中的具体实体名称,迫使模型更关注知识图谱的结构和关系信息,从而增强其对外部知识的依赖性和使用能力。实验表明,该方法显著提升了LLMs在OpenDialKG数据集上的知识附着度(knowledge attachment)。

链接: https://arxiv.org/abs/2511.11946
作者: Hadi Sheikhi,Chenyang Huang,Osmar R. Zaïane
机构: University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (阿尔伯塔机器智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs’ attachment on external knowledge.
zh

[NLP-93] InData: Towards Secure Multi-Step Tool-Based Data Analysis

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理敏感数据时因直接生成和执行代码而导致的安全风险问题。其解决方案的关键在于提出一种安全驱动的替代范式——限制LLMs直接生成代码或访问数据,而是要求它们仅通过预定义的、经过验证的安全工具集来间接参与数据交互。为此,作者构建了InData数据集,用于评估LLMs在多步骤工具调用场景下的推理能力,从而推动具备更强组合式推理能力的LLM发展。

链接: https://arxiv.org/abs/2511.11933
作者: Karthikeyan K,Raghuveer Thirukovalluru,Bhuwan Dhingra,David Edwin Carlson
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs’ multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels–Easy, Medium, and Hard–capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.
zh

[NLP-94] Additive Large Language Models for Semi-Structured Text

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床文本分类中因预测过程不透明而难以被研究人员和临床医生信任的问题,尤其是在需要明确识别患者记录中哪些部分驱动风险信号的场景下。解决方案的关键在于提出一种名为CALM(Classification with Additive Large Language Models)的可解释框架,其核心思想是将输入文本视为由语义明确的组件(如入院记录的段落或问诊表单的问答字段)构成,并将预测结果建模为各组件贡献的加性之和,从而使得每个组件的贡献成为前向传播的一部分,实现患者层面与群体层面的忠实解释。该加性结构还支持直观可视化(如类似广义加性模型的组件级风险曲线),显著提升模型的可审计性和临床可理解性。

链接: https://arxiv.org/abs/2511.11922
作者: Karthikeyan K,Raghuveer Thirukovalluru,David Carlson
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models have advanced clinical text classification, but their opaque predictions remain a critical barrier to practical adoption in research and clinical settings where investigators and physicians need to understand which parts of a patient’s record drive risk signals. To address this challenge, we introduce \textbfCALM, short for \textbfClassification with Additive Large Language Models, an interpretable framework for semi-structured text where inputs are composed of semantically meaningful components, such as sections of an admission note or question-answer fields from an intake form. CALM predicts outcomes as the additive sum of each component’s contribution, making these contributions part of the forward computation itself and enabling faithful explanations at both the patient and population level. The additive structure also enables clear visualizations, such as component-level risk curves similar to those used in generalized additive models, making the learned relationships easier to inspect and communicate. Although CALM expects semi-structured inputs, many clinical documents already have this form, and similar structure can often be automatically extracted from free-text notes. CALM achieves performance comparable to conventional LLM classifiers while improving trust, supporting quality-assurance checks, and revealing clinically meaningful patterns during model development and auditing.
zh

[NLP-95] Forgetting-MarI: LLM Unlearning via Marginal Information Regularization

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中因数据隐私和合规性要求而需删除特定数据影响的问题。现有方法常因过度删除信息导致模型性能下降,无法精确控制遗忘范围。其解决方案的关键在于提出一种名为Forgetting-MarI的框架,通过惩罚边际信息(marginal information)实现对目标数据的精准“遗忘”——即仅移除该数据带来的额外信息,同时保留其余数据支持的参数知识,从而在理论上保证未被遗忘数据的残留影响上限,并提供可证明的不可检测性,显著优于当前最先进方法,在多个基准测试中实现了更可靠的遗忘效果与更高的模型性能保持能力。

链接: https://arxiv.org/abs/2511.11914
作者: Shizhou Xu,Yuan Ni,Stefan Broecker,Thomas Strohmer
机构: University of California, Davis(加州大学戴维斯分校); Stanford University(斯坦福大学); SLAC National Accelerator Laboratory(SLAC国家加速器实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As AI models are trained on ever-expanding datasets, the ability to remove the influence of specific data from trained models has become essential for privacy protection and regulatory compliance. Unlearning addresses this challenge by selectively removing parametric knowledge from the trained models without retraining from scratch, which is critical for resource-intensive models such as Large Language Models (LLMs). Existing unlearning methods often degrade model performance by removing more information than necessary when attempting to ‘‘forget’’ specific data. We introduce Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned, while preserving the information supported by the data to be retained. By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset’s residual influence in the trained models, providing provable undetectability. Extensive experiments confirm that our approach outperforms current state-of-the-art unlearning methods, delivering reliable forgetting and better preserved general model performance across diverse benchmarks. This advancement represents an important step toward making AI systems more controllable and compliant with privacy and copyright regulations without compromising their effectiveness.
zh

[NLP-96] Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在心理健康支持场景中因缺乏情境感知和情感理解能力而导致的治疗对话生成不准确的问题。其核心解决方案在于采用监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)相结合的方法,重构输入格式以同时整合用户输入、上下文信息和情绪状态,并设计多组件奖励函数,使模型输出更贴近专业治疗师的回应及标注的情绪标签。实验表明,强化学习显著提升了生成对话的语义一致性(如BLEU、ROUGE等指标)和情绪识别准确性(从66.96%提升至99.34%),验证了该方法在构建具备临床辅助价值的智能心理对话系统中的有效性。

链接: https://arxiv.org/abs/2511.11884
作者: Eric Hua Qing Zhang,Julia Ive
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessary for appropriate therapeutic responses. This paper investigated the application of supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance GPT-2’s capacity for therapeutic dialogue generation. The methodology restructured input formats to enable simultaneous processing of contextual information and emotional states alongside user input, employing a multi-component reward function that aligned model outputs with professional therapist responses and annotated emotions. Results demonstrated improvements through reinforcement learning over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581). LLM evaluation confirmed high contextual relevance and professionalism, while reinforcement learning achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2. These findings demonstrate reinforcement learning’s effectiveness in developing therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.
zh

[NLP-97] ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts

【速读】: 该论文旨在解决临床笔记(clinical notes)中蕴含的丰富上下文信息因未结构化而引发的问题,包括无意偏见(如性别或种族偏见)、跨临床场景泛化能力差(例如在不同电子健康记录(EHR)系统间性能下降)以及模型可解释性不足。解决方案的关键在于提出ClinStructor管道,利用大语言模型(LLMs)将自由文本临床笔记转化为任务特定的问答对(question-answer pairs),从而在预测建模前实现结构化表示,显著提升模型的透明度和可控性,同时仅导致预测性能小幅下降(AUC降低2-3%),优于直接微调方法。

链接: https://arxiv.org/abs/2511.11883
作者: Karthikeyan K,Raghuveer Thirukovalluru,David Carlson
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question-answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2-3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.
zh

[NLP-98] Better LLM Reasoning via Dual-Play

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在强化学习中对人工标注监督信号的高度依赖问题,以及由此引发的训练不稳定性和奖励欺骗(reward hacking)风险。其核心解决方案是提出一种名为PasoDoble的双人对抗训练框架,通过引入两个角色——生成挑战性问题并提供标准答案的Proposer和尝试解答这些问题的Solver——实现模型间的相互竞争与协同进化。关键创新在于:1)利用预训练数据增强Proposer的知识以保证问题质量与多样性;2)设计基于有效性和难度的奖励机制防止奖励欺骗,确保双方模型同步更新;3)引入可选的离线训练范式,交替固定一方更新另一方,从而提升训练稳定性。实验表明,该方法可在无监督条件下显著提升LLM的推理能力。

链接: https://arxiv.org/abs/2511.11881
作者: Zhengxin Zhang,Chengyu Huang,Aochong Oliver Li,Claire Cardie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions’ quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver’s limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at this https URL.
zh

[NLP-99] MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中对低资源语言支持不足的问题,尤其是巴西葡萄牙语(Brazilian Portuguese)缺乏高质量、真实世界临床对话数据集的现状。现有简单翻译方法无法捕捉本土特有的临床和文化语境,如地方性疾病的表达方式。解决方案的关键在于构建首个大规模、真实世界医学问答语料库MedPT,包含384,095条来自患者-医生互动的真实问答对,并通过多阶段精细化筛选与上下文增强策略去除噪声、丰富模糊查询语义;同时利用生成式AI(Generative AI)驱动的标注流程对问题进行七类语义分类,以准确反映用户意图。该数据集不仅覆盖3,200个主题且体现医患沟通中的自然不对称性,实证验证其有效性——在20类专科分诊任务中微调17亿参数模型即可达到94% F1分数,且错误分析揭示其能识别真实的临床歧义,证明其深层语义价值。

链接: https://arxiv.org/abs/2511.11878
作者: Fernanda Bufon Färber,Iago Alves Brito,Julia Soares Dollis,Pedro Schindler Freire Brasil Ribeiro,Rafael Teixeira Sousa,Arlindo Rodrigues Galvão Filho
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 tables, 2 figures

点击查看摘要

Abstract:While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages, creating a critical barrier for others as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese, comprising 384,095 authentic question-answer pairs from patient-doctor interactions. The dataset underwent a meticulous multi-stage curation protocol, using a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries. We further augmented the corpus via LLM-driven annotation, classifying questions into seven semantic types to capture user intent. Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication. To validate its utility, we benchmark a medical specialty routing task: fine-tuning a 1.7B parameter model achieves an outstanding 94% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset’s deep semantic richness. We publicly release MedPT to foster the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.
zh

[NLP-100] Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches LREC2026

【速读】: 该论文旨在解决临床自然语言处理中缺乏针对放射学任务的专用标注数据集的问题,从而限制了对大型语言模型(Large Language Models, LLMs)在随访依从性检测任务上性能的系统评估。其关键解决方案是构建了一个包含6,393份放射学报告的标注语料库,每份报告均标记了随访影像状态,并在此基础上对比了传统机器学习方法(如逻辑回归、支持向量机)、基于Transformer的长文本处理模型(Longformer)以及微调后的Llama3-8B-Instruct与生成式大模型(Generative LLMs),特别是GPT-4o和开源模型GPT-OSS-20B在两种配置下的表现:基础设置与任务优化设置(Advanced)。研究发现,通过聚焦于元数据、推荐语句及其上下文信息进行提示工程优化后,生成式LLMs(尤其是GPT-4o Advanced)达到了F1=0.832的最佳性能,接近人工标注者一致性(F1=0.846),同时表明即便在生成式AI快速发展的背景下,可解释性强且资源消耗低的传统模型仍具重要价值。

链接: https://arxiv.org/abs/2511.11867
作者: Namu Park,Giridhar Kaushik Ramachandran,Kevin Lybarger,Fei Xia,Ozlem Uzuner,Meliha Yetisgen,Martin Gunn
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to LREC 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.
zh

[NLP-101] hree Stage Narrative Analysis; Plot-Sentiment Breakdown Structure Learning and Concept Detection

【速读】: 该论文旨在解决故事理解与分析在自然语言理解领域中的长期挑战,特别是针对电影剧本中情感弧线(sentiment arcs)的自动化识别与扩展性语境分析问题。其解决方案的关键在于构建一个融合词典驱动的情感分析框架,利用基于NRC-VAD数据集的Valence、Arousal和Dominance评分定制化词典(LabMTsimple storylab模块),并结合Ward层次聚类技术对相似情感轨迹进行聚类,从而实现从低层词汇到高层叙事概念的多粒度语义提取,有效支持用户在选择叙事内容时的决策。

链接: https://arxiv.org/abs/2511.11857
作者: Taimur Khan,Ramoza Ahsan,Mohib Hameed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Story understanding and analysis have long been challenging areas within Natural Language Understanding. Automated narrative analysis requires deep computational semantic representations along with syntactic processing. Moreover, the large volume of narrative data demands automated semantic analysis and computational learning rather than manual analytical approaches. In this paper, we propose a framework that analyzes the sentiment arcs of movie scripts and performs extended analysis related to the context of the characters involved. The framework enables the extraction of high-level and low-level concepts conveyed through the narrative. Using dictionary-based sentiment analysis, our approach applies a custom lexicon built with the LabMTsimple storylab module. The custom lexicon is based on the Valence, Arousal, and Dominance scores from the NRC-VAD dataset. Furthermore, the framework advances the analysis by clustering similar sentiment plots using Wards hierarchical clustering technique. Experimental evaluation on a movie dataset shows that the resulting analysis is helpful to consumers and readers when selecting a narrative or story.
zh

[NLP-102] owards Autoformalization of LLM -generated Outputs for Requirement Verification

【速读】: 该论文试图解决的问题是:如何验证大型语言模型(Large Language Models, LLMs)生成的结构化输出(如形式化逻辑表达式)是否准确且逻辑一致。当前缺乏一种正式的方法来检验LLM从自然语言(Natural Language, NL)生成的形式化内容是否忠实于原始需求。解决方案的关键在于提出并初步验证了一个基于简单LLM的自动形式化(Autoformalization)流程,该流程能够将自然语言要求转化为形式逻辑,并通过逻辑比对识别出不同表述间的等价性或不一致性,从而为LLM生成内容提供形式化验证能力。

链接: https://arxiv.org/abs/2511.11829
作者: Mihir Gupte,Ramesh S
机构: General Motors (通用汽车)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
备注: To be submitted for publication

点击查看摘要

Abstract:Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models (LLMs). While LLMs show promise in generating structured outputs from natural language (NL), such as Gherkin Scenarios from NL feature requirements, there’s currently no formal method to verify if these outputs are accurate. This paper takes a preliminary step toward addressing this gap by exploring the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements. We conducted two distinct experiments. In the first one, the autoformalizer successfully identified that two differently-worded NL requirements were logically equivalent, demonstrating the pipeline’s potential for consistency checks. In the second, the autoformalizer was used to identify a logical inconsistency between a given NL requirement and an LLM-generated output, highlighting its utility as a formal verification tool. Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs, laying a crucial foundation for future, more extensive studies into this novel application.
zh

[NLP-103] Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis

【速读】: 该论文旨在解决在监管文档中使用大语言模型(Large Language Models, LLMs)进行信息抽取时,性能与计算资源之间存在的关键权衡问题。其解决方案的关键在于通过系统性评估七个参数规模从0.6B到70B的开源模型在水电许可文档上的表现,识别出一个14B参数阈值——在此阈值以下,验证方法无效(F1得分低于0.15),而在此阈值以上则可实现有效抽取(F1达0.64)。研究进一步表明,消费级部署模型经适当验证可达到64% F1,而小模型性能受限于系统性幻觉(hallucination)现象,即使召回率完美也反映提取失败;大型模型虽接近77% F1,但需企业级基础设施支持。这一发现首次建立了监管场景下开源模型资源-性能映射关系,为基于证据的信息抽取模型选择提供依据。

链接: https://arxiv.org/abs/2511.11821
作者: Hong-Jun Yoon,Faisal Ashraf,Thomas A. Ruggles,Debjani Singh
机构: Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, zero figures, Preprint submitted to Environmental Modeling and Software

点击查看摘要

Abstract:Information extraction from regulatory documents using large language models presents critical trade-offs between performance and computational resources. We evaluated seven open-weight models (0.6B-70B parameters) on hydropower licensing documentation to provide empirical deployment guidance. Our analysis identified a pronounced 14B parameter threshold where validation methods transition from ineffective (F1 0.15) to viable (F1 = 0.64). Consumer-deployable models achieve 64% F1 through appropriate validation, while smaller models plateau at 51%. Large-scale models approach 77% F1 but require enterprise infrastructure. We identified systematic hallucination patterns where perfect recall indicates extraction failure rather than success in smaller models. Our findings establish the first comprehensive resource-performance mapping for open-weight information extraction in regulatory contexts, enabling evidence-based model selection. These results provide immediate value for hydropower compliance while contributing insights into parameter scaling effects that generalize across information extraction tasks. Comments: 18 pages, zero figures, Preprint submitted to Environmental Modeling and Software Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11821 [cs.CL] (or arXiv:2511.11821v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.11821 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hong-Jun Yoon [view email] [v1] Fri, 14 Nov 2025 19:23:25 UTC (238 KB)
zh

[NLP-104] Do LLM s Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy AAAI2026 AAAI

【速读】: 该论文旨在解决自然语言到一阶逻辑(First-Order Logic, FOL)翻译(NL-FOL translation)这一长期存在的挑战,尤其聚焦于评估大语言模型(Large Language Models, LLMs)在该任务上的真实语义理解能力。现有研究在评估方法上存在关键缺陷,可能导致对模型能力的误判,例如混淆了模式识别、记忆或数据污染与真正的逻辑理解。论文的核心解决方案在于提出一种新型评估协议,专门设计用于区分模型是否具备深层次的语义级逻辑理解能力,而非表面特征匹配。通过该协议,作者发现当前对话导向的LLMs展现出较强的NL-FOL翻译能力及句子层面的逻辑掌握能力,而基于嵌入(embedding-centric)的模型表现显著较差。

链接: https://arxiv.org/abs/2511.11816
作者: Andrea Brunello,Luca Geatti,Michele Mignani,Angelo Montanari,Nicola Saccomanno
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: Full version of the paper accepted for publication at The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)

点击查看摘要

Abstract:Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs’ actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.
zh

[NLP-105] On the Notion that Language Models Reason

【速读】: 该论文试图解决的问题是:当前语言模型(Language Models, LMs)被广泛认为具备“推理”能力,但这一说法缺乏严谨的定义和理论支撑,且与LM的实际训练机制、信息处理方式及生成过程存在根本性不一致。为澄清这一问题,作者提出关键解决方案:将基于Transformer的LM视为一种隐式有限阶马尔可夫核(implicit finite-order Markov kernel),该核将上下文映射为条件词元分布。在此视角下,LM产生的类推理输出本质上是学习到的统计规律和近似统计不变性的体现,而非显式的逻辑推理机制实现。这一观点揭示了LM作为“统计模式匹配器”的本质,解释了为何其输出看似具有推理特征却无法保证逻辑一致性,从而为理解大语言模型中的认知不确定性提供了理论基础。

链接: https://arxiv.org/abs/2511.11810
作者: Bertram Højer
机构: IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 1st Workshop on Epistemic Intelligence in Machine Learning, EurIPS 2025

点击查看摘要

Abstract:Language models (LMs) are said to be exhibiting reasoning, but what does this entail? We assess definitions of reasoning and how key papers in the field of natural language processing (NLP) use the notion and argue that the definitions provided are not consistent with how LMs are trained, process information, and generate new tokens. To illustrate this incommensurability we assume the view that transformer-based LMs implement an \textitimplicit finite-order Markov kernel mapping contexts to conditional token distributions. In this view, reasoning-like outputs correspond to statistical regularities and approximate statistical invariances in the learned kernel rather than the implementation of explicit logical mechanisms. This view is illustrative of the claim that LMs are “statistical pattern matchers”" and not genuine reasoners and provides a perspective that clarifies why reasoning-like outputs arise in LMs without any guarantees of logical consistency. This distinction is fundamental to how epistemic uncertainty is evaluated in LMs. We invite a discussion on the importance of how the computational processes of the systems we build and analyze in NLP research are described.
zh

[NLP-106] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model Context and Interactive Scaling

【速读】: 该论文旨在解决当前研究代理(research agent)在工具增强推理和信息获取能力上的局限性,特别是传统方法仅通过扩大模型规模或上下文长度来提升性能,而忽视了交互深度与频率对复杂任务处理能力的影响。其核心解决方案是提出“交互扩展”(interaction scaling)这一新维度,即通过强化学习训练模型在单次任务中执行更多工具调用(最多600次),并利用环境反馈与外部信息获取实现错误纠正和轨迹优化,从而显著提升多轮推理能力和真实世界研究流程的适应性。实验证明,交互深度可像模型规模和上下文长度一样呈现可预测的缩放效应,确立了交互扩展作为下一代开源研究代理的关键支柱。

链接: https://arxiv.org/abs/2511.11793
作者: MiroMind Team,Song Bai,Lidong Bing,Carson Chen,Guanzheng Chen,Yuntao Chen,Zhe Chen,Ziyi Chen,Jifeng Dai,Xuan Dong,Yue Deng,Yunjie Fu,Junqi Ge,Chenxia Han,Tammy Huang,Zhenhang Huang,Jerry Jiao,Shilei Jiang,Tianyu Jiao,Xiaoqi Jian,Lei Lei,Ruilin Li,Ryan Luo,Tiantong Li,Xiang Lin,Ziyuan Liu,Zhiqi Li,Jie Ni,Qiang Ren,Pax Sun,Shiqian Su,Chenxin Tao,Bin Wang,Hellen Wang,Haonan Wang,James Wang,Jin Wang,Jojo Wang,Letian Wang,Shizun Wang,Weizhi Wang,Zixuan Wang,Jinfan Xu,Sen Xing,Chenyu Yang,Hai Ye,Jiaheng Yu,Yue Yu,Muyan Zhong,Tianchen Zhao,Xizhou Zhu,Yanpeng Zhou,Yifan Zhang,Zhi Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.
zh

[NLP-107] Reasoning : From Reflection to Solution

【速读】: 该论文试图解决的核心问题是:当前大语言模型(Large Language Models, LLMs)在诸如GSM8K和HumanEval等基准测试中表现出的“超人类”性能,究竟是源于真正的推理能力,还是仅仅对已有推理轨迹的模式匹配。作者指出,现有系统缺乏对推理本质的结构化理解,导致其在复杂任务中表现不稳定甚至失败。解决方案的关键在于提出一个形式化的定义——推理是状态空间中迭代算子应用并收敛至不动点的过程(reasoning as iterative operator application in state spaces converging to fixed points),这一定义具有明确的架构启示意义。基于此理论框架,论文构建了OpenLM系统,在OpenXOR难题上实现了76%准确率,而最先进的LLMs在此任务上准确率为0%,从而验证了该定义下架构设计的有效性。

链接: https://arxiv.org/abs/2511.11712
作者: Zixi Li
机构: Noesis Lab (Independent Research Group); Sun Yat-sen University (中山大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:What is reasoning? This question has driven centuries of philosophical inquiry, from Aristotle’s syllogisms to modern computational complexity theory. In the age of large language models achieving superhuman performance on benchmarks like GSM8K (95% accuracy) and HumanEval (90% pass@1), we must ask: have these systems learned to \emphreason, or have they learned to \emphpattern-match over reasoning traces? This paper argues for a specific answer: \textbfreasoning is iterative operator application in state spaces, converging to fixed points. This definition is not merely philosophical – it has concrete architectural implications that explain both the failures of current systems and the path to genuine reasoning capabilities. Our investigation begins with a puzzle (OpenXOR), progresses through theory (OpenOperator), and culminates in a working solution (OpenLM) that achieves 76% accuracy where state-of-the-art LLMs achieve 0%. This is not about criticizing existing systems, but about \emphunderstanding what reasoning requires and \emphbuilding architectures that provide it. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2511.11712 [cs.LG] (or arXiv:2511.11712v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-108] Generative AI as a Linguistic Equalizer in Global Science

【速读】: 该论文旨在解决全球科学领域中因英语主导地位而对非母语英语研究者造成的不平等问题,即语言壁垒如何限制非英语国家科研人员的学术影响力与发表机会。其解决方案的关键在于利用生成式 AI(Generative AI)作为技术手段,通过大规模实证分析验证 GenAI 是否能够缩小不同语言背景作者在科学写作风格上的差异,从而实现语言层面的“平等化”。研究基于2021至2024年间565万篇科学论文的数据,使用 SciBERT 提取文本嵌入(text embeddings),衡量非英语国家作者在使用 GenAI 辅助写作后与美国基准群体的写作风格相似度变化,发现自2022年底 ChatGPT 发布以来,GenAI 辅助论文呈现出显著且持续增强的风格趋同趋势,尤其在语言距离英语较远的国家表现最为明显,表明 GenAI 正在重塑全球科学传播格局,有效降低语言障碍。

链接: https://arxiv.org/abs/2511.11687
作者: Dragan Filimonovic,Christian Rutzer,Jeffrey Macher,Rolf Weder
机构: University of Basel (巴塞尔大学); Georgetown University (乔治城大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For decades, the dominance of English has created a substantial barrier in global science, disadvantaging non-native speakers. The recent rise of generative AI (GenAI) offers a potential technological response to this long-standing inequity. We provide the first large-scale evidence testing whether GenAI acts as a linguistic equalizer in global science. Drawing on 5.65 million scientific articles published from 2021 to 2024, we compare GenAI-assisted and non-assisted publications from authors in non-English-speaking countries. Using text embeddings derived from a pretrained large language model (SciBERT), we measure each publication’s linguistic similarity to a benchmark of scientific writing from U.S.-based authors and track stylistic convergence over time. We find significant and growing convergence for GenAI-assisted publications after the release of ChatGPT in late 2022. The effect is strongest for domestic coauthor teams from countries linguistically distant from English. These findings provide large-scale evidence that GenAI is beginning to reshape global science communication by reducing language barriers in research.
zh

[NLP-109] A Structure-Agnostic Co-Tuning Framework for LLM s and SLMs in Cloud-Edge Systems

【速读】: 该论文旨在解决带宽受限的云服务器在实时处理大规模语言模型(Large Language Models, LLMs)工作负载时,难以兼顾用户数据隐私保护与推理性能的问题。针对这一挑战,作者提出了一种名为Co-PLMs的协同调优框架,其核心在于通过结构无关的相互学习机制实现异构语言模型间的知识迁移:利用蒸馏代理模型(Distilled Proxy Models, DPMs)作为桥梁,在云端LLM与边缘设备上的小型语言模型(Small Language Models, SLMs)之间建立协作训练通道,同时保留各设备的领域特定洞察。该方案有效缓解了跨域部署和SLMs结构异质性带来的性能瓶颈,实验表明其在Rouge-L和EM指标上分别提升5.38%和4.88%,优于现有最优方法。

链接: https://arxiv.org/abs/2511.11678
作者: Yuze Liu,Yunhan Wang,Tiehua Zhang,Zhishu Shen,Cheng Peng,Libing Wu,Feng Xia,Jiong Jin
机构: Swinburne University of Technology (斯威本科技大学); Tongji University (同济大学); Wuhan University of Technology (武汉理工大学); INFLY TECH (Shanghai) Co., Ltd. (上海英飞科技有限公司); Wuhan University (武汉大学); RMIT University (皇家墨尔本理工大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The surge in intelligent applications driven by large language models (LLMs) has made it increasingly difficult for bandwidth-limited cloud servers to process extensive LLM workloads in real time without compromising user data privacy. To solve these problems, recent research has focused on constructing cloud-edge consortia that integrate server-based LLM with small language models (SLMs) on mobile edge devices. Furthermore, designing collaborative training mechanisms within such consortia to enhance inference performance has emerged as a promising research direction. However, the cross-domain deployment of SLMs, coupled with structural heterogeneity in SLMs architectures, poses significant challenges to enhancing model performance. To this end, we propose Co-PLMs, a novel co-tuning framework for collaborative training of large and small language models, which integrates the process of structure-agnostic mutual learning to realize knowledge exchange between the heterogeneous language models. This framework employs distilled proxy models (DPMs) as bridges to enable collaborative training between the heterogeneous server-based LLM and on-device SLMs, while preserving the domain-specific insights of each device. The experimental results show that Co-PLMs outperform state-of-the-art methods, achieving average increases of 5.38% in Rouge-L and 4.88% in EM.
zh

[NLP-110] H-Model: Dynamic Neural Architectures for Adaptive Processing

【速读】: 该论文旨在解决传统神经网络结构固定、缺乏动态适应能力的问题,即如何使网络不仅学习数据的表示,还能自主调整计算结构以适应输入特征。其解决方案的关键在于引入一种路由机制(routing mechanism),该机制允许每一层根据输入数据和内部状态动态决定输出信息的传播路径,从而实现迭代式与自适应的计算过程,这为构建更具可解释性和灵活性的神经网络架构提供了新思路。

链接: https://arxiv.org/abs/2511.11669
作者: Dmytro Hospodarchuk
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Independent research report, 24 pages including references and figures

点击查看摘要

Abstract:This article explores the design and experimentation of a neural network architecture capable of dynamically adjusting its internal structure based on the input data. The proposed model introduces a routing mechanism that allows each layer to influence how its outputs are propagated through the network, enabling iterative and adaptive computation. This concept is loosely inspired by the idea of thought processes and dynamic reasoning, where information flow is conditioned not only on the data itself, but also on the internal state of the system. It is important to note that this work does not aim to compete with state-of-the-art language models in terms of performance. Instead, it presents a conceptual prototype-an architectural framework that opens up a new direction for exploring adaptable and potentially more interpretable networks. The goal is not optimization of existing benchmarks but rather the proposal of a system that can learn not only representations, but also the structure of computation itself. Due to practical constraints in computing resources and data, this study remains a preliminary investigation. Nevertheless, initial observations show promise, and the architecture’s full potential can only be evaluated in future experiments under more favorable computational conditions. Comments: Independent research report, 24 pages including references and figures Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) MSC classes: 68T07 ACMclasses: I.2.6; I.5.1 Cite as: arXiv:2511.11669 [cs.LG] (or arXiv:2511.11669v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-111] Automatic generation of DRI Statements

【速读】: 该论文旨在解决群体审议质量评估中因Deliberative Reason Index(DRI)语句生成过程复杂且耗时,导致研究实施难度大、可及性低的问题。其解决方案的关键在于提出了一种基于自然语言处理(Natural Language Processing, NLP)和大语言模型(Large Language Models, LLMs)的自动化DRI语句生成框架,显著降低了人工参与成本,并为社会科学研究方法中集成生成式人工智能(Generative AI)提供了可复现的模板。

链接: https://arxiv.org/abs/2511.11655
作者: Maurice Flechtner
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Master Thesis

点击查看摘要

Abstract:Assessing the quality of group deliberation is essential for improving our understanding of deliberative processes. The Deliberative Reason Index (DRI) offers a sophisticated metric for evaluating group reasoning, but its implementation has been constrained by the complex and time-consuming process of statement generation. This thesis introduces an innovative, automated approach to DRI statement generation that leverages advanced natural language processing (NLP) and large language models (LLMs) to substantially reduce the human effort involved in survey preparation. Key contributions are a systematic framework for automated DRI statement generation and a methodological innovation that significantly lowers the barrier to conducting comprehensive deliberative process assessments. In addition, the findings provide a replicable template for integrating generative artificial intelligence into social science research methodologies.
zh

[NLP-112] EduAgent QG: A Multi-Agent Workflow Framework for Personalized Question Generation

【速读】: 该论文旨在解决当前自动化题库生成方法中存在的问题,即现有单智能体或基于规则的流水线方法在生成题目时质量不稳定、多样性不足且难以与教育目标对齐。解决方案的关键在于提出一个名为EduAgentQG的多智能体协作框架,通过五个专业化智能体(规划者、写作者、求解者、教育者和检查者)构成的迭代反馈机制实现高质量、多样化且符合教学目标的个性化题目生成:其中规划者负责制定结构化设计计划以提升多样性,写作者依据计划生成候选题目并结合求解者与教育者的评分反馈优化质量与多样性,最终由检查者进行答案正确性和清晰度的验证,确保题目与教育目标的一致性。

链接: https://arxiv.org/abs/2511.11635
作者: Rui Jia,Min Zhang,Fengrui Liu,Bo Jiang,Kun Kuang,Zhongxiang Dai
机构: 华东师范大学智能教育实验室(Educational Intelligence Laboratory, East China Normal University)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality personalized question banks are crucial for supporting adaptive learning and individualized assessment. Manually designing questions is time-consuming and often fails to meet diverse learning needs, making automated question generation a crucial approach to reduce teachers’ workload and improve the scalability of educational resources. However, most existing question generation methods rely on single-agent or rule-based pipelines, which still produce questions with unstable quality, limited diversity, and insufficient alignment with educational goals. To address these challenges, we propose EduAgentQG, a multi-agent collaborative framework for generating high-quality and diverse personalized questions. The framework consists of five specialized agents and operates through an iterative feedback loop: the Planner generates structured design plans and multiple question directions to enhance diversity; the Writer produces candidate questions based on the plan and optimizes their quality and diversity using feedback from the Solver and Educator; the Solver and Educator perform binary scoring across multiple evaluation dimensions and feed the evaluation results back to the Writer; the Checker conducts final verification, including answer correctness and clarity, ensuring alignment with educational goals. Through this multi-agent collaboration and iterative feedback loop, EduAgentQG generates questions that are both high-quality and diverse, while maintaining consistency with educational objectives. Experiments on two mathematics question datasets demonstrate that EduAgentQG outperforms existing single-agent and multi-agent methods in terms of question diversity, goal consistency, and overall quality.
zh

[NLP-113] Characterizing and Understanding Energy Footprint and Efficiency of Small Language Model on Edges

【速读】: 该论文旨在解决在资源受限的边缘设备上高效部署小语言模型(Small Language Models, SLMs)所面临的挑战,特别是如何在保证性能的同时优化能效比。其关键解决方案在于通过实证分析不同SLMs(如Llama 3.2、Phi-3 Mini、TinyLlama和Gemma 2)在多种边缘硬件平台(Raspberry Pi 5、Jetson Nano、Jetson Orin Nano的CPU与GPU配置)上的推理能效表现,发现GPU加速、内存带宽以及模型架构设计是提升推理能效的核心因素;其中,Jetson Orin Nano配合GPU加速展现出最优的能量-性能比,而Llama 3.2在准确性和能效之间提供了最佳平衡,为智能系统和移动自组织网络等场景下的边缘AI部署提供了可落地的优化策略。

链接: https://arxiv.org/abs/2511.11624
作者: Md Romyull Islam,Bobin Deng,Nobel Dhar,Tu N. Nguyen,Selena He,Yong Shi,Kun Suo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted version; 9 pages, 5 figures; presented at IEEE MASS 2025 (online publication pending)

点击查看摘要

Abstract:Cloud-based large language models (LLMs) and their variants have significantly influenced real-world applications. Deploying smaller models (i.e., small language models (SLMs)) on edge devices offers additional advantages, such as reduced latency and independence from network connectivity. However, edge devices’ limited computing resources and constrained energy budgets challenge efficient deployment. This study evaluates the power efficiency of five representative SLMs - Llama 3.2, Phi-3 Mini, TinyLlama, and Gemma 2 on Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano (CPU and GPU configurations). Results show that Jetson Orin Nano with GPU acceleration achieves the highest energy-to-performance ratio, significantly outperforming CPU-based setups. Llama 3.2 provides the best balance of accuracy and power efficiency, while TinyLlama is well-suited for low-power environments at the cost of reduced accuracy. In contrast, Phi-3 Mini consumes the most energy despite its high accuracy. In addition, GPU acceleration, memory bandwidth, and model architecture are key in optimizing inference energy efficiency. Our empirical analysis offers practical insights for AI, smart systems, and mobile ad-hoc platforms to leverage tradeoffs from accuracy, inference latency, and power efficiency in energy-constrained environments.
zh

[NLP-114] Small Vocabularies Big Gains: Pretraining and Tokenization in Time Series Models

【速读】: 该论文旨在解决时间序列建模中离散表示学习(discrete representation learning)的两个核心问题:一是分词器(tokenizer)设计对模型性能的影响,二是预训练(pretraining)与随机初始化在优化效率和特征对齐方面的差异。其关键解决方案在于系统性地分析分词策略(尤其是缩放和量化方法)如何决定模型的表征能力与稳定性,并揭示预训练能更有效地利用精心设计的分词器,尤其是在小词汇量条件下;同时指出若分词不匹配,预训练的优势可能被削弱甚至逆转。研究结果表明,在多模态预测场景中,结合小而高效的词汇表与预训练权重是提升模型性能的有效路径。

链接: https://arxiv.org/abs/2511.11622
作者: Alexis Roger,Gwen Legate,Kashif Rasul,Yuriy Nevmyvaka,Irina Rish
机构: 1. University of Montreal (蒙特利尔大学); 2. MILA (MILA); 3. Google Brain (谷歌大脑); 4. Mila (Mila); 5. Meta (Meta)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization and transfer learning are two critical components in building state of the art time series foundation models for forecasting. In this work, we systematically study the effect of tokenizer design, specifically scaling and quantization strategies, on model performance, alongside the impact of pretraining versus random initialization. We show that tokenizer configuration primarily governs the representational capacity and stability of the model, while transfer learning influences optimization efficiency and alignment. Using a combination of empirical training experiments and theoretical analyses, we demonstrate that pretrained models consistently leverage well-designed tokenizers more effectively, particularly at smaller vocabulary sizes. Conversely, misaligned tokenization can diminish or even invert the benefits of pretraining. These findings highlight the importance of careful tokenization in time series modeling and suggest that combining small, efficient vocabularies with pretrained weights is especially advantageous in multi-modal forecasting settings, where the overall vocabulary must be shared across modalities. Our results provide concrete guidance for designing tokenizers and leveraging transfer learning in discrete representation learning for continuous signals.
zh

[NLP-115] SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detectio

【速读】: 该论文旨在解决当前网络欺凌(cyberbullying, CB)研究中数据获取的伦理困境与局限性问题,即传统依赖真实人类对话数据收集方式存在隐私风险、样本稀缺及标注成本高等挑战。其解决方案的关键在于提出SynBullying——一个基于多大语言模型(multi-LLM)生成的对话式合成数据集,通过模拟真实社交场景中的多轮交互来构建具有伦理安全性和可扩展性的CB语料库。该方案的核心创新点包括:(1)保留对话结构以捕捉上下文动态;(2)引入情境感知的标注机制,结合意图和话语流评估危害程度;(3)提供细粒度类别标签,支持对不同类型的网络欺凌行为进行精细化分析。实证表明,该数据集在多个维度上具备良好的代表性,并能有效作为独立训练数据或增强源提升CB分类性能。

链接: https://arxiv.org/abs/2511.11599
作者: Arefeh Kazemi,Hamza Qadeer,Joachim Wagner,Hossein Hosseini,Sri Balaaji Natarajan Kalaivendan,Brian Davis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.
zh

[NLP-116] CLINB: A Climate Intelligence Benchmark for Foundational Models WWW

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂、专业化知识时评估困难的问题,尤其聚焦于气候变化这一高度专业化的领域。其解决方案的关键在于引入CLINB基准测试,该基准通过开放性、基于事实的多模态问答任务来评估模型表现,并明确要求知识质量和证据支持;CLINB依赖真实用户提问数据集及由顶尖气候科学家制定的评价标准,同时构建了基于模型的评估流程以验证结果可靠性。研究发现,前沿模型虽具备出色的跨知识融合能力(常达到博士级理解水平),但其在证据溯源方面存在严重缺陷,如参考文献和图像的幻觉率较高,表明当前模型在知识合成与可验证归因之间存在显著差距,亟需类似CLINB这样可靠且可解释的基准来推动可信人工智能系统的发展。

链接: https://arxiv.org/abs/2511.11597
作者: Michelle Chen Huebscher,Katharine Mach,Aleksandar Stanić,Markus Leippold,Ben Gaiarin,Zeke Hausfather,Elisa Rawat,Erich Fischer,Massimiliano Ciaramita,Joeri Rogelj,Christian Buck,Lierni Sestorain Saralegui,Reto Knutti
机构: University of Miami(迈阿密大学); University of Zurich(苏黎世大学); Stripe(Stripe); ETH Zurich(苏黎世联邦理工学院); Imperial College London(帝国理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Questions, system prompt and model judge prompts available here: this https URL

点击查看摘要

Abstract:Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users’ questions and evaluation rubrics curated by leading climate scientists. We implement and validate a model-based evaluation process and evaluate several frontier models. Our findings reveal a critical dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform “hybrid” answers curated by domain experts assisted by weaker models. However, this performance is countered by failures in grounding. The quality of evidence varies, with substantial hallucination rates for references and images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks like CLINB are needed to progress towards building trustworthy AI systems.
zh

[NLP-117] meStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy

【速读】: 该论文旨在解决在长篇文本中检索语义相同但句法不同的引用片段(如官方记录与语音转录文本之间的对齐)时,传统模糊匹配方法失效的问题。其核心挑战在于从时间戳标记的长语音转录中精确返回目标引文的毫秒级起止边界,而标准算法难以应对因转录误差或编辑漂移导致的非字面一致表达。解决方案的关键在于提出一种两阶段方法:首先使用RapidFuzz进行预筛选以缩小候选范围,再由大型语言模型(LLM)对短片段进行验证,该“辅助模糊匹配”(Assisted Fuzzy)策略显著提升准确率(最高达50个百分点),同时将延迟降低50%、单位正确结果成本减少96%,并展现出对不同长度、词汇漂移和领域变化的鲁棒性。

链接: https://arxiv.org/abs/2511.11594
作者: James McCammon
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our “Assisted Fuzzy” approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.
zh

[NLP-118] LLM -Generated Negative News Headlines Dataset: Creation and Benchmarking Against Real Journalism

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)任务中因真实数据获取困难及隐私问题带来的挑战,尤其聚焦于负面情绪文本(negative valence text)在情感分析中的应用。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成合成新闻标题数据集,通过定制化提示词构建覆盖多社会领域的负面情绪语料,并借助专家评审与嵌入空间分析验证其内容、语气、长度和风格等维度的真实性与一致性。实验表明,该合成数据集在多数关键指标(如与真实标题的相关性、困惑度、连贯性和语义相似性)上与真实数据高度匹配,仅在词性标注中的专有名词比例上存在显著差异,证明LLM生成数据可作为高质量替代方案用于NLP研究与实践。

链接: https://arxiv.org/abs/2511.11591
作者: Olusola Babalola,Bolanle Ojokoh,Olutayo Boyinbode
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 50 pages, 19 figures, 9 tables

点击查看摘要

Abstract:This research examines the potential of datasets generated by Large Language Models (LLMs) to support Natural Language Processing (NLP) tasks, aiming to overcome challenges related to data acquisition and privacy concerns associated with real-world data. Focusing on negative valence text, a critical component of sentiment analysis, we explore the use of LLM-generated synthetic news headlines as an alternative to real-world data. A specialized corpus of negative news headlines was created using tailored prompts to capture diverse negative sentiments across various societal domains. The synthetic headlines were validated by expert review and further analyzed in embedding space to assess their alignment with real-world negative news in terms of content, tone, length, and style. Key metrics such as correlation with real headlines, perplexity, coherence, and realism were evaluated. The synthetic dataset was benchmarked against two sets of real news headlines using evaluations including the Comparative Perplexity Test, Comparative Readability Test, Comparative POS Profiling, BERTScore, and Comparative Semantic Similarity. Results show the generated headlines match real headlines with the only marked divergence being in the proper noun score of the POS profile test.
zh

[NLP-119] he Anatomy of a Triton Attention Kernel

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)推理平台在跨硬件架构下的可移植性与高效性难题,即如何在不依赖低级手工调优的前提下,实现对不同GPU厂商(如NVIDIA和AMD)的高性能支持。其解决方案的关键在于开发了一个基于领域特定即时编译语言Triton的先进分页注意力(paged attention)核函数,通过算法优化、系统级改进及参数自动调优,将通用Triton注意力核的性能从仅达到最先进水平的19.7%提升至105.9%,从而验证了利用开源领域特定语言实现跨平台高效LLM推理的可行性。

链接: https://arxiv.org/abs/2511.11581
作者: Burkhard Ringlein,Jan van Lunteren,Radu Stoica,Thomas Parnell
机构: IBM Research (IBM 研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this work, we demonstrate that portable, efficient cross-platform LLM inference is indeed possible and share our experience. We develop a state-of-the-art paged attention kernel, the core performance-critical component of many LLM deployments, that builds exclusively on the domain-specific just-in-time compiled language Triton to achieve state-of-the-art performance on both NVIDIA and AMD GPUs. We describe our high-level approach, the key algorithmic and system-level improvements, the parameter auto-tuning required to unlock efficiency, and the integrations into a popular inference server that are necessary to bring the performance of a generic Triton attention kernel from 19.7% of the state-of-the-art to 105.9%. Our results highlight how open-source domain-specific languages can be leveraged to unlock model portability across different GPU vendors.
zh

[NLP-120] Decoupling Positional and Symbolic Attention Behavior in Transformers

【速读】: 该论文旨在解决Transformer模型中注意力头(attention heads)在处理位置信息(positional information)与符号信息(symbolic information)时的行为机制不明确的问题,特别是针对旋转位置编码(Rotary Positional Encoding, RoPE)如何通过不同频率成分实现这两种功能的理解。其解决方案的关键在于:首先,提出了一套理论定义来区分“位置行为”和“符号行为”,并证明二者互斥;其次,构建了一个量化指标用于评估每个注意力头的行为倾向;最后,通过设计纯位置任务和纯符号任务,验证了模型性能与注意力头能否利用相应频率(高频用于位置、低频用于语义)之间存在因果关系,从而揭示RoPE的有效性源于其对频率的分层利用机制。

链接: https://arxiv.org/abs/2511.11579
作者: Felipe Urrutia,Jorge Salas,Alexander Kozachinskiy,Cristian Buc Calderon,Hector Pasten,Cristobal Rojas
机构: University of Chile (智利大学); CENIA; Santiago, Chile; Faculty of mathematics UC (智利大学数学学院); IMC UC (智利大学信息与计算中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 32 pages, 12 figures, repository available

点击查看摘要

Abstract:An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success. Recently, it has been argued that part of RoPE’s success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify them. We apply our framework to analyze Transformer-based LLMs using RoPE and find that all heads exhibit a strong correspondence between behavior and frequency use. Finally, we introduce canonical tasks designed to be either purely positional or symbolic, and demonstrate that the Transformer performance causally relates to the ability of attention heads to leverage the appropriate frequencies. In particular, we show that we can control the Transformer performance by controlling which frequencies the attention heads can access. Altogether, our work provides a detailed understanding of RoPE, and how its properties relate to model behavior.
zh

[NLP-121] LLM Architecture Scaling Laws and Economics: A Quick Summary

【速读】: 该论文旨在系统梳理当前大语言模型(Large Language Models, LLMs)的标准架构,特别是基于QKV自注意力机制的Transformer结构,并提供计算量(浮点运算次数,flops)与内存需求(参数加数据)的缩放定律,同时给出截至2025年不同规模LLM参数的大致成本估算,进一步讨论DeepSeek是否属于特殊案例。其解决方案的关键在于以简洁、整合的方式呈现这些通常分散于各研究中的技术事实和经济估算,填补了现有文献中缺乏此类综合总结的空白。

链接: https://arxiv.org/abs/2511.11572
作者: William H. Press
机构: 未知
类目: General Literature (cs.GL); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The current standard architecture of Large Language Models (LLMs) with QKV self-attention is briefly summarized, including the architecture of a typical Transformer. Scaling laws for compute (flops) and memory (parameters plus data) are given, along with their present (2025) rough cost estimates for the parameters of present LLMs of various scales, including discussion of whether DeepSeek should be viewed as a special case. Nothing here is new, but this material seems not otherwise readily available in summary form.
zh

[NLP-122] DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization

【速读】: 该论文旨在解决偏好优化(Preference Optimization, PO)中训练数据质量对大语言模型(Large Language Models, LLMs)学习效果影响不明确的问题,尤其是现有偏好数据集中响应对(y⁺ 和 y⁻)之间的差异与模型期望学习的差异不匹配的现象。解决方案的关键在于提出一种新的度量指标——距离校准奖励边际(Distance Calibrated Reward Margin, DCRM),该指标通过量化响应对间的距离和奖励边际来评估其适配PO任务的质量,并据此设计了一种基于DCRM最优的“best-of-N²”配对方法,以筛选出具有更高DCRM值的响应对用于训练,从而显著提升模型在AlpacaEval、MT-Bench和Arena-Hard等评测基准上的性能表现。

链接: https://arxiv.org/abs/2506.14157
作者: Chengyu Huang,Tanya Goyal
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response y^+ and dispreferred response y^- influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of- N^2 pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models’ performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.
zh

[NLP-123] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时通过扩展输出长度以提升推理能力所带来的冗余输出和推理成本增加的问题。现有方法通常依赖全局预算约束或查询级长度优化,但未能利用训练过程中对同一问题的历史交互信息,从而限制了模型逐步生成更简洁解的能力。解决方案的关键在于提出一种历史感知策略优化方法(History-Aware Policy Optimization, HAPO),其核心机制是为每个问题维护一个历史状态(如先前正确响应的最短长度),并设计基于该历史状态的长度奖励函数,激励模型发现比以往更简洁的正确解,同时避免对较短但错误响应的过度惩罚,从而在保证准确性的前提下实现效率优化。

链接: https://arxiv.org/abs/2505.11225
作者: Chengyu Huang,Zhengxin Zhang,Claire Cardie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs’ concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.
zh

[NLP-124] VoiceCraft-X: Unifying Multilingual Voice-Cloning Speech Synthesis and Speech Editing CEC EMNLP2025

【速读】: 该论文旨在解决多语言语音编辑与零样本文本到语音(Text-to-Speech, TTS)合成任务在跨语言场景下的统一建模问题,尤其针对不同语言数据稀缺时的性能瓶颈。解决方案的关键在于提出 VoiceCraft-X——一个自回归神经编解码语言模型,通过引入 Qwen3 大语言模型实现无音素的跨语言文本处理,并设计了一种新颖的 token 重排序机制,将时间对齐的文本与语音 token 统一为单序列生成问题,从而在一个框架内无缝完成多语言语音编辑和零样本 TTS 合成,且在仅需少量每语言数据的情况下仍能保持鲁棒性能。

链接: https://arxiv.org/abs/2511.12347
作者: Zhisheng Zheng,Puyuan Peng,Anuj Diwan,Cong Phuoc Huynh,Xiaohang Sun,Zhu Liu,Vimal Bhat,David Harwath
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Amazon(亚马逊)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: EMNLP 2025. Demo and code are available at this https URL

点击查看摘要

Abstract:We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at this https URL.
zh

[NLP-125] How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer ICASSP2026

【速读】: 该论文旨在解决自监督学习(SSL)语音模型在非普通话语言中对声调(lexical tone)建模能力不足的问题,特别是在低资源条件下声调信息的提取与迁移机制尚不明确。其关键解决方案是通过在缅甸语、泰语、老挝语和越南语这四种具有复杂声调系统的语言上进行系统性实验,量化声调线索的时间跨度(约100 ms至180 ms),并利用探针(probes)和梯度分析揭示不同下游任务如何影响模型对声调信息的时序关注:自动语音识别(ASR)微调使模型聚焦于语言特定的声调时间范围,而韵律和语音相关任务则导致模型过度依赖过长的时间跨度,从而表明声调迁移效果受下游任务性质显著调控。

链接: https://arxiv.org/abs/2511.12285
作者: Minu Kim,Ji Sub Um,Hoirin Kim
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 7 figures, submitted to ICASSP 2026

点击查看摘要

Abstract:Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.
zh

计算机视觉

[CV-0] Back to Basics: Let Denoising Generative Models Denoise

【速读】:该论文旨在解决当前去噪扩散模型(denoising diffusion models)在生成高质量图像时存在的根本性局限问题:现有模型并非直接预测干净图像,而是通过预测噪声或带噪数据来间接实现生成,这在高维空间中可能导致性能崩溃。其解决方案的关键在于提出一种全新的建模范式——“仅图像Transformer”(Just image Transformers, JiT),即让神经网络直接学习从噪声到干净数据的映射,而非预测噪声本身。这一方法基于流形假设(manifold assumption),认为自然数据分布位于低维流形上,而带噪数据则偏离该流形;因此,直接预测干净数据可使容量受限的网络在高维像素空间中依然高效工作。实验表明,无需分词器、预训练或额外损失函数,仅使用大块(patch size=16/32)的纯Transformer架构即可在ImageNet 256×256和512×512分辨率下取得具有竞争力的结果,从而验证了该方法的有效性和简洁性。

链接: https://arxiv.org/abs/2511.13720
作者: Tianhong Li,Kaiming He
机构: MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report. Code at this https URL

点击查看摘要

Abstract:Today’s denoising diffusion models do not “denoise” in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than " \textbfJust image Transformers ", or \textbfJiT , as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
zh

[CV-1] Scaling Spatial Intelligence with Multimodal Foundation Models

【速读】:该论文旨在解决多模态基础模型在空间智能(spatial intelligence)方面存在的显著不足问题。其解决方案的关键在于系统性地构建一个高质量、多样化的数据集 SenseNova-SI-8M,包含八百万条遵循严格空间能力分类体系的数据样本,并基于此对多模态基础模型进行训练与优化。通过这一策略,模型在多个空间智能基准测试中展现出卓越性能,同时保持了强大的通用多模态理解能力,验证了数据规模与多样性对激发涌现式泛化能力的重要性。

链接: https://arxiv.org/abs/2511.13719
作者: Zhongang Cai,Ruisi Wang,Chenyang Gu,Fanyi Pu,Junxiang Xu,Yubo Wang,Wanqi Yin,Zhitao Yang,Chen Wei,Qingping Sun,Tongxi Zhou,Jiaqi Li,Hui En Pang,Oscar Qian,Yukun Wei,Zhiqian Lin,Xuanke Shi,Kewang Deng,Xiaoyang Han,Zukai Chen,Xiangyu Fan,Hanming Deng,Lewei Lu,Liang Pan,Bo Li,Ziwei Liu,Quan Wang,Dahua Lin,Lei Yang
机构: SenseTime Research (商汤科技研究部); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: Model: this https URL Code: this https URL

点击查看摘要

Abstract:Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
zh

[CV-2] Segment Anything Across Shots: A Method and Benchmark AAAI2026

【速读】:该论文旨在解决多镜头半监督视频目标分割(Multi-shot Semi-supervised Video Object Segmentation, MVOS)中因镜头切换导致的分割性能下降问题,现有方法主要针对单镜头视频设计,在跨镜头场景下表现不佳。其解决方案的关键在于提出两种核心创新:一是过渡模拟数据增强策略(Transition Mimicking Data Augmentation, TMA),通过模拟镜头间过渡来提升模型对跨镜头变化的泛化能力,缓解多镜头标注数据稀缺的问题;二是Segment Anything Across Shots (SAAS) 模型,能够有效检测并理解镜头切换,实现跨镜头的目标分割。实验表明,SAAS 在 YouMVOS 和新提出的 Cut-VOS 基准上均达到当前最优性能。

链接: https://arxiv.org/abs/2511.13715
作者: Hengrui Hu,Kaining Ying,Henghui Ding
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026, Project Page: this https URL

点击查看摘要

Abstract:This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at this https URL.
zh

[CV-3] UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

【速读】:该论文旨在解决当前Segment Anything Model (SAM)家族在分割粒度控制方面的局限性问题,即用户需通过手动添加提示或从预生成掩码中选择来调整分割细节,这一过程存在歧义且效率低下,同时获取多粒度密集标注数据成本高昂,使得监督学习方案难以实施。解决方案的关键在于提出UnSAMv2,其核心创新包括:1)基于分治策略发现大量掩码-粒度配对;2)引入新颖的粒度控制嵌入(granularity control embedding),实现对分割尺度的精确、连续控制。实验表明,仅使用6K未标注图像和0.02%额外参数,UnSAMv2显著提升SAM-2在交互式、全图及视频分割任务中的性能,验证了小量无标签数据结合粒度感知自监督学习方法可有效释放视觉基础模型的潜力。

链接: https://arxiv.org/abs/2511.13714
作者: Junwei Yu,Trevor Darrell,XuDong Wang
机构: UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only 6 K unlabeled images and 0.02% additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over 11 benchmarks, UnSAMv2 improves \textNoC_90 (5.69 \rightarrow 4.75), 1-IoU (58.0 \rightarrow 73.1), and \textAR_1000 (49.6 \rightarrow 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
zh

[CV-4] Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine AAAI2026

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)扩散模型在进行3D感知对象编辑时存在的局限性,即大多数方法无法实现物理一致的3D对象操作,且常依赖于图像空间处理或耗时且易出错的3D重建流程。其解决方案的关键在于提出一种名为FFSE(3D-aware autoregressive framework)的自回归框架,将编辑过程建模为一系列可学习的3D变换序列,从而支持用户对真实世界图像中的对象执行任意类型的几何变换(如平移、缩放和旋转),同时保持背景效果(如阴影、反射)的真实性与多轮编辑下的全局场景一致性。此外,为支撑多轮3D感知编辑的学习,作者构建了3DObjectEditor这一混合数据集,该数据集由多样化对象与场景中模拟的编辑序列组成,有效提升了模型在动态和多轮条件下的训练效果。

链接: https://arxiv.org/abs/2511.13713
作者: Xincheng Shuai,Zhenyuan Qin,Henghui Ding,Dacheng Tao
机构: 1. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026, Project Page: this https URL

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.
zh

[CV-5] ViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

【速读】:该论文旨在解决当前视频生成模型在物理合理性与逻辑一致性方面的推理能力评估不足的问题。现有基准主要关注视觉保真度和时序连贯性,未能有效衡量模型的高阶推理能力。为此,作者提出TiViBench——一个分层基准,系统性地从结构推理(Structural Reasoning)、空间视觉模式推理(Spatial Visual Pattern Reasoning)、符号逻辑推理(Symbolic Logical Reasoning)和动作规划任务执行(Action Planning Task Execution)四个维度,覆盖24种任务场景与3个难度层级来评估图像到视频(I2V)生成模型的推理能力。解决方案的关键在于引入VideoTPO,这是一种基于大语言模型(LLM)自我分析的测试阶段优化策略,通过识别生成候选结果的优势与缺陷,在不依赖额外训练、数据或奖励模型的前提下显著提升视频生成模型的推理性能。

链接: https://arxiv.org/abs/2511.13704
作者: Harold Haodong Chen,Disen Lan,Wen-Jie Shu,Qingyang Liu,Zihan Wang,Sirui Chen,Wenkai Cheng,Kanghao Chen,Hongfei Zhang,Zixin Zhang,Rongjin Guo,Yu Cheng,Ying-Cong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL

点击查看摘要

Abstract:The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3’s chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning Search, ii) Spatial Visual Pattern Reasoning, iii) Symbolic Logical Reasoning, and iv) Action Planning Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
zh

[CV-6] raining-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

【速读】:该论文旨在解决基于文本提示的3D场景重光照(text-guided relighting)问题,即如何在不重新训练模型的前提下,利用用户提供的自然语言指令精准控制3D场景中的光照方向、颜色、强度等属性,并生成高保真且多视角一致的重光照结果。解决方案的关键在于提出了一种无需训练的文本感知流水线GS-Light,其核心创新包括:首先通过大视觉语言模型(LVLM)解析文本提示获取光照先验;其次结合现成的几何与语义估计器(如深度图、法向量和语义分割)构建视图几何约束,从而计算出每个视角的光照映射并生成初始潜在编码(init latents);最后将这些精心设计的初始潜变量与多视角渲染图像一同输入扩散模型进行协同优化,实现高质量的多视角重光照生成,并进一步微调3D高斯泼溅(3DGS)场景以获得完整重光照的3D表示。

链接: https://arxiv.org/abs/2511.13684
作者: Jiangnan Ye,Jiedong Zhuang,Lianrui Mu,Wenjie Zheng,Jiaqi Hu,Xingze Zou,Jing Wang,Haoji Hu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitting for Neurocomputing

点击查看摘要

Abstract:We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.
zh

[CV-7] QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention DATE2026

【速读】:该论文旨在解决可变形注意力机制(deformable attention)在硬件部署中面临的两大挑战:不规则内存访问模式和低算术强度(arithmetic intensity),这导致其在实际加速器上性能不佳。解决方案的关键在于提出一种调度感知的加速器架构 QUILL,其核心创新是基于距离的乱序查询(Distance-based Out-of-Order Querying, DOOQ),通过按空间邻近性排序查询并引入前瞻预取机制,将内存访问转化为缓存友好的单遍遍历;同时设计了一个融合的 MSDeformAttn 引擎,在单次遍历中完成插值、Softmax、聚合及最终投影(W’'m)操作,避免中间结果溢出,并结合片上小张量存储与集成 GEMM 单元处理周围密集层,从而实现高吞吐和高能效。

链接: https://arxiv.org/abs/2511.13679
作者: Hyunwoo Oh,Hanning Chen,Sanggeon Yun,Yang Ni,Wenjun Huang,Tamoghno Das,Suyeon Jang,Mohsen Imani
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to DATE 2026

点击查看摘要

Abstract:Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer–forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W’'m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within =0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality–and locality into utilization–QUILL delivers consistent, end-to-end speedups.
zh

[CV-8] OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

【速读】:该论文旨在解决地球观测(Earth Observation, EO)数据在多模态、时空特性上的建模难题,即如何有效利用具有空间图像特征、时间序列特征以及多源异构信息的复杂EO数据。其解决方案的关键在于提出OlmoEarth——一个专为地球观测领域设计的多模态时空基础模型,通过创新的自监督学习范式、掩码策略和损失函数,实现对EO数据的高效表征与泛化能力提升。实验表明,OlmoEarth在多个基准任务和真实场景中均达到领先性能,且已部署为端到端平台,赋能非营利组织与NGO开展地球观测模型的开发与应用。

链接: https://arxiv.org/abs/2511.13655
作者: Henry Herzog,Favyen Bastani,Yawen Zhang,Gabriel Tseng,Joseph Redmon,Hadrien Sablon,Ryan Park,Jacob Morrison,Alexandra Buraczynski,Karen Farley,Joshua Hansen,Andrew Howe,Patrick Alan Johnson,Mark Otterlee,Ted Schmitt,Hunter Pitelka,Stephen Daspit,Rachel Ratner,Christopher Wilhelm,Sebastian Wood,Mike Jacobi,Hannah Kerner,Evan Shelhamer,Ali Farhadi,Ranjay Krishna,Patrick Beukema
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world’s biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at \hrefthis https URL\textthis https URL .
zh

[CV-9] uning for Two Adversaries: Enhancing the Robustness Against Transfer and Query-Based Attacks using Hyperparameter Tuning AAAI

【速读】:该论文旨在解决优化超参数(如学习率、权重衰减、动量和批量大小)对模型在面对迁移攻击(transfer-based attacks)和查询攻击(query-based attacks)时鲁棒性影响的不明确问题。解决方案的关键在于揭示了两类攻击下学习率调控的显著差异:对于迁移攻击,降低学习率可提升鲁棒性高达64%;而对于查询攻击,提高学习率则能带来最高达28%的鲁棒性增强。基于此发现,论文首次系统探索了优化超参数的设计空间,提出通过针对性调整学习率等超参数,在分布式训练等场景中实现对两类攻击的协同防御,从而在不同训练设置下获得更优的鲁棒性权衡。

链接: https://arxiv.org/abs/2511.13654
作者: Pascal Zimmer,Ghassan Karame
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2026

点击查看摘要

Abstract:In this paper, we present the first detailed analysis of how optimization hyperparameters – such as learning rate, weight decay, momentum, and batch size – influence robustness against both transfer-based and query-based attacks. Supported by theory and experiments, our study spans a variety of practical deployment settings, including centralized training, ensemble learning, and distributed training. We uncover a striking dichotomy: for transfer-based attacks, decreasing the learning rate significantly enhances robustness by up to 64% . In contrast, for query-based attacks, increasing the learning rate consistently leads to improved robustness by up to 28% across various settings and data distributions. Leveraging these findings, we explore – for the first time – the optimization hyperparameter design space to jointly enhance robustness against both transfer-based and query-based attacks. Our results reveal that distributed models benefit the most from hyperparameter tuning, achieving a remarkable tradeoff by simultaneously mitigating both attack types more effectively than other training setups.
zh

[CV-10] Distribution Matching Distillation Meets Reinforcement Learning ATC

【速读】:该论文旨在解决扩散模型蒸馏(Distribution Matching Distillation, DMD)中少步数生成器性能受限于多步教师模型的问题,即传统蒸馏方法难以充分释放少步生成器的潜力。解决方案的关键在于提出DMDR框架,将强化学习(Reinforcement Learning, RL)引入蒸馏过程:一方面,DMD损失本身作为正则化项比传统方法更有效地指导少步生成器的RL训练;另一方面,RL能够更有效地引导模式覆盖(mode coverage)过程,从而在蒸馏与RL协同优化下显著提升少步生成器的能力。此外,通过动态分布引导和动态重噪声采样策略进一步优化初始蒸馏阶段,最终实现视觉质量领先且超越原多步教师模型的表现。

链接: https://arxiv.org/abs/2511.13649
作者: Dengyang Jiang,Dongyang Liu,Zanyi Wang,Qilong Wu,Xin Jin,David Liu,Zhen Li,Mengmeng Wang,Peng Gao,Harry Yang
机构: The Hong Kong University of Science and Technology (香港科技大学); Alibaba Group (阿里巴巴集团); Zhejiang University of Technology (浙江工业大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The synergy of reinforcement learning and distribution matching distillation. See more: this https URL

点击查看摘要

Abstract:Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
zh

[CV-11] PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

【速读】:该论文旨在解决当前3D生成方法普遍忽视物理属性和关节结构的问题,从而限制其在具身智能(embodied AI)中的应用。现有方法生成的3D资产多为静态视觉模型,难以直接用于物理仿真或交互任务。解决方案的关键在于提出首个面向仿真的物理3D生成框架PhysX-Anything,其核心创新包括:1)设计首个基于视觉语言模型(VLM)的物理3D生成模型,并引入一种高效的3D表示方式,将几何信息token化后减少至原规模的1/193,从而在标准VLM token预算内实现显式几何学习,且无需额外特殊token;2)构建新的物理3D数据集PhysX-Mobility,涵盖超过2倍于已有数据集的对象类别,包含2000余个常见现实物体及其丰富的物理标注,显著提升训练多样性与泛化能力。实验表明,该框架可生成高质量、可直接用于接触密集型机器人策略学习的仿真就绪3D资产,有效推动了具身智能与物理模拟的落地应用。

链接: https://arxiv.org/abs/2511.13648
作者: Ziang Cao,Fangzhou Hong,Zhaoxi Chen,Liang Pan,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学); Shanghai AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
zh

[CV-12] Part-X-MLLM : Part-aware 3D Multimodal Large Language Model

【速读】:该论文旨在解决3D多模态任务中缺乏统一表示与可控生成接口的问题,即如何将多样化的3D任务(如分割、描述、编辑)整合到一个结构化、可执行的框架中,从而实现端到端的语义驱动几何操作。解决方案的关键在于提出Part-X-MLLM,这是一种原生3D多模态大语言模型,通过将3D任务建模为结构化可执行语法中的程序,使模型能够从RGB点云和自然语言提示中自回归生成包含部件级边界框、语义描述及编辑命令的单一连贯token序列;这种结构化输出作为通用接口,可驱动下游几何感知模块进行基于部件的生成与编辑,同时通过符号规划(symbolic planning)与几何合成(geometric synthesis)的解耦设计,支持任意兼容的几何引擎通过统一的语言前端进行控制。

链接: https://arxiv.org/abs/2511.13647
作者: Chunshi Wang,Junliang Ye,Yunhan Yang,Yang Li,Zizhuo Lin,Jun Zhu,Zhuo Chen,Yawei Luo,Chunchao Guo
机构: Zhejiang University (浙江大学); Tencent Hunyuan (腾讯混元); Tsinghua University (清华大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\A, compositional generation, and localized editing through one unified interface. Project page: this https URL
zh

[CV-13] CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

【速读】:该论文旨在解决长视频问答(Long-form Video Question Answering, VQA)中视觉语言模型(Vision-Language Models, VLMs)因注意力机制和键值(Key-Value, KV)缓存随运行时间增长而导致的计算效率低下问题,即在保持上下文感知能力的同时难以实现高效推理。解决方案的关键在于提出一种无需训练的流水线方法 CacheFlow,其核心创新包括:1)动态令牌剪枝(Dynamic Token Dropping, DTD),通过帧间余弦相似度在线剔除冗余图像块令牌,保留关键信息;2)压缩式长期记忆机制,将剩余令牌分块并用轻量级循环编码器生成检索索引,同时将完整 KV 对离线存储并在生成时重新加载,从而兼顾推理效率与答案保真度;3)基于共识的检索机制,在推理阶段仅访问最相关的 Top-K 块,结合局部上下文进行精准的长程推理。该方案实现了对现有 VLM 架构的无侵入式增强,且无需微调即可显著减少 token 处理量(最多达 87%),有效支持实时流媒体场景下的长视频理解任务。

链接: https://arxiv.org/abs/2511.13644
作者: Shrenik Patel,Daivik Patel
机构: Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one’s keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block’s full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.
zh

[CV-14] Alpha Divergence Losses for Biometric Verification

【速读】:该论文旨在解决当前人脸和说话人验证任务中,基于角度边距(angular margin)的损失函数(如CosFace和ArcFace)与α-散度损失函数之间难以融合的问题。尽管α-散度损失能诱导稀疏解(当α < 1时),但其天然不包含对角度边距的显式建模机制,而边距对验证任务至关重要。解决方案的关键在于提出两种新的路径:一是将边距引入参考测度(即先验概率)中,得到Q-Margin;二是将边距施加于logits(未归一化的对数似然)中,得到A3M。特别地,作者识别出A3M训练中的不稳定问题——源于惩罚后的logits与稀疏性之间的相互作用,并通过一种简单有效的原型重初始化策略加以解决。实验表明,所提方法在IJB-B、IJB-C人脸验证和VoxCeleb说话人验证数据集上均显著优于强基线模型,尤其在低误接受率(FAR)下表现优异,满足高安全性场景(如银行认证)的实际需求。

链接: https://arxiv.org/abs/2511.13621
作者: Dimitrios Koutsianos,Ladislav Mosner,Yannis Panagakis,Themos Stafylakis
机构: Athens University of Economics and Business (雅典经济与商业大学); Archimedes/Athena RC (阿基米德/雅典研究中心); Brno University of Technology (布杰约维采理工大学); National and Kapodistrian University of Athens (国家和卡波迪斯特里安雅典大学); Omilia - Conversational Intelligence (Omilia - 对话智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced \alpha -divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when \alpha1 ). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based \alpha -divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.
zh

[CV-15] A Real-Time Driver Drowsiness Detection System Using MediaPipe and Eye Aspect Ratio

【速读】:该论文旨在解决因驾驶员疲劳(Driver Fatigue)导致的道路交通事故问题,通过开发一种基于视觉的驾驶员困倦检测系统(Driver Drowsiness Detection System),提升道路行车安全。其解决方案的关键在于利用标准网络摄像头结合MediaPipe的Face Mesh框架实时识别面部关键点,并采用眼睑纵横比(Eye Aspect Ratio, EAR)方法精确监测眨眼频率与闭眼时长,从而判断驾驶员是否处于困倦状态;当检测到异常行为时,系统通过声音警报及时提醒驾驶员,整个过程依托OpenCV实现高效图像处理,具备高精度、低延迟和低成本的优势,可集成至当前的高级驾驶辅助系统(Advanced Driving Assistance System, ADAS)中。

链接: https://arxiv.org/abs/2511.13618
作者: Ashlesha G. Sawant,Shreyash S. Kamble,Raj S. Kanade,Raunak N. Kanugo,Tanishq A. Kapse,Karan A. Bhapse
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 8 referenced papers

点击查看摘要

Abstract:One of the major causes of road accidents is driver fatigue that causes thousands of fatalities and injuries every year. This study shows development of a Driver Drowsiness Detection System meant to improve the safety of the road by alerting drivers who are showing signs of being drowsy. The system is based on a standard webcam that tracks the facial features of the driver with the main emphasis on the examination of eye movements that can be conducted with the help of the Eye Aspect Ratio (EAR) method. The Face Mesh by MediaPipe is a lightweight framework that can identify facial landmarks with high accuracy and efficiency, which is considered to be important in real time use. The system detects the moments of long eye shutdowns or a very low rate of blinking which are manifestations of drowsiness and alerts the driver through sound to get her attention back. This system achieves a high-performance and low-cost driver monitoring solution with the help of the computational power of OpenCV to process the image and the MediaPipe to identify faces. Test data experimental analyses indicate that the system is very accurate and responds quicker; this confirms that it can be a component of the current Advanced Driving Assistance System (ADAS).
zh

[CV-16] ssue Aware Nuclei Detection and Classification Model for Histopathology Images

【速读】:该论文旨在解决计算病理学中细胞核检测与分类任务面临的两大挑战:一是现有方法高度依赖详尽的专家标注,二是缺乏对组织上下文信息的有效利用。其解决方案的关键在于提出了一种新型框架TAND(Tissue-Aware Nuclei Detection),通过点级监督结合组织掩膜条件化,实现细胞核的联合检测与分类。该框架将基于ConvNeXt的编码器-解码器结构与一个冻结的Virchow-2组织分割分支耦合,利用一种新颖的多尺度空间特征逐线性调制(Spatial-FiLM)机制,使语义组织概率在分类流中选择性地调节特征表示,从而显著提升对依赖组织类型的细胞类型(如上皮、内皮和基质细胞)的识别性能。

链接: https://arxiv.org/abs/2511.13615
作者: Kesi Xu,Eleni Chiou,Ali Varamesh,Laura Acqualagna,Nasir Rajpoot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures. Under review

点击查看摘要

Abstract:Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.
zh

[CV-17] AtlasMorph: Learning conditional deformable templates for brain MRI

【速读】:该论文旨在解决医学图像分析中模板(atlas)构建效率低、代表性不足的问题,尤其是在群体内部存在较大变异时,现有模板往往无法准确反映目标人群的解剖特征。其解决方案的关键在于提出一种基于卷积注册神经网络(convolutional registration neural networks)的机器学习框架,通过学习从个体属性(如年龄、性别)到模板的映射函数,实现条件化模板生成;同时利用可用的分割结果增强模板的解剖标签信息,从而生成更具代表性的高质量模板,并支持对个体图像的高效配准。

链接: https://arxiv.org/abs/2511.13609
作者: Marianne Rakic,Andrew Hoopes,S. Mazdak Abulnaga,Mert R. Sabuncu,John V. Guttag,Adrian V. Dalca
机构: CSAIL MIT (麻省理工学院计算机科学与人工智能实验室); MGH (马萨诸塞州总医院); HMS (哈佛医学院); Cornell Tech (康奈尔技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.
zh

[CV-18] ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement AAAI-26

【速读】:该论文旨在解决低光照图像增强(Low-Light Image Enhancement, LLIE)任务中因亮度(luminance)与色度(chrominance)分支间分布差异显著,导致互补特征提取受限、亮度误差通过非线性参数传播至色度通道,以及在大均质色区域中色度分支相关性弱而引发的梯度冲突问题。解决方案的关键在于提出一种交互式亮度与色度增强框架(Inter-Chrominance and Luminance Interaction, ICLR),其核心组件包括双流交互增强模块(Dual-stream Interaction Enhancement Module, DIEM)和协方差校正损失(Covariance Correction Loss, CCL):DIEM通过融合与增强两个维度提升跨分支互补信息提取能力,CCL则利用亮度残差统计量惩罚色度误差,并通过约束色度分支协方差来平衡梯度冲突,从而实现更稳定且高质量的增强效果。

链接: https://arxiv.org/abs/2511.13607
作者: Xin Xu,Hao Liu,Wei Liu,Wei Wang,Jiayi Wu,Kui Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-26

点击查看摘要

Abstract:Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.
zh

[CV-19] VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)生成模型在推理过程中因逐token预测机制导致的高延迟问题。传统推测解码(Speculative Decoding, SD)虽能加速生成,但其“草拟一步、验证一步”的范式无法直接减少目标模型的前向传播次数,从而限制了加速潜力。解决方案的关键在于首次探索在VAR模型的SD流程中跳过验证步骤(verification skipping),通过引入部分验证跳过策略来显式减少目标模型的前向计算次数。具体而言,提出VVS框架,集成三个互补模块:(1) 基于动态截断的无验证token选择器,(2) token级特征缓存与复用机制,(3) 细粒度跳过步调度策略,从而在保持生成质量的同时将目标模型前向传播次数降低2.8倍,显著优于传统SD方法,实现了更优的速度-质量权衡。

链接: https://arxiv.org/abs/2511.13587
作者: Haotian Dong,Ye Li,Rongwei Lu,Chen Tang,Shu-Tao Xia,Zhi Wang
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its “draft one step, then verify one step” paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage’s characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of 2.8\times relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.
zh

[CV-20] Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images

【速读】:该论文旨在解决常规组织病理学图像中细胞类型与亚型识别的挑战,特别是现有基于图像块(tile-based)的模型难以融合影响细胞功能和身份的组织微环境上下文,且人类标注通常粒度粗、分布不均,导致细粒度亚型监督难以获取的问题。其解决方案的关键在于提出NuClass框架,通过双路径结构实现细胞级别的多尺度信息整合:Path local专注于224×224像素区域的核形态特征提取,Path global则建模1024×1024像素邻域的微环境上下文;引入可学习门控模块动态平衡局部细节与上下文线索,并设计不确定性引导的目标函数促使全局路径在局部路径不确定区域优先关注,从而增强互补学习效果;此外,利用Xenium空间转录组数据构建了百万级单细胞分辨率标注数据集,显著提升了标注质量与覆盖度,最终在三个独立验证队列上实现了高达96%的F1分数,验证了多尺度、不确定性感知融合策略在桥接切片级病理基础模型与可靠细胞表型预测之间的有效性。

链接: https://arxiv.org/abs/2511.13586
作者: Yinuo Xu,Yan Cui,Mingyao Li,Zhi Huang
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell’s function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain. To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability. To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.13586 [cs.CV] (or arXiv:2511.13586v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.13586 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-21] Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification AAAI2026

【速读】:该论文旨在解决行人重识别(Person Re-identification, ReID)中图像到图像(I2I)和文本到图像(T2I)两种检索任务因独立建模而导致的表征纠缠与性能次优问题。其解决方案的关键在于提出一种统一框架——分层提示学习(Hierarchical Prompt Learning, HPL),通过任务感知的提示建模实现双任务联合优化:首先设计任务路由Transformer(Task-Routed Transformer),在共享视觉编码器中引入双分类令牌以分别引导I2I和T2I分支;其次构建分层提示生成机制,融合身份级可学习令牌与实例级伪文本令牌(由模态特定反演网络从图像或文本特征中提取),注入细粒度实例语义;最后引入跨模态提示正则化策略(Cross-Modal Prompt Regularization),在提示token空间中强制语义对齐,确保伪提示保留源模态特性并提升跨模态迁移能力。

链接: https://arxiv.org/abs/2511.13575
作者: Linhan Zhou,Shuang Li,Neng Dong,Yonghang Tai,Yafei Zhang,Huafeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, accepted by AAAI 2026

点击查看摘要

Abstract:Person re-identification (ReID) aims to retrieve target pedestrian images given either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). Although both tasks share a common retrieval objective, they pose distinct challenges: I2I emphasizes discriminative identity learning, while T2I requires accurate cross-modal semantic alignment. Existing methods often treat these tasks separately, which may lead to representation entanglement and suboptimal performance. To address this, we propose a unified framework named Hierarchical Prompt Learning (HPL), which leverages task-aware prompt modeling to jointly optimize both tasks. Specifically, we first introduce a Task-Routed Transformer, which incorporates dual classification tokens into a shared visual encoder to route features for I2I and T2I branches respectively. On top of this, we develop a hierarchical prompt generation scheme that integrates identity-level learnable tokens with instance-level pseudo-text tokens. These pseudo-tokens are derived from image or text features via modality-specific inversion networks, injecting fine-grained, instance-specific semantics into the prompts. Furthermore, we propose a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, ensuring that pseudo-prompts preserve source-modality characteristics while enhancing cross-modal transferability. Extensive experiments on multiple ReID benchmarks validate the effectiveness of our method, achieving state-of-the-art performance on both I2I and T2I tasks.
zh

[CV-22] Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation AAAI2026

【速读】:该论文针对3D Gaussian Splatting (3DGS) 在新视角合成任务中存在两个核心优化问题:一是容易陷入次优局部极小值(suboptimal local optima),二是收敛质量不足。解决方案的关键在于提出Opt3DGS框架,采用两阶段优化策略:第一阶段通过自适应加权随机梯度朗之万动力学(Adaptive Weighted Stochastic Gradient Langevin Dynamics, SGLD)增强全局搜索能力以逃离局部极小值;第二阶段利用局部拟牛顿方向引导的Adam优化器(Local Quasi-Newton Direction-guided Adam)引入曲率信息,实现高精度且高效的收敛。该方法在不改变3DGS表示结构的前提下显著提升了渲染质量。

链接: https://arxiv.org/abs/2511.13571
作者: Ziyang Huang,Jiagang Chen,Jin Liu,Shunping Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026 as a Conference Paper

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a leading framework for novel view synthesis, yet its core optimization challenges remain underexplored. We identify two key issues in 3DGS optimization: entrapment in suboptimal local optima and insufficient convergence quality. To address these, we propose Opt3DGS, a robust framework that enhances 3DGS through a two-stage optimization process of adaptive exploration and curvature-guided exploitation. In the exploration phase, an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method enhances global search to escape local optima. In the exploitation phase, a Local Quasi-Newton Direction-guided Adam optimizer leverages curvature information for precise and efficient convergence. Extensive experiments on diverse benchmark datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization process without modifying its underlying representation.
zh

[CV-23] SE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images

【速读】:该论文旨在解决单目高度估计(monocular height estimation)中因标注数据稀缺而导致模型泛化能力差、性能受限的问题。现有深度学习方法严重依赖大量人工标注的高精度高度标签,而这类数据获取成本高、效率低,难以满足大规模应用需求。为突破这一瓶颈,作者提出一种基于自训练(self-training)的半监督学习框架——TSE-Net,其核心创新在于构建了一个包含教师网络(teacher)、学生网络(student)和考试网络(exam)的三阶段架构:学生网络利用教师网络生成的伪标签在无标签数据上进行训练;考试网络作为学生网络的时间集成器以稳定预测性能;教师网络则采用回归与分类联合建模方式,其中分类分支通过层次化双切分策略(hierarchical bi-cut strategy)处理高度分布的长尾特性,并结合Plackett-Luce模型对类别概率进行校准,从而筛选出更可靠的伪标签。该设计显著提升了模型在不同分辨率和成像模态下的鲁棒性与准确性。

链接: https://arxiv.org/abs/2511.13552
作者: Sining Chen,Xiao Xiang Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at this https URL.
zh

[CV-24] Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks

【速读】:该论文旨在解决多模态对比学习模型(如CLIP)在面对后门攻击时的脆弱性问题,即攻击者可通过隐蔽的触发器(trigger)操纵模型行为,而现有防御方法往往缺乏针对性且需大规模重新训练或微调。其解决方案的关键在于引入一个图像分割“oracle”作为监督信号,通过两种算法实现精准防御:一是利用oracle与CLIP输出之间的知识差异识别潜在触发器;二是精确定位受攻击的标签和样本,并构建紧凑的微调数据集,从而高效修复被污染的CLIP模型,消除后门影响。

链接: https://arxiv.org/abs/2511.13545
作者: Md. Iqbal Hossain,Afia Sajeeda,Neeresh Kumar Perla,Ming Shao
机构: University of Massachusetts Dartmouth (马萨诸塞大学达特茅斯分校); University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle’’ is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle’s knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.
zh

[CV-25] BootOOD: Self-Supervised Out-of-Distribution Detection via Synthetic Sample Exposure under Neural Collapse

【速读】:该论文旨在解决图像分类模型在安全敏感场景中部署时,对语义上与分布内(in-distribution, ID)类别相似的分布外(out-of-distribution, OOD)样本检测效果不佳的问题。现有方法在面对此类语义挑战时性能下降明显。其解决方案的关键在于提出一种完全自监督的OOD检测框架BootOOD,该框架仅依赖ID数据进行训练,并利用神经坍缩(Neural Collapse, NC)特性——即ID特征紧密聚集在类均值周围且具有稳定特征范数——通过一个轻量级辅助头实现基于特征范数的半径分类。该设计将OOD检测任务从主分类器中解耦,放宽了对OOD特征空间约束的要求:只要求OOD样本的特征范数小于ID样本即可,这一条件在ID与OOD语义相近时更容易满足,从而显著提升了检测性能。

链接: https://arxiv.org/abs/2511.13539
作者: Yuanchao Wang,Tian Qin,Eduardo Valle,Bruno Abrahao
机构: NYU Shanghai Center for Data Science (纽约大学上海数据科学中心); Intercom; Leonard N. Stern School of Business, New York University (纽约大学斯特恩商学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.
zh

[CV-26] Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

【速读】:该论文旨在解决联邦学习(Federated Learning)中模型可解释性被隐蔽攻击的问题,即在不损害分类准确率的前提下,通过对抗性扰动破坏梯度类激活映射(Grad-CAM)等可视化解释的忠实性。解决方案的关键在于提出一种名为“色度扰动模块”(Chromatic Perturbation Module)的新型攻击框架,该框架通过在客户端侧施加微小的颜色对比度扰动,使模型的显著性图(saliency maps)偏离语义有意义区域,从而逐步污染全局模型的内部特征归因,且这种扰动在训练轮次中持续累积、难以察觉。实验表明,该攻击可在保持分类准确率高于96%的同时,使Grad-CAM显著性图的峰值激活重叠度降低达35%,揭示了模型可解释性本身也可成为攻击面。

链接: https://arxiv.org/abs/2511.13535
作者: Farhin Farhad Riya,Shahinul Hoque,Jinyuan Stella Sun,Olivera Kotevska
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model’s saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model’s internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.
zh

[CV-27] Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems

【速读】:该论文旨在解决 ill-posed imaging inverse problems 中不确定性量化(uncertainty quantification)的挑战,特别是在多目标估计场景下,现有方法仅能处理标量估计目标,无法满足实际应用中对多个输出指标同时进行可靠不确定性评估的需求。其解决方案的关键在于提出一种渐近最小最大(asymptotically minimax)的多目标 conformal prediction 方法,该方法能够在保证联合边际覆盖(joint marginal coverage)的前提下,生成更紧致的预测区间(prediction intervals),从而提升多任务、多指标场景下的不确定性建模精度与实用性。

链接: https://arxiv.org/abs/2511.13533
作者: Jeffrey Wen,Rizwan Ahmad,Philip Schniter
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.
zh

[CV-28] Mapping the Vanishing and Transformation of Urban Villages in China

【速读】:该论文旨在解决中国城中村(Urban Villages, UVs)在大规模拆除与再开发过程中,存在缺乏系统性评估的问题,即被拆除土地是否得到有效再利用,从而引发对当前再开发实践效能与可持续性的担忧。解决方案的关键在于提出一种基于深度学习的框架,通过多时相遥感影像的语义分割技术识别城中村边界演变,并依据“保留-拆除-再开发”三阶段对拆除后的土地利用类型进行六类划分(包括未完成拆除、空置地、施工场地、建筑、绿地及其他),进而揭示其时空演化路径与模式。该方法实现了对城中村再开发过程的精细化监测与分类,为制定分层、情境敏感的城市更新策略提供了实证依据。

链接: https://arxiv.org/abs/2511.13507
作者: Wenyu Zhang,Yao Tong,Yiqiu Liu,Rui Cao
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Appendix A. Supplementary data at this https URL

点击查看摘要

Abstract:Urban villages (UVs), informal settlements embedded within China’s urban fabric, have undergone widespread demolition and redevelopment in recent decades. However, there remains a lack of systematic evaluation of whether the demolished land has been effectively reused, raising concerns about the efficacy and sustainability of current redevelopment practices. To address the gap, this study proposes a deep learning-based framework to monitor the spatiotemporal changes of UVs in China. Specifically, semantic segmentation of multi-temporal remote sensing imagery is first used to map evolving UV boundaries, and then post-demolition land use is classified into six categories based on the “remained-demolished-redeveloped” phase: incomplete demolition, vacant land, construction sites, buildings, green spaces, and others. Four representative cities from China’s four economic regions were selected as the study areas, i.e., Guangzhou (East), Zhengzhou (Central), Xi’an (West), and Harbin (Northeast). The results indicate: 1) UV redevelopment processes were frequently prolonged; 2) redevelopment transitions primarily occurred in peripheral areas, whereas urban cores remained relatively stable; and 3) three spatiotemporal transformation pathways, i.e., synchronized redevelopment, delayed redevelopment, and gradual optimization, were revealed. This study highlights the fragmented, complex and nonlinear nature of UV redevelopment, underscoring the need for tiered and context-sensitive planning strategies. By linking spatial dynamics with the context of redevelopment policies, the findings offer valuable empirical insights that support more inclusive, efficient, and sustainable urban renewal, while also contributing to a broader global understanding of informal settlement transformations.
zh

[CV-29] Language-Guided Invariance Probing of Vision-Language Models

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在面对可控语言扰动时的可靠性问题,特别是其对语义不变改写(paraphrase)和语义改变扰动(semantic flip)的响应差异。现有模型如CLIP、OpenCLIP、EVA02-CLIP和SigLIP虽在零样本图像-文本匹配任务中表现优异,但缺乏对语言鲁棒性的系统评估。为此,作者提出Language-Guided Invariance Probing (LGIP)基准,通过构造40k MS COCO图像的五条人工描述,并自动生成语义保持的同义句与基于规则的语义翻转(如对象类别、颜色或数量变化),以量化模型的“不变性误差”(invariance error)、“语义敏感度差距”(semantic sensitivity gap)和“正向率统计量”(positive-rate statistic)。关键创新在于引入一个模型无关的诊断框架,揭示了传统检索指标无法捕捉的模型行为偏差,例如SigLIP系列模型在语义翻转后反而偏好错误描述,而EVA02-CLIP和大尺寸OpenCLIP则展现出更优的不变性与敏感性平衡。

链接: https://arxiv.org/abs/2511.13494
作者: Jae Joong Lee
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.13494 [cs.CV] (or arXiv:2511.13494v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.13494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-30] InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE AAAI-26

【速读】:该论文旨在解决现有方法在生成高质量人类交互动作时难以保留个体独特特征且无法充分遵循文本描述语义的问题。解决方案的关键在于提出InterMoE框架,其核心是一个基于动态时间选择性专家混合(Dynamic Temporal-Selective Mixture of Experts)的路由机制,该机制协同利用高层文本语义与低层运动上下文信息,将时间上的运动特征分配给特定专家;通过这种机制,专家能够动态调整选择能力并聚焦于关键的时间特征,从而在保持高语义保真度的同时有效保留个体身份特征。

链接: https://arxiv.org/abs/2511.13488
作者: Lipeng Wang,Hongxing Fan,Haohua Chen,Zehuan Huang,Lu Sheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI-26. Codes: this https URL

点击查看摘要

Abstract:Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.
zh

[CV-31] Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

【速读】:该论文旨在解决静态位图格式多媒体文档(如幻灯片和海报)缺乏可编辑性的问题,核心挑战在于现有基于几何的位图转矢量方法无法保留文档的语义结构,导致文本与图像元素的区分丢失。解决方案的关键是提出SliDer框架,利用视觉-语言模型(Vision-Language Models, VLMs)实现语义驱动的文档反渲染(semantic document derendering),通过检测并提取图像与文本元素的属性,并以迭代优化的方式生成结构清晰、可编辑的可缩放矢量图形(Scalable Vector Graphic, SVG)表示,从而在渲染时更准确地还原原始位图内容。

链接: https://arxiv.org/abs/2511.13478
作者: Adam Hazimeh,Ke Wang,Mark Collier,Gilles Baechler,Efi Kokiopoulou,Pascal Frossard
机构: 1. EPFL (瑞士洛桑联邦理工学院); 2. Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
zh

[CV-32] rust in Vision-Language Models: Insights from a Participatory User Workshop

【速读】:该论文试图解决的问题是:随着视觉语言模型(Vision-Language Models, VLMs)在各类应用场景中的广泛部署,用户对这些模型的信任如何建立与演化尚不明确,且当前多依赖AI模型作为评判工具进行实验验证,忽视了真实用户的参与,导致信任机制缺乏用户中心的 contextualization(情境化)。解决方案的关键在于采用以用户为中心的方法,通过开展面向潜在VLM用户的研讨工作坊,收集初步实证数据,从而为未来研究提供依据,旨在开发能够适配用户-VLM交互场景的信任度量指标和参与策略。

链接: https://arxiv.org/abs/2511.13458
作者: Agnese Chiatti,Lara Piccolo,Sara Bernardini,Matteo Matteucci,Viola Schiaffonati
机构: Politecnico di Milano (米兰理工大学); University of Oxford (牛津大学); CODE University of Applied Sciences (CODE应用科学大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants’ engagement to fit the case of user-VLM interaction.
zh

[CV-33] Unlocking the Forgery Detection Potential of Vanilla MLLM s: A Novel Training-Free Pipeline

【速读】:该论文旨在解决图像伪造检测与定位(IFDL)方法在跨数据集泛化能力不足以及可解释性有限的问题。现有基于大规模训练的多模态大语言模型(MLLM)方法虽具潜力,但计算开销大且未能充分挖掘原始MLLM的通用能力。其解决方案的关键在于提出一种无需训练的MLLM驱动流程Foresee,通过引入类型先验驱动策略和灵活特征检测器(FFD)模块,专门针对复制-移动篡改进行优化,从而在不依赖额外训练的情况下实现高精度定位与更丰富的文本解释,并显著提升对多种篡改类型(如拼接、删除、局部增强、深度伪造及AIGC编辑)的泛化性能。

链接: https://arxiv.org/abs/2511.13442
作者: Rui Zuo,Qinyue Tong,Zhe-Ming Lu,Ziqian Lu
机构: Zhejiang University (浙江大学); Zhejiang Sci-Tech University (浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.
zh

[CV-34] FUSE: A Flow-based Mapping Between Shapes

【速读】:该论文旨在解决三维形状之间高效、通用且无需大规模训练的映射表示问题,尤其在不同数据模态(如点云、网格、符号距离场和体素数据)间实现精确的跨表示形状匹配。其核心解决方案是基于流匹配(flow-matching)模型构建一种新的神经表示方法,将3D形状建模为从固定锚定分布通过连续可逆流映射产生的概率分布;通过组合源形状到锚定分布的逆流与锚定分布到目标形状的正向流,实现两个表面之间的连续点对点映射。该方法不依赖于数据驱动训练,具备可逆性和模态无关性,从而在多种基准测试中均展现出高覆盖率和精度。

链接: https://arxiv.org/abs/2511.13431
作者: Lorenzo Olearo,Giulio Viganò,Daniele Baieri,Filippo Maggioli,Simone Melzi
机构: University of Milano-Bicocca (米兰大学-比科卡分校); Pegaso University (佩加索大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:We introduce a novel neural representation for maps between 3D shapes based on flow-matching models, which is computationally efficient and supports cross-representation shape matching without large-scale training or data-driven procedures. 3D shapes are represented as the probability distribution induced by a continuous and invertible flow mapping from a fixed anchor distribution. Given a source and a target shape, the composition of the inverse flow (source to anchor) with the forward flow (anchor to target), we continuously map points between the two surfaces. By encoding the shapes with a pointwise task-tailored embedding, this construction provides an invertible and modality-agnostic representation of maps between shapes across point clouds, meshes, signed distance fields (SDFs), and volumetric data. The resulting representation consistently achieves high coverage and accuracy across diverse benchmarks and challenging settings in shape matching. Beyond shape matching, our framework shows promising results in other tasks, including UV mapping and registration of raw point cloud scans of human bodies.
zh

[CV-35] VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在自愿想象任务(如故事创作)中幻觉检测的不足问题。传统研究多聚焦于事实描述类任务,将任何图像中未出现的内容一律视为幻觉,但在自愿想象任务中,模型被期望生成超越图像内容的新颖信息,此时简单地将新生成内容归为幻觉是不恰当的。为此,作者提出了一种名为“自愿想象对象存在评估”(Voluntary-imagined Object Presence Evaluation, VOPE)的新方法,其核心在于通过再检查式提问(recheck-based questions)来评估模型对其生成内容中对象是否存在性的理解,并据此判断模型是否产生幻觉。VOPE的关键创新在于基于对象在图像中的实际存在性与模型自我解释之间的一致性来判定幻觉,从而更准确地区分合理想象与错误生成。

链接: https://arxiv.org/abs/2511.13420
作者: Xingming Long,Jie Zhang,Shiguang Shan,Xilin Chen
机构: Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, China; University of Chinese Academy of Sciences, Beijing, China; Zhongguancun Academy, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs’ hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model’s interpretation and the object’s presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.
zh

[CV-36] Delineate Anything Flow: Fast Country-Level Field Boundary Detection from Any Source

【速读】:该论文旨在解决农业田块边界从卫星影像中精确提取的难题,传统方法常存在边界不完整、相邻田块合并及难以规模化等问题。解决方案的关键在于提出一种分辨率无关的“Delineate Anything Flow (DelAnyFlow)”方法,其核心由两部分组成:一是基于YOLOv11骨干网络训练的DelAny实例分割模型,该模型在大规模Field Boundary Instance Segmentation-22M(FBIS 22M)数据集上进行优化,包含2290万条验证过的田块实例;二是结构化的后处理流程,包括合并与矢量化步骤,确保生成拓扑一致的矢量边界。该方法实现了超过100%的mAP提升和400倍的推理速度优于SAM2,并在乌克兰全国尺度(603,000 km²)应用中仅用单工作站六小时内完成边界绘制,显著优于现有商业产品(如Sinergise Solutions和NASA Harvest),尤其在小农户和碎片化农田系统中表现突出。

链接: https://arxiv.org/abs/2511.13417
作者: Mykola Lavreniuk,Nataliia Kussul,Andrii Shelestov,Yevhenii Salii,Volodymyr Kuzin,Sergii Skakun,Zoltan Szantoi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at this https URL.
zh

[CV-37] What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

【速读】:该论文旨在解决多模态大模型(Multimodal Large Models, MLMs)在视觉感知中因颜色信息干扰而导致的幻觉问题,这种幻觉会显著影响模型输出的准确性与可靠性。解决方案的关键在于构建了一个名为“What Color Is It”的新型基准数据集,通过简单方法触发单模态视觉幻觉,从而系统性地验证假设并深入分析视觉模态下幻觉产生的根本原因,进而提出增强模型鲁棒性的潜在改进策略。

链接: https://arxiv.org/abs/2511.13400
作者: Jinkun Zhao,Lei Huang,Wenjun Wu
机构: Beihang University (北京航空航天大学); SKLCCSE (智能科学与技术重点实验室); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (未来区块链与隐私计算高精尖创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the “What Color Is It” dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.
zh

[CV-38] ripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing AAAI2026

【速读】:该论文旨在解决场景文本编辑(Scene Text Editing, STE)中可编辑属性难以完全解耦的问题,现有方法通常仅能处理文本内容或风格单一维度的修改,导致控制能力受限且视觉一致性差。其解决方案的关键在于提出TripleFDS框架与SCB Synthesis数据集:通过引入“SCB Group”这一新型结构,在每张图像中组合文本内容、文本样式和背景三类属性,形成用于训练的最小单元;在此基础上,TripleFDS采用组间对比正则化确保语义准确性,并利用样本内多特征正交性降低冗余,从而实现三重特征的高质量解耦;此外,在合成阶段通过特征重映射防止重建过程中的“捷径”现象和潜在特征泄露,最终在主流STE基准上达到最先进的图像保真度(SSIM 44.54)和文本准确率(ACC 93.58%),并支持风格替换、背景迁移等新编辑操作。

链接: https://arxiv.org/abs/2511.13399
作者: Yuchen Bao,Yiting Wang,Wenjian Huang,Haowei Wang,Shen Chen,Taiping Yao,Shouhong Ding,Jianguo Zhang
机构: Tencent Youtu Lab(腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the “SCB Group”, a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent “shortcut” phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: this https URL
zh

[CV-39] Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶场景中感知能力的评估问题,尤其是在复杂交通环境中对远距离目标的识别与理解能力不足的问题。当前VLMs虽在多种任务上取得进展,但其在安全关键领域如自动驾驶中的可信度仍受限于感知鲁棒性,特别是对远距离(≥30米)关键物体的识别性能缺乏系统性评测工具。解决方案的关键在于提出一个名为Distance-Annotated Traffic Perception Question Answering (DTPQA) 的视觉问答(Visual Question Answering, VQA)基准测试集,该数据集包含合成数据(DTP-Synthetic)和真实世界数据(DTP-Real),并为每个样本标注对象到摄像头的距离信息,从而可独立评估VLM在不同距离下的感知性能退化情况,为后续模型优化提供可量化、可复现的评测依据。

链接: https://arxiv.org/abs/2511.13397
作者: Nikos Theodoridis,Tim Brophy,Reenu Mohandas,Ganesh Sistu,Fiachra Collins,Anthony Scanlan,Ciaran Eising
机构: University of Limerick (利默里克大学); Lero, The Irish Software Research Centre (爱尔兰软件研究中心); Valeo Vision Systems (瓦莱奥视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, © the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
zh

[CV-40] Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model

【速读】:该论文旨在解决现有图像压缩方法中对扩散模型(Diffusion Models)泛化能力不足的问题,特别是针对仅适用于特定扩散模型(如DDPM)的去噪扩散码本模型(Denoising Diffusion Codebook Models, DDCM)无法扩展至其他主流扩散模型(如基于得分的模型、一致性模型和修正流模型)的局限性。解决方案的关键在于提出广义去噪扩散压缩模型(generalized Denoising Diffusion Compression Model, gDDCM),通过统一框架将DDCM扩展至多种主流扩散模型及其变体,从而在保持压缩性能的同时显著提升方法的适用性和灵活性。实验表明,gDDCM在CIFAR-10和LSUN Bedroom数据集上均实现了优于原始DDCM的压缩效果,验证了其通用性与有效性。

链接: https://arxiv.org/abs/2511.13387
作者: Fei Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: in Chinese language

点击查看摘要

Abstract:Recently, the Denoising Diffusion Codebook Models (DDCM) was proposed. DDCM leverages the Denoising Diffusion Probabilistic Model (DDPM) and replaces the random noise in the backward process with noise sampled from specific sets according to a predefined rule, thereby enabling image compression. However, DDCM cannot be applied to methods other than DDPM. In this paper, we propose the generalized Denoising Diffusion Compression Model (gDDCM), which extends DDCM to mainstream diffusion models and their variants, including DDPM, Score-Based Models, Consistency Models, and Rectified Flow. We evaluate our method on CIFAR-10 and LSUN Bedroom datasets. Experimental results demonstrate that our approach successfully generalizes DDCM to the aforementioned models and achieves improved performance.
zh

[CV-41] Semi-Supervised Multi-Task Learning for Interpretable Quality As- sessment of Fundus Images

【速读】:该论文旨在解决视网膜图像质量评估(Retinal Image Quality Assessment, RIQA)中缺乏对采集缺陷细节标注的问题,现有工具通常仅输出整体图像质量等级,无法指导医生或技术人员进行有效的图像重采。这一局限主要源于详细标注的高成本。解决方案的关键在于提出一种混合半监督学习方法:利用少量人工标注的整体质量标签与由教师模型(Teacher model)生成的伪标签(pseudo-labels)在多任务框架下联合训练,从而在不增加额外人工标注成本的前提下,实现对图像采集条件(如光照、清晰度、对比度)等细节的可解释性预测。实验表明,该方法在EyeQ和DeepDRiD数据集上均优于单任务基线模型,并且其性能与专家标注水平相当,显著提升了RIQA系统的临床实用性与可操作性。

链接: https://arxiv.org/abs/2511.13353
作者: Lucas Gabriel Telesco,Danila Nejamkin,Estefanía Mata,Francisco Filizzola,Kevin Wignall,Lucía Franco Troilo,María de los Angeles Cenoz,Melissa Thompson,Mercedes Leguía,Ignacio Larrabide,José Ignacio Orlando
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retinal image quality assessment (RIQA) supports computer-aided diagnosis of eye diseases. However, most tools classify only overall image quality, without indicating acquisition defects to guide recapture. This gap is mainly due to the high cost of detailed annotations. In this paper, we aim to mitigate this limitation by introducing a hybrid semi-supervised learning approach that combines manual labels for overall quality with pseudo-labels of quality details within a multi-task framework. Our objective is to obtain more interpretable RIQA models without requiring extensive manual labeling. Pseudo-labels are generated by a Teacher model trained on a small dataset and then used to fine-tune a pre-trained model in a multi-task setting. Using a ResNet-18 backbone, we show that these weak annotations improve quality assessment over single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, and 0.778 vs. 0.763 on DeepDRiD), matching or surpassing existing methods. The multi-task model achieved performance statistically comparable to the Teacher for most detail prediction tasks (p 0.05). In a newly annotated EyeQ subset released with this paper, our model performed similarly to experts, suggesting that pseudo-label noise aligns with expert variability. Our main finding is that the proposed semi-supervised approach not only improves overall quality assessment but also provides interpretable feedback on capture conditions (illumination, clarity, contrast). This enhances interpretability at no extra manual labeling cost and offers clinically actionable outputs to guide image recapture.
zh

[CV-42] YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

【速读】:该论文旨在解决传统单模型目标检测方法在复杂场景下特征表达能力有限、泛化性能不足的问题。其解决方案的关键在于提出一种基于自适应路由机制的专家混合(Mixture-of-Experts, MoE)框架,通过在多个YOLOv9-T专家之间动态分配输入特征,实现特征层面的差异化处理与专业化分工,从而提升平均精度(mean Average Precision, mAP)和平均召回率(Average Recall, AR)。

链接: https://arxiv.org/abs/2511.13344
作者: Ori Meiraz,Sharon Shalev,Avishai Weizman
机构: Technion, Israel Institute of Technology (以色列理工学院); Ben-Gurion University of the Negev (内盖夫本-古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1 figure, 1 table

点击查看摘要

Abstract:This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.
zh

[CV-43] Computer Vision based group activity detection and action spotting

【速读】:该论文旨在解决多人群体场景中群体活动检测(Group Activity Detection)的难题,主要挑战包括复杂的人类交互、遮挡以及随时间变化的外观差异。其解决方案的关键在于融合分割(segmentation)、特征提取与关系图推理:首先利用Mask R-CNN进行精准的个体定位并生成实例掩码;随后结合多种骨干网络(如Inception V3、MobileNet和VGG16)提取特征,并通过RoIAlign保持空间对齐,再将掩码信息与特征图融合以获得精细化的掩码特征表示;进而构建Actor Relation Graphs,基于归一化互相关、绝对差之和及点积等方法编码个体间的外观相似性和位置关系;最后使用图卷积网络(Graph Convolutional Networks, GCNs)对关系图进行推理,从而同时预测个体动作和群体层面的活动。实验证明,这种整合掩码特征精炼、鲁棒的相似性搜索与图神经网络推理的方法,在拥挤与非拥挤场景下均显著提升了识别性能。

链接: https://arxiv.org/abs/2511.13315
作者: Narthana Sivalingam,Santhirarajah Sivasthigan,Thamayanthi Mahendranathan,G.M.R.I. Godaliyadda,M.P.B. Ekanayake,H.M.V.R. Herath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.
zh

[CV-44] DriveLiDAR4D: Sequential and Controllable LiDAR Scene Generation for Autonomous Driving AAAI2026

【速读】:该论文旨在解决当前3D LiDAR点云生成方法在自动驾驶系统开发与评估中面临的两大关键问题:一是缺乏时序一致性(temporal consistency)的生成能力,二是难以精确控制前景物体的位置并生成逼真的背景环境。为应对这些挑战,作者提出了一种名为DriveLiDAR4D的新颖LiDAR生成流水线,其核心创新在于引入多模态条件输入和一种新颖的序列噪声预测模型LiDAR4DNet,从而实现端到端的时序一致LiDAR场景生成,并具备对前景对象的高可控性和背景的真实感。这一方案首次在端到端框架下实现了完整场景的可操控性生成,显著提升了生成质量,在nuScenes和KITTI数据集上分别取得了743.13的FRD分数和16.96的FVD分数,相较现有最优方法UniScene分别提升37.2%和24.1%。

链接: https://arxiv.org/abs/2511.13309
作者: Kaiwen Cai,Xinze Liu,Xia Zhou,Hengtong Hu,Jie Xiang,Luyao Zhang,Xueyang Zhang,Kun Zhan,Yifei Zhan,Xianpeng Lang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026

点击查看摘要

Abstract:The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.
zh

[CV-45] DAP: A Discrete-token Autoregressive Planner for Autonomous Driving

【速读】:该论文旨在解决自动驾驶中如何在有限的数据和模型参数预算下实现可持续的性能提升这一核心挑战。现有方法在规划任务中虽展现出良好的数据扩展效率,但仅预测本车轨迹会导致监督信号稀疏,难以有效约束场景演化对本车运动的影响。其解决方案的关键在于提出一种离散token自回归规划器(DAP),该方法联合预测鸟瞰图(BEV)语义与本车轨迹,从而强化表征学习并使预测动态直接指导本车运动;同时引入基于强化学习的微调机制,在保留监督行为克隆先验的基础上注入奖励驱动的改进。该方案以160M参数规模实现了开环指标的最先进性能及闭环测试中的竞争力表现,证明了完全离散token化的自回归范式在自动驾驶规划中的紧凑性与可扩展性。

链接: https://arxiv.org/abs/2511.13306
作者: Bowen Ye,Bin Zhang,Hang Zhao
机构: Shanghai Qi Zhi Institute (上海奇智研究院); IIIS, Tsinghua University (清华大学交叉信息研究院); Shanghai Jiaotong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaining sustainable performance improvement with scaling data and model budget remains a pivotal yet unresolved challenge in autonomous driving. While autoregressive models exhibited promising data-scaling efficiency in planning tasks, predicting ego trajectories alone suffers sparse supervision and weakly constrains how scene evolution should shape ego motion. Therefore, we introduce DAP, a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories, thereby enforcing comprehensive representation learning and allowing predicted dynamics to directly condition ego motion. In addition, we incorporate a reinforcement-learning-based fine-tuning, which preserves supervised behavior cloning priors while injecting reward-guided improvements. Despite a compact 160M parameter budget, DAP achieves state-of-the-art performance on open-loop metrics and delivers competitive closed-loop results on the NAVSIM benchmark. Overall, the fully discrete-token autoregressive formulation operating on both rasterized BEV and ego actions provides a compact yet scalable planning paradigm for autonomous driving.
zh

[CV-46] CorrectAD: A Self-Correcting Agent ic System to Improve End-to-end Planning in Autonomous Driving

【速读】:该论文旨在解决当前端到端自动驾驶系统中数据驱动方法因长尾问题(long-tail problem)导致的鲁棒性不足问题,即罕见但高风险的事故场景难以被有效识别和纠正。其解决方案的关键在于构建一个全自动的自修正智能体系统 CorrectAD,该系统包含两个核心组件:一是PM-Agent(产品管理代理),用于模拟产品管理角色并制定与失败案例相似的数据采集需求;二是DriveSora,一种基于扩散模型的视频生成方法,能够根据结构化3D布局生成时空一致的高保真视频数据,从而实现对失败案例的自动模拟与修复。该方案不依赖特定规划模型,具有良好的通用性和可扩展性,实验证明可在nuScenes和内部挑战数据集上分别纠正62.5%和49.8%的失败案例,显著降低碰撞率。

链接: https://arxiv.org/abs/2511.13297
作者: Enhui Ma,Lijun Zhou,Tao Tang,Jiahuan Zhang,Junpeng Jiang,Zhan Zhang,Dong Han,Kun Zhan,Xueyang Zhang,XianPeng Lang,Haiyang Sun,Xia Zhou,Di Lin,Kaicheng Yu
机构: Li Auto Inc.(理想汽车); Autolab, Westlake University(西湖大学Autolab实验室); Westlake University(西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end planning methods are the de facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious long-tail problem (i.e., rare but safety-critical failure cases). In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed PM-Agent, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high-fidelity data conditioned on 3D layouts. To address this, we propose DriveSora, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM-Agent. We integrate these components into our self-correcting agentic system, CorrectAD. Importantly, our pipeline is an end-to-end model-agnostic and can be applied to improve any end-to-end planner. Evaluated on both nuScenes and a more challenging in-house dataset across multiple end-to-end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.
zh

[CV-47] SkyReels-Text: Fine-grained Font-Controllable Text Editing for Poster Design

【速读】:该论文旨在解决当前图像编辑模型在艺术设计(如海报设计)中对文本内容进行细粒度、字体感知的精确修改时存在的局限性,尤其是在保持视觉和谐与排版意图的同时处理多种字体风格的问题。解决方案的关键在于提出了一种名为SkyReels-Text的新颖字体可控框架,其核心创新在于无需字体标签或推理阶段微调即可实现多文本区域的同步编辑,仅需用户提供目标字体的裁剪字形片段(glyph patches),即使该字体未收录于标准字体库也能适用,从而实现了对字体家族和风格细节的前所未有的精准控制。

链接: https://arxiv.org/abs/2511.13285
作者: Yunjie Yu,Jingchen Wu,Junchen Zhu,Chunze Lin,Guibin Chen
机构: Skywork AI; Kunlun Inc. (昆仑万维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artistic design such as poster design often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present SkyReels-Text, a novel font-controllable framework for precise poster text editing. Our method enables simultaneous editing of multiple text regions, each rendered in distinct typographic styles, while preserving the visual appearance of non-edited regions. Notably, our model requires neither font labels nor fine-tuning during inference: users can simply provide cropped glyph patches corresponding to their desired typography, even if the font is not included in any standard library. Extensive experiments on multiple datasets, including handwrittent text benchmarks, SkyReels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families, and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design.
zh

[CV-48] abFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing AAAI2026

【速读】:该论文旨在解决表格图像理解中因冗余背景区域和缺乏针对性关注而导致的视觉表征信息量不足与效率低下的问题。现有多模态大语言模型(Multimodal Large Language Model, MLLM)通常忽视表格图像特有的结构特性,生成冗余且不具判别力的视觉特征。解决方案的关键在于提出三个核心机制:首先,采用渐进式问题条件化(progressive question conditioning),将问题信息逐步注入视觉Transformer的不同层,以生成与问题相关的视觉特征;其次,引入剪枝策略(pruning strategy)移除背景token以提升效率;最后,设计token聚焦训练策略(token focusing),促使模型将关键信息集中于保留的token中,从而缓解剪枝带来的信息损失。通过整合上述方法,作者提出了TabFlash模型,在保持最优性能的同时显著降低计算资源消耗(相比次优模型减少27% FLOPs和30%内存使用)。

链接: https://arxiv.org/abs/2511.13283
作者: Jongha Kim,Minseong Bae,Sanghyeok Lee,Jinsung Yoon,Hyunwoo J. Kim
机构: Korea University (韩国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026 (Main Technical Track)

点击查看摘要

Abstract:Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer’s capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.
zh

[CV-49] owards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

【速读】:该论文旨在解决单图多人网格恢复(multi-person human mesh recovery)中的场景一致性问题,尤其是现有伪真值(pGT)生成方法多为单人中心化处理、缺乏全局优化,导致同一图像中个体间深度和尺度冲突的问题。解决方案的关键在于提出一种基于优化的方法——Depth-conditioned Translation Optimization (DTO),通过联合优化人群内所有个体在相机空间中的平移量,利用人体身高先验和单目深度估计器提供的深度线索,在最大后验概率(MAP)框架下实现场景一致的个体布局。此外,作者还提出了Metric-Aware HMR网络,通过引入相机分支和相对度量损失,直接输出具有真实尺度的人体网格与相机参数,从而提升相对深度推理和网格重建性能。

链接: https://arxiv.org/abs/2511.13282
作者: Kaiwen Wang,Kaili Zheng,Yiming Shi,Chenyi Guo,Ji Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.
zh

[CV-50] SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting

【速读】:该论文旨在解决传统多视角几何重建流程在构建轻量化建筑表面模型时存在的冗余与敏感性问题,这些问题通常源于对密集重建、网格化及后续简化步骤的依赖。其核心解决方案是提出SF-Recon方法,关键在于通过三阶段优化实现无需后处理简化的高效重建:首先训练一个初始的3D高斯点阵(3D Gaussian Splatting, 3DGS)场以获得视图一致的表示;接着利用法向梯度引导的高斯优化策略提取与屋顶和墙体边界对齐的结构化点元,再通过多视角边缘一致性剪枝提升结构锐度并抑制非结构伪影;最后基于深度约束的Delaunay三角剖分将结构化高斯场转换为轻量且结构忠实的建筑网格。

链接: https://arxiv.org/abs/2511.13278
作者: Zihan Li,Tengfei Wang,Wentian Gan,Hao Zhan,Xin Wang,Zongqian Zhan
机构: School of Geodesy and Geomatics, Wuhan University, China PR. (武汉大学测绘学院,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:this https URL
zh

[CV-51] Recognition of Abnormal Events in Surveillance Videos using Weakly Supervised Dual-Encoder Models

【速读】:该论文旨在解决仅使用视频级别监督(video-level supervision)条件下,在监控视频中检测罕见且多样异常行为的问题。其解决方案的关键在于提出了一种双骨干(dual-backbone)框架,通过Top-k池化(top-k pooling)机制融合卷积神经网络(Convolutional Neural Networks, CNN)与Transformer的特征表示,从而在不依赖细粒度标注的情况下有效捕捉异常模式,在UCF-Crime数据集上实现了90.7%的AUC(Area Under the Curve)性能。

链接: https://arxiv.org/abs/2511.13276
作者: Noam Tsfaty,Avishai Weizman,Liav Cohen,Moshe Tshuva,Yehudit Aperstein
机构: Afeka College of Engineering (Afeka工程学院); Ben-Gurion University of the Negev (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1 figure, 1 table

点击查看摘要

Abstract:We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.
zh

[CV-52] Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在无人机(Unmanned Aerial Vehicle, UAV)场景中空间智能能力不足的问题,尤其是在动态环境中进行导航与空间理解时表现不佳。其解决方案的关键在于构建了一个名为SpatialSky-Bench的综合性基准测试平台,涵盖环境感知与场景理解两大类共13个子任务,并开发了包含100万样本的SpatialSky-Dataset数据集,从而系统评估和提升VLM的空间推理能力;在此基础上,提出专为无人机空间推理设计的Sky-VLM模型,该模型在多粒度与多场景下展现出卓越性能,实现了当前最优的空间智能表现,推动了面向UAV任务的VLM发展。

链接: https://arxiv.org/abs/2511.13269
作者: Lingfeng Zhang,Yuchen Zhang,Hongsheng Li,Haoxiang Fu,Yingbo Tang,Hangjun Ye,Long Chen,Xiaojun Liang,Xiaoshuai Hao,Wenbo Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at this https URL.
zh

[CV-53] SymGS : Leverag ing Local Symmetries for 3D Gaussian Splatting Compression

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在复杂场景下内存占用急剧增长的问题,其核心挑战在于现有压缩方法依赖于原始图元级别的冗余检测与量化,难以进一步突破压缩极限。解决方案的关键在于引入对称性感知的压缩框架SymGS,通过在场景中引入可学习的镜面(learnable mirrors),系统性地消除局部和全局的反射冗余,从而实现更高效的压缩。该方法作为插件式增强模块,可无缝集成至当前最先进的压缩方法(如HAC),显著提升压缩比——平均可达108倍,且保持高质量渲染效果,在基准数据集上相较HAC实现最高达3倍的压缩增益。

链接: https://arxiv.org/abs/2511.13264
作者: Keshav Gupta,Akshat Sanghvi,Shreyas Reddy Palley,Astitva Srivastava,Charu Sharma,Avinash Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, \textbf\textitSymGS, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve 1.66 \times compression across benchmark datasets (upto 3\times on large-scale scenes). On an average, SymGS enables \bf108\times compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at \textbf\colorcyanthis http URL
zh

[CV-54] Building Egocentric Procedural AI Assistant: Methods Benchmarks and Challenges

【速读】:该论文旨在解决如何在第一人称视角(egocentric view)下,利用视觉语言模型(Vision Language Models, VLMs)构建一个能够支持日常程序性任务的智能助手——即“第一人称过程AI助手”(EgoProceAssist)。其核心问题在于现有VLM-based AI助手在处理以用户自身动作为中心的步骤化任务时存在功能局限,缺乏对第一人称场景中过程性错误检测、学习和问答的系统性建模。解决方案的关键在于提出并定义了三个核心任务:第一人称过程错误检测(egocentric procedural error detection)、第一人称过程学习(egocentric procedural learning)以及第一人称过程问答(egocentric procedural question answering),并通过全面梳理当前技术、数据集与评估指标,开展对比实验揭示现有方法的不足,从而为未来研究指明方向。

链接: https://arxiv.org/abs/2511.13261
作者: Junlong Li,Huaiyuan Xu,Sijie Cheng,Kejun Wu,Kim-Hui Yap,Lap-Pui Chau,Yi Wang
机构: PolyU (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 8 figures, 8 tables, Under peer-review

点击查看摘要

Abstract:Driven by recent advances in vision language models (VLMs) and egocentric perception research, we introduce the concept of an egocentric procedural AI assistant (EgoProceAssist) tailored to step-by-step support daily procedural tasks in a first-person view. In this work, we start by identifying three core tasks: egocentric procedural error detection, egocentric procedural learning, and egocentric procedural question answering. These tasks define the essential functions of EgoProceAssist within a new taxonomy. Specifically, our work encompasses a comprehensive review of current techniques, relevant datasets, and evaluation metrics across these three core areas. To clarify the gap between the proposed EgoProceAssist and existing VLM-based AI assistants, we introduce novel experiments and provide a comprehensive evaluation of representative VLM-based methods. Based on these findings and our technical analysis, we discuss the challenges ahead and suggest future research directions. Furthermore, an exhaustive list of this study is publicly available in an active repository that continuously collects the latest work: this https URL
zh

[CV-55] GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在跨视角地理定位(cross-view geo-localization)与位姿估计(pose estimation)任务中能力尚不明确的问题,尽管这些能力对导航、自动驾驶和户外机器人等领域具有重要意义。解决方案的关键在于构建了一个名为GeoX-Bench的综合性基准测试平台,该平台包含10,859组全景图像与卫星图像配对数据(覆盖49个国家的128个城市),以及755,976个问答(QA)对,其中42,900个用于基准评测,其余用于增强LMMs的能力。通过该基准,研究发现当前LMMs在地理定位任务上表现优异,但在更复杂的位姿估计任务中性能显著下降,同时表明基于GeoX-Bench训练数据进行指令微调(instruction-tuning)可有效提升LMMs的跨视角地理感知能力。

链接: https://arxiv.org/abs/2511.13259
作者: Yushuo Zheng,Jiangyong Ying,Huiyu Duan,Chunyi Li,Zicheng Zhang,Jing Liu,Xiaohong Liu,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textitetc. To bridge this gap, we introduce \textbfGeoX-Bench, a comprehensive \underlineBenchmark designed to explore and evaluate the capabilities of LMMs in \underlinecross-view \underlineGeo-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolormagentathis https URL.
zh

[CV-56] Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

【速读】:该论文旨在解决隐蔽目标检测(Referring Camouflaged Object Detection, Ref-COD)中如何有效融合参考图像中的显著性特征与伪装目标特征以提升检测性能的问题。其解决方案的关键在于提出RFMNet框架,通过多阶段编码特征的交互式融合机制,在对应编码层级上将参考显著图像的丰富特征与伪装目标特征进行局部区域内的精细匹配;同时引入重叠窗口交叉注意力(Overlapped Windows Cross-Attention)机制增强局部信息匹配能力,并设计参考特征聚合(Referring Feature Aggregation, RFA)模块实现伪装目标的渐进式解码与分割,从而显著提升模型对复杂背景中伪装目标的识别精度。

链接: https://arxiv.org/abs/2511.13249
作者: Yu Wen,Shuyong Gao,Shuping Zhang,Miao Huang,Lili Tao,Han Yang,Haozhe Xing,Lihe Zhang,Boxue Hou
机构: Shanghai University of Engineering Science (上海工程技术大学); Fudan University (复旦大学); Geely Group (吉利集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7figures, This work is supported by National Nature Science Foundation of China (Grant No. 62203291)

点击查看摘要

Abstract:Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.
zh

[CV-57] Uncovering and Mitigating Transient Blindness in Multimodal Model Editing AAAI’26

【速读】:该论文旨在解决多模态模型编辑(Multimodal Model Editing, MMED)中存在的评估偏差与过拟合问题,现有方法因依赖低相似度或随机输入而高估编辑效果,掩盖了模型在真实场景下的性能缺陷。其解决方案的关键在于提出一个全面的局部性评估框架,涵盖随机图像局部性、无图像局部性和一致图像局部性三个维度,并通过七种不同数据类型实现结构化分析;同时引入动态视觉问答评估方法 De-VQA 发现“瞬时失明”现象(即模型过度拟合编辑相似文本而忽略视觉信息),并通过局部感知对抗损失平衡跨模态表示,从而显著提升编辑的局部性和泛化能力,在实验中平均减少17%的瞬时失明并改善局部性表现。

链接: https://arxiv.org/abs/2511.13243
作者: Xiaoqi Han,Ru Li,Ran Yi,Hongye Tan,Zhuomin Liang,Víctor Gutiérrez-Basulto,Jeff Z. Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI’26

点击查看摘要

Abstract:Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.
zh

[CV-58] MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection

【速读】:该论文旨在解决当前基于通用多模态大语言模型(Multimodal Large Language Models, MLLMs)在检测生成式 AI (AIGC) 时代下日益复杂且低成本传播的多模态虚假信息时所面临的两大核心问题:一是推理能力不足,即通用 MLLMs 缺乏针对多模态虚假信息检测的任务特定知识,导致其推理过程缺乏准确性;二是推理偏差,即单一思维模式难以适应快速演化、结构复杂的多模态虚假信息。解决方案的关键在于提出 MMD-Thinker 框架,该框架采用三阶段设计:首先构建面向多模态虚假信息检测的定制化思维模式(tailor-designed thinking mode),其次通过任务特定指令微调(instruction tuning)将该思维模式注入通用 MLLMs,最后引入带有混合优势函数(mixed advantage function)的强化学习策略以增强推理轨迹中的决策能力。此外,作者还构建了包含超过 8K 图文对的多模态虚假信息推理数据集(MMR dataset),为该领域研究提供高质量标注资源。实验表明,MMD-Thinker 在域内和域外基准数据集上均达到最优性能,同时具备灵活的推理路径与高效的 token 使用效率。

链接: https://arxiv.org/abs/2511.13242
作者: Junjie Wu,Guohong Fu
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.
zh

[CV-59] MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI

【速读】:该论文旨在解决便携式超低场磁共振成像(ultra-low-field MRI, uLF-MRI)在新生儿脑部评估中因信噪比(signal-to-noise ratio, SNR)低而导致图像质量差、诊断可靠性不足的问题。解决方案的关键在于提出一种3D条件扩散框架MRIQT,其核心创新包括:基于物理一致性的K空间退化模拟实现uLF成像的逼真建模;采用v-prediction结合无分类器引导(classifier-free guidance)以提升图像到图像生成的稳定性;引入SNR加权的3D感知损失函数保障解剖结构保真度;并通过体积注意力UNet架构实现从噪声uLF输入到高质量HF-like图像的结构保留转换,从而显著提升图像质量并保持病理特征清晰可辨。

链接: https://arxiv.org/abs/2511.13232
作者: Malek Al Abed,Sebiha Demir,Anne Groteklaes,Elodie Germani,Shahrooz Faghihroohi,Hemmen Sabir,Shadi Albarqouni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.
zh

[CV-60] Hybrid-Domain Adaptative Representation Learning for Gaze Estimation AAAI2026

【速读】:该论文旨在解决基于外观的视线估计(Appearance-based gaze estimation)在跨域评估中性能显著下降的问题,其根本原因在于表情、佩戴物及图像质量等与视线无关因素的干扰。解决方案的关键在于提出了一种新型的混合域自适应表示学习框架(Hybrid-domain Adaptative Representation Learning, HARL),通过无监督域适应方式将低质量人脸图像中的视线相关特征与高质量近眼图像特征对齐,从而解耦出鲁棒的视线表示;同时设计了一个简单高效的稀疏图融合模块,利用头姿与视线之间的几何约束关系构建稠密且稳健的视线表示,显著提升了模型在跨数据集场景下的泛化能力。

链接: https://arxiv.org/abs/2511.13222
作者: Qida Tan,Hongyu Yang,Wenchao Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026

点击查看摘要

Abstract:Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of \textbf5.02^\circ and \textbf3.36^\circ , and \textbf9.26^\circ respectively, and present competitive performances through cross-dataset evaluation. The code is available at this https URL.
zh

[CV-61] 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale

【速读】:该论文旨在解决当前3D-text跨模态对齐方法在细粒度语义对齐与大规模3D数据库扩展性方面的局限性,即现有方法难以将文本的细微语义精确映射到3D几何结构中,且在大规模场景下对齐性能显著下降。解决方案的关键在于提出一个统一框架3DAlign-DAER,其核心创新包括:(1)动态注意力策略(Dynamic Attention Policy, DAP),通过分层注意力融合(Hierarchical Attention Fusion, HAF)模块学习细粒度token-to-point注意力机制,并利用蒙特卡洛树搜索(Monte Carlo Tree Search)结合混合奖励信号动态校准注意力权重,从而增强文本描述与局部3D几何之间的对齐;(2)高效检索策略(Efficient Retrieval Strategy, ERS),在推理阶段采用层次化搜索机制,在大规模嵌入空间中实现高精度、高效率的跨模态检索,优于传统KNN方法。

链接: https://arxiv.org/abs/2511.13211
作者: Yijia Fan,Jusheng Zhang,Kaitong Cai,Jing Yang,Jian Wang,Keze Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets.
zh

[CV-62] End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

【速读】:该论文旨在解决多人群体视频姿态估计(multi-person video pose estimation)中传统两阶段方法依赖启发式操作(如检测、RoI裁剪和非极大值抑制NMS)导致的准确率与效率受限问题。其核心解决方案是提出一个全端到端框架PAVE-Net,关键创新在于引入一种姿态感知的时空注意力机制(pose-aware attention mechanism),使每个姿态查询能够跨帧选择性聚合对应个体的特征,从而实现复杂重叠轨迹下的精准时序关联,并通过空间编码器建模帧内关系、时空解码器捕捉全局依赖,显著提升性能。

链接: https://arxiv.org/abs/2511.13208
作者: Yonghui Yu,Jiahang Cai,Xun Wang,Wenwu Yang
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive this http URL, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose this http URL experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a \textbf6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in this http URL page: this https URL
zh

[CV-63] PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

【速读】:该论文旨在解决在未知环境中导航至指定目标物体时,现有方法难以平衡决策频率与智能性的问题,导致决策缺乏前瞻性或动作连续性不足。解决方案的关键在于提出PIGEON框架,其通过在探索过程中维护轻量级且语义对齐的快照记忆(snapshot memory),作为探索策略的语义输入,并利用大视觉语言模型(VLM)识别探索中形成的兴趣点(Point of Interest, PoI),再由低层规划器输出具体动作,从而提升决策频率;同时,基于PoI的决策机制可生成适用于模拟器的可验证奖励强化学习(Reinforcement Learning with Verifiable Reward, RLVR)数据,增强模型的语义引导能力,实现实时导航中的深度推理。

链接: https://arxiv.org/abs/2511.13207
作者: Cheng Peng,Zhenzhe Zhang,Cheng Chi,Xiaobao Wei,Yanhao Zhang,Heng Wang,Pengwei Wang,Zhongyuan Wang,Jing Liu,Shanghang Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学);
2. Alibaba Group (阿里巴巴集团);
3. Tongji University (同济大学);
4. National University of Singapore (新加坡国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Navigating to a specified object in an unknown environment is a fundamental yet challenging capability of embodied intelligence. However, current methods struggle to balance decision frequency with intelligence, resulting in decisions lacking foresight or discontinuous actions. In this work, we propose PIGEON: Point of Interest Guided Exploration for Object Navigation with VLM, maintaining a lightweight and semantically aligned snapshot memory during exploration as semantic input for the exploration strategy. We use a large Visual-Language Model (VLM), named PIGEON-VL, to select Points of Interest (PoI) formed during exploration and then employ a lower-level planner for action output, increasing the decision frequency. Additionally, this PoI-based decision-making enables the generation of Reinforcement Learning with Verifiable Reward (RLVR) data suitable for simulators. Experiments on classic object navigation benchmarks demonstrate that our zero-shot transfer method achieves state-of-the-art performance, while RLVR further enhances the model’s semantic guidance capabilities, enabling deep reasoning during real-time navigation.
zh

[CV-64] RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection AAAI2026

【速读】:该论文旨在解决弱监督视频异常检测(Weakly-Supervised Video Anomaly Detection, WSVAD)中因将所有异常事件视为单一类别而导致的异常空间过度简化问题,从而忽略了真实世界中异常事件在语义和时间特性上的多样性。其解决方案的关键在于提出RefineVAD框架,该框架通过双过程推理机制联合建模时间动态与语义结构:一是引入Motion-aware Temporal Attention and Recalibration (MoTAR)模块,利用基于位移的注意力机制与全局Transformer建模动态调整时间关注区域;二是设计Category-Oriented Refinement (CORE)模块,通过交叉注意力机制将片段级特征与可学习类别原型对齐,注入软异常类别先验,实现对异常相关特征的有效精炼。该方法显式地建模了“运动如何演化”与“其对应何种语义类别”的双重信息,显著提升了异常检测的准确性与泛化能力。

链接: https://arxiv.org/abs/2511.13204
作者: Junhee Lee,ChaeBeen Bang,MyoungChul Kim,MyeongAh Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both “how” motion evolves and “what” semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.
zh

[CV-65] Self-Supervised Ultrasound Screen Detection

【速读】:该论文旨在解决超声(Ultrasound, US)图像在临床系统中传输依赖DICOM标准所带来的瓶颈问题,从而限制了新算法的快速测试与原型开发。解决方案的关键在于提出了一种自监督(self-supervised)图像处理流程,能够从对设备显示器拍摄的照片中提取出高质量的超声图像,有效绕过DICOM传输环节。实验表明,经校正后的图像保留了足够的视觉保真度,可实现心脏切面分类任务,其平衡准确率达到0.79,与原始DICOM图像性能相当。

链接: https://arxiv.org/abs/2511.13197
作者: Alberto Gomez,Jorge Oliveira,Ramon Casero,Agis Chartsias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ISBI 2026

点击查看摘要

Abstract:Ultrasound (US) machines display images on a built-in monitor, but routine transfer to hospital systems relies on DICOM. We propose a self-supervised pipeline to extract the US image from a photograph of the monitor. This removes the DICOM bottleneck and enables rapid testing and prototyping of new algorithms. In a proof-of-concept study, the rectified images retained enough visual fidelity to classify cardiac views with a balanced accuracy of 0.79 with respect to the native DICOMs.
zh

[CV-66] Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection AAAI2026

【速读】:该论文旨在解决单目3D目标检测中因深度信息先天模糊而导致的精度不足问题,以及现有基于DETR的方法在处理遮挡、距离远近和截断等实例级难度差异时表现不佳的问题。其解决方案的关键在于提出MonoDLGD框架,该框架通过自适应地根据检测不确定性对真实标签(ground-truth labels)施加不同程度的扰动(易例施加强扰动,难例施加弱扰动),并重建这些扰动后的标签以提供显式的几何监督信号。这种机制联合优化标签重建与3D目标检测任务,从而促进几何感知表示学习,并提升模型对不同复杂度目标的鲁棒性。

链接: https://arxiv.org/abs/2511.13195
作者: Soyul Lee,Seungmin Baek,Dongbo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026 accepted

点击查看摘要

Abstract:Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.
zh

[CV-67] Birth of a Painting: Differentiable Brushstroke Reconstruction

【速读】:该论文旨在解决现有生成式模型在绘画合成中缺乏显式笔触结构、无法实现自然平滑着色的问题,即如何在保留人类绘画-涂抹循环(painting-smudging loop)特征的前提下,生成具有真实笔触轨迹与细腻色调过渡的风格化数字绘画。其解决方案的关键在于提出一个可微分的笔触重建框架,该框架通过并行可微分的涂色渲染器优化单色和双色贝塞尔笔触(Bezier strokes),引入基于几何条件的纹理生成模块以适配多种绘画风格,并设计了一个可微分的涂抹算子来实现自然的颜色混合与明暗渐变;结合粗到精的优化策略,在几何与语义引导下联合优化笔触形状、颜色及纹理,从而实现高保真度的绘画过程重建与风格化表达。

链接: https://arxiv.org/abs/2511.13191
作者: Ying Jiang,Jiayin Lu,Yunuo Chen,Yumeng He,Kui Wu,Yin Yang,Chenfanfu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Painting embodies a unique form of visual storytelling, where the creation process is as significant as the final artwork. Although recent advances in generative models have enabled visually compelling painting synthesis, most existing methods focus solely on final image generation or patch-based process simulation, lacking explicit stroke structure and failing to produce smooth, realistic shading. In this work, we present a differentiable stroke reconstruction framework that unifies painting, stylized texturing, and smudging to faithfully reproduce the human painting-smudging loop. Given an input image, our framework first optimizes single- and dual-color Bezier strokes through a parallel differentiable paint renderer, followed by a style generation module that synthesizes geometry-conditioned textures across diverse painting styles. We further introduce a differentiable smudge operator to enable natural color blending and shading. Coupled with a coarse-to-fine optimization strategy, our method jointly optimizes stroke geometry, color, and texture under geometric and semantic guidance. Extensive experiments on oil, watercolor, ink, and digital paintings demonstrate that our approach produces realistic and expressive stroke reconstructions, smooth tonal transitions, and richly stylized appearances, offering a unified model for expressive digital painting creation. See our project page for more demos: this https URL.
zh

[CV-68] Video Spatial Reasoning with Object-Centric 3D Rollout

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视频空间推理(video spatial reasoning)任务中的局限性,即模型常表现出“查询锁定”(query-locked reasoning)现象——仅关注提示中明确提及的对象,而忽略场景中的关键上下文线索。为克服这一问题,作者提出了一种名为“基于物体中心的3D滚动策略”(Object-Centric 3D Rollout, OCR)的新方法,其核心在于训练过程中对选定物体施加结构化的3D几何扰动,通过降低特定物体的视觉特征并将其投影至2D空间,迫使模型从整体场景出发进行推理。此外,论文设计了一个基于滚动(rollout)的训练流程,联合优化标准视频与区域噪声视频,从而提升空间关系理解能力。实验表明,该方法在VSI-Bench基准上达到47.5%准确率,优于多个7B参数基线模型,且消融实验证明OCR显著优于现有滚动策略(如T-GRPO和NoisyRollout)。

链接: https://arxiv.org/abs/2511.13190
作者: Haoran Tang,Meng Cao,Ruyang Liu,Xiaoxi Liang,Linglong Li,Ge Li,Xiaodan Liang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR’s superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).
zh

[CV-69] Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework AAAI2026

【速读】:该论文旨在解决极端多标签分类(Extreme Multi-label Classification, XMC)中如何有效利用大规模解码器模型和视觉信息以提升性能,同时保持计算效率的问题。其关键解决方案在于:首先,采用参数量达数十亿的解码器-only 模型(decoder-only models)来增强文本表示能力,显著提升分类性能且控制计算开销;其次,提出 Vision-enhanced eXtreme Multi-label Learning (ViXML) 框架,通过为每张图像提取单一嵌入(single embedding per image)高效融合视觉信息,实现多模态建模而不显著增加计算复杂度。实验表明,ViXML 在多个公共文本-only 数据集及其图像增强版本上均优于现有最优方法,验证了视觉信息与大模型协同作用的有效性。

链接: https://arxiv.org/abs/2511.13189
作者: Diego Ortego,Marlon Rodríguez,Mario Almagro,Kunal Dahiya,David Jiménez,Juan C. SanMiguel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: To appear at AAAI 2026

点击查看摘要

Abstract:Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals’ effectiveness, surpassing previous state-of-the-art by up to +8.21% in P@1 on the largest dataset. ViXML’s code is available at this https URL.
zh

[CV-70] GenTract: Generative Global Tractography

【速读】:该论文旨在解决局部纤维束追踪方法在噪声或低分辨率扩散磁共振成像(dMRI)数据上易出现误差累积和高假阳性率的问题,以及全局方法计算成本过高的局限性。其解决方案的关键在于提出GenTract,首个用于全局纤维束追踪的生成式模型(Generative Model),将追踪任务建模为从dMRI直接映射到完整且解剖学合理的纤维束轨迹(streamlines)的生成任务,通过扩散模型(diffusion-based)与流匹配(flow matching)两种范式进行优化,在保持高精度的同时显著提升对低质量数据的鲁棒性,相较现有最优方法TractOracle在精度上提升2.1倍,并在挑战性场景下性能优势达一个数量级。

链接: https://arxiv.org/abs/2511.13183
作者: Alec Sargood,Lemuel Puglisi,Elinor Thompson,Mirco Musolesi,Daniel C. Alexander
机构: Hawkes Institute and Department of Computer Science, University College London, UK (伦敦大学学院计算机科学系和霍克斯研究所); Department of Maths and Computer Science, University of Catania, Italy (卡塔尼亚大学数学与计算机科学系); AI Centre and Department of Computer Science, University College London, UK (伦敦大学学院计算机科学系和人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract’s performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1x higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.
zh

[CV-71] HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution

【速读】:该论文旨在解决基于扩散模型的单图像超分辨率(Single Image Super-Resolution, SISR)方法在高频率域引导不足导致重建图像细节模糊的问题。解决方案的关键在于提出一种基于小波分解的高频引导扩散网络(High-Frequency Guided Diffusion Network based on Wavelet Decomposition, HDW-SR),其核心创新包括:1)仅对残差图进行扩散,使网络更聚焦于高频信息恢复;2)用小波下采样替代传统CNN下采样,实现多尺度频域分解,并引入高频子带与低频子带之间的稀疏交叉注意力机制以实现显式高频引导;3)设计动态阈值块(Dynamic Thresholding Block, DTB)优化稀疏注意力过程中的高频选择,同时利用小波变换的可逆性保障上采样时特征的低损失重建。

链接: https://arxiv.org/abs/2511.13175
作者: Chao Yang,Boqian Zhang,Jinghao Xu,Guang Jiang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.
zh

[CV-72] HIR: Topological Histopathological Image Retrieval

【速读】:该论文旨在解决乳腺癌早期诊断中医学图像检索的效率与准确性问题,尤其针对现有深度学习方法依赖大规模标注数据、昂贵计算资源及复杂训练流程的局限性。其解决方案的关键在于提出一种无需监督的新型内容感知医学图像检索(Content-Based Medical Image Retrieval, CBMIR)框架THIR,该框架利用拓扑数据分析中的贝蒂数(Betti numbers)通过立方体持久同调(cubical persistence)直接从RGB组织病理图像中提取结构特征,生成紧凑且可解释的拓扑指纹(topological fingerprints),进而基于这些特征向量计算图像间距离实现高效相似性检索。该方法在不依赖训练的前提下实现了快速、可扩展的临床图像匹配,在BreaKHis数据集上优于当前最先进的有监督和无监督方法,且单机CPU可在20分钟内完成全量图像检索。

链接: https://arxiv.org/abs/2511.13170
作者: Zahra Tabatabaei,Jon Sporring
机构: Københavns Universitet (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches. Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.13170 [cs.CV] (or arXiv:2511.13170v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.13170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-73] SOMA: Feature Gradient Enhanced Affine-Flow Matching for SAR-Optical Registration

【速读】:该论文旨在解决合成孔径雷达(SAR)与光学图像之间的像素级配准难题,该问题因二者成像机制和视觉特征的根本差异而极具挑战性。现有深度学习方法在SAR-光学图像匹配任务中的表现仍不理想,主要原因是传统依赖梯度信息的描述子虽能突出结构差异,但未被有效整合进深度学习框架中。解决方案的关键在于提出SOMA框架,其核心创新包括:(1)引入特征梯度增强模块(Feature Gradient Enhancer, FGE),通过注意力机制和重建策略将多尺度、多方向梯度滤波器嵌入特征空间,提升特征区分度;(2)设计全局-局部仿射-光流匹配器(Global-Local Affine-Flow Matcher, GLAM),在粗到精架构中融合仿射变换与基于流的精修策略,兼顾整体结构一致性与局部精度。实验表明,该方法显著提升了配准精度,并展现出良好的鲁棒性和跨场景泛化能力。

链接: https://arxiv.org/abs/2511.13168
作者: Haodong Wang,Tao Zhuo,Xiuwei Zhang,Hanlin Yin,Wencong Wu,Yanning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Achieving pixel-level registration between SAR and optical images remains a challenging task due to their fundamentally different imaging mechanisms and visual characteristics. Although deep learning has achieved great success in many cross-modal tasks, its performance on SAR-Optical registration tasks is still unsatisfactory. Gradient-based information has traditionally played a crucial role in handcrafted descriptors by highlighting structural differences. However, such gradient cues have not been effectively leveraged in deep learning frameworks for SAR-Optical image matching. To address this gap, we propose SOMA, a dense registration framework that integrates structural gradient priors into deep features and refines alignment through a hybrid matching strategy. Specifically, we introduce the Feature Gradient Enhancer (FGE), which embeds multi-scale, multi-directional gradient filters into the feature space using attention and reconstruction mechanisms to boost feature distinctiveness. Furthermore, we propose the Global-Local Affine-Flow Matcher (GLAM), which combines affine transformation and flow-based refinement within a coarse-to-fine architecture to ensure both structural consistency and local accuracy. Experimental results demonstrate that SOMA significantly improves registration precision, increasing the CMR@1px by 12.29% on the SEN1-2 dataset and 18.50% on the GFGE_SO dataset. In addition, SOMA exhibits strong robustness and generalizes well across diverse scenes and resolutions.
zh

[CV-74] Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

【速读】:该论文旨在解决视频行人重识别(Video-based Person Re-Identification, Video ReID)中现有方法依赖文本模态导致的两个核心问题:一是缺乏真正意义上的多模态预训练,二是文本难以捕捉细粒度的时间运动信息——而这是区分视频中个体身份的关键特征。其解决方案的关键在于提出首个基于骨骼序列驱动的预训练框架CSIP-ReID,通过两个创新机制实现:第一阶段采用对比学习在序列层面对齐骨骼与视觉特征;第二阶段引入动态原型融合更新器(Prototype Fusion Updater, PFU),融合运动与外观线索以优化身份原型;同时设计骨骼引导的时间建模模块(Skeleton Guided Temporal Modeling, SGTM),从骨骼数据中提取时序信息并注入到视觉特征中。该方法实现了无需标注的、运动感知的预训练范式,在多个标准视频ReID和纯骨骼ReID数据集上均取得显著性能提升。

链接: https://arxiv.org/abs/2511.13150
作者: Rifen Lin,Alex Jinpeng Wang,Jiawei Mo,Min Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.
zh

[CV-75] Automated Road Distress Detection Using Vision Transformersand Generative Adversarial Networks

【速读】:该论文旨在解决美国道路基础设施管理效率低下问题,特别是传统人工或激光检测方法成本高、耗时长的痛点。其核心解决方案是利用自动驾驶车辆提供的实时视觉数据,结合先进的计算机视觉(Computer Vision, CV)技术进行道路病害分割。关键创新在于:首先采用生成对抗网络(Generative Adversarial Networks, GANs)生成合成数据以提升模型训练效果,其次对比卷积神经网络(Convolutional Neural Networks, CNNs)与基于Transformer的MaskFormer模型在道路病害分割任务中的性能表现,结果表明GAN合成数据可有效提高模型性能,且MaskFormer在mAP50和IoU两个指标上优于CNN模型。

链接: https://arxiv.org/abs/2511.13145
作者: Cesar Portocarrero Rodriguez,Laura Vandeweyen,Yosuke Yamamoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The American Society of Civil Engineers has graded Americas infrastructure condition as a C, with the road system receiving a dismal D. Roads are vital to regional economic viability, yet their management, maintenance, and repair processes remain inefficient, relying on outdated manual or laser-based inspection methods that are both costly and time-consuming. With the increasing availability of real-time visual data from autonomous vehicles, there is an opportunity to apply computer vision (CV) methods for advanced road monitoring, providing insights to guide infrastructure rehabilitation efforts. This project explores the use of state-of-the-art CV techniques for road distress segmentation. It begins by evaluating synthetic data generated with Generative Adversarial Networks (GANs) to assess its usefulness for model training. The study then applies Convolutional Neural Networks (CNNs) for road distress segmentation and subsequently examines the transformer-based model MaskFormer. Results show that GAN-generated data improves model performance and that MaskFormer outperforms the CNN model in two metrics: mAP50 and IoU.
zh

[CV-76] WinMamba: Multi-Scale Shifted Windows in State Space Model for 3D Object Detection

【速读】:该论文旨在解决3D目标检测中如何在保证计算效率的同时有效捕捉长距离空间依赖关系的问题。现有方法通常依赖于固定窗口内的轴对齐扫描,导致空间信息丢失。其解决方案的关键在于提出WinMamba,一种基于Mamba架构的新型3D特征编码主干网络,通过堆叠WinMamba模块实现高效建模;其中,核心创新包括:1)引入窗口尺度自适应模块(Window-Scale-Adaptive Module, WSA),在采样过程中补偿不同分辨率下的体素特征以增强多尺度表示能力;2)在线性状态空间内设计可学习位置编码与窗口移位策略(Window-Shift Strategy, WSF),从而获取更丰富的上下文线索。实验表明,该方法在KITTI和Waymo数据集上显著优于基线模型。

链接: https://arxiv.org/abs/2511.13138
作者: Longhui Zheng,Qiming Xia,Xiaolu Chen,Zhaoliang Liu,Chenglu Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures,

点击查看摘要

Abstract:3D object detection is critical for autonomous driving, yet it remains fundamentally challenging to simultaneously maximize computational efficiency and capture long-range spatial dependencies. We observed that Mamba-based models, with their linear state-space design, capture long-range dependencies at lower cost, offering a promising balance between efficiency and accuracy. However, existing methods rely on axis-aligned scanning within a fixed window, inevitably discarding spatial information. To address this problem, we propose WinMamba, a novel Mamba-based 3D feature-encoding backbone composed of stacked WinMamba blocks. To enhance the backbone with robust multi-scale representation, the WinMamba block incorporates a window-scale-adaptive module that compensates voxel features across varying resolutions during sampling. Meanwhile, to obtain rich contextual cues within the linear state space, we equip the WinMamba layer with a learnable positional encoding and a window-shift strategy. Extensive experiments on the KITTI and Waymo datasets demonstrate that WinMamba significantly outperforms the baseline. Ablation studies further validate the individual contributions of the WSF and AWF modules in improving detection accuracy. The code will be made publicly available.
zh

[CV-77] MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation CVPR2026

【速读】:该论文旨在解决当前医学视觉语言模型(Vision-Language Models, VLMs)在医疗应用中面临的三大核心问题:一是现有医学视觉基准测试依赖模糊查询,缺乏与图像内容的充分关联;二是诊断推理过程被过度简化为封闭式选项,无法体现真实临床复杂性;三是评估范式以文本为中心,忽视了生成式 AI(Generative AI)在图像生成能力上的关键作用。解决方案的关键在于提出 MedGEN-Bench——一个涵盖六种成像模态、16项临床任务和28个子任务的多模态基准,其结构包含视觉问答(Visual Question Answering)、图像编辑和情境化多模态生成三类任务,强调上下文交织的指令设计以促进跨模态推理,并引入三级评估框架(像素级指标、语义文本分析与专家指导的临床相关性评分),从而全面衡量模型在生成高质量、临床可用医学图像方面的综合能力。

链接: https://arxiv.org/abs/2511.13135
作者: Junjie Yang,Yuhao Yan,Gang Wu,Yuxuan Wang,Ruoyu Liang,Xinjie Jiang,Xiang Wan,Fenglei Fan,Yongquan Zhang,Feiwei Qin,Changmiao Wan
机构: South China University of Technology (华南理工大学); Sun Yat-sen University (中山大学); Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University of Finance & Economics (浙江财经大学); National University of Singapore (新加坡国立大学); Shenzhen Research Institute of Big Data (深圳大数据研究院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Under Review

点击查看摘要

Abstract:As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textscMedGEN-Bench, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.
zh

[CV-78] Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack

【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)代理在真实室内环境中鲁棒性不足的问题。现有对抗评估多依赖于罕见的人工纹理扰动,其实际应用价值有限;而本文聚焦于室内照明这一自然且常被忽视的场景属性,提出了一种基于室内照明的对抗攻击(Indoor Lighting-based Adversarial Attack, ILA)框架,通过操控全局光照来干扰VLN代理的决策。其关键创新在于设计了两种现实场景驱动的攻击模式:静态室内照明攻击(SILA)和动态室内照明攻击(DILA),分别模拟恒定光照与关键时刻开关灯引起的光照突变,从而揭示VLN模型对真实世界照明变化的脆弱性,并显著提升失败率与降低路径效率,为提升VLN系统鲁棒性提供了新的评估视角与方法。

链接: https://arxiv.org/abs/2511.13132
作者: Chenyang Li,Wenbing Tang,Yihao Huang,Sinong Simon Zhan,Ming Hu,Xiaojun Jia,Yang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.
zh

[CV-79] MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在电信领域部署时面临的领域特异性挑战,这些挑战限制了其在网络优化、故障自动排查、客户支持及合规性保障等场景中的有效应用。解决方案的关键在于提出MM-Telco——一个面向电信领域的多模态基准测试套件与定制化模型体系,该套件包含文本和图像任务,覆盖网络运维、管理、文档质量提升及信息检索等实际用例;通过在该数据集上微调模型,显著提升了性能,并揭示了当前先进多模态大模型的薄弱环节,从而为后续研究提供明确方向。

链接: https://arxiv.org/abs/2511.13131
作者: Gagan Raj Gupta,Anshul Kumar,Manish Rai,Apu Chakraborty,Ashutosh Modi,Abdelaali Chaoub,Soumajit Pramanik,Moyank Giri,Yashwanth Holla,Sunny Kumar,M. V. Kiran Sooraj
机构: IIT Bhilai(印度理工学院比哈利分校); IIT Kanpur(印度理工学院坎普尔分校); INPT(国家电信学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.
zh

[CV-80] VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)模型中存在的“越狱攻击”(jailbreak attack)问题,即通过设计看似无害但隐含诱导性线索的提示词,使模型生成违反安全政策的视频内容,同时保持原始意图不变。传统方法依赖显式不安全提示并易被检测,而本文提出VEIL框架,其核心在于利用T2V模型跨模态关联特性,构建模块化提示结构:包括中立场景锚点(neutral scene anchors)、潜在听觉触发器(latent auditory triggers)和风格调制器(stylistic modulators),并通过约束优化与引导搜索策略实现隐蔽性与攻击效果之间的平衡,从而显著提升攻击成功率(在商用模型上平均提升23%)。

链接: https://arxiv.org/abs/2511.13127
作者: Zonghao Ying,Moyang Chen,Nizhang Li,Zhiqiang Wang,Wenxin Zhang,Quanchen Zou,Zonglei Jing,Aishan Liu,Xianglong Liu
机构: Beihang University (北京航空航天大学); Wenzhou-Kean University (温州肯恩大学); 360 AI Security Lab; Macau University of Science and Technology (澳门科技大学); Hong Kong University of Science and Technology (香港科技大学); University of Chinese Academy of Sciences (中国科学院大学); Zhongguancun Laboratory (中关村实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models’ cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger’s effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.
zh

[CV-81] Region-Point Joint Representation for Effective Trajectory Similarity Learning AAAI2026

【速读】:该论文旨在解决现有基于学习的轨迹相似性计算方法未能充分利用轨迹信息全谱特征的问题,导致相似性建模不够全面。其核心解决方案是提出RePo方法,通过联合编码区域级(region-wise)与点级(point-wise)特征来同时捕捉空间上下文和细粒度移动模式:区域级表示利用网格序列映射结合结构特征与视觉增强的语义上下文;点级表示则由三个轻量级专家网络分别提取局部、相关性和连续运动模式,并通过路由网络自适应融合,最终与区域级特征通过交叉注意力机制整合生成轨迹嵌入。

链接: https://arxiv.org/abs/2511.13125
作者: Hao Long,Silin Zhou,Lisi Chen,Shuo Shang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: This paper is accepted by AAAI2026

点击查看摘要

Abstract:Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbfRePo, a novel method that jointly encodes \textbfRegion-wise and \textbfPoint-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2% over SOTA baselines across all evaluation metrics.
zh

[CV-82] CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model

【速读】:该论文旨在解决从稀疏视角输入中进行近距离场景重建与新视角合成的难题,尤其针对现有方法在细粒度细节捕捉能力不足的问题。其关键解决方案在于提出了一种基于扩散模型的框架 CloseUpShot,通过点条件视频扩散机制实现高质量近景新视角合成;具体包括:1)引入分层像素映射(hierarchical warping)和遮挡感知噪声抑制(occlusion-aware noise suppression),以缓解近景条件下因视点稀疏导致的条件图像质量下降与背景泄漏问题;2)设计全局结构引导机制(global structure guidance),利用密集融合点云提供一致的几何上下文,弥补稀疏条件输入中缺乏全局3D约束的缺陷。

链接: https://arxiv.org/abs/2511.13121
作者: Yuqi Zhang,Guanying Chen,Jiaxing Chen,Chuanyu Fu,Chuan Huang,Shuguang Cui
机构: Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen); Chinese University of Hong Kong at Shenzhen (CUHKSZ); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Link: this https URL

点击查看摘要

Abstract:Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.
zh

[CV-83] A Lightweight 3D Anomaly Detection Method with Rotationally Invariant Features

【速读】:该论文旨在解决3D异常检测(3D AD)中因点云数据在旋转和位置变化下特征表示不稳定而导致的性能下降问题。解决方案的关键在于提出一种旋转不变特征(Rotationally Invariant Features, RIF)框架,其核心包括两个创新:一是设计点坐标映射(Point Coordinate Mapping, PCM)技术,将每个点映射到旋转不变空间以保持表征一致性;二是构建轻量级卷积变换特征网络(Convolutional Transform Feature Network, CTF-Net),结合迁移学习与3D数据增强策略预训练特征提取器,从而学习鲁棒且判别性强的旋转不变特征用于内存库建模。该方法在Anomaly-ShapeNet和Real3D-AD数据集上均取得显著性能提升,验证了其强大的泛化能力与工业应用潜力。

链接: https://arxiv.org/abs/2511.13115
作者: Hanzhe Liang,Jie Zhou,Can Gao,Bingyang Guo,Jinbao Wang,Linlin Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Elsevier

点击查看摘要

Abstract:3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.
zh

[CV-84] Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining

【速读】:该论文旨在解决雨滴污染对计算机视觉系统性能的显著负面影响,尤其是在自动驾驶和视频监控等应用中,现有去雨方法在语义和空间细节保真度方面存在不足。解决方案的关键在于提出多先验分层Mamba(MPHM)网络,其核心创新包括:一是融合宏观语义文本先验(CLIP)与微观结构视觉先验(DINOv2),实现任务级语义引导与场景感知结构信息的协同;二是设计渐进式先验融合注入机制(PFI),在解码器不同层级有策略地注入互补线索以缓解异构先验间的潜在冲突;三是引入分层Mamba模块(HMM),通过傅里叶增强的双路径结构同步优化全局上下文建模与局部细节恢复能力,从而显著提升去雨效果与真实场景泛化性能。

链接: https://arxiv.org/abs/2511.13113
作者: Zhaocheng Yu,Kui Jiang,Junjun Jiang,Xianming Liu,Guanglu Sun,Yi Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rain significantly degrades the performance of computer vision systems, particularly in applications like autonomous driving and video surveillance. While existing deraining methods have made considerable progress, they often struggle with fidelity of semantic and spatial details. To address these limitations, we propose the Multi-Prior Hierarchical Mamba (MPHM) network for image deraining. This novel architecture synergistically integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information. To alleviate potential conflicts between heterogeneous priors, we devise a progressive Priors Fusion Injection (PFI) that strategically injects complementary cues at different decoder levels. Meanwhile, we equip the backbone network with an elaborate Hierarchical Mamba Module (HMM) to facilitate robust feature representation, featuring a Fourier-enhanced dual-path design that concurrently addresses global context modeling and local detail recovery. Comprehensive experiments demonstrate MPHM’s state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset while delivering superior generalization on real-world rainy scenarios.
zh

[CV-85] Learning Implicit Neural Degradation Representation for Unpaired Image Dehazing

【速读】:该论文旨在解决图像去雾任务中复杂场景下非均匀雾霾分布的细粒度特征表示与全局一致性建模之间的平衡难题。现有方法在处理空间变化复杂的雾霾时,难以同时保持局部细节清晰度和整体结构合理性。其解决方案的关键在于提出一种无监督的隐式神经退化表示方法:首先基于Kolmogorov-Arnold表示定理设计通道独立与通道依赖相结合的机制,以增强对非线性依赖关系的学习能力;其次引入隐式神经表示将雾霾退化建模为连续函数,从而消除对显式特征提取和物理模型的依赖,并通过密集残差增强模块进一步优化隐式表征,有效去除冗余信息,实现高质量图像恢复。

链接: https://arxiv.org/abs/2511.13110
作者: Shuaibin Fan,Senming Zhong,Wenchao Yan,Minglong Xue
机构: Chongqing University of Technology (重庆理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common degenerate representation of haze in spatial variations, we propose an unsupervised dehaze method for implicit neural degradation representation. Firstly, inspired by the Kolmogorov-Arnold representation theorem, we propose a mechanism combining the channel-independent and channel-dependent mechanisms, which efficiently enhances the ability to learn from nonlinear dependencies. which in turn achieves good visual perception in complex scenes. Moreover, we design an implicit neural representation to model haze degradation as a continuous function to eliminate redundant information and the dependence on explicit feature extraction and physical models. To further learn the implicit representation of the haze features, we also designed a dense residual enhancement module from it to eliminate redundant information. This achieves high-quality image restoration. Experimental results show that our method achieves competitive dehaze performance on various public and real-world datasets. This project code will be available at this https URL.
zh

[CV-86] DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

【速读】:该论文旨在解决生成式模型(如GANs和扩散模型)产生的AI图像广泛传播所引发的虚假信息、隐私侵犯及数字媒体信任危机问题,尤其针对现有检测方法在微调大规模多模态模型(如CLIP)时易引发灾难性遗忘(catastrophic forgetting),导致预训练先验知识退化并限制跨域泛化能力的问题。解决方案的关键在于提出一种基于蒸馏引导梯度手术网络(Distillation-guided Gradient Surgery Network, DGS-Net)的新框架,其核心机制是通过梯度空间分解分离有害与有益的下降方向:将任务梯度投影到有害方向的正交补空间中,并对齐来自冻结CLIP编码器蒸馏出的有益方向,从而实现先验保留与无关成分抑制的统一优化。

链接: https://arxiv.org/abs/2511.13108
作者: Jiazhen Yan,Ziqiang Li,Fan Wang,Boyu Wang,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.
zh

[CV-87] Low-Level Dataset Distillation for Medical Image Enhancement

【速读】:该论文旨在解决医学图像增强任务中因依赖大规模数据集而导致的训练与存储成本高昂的问题,尤其是针对低层次任务中像素级映射的复杂性所引发的数据蒸馏(Dataset Distillation, DD)难题。现有DD方法主要适用于高阶任务(如分类),其“多对一”标签映射可实现语义压缩,但低层次任务具有“多对多”的像素级对应关系,使得小规模蒸馏数据难以约束密集的像素级映射,构成欠定问题。解决方案的关键在于提出首个面向低层次任务的DD方法:首先利用患者间解剖结构相似性构建共享解剖先验(shared anatomical prior),作为不同患者蒸馏数据的初始化;随后通过结构保持个性化生成(Structure-Preserving Personalized Generation, SPG)模块,在保留像素级保真度的同时注入个体特异性解剖信息;最后,通过梯度对齐策略将患者特有知识嵌入蒸馏数据,而无需暴露原始患者数据,从而在保障隐私的前提下实现高效、个性化的医学图像增强训练。

链接: https://arxiv.org/abs/2511.13106
作者: Fengzhi Xu,Ziyuan Yang,Mengyu Sun,Joey Tianyi Zhou,Yi Zhang
机构: Sichuan University (四川大学); The Chinese University of Hong Kong (香港中文大学); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image enhancement is clinically valuable, but existing methods require large-scale datasets to learn complex pixel-level mappings. However, the substantial training and storage costs associated with these datasets hinder their practical deployment. While dataset distillation (DD) can alleviate these burdens, existing methods mainly target high-level tasks, where multiple samples share the same label. This many-to-one mapping allows distilled data to capture shared semantics and achieve information compression. In contrast, low-level tasks involve a many-to-many mapping that requires pixel-level fidelity, making low-level DD an underdetermined problem, as a small distilled dataset cannot fully constrain the dense pixel-level mappings. To address this, we propose the first low-level DD method for medical image enhancement. We first leverage anatomical similarities across patients to construct the shared anatomical prior based on a representative patient, which serves as the initialization for the distilled data of different patients. This prior is then personalized for each patient using a Structure-Preserving Personalized Generation (SPG) module, which integrates patient-specific anatomical information into the distilled dataset while preserving pixel-level fidelity. For different low-level tasks, the distilled data is used to construct task-specific high- and low-quality training pairs. Patient-specific knowledge is injected into the distilled data by aligning the gradients computed from networks trained on the distilled pairs with those from the corresponding patient’s raw data. Notably, downstream users cannot access raw patient data. Instead, only a distilled dataset containing abstract training information is shared, which excludes patient-specific details and thus preserves privacy.
zh

[CV-88] PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking AAAI2026

【速读】:该论文旨在解决多目标跟踪(Multi-object Tracking, MOT)中运动预测器性能受限的问题,具体表现为传统基于卡尔曼滤波(Kalman filter)的线性运动预测器在非线性运动场景下表现不佳,而数据驱动的非线性预测器虽能捕捉复杂动态但存在泛化能力弱和计算开销高的缺陷。解决方案的关键在于提出PlugTrack框架,通过多感知运动理解(multi-perceptive motion understanding)生成自适应融合因子,实现对卡尔曼滤波与数据驱动预测器的动态融合,从而在不修改原有预测器的前提下显著提升跟踪性能,并在MOT17/MOT20等基准上取得显著改进,在DanceTrack上达到当前最优效果。

链接: https://arxiv.org/abs/2511.13105
作者: Seungjae Kim,SeungJoon Lee,MyeongAh Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026. Code: this https URL

点击查看摘要

Abstract:Multi-object tracking (MOT) predominantly follows the tracking-by-detection paradigm, where Kalman filters serve as the standard motion predictor due to computational efficiency but inherently fail on non-linear motion patterns. Conversely, recent data-driven motion predictors capture complex non-linear dynamics but suffer from limited domain generalization and computational overhead. Through extensive analysis, we reveal that even in datasets dominated by non-linear motion, Kalman filter outperforms data-driven predictors in up to 34% of cases, demonstrating that real-world tracking scenarios inherently involve both linear and non-linear patterns. To leverage this complementarity, we propose PlugTrack, a novel framework that adaptively fuses Kalman filter and data-driven motion predictors through multi-perceptive motion understanding. Our approach employs multi-perceptive motion analysis to generate adaptive blending factors. PlugTrack achieves significant performance gains on MOT17/MOT20 and state-of-the-art on DanceTrack without modifying existing motion predictors. To the best of our knowledge, PlugTrack is the first framework to bridge classical and modern motion prediction paradigms through adaptive fusion in MOT.
zh

[CV-89] CapeNext: Rethinking and refining dynamic support information for category-agnostic pose estimation

【速读】:该论文旨在解决类别无关姿态估计(Category-Agnostic Pose Estimation, CAPE)中静态关节嵌入(static joint embedding)所引发的两个核心问题:一是多义性导致的跨类别歧义(polysemy-induced cross-category ambiguity),例如“腿”在人类与家具中的视觉表现差异;二是细粒度类内变化的判别能力不足,如不同姿势或毛色的猫之间难以区分。解决方案的关键在于提出一种融合分层跨模态交互(hierarchical cross-modal interaction)与双流特征精炼(dual-stream feature refinement)的新框架,通过引入文本描述和具体图像中的类别级与实例级线索,动态增强关节嵌入的语义表达能力,从而显著提升匹配精度。

链接: https://arxiv.org/abs/2511.13102
作者: Yu Zhu,Dan Zeng,Shuiwang Li,Qijun Zhao,Qiaomu Shen,Bo Tang
机构: 1. Sun Yat-sen University (中山大学); 2. Tsinghua University (清华大学); 3. University of Oxford (牛津大学); 4. Chinese Academy of Sciences (中国科学院); 5. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept “leg” exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.
zh

[CV-90] MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images WACV2026

【速读】:该论文旨在解决在全切片图像(Whole Slide Images, WSIs)上进行终身学习(lifelong learning)时面临的资源消耗高、数据迁移与处理复杂的问题,尤其是在WSI通常达到千兆字节级别的情况下。其核心挑战在于如何在不丢失先前任务性能的前提下,持续地对新癌症相关任务进行模型训练或微调。解决方案的关键在于提出MergeSlide框架,该框架将终身学习建模为一个模型合并问题,利用视觉-语言病理学基础模型(vision-language pathology foundation model),通过三个步骤实现:1)使用类别感知提示(class-aware prompts)定义新任务;2)采用无MLP骨干网络微调数个epoch;3)应用正交连续合并策略(orthogonal continual merging strategy)以保留性能并缓解灾难性遗忘。此外,针对类别增量学习(CLASS-IL)场景下任务身份未知的情况,引入任务到类别提示对齐(Task-to-Class Prompt-aligned, TCP)推理机制,先基于任务级提示识别最相关任务,再使用对应类别提示生成预测,从而提升实际应用场景下的泛化能力。

链接: https://arxiv.org/abs/2511.13099
作者: Doanh C. Bui,Ba Hung Ngo,Hoai Luan Pham,Khang Nguyen,Maï K. Nguyen,Yasuhiko Nakashima
机构: Nara Institute of Science and Technology (日本奈良科学技术大学院大学); Chonnam National University (全南国立大学); University of Information Technology, Viet Nam National University Ho Chi Minh City (越南胡志明市国家大学信息技术大学); CY Cergy Paris University (塞吉巴黎大学); ENSEA (法国国立高等工程技术学校); CNRS (法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV2026 Accepted

点击查看摘要

Abstract:Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at this https URL.
zh

[CV-91] MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

【速读】:该论文旨在解决图形用户界面(GUI)定位任务中因视觉杂乱和指令模糊导致的准确率下降问题,现有方法多依赖单一模型或一次性流水线,缺乏模块化设计且鲁棒性不足。其核心解决方案是提出一种多阶段框架MEGA-GUI,将GUI定位分解为粗粒度的感兴趣区域(ROI)选择与细粒度元素定位两个步骤,并由专门的视觉-语言代理协同控制;关键创新包括双向ROI缩放算法以缓解空间稀释效应,以及上下文感知的重写代理以降低语义歧义,从而在ScreenSpot-Pro和OSWorld-G等基准上显著优于传统单体方法。

链接: https://arxiv.org/abs/2511.13087
作者: SeokJoo Kwak,Jihoon Kim,Boyoun Kim,Jung Jae Yoon,Wooseok Jang,Jeonghoon Hong,Jaeho Yang,Yeong-Dae Kwon
机构: Samsung SDS (三星SDS)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 7 figures. Code available at this https URL

点击查看摘要

Abstract:Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at this https URL.
zh

[CV-92] Real-time prediction of breast cancer sites using deformation-aware graph neural network

【速读】:该论文旨在解决间接磁共振成像(MRI)引导下乳腺活检中实时准确预测肿瘤位移的难题,这一问题制约了该技术在临床中的广泛应用。其关键解决方案是提出了一种基于图神经网络(Graph Neural Network, GNN)的变形感知模型,该模型结合个体特异性有限元(Finite Element, FE)模型与表面位移及距离图结构数据,实现了对乳腺组织尤其是肿瘤区域变形行为的高精度实时预测,验证结果显示癌症病灶位移预测的均方根误差(RMSE)低于0.2 mm,空间重叠Dice相似系数(DSC)达0.977,且计算效率较传统FE仿真提升超过4000倍,从而显著提升了活检过程的精准性与实时性。

链接: https://arxiv.org/abs/2511.13082
作者: Kyunghyun Lee,Yong-Min Shin,Minwoo Shin,Jihun Kim,Sunghwan Lim,Won-Yong Shin,Kyungho Yoon
机构: Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early diagnosis of breast cancer is crucial, enabling the establishment of appropriate treatment plans and markedly enhancing patient prognosis. While direct magnetic resonance imaging-guided biopsy demonstrates promising performance in detecting cancer lesions, its practical application is limited by prolonged procedure times and high costs. To overcome these issues, an indirect MRI-guided biopsy that allows the procedure to be performed outside of the MRI room has been proposed, but it still faces challenges in creating an accurate real-time deformable breast model. In our study, we tackled this issue by developing a graph neural network (GNN)-based model capable of accurately predicting deformed breast cancer sites in real time during biopsy procedures. An individual-specific finite element (FE) model was developed by incorporating magnetic resonance (MR) image-derived structural information of the breast and tumor to simulate deformation behaviors. A GNN model was then employed, designed to process surface displacement and distance-based graph data, enabling accurate prediction of overall tissue displacement, including the deformation of the tumor region. The model was validated using phantom and real patient datasets, achieving an accuracy within 0.2 millimeters (mm) for cancer node displacement (RMSE) and a dice similarity coefficient (DSC) of 0.977 for spatial overlap with actual cancerous regions. Additionally, the model enabled real-time inference and achieved a speed-up of over 4,000 times in computational cost compared to conventional FE simulations. The proposed deformation-aware GNN model offers a promising solution for real-time tumor displacement prediction in breast biopsy, with high accuracy and real-time capability. Its integration with clinical procedures could significantly enhance the precision and efficiency of breast cancer diagnosis.
zh

[CV-93] Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations

【速读】:该论文旨在解决当前视觉解释方法(如显著性图)在用户意图对齐方面的根本性模糊问题,即现有解释缺乏明确的目的导向,难以满足多样化用户查询需求。其解决方案的关键在于提出Reference-Frame × Granularity (RFxG) 分类框架,从两个维度系统化组织解释:Reference-Frame 区分点式解释(“为何做出此预测?”)与对比式解释(“为何是这个而非其他?”),Granularity 则涵盖细粒度类别级(如“为何是哈士奇?”)到粗粒度组别级(如“为何是狗?”)的语义层次。基于该框架,作者进一步设计了四项新的忠实性评估指标,以全面衡量解释在两种维度上的质量,并通过实证分析揭示现有评估指标过度侧重点式忠实性而忽略对比推理和语义粒度的问题,从而推动解释方法向更贴合人类认知复杂性的用户意图驱动方向演进。

链接: https://arxiv.org/abs/2511.13081
作者: Yehonatan Elisha,Seffi Cohen,Oren Barkan,Noam Koenigstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation this http URL address this gap by introducing the Reference-Frame \times Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise (“Why this prediction?”) and contrastive (“Why this and not an alternative?”) this http URL: Ranging from fine-grained class-level (e.g., “Why Husky?”) to coarse-grained group-level (e.g., “Why Dog?”) this http URL the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three this http URL advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.
zh

[CV-94] Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

【速读】:该论文旨在解决当前面向规划的自动驾驶模块化设计中对自车状态(ego status)过度依赖的问题,这种依赖导致系统在复杂场景下泛化能力不足和场景理解鲁棒性差。其核心挑战在于现有架构在BEV(Bird’s Eye View)编码器早期融合自车状态信息,使得强先验信息主导下游规划模块决策,形成“捷径学习”(shortcut learning)。解决方案的关键是提出AdaptiveAD,一种基于多上下文融合策略的架构级改进:采用双分支结构显式解耦场景感知与自车状态,其中一 branch 通过多任务学习进行纯场景驱动推理(BEV编码器中不包含自车状态),另一 branch 仅基于规划任务进行自车驱动推理;随后通过一个场景感知融合模块自适应整合两个分支的互补决策以生成最终轨迹。此外,为保障多任务学习有效性,引入路径注意力机制(path attention)增强自车-BEV交互,并增加BEV单向蒸馏和自回归在线建图两项辅助任务,从而显著提升模型在nuScenes数据集上的开环规划性能及跨场景泛化能力。

链接: https://arxiv.org/abs/2511.13079
作者: Jiacheng Tang,Mingyue Feng,Jiachao Liu,Yaonong Wang,Jian Pu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.
zh

[CV-95] RobustGait: Robustness Analysis for Appearance Based Gait Recognition WACV’26

【速读】:该论文旨在解决当前基于外观的步态识别(Appearance-based Gait Recognition)系统在真实世界场景中鲁棒性不足的问题,尤其是缺乏对噪声、遮挡、轮廓提取方法差异及模型架构影响的系统性评估。其解决方案的关键在于提出 RobustGait 框架,该框架从四个维度进行细粒度鲁棒性评测:扰动类型(数字、环境、时间、遮挡)、轮廓提取方法(分割与解析网络)、模型架构能力以及部署场景,并在 CASIA-B、CCPG、SUSTech1K 等数据集上引入 15 种腐蚀类型(共 5 个严重等级),结合 MEVID 的野外验证,全面评估六种前沿步态识别系统。关键发现包括:RGB 层级噪声更能反映现实退化;轮廓提取器偏差是被忽视的基准偏倚来源;鲁棒性依赖于扰动类型和模型设计;并通过噪声感知训练和知识蒸馏等策略提升系统鲁棒性,推动迈向可部署的步态识别系统。

链接: https://arxiv.org/abs/2511.13065
作者: Reeshoon Sayera,Akash Kumar,Sirshapan Mitra,Prudvi Kamtam,Yogesh S Rawat
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE WACV’26 Main Conference

点击查看摘要

Abstract:Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.
zh

[CV-96] FGNet: Leverag ing Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation

【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中神经结构分割的难题,该任务因复杂的形态特征、低信噪比及标注数据稀缺而面临精度和泛化能力不足的问题。解决方案的关键在于利用在大量自然图像上预训练的视觉基础模型(如Segment Anything 2, SAM2)所学习到的先验知识,并通过特征引导注意力模块(Feature-Guided Attention)来缩小自然图像域与EM图像域之间的差异,同时引入轻量级细粒度编码器(Fine-Grained Encoder, FGE)聚焦于难分区域,最终由双亲和力解码器生成粗略与精细的亲和图,实现高效且准确的神经结构分割。

链接: https://arxiv.org/abs/2511.13063
作者: Zhenghua Li,Hang Chen,Zihao Sun,Kai Li,Xiaolin Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.
zh

[CV-97] Monocular 3D Lane Detection via Structure Uncertainty-Aware Network with Curve-Point Queries

【速读】:该论文旨在解决单目3D车道线检测中因观测噪声导致的偶然不确定性(aleatoric uncertainty)问题,现有方法通常依赖简化的几何假设(如独立点预测或全局平面建模),难以捕捉真实场景中的结构变化和不确定性。其解决方案的关键在于提出一种无需鸟瞰图(BEV-free)的3D车道线检测框架MonoUnc,该框架在前视图(FV)空间中将3D车道线近似为参数化曲线,并基于曲线预测动态生成曲线点查询嵌入(query embeddings)以进行3D空间中的车道点预测;同时,每两个相邻点构成的线段被建模为一个3D高斯分布,其参数由局部结构和不确定性估计共同决定,并设计了一种新颖的3D高斯匹配损失来联合约束这些参数,从而显式地建模并优化偶然不确定性。

链接: https://arxiv.org/abs/2511.13055
作者: Ruixin Liu,Zejian Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular 3D lane detection is challenged by aleatoric uncertainty arising from inherent observation noise. Existing methods rely on simplified geometric assumptions, such as independent point predictions or global planar modeling, failing to capture structural variations and aleatoric uncertainty in real-world scenarios. In this paper, we propose MonoUnc, a bird’s-eye view (BEV)-free 3D lane detector that explicitly models aleatoric uncertainty informed by local lane structures. Specifically, 3D lanes are projected onto the front-view (FV) space and approximated by parametric curves. Guided by curve predictions, curve-point query embeddings are dynamically generated for lane point predictions in 3D space. Each segment formed by two adjacent points is modeled as a 3D Gaussian, parameterized by the local structure and uncertainty estimations. Accordingly, a novel 3D Gaussian matching loss is designed to constrain these parameters jointly. Experiments on the ONCE-3DLanes and OpenLane datasets demonstrate that MonoUnc outperforms previous state-of-the-art (SoTA) methods across all benchmarks under stricter evaluation criteria. Additionally, we propose two comprehensive evaluation metrics for ONCE-3DLanes, calculating the average and maximum bidirectional Chamfer distances to quantify global and local errors. Codes are released at this https URL.
zh

[CV-98] ViSS-R1: Self-Supervised Reinforcement Video Reasoning

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视频推理任务中因过度依赖文本中心推理而忽视丰富视觉信息的问题,这导致模型易产生捷径学习(shortcut learning)和幻觉(hallucination)。其解决方案的关键在于提出一种新颖的自监督强化学习算法 Pretext-GRPO,通过在标准 R1 训练流程中引入预训练任务奖励机制,促使模型对变换后的视觉输入进行非平凡处理;在此基础上进一步构建 ViSS-R1 框架,将基于预训练任务的自监督学习直接整合进 MLLM 的 R1 后训练范式,使模型必须同时处理预训练问题(关于图像变换)与真实用户查询,从而强制其识别变换并重建原始视频以生成准确答案,显著提升视频理解的鲁棒性和视觉感知能力。

链接: https://arxiv.org/abs/2511.13054
作者: Bo Fang,Yuxin Song,Qiangqiang Wu,Haoyuan Sun,Wenhao Wu,Antoni B. Chan
机构: City University of Hong Kong (香港城市大学); Baidu Inc. (百度公司); Tsinghua University (清华大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our paper was initially titled “Video-SSR1: Self-Supervised Reinforcement Video Reasoning.” Upon noticing its close resemblance to the title of a recently released paper, we have decided to rename our work as “ViSS-R1.”

点击查看摘要

Abstract:Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM’s R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.
zh

[CV-99] DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

【速读】:该论文旨在解决RGB-D室内场景语义分割中因现有方法依赖计算复杂的交叉注意力机制且未能充分建模模态内与模态间特征关系,从而导致特征对齐不精确和判别性表征受限的问题。其解决方案的关键在于提出DiffPixelFormer,一种差分像素感知Transformer,核心创新是引入Intra-Inter Modal Interaction Block(IIMIB),通过自注意力机制捕捉模态内长程依赖,并结合差分共享跨模态模块(Differential-Shared Inter-Modal, DSIM)分离模态特有与共享线索,实现细粒度的像素级跨模态对齐;同时设计动态融合策略以根据场景特性平衡RGB与深度模态贡献,从而提升分割精度。

链接: https://arxiv.org/abs/2511.13047
作者: Yan Gong,Jianli Lu,Yongsheng Gao,Jie Zhao,Xiaojuan Zhang,Susanto Rahardja
机构: Harbin Institute of Technology (哈尔滨工业大学); A*STAR (新加坡科技研究局); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at this https URL.
zh

[CV-100] MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization

【速读】:该论文旨在解决开放词汇时序动作定位(Open-Vocabulary Temporal Action Localization, OV-TAL)中因仅在单一粒度上识别动作类别而导致的基类和新类动作识别准确率下降的问题。解决方案的关键在于提出多粒度类别感知网络(Multi-Grained Category-Aware Network, MGCA-Net),通过引入局部定位器、动作存在预测器、传统分类器与粗到细分类器,实现对基类动作在片段粒度上的精细分类以及对新类动作在视频和提案粒度上的分层类别感知,从而有效提升动作定位性能。

链接: https://arxiv.org/abs/2511.13039
作者: Zhenying Fang,Richang Hong
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier’s awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS’14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.
zh

[CV-101] uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

【速读】:该论文旨在解决多语言视觉-语言模型在低资源语言(如捷克语、芬兰语、克罗地亚语、匈牙利语和罗马尼亚语)中表现不佳的问题,尤其是在跨模态检索任务上的性能瓶颈。现有模型在Crossmodal-3600(XM3600)基准上对这些语言的检索准确率显著偏低,主要受限于高质量多语言图文数据的稀缺性。解决方案的关键在于提出一种轻量级且数据高效的多语言视觉-语言对齐框架:该方法无需图像-文本对或文本-文本对进行训练,冻结预训练的图像编码器和多语言文本编码器,仅训练一个参数量仅为170万的投影模块,并利用英文表示作为语义锚点,通过对比损失实现跨语言对齐。这种基于pivot(锚点)的参数高效策略在极少监督条件下仍能实现鲁棒的多语言对齐,显著提升了低资源语言的检索性能。

链接: https://arxiv.org/abs/2511.13036
作者: Dahyun Chung,Donghyun Shin,Yujin Sung,Seunggi Moon,Jinwoo Jeon,Byung-Jun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page can be found at this https URL

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.
zh

[CV-102] Uni-Inter: Unifying 3D Human Motion Synthesis Across Diverse Interaction Contexts

【速读】:该论文旨在解决现有生成式AI(Generative AI)在人类运动生成中对交互场景泛化能力不足的问题,特别是针对人-人、人-物及人-场景等多样化交互任务缺乏统一建模方法的局限性。其解决方案的关键在于提出统一交互体积(Unified Interactive Volume, UIV),这是一种将异构交互实体编码为共享空间场的体素表示方法,从而实现一致的关系推理与复合交互建模;在此基础上,通过在UIV上进行关节级概率预测,模型能够捕捉细粒度的空间依赖关系,生成具上下文感知且连贯的行为,显著提升跨任务的泛化性能。

链接: https://arxiv.org/abs/2511.13032
作者: Sheng Liu,Yuanzhi Liang,Jiepeng Wang,Sidan Du,Chi Zhang,Xuelong Li
机构: Nanjing University (南京大学); Institute of Artificial Intelligence, China Telecom (TeleAI) (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Uni-Inter, a unified framework for human motion generation that supports a wide range of interaction scenarios: including human-human, human-object, and human-scene-within a single, task-agnostic architecture. In contrast to existing methods that rely on task-specific designs and exhibit limited generalization, Uni-Inter introduces the Unified Interactive Volume (UIV), a volumetric representation that encodes heterogeneous interactive entities into a shared spatial field. This enables consistent relational reasoning and compound interaction modeling. Motion generation is formulated as joint-wise probabilistic prediction over the UIV, allowing the model to capture fine-grained spatial dependencies and produce coherent, context-aware behaviors. Experiments across three representative interaction tasks demonstrate that Uni-Inter achieves competitive performance and generalizes well to novel combinations of entities. These results suggest that unified modeling of compound interactions offers a promising direction for scalable motion synthesis in complex environments.
zh

[CV-103] owards 3D Object-Centric Feature Learning for Semantic Scene Completion AAAI-2026

【速读】:该论文旨在解决视觉引导的3D语义场景补全(Vision-based 3D Semantic Scene Completion, SSC)中因采用以自我为中心(ego-centric)建模方式而导致的细粒度对象级细节缺失问题,从而引发语义与几何模糊性,尤其在复杂环境下的表现受限。其解决方案的关键在于提出一种以对象为中心(object-centric)的预测框架Ocean,通过三个核心模块实现:首先利用轻量级分割模型MobileSAM提取图像中的实例掩码;其次引入3D语义组注意力模块(3D Semantic Group Attention),借助线性注意力机制在3D空间中聚合对象级特征;再次设计全局相似性引导注意力模块(Global Similarity-Guided Attention),利用分割特征增强全局交互以缓解分割错误和实例缺失问题;最后提出实例感知局部扩散模块(Instance-aware Local Diffusion),通过生成式过程优化实例特征并进一步提升BEV空间中的场景表示精度。该方法显著提升了复杂场景下的语义占用预测准确性,在SemanticKITTI和SSCBench-KITTI360基准上分别达到17.40和20.28的mIoU。

链接: https://arxiv.org/abs/2511.13031
作者: Weihua Wang,Yubo Cui,Xiangru Lin,Zhiheng Li,Zheng Fang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能研究院); 3. Hefei National Laboratory for Physical Sciences at the Microscale (合肥国家实验室); 4. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by AAAI-2026

点击查看摘要

Abstract:Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.
zh

[CV-104] REVISOR: Beyond Textual Reflection Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

【速读】:该论文旨在解决纯文本反思机制在长视频理解任务中表现受限的问题,其根本原因在于:一方面,长视频包含更丰富动态的视觉信息,仅对文本进行再思考不足以充分理解视频内容;另一方面,纯文本反思机制缺乏跨模态交互能力,无法在反思过程中有效融合视觉信息。解决方案的关键在于提出REVISOR(REflective VIsual Segment Oriented Reasoning)框架,该框架通过引入工具增强的多模态反思机制,使大语言模型(MLLMs)能够协同构建文本与视觉模态间的内省式反思过程,从而显著提升对长视频的理解能力。为确保强化学习阶段模型能准确聚焦于与问题高度相关的视频片段,作者设计了双归属解耦奖励机制(Dual Attribution Decoupled Reward, DADR),并集成到GRPO训练策略中,以强制模型推理与所选视频证据之间保持因果一致性。该方案无需额外监督微调或外部模型即可显著提升性能,在VideoMME、LongVideoBench、MLVU和LVBench四个基准上取得优异结果。

链接: https://arxiv.org/abs/2511.13026
作者: Jiaze Li,Hao Yin,Wenhui Tan,Jingyang Chen,Boshen Xu,Yuxun Qu,Yijing Chen,Jianzhong Ju,Zhenbo Luo,Jian Luan
机构: MiLM Plus, Xiaomi Inc.; Renmin University of China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model’s reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.
zh

[CV-105] SpectralAdapt: Semi-Supervised Domain Adaptation with Spectral Priors for Human-Centered Hyperspectral Image Reconstruction

【速读】:该论文旨在解决医疗领域中高光谱成像(Hyperspectral Imaging, HSI)数据稀缺与域偏移问题,即如何利用有限的标注人类HSI数据和大量易获取的通用域RGB图像,实现高质量的HSI重建。其解决方案的关键在于提出了一种半监督域适应(Semi-Supervised Domain Adaptation, SSDA)框架SpectralAdapt,核心创新包括:1)引入光谱密度掩码(Spectral Density Masking, SDM),通过自适应掩蔽RGB通道以增强一致性训练中对信息丰富区域的恢复能力;2)设计光谱端元表示对齐(Spectral Endmember Representation Alignment, SERA),利用标注像素提取物理可解释的端元作为域不变锚点,引导未标注数据预测,并结合动量更新机制提升稳定性与适应性。二者协同作用,有效缓解了域偏移、光谱退化和数据稀缺问题,显著提升了重建光谱保真度与跨域泛化性能。

链接: https://arxiv.org/abs/2511.13020
作者: Yufei Wen,Yuting Zhang,Jingdan Kang,Hao Ren,Weibin Cheng,Jintai Chen,Kaishun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) holds great potential for healthcare due to its rich spectral information. However, acquiring HSI data remains costly and technically demanding. Hyperspectral image reconstruction offers a practical solution by recovering HSI data from accessible modalities, such as RGB. While general domain datasets are abundant, the scarcity of human HSI data limits progress in medical applications. To tackle this, we propose SpectralAdapt, a semi-supervised domain adaptation (SSDA) framework that bridges the domain gap between general and human-centered HSI datasets. To fully exploit limited labels and abundant unlabeled data, we enhance spectral reasoning by introducing Spectral Density Masking (SDM), which adaptively masks RGB channels based on their spectral complexity, encouraging recovery of informative regions from complementary cues during consistency training. Furthermore, we introduce Spectral Endmember Representation Alignment (SERA), which derives physically interpretable endmembers from valuable labeled pixels and employs them as domain-invariant anchors to guide unlabeled predictions, with momentum updates ensuring adaptability and stability. These components are seamlessly integrated into SpectralAdapt, a spectral prior-guided framework that effectively mitigates domain shift, spectral degradation, and data scarcity in HSI reconstruction. Experiments on benchmark datasets demonstrate consistent improvements in spectral fidelity, cross-domain generalization, and training stability, highlighting the promise of SSDA as an efficient solution for hyperspectral imaging in healthcare.
zh

[CV-106] MeanFlow Transformers with Representation Autoencoders

【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型(Diffusion Model)训练与采样效率低、稳定性差以及类条件生成依赖复杂引导超参数的问题。其核心挑战在于:传统基于噪声到数据的长跳步生成方法(如 MeanFlow, MF)在高维空间中训练成本高且易发散,而结合预训练变分自编码器(如 Stable Diffusion VAE)的潜空间实现虽能提升建模能力,却仍存在推理阶段解码器计算开销大、引导机制复杂等问题。解决方案的关键在于:引入一种基于 Representation Autoencoder (RAE) 的潜空间框架,利用预训练视觉编码器(如 DINO)提供语义丰富且轻量的潜变量表示,并通过一致性中段训练(Consistency Mid-Training)进行轨迹感知初始化,辅以两阶段训练策略——第一阶段使用流匹配教师模型蒸馏加速收敛并降低方差,第二阶段采用单点速度估计器进行可选自举优化,从而显著减少对引导参数的依赖、简化配置、提升训练稳定性与采样效率。实验表明,该方法在 ImageNet 256 上实现了 1 步 FID 2.03(优于原生 MF 的 3.43),同时将采样 GFLOPS 减少 38%,总训练成本下降 83%;在 ImageNet 512 上也达到竞争性性能(1 步 FID 3.23)且 GFLOPS 最低。

链接: https://arxiv.org/abs/2511.13019
作者: Zheyuan Hu,Chieh-Hsin Lai,Ge Wu,Yuki Mitsufuji,Stefano Ermon
机构: Sony AI(索尼人工智能); Sony Group Corporation(索尼集团); Nankai University(南开大学); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL

点击查看摘要

Abstract:MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling. Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF’s 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive 1-step FID of 3.23 with the lowest GFLOPS among all baselines. Code is available at this https URL.
zh

[CV-107] Geometry Meets Light: Leverag ing Geometric Priors for Universal Photometric Stereo under Limited Multi-Illumination Cues AAAI2026

【速读】:该论文旨在解决通用光度立体(Universal Photometric Stereo)方法在复杂真实场景中因多光源信息不可靠(如偏置光照、阴影或自遮挡区域)而导致表面法向量恢复性能下降的问题。其解决方案的关键在于提出GeoUniPS网络,该网络通过融合合成监督信号与大规模3D重建模型所蕴含的高阶几何先验,构建了一个Light-Geometry Dual-Branch Encoder,从而同时提取多光源线索和几何约束;此外,为突破传统正交投影假设的局限性,作者引入了具有真实透视投影的PS-Perp数据集,使模型能够学习空间变化的视角方向,显著提升了在复杂户外场景中的泛化能力和精度。

链接: https://arxiv.org/abs/2511.13015
作者: King-Man Tam,Satoshi Ikehata,Yuta Asano,Zhaoyi An,Rei Kawakami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 (Oral)

点击查看摘要

Abstract:Universal Photometric Stereo is a promising approach for recovering surface normals without strict lighting assumptions. However, it struggles when multi-illumination cues are unreliable, such as under biased lighting or in shadows or self-occluded regions of complex in-the-wild scenes. We propose GeoUniPS, a universal photometric stereo network that integrates synthetic supervision with high-level geometric priors from large-scale 3D reconstruction models pretrained on massive in-the-wild data. Our key insight is that these 3D reconstruction models serve as visual-geometry foundation models, inherently encoding rich geometric knowledge of real scenes. To leverage this, we design a Light-Geometry Dual-Branch Encoder that extracts both multi-illumination cues and geometric priors from the frozen 3D reconstruction model. We also address the limitations of the conventional orthographic projection assumption by introducing the PS-Perp dataset with realistic perspective projection to enable learning of spatially varying view directions. Extensive experiments demonstrate that GeoUniPS delivers state-of-the-arts performance across multiple datasets, both quantitatively and qualitatively, especially in the complex in-the-wild scenes.
zh

[CV-108] You Only Look Omni Gradient Backpropagation for Moving Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测中因信杂比低、目标与背景分布严重失衡以及判别特征微弱而导致的检测性能瓶颈问题。现有深度学习方法主要依赖时空特征聚合,但效果受限,表明根本瓶颈在于单帧特征表示模糊而非时空建模本身。解决方案的关键在于提出BP-FPN(Backpropagation-driven Feature Pyramid Network),其核心创新为:1)Gradient-Isolated Low-Level Shortcut (GILS),通过隔离梯度流高效引入细粒度目标细节,避免捷径学习;2)Directional Gradient Regularization (DGR),在反向传播过程中强制层级特征一致性,从而提升特征表达质量。该设计理论严谨、计算开销极低,且可无缝嵌入现有框架,在多个公开数据集上实现显著性能提升,是首个完全从反向传播视角设计的小目标检测FPN架构。

链接: https://arxiv.org/abs/2511.13013
作者: Guoyi Zhang,Guangsheng Xu,Siyang Chen,Han Wang,Xiaohu Zhang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Moving infrared small target detection is a key component of infrared search and tracking systems, yet it remains extremely challenging due to low signal-to-clutter ratios, severe target-background imbalance, and weak discriminative features. Existing deep learning methods primarily focus on spatio-temporal feature aggregation, but their gains are limited, revealing that the fundamental bottleneck lies in ambiguous per-frame feature representations rather than spatio-temporal modeling itself. Motivated by this insight, we propose BP-FPN, a backpropagation-driven feature pyramid architecture that fundamentally rethinks feature learning for small target. BP-FPN introduces Gradient-Isolated Low-Level Shortcut (GILS) to efficiently incorporate fine-grained target details without inducing shortcut learning, and Directional Gradient Regularization (DGR) to enforce hierarchical feature consistency during backpropagation. The design is theoretically grounded, introduces negligible computational overhead, and can be seamlessly integrated into existing frameworks. Extensive experiments on multiple public datasets show that BP-FPN consistently establishes new state-of-the-art performance. To the best of our knowledge, it is the first FPN designed for this task entirely from the backpropagation perspective.
zh

[CV-109] Beyond Darkness: Thermal-Supervised 3D Gaussian Splatting for Low-Light Novel View Synthesis

【速读】:该论文旨在解决极端低光照条件下新型视图合成(Novel View Synthesis, NVS)中出现的几何失真、颜色一致性差和辐射稳定性不足的问题。传统3D高斯泼溅(3D Gaussian Splatting, 3DGS)方法直接应用于欠曝输入时,因各视角独立增强导致光照不一致与几何畸变。其解决方案的关键在于提出DTGS框架,通过将Retinex-inspired光照分解与热引导的3DGS紧密结合,实现光照不变的重建;核心创新包括:1)引入循环增强-重建机制,在增强、几何与热监督之间进行联合优化;2)嵌入基于Retinex的分解模块于3DGS循环中,提供物理可解释的反射率-光照分离,确保跨视角的颜色与纹理一致性;3)设计热监督分支动态平衡增强、结构与热损失,提升颜色恢复与几何学习的稳定性。

链接: https://arxiv.org/abs/2511.13011
作者: Qingsen Ma,Chen Zou,Dianyun Wang,Jia Wang,Liuyu Xiang,Zhaofeng He
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Under extremely low-light conditions, novel view synthesis (NVS) faces severe degradation in terms of geometry, color consistency, and radiometric stability. Standard 3D Gaussian Splatting (3DGS) pipelines fail when applied directly to underexposed inputs, as independent enhancement across views causes illumination inconsistencies and geometric distortion. To address this, we present DTGS, a unified framework that tightly couples Retinex-inspired illumination decomposition with thermal-guided 3D Gaussian Splatting for illumination-invariant reconstruction. Unlike prior approaches that treat enhancement as a pre-processing step, DTGS performs joint optimization across enhancement, geometry, and thermal supervision through a cyclic enhancement-reconstruction mechanism. A thermal supervisory branch stabilizes both color restoration and geometry learning by dynamically balancing enhancement, structural, and thermal losses. Moreover, a Retinex-based decomposition module embedded within the 3DGS loop provides physically interpretable reflectance-illumination separation, ensuring consistent color and texture across viewpoints. To evaluate our method, we construct RGBT-LOW, a new multi-view low-light thermal dataset capturing severe illumination degradation. Extensive experiments show that DTGS significantly outperforms existing low-light enhancement and 3D reconstruction baselines, achieving superior radiometric consistency, geometric fidelity, and color stability under extreme illumination.
zh

[CV-110] R-Gaussians: High-fidelity Real-time Rendering of Planar Transmission and Reflection with 3D Gaussian Splatting

【速读】:该论文旨在解决室内场景中普遍存在且复杂的平面透射(transmission)与反射(reflection)现象的高保真渲染问题,尤其在基于3D-Gaussian表示的神经渲染方法中难以同时精确建模这两类光学效应。解决方案的关键在于提出了一种名为Transmission-Reflection Gaussians (TR-Gaussians) 的新型3D-Gaussian表示方法:通过引入可学习的反射平面来显式建模玻璃表面的视点依赖反射强度,其中真实场景和透射成分由原始3D Gaussians表示,而反射成分则通过关于反射平面镜像的Gaussians建模;二者通过基于Fresnel方程的视点依赖权重进行融合,从而实现复杂视角下多样的视觉效果还原。此外,作者设计了包含颜色与几何约束的多阶段优化框架及不透明度扰动机制,以提升优化稳定性与重建质量。

链接: https://arxiv.org/abs/2511.13009
作者: Yong Liu,Keyang Ye,Tianjia Shao,Kun Zhou
机构: Zhejiang University (浙江大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:We propose Transmission-Reflection Gaussians (TR-Gaussians), a novel 3D-Gaussian-based representation for high-fidelity rendering of planar transmission and reflection, which are ubiquitous in indoor scenes. Our method combines 3D Gaussians with learnable reflection planes that explicitly model the glass planes with view-dependent reflectance strengths. Real scenes and transmission components are modeled by 3D Gaussians and the reflection components are modeled by the mirrored Gaussians with respect to the reflection plane. The transmission and reflection components are blended according to a Fresnel-based, view-dependent weighting scheme, allowing for faithful synthesis of complex appearance effects under varying viewpoints. To effectively optimize TR-Gaussians, we develop a multi-stage optimization framework incorporating color and geometry constraints and an opacity perturbation mechanism. Experiments on different datasets demonstrate that TR-Gaussians achieve real-time, high-fidelity novel view synthesis in scenes with planar transmission and reflection, and outperform state-of-the-art approaches both quantitatively and qualitatively.
zh

[CV-111] SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias AAAI2026

【速读】:该论文旨在解决大型视觉语言模型(如CLIP)在零样本分类任务中因多模态虚假关联(multimodal spurious bias)导致的鲁棒性下降问题,即模型过度依赖图像与文本之间的非本质共现特征(如背景信息),而非对象的核心语义特征,从而在分布外数据上表现显著退化。解决方案的关键在于提出一种无需训练、微调或外部标注的简单有效方法——虚假感知引导探索(Spuriousness-Aware Guided Exploration, SAGE),其核心思想是通过系统探索提示模板空间,选择能最大化类别间语义分离度的提示,从而抑制虚假关联并提升最差组(worst-group)的鲁棒性。

链接: https://arxiv.org/abs/2511.13005
作者: Wenqian Ye,Di Wang,Guangtao Zheng,Bohan Liu,Aidong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Large vision-language models, such as CLIP, have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, which is the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object’s core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias through guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores a space of prompt templates and selects the prompts that induce the largest semantic separation between classes, thereby improving worst-group robustness. Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.
zh

[CV-112] Infinite-Story: A Training-Free Consistent Text-to-Image Generation AAAI2026

【速读】:该论文旨在解决多提示文本到图像(text-to-image, T2I)生成场景中普遍存在的身份不一致(identity inconsistency)和风格不一致(style inconsistency)问题,尤其针对视觉叙事任务中跨多个提示保持人物或场景一致性这一挑战。其核心解决方案是提出一种无需训练的框架Infinite-Story,关键创新在于三个互补技术:1)身份提示替换(Identity Prompt Replacement),用于缓解文本编码器中的上下文偏差,确保不同提示间身份属性对齐;2)统一注意力引导机制,包含自适应风格注入(Adaptive Style Injection)与同步引导适配(Synchronized Guidance Adaptation),协同强化全局风格与身份外观的一致性,同时保留原始提示语义 fidelity。该方法完全在测试阶段运行,避免了传统扩散模型需微调或推理缓慢的问题,在保证生成质量的同时实现超过6倍于现有最优方法的推理速度(每图1.72秒)。

链接: https://arxiv.org/abs/2511.13002
作者: Jihun Park,Kyoungmin Lee,Jongmin Gim,Hyeonseo Jo,Minseok Oh,Wonhyeok Choi,Kyumin Hwang,Jaeyeul Kim,Minwoo Choi,Sunghoon Im
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18pages, 13 figures, AAAI 2026 Oral

点击查看摘要

Abstract:We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.
zh

[CV-113] Medal S: Spatio-Textual Prompt Model for Medical Segmentation CVPR2025

【速读】:该论文旨在解决多模态医学图像分割中因分辨率不匹配导致的空间精度不足以及传统文本驱动方法缺乏空间感知能力的问题。其关键解决方案是提出Medal S,一个支持原生分辨率空间与文本提示的端到端可训练基础模型,通过通道级对齐体积提示与文本嵌入,实现跨模态语义一致性;同时引入轻量3D卷积模块进行体素空间精修,并结合动态重采样策略优化目标补丁比例不平衡问题,从而在CT、MRI、PET、超声和显微镜等五种模态上实现高精度、高效率的多类别分割(最多支持243类),显著优于现有方法如SAT和nnU-Net。

链接: https://arxiv.org/abs/2511.13001
作者: Pengcheng Shi,Jiawei Chen,Jiaqi Liu,Xinglin Zhang,Tao Chen,Lei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 Workshop MedSegFM

点击查看摘要

Abstract:We introduce Medal S, a medical segmentation foundation model that supports native-resolution spatial and textual prompts within an end-to-end trainable framework. Unlike text-only methods lacking spatial awareness, Medal S achieves channel-wise alignment between volumetric prompts and text embeddings, mitigating inaccuracies from resolution mismatches. By preserving full 3D context, it efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance. A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types, supporting up to 243 classes across CT, MRI, PET, ultrasound, and microscopy modalities in the BiomedSegFM dataset. Medal S offers two prompting modes: a text-only mode, where model predictions serve as spatial prompts for self-refinement without human input, and a hybrid mode, incorporating manual annotations for enhanced flexibility. For 24-class segmentation, parallel spatial prompting reduces inference time by more than 90% compared to sequential prompting. We propose dynamic resampling to address target-patch ratio imbalance, extending SAT and nnU-Net for data augmentation. Furthermore, we develop optimized text preprocessing, a two-stage inference strategy, and post-processing techniques to improve memory efficiency, precision, and inference speed. On the five-modality average on the validation set, Medal S outperforms SAT with a DSC of 75.44 (vs. 69.83), NSD of 77.34 (vs. 71.06), F1 of 38.24 (vs. 24.88), and DSC TP of 65.46 (vs. 46.97). Medal S achieves excellent performance by harmonizing spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches. Medal S will be publicly available at this https URL.
zh

[CV-114] PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching AAAI2026

【速读】:该论文旨在解决图像修饰(image retouching)中可控性与主观偏好之间难以平衡的问题,尤其在个性化审美需求日益增长的背景下。其核心挑战在于如何在保持全局美学质量的同时,实现对特定语义区域的细粒度控制,并准确响应用户自然语言指令。解决方案的关键在于提出一个基于扩散模型(diffusion-based)的统一框架 PerTouch,通过引入包含特定语义区域属性值的参数图(parameter maps)构建显式的参数到图像映射关系,从而实现语义级别的精细调节;同时,设计语义替换和参数扰动机制以增强语义边界感知能力,并利用视觉语言模型(VLM-driven)驱动的代理(agent)将自然语言指令转化为可视化控制信号,结合反馈驱动的再思考机制与场景感知记忆模块,显著提升对用户意图的理解与长期偏好建模能力。

链接: https://arxiv.org/abs/2511.12998
作者: Zewei Chang,Zheng-Peng Duan,Jianxing Zhang,Chun-Le Guo,Siyu Liu,Hyungju Chun,Hyunhee Park,Zikun Liu,Chongyi Li
机构: 1. University of Science and Technology of China (中国科学技术大学);
2. Tsinghua University (清华大学);
3. Samsung Electronics Co., Ltd. (三星电子有限公司);
4. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at AAAI 2026

点击查看摘要

Abstract:Image retouching aims to enhance visual quality while aligning with users’ personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component’s effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: this https URL.
zh

[CV-115] Semantic Prioritization in Visual Counterfactual Explanations with Weighted Segmentation and Auto-Adaptive Region Selection

【速读】:该论文旨在解决非生成式视觉反事实解释(Visual Counterfactual Explanations, CE)中传统方法因忽视替换区域与目标对象之间的语义相关性而导致模型可解释性下降及编辑流程效率低下的问题。其解决方案的关键在于提出一种名为加权语义图与自适应候选编辑网络(Weighted Semantic Map with Auto-adaptive Candidate Editing Network, WSAE-Net)的新方法,核心创新包括:一是构建加权语义图以最小化非语义特征单元的计算量,提升计算效率;二是设计自适应候选编辑序列,动态确定待处理特征单元的最优计算顺序,在保障替换区域语义相关性的前提下高效生成反事实样本。

链接: https://arxiv.org/abs/2511.12992
作者: Lintong Zhang,Kang Yin,Seong-Whan Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31page, 7 figures

点击查看摘要

Abstract:In the domain of non-generative visual counterfactual explanations (CE), traditional techniques frequently involve the substitution of sections within a query image with corresponding sections from distractor images. Such methods have historically overlooked the semantic relevance of the replacement regions to the target object, thereby impairing the model’s interpretability and hindering the editing workflow. Addressing these challenges, the present study introduces an innovative methodology named as Weighted Semantic Map with Auto-adaptive Candidate Editing Network (WSAE-Net). Characterized by two significant advancements: the determination of an weighted semantic map and the auto-adaptive candidate editing sequence. First, the generation of the weighted semantic map is designed to maximize the reduction of non-semantic feature units that need to be computed, thereby optimizing computational efficiency. Second, the auto-adaptive candidate editing sequences are designed to determine the optimal computational order among the feature units to be processed, thereby ensuring the efficient generation of counterfactuals while maintaining the semantic relevance of the replacement feature units to the target object. Through comprehensive experimentation, our methodology demonstrates superior performance, contributing to a more lucid and in-depth understanding of visual counterfactual explanations.
zh

[CV-116] UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective AAAI2026

【速读】:该论文旨在解决深度学习中因数据集规模不断扩大而带来的显著计算挑战,特别是传统数据集剪枝(dataset pruning)方法在样本评分时依赖训练阶段的模型性能,导致评分分布过于集中、难以有效区分样本的问题。其解决方案的关键在于从泛化能力(generalization)角度出发,即使用未见过目标样本的模型来对样本进行评分,从而提升评分的区分度;并提出了一个可插拔的框架UNSEEN,能够集成到现有剪枝方法中,同时通过多步增量选择机制,利用在不同子集上训练的模型动态优化核心集(coreset)质量,实现更高效且高质量的数据集剪枝。

链接: https://arxiv.org/abs/2511.12988
作者: Furui Xu,Shaobo Wang,Jiajun Zhang,Chenghao Sun,Haixiang Tang,Linfeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2026, 13 pages, 9 figures, 5 tables

点击查看摘要

Abstract:The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model’s performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30%.
zh

[CV-117] Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks AAAI2026

【速读】:该论文旨在解决现有对抗攻击方法在非欧几里得几何空间(特别是双曲空间)中效率低下且几何不一致的问题。传统方法如FGSM和PGD在施加扰动时未考虑双曲空间的内在几何结构,导致攻击效果受限。其解决方案的关键在于:首先在双曲空间的切空间中计算损失函数梯度,并将其分解为径向(深度)分量与角向(语义)分量;随后仅基于角向分量施加扰动,从而聚焦于语义敏感方向,实现对双曲嵌入的高效攻击。该方法显著提升了攻击成功率,同时揭示了双曲表示空间中的潜在脆弱性,为面向层次化嵌入的几何感知攻击提供了理论框架。

链接: https://arxiv.org/abs/2511.12985
作者: Minsoo Jo,Dongyoon Yang,Taesup Kim
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Adversarial examples in neural networks have been extensively studied in Euclidean geometry, but recent advances in \textithyperbolic networks call for a reevaluation of attack strategies in non-Euclidean geometries. Existing methods such as FGSM and PGD apply perturbations without regard to the underlying hyperbolic structure, potentially leading to inefficient or geometrically inconsistent attacks. In this work, we propose a novel adversarial attack that explicitly leverages the geometric properties of hyperbolic space. Specifically, we compute the gradient of the loss function in the tangent space of hyperbolic space, decompose it into a radial (depth) component and an angular (semantic) component, and apply perturbation derived solely from the angular direction. Our method generates adversarial examples by focusing perturbations in semantically sensitive directions encoded in angular movement within the hyperbolic geometry. Empirical results on image classification, cross-modal retrieval tasks and network architectures demonstrate that our attack achieves higher fooling rates than conventional adversarial attacks, while producing high-impact perturbations with deeper insights into vulnerabilities of hyperbolic embeddings. This work highlights the importance of geometry-aware adversarial strategies in curved representation spaces and provides a principled framework for attacking hierarchical embeddings.
zh

[CV-118] SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在文本与图像交互中产生的复合安全风险问题,这类风险可能在单个输入无害的情况下因跨模态耦合而引发不安全语义,暴露出当前MLLMs在安全意识上的脆弱性。解决方案的关键在于提出SafeGRPO框架,其核心是将规则驱动的奖励构建机制整合进无需人工标注的Group Relative Policy Optimization (GRPO),从而实现可解释且可验证的推理安全性优化;该框架基于自建的SafeTag-VL-3K数据集(包含视觉、文本及联合安全标签),通过步骤引导的安全思维机制强化结构化推理与行为对齐,在不牺牲通用能力的前提下显著提升多模态安全意识、组合鲁棒性和推理稳定性。

链接: https://arxiv.org/abs/2511.12982
作者: Xuankun Rong,Wenke Huang,Tingfeng Wang,Daiguo Zhou,Bo Du,Mang Ye
机构: Wuhan University (武汉大学); Xiaomi Inc. (小米公司)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities, yet their expanded modality space introduces new compositional safety risks that emerge from complex text-image interactions. Such cross-modal couplings can produce unsafe semantics even when individual inputs are benign, exposing the fragile safety awareness of current MLLMs. While recent works enhance safety by guiding models to reason about potential risks, unregulated reasoning traces may compromise alignment; although Group Relative Policy Optimization (GRPO) offers self-rewarded refinement without human supervision, it lacks verifiable signals for reasoning safety. To address this, we propose SafeGRPO a self-rewarded multimodal safety alignment framework that integrates rule-governed reward construction into GRPO, enabling interpretable and verifiable optimization of reasoning safety. Built upon the constructed SafeTag-VL-3K dataset with explicit visual, textual, and combined safety tags, SafeGRPO performs step-guided safety thinking to enforce structured reasoning and behavior alignment, substantially improving multimodal safety awareness, compositional robustness, and reasoning stability across diverse benchmarks without sacrificing general capabilities.
zh

[CV-119] Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

【速读】:该论文旨在解决对比视觉语言模型(Contrastive Vision-Language Models, VLMs)如CLIP在零样本识别中对虚假相关性(spurious correlations)敏感的问题,尤其是背景依赖(background over-reliance)导致的性能下降。其核心解决方案是提出一种名为基于聚类的概念重要性(Cluster-based Concept Importance, CCI)的新颖可解释性方法:利用CLIP自身的patch embeddings将空间patch聚类为语义一致的簇,通过掩码这些簇并评估模型预测的变化来量化每个区域的重要性。CCI在忠实性基准测试中达到新的最优性能,例如在MS COCO检索任务的删除-AUC指标上提升超过两倍;进一步结合GroundedSAM,可自动区分预测是由前景还是背景驱动,从而提供关键诊断能力。该研究还揭示现有基准(如CounterAnimals)仅依赖准确率且隐含假设所有性能下降源于背景相关性的局限性,并引入COVAR基准系统性地分离前景与背景变量的影响,为构建更鲁棒的VLM提供了方法论进步和实证依据。

链接: https://arxiv.org/abs/2511.12978
作者: Aishwarya Agarwal,Srikrishna Karanam,Vineet Gandhi
机构: CVIT, Kohli Centre for Intelligent Systems, IIIT Hyderabad, India(印度海得拉巴国际信息学院); Adobe Research, Bengaluru, India(印度班加罗尔Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 21 figures

点击查看摘要

Abstract:Contrastive vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP’s own patch embeddings to group spatial patches into semantically coherent clusters, mask them, and evaluate relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI, when combined with GroundedSAM, automatically categorizes predictions as foreground- or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we present a comprehensive evaluation of eighteen CLIP variants, offering methodological advances and empirical evidence that chart a path toward more robust VLMs.
zh

[CV-120] ArtiWorld: LLM -Driven Articulation of 3D Objects in Scenes

【速读】:该论文旨在解决当前交互式模拟器和可扩展机器人学习环境构建中缺乏大量关节可动资产的问题,尤其是现有3D资产多为刚体,手动将其转化为关节结构(articulated objects)成本高昂且效率低下。为此,作者提出ArtiWorld,一个场景感知的自动化流水线,其核心是Arti4URDF模块——该模块利用3D点云数据、大语言模型(Large Language Model, LLM)先验知识以及面向URDF(Unified Robot Description Format)的提示工程设计,实现从文本场景描述中定位候选可动对象,并快速重建保留原始几何形状的可执行URDF模型。关键创新在于将LLM语义理解能力与URDF建模约束相结合,实现了对刚体对象的自动关节化转换,同时保持几何保真度和交互逻辑正确性,显著提升了仿真环境中关节物体生成的自动化水平与实用性。

链接: https://arxiv.org/abs/2511.12977
作者: Yixuan Yang,Luyang Xie,Zhen Luo,Zixiang Zhao,Mingqi Gao,Feng Zheng
机构: SUSTech (南方科技大学); SII; ETH Zürich (苏黎世联邦理工学院); Spatialtemporal AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building interactive simulators and scalable robot-learning environments requires a large number of articulated assets. However, most existing 3D assets in simulation are rigid, and manually converting them into articulated objects is extremely labor- and cost-intensive. This raises a natural question: can we automatically identify articulable objects in a scene and convert them into articulated assets directly? In this paper, we present ArtiWorld, a scene-aware pipeline that localizes candidate articulable objects from textual scene descriptions and reconstructs executable URDF models that preserve the original geometry. At the core of this pipeline is Arti4URDF, which leverages 3D point cloud, prior knowledge of a large language model (LLM), and a URDF-oriented prompt design to rapidly convert rigid objects into interactive URDF-based articulated objects while maintaining their 3D shape. We evaluate ArtiWorld at three levels: 3D simulated objects, full 3D simulated scenes, and real-world scan scenes. Across all three settings, our method consistently outperforms existing approaches and achieves state-of-the-art performance, while preserving object geometry and correctly capturing object interactivity to produce usable URDF-based articulated models. This provides a practical path toward building interactive, robot-ready simulation environments directly from existing 3D assets. Code and data will be released.
zh

[CV-121] MCAQ-YOLO: Morphological Complexity-Aware Quantization for Efficient Object Detection with Curriculum Learning

【速读】:该论文旨在解决现有神经网络量化方法中采用全局统一比特精度导致的效率与精度失衡问题,即未考虑视觉数据在空间区域上的结构和纹理复杂度差异,从而限制了模型在计算资源受限场景下的性能表现。其解决方案的关键在于提出一种形态学复杂度感知的量化框架MCAQ-YOLO,通过引入五种形态学指标(分形维数、纹理熵、梯度方差、边缘密度和轮廓复杂度)来刻画局部视觉形态特征,并据此指导空间自适应比特分配;同时结合课程学习式的量化感知训练策略,逐步提升量化难度以稳定优化过程并加速收敛,最终实现高精度与高压缩比的协同提升。

链接: https://arxiv.org/abs/2511.12976
作者: Yoonjae Seo,Ermal Elbasani,Jaehong Lee
机构: Sejong University (世宗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 7 tables. Preprint

点击查看摘要

Abstract:Most neural network quantization methods apply uniform bit precision across spatial regions, ignoring the heterogeneous structural and textural complexity of visual data. This paper introduces MCAQ-YOLO, a morphological complexity-aware quantization framework for object detection. The framework employs five morphological metrics - fractal dimension, texture entropy, gradient variance, edge density, and contour complexity - to characterize local visual morphology and guide spatially adaptive bit allocation. By correlating these metrics with quantization sensitivity, MCAQ-YOLO dynamically adjusts bit precision according to spatial complexity. In addition, a curriculum-based quantization-aware training scheme progressively increases quantization difficulty to stabilize optimization and accelerate convergence. Experimental results demonstrate a strong correlation between morphological complexity and quantization sensitivity and show that MCAQ-YOLO achieves superior detection accuracy and convergence efficiency compared with uniform quantization. On a safety equipment dataset, MCAQ-YOLO attains 85.6 percent mAP@0.5 with an average of 4.2 bits and a 7.6x compression ratio, yielding 3.5 percentage points higher mAP than uniform 4-bit quantization while introducing only 1.8 ms of additional runtime overhead per image. Cross-dataset validation on COCO and Pascal VOC further confirms consistent performance gains, indicating that morphology-driven spatial quantization can enhance efficiency and robustness for computationally constrained, safety-critical visual recognition tasks.
zh

[CV-122] HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology AAAI2026

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)在临床应用中因技术复杂性和高昂成本而面临的障碍,特别是现有计算方法在从苏木精-伊红染色全切片图像(HE-stained Whole-Slide Images, WSIs)预测基因表达时,难以捕捉斑点内细微的生物异质性,并且在整合周围组织上下文信息时易受形态噪声干扰的问题。解决方案的关键在于提出HiFusion框架,其核心创新包括两个互补模块:一是分层内斑点建模模块(Hierarchical Intra-Spot Modeling),通过多分辨率子图块分解提取细粒度形态表征,并借助特征对齐损失确保跨尺度语义一致性;二是上下文感知跨尺度融合模块(Context-aware Cross-scale Fusion),利用交叉注意力机制选择性引入生物学相关区域上下文,从而增强表征能力。该架构实现了细胞水平特征与组织微环境线索的综合建模,显著提升了基因表达预测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2511.12969
作者: Ziqiao Weng,Yaoyu Fang,Jiahe Qian,Xinkun Wang,Lee AD Cooper,Weidong Cai,Bo Zhou
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026. 7 pages (main text), 12 pages total including references and supplementary material. 6 figures

点击查看摘要

Abstract:Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from HE-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion’s potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.
zh

[CV-123] GrOCE:Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models

【速读】:该论文旨在解决文本到图像扩散模型中概念擦除(concept erasure)的问题,即如何在不损害无关语义的前提下,精准、高效地移除有害、不当或受版权保护的内容。现有方法要么依赖昂贵的微调,要么采用粗粒度的语义分离,导致无关概念退化且难以适应动态变化的概念集合。解决方案的关键在于提出一种无需训练的图引导在线概念擦除框架(Graph-Guided Online Concept Erasure, GrOCE),其核心是通过构建动态语义图来建模概念及其相互关系,实现基于图结构的细粒度推理与隔离;具体包括三个模块:动态拓扑图构建、自适应聚类识别和选择性边切断,从而在不破坏全局语义的前提下实现目标概念的精确移除。

链接: https://arxiv.org/abs/2511.12968
作者: Ning Han,Zhenyu Ge,Feng Han,Yuhua Sun,Chengqing Li,Jingjing Chen
机构: Xiangtan University (湘潭大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Concept erasure aims to remove harmful, inappropriate, or copyrighted content from text-to-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. To alleviate this issue, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and adaptive concept removal through graph-based semantic reasoning. GrOCE models concepts and their interrelations as a dynamic semantic graph, enabling principled reasoning over dependencies and fine-grained isolation of undesired content. It comprises three components: (1) Dynamic Topological Graph Construction for incremental graph building, (2) Adaptive Cluster Identification for multi-hop traversal with similarity-decay scoring, and (3) Selective Edge Severing for targeted edge removal while preserving global semantics. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fréchet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure without retraining.
zh

[CV-124] CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models

【速读】:该论文旨在解决半监督学习(Semi-Supervised Learning, SSL)中模型校准不足的问题,即现有SSL方法在利用伪标签(pseudolabels)进行训练时,常因伪标签的过自信和不可靠性导致模型预测概率与实际准确率不匹配,从而影响其可靠性。解决方案的关键在于提出CalibrateMix,一种基于目标混合(targeted mixup)的方法:通过分析有标签和无标签样本的训练动态,识别出“易学样本”(easy-to-learn)和“难学样本”(hard-to-learn),并在此基础上对易学样本与难学样本进行有针对性的混合操作,从而在提升分类准确率的同时显著降低预期校准误差(Expected Calibration Error, ECE),实现更可靠的模型输出概率估计。

链接: https://arxiv.org/abs/2511.12964
作者: Mehrab Mustafy Rahman,Jayanth Mohan,Tiberiu Sosea,Cornelia Caragea
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has demonstrated high performance in image classification tasks by effectively utilizing both labeled and unlabeled data. However, existing SSL methods often suffer from poor calibration, with models yielding overconfident predictions that misrepresent actual prediction likelihoods. Recently, neural networks trained with \tt mixup that linearly interpolates random examples from the training set have shown better calibration in supervised settings. However, calibration of neural models remains under-explored in semi-supervised settings. Although effective in supervised model calibration, random mixup of pseudolabels in SSL presents challenges due to the overconfidence and unreliability of pseudolabels. In this work, we introduce CalibrateMix, a targeted mixup-based approach that aims to improve the calibration of SSL models while maintaining or even improving their classification accuracy. Our method leverages training dynamics of labeled and unlabeled samples to identify easy-to-learn'' and hard-to-learn’’ samples, which in turn are utilized in a targeted mixup of easy and hard samples. Experimental results across several benchmark image datasets show that our method achieves lower expected calibration error (ECE) and superior accuracy compared to existing SSL approaches.
zh

[CV-125] EndoSight AI: Deep Learning-Driven Real-Time Gastrointestinal Polyp Detection and Segmentation for Enhanced Endoscopic Diagnostics

【速读】:该论文旨在解决内镜检查中胃肠道息肉(gastrointestinal polyps)的精准与实时检测问题,以促进结直肠癌的早期诊断和预防。解决方案的关键在于提出了一种名为EndoSight AI的深度学习架构,该架构在公开的Hyper-Kvasir数据集上实现了88.3%的平均精度(mean Average Precision, mAP)用于息肉定位,以及高达69%的Dice系数用于边界分割,并在GPU硬件上实现超过35帧/秒的实时推理速度。此外,训练过程引入了临床相关性能指标及一种新颖的热感知(thermal-aware)流程,确保模型在实际临床环境中的鲁棒性和效率,从而支持无缝集成到内镜工作流中,提升诊断准确性与临床决策能力。

链接: https://arxiv.org/abs/2511.12962
作者: Daniel Cavadia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise and real-time detection of gastrointestinal polyps during endoscopic procedures is crucial for early diagnosis and prevention of colorectal cancer. This work presents EndoSight AI, a deep learning architecture developed and evaluated independently to enable accurate polyp localization and detailed boundary delineation. Leveraging the publicly available Hyper-Kvasir dataset, the system achieves a mean Average Precision (mAP) of 88.3% for polyp detection and a Dice coefficient of up to 69% for segmentation, alongside real-time inference speeds exceeding 35 frames per second on GPU hardware. The training incorporates clinically relevant performance metrics and a novel thermal-aware procedure to ensure model robustness and efficiency. This integrated AI solution is designed for seamless deployment in endoscopy workflows, promising to advance diagnostic accuracy and clinical decision-making in gastrointestinal healthcare.
zh

[CV-126] 2I-Based Physical-World Appearance Attack against Traffic Sign Recognition Systems in Autonomous Driving

【速读】:该论文旨在解决交通标志识别(Traffic Sign Recognition, TSR)系统在物理世界中面临的对抗性外观攻击(adversarial appearance attacks)的局限性问题,特别是现有方法在隐蔽性、迁移性、泛化能力及实用性方面的不足。当前基于像素扰动的方法易被察觉且对特定模型过拟合,而基于文本到图像(text-to-image, T2I)扩散模型的方法则难以有效攻击未知类别的交通标志且泛化能力弱。解决方案的关键在于提出DiffSign框架,其核心创新包括:1)引入CLIP-based损失与掩码提示(masked prompts)机制以增强攻击的目标聚焦性和可控性;2)设计两种新颖的风格定制方法,提升攻击在跨类别交通标志上的泛化能力与隐蔽性;3)通过完整的攻击流程实现高成功率(平均达83.3%)和强迁移性,从而在多种真实场景条件下(如距离、角度、光照变化)均具备鲁棒性。

链接: https://arxiv.org/abs/2511.12956
作者: Chen Ma,Ningfei Wang,Junhao Zheng,Qing Guo,Qian Wang,Qi Alfred Chen,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); University of California, Irvine (加州大学欧文分校); Nankai University (南开大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:Traffic Sign Recognition (TSR) systems play a critical role in Autonomous Driving (AD) systems, enabling real-time detection of road signs, such as STOP and speed limit signs. While these systems are increasingly integrated into commercial vehicles, recent research has exposed their vulnerability to physical-world adversarial appearance attacks. In such attacks, carefully crafted visual patterns are misinterpreted by TSR models as legitimate traffic signs, while remaining inconspicuous or benign to human observers. However, existing adversarial appearance attacks suffer from notable limitations. Pixel-level perturbation-based methods often lack stealthiness and tend to overfit to specific surrogate models, resulting in poor transferability to real-world TSR systems. On the other hand, text-to-image (T2I) diffusion model-based approaches demonstrate limited effectiveness and poor generalization to out-of-distribution sign types. In this paper, we present DiffSign, a novel T2I-based appearance attack framework designed to generate physically robust, highly effective, transferable, practical, and stealthy appearance attacks against TSR systems. To overcome the limitations of prior approaches, we propose a carefully designed attack pipeline that integrates CLIP-based loss and masked prompts to improve attack focus and controllability. We also propose two novel style customization methods to guide visual appearance and improve out-of-domain traffic sign attack generalization and attack stealthiness. We conduct extensive evaluations of DiffSign under varied real-world conditions, including different distances, angles, light conditions, and sign categories. Our method achieves an average physical-world attack success rate of 83.3%, leveraging DiffSign’s high effectiveness in attack transferability. Comments: 16 pages, 12 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR) Cite as: arXiv:2511.12956 [cs.CV] (or arXiv:2511.12956v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.12956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-127] Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

【速读】:该论文旨在解决视频扩散模型在长视频生成中因局部全注意力机制导致的记忆压缩与检索效率低下问题,从而引发遗忘和时空不一致性现象。其核心解决方案是提出一种新颖的循环自回归扩散(Recurrent Autoregressive Diffusion, RAD)框架,关键在于引入基于LSTM的循环神经网络结构嵌入扩散Transformer架构,并通过帧级自回归方式实现训练与推理阶段一致的记忆更新与检索机制,有效提升了固定记忆预算下的历史信息保留能力,实验表明该方法在Memory Maze和Minecraft数据集上显著优于现有扩散-RNN模型。

链接: https://arxiv.org/abs/2511.12940
作者: Taiye Chen,Zihan Ding,Anjian Li,Christina Zhang,Zeqi Xiao,Yisen Wang,Chi Jin
机构: Peking University (北京大学); Princeton University (普林斯顿大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.
zh

[CV-128] Semi-Supervised High Dynamic Range Image Reconstructing via Bi-Level Uncertain Area Masking AAAI2026

【速读】:该论文旨在解决低动态范围(Low Dynamic Range, LDR)图像序列到高动态范围(High Dynamic Range, HDR)图像重建中的标注效率问题,即如何在仅使用少量HDR真值(Ground Truth, GT)的情况下实现与全监督方法相当的重建性能。其解决方案的关键在于引入一种基于不确定性的掩码机制(uncertainty-based masking process),该机制在像素和图像块(patch)两个层级上识别并屏蔽伪HDR真值中不可靠的部分,从而避免学生模型从带有伪影的伪标签中学习,提升模型鲁棒性与重建质量。通过这一创新掩码策略,该方法仅需6.7%的HDR真值即可达到与最新全监督方法相当的性能。

链接: https://arxiv.org/abs/2511.12939
作者: Wei Jiang,Jiahao Cui,Yizheng Wu,Zhan Peng,Zhiyu Pan,Zhiguo Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, accepted to AAAI 2026 (poster)

点击查看摘要

Abstract:Reconstructing high dynamic range (HDR) images from low dynamic range (LDR) bursts plays an essential role in the computational photography. Impressive progress has been achieved by learning-based algorithms which require LDR-HDR image pairs. However, these pairs are hard to obtain, which motivates researchers to delve into the problem of annotation-efficient HDR image reconstructing: how to achieve comparable performance with limited HDR ground truths (GTs). This work attempts to address this problem from the view of semi-supervised learning where a teacher model generates pseudo HDR GTs for the LDR samples without GTs and a student model learns from pseudo GTs. Nevertheless, the confirmation bias, i.e., the student may learn from the artifacts in pseudo HDR GTs, presents an impediment. To remove this impediment, an uncertainty-based masking process is proposed to discard unreliable parts of pseudo GTs at both pixel and patch levels, then the trusted areas can be learned from by the student. With this novel masking process, our semi-supervised HDR reconstructing method not only outperforms previous annotation-efficient algorithms, but also achieves comparable performance with up-to-date fully-supervised methods by using only 6.7% HDR GTs.
zh

[CV-129] ProtoAnomalyNCD: Prototype Learning for Multi-class Novel Anomaly Discovery in Industrial Scenarios

【速读】:该论文旨在解决工业异常检测中难以发现和分类多种未见异常类型的问题,尤其针对异常语义细微、现有方法未能充分利用图像先验信息导致聚类性能不佳的挑战。其解决方案的关键在于提出ProtoAnomalyNCD框架,通过引入基于原型学习(prototype learning)的机制,结合两种关键设计:一是利用带文本提示的Grounded SAM(Segment Anything Model)定位物体区域作为先验,以抑制背景干扰;二是设计异常图引导注意力模块(Anomaly-Map-Guided Attention block),其中引入区域引导因子(Region Guidance Factor)帮助注意力机制区分背景、物体区域与异常区域,从而增强异常特征并保留正常特征用于对比学习。最终在统一的原型学习框架下实现未见异常类别的自动发现与多类型分类,并扩展至未见离群点检测任务,实现了任务层面的统一。

链接: https://arxiv.org/abs/2511.12938
作者: Botong Zhao,Qijun Shi,Shujing Lyu,Yue Lu
机构: Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University (华东师范大学多维信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing industrial anomaly detection methods mainly determine whether an anomaly is present. However, real-world applications also require discovering and classifying multiple anomaly types. Since industrial anomalies are semantically subtle and current methods do not sufficiently exploit image priors, direct clustering approaches often perform poorly. To address these challenges, we propose ProtoAnomalyNCD, a prototype-learning-based framework for discovering unseen anomaly classes of multiple types that can be integrated with various anomaly detection methods. First, to suppress background clutter, we leverage Grounded SAM with text prompts to localize object regions as priors for the anomaly classification network. Next, because anomalies usually appear as subtle and fine-grained patterns on the product, we introduce an Anomaly-Map-Guided Attention block. Within this block, we design a Region Guidance Factor that helps the attention module distinguish among background, object regions, and anomalous regions. By using both localized product regions and anomaly maps as priors, the module enhances anomalous features while suppressing background noise and preserving normal features for contrastive learning. Finally, under a unified prototype-learning framework, ProtoAnomalyNCD discovers and clusters unseen anomaly classes while simultaneously enabling multi-type anomaly classification. We further extend our method to detect unseen outliers, achieving task-level unification. Our method outperforms state-of-the-art approaches on the MVTec AD, MTD, and Real-IAD datasets.
zh

[CV-130] Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

【速读】:该论文旨在解决跨平台策略类游戏中智能体在多样化用户界面(User Interface, UI)和动态战场环境下实现鲁棒泛化的问题。传统视觉-语言模型(Vision-Language Models, VLMs)虽在多模态推理方面展现出潜力,但在复杂人机交互场景如策略游戏中的应用仍处于探索阶段。解决方案的关键在于提出Yanyun-3框架,其核心创新是将Qwen2.5-VL的视觉-语言推理能力与UI-TARS的精准动作执行能力相结合,并设计了一种基于组合粒度(combination granularity)的结构化多模态数据组织策略:通过融合多帧图像与视频数据并混合静态图像(MV+S),显著提升推理效率与任务性能——相较全融合方法,推理时间降低63%,BLEU-4得分提升约12.98倍(从4.81%增至62.41%)。该方案不仅实现了三大核心任务(目标定位、战斗资源分配、区域控制)的自主执行,还构建了闭环式屏幕捕获-模型推理-动作执行流程,为VLM在具身智能中的高效应用提供了新范式。

链接: https://arxiv.org/abs/2511.12937
作者: Guoyan Wang,Yanyan Huang,Chunlin Chen,Lifeng Wang,Yuxiang Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 13 figures

点击查看摘要

Abstract:Automated operation in cross-platform strategy games demands agents with robust generalization across diverse user interfaces and dynamic battlefield conditions. While vision-language models (VLMs) have shown considerable promise in multimodal reasoning, their application to complex human-computer interaction scenarios–such as strategy gaming–remains largely unexplored. Here, we introduce Yanyun-3, a general-purpose agent framework that, for the first time, enables autonomous cross-platform operation across three heterogeneous strategy game environments. By integrating the vision-language reasoning of Qwen2.5-VL with the precise execution capabilities of UI-TARS, Yanyun-3 successfully performs core tasks including target localization, combat resource allocation, and area control. Through systematic ablation studies, we evaluate the effects of various multimodal data combinations–static images, multi-image sequences, and videos–and propose the concept of combination granularity to differentiate between intra-sample fusion and inter-sample mixing strategies. We find that a hybrid strategy, which fuses multi-image and video data while mixing in static images (MV+S), substantially outperforms full fusion: it reduces inference time by 63% and boosts the BLEU-4 score by a factor of 12 (from 4.81% to 62.41%, approximately 12.98x). Operating via a closed-loop pipeline of screen capture, model inference, and action execution, the agent demonstrates strong real-time performance and cross-platform generalization. Beyond providing an efficient solution for strategy game automation, our work establishes a general paradigm for enhancing VLM performance through structured multimodal data organization, offering new insights into the interplay between static perception and dynamic reasoning in embodied intelligence.
zh

[CV-131] PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos AAAI2026

【速读】:该论文旨在解决从真实世界“每日穿搭”(Outfit of the Day, OOTD)照片中重建高质量3D虚拟形象(Avatar)的问题,此类图像通常包含多变姿态、遮挡和复杂背景,传统方法在处理时易因部件分割不一致导致细节失真。其解决方案的关键在于提出PFAvatar(Pose-Fusion Avatar),采用两阶段策略:第一阶段通过少量OOTD样本微调一个姿态感知的扩散模型(pose-aware diffusion model),引入预训练ControlNet进行姿态估计,并设计条件先验保持损失(Condition Prior Preservation Loss, CPPL)以缓解少样本训练中的语言漂移问题,实现端到端的全身外观建模;第二阶段利用神经辐射场(Neural Radiance Field, NeRF)表示3D avatar,结合规范SMPL-X空间采样与多分辨率3D-SDS优化策略,在连续辐射场中有效保留高频纹理(如头发)并正确处理遮挡问题,相较基于网格的表示具有更高保真度和鲁棒性。

链接: https://arxiv.org/abs/2511.12935
作者: Dianbing Xi,Guoyuan An,Jingsen Zhu,Zhijian Liu,Yuan Liu,Ruiyuan Zhang,Jiayuan Lu,Rui Wang,Yuchi Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from ``Outfit of the Day’’ (OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48 \times speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.
zh

[CV-132] xt2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes

【速读】:该论文旨在解决交通场景中基于文本的图像生成与编辑技术所面临的四大挑战:生成交通元素语义信息不足、相机视角单一、合成图像视觉保真度低以及文本描述与生成内容对齐不佳。其解决方案的关键在于提出一个统一的文本驱动框架,通过引入可控掩码机制(controllable mask mechanism)实现图像生成与编辑任务的无缝融合;同时利用车端与路侧多视角数据提升场景几何多样性,并采用两阶段训练策略——先在粗粒度文本-图像数据上进行概念学习,再在细粒度描述数据上微调以增强文本-图像对齐和细节质量;此外,设计了掩码区域加权损失函数(mask-region-weighted loss),在训练过程中动态强化小而关键区域的关注,显著提升了小尺度交通元素的生成保真度。

链接: https://arxiv.org/abs/2511.12932
作者: Feng Lv,Haoxuan Feng,Zilu Zhang,Chunlong Xia,Yanfeng Li
机构: Baidu(百度); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of intelligent transportation systems, text-driven image generation and editing techniques have demonstrated significant potential in providing rich, controllable visual scene data for applications such as traffic monitoring and autonomous driving. However, several challenges remain, including insufficient semantic richness of generated traffic elements, limited camera viewpoints, low visual fidelity of synthesized images, and poor alignment between textual descriptions and generated content. To address these issues, we propose a unified text-driven framework for both image generation and editing, leveraging a controllable mask mechanism to seamlessly integrate the two tasks. Furthermore, we incorporate both vehicle-side and roadside multi-view data to enhance the geometric diversity of traffic scenes. Our training strategy follows a two-stage paradigm: first, we perform conceptual learning using large-scale coarse-grained text-image data; then, we fine-tune with fine-grained descriptive data to enhance text-image alignment and detail quality. Additionally, we introduce a mask-region-weighted loss that dynamically emphasizes small yet critical regions during training, thereby substantially enhancing the generation fidelity of small-scale traffic elements. Extensive experiments demonstrate that our method achieves leading performance in text-based image generation and editing within traffic scenes.
zh

[CV-133] Neo: Real-Time On-Device 3D Gaussian Splatting with Reuse-and-Update Sorting Acceleration

【速读】:该论文旨在解决在资源受限设备上实现高帧率、高分辨率的3D高斯溅射(3D Gaussian Splatting, 3DGS)实时渲染问题,其核心瓶颈在于渲染流程中的排序阶段对内存带宽需求过高。解决方案的关键在于提出一种“重用与更新”排序算法(reuse-and-update sorting algorithm),通过利用连续帧间高斯点深度顺序的时域冗余性,避免从头重新排序,从而显著减少冗余计算和内存访问压力;同时设计了针对该算法优化的硬件加速器,最终在吞吐量上相比现有边缘GPU和专用集成电路(ASIC)方案分别提升10.0倍和5.6倍,并将DRAM数据传输量降低94.5%和81.3%。

链接: https://arxiv.org/abs/2511.12930
作者: Changhun Oh,Seongryong Oh,Jinwoo Hwang,Yoonsung Kim,Hardik Sharma,Jongse Park
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) rendering in real-time on resource-constrained devices is essential for delivering immersive augmented and virtual reality (AR/VR) experiences. However, existing solutions struggle to achieve high frame rates, especially for high-resolution rendering. Our analysis identifies the sorting stage in the 3DGS rendering pipeline as the major bottleneck due to its high memory bandwidth demand. This paper presents Neo, which introduces a reuse-and-update sorting algorithm that exploits temporal redundancy in Gaussian ordering across consecutive frames, and devises a hardware accelerator optimized for this algorithm. By efficiently tracking and updating Gaussian depth ordering instead of re-sorting from scratch, Neo significantly reduces redundant computations and memory bandwidth pressure. Experimental results show that Neo achieves up to 10.0x and 5.6x higher throughput than state-of-the-art edge GPU and ASIC solution, respectively, while reducing DRAM traffic by 94.5% and 81.3%. These improvements make high-quality and low-latency on-device 3D rendering more practical.
zh

[CV-134] Generative Photographic Control for Scene-Consistent Video Cinematic Editing

【速读】:该论文旨在解决生成式视频模型中对专业摄影参数(如景深、快门速度等)缺乏精细控制的问题,现有方法通常仅限于相机运动控制,难以实现类似电影级的视觉效果。解决方案的关键在于提出CineCtrl框架,其核心创新是引入解耦交叉注意力机制(decoupled cross-attention mechanism),将相机运动与摄影输入分离,从而实现不破坏场景一致性的细粒度独立控制;同时,为缓解训练数据不足问题,构建了结合模拟摄影效果与专用真实世界采集流程的大规模数据生成策略,支撑模型在高保真视频生成中精确实现用户指定的电影级摄影效果。

链接: https://arxiv.org/abs/2511.12921
作者: Huiqiang Sun,Liao Shen,Zhan Peng,Kun Wang,Size Wu,Yuhang Zang,Tianqi Liu,Zihao Huang,Xingyu Zeng,Zhiguo Cao,Wei Li,Chen Change Loy
机构: School of AIA, Huazhong University of Science and Technology (华中科技大学人工智能学院); S-Lab, Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技研究院); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.
zh

[CV-135] CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation AAAI2026

【速读】:该论文致力于解决**单参考视图下的6D位姿估计(6D pose estimation)**问题,尤其针对未见过物体且缺乏完整3D模型时的挑战。现有方法依赖于实值坐标回归,存在全局一致性差(因卷积结构的局部性)以及对对称或遮挡场景建模不足的问题。解决方案的关键在于提出CoordAR框架:通过将参考与查询视图间的3D-3D对应关系建模为离散标记(token)序列,并采用自回归概率方式生成;其核心创新包括:1)一种新的坐标图标记化方法,实现对离散化3D空间的概率预测;2)模态解耦编码策略,分别处理RGB外观和坐标线索;3)基于位置对齐查询特征和部分生成标记序列的自回归Transformer解码器。这一设计显著提升了精度与鲁棒性,尤其在对称、遮挡等复杂场景下表现优异。

链接: https://arxiv.org/abs/2511.12919
作者: Dexin Zuo,Ang Li,Wei Wang,Wenxian Yu,Danping Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, accepted by AAAI 2026 (oral)

点击查看摘要

Abstract:Object 6D pose estimation, a crucial task for robotics and augmented reality applications, becomes particularly challenging when dealing with novel objects whose 3D models are not readily available. To reduce dependency on 3D models, recent studies have explored one-reference-based pose estimation, which requires only a single reference view instead of a complete 3D model. However, existing methods that rely on real-valued coordinate regression suffer from limited global consistency due to the local nature of convolutional architectures and face challenges in symmetric or occluded scenarios owing to a lack of uncertainty modeling. We present CoordAR, a novel autoregressive framework for one-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a map of discrete tokens, which is obtained in an autoregressive and probabilistic manner. To enable accurate correspondence regression, CoordAR introduces 1) a novel coordinate map tokenization that enables probabilistic prediction over discretized 3D space; 2) a modality-decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both position-aligned query features and the partially generated token sequence. With these novel mechanisms, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests.
zh

[CV-136] Explore How to Inject Beneficial Noise in MLLM s AAAI2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在微调过程中忽视跨模态异质性(cross-modal heterogeneity)的问题,从而限制了其在跨模态理解与对齐上的性能潜力。解决方案的关键在于提出一种新颖的微调策略——通过注入有益的随机噪声实现高效且高性能的模态微调,核心创新是设计了多模态噪声生成器(Multimodal Noise Generator, MuNG)。MuNG基于变分推断视角重构MLLM推理过程,动态分析图像-文本对中的跨模态关系,并生成任务自适应的有益噪声;该噪声能有效抑制无关语义成分,显著增强跨模态表征对齐,从而在仅调整约1%~2%额外参数的情况下,超越全参数微调及其他现有方法的性能表现。

链接: https://arxiv.org/abs/2511.12917
作者: Ruishu Zhu,Sida Huang,Ziheng Jiao,Hongyuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about 1\sim2% additional parameters. The relevant code is uploaded in the supplementary.
zh

[CV-137] CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection AAAI2026

【速读】:该论文旨在解决当前3D异常检测方法在通用性上的局限性问题,即大多数深度学习方法为特定任务设计,难以迁移至其他3D理解任务;而传统自监督点云模型虽具备通用表示学习能力,但在统一微调范式下异常检测性能不佳。解决方案的关键在于提出一种基于重建范式的曲率增强自监督学习(Curvature-Augmented Self-supervised Learning, CASL)框架,其核心创新是引入多尺度曲率提示(multi-scale curvature prompts)作为先验信息引导解码器预测点的空间坐标,从而在无需任何专用异常检测机制的前提下,通过简单的异常分类微调实现领先的检测性能,并且所学表征可有效泛化至标准3D理解任务如点云分类。

链接: https://arxiv.org/abs/2511.12909
作者: Yaohua Zha,Xue Yuerong,Chunlin Fan,Yuansong Wang,Tao Dai,Ke Chen,Shu-Tao Xia
机构: 1. Chinese Academy of Sciences (中国科学院); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Deep learning-based 3D anomaly detection methods have demonstrated significant potential in industrial manufacturing. However, many approaches are specifically designed for anomaly detection tasks, which limits their generalizability to other 3D understanding tasks. In contrast, self-supervised point cloud models aim for general-purpose representation learning, yet our investigation reveals that these classical models are suboptimal at anomaly detection under the unified fine-tuning paradigm. This motivates us to develop a more generalizable 3D model that can effectively detect anomalies without relying on task-specific designs. Interestingly, we find that using only the curvature of each point as its anomaly score already outperforms several classical self-supervised and dedicated anomaly detection models, highlighting the critical role of curvature in 3D anomaly detection. In this paper, we propose a Curvature-Augmented Self-supervised Learning (CASL) framework based on a reconstruction paradigm. Built upon the classical U-Net architecture, our approach introduces multi-scale curvature prompts to guide the decoder in predicting the spatial coordinates of each point. Without relying on any dedicated anomaly detection mechanisms, it achieves leading detection performance through straightforward anomaly classification fine-tuning. Moreover, the learned representations generalize well to standard 3D understanding tasks such as point cloud classification. The code is available at this https URL.
zh

[CV-138] DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agent ic Reinforcement Learning

【速读】:该论文旨在解决体育视频理解中模型难以处理高速动态、复杂规则及长时程上下文推理的问题,尤其针对当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在体育领域存在任务单一、体育项目局限或依赖无训练范式而缺乏有效推理能力的不足。其解决方案的关键在于提出首个端到端训练的多任务、多体育项目视频理解框架DeepSport,通过引入一种专门的帧提取工具实现动态内容交互,使模型能够“用视频思考”;同时设计了一种数据蒸馏管道,从10个多样化数据源合成高质量思维链(Chain-of-Thought, CoT)轨迹,构建包含78k样本的统一训练集,并采用两阶段训练策略——监督微调(Supervised Fine-Tuning, SFT)后接强化学习(Reinforcement Learning, RL),辅以新颖的门控工具使用奖励机制,显著提升了模型的推理能力和泛化性能,在6.7k问题测试基准上超越现有专有与开源模型,为特定领域视频推理提供了新范式。

链接: https://arxiv.org/abs/2511.12908
作者: Junbo Zou,Haotian Xia,Zhen Ye,Shengjie Zhang,Christopher Lai,Vicente Ordonez,Weining Shen,Hanjie Chen
机构: Georgia Institute of Technology (佐治亚理工学院); Rice University (莱斯大学); Johns Hopkins University (约翰霍普金斯大学); University of California, Irvine (加州大学欧文分校); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos’’ by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model’s reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
zh

[CV-139] FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI AAAI2026

【速读】:该论文旨在解决脑部磁共振成像(MRI)中监督异常检测因脑部解剖多样性及标注数据稀缺而面临的挑战,推动无监督异常检测(Unsupervised Anomaly Detection, UAD)方法的发展。其核心问题在于现有UAD方法依赖人工生成的噪声扰动模拟异常,但此类假想异常缺乏真实临床病灶的生物物理保真度与形态复杂性,导致检测性能受限。解决方案的关键在于首次对病理特征进行系统性的频域分析,揭示两个关键性质:(1)异常具有区别于正常结构的独特频率模式;(2)低频信号在健康扫描中保持一致表征。基于此,作者提出频率分解预处理(Frequency-Decomposition Preprocessing, FDP)框架,通过频域重建实现病灶抑制与解剖结构保留的协同优化,且可无缝集成至现有异常模拟技术中,显著提升多种模型架构下的检测性能,如在LDM基础上实现DICE分数提升17.63%,同时保障诊断保真度。

链接: https://arxiv.org/abs/2511.12899
作者: Hao Li,Zhenfeng Zhuang,Jingyu Lin,Yu Liu,Yifei Chen,Qiong Peng,Lequan Yu,Liansheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Due to the diversity of brain anatomy and the scarcity of annotated data, supervised anomaly detection for brain MRI remains challenging, driving the development of unsupervised anomaly detection (UAD) approaches. Current UAD methods typically utilize artificially generated noise perturbations on healthy MRIs to train generative models for normal anatomy reconstruction, enabling anomaly detection via residual mapping. However, such simulated anomalies lack the biophysical fidelity and morphological complexity characteristic of true clinical lesions. To advance UAD in brain MRI, we conduct the first systematic frequency-domain analysis of pathological signatures, revealing two key properties: (1) anomalies exhibit unique frequency patterns distinguishable from normal anatomy, and (2) low-frequency signals maintain consistent representations across healthy scans. These insights motivate our Frequency-Decomposition Preprocessing (FDP) framework, the first UAD method to leverage frequency-domain reconstruction for simultaneous pathology suppression and anatomical preservation. FDP can integrate seamlessly with existing anomaly simulation techniques, consistently enhancing detection performance across diverse architectures while maintaining diagnostic fidelity. Experimental results demonstrate that FDP consistently improves anomaly detection performance when integrated with existing methods. Notably, FDP achieves a 17.63% increase in DICE score with LDM while maintaining robust improvements across multiple baselines. The code is available at this https URL.
zh

[CV-140] Functional Mean Flow in Hilbert Space

【速读】:该论文旨在解决高维或无限维函数空间中生成模型训练效率与稳定性不足的问题,特别是在时间序列、图像、偏微分方程(PDE)和三维几何等复杂功能数据生成任务中的应用挑战。其解决方案的关键在于提出了一种定义在无限维希尔伯特空间(Hilbert space)中的单步流匹配方法——功能性均值流(Functional Mean Flow, FMF),通过理论推导出功能性流匹配(Functional Flow Matching)的公式,并设计了基于 x1x_1-预测的变体以提升训练稳定性,从而实现高效且稳定的生成式建模。

链接: https://arxiv.org/abs/2511.12898
作者: Zhiqi Li,Yuchen Sun,Greg Turk,Bo Zhu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 13 figures

点击查看摘要

Abstract:We present Functional Mean Flow (FMF) as a one-step generative model defined in infinite-dimensional Hilbert space. FMF extends the one-step Mean Flow framework to functional domains by providing a theoretical formulation for Functional Flow Matching and a practical implementation for efficient training and sampling. We also introduce an x_1 -prediction variant that improves stability over the original u -prediction form. The resulting framework is a practical one-step Flow Matching method applicable to a wide range of functional data generation tasks such as time series, images, PDEs, and 3D geometry.
zh

[CV-141] Reconstructing 3D Scenes in Native High Dynamic Range

【速读】:该论文旨在解决当前3D场景重建技术主要依赖低动态范围(Low Dynamic Range, LDR)数据,从而难以满足专业数字媒体制作(如电影制作、虚拟制作和逼真渲染)对高动态范围(High Dynamic Range, HDR)场景重建的需求问题。现有方法虽尝试从LDR观测中恢复HDR场景,但通常依赖多曝光融合或逆色调映射,不仅增加采集复杂度,还依赖合成监督信号。针对这一瓶颈,本文提出首个直接建模原生HDR(native HDR)观测的3D场景重建方法——Native High dynamic range 3D Gaussian Splatting (NH-3DGS),其关键创新在于引入了一种新颖的亮度-色度分解(luminance-chromaticity decomposition)色彩表示方式,使得无需转换即可直接从原生HDR图像中优化重建过程,从而完整保留动态范围并显著提升重建质量与专业适用性。

链接: https://arxiv.org/abs/2511.12895
作者: Kaixuan Zhang,Minxian Li,Mingwu Ren,Jiankang Deng,Xiatian Zhu
机构: Nanjing University of Science and Technology (南京理工大学); Imperial College London (帝国理工学院); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High Dynamic Range (HDR) imaging is essential for professional digital media creation, e.g., filmmaking, virtual production, and photorealistic rendering. However, 3D scene reconstruction has primarily focused on Low Dynamic Range (LDR) data, limiting its applicability to professional workflows. Existing approaches that reconstruct HDR scenes from LDR observations rely on multi-exposure fusion or inverse tone-mapping, which increase capture complexity and depend on synthetic supervision. With the recent emergence of cameras that directly capture native HDR data in a single exposure, we present the first method for 3D scene reconstruction that directly models native HDR observations. We propose \bf Native High dynamic range 3D Gaussian Splatting (NH-3DGS), which preserves the full dynamic range throughout the reconstruction pipeline. Our key technical contribution is a novel luminance-chromaticity decomposition of the color representation that enables direct optimization from native HDR camera data. We demonstrate on both synthetic and real multi-view HDR datasets that NH-3DGS significantly outperforms existing methods in reconstruction quality and dynamic range preservation, enabling professional-grade 3D reconstruction directly from native HDR captures. Code and datasets will be made available.
zh

[CV-142] ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在图像生成过程中因序列长度增加而导致的计算成本急剧上升的问题,同时避免现有静态剪枝方法因永久移除权重或标记而破坏预训练依赖关系所导致的性能下降。解决方案的关键在于提出ActVAR框架,通过引入双重稀疏性(即模型权重和标记序列层面的动态稀疏),实现高效推理而不牺牲模型容量:一方面将前馈网络(Feedforward Networks, FFNs)分解为轻量级专家子网络,并利用可学习路由器根据内容动态选择特定标记的专家子集;另一方面设计门控标记选择器识别高更新潜力的标记进行计算,同时重建未选标记以保持全局上下文和序列对齐。训练采用两阶段知识蒸馏策略,由原始VAR模型指导路由与门控策略的学习,使其与预训练知识对齐,从而在ImageNet 256×256基准上实现高达21.2%的浮点运算次数(FLOPs)减少且性能损失最小。

链接: https://arxiv.org/abs/2511.12893
作者: Kaixin Zhang,Ruiqing Yang,Yuan Zhang,Shan You,Tao Huang
机构: Central South University (中南大学); University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学); SenseTime Research (商汤科技研究院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet 256\times 256 benchmark demonstrate that ActVAR achieves up to 21.2% FLOPs reduction with minimal performance degradation.
zh

[CV-143] Simple Lines Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings

【速读】:该论文旨在解决当前基于视觉输出(如绘画)的人类创造力评估仍依赖专家主观评分、效率低且缺乏客观性的问题。其解决方案的关键在于提出一个数据驱动的自动且可解释的创造力评估框架,该框架将创造力视为内容(content)与风格(style)两个互补维度的函数,并通过构建一个多模态、多任务学习模型实现对创造力分数、内容类别和风格特征的联合预测;其中引入条件学习机制,使模型能根据绘画的风格和语义信息动态调整视觉特征提取过程,从而在提升评估准确性的同时提供与人类判断高度一致的可解释可视化结果。

链接: https://arxiv.org/abs/2511.12880
作者: Zihao Lin,Zhenshan Shi,Sasa Zhao,Hanwei Zhu,Lingyu Zhu,Baoliang Chen,Lei Mo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assessing human creativity through visual outputs, such as drawings, plays a critical role in fields including psychology, education, and cognitive science. However, current assessment practices still rely heavily on expert-based subjective scoring, which is both labor-intensive and inherently subjective. In this paper, we propose a data-driven framework for automatic and interpretable creativity assessment from drawings. Motivated by the cognitive understanding that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary this http URL, we first augment an existing creativity labeled dataset with additional annotations targeting content categories. Based on the enriched dataset, we further propose a multi-modal, multi-task learning framework that simultaneously predicts creativity scores, categorizes content types, and extracts stylistic features. In particular, we introduce a conditional learning mechanism that enables the model to adapt its visual feature extraction by dynamically tuning it to creativity-relevant signals conditioned on the drawing’s stylistic and semantic this http URL results demonstrate that our model achieves state-of-the-art performance compared to existing regression-based approaches and offers interpretable visualizations that align well with human judgments. The code and annotations will be made publicly available at this https URL
zh

[CV-144] Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views IROS’25

【速读】:该论文旨在解决第一人称视觉(egocentric vision)中手-物体交互的时序定位问题(Temporal Interaction Localization, TIL),即精确识别手与目标物体之间接触和分离的关键时刻(“when to interact”),这是实现混合现实沉浸式交互体验和机器人运动规划的关键挑战。现有方法受限于依赖语义掩码或类别标注,存在对象定位不准、场景杂乱以及无法零样本泛化的问题。其解决方案的核心在于提出一种新颖的零样本方法 EgoLoc:通过引入**手部动态引导采样(hand-dynamics-guided sampling)生成高质量视觉提示,并利用视觉-语言模型(vision-language model)**识别接触/分离属性并精确定位时间戳,同时提供闭环反馈以迭代优化结果。EgoLoc 不依赖物体掩码或动词-名词分类体系,实现了可泛化的零样本交互时序定位能力。

链接: https://arxiv.org/abs/2511.12878
作者: Junyi Ma,Wentao Bao,Jingyi Xu,Guanzhong Sun,Yu Zheng,Erhang Zhang,Xieyuanli Chen,Hesheng Wang
机构: IRMV Lab, the Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系); Meta Reality Labs (Meta); the Department of Electronic Engineering, Shanghai Jiao Tong University (上海交通大学电子工程系); School of Information and Control Engineering, China University of Mining and Technology (中国矿业大学信息与控制工程学院); College of Intelligence Science and Technology, National University of Defense Technology (国防科技大学智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Extended journal version of MMTwin (IROS’25)

点击查看摘要

Abstract:Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., “how to interact”). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., “when to interact”) is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at this https URL.
zh

[CV-145] View-aware Cross-modal Distillation for Multi-view Action Recognition WACV

【速读】:该论文旨在解决部分重叠视角(partially overlapping views)下多模态动作识别(multi-view action recognition)的挑战,尤其是在真实场景中传感器输入模态受限且仅提供序列级标注(sequence-level annotations)而非密集帧级标签(frame-level labels)的情况下。传统方法在完全重叠视角设置中表现良好,但在部分可见动作的场景中性能显著下降。解决方案的关键在于提出一种视图感知的跨模态知识蒸馏框架(View-aware Cross-modal Knowledge Distillation, ViCoKD),其核心创新包括:1)引入跨模态适配器(cross-modal adapter)结合跨模态注意力机制,使学生模型能够在不完整模态条件下利用多模态相关性;2)设计视图感知一致性模块(View-aware Consistency module),通过人体检测掩码和置信度加权的Jensen-Shannon散度约束共可见视图下的预测一致性,从而缓解因视角差异导致的动作表征错位问题。实验表明,ViCoKD在MultiSensor-Home真实数据集上优于多种蒸馏方法,并在模态与标注受限条件下超越教师模型。

链接: https://arxiv.org/abs/2511.12870
作者: Trung Thanh Nguyen,Yasutomo Kawanishi,Vijay John,Takahiro Komamizu,Ichiro Ide
机构: Nagoya University (名古屋大学); RIKEN (理化学研究所); Lawrence Technological University (劳伦斯科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

点击查看摘要

Abstract:The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.
zh

[CV-146] Video Finetuning Improves Reasoning Between Frames NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)从图像到视频任务扩展时性能提升有限的问题,特别是其在处理长视频问答任务中对帧间动态信息建模不足的局限。解决方案的关键在于提出视觉思维链(Visual Chain-of-Thought, vCoT),这是一种显式的推理机制,通过生成连续帧之间的过渡事件描述来增强模型对视频时序动态的理解。实验表明,vCoT显著提升了仅基于图像训练的模型在长视频问答上的表现,而对已进行视频微调的模型则收益有限,说明后者已隐式学习了帧间转换关系;同时,视频微调模型还能将这种时序推理能力迁移到静态图像场景中,在关系视觉推理任务上优于纯图像模型基线。

链接: https://arxiv.org/abs/2511.12868
作者: Ruiqi Yang,Tian Yun,Zihan Wang,Ellie Pavlick
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CogInterp @ NeurIPS 2025

点击查看摘要

Abstract:Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models’ baselines on relational visual reasoning tasks.
zh

[CV-147] SAGA: Source Attribution of Generative AI Videos

【速读】:该论文旨在解决生成式 AI (Generative AI) 视频日益泛滥所带来的溯源难题,传统二分类真假检测已无法满足对具体生成模型来源识别的需求。其核心问题是:如何在大规模场景下实现高精度、多粒度的 AI 生成视频源 Attribution(归属),并提供可解释性支持。解决方案的关键在于提出 SAGA 框架,其创新点包括:1)设计了一种基于视觉基础模型特征的新型视频 Transformer 架构,有效捕捉时空伪影;2)引入数据高效的预训练-归属策略,仅需每类 0.5% 的标注数据即可达到全监督性能;3)提出 Temporal Attention Signatures (T-Sigs),首次可视化不同生成器间的时序差异,提供可解释机制以说明为何不同视频生成器可被区分。

链接: https://arxiv.org/abs/2511.12834
作者: Rohit Kundu,Vishal Mohanty,Hao Xiong,Shan Jia,Athula Balachandran,Amit K. Roy-Chowdhury
机构: Google LLC(谷歌); Google DeepMind(谷歌深度思维); University of California, Riverside(加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.
zh

[CV-148] MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection

【速读】:该论文旨在解决伪装目标检测(Camouflaged Object Detection)任务中因目标与背景在颜色、纹理和尺寸上高度相似而导致的识别困难问题,尤其针对低光照、部分遮挡、小目标、复杂背景及多目标等挑战性场景下现有方法性能不足的问题。解决方案的关键在于提出一种多尺度递归网络(Multi-Scale Recursive Network),其核心创新包括:1)基于金字塔视觉Transformer(Pyramid Vision Transformer)提取多尺度特征,并通过专用的注意力机制引导的尺度融合单元(Attention-Based Scale Integration Units)实现选择性特征合并;2)在解码端引入多粒度融合单元(Multi-Granularity Fusion Units)进行递归特征优化,结合新颖的递归反馈解码策略以增强全局上下文理解能力。该方法通过联合利用多尺度学习与递归特征优化,在多个公开数据集上实现了领先或次优性能,显著提升了对小尺寸和多目标伪装物体的检测精度。

链接: https://arxiv.org/abs/2511.12810
作者: Leena Alghamdi,Muhammad Usman,Hafeez Anwar,Abdul Bais,Saeed Anwar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \hrefthis https URLthis https URL.
zh

[CV-149] Enhancing Neuro-Oncology Through Self-Assessing Deep Learning Models for Brain Tumor Unified Model for MRI Segmentation

【速读】:该论文旨在解决当前深度学习在脑肿瘤分割中面临的两大临床应用瓶颈:一是缺乏对预测误差的不确定性估计,二是未能同时分割肿瘤周围健康脑组织结构,从而无法为手术规划提供完整的解剖学上下文。其解决方案的关键在于提出一个具有不确定性感知能力的框架,通过在nnUNet基础上增加一个用于体素级不确定性预测的通道,在不牺牲肿瘤分割精度的前提下,实现单次前向传播即可输出不确定性图(uncertainty map),并结合正常脑组织与肿瘤数据集训练统一模型,从而同时获得高精度的肿瘤分割(Dice相似系数DSC=0.86)和健康脑结构分割(DSC=0.81),最终生成包含肿瘤及其自然解剖环境的完整输出,并叠加不确定性信息以辅助医生进行更可靠的临床决策。

链接: https://arxiv.org/abs/2511.12801
作者: Andrew Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of brain tumors is vital for diagnosis, surgical planning, and treatment monitoring. Deep learning has advanced on benchmarks, but two issues limit clinical use: no uncertainty estimates for errors and no segmentation of healthy brain structures around tumors for surgery. Current methods fail to unify tumor localization with anatomical context and lack confidence scores. This study presents an uncertainty-aware framework augmenting nnUNet with a channel for voxel-wise uncertainty. Trained on BraTS2023, it yields a correlation of 0.750 and RMSD of 0.047 for uncertainty without hurting tumor accuracy. It predicts uncertainty in one pass, with no extra networks or inferences, aiding clinical decisions. For whole-brain context, a unified model combines normal and cancer datasets, achieving a DSC of 0.81 for brain structures and 0.86 for tumor, with robust key-region performance. Combining both innovations gives the first model outputting tumor in natural surroundings plus an overlaid uncertainty map. Visual checks of outputs show uncertainty offers key insights to evaluate predictions and fix errors, helping informed surgical decisions from AI.
zh

[CV-150] Lightweight Optimal-Transport Harmonization on Edge Devices AAAI2026

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)中插入对象与周围场景颜色不协调的问题,即颜色和谐化(Color Harmonization)问题,以实现视觉上无缝融合的合成效果。现有方法因缺乏实时性而未被集成到AR流水线中,为此,作者提出了一种轻量级解决方案——MKL-Harmonizer算法,其核心在于利用经典最优传输理论(Optimal Transport Theory),通过训练一个紧凑的编码器来预测Monge-Kantorovich运输映射(Monge-Kantorovich transport map),从而实现在设备端的高效推理。

链接: https://arxiv.org/abs/2511.12785
作者: Maria Larchenko,Dmitry Guskov,Alexander Lobashev,Georgy Derevyanko
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: AAAI 2026, Oral

点击查看摘要

Abstract:Color harmonization adjusts the colors of an inserted object so that it perceptually matches the surrounding image, resulting in a seamless composite. The harmonization problem naturally arises in augmented reality (AR), yet harmonization algorithms are not currently integrated into AR pipelines because real-time solutions are scarce. In this work, we address color harmonization for AR by proposing a lightweight approach that supports on-device inference. For this, we leverage classical optimal transport theory by training a compact encoder to predict the Monge-Kantorovich transport map. We benchmark our MKL-Harmonizer algorithm against state-of-the-art methods and demonstrate that for real composite AR images our method achieves the best aggregated score. We release our dedicated AR dataset of composite images with pixel-accurate masks and data-gathering toolkit to support further data acquisition by researchers.
zh

[CV-151] RoCoISLR: A Romanian Corpus for Isolated Sign Language Recognition

【速读】:该论文旨在解决罗马尼亚孤立手语识别(Romanian Isolated Sign Language Recognition, RoISLR)领域缺乏大规模、标准化数据集的问题,从而阻碍了相关研究的进展。解决方案的关键在于构建了一个名为RoCoISLR的新语料库,包含超过9,000个视频样本,覆盖近6,000个标准化词汇(glosses),并在此基础上对七种前沿视频识别模型进行了系统性基准测试,首次为RoISLR提供了可比较的性能指标。结果表明,基于Transformer的架构(如Swin Transformer)优于传统卷积模型,在Top-1准确率上达到34.1%,同时揭示了低资源手语中长尾类别分布带来的挑战,为后续系统性研究奠定了基础。

链接: https://arxiv.org/abs/2511.12767
作者: Cătălin-Alexandru Rîpanu,Andrei-Theodor Hotnog,Giulia-Stefania Imbrea,Dumitru-Clementin Cercel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Automatic sign language recognition plays a crucial role in bridging the communication gap between deaf communities and hearing individuals; however, most available datasets focus on American Sign Language. For Romanian Isolated Sign Language Recognition (RoISLR), no large-scale, standardized dataset exists, which limits research progress. In this work, we introduce a new corpus for RoISLR, named RoCoISLR, comprising over 9,000 video samples that span nearly 6,000 standardized glosses from multiple sources. We establish benchmark results by evaluating seven state-of-the-art video recognition models-I3D, SlowFast, Swin Transformer, TimeSformer, Uniformer, VideoMAE, and PoseConv3D-under consistent experimental setups, and compare their performance with that of the widely used WLASL2000 corpus. According to the results, transformer-based architectures outperform convolutional baselines; Swin Transformer achieved a Top-1 accuracy of 34.1%. Our benchmarks highlight the challenges associated with long-tail class distributions in low-resource sign languages, and RoCoISLR provides the initial foundation for systematic RoISLR research.
zh

[CV-152] Which Way from B to A: The role of embedding geometry in image interpolation for Stable Diffusion

【速读】:该论文旨在解决稳定扩散模型(Stable Diffusion)在跨不同提示词(prompt)的嵌入向量之间进行插值时,生成图像过渡不平滑、语义跳跃明显的问题。其核心解决方案在于将CLIP嵌入矩阵重新诠释为Wasserstein空间中的点云(point clouds),并将其插值问题建模为最优传输(optimal transport)问题,从而计算出嵌入空间中两点间的最短路径(测地线)。这种方法利用了嵌入空间的几何结构,相较于传统线性插值等方法,能够生成更连贯、更自然的中间图像,显著提升了图像插值的质量与视觉一致性。

链接: https://arxiv.org/abs/2511.12757
作者: Nicholas Karris,Luke Durell,Javier Flores,Tegan Emerson
机构: University of California, San Diego (加州大学圣地亚哥分校); Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:It can be shown that Stable Diffusion has a permutation-invariance property with respect to the rows of Contrastive Language-Image Pretraining (CLIP) embedding matrices. This inspired the novel observation that these embeddings can naturally be interpreted as point clouds in a Wasserstein space rather than as matrices in a Euclidean space. This perspective opens up new possibilities for understanding the geometry of embedding space. For example, when interpolating between embeddings of two distinct prompts, we propose reframing the interpolation problem as an optimal transport problem. By solving this optimal transport problem, we compute a shortest path (or geodesic) between embeddings that captures a more natural and geometrically smooth transition through the embedding space. This results in smoother and more coherent intermediate (interpolated) images when rendered by the Stable Diffusion generative model. We conduct experiments to investigate this effect, comparing the quality of interpolated images produced using optimal transport to those generated by other standard interpolation methods. The novel optimal transport–based approach presented indeed gives smoother image interpolations, suggesting that viewing the embeddings as point clouds (rather than as matrices) better reflects and leverages the geometry of the embedding space.
zh

[CV-153] SAGE: Saliency-Guided Contrastive Embeddings

【速读】:该论文旨在解决现有基于显著性(saliency)引导训练方法中因依赖模型内部机制而导致的不可靠性问题,以及如何更有效地将人类感知先验(human perceptual priors)融入神经网络训练以提升模型泛化能力与对高风险领域任务的适应性。其解决方案的关键在于摒弃传统仅在图像空间内施加显著性引导的做法,转而利用模型的潜在空间(latent space)嵌入来捕捉和调控人类显著性信息;具体提出SAGE(Saliency-Guided Contrastive Embeddings)损失函数,通过对比嵌入方式引导模型关注显著特征、远离非显著特征,并结合logit分布的合理性验证确保训练方向与人类显著性感知一致,从而实现跨骨干网络、开放集与封闭集场景下的性能提升。

链接: https://arxiv.org/abs/2511.12744
作者: Colton R. Crum,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Integrating human perceptual priors into the training of neural networks has been shown to raise model generalization, serve as an effective regularizer, and align models with human expertise for applications in high-risk domains. Existing approaches to integrate saliency into model training often rely on internal model mechanisms, which recent research suggests may be unreliable. Our insight is that many challenges associated with saliency-guided training stem from the placement of the guidance approaches solely within the image space. Instead, we move away from the image space, use the model’s latent space embeddings to steer human guidance during training, and we propose SAGE (Saliency-Guided Contrastive Embeddings): a loss function that integrates human saliency into network training using contrastive embeddings. We apply salient-preserving and saliency-degrading signal augmentations to the input and capture the changes in embeddings and model logits. We guide the model towards salient features and away from non-salient features using a contrastive triplet loss. Additionally, we perform a sanity check on the logit distributions to ensure that the model outputs match the saliency-based augmentations. We demonstrate a boost in classification performance across both open- and closed-set scenarios against SOTA saliency-based methods, showing SAGE’s effectiveness across various backbones, and include experiments to suggest its wide generalization across tasks.
zh

[CV-154] Deep Imbalanced Multi-Target Regression: 3D Point Cloud Voxel Content Estimation in Simulated Forests

【速读】:该论文旨在解决基于体素化(Voxelization)的激光雷达(LiDAR)点云数据在降低计算成本的同时,因空间分辨率受限而导致细粒度结构信息丢失的问题,特别是如何从高阶体素化数据中准确推断低阶体素内容信息(如目标占据百分比)。其解决方案的关键在于提出一种面向类别不平衡学习的多目标回归方法,结合核点卷积(Kernel Point Convolutions, KPConv)与密度相关性加权策略(Density-Based Relevance, DBR),通过加权均方误差(Weighted MSE)、焦点回归(Focal Regression, FocalR)及正则化技术优化模型训练,从而提升对森林场景中树皮、叶片、土壤等多类目标在不同体素尺寸下占据率的估计精度。

链接: https://arxiv.org/abs/2511.12740
作者: Amirhossein Hassanzadeh,Bartosz Krawczyk,Michael Saunders,Rob Wible,Keith Krause,Dimah Dera,Jan van Aardt
机构: Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology (罗彻斯特理工学院切斯特F.卡尔森成像科学中心); United States Space Force (美国太空军); Battelle (巴特尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Voxelization is an effective approach to reduce the computational cost of processing Light Detection and Ranging (LiDAR) data, yet it results in a loss of fine-scale structural information. This study explores whether low-level voxel content information, specifically target occupancy percentage within a voxel, can be inferred from high-level voxelized LiDAR point cloud data collected from Digital Imaging and remote Sensing Image Generation (DIRSIG) software. In our study, the targets include bark, leaf, soil, and miscellaneous materials. We propose a multi-target regression approach in the context of imbalanced learning using Kernel Point Convolutions (KPConv). Our research leverages cost-sensitive learning to address class imbalance called density-based relevance (DBR). We employ weighted Mean Saquared Erorr (MSE), Focal Regression (FocalR), and regularization to improve the optimization of KPConv. This study performs a sensitivity analysis on the voxel size (0.25 - 2 meters) to evaluate the effect of various grid representations in capturing the nuances of the forest. This sensitivity analysis reveals that larger voxel sizes (e.g., 2 meters) result in lower errors due to reduced variability, while smaller voxel sizes (e.g., 0.25 or 0.5 meter) exhibit higher errors, particularly within the canopy, where variability is greatest. For bark and leaf targets, error values at smaller voxel size datasets (0.25 and 0.5 meter) were significantly higher than those in larger voxel size datasets (2 meters), highlighting the difficulty in accurately estimating within-canopy voxel content at fine resolutions. This suggests that the choice of voxel size is application-dependent. Our work fills the gap in deep imbalance learning models for multi-target regression and simulated datasets for 3D LiDAR point clouds of forests.
zh

[CV-155] Direct Visual Grounding by Directing Attention of Visual Tokens

【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)中一个关键问题:在最终的语言模型(LLM)模块中,与查询最相关的视觉标记(visual tokens)在生成答案时往往未获得足够的注意力,导致视觉问答等任务出现错误答案。实验表明,标准的下一个词预测(Next-Token Prediction, NTP)损失不足以引导模型关注相关视觉信息。解决方案的关键在于提出一种新的KL注意力损失(KL Attention Loss, KLAL),通过最小化视觉标记的注意力分布与真实注意力图(ground truth attention maps)之间的KL散度来直接监督视觉标记对语言标记的注意力分配。该注意力图可来自合成场景中的几何结构或真实图像中的标注(如边界框或点注释),无需额外标签即可嵌入LLM模块进行监督。KLAL与NTP联合优化后显著提升了VLM在几何推理、指代表达理解和指针任务上的性能。

链接: https://arxiv.org/abs/2511.12738
作者: Parsa Esmaeilkhani,Longin Jan Latecki
机构: Temple University (坦普尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) mix visual tokens and text tokens. A puzzling issue is the fact that visual tokens most related to the query receive little to no attention in the final layers of the LLM module of VLMs from the answer tokens, where all tokens are treated equally, in particular, visual and language tokens in the LLM attention layers. This fact may result in wrong answers to visual questions, as our experimental results confirm. It appears that the standard next-token prediction (NTP) loss provides an insufficient signal for directing attention to visual tokens. We hypothesize that a more direct supervision of the attention of visual tokens to corresponding language tokens in the LLM module of VLMs will lead to improved performance on visual tasks. To demonstrate that this is indeed the case, we propose a novel loss function that directly supervises the attention of visual tokens. It directly grounds the answer language tokens in images by directing their attention to the relevant visual tokens. This is achieved by aligning the attention distribution of visual tokens to ground truth attention maps with KL divergence. The ground truth attention maps are obtained from task geometry in synthetic cases or from standard grounding annotations (e.g., bounding boxes or point annotations) in real images, and are used inside the LLM for attention supervision without requiring new labels. The obtained KL attention loss (KLAL) when combined with NTP encourages VLMs to attend to relevant visual tokens while generating answer tokens. This results in notable improvements across geometric tasks, pointing, and referring expression comprehension on both synthetic and real-world data, as demonstrated by our experiments. We also introduce a new dataset to evaluate the line tracing abilities of VLMs. Surprisingly, even commercial VLMs do not perform well on this task.
zh

[CV-156] Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning AAAI2026

【速读】:该论文旨在解决开放词汇目标检测器(Open-vocabulary object detectors, OVODs)在高风险应用场景中面临的新型后门攻击问题,特别是由提示调优(prompt tuning)引入的攻击面。其解决方案的关键在于提出TrAP(Trigger-Aware Prompt tuning),一种多模态后门注入策略,通过联合优化图像和文本模态中的提示参数与视觉触发器,在不重新训练基础模型权重的前提下植入隐蔽后门。该方法利用可学习的轻量级提示令牌实现恶意行为的嵌入,同时采用基于课程的学习策略逐步缩小触发器尺寸,从而在推理阶段仅用小尺寸触发 patch 即可有效激活后门,且在干净图像上相比零样本设置还能提升下游任务性能。

链接: https://arxiv.org/abs/2511.12735
作者: Ankita Raj,Chetan Arora
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting.
zh

[CV-157] FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

【速读】:该论文旨在解决自动驾驶系统中驾驶员注意力预测与解释生成的难题,尤其是现有方法依赖大规模标注 gaze 数据集所带来的高成本问题。其解决方案的关键在于提出 FSDAM(Few-Shot Driver Attention Modeling)框架,采用双路径架构:分别由独立模块完成空间注意力预测与自然语言描述生成,并通过跨模态对齐机制确保语义一致性,从而在仅需约 100 个标注样本的情况下实现高效且可解释的注意力建模,展现出强大的零样本泛化能力。

链接: https://arxiv.org/abs/2511.12708
作者: Kaiser Hamid,Can Cui,Khandakar Ashrafi Akbar,Ziran Wang,Nade Liang
机构: Texas Tech University (德克萨斯理工大学); Purdue University (普渡大学); University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.
zh

[CV-158] Counting Through Occlusion: Framework for Open World Amodal Counting

【速读】:该论文旨在解决现有可见物体计数方法在遮挡场景下性能显著下降的问题,其根本原因在于骨干网络会编码遮挡表面而非目标物体,导致特征表示被污染,从而影响准确计数。解决方案的关键在于提出CountOCC框架,通过层次化多模态引导显式重建被遮挡物体的特征:利用可见片段的空间上下文与文本及视觉嵌入的语义先验,融合生成多个金字塔层级上遮挡区域的类别判别性特征;同时引入视觉等价目标,在注意力空间中强制同一场景的不同视图(遮挡与未遮挡)产生空间对齐的基于梯度的注意力图,从而保持计数所需的判别特性。

链接: https://arxiv.org/abs/2511.12702
作者: Safaeid Hossain Arib,Rabeya Akter,Abdul Monaf Chowdhury,Md Jubair Ahmed Sourov,Md Mehedi Hasan
机构: University of Dhaka (达卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object counting has achieved remarkable success on visible instances, yet state-of-the-art (SOTA) methods fail under occlusion, a pervasive challenge in real world deployment. This failure stems from a fundamental architectural limitation where backbone networks encode occluding surfaces rather than target objects, thereby corrupting the feature representations required for accurate enumeration. To address this, we present CountOCC, an amodal counting framework that explicitly reconstructs occluded object features through hierarchical multimodal guidance. Rather than accepting degraded encodings, we synthesize complete representations by integrating spatial context from visible fragments with semantic priors from text and visual embeddings, generating class-discriminative features at occluded locations across multiple pyramid levels. We further introduce a visual equivalence objective that enforces consistency in attention space, ensuring that both occluded and unoccluded views of the same scene produce spatially aligned gradient-based attention maps. Together, these complementary mechanisms preserve discriminative properties essential for accurate counting under occlusion. For rigorous evaluation, we establish occlusion-augmented versions of FSC 147 and CARPK spanning both structured and unstructured scenes. CountOCC achieves SOTA performance on FSC 147 with 26.72% and 20.80% MAE reduction over prior baselines under occlusion in validation and test, respectively. CountOCC also demonstrates exceptional generalization by setting new SOTA results on CARPK with 49.89% MAE reduction and on CAPTUREReal with 28.79% MAE reduction, validating robust amodal counting across diverse visual domains. Code will be released soon.
zh

[CV-159] X-VMamba: Explainable Vision Mamba

【速读】:该论文旨在解决视觉状态空间模型(Vision State Space Models, Vision SSMs)在处理空间信息时缺乏透明性的问题,尤其是其内部状态动态如何被输入序列(如图像块或token)所影响尚不清晰,这限制了对模型决策机制的理解。解决方案的关键在于提出一种基于可控性的可解释性框架,通过两种互补的方法量化输入片段对内部状态演化的影响:一是适用于任意SSM架构的基于雅可比矩阵(Jacobian-based)方法,能捕捉完整的状态传播链路;二是针对对角线SSM结构的基于格拉姆矩阵(Gramian-based)方法,利用闭式解析解实现更高效的计算。这两种方法均在单次前向传播中完成分析,具有线性复杂度,无需修改网络结构或调参,从而为SSMs提供了一个统一、基础且高效的可解释性分析范式。

链接: https://arxiv.org/abs/2511.12694
作者: Mohamed A. Mabrok,Yalda Zafari
机构: Qatar University (卡塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:State Space Models (SSMs), particularly the Mamba architecture, have recently emerged as powerful alternatives to Transformers for sequence modeling, offering linear computational complexity while achieving competitive performance. Yet, despite their effectiveness, understanding how these Vision SSMs process spatial information remains challenging due to the lack of transparent, attention-like mechanisms. To address this gap, we introduce a controllability-based interpretability framework that quantifies how different parts of the input sequence (tokens or patches) influence the internal state dynamics of SSMs. We propose two complementary formulations: a Jacobian-based method applicable to any SSM architecture that measures influence through the full chain of state propagation, and a Gramian-based approach for diagonal SSMs that achieves superior speed through closed-form analytical solutions. Both methods operate in a single forward pass with linear complexity, requiring no architectural modifications or hyperparameter tuning. We validate our framework through experiments on three diverse medical imaging modalities, demonstrating that SSMs naturally implement hierarchical feature refinement from diffuse low-level textures in early layers to focused, clinically meaningful patterns in deeper layers. Our analysis reveals domain-specific controllability signatures aligned with diagnostic criteria, progressive spatial selectivity across the network hierarchy, and the substantial influence of scanning strategies on attention patterns. Beyond medical imaging, we articulate applications spanning computer vision, natural language processing, and cross-domain tasks. Our framework establishes controllability analysis as a unified, foundational interpretability paradigm for SSMs across all domains. Code and analysis tools will be made available upon publication
zh

[CV-160] HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在开放问答任务中普遍存在的幻觉问题(hallucination),即模型生成与输入图像无关或不准确的答案。其核心挑战在于如何构建一个统一、可复现且对不同架构和提示设计敏感的检测框架。解决方案的关键在于提出HEDGE框架,该框架融合受控的视觉扰动(controlled visual perturbations)、语义聚类(semantic clustering,包括基于自然语言推理NLI和嵌入空间的方法)以及鲁棒不确定性度量(robust uncertainty metrics),形成一个端到端的检测流水线。实验表明,VASE指标结合嵌入聚类和适度采样预算(n ~ 10–15)时表现最优,同时发现模型架构(如密集视觉token化优于受限token化)和提示结构(简洁标签式输出优于复杂句式)显著影响检测性能,从而将幻觉检测建模为由采样尺度、提示结构、模型架构和聚类策略共同决定的几何鲁棒性问题。

链接: https://arxiv.org/abs/2511.12693
作者: Sushant Gautam,Michael A. Riegler,Pål Halvorsen
机构: Simula Metropolitan Center for Digital Engineering (SimulaMet); Oslo Metropolitan University (OsloMet); Simula Research Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures. Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the generated answers, whereas NLI-based clustering remains advantageous for LLaVA-Med and for longer, sentence-level responses. Across configurations, the VASE metric consistently provides the most robust hallucination signal, especially when paired with embedding clustering and a moderate sampling budget (n ~ 10-15). Prompt design also matters: concise, label-style outputs offer clearer semantic structure than syntactically constrained one-sentence responses. By framing hallucination detection as a geometric robustness problem shaped jointly by sampling scale, prompt structure, model architecture, and clustering strategy, HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at this https URL . Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) MSC classes: 68T09, 68T45 ACMclasses: I.2.10; I.4.8 Cite as: arXiv:2511.12693 [cs.CV] (or arXiv:2511.12693v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.12693 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-161] R2Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

【速读】:该论文旨在解决基础模型(foundation models)在医学图像分割任务中面对分布外(out-of-distribution, OOD)数据时性能下降的问题,尤其是对OOD肿瘤区域产生碎片化假阳性结果。其解决方案的关键在于提出一个无需训练的框架R² Seg,采用两阶段“推理-拒绝”(Reason-and-Reject)机制:第一阶段通过LLM引导的解剖学推理规划器定位器官锚点并生成多尺度感兴趣区域(Region of Interest, ROI);第二阶段在冻结的基础模型(BiomedParse)生成的候选区域中,利用双样本统计检验筛选出显著区别于正常组织的分割结果,从而有效抑制假阳性。该方法无需参数更新,兼容零更新测试时增强(test-time augmentation),避免灾难性遗忘,并在多中心、多模态肿瘤分割基准上显著提升Dice分数、特异性与敏感性。

链接: https://arxiv.org/abs/2511.12691
作者: Shuaike Shen,Ke Liu,Jiaqing Xie,Shangde Gao,Chunhua Shen,Ge Liu,Mireia Crispin-Ortuzar,Shangqi Gao
机构: Carnegie Mellon University (卡内基梅隆大学); University of Cambridge (剑桥大学); Zhejiang University (浙江大学); ETH Zurich (苏黎世联邦理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R ^2 Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R ^2 Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models. Code are available at this https URL.
zh

[CV-162] BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

【速读】:该论文旨在解决在真实世界场景中部署能够回答环境相关问题的具身智能体(Embodied Agents)所面临的挑战,特别是由于缺乏能真实反映实际运行条件的基准测试数据集。为应对这一问题,作者提出将基础设施巡检作为开放词汇具身问答(Open-Vocabulary Embodied Question Answering, EQA)的典型应用场景,因其天然需要多尺度推理、长距离空间理解及复杂语义关系建模,并可借助标准化的国家桥梁库存(National Bridge Inventory, NBI)评级体系(0–9分)进行客观评估。解决方案的关键在于构建了一个名为BridgeEQA的新基准数据集,包含2200个基于专业桥梁巡检报告生成的开放词汇QA对,覆盖200个真实桥梁场景,平均每场景47.93张图像;同时引入图像引用相关性(Image Citation Relevance)作为新指标以衡量模型引用视觉证据的能力。进一步地,作者提出Embodied Memory Visual Reasoning(EMVR)框架,将巡检任务建模为基于图像场景图的序列导航过程:图像作为节点,代理通过动作遍历视点、比对证据并在马尔可夫决策过程中进行推理,显著优于现有基线方法。

链接: https://arxiv.org/abs/2511.12676
作者: Subin Varghese,Joshua Gao,Asad Ur Rahman,Vedhus Hoskere
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.12676 [cs.CV] (or arXiv:2511.12676v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.12676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-163] Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis

【速读】:该论文旨在解决新颖视图合成(Novel View Synthesis, NVS)中生成图像的真实性与视角变换忠实度难以可靠评估的问题。现有评价指标如像素级相似性或分布度量常因无法捕捉源图像、视角变化与生成结果之间的细微关联而误判错误结果,导致模型性能排名失真。解决方案的关键在于提出一个任务感知的评估框架,该框架利用强大的NVS基础模型Zero123提取特征,并通过轻量级微调增强判别能力;在此基础上引入两个互补的评价指标:基于参考图像的Dₚᵣᵢₛₘ和无参考的MMDₚᵣᵢₛₘ,二者均能有效识别错误生成并准确反映人类偏好排序,从而填补了NVS评估中的关键空白。

链接: https://arxiv.org/abs/2511.12675
作者: Saar Stern,Ido Sobol,Or Litany
机构: Technion (以色列理工学院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3DV 2026. Project page: this https URL

点击查看摘要

Abstract:The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, D_\textPRISM , and a reference-free score, \textMMD_\textPRISM . Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where \textMMD_\textPRISM produces a clear and stable ranking, with lower scores consistently indicating stronger models.
zh

[CV-164] DensePercept-NCSSD: Vision Mamba towards Real-time Dense Visual Perception with Non-Causal State Space Duality

【速读】:该论文旨在解决实时、高精度的光流(optical flow)与视差(disparity)估计问题,以支持密集三维感知任务。其核心挑战在于如何在保证准确性的前提下实现低延迟和低GPU资源消耗。解决方案的关键在于提出了一种基于非因果选择性状态空间(non-causal selective state space)的Mamba块模型,通过融合成对输入图像,在保持高效计算的同时有效处理实时应用中的约束条件,从而显著降低推理时间并维持高精度输出。

链接: https://arxiv.org/abs/2511.12671
作者: Tushar Anand,Advik Sinha,Abhijit Das
机构: Birla Institute of Technology and Science, Pilani, Hyderabad Campus (比尔拉理工大学与科学学院,皮兰尼,海得拉巴校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in a real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis, and validation in real-life scenario justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at this https URL
zh

[CV-165] Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans

【速读】:该论文旨在解决高保真数字人(digital human)在交互应用中同时实现视觉真实感与实时响应性的难题。其解决方案的关键在于构建一个异步执行流水线,协同多模态组件以最小化延迟,并集成视觉逼真的3D头像、基于人格特征的语音合成(persona-driven expressive speech synthesis)以及知识增强的对话生成模块。系统通过历史增强(history augmentation)维持对话连贯性,结合意图驱动的路由机制(intent-based routing)高效访问知识库,从而实现低延迟、情感丰富且语境感知的自然交互,适用于通信、教育和娱乐等沉浸式场景。

链接: https://arxiv.org/abs/2511.12662
作者: Hongbin Huang,Junwei Li,Tianxin Xie,Zhuang Li,Cekai Weng,Yaodong Yang,Yue Luo,Li Liu,Jing Tang,Zhijing Shao,Zeyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the Computer Graphics International 2025 (CGI’25)

点击查看摘要

Abstract:High-fidelity digital humans are increasingly used in interactive applications, yet achieving both visual realism and real-time responsiveness remains a major challenge. We present a high-fidelity, real-time conversational digital human system that seamlessly combines a visually realistic 3D avatar, persona-driven expressive speech synthesis, and knowledge-grounded dialogue generation. To support natural and timely interaction, we introduce an asynchronous execution pipeline that coordinates multi-modal components with minimal latency. The system supports advanced features such as wake word detection, emotionally expressive prosody, and highly accurate, context-aware response generation. It leverages novel retrieval-augmented methods, including history augmentation to maintain conversational flow and intent-based routing for efficient knowledge access. Together, these components form an integrated system that enables responsive and believable digital humans, suitable for immersive applications in communication, education, and entertainment.
zh

[CV-166] oward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis NEURIPS2025

【速读】:该论文旨在解决文本图像伪造定位(Text Image Forgery Localization, T-IFL)方法在真实场景中泛化能力差的问题,其根源在于真实世界数据集规模有限,以及合成数据与真实篡改行为之间存在的分布差异。解决方案的关键在于提出基于傅里叶级数的篡改合成框架(Fourier Series-based Tampering Synthesis, FSTS),该框架通过收集16,750个来自五类典型篡改类型的真人编辑痕迹(多格式日志记录),识别个体与群体层面的重复行为模式,并构建分层建模机制:个体篡改参数被表示为基操作-参数配置的紧凑组合,而群体分布则通过聚合这些行为形成;该建模方式借鉴傅里叶级数思想,以可解释的基础函数及其学习权重实现对篡改行为的逼近,从而生成多样且贴近真实伪造痕迹的训练数据,显著提升模型在真实数据上的泛化性能。

链接: https://arxiv.org/abs/2511.12658
作者: Zeqin Yu,Haotao Xie,Jian Zhang,Jiangqun Ni,Wenkan Su,Jiwu Huang
机构: Sun Yat-sen University (中山大学); Beihang University (北京航空航天大学); Guangzhou University (广州大学); Peng Cheng Laboratory (鹏城实验室); Shenzhen MSU-BIT University (深圳莫斯科大学-比特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 DB Track

点击查看摘要

Abstract:Existing Text Image Forgery Localization (T-IFL) methods often suffer from poor generalization due to the limited scale of real-world datasets and the distribution gap caused by synthetic data that fails to capture the complexity of real-world tampering. To tackle this issue, we propose Fourier Series-based Tampering Synthesis (FSTS), a structured and interpretable framework for synthesizing tampered text images. FSTS first collects 16,750 real-world tampering instances from five representative tampering types, using a structured pipeline that records human-performed editing traces via multi-format logs (e.g., video, PSD, and editing logs). By analyzing these collected parameters and identifying recurring behavioral patterns at both individual and population levels, we formulate a hierarchical modeling framework. Specifically, each individual tampering parameter is represented as a compact combination of basis operation-parameter configurations, while the population-level distribution is constructed by aggregating these behaviors. Since this formulation draws inspiration from the Fourier series, it enables an interpretable approximation using basis functions and their learned weights. By sampling from this modeled distribution, FSTS synthesizes diverse and realistic training data that better reflect real-world forgery traces. Extensive experiments across four evaluation protocols demonstrate that models trained with FSTS data achieve significantly improved generalization on real-world datasets. Dataset is available at \hrefthis https URLProject Page.
zh

[CV-167] DPVO-QAT: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry

【速读】:该论文旨在解决深度学习-based视觉SLAM(vSLAM)系统在资源受限的自主平台部署时面临的计算开销过大问题,尤其是生成式AI(Generative AI)驱动的深度Patch视觉里程计(Deep Patch Visual Odometry, DPVO)模型因高精度浮点运算导致的内存占用高、推理延迟大等效率瓶颈。其解决方案的关键在于提出一种分层量化优化框架DPVO-QAT++,通过三个核心技术协同实现:(1) 可学习缩放参数化(Scale-Only Training)以降低量化误差;(2) 前端与后端异构精度设计(Heterogeneous Precision Architecture),即前端采用浮点伪量化(fake quantization)并支持FP16/FP32混合精度,后端保持全精度;(3) GPU原生伪量化核融合(GPU-native kernel fusion),利用定制CUDA内核优化量化操作,显著减少内存占用和延迟。实验表明,该方法在保持轨迹精度(ATE)的同时大幅提升运行效率,在TartanAir和EuRoC数据集上分别实现了平均帧率提升52.1%与30.1%,峰值GPU内存减少64.9%与37.7%。

链接: https://arxiv.org/abs/2511.12653
作者: Cheng Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based Visual SLAM (vSLAM) systems exhibit exceptional geometric reasoning capabilities, yet their prohibitive computational overhead severely restricts deployment on resource-constrained autonomous platforms. This paper presents a hierarchical quantization optimization framework, DPVO-QAT++ (DPVO-QAT++: Heterogeneous QAT and CUDA Kernel Fusion for High-Performance Deep Patch Visual Odometry). Through the synergistic integration of learnable scale parameterization, a heterogeneous precision design for the Visual Odometry (VO) front-end and back-end (front-end floating-point fake quantization with FP16/FP32; back-end full precision), and GPU-native kernel fusion for fake quantization (custom CUDA kernels), our framework significantly reduces memory footprint and increases processing speed while preserving the trajectory accuracy of the original model. On the TartanAir dataset, our framework achieves an average FPS increase of 52.1%, a 29.1% reduction in median latency, and a 64.9% reduction in peak GPU memory reservation, while maintaining trajectory accuracy (ATE) comparable to the original DPVO model across 32 validation sequences. On the EuRoC dataset, it realizes an average FPS increase of 30.1%, a 23.1% reduction in median latency, and a 37.7% reduction in peak GPU memory reservation, maintaining comparable trajectory accuracy (ATE) across 11 validation sequences. Experimental results demonstrate that DPVO-QAT++ effectively bridges the gap between high-precision deep VO and the efficiency requirements for practical deployment, offering a viable engineering paradigm for the application of this technology on real-world embedded platforms. Keywords: Visual Odometry, Heterogeneous Precision Architecture, Quantization-Aware Training, CUDA Kernel Fusion, Scale-Only Training, Deep Patch Visual Odometry, GPU-Native Kernel Fusion. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.12653 [cs.CV] (or arXiv:2511.12653v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.12653 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-168] Medical Knowledge Intervention Prompt Tuning for Medical Image Classification

【速读】:该论文旨在解决现有提示调优(prompt tuning)方法在医学图像分类任务中难以精准区分不同医学概念、从而忽略跨模态医学影像中关键疾病特征的问题。解决方案的关键在于提出CILMP(Conditional Intervention of Large Language Models for Prompt Tuning),该方法通过从大规模文本预训练的大型语言模型(Large Language Models, LLMs)中提取疾病特异性表征,将其干预到低秩线性子空间中,并结合条件机制对每张医学图像进行实例自适应的提示生成,从而实现医学知识向视觉-语言模型(Vision-Language Models, VLMs)提示的有效迁移与增强。

链接: https://arxiv.org/abs/2511.12639
作者: Ye Du,Nanxi Yu,Shujun Wang
机构: The Hong Kong Polytechnic University (香港理工大学); Research Institute for Smart Ageing (智能老龄化研究院); Research Institute for Artificial Intelligence of Things (人工智能物联网研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Medical Imaging (Early Access) July 2025

点击查看摘要

Abstract:Vision-language foundation models (VLMs) have shown great potential in feature transfer and generalization across a wide spectrum of medical-related downstream tasks. However, fine-tuning these models is resource-intensive due to their large number of parameters. Prompt tuning has emerged as a viable solution to mitigate memory usage and reduce training time while maintaining competitive performance. Nevertheless, the challenge is that existing prompt tuning methods cannot precisely distinguish different kinds of medical concepts, which miss essentially specific disease-related features across various medical imaging modalities in medical image classification tasks. We find that Large Language Models (LLMs), trained on extensive text corpora, are particularly adept at providing this specialized medical knowledge. Motivated by this, we propose incorporating LLMs into the prompt tuning process. Specifically, we introduce the CILMP, Conditional Intervention of Large Language Models for Prompt Tuning, a method that bridges LLMs and VLMs to facilitate the transfer of medical knowledge into VLM prompts. CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and utilizes them to create disease-specific prompts. Additionally, a conditional mechanism is incorporated to condition the intervention process on each individual medical image, generating instance-adaptive prompts and thus enhancing adaptability. Extensive experiments across diverse medical image datasets demonstrate that CILMP consistently outperforms state-of-the-art prompt tuning methods, demonstrating its effectiveness. Code is available at this https URL.
zh

[CV-169] Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)在高维潜在空间中因冗余高频成分导致扩散模型训练收敛困难、生成质量下降的问题。传统方法通过外部视觉基础模型(Vision Foundation Model, VFM)对高维潜在空间进行正则化,但其效果受限且机制不明确。本文的关键解决方案是提出一种频谱自正则化策略(spectral self-regularization),在不依赖VFM的前提下抑制高维潜在空间中的冗余高频噪声,同时保持重建质量;并进一步引入频谱对齐策略(spectral alignment)以加速基于Denoising-VAE的生成模型优化。实验表明,该方法使扩散模型收敛速度提升约2倍,同时在ImageNet 256×256上实现最优重建指标(rFID = 0.28, PSNR = 27.26)和竞争力生成性能(gFID = 1.82)。

链接: https://arxiv.org/abs/2511.12633
作者: Xunzhi Xiang,Xingye Tian,Guiyu Zhang,Yabo Chen,Shaofeng Zhang,Xuebo Wang,Xin Tao,Qi Fan
机构: Nanjing University (南京大学); Kling Team, Kuaishou Technology (快手科技); Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2 \times faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256 \times 256 benchmark.
zh

[CV-170] Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

【速读】:该论文旨在解决多模态人脸生成中传统特征融合方法难以实现有效跨模态交互的问题,从而导致生成效果不佳。其解决方案的关键在于提出一种定制化的扩散变换器框架MDiTFace,通过统一的标记化策略处理语义掩码(semantic mask)和文本输入,消除异构模态表示间的差异;并设计了堆叠的新型多变量Transformer模块,同步处理所有条件以促进全面的多模态特征交互;此外,创新性地引入解耦注意力机制,将掩码标记与时间嵌入之间的隐式依赖关系分离,划分出动态与静态计算路径,使静态路径中的特征可在首次计算后缓存复用,从而在保持性能的同时将掩码条件引入的额外计算开销降低超过94%。

链接: https://arxiv.org/abs/2511.12631
作者: Yushe Cao,Dianxi Shi,Xing Fu,Xuechao Zou,Haikuo Peng,Xueqi Li,Chun Yu,Junliang Xing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace–a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.
zh

[CV-171] C3Net: Context-Contrast Network for Camouflaged Object Detection

【速读】:该论文旨在解决伪装目标检测(Camouflaged Object Detection, COD)中因目标与背景在颜色、纹理和图案上高度相似而导致的检测难题。现有传统分割方法和现代基础模型在处理此类任务时表现严重不足,主要原因在于COD面临六大核心挑战:内在相似性(Intrinsic Similarity)、边缘破坏(Edge Disruption)、极端尺度变化(Extreme Scale Variation)、环境复杂性(Environmental Complexities)、上下文依赖性(Contextual Dependencies)以及显著-伪装目标消歧(Salient-Camouflaged Object Disambiguation),这些挑战常同时出现并相互叠加,亟需系统性的架构创新来应对。解决方案的关键是提出C3Net,其采用专用的双路径解码器架构:边缘精修路径(Edge Refinement Pathway)通过梯度初始化的边缘增强模块从早期特征中恢复精确边界;上下文定位路径(Contextual Localization Pathway)引入图像级上下文引导机制(Image-based Context Guidance),无需外部模型即可实现内在显著性抑制;两者由注意力融合模块(Attentive Fusion Module)通过空间门控机制协同整合,从而实现对多维度挑战的全面覆盖。实验表明,C3Net在COD10K、CAMO和NC4K数据集上分别取得S-measure 0.898、0.904和0.913的领先性能,兼具高效性与鲁棒性。

链接: https://arxiv.org/abs/2511.12627
作者: Baber Jan,Aiman H. El-Maleh,Abdul Jabbar Siddiqui,Abdul Bais,Saeed Anwar
机构: King Fahd University of Petroleum and Minerals (国王法赫德石油大学); University of Regina (皇后大学); The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Camouflaged object detection identifies objects that blend seamlessly with their surroundings through similar colors, textures, and patterns. This task challenges both traditional segmentation methods and modern foundation models, which fail dramatically on camouflaged objects. We identify six fundamental challenges in COD: Intrinsic Similarity, Edge Disruption, Extreme Scale Variation, Environmental Complexities, Contextual Dependencies, and Salient-Camouflaged Object Disambiguation. These challenges frequently co-occur and compound the difficulty of detection, requiring comprehensive architectural solutions. We propose C3Net, which addresses all challenges through a specialized dual-pathway decoder architecture. The Edge Refinement Pathway employs gradient-initialized Edge Enhancement Modules to recover precise boundaries from early features. The Contextual Localization Pathway utilizes our novel Image-based Context Guidance mechanism to achieve intrinsic saliency suppression without external models. An Attentive Fusion Module synergistically combines the two pathways via spatial gating. C3Net achieves state-of-the-art performance with S-measures of 0.898 on COD10K, 0.904 on CAMO, and 0.913 on NC4K, while maintaining efficient processing. C3Net demonstrates that complex, multifaceted detection challenges require architectural innovation, with specialized components working synergistically to achieve comprehensive coverage beyond isolated improvements. Code, model weights, and results are available at this https URL.
zh

[CV-172] OPFormer: Object Pose Estimation leverag ing foundation model with geometric encoding

【速读】:该论文旨在解决目标检测与姿态估计(Pose Estimation)在实际应用中难以统一建模的问题,尤其在缺乏3D CAD模型时如何高效构建高保真对象表示。其解决方案的关键在于提出一个端到端的框架,首先通过传统3D CAD模型或快速重建的神经辐射场(NeRF)实现对象表征的灵活获取;随后采用CNOS检测器定位目标,再利用OPFormer模块进行精确的6D姿态估计。OPFormer的核心创新是基于Transformer架构的新型姿态估计模块,它通过联合编码多视角模板并引入Normalized Object Coordinate Space(NOCS)显式几何先验,从而学习全面的对象特征,并借助解码器建立鲁棒的2D-3D对应关系以输出最终姿态,显著提升了模型在有/无先验模型场景下的精度与效率。

链接: https://arxiv.org/abs/2511.12614
作者: Artem Moroz,Vít Zeman,Martin Mikšík,Elizaveta Isianova,Miroslav David,Pavel Burget,Varun Burde
机构: Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague (捷克技术大学信息学、机器人学与控制论研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.
zh

[CV-173] Open-World Test-Time Adaptation with Hierarchical Feature Aggregation and Attention Affine

【速读】:该论文旨在解决测试时适应(Test-time Adaptation, TTA)中因遇到未见类别(out-of-distribution, OOD)样本而导致模型性能显著下降的问题。现有TTA方法在面对OOD样本时,常将其误判为已知类别(in-distribution, ID),不仅降低预测准确性,还会干扰适应过程,引发后续ID样本的错误累积。解决方案的关键在于提出一种分层梯式网络(Hierarchical Ladder Network, HLN),通过聚合Transformer各层的类别令牌(class tokens)提取OOD特征,并结合原模型预测与HLN输出进行加权概率融合以提升OOD检测性能;同时引入注意力仿射网络(Attention Affine Network, AAN)自适应调整自注意力机制,增强对领域偏移(domain shift)的鲁棒性,并采用加权熵机制动态抑制低置信度样本的影响,从而显著提升模型在存在领域漂移场景下的分类性能。

链接: https://arxiv.org/abs/2511.12607
作者: Ziqiong Liu,Yushun Tang,Junyang Ji,Zhihai He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) refers to adjusting the model during the testing phase to cope with changes in sample distribution and enhance the model’s adaptability to new environments. In real-world scenarios, models often encounter samples from unseen (out-of-distribution, OOD) categories. Misclassifying these as known (in-distribution, ID) classes not only degrades predictive accuracy but can also impair the adaptation process, leading to further errors on subsequent ID samples. Many existing TTA methods suffer substantial performance drops under such conditions. To address this challenge, we propose a Hierarchical Ladder Network that extracts OOD features from class tokens aggregated across all Transformer layers. OOD detection performance is enhanced by combining the original model prediction with the output of the Hierarchical Ladder Network (HLN) via weighted probability fusion. To improve robustness under domain shift, we further introduce an Attention Affine Network (AAN) that adaptively refines the self-attention mechanism conditioned on the token information to better adapt to domain drift, thereby improving the classification performance of the model on datasets with domain shift. Additionally, a weighted entropy mechanism is employed to dynamically suppress the influence of low-confidence samples during adaptation. Experimental results on benchmark datasets show that our method significantly improves the performance on the most widely used classification datasets.
zh

[CV-174] Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

【速读】:该论文旨在解决群体活动识别(Group Activity Recognition, GAR)中缺乏标准化多模态基准的问题,尤其是如何在视频(pixel)与追踪数据(position)之间进行公平比较。现有研究多集中于视频模态,而基于球员位置和轨迹的追踪模态虽具紧凑性和空间交互显式编码优势,却未被充分探索。为填补这一空白,作者构建了SoccerNet-GAR数据集,同步并标注了2022年世界杯64场比赛中的94,285个群体活动实例,涵盖10类活动,并提出统一评估协议以对比两种单模态方法:基于视频的分类器与基于追踪数据的图神经网络分类器。其关键创新在于设计了一种角色感知的图结构架构,通过位置边(positional edges)和时间注意力机制直接建模战术结构,显著提升了性能——追踪模型达到67.2%的平衡准确率,优于最优视频基线(58.1%),且训练速度更快(快4.25倍)、参数量更少(仅197K vs. 86.3M)。该研究揭示了模态选择与角色感知建模对GAR的重要性。

链接: https://arxiv.org/abs/2511.12606
作者: Drishya Karki,Merey Ramazanova,Anthony Cioppa,Silvio Giancola,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST); University of Liège
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the 64 matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for 94,285 group activities are synchronized and annotated with 10 categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) a competitive video-based classifiers and (ii) a tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges and temporal attention. Our tracking model achieves 67.2% balanced accuracy compared to 58.1% for the best video baseline, while training 4.25 \times faster with 438 \times fewer parameters ( 197K \vs 86.3M ). This study provides new insights into the relative strengths of pixels and positions for group activity recognition. Overall, it highlights the importance of modality choice and role-aware modeling for GAR.
zh

[CV-175] LoRA-Enhanced Vision Transformer for Single Image based Morphing Attack Detection via Knowledge Distillation from EfficientNet

【速读】:该论文旨在解决人脸识别系统(Face Recognition Systems, FRS)在面对合成图像攻击(即“混叠攻击”或morphing attacks)时的脆弱性问题,这类攻击通过生成融合多个个体生物特征的合成图像来欺骗系统。解决方案的关键在于提出一种基于教师-学生框架的单图像混叠攻击检测方法(Single-Image Morphing Attack Detection, S-MAD),其中使用CNN作为教师模型对ViT(Vision Transformer)学生模型进行知识蒸馏,并引入低秩适应(Low-Rank Adaptation, LoRA)技术以降低微调过程中的计算开销,从而在保持高检测准确率的同时显著提升效率。

链接: https://arxiv.org/abs/2511.12602
作者: Ria Shekhawat,Sushrut Patwardhan,Raghavendra Ramachandra,Praveen Kumar Chandaliya,Kishor P. Upla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face Recognition Systems (FRS) are critical for security but remain vulnerable to morphing attacks, where synthetic images blend biometric features from multiple individuals. We propose a novel Single-Image Morphing Attack Detection (S-MAD) approach using a teacher-student framework, where a CNN-based teacher model refines a ViT-based student model. To improve efficiency, we integrate Low-Rank Adaptation (LoRA) for fine-tuning, reducing computational costs while maintaining high detection accuracy. Extensive experiments are conducted on a morphing dataset built from three publicly available face datasets, incorporating ten different morphing generation algorithms to assess robustness. The proposed method is benchmarked against six state-of-the-art S-MAD techniques, demonstrating superior detection performance and computational efficiency.
zh

[CV-176] Seg-VAR: Image Segmentation with Visual Autoregressive Modeling NEURIPS2025

【速读】:该论文旨在解决视觉自回归模型(Visual Autoregressive Modeling, VAR)在图像分割任务中应用不足的问题,尤其是其在低层次空间感知精度方面的局限性。传统方法多采用判别式学习策略,难以有效建模实例间复杂的空间关系与细节信息。为此,作者提出Seg-VAR框架,将分割任务重新定义为条件自回归掩码生成问题,核心创新在于引入潜变量表示(seglat),通过三个关键组件实现:(1) 图像编码器生成输入图像的潜在先验;(2) 空间感知的seglat编码器利用位置敏感的颜色映射将分割掩码离散化为潜 token,以区分不同实例;(3) 解码器从这些潜变量中重建掩码。该方案通过多阶段训练策略逐步优化潜变量分布与图像特征的一致性,显著提升了分割性能,在多个基准上超越了先前判别式和生成式方法,为将自回归推理整合进空间感知视觉系统开辟了新路径。

链接: https://arxiv.org/abs/2511.12594
作者: Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); Insta360; Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute; SenseTime Research (商汤研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025, 22 pages

点击查看摘要

Abstract:While visual autoregressive modeling (VAR) strategies have shed light on image generation with the autoregressive models, their potential for segmentation, a task that requires precise low-level spatial perception, remains unexplored. Inspired by the multi-scale modeling of classic Mask2Former-based models, we propose Seg-VAR, a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem. This is achieved by replacing the discriminative learning with the latent learning process. Specifically, our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens using a location-sensitive color mapping to distinguish instances, and (3) a decoder reconstructing masks from these latents. A multi-stage training strategy is introduced: first learning seglat representations via image-seglat joint training, then refining latent transformations, and finally aligning image-encoder-derived latents with seglat distributions. Experiments show Seg-VAR outperforms previous discriminative and generative methods on various segmentation tasks and validation benchmarks. By framing segmentation as a sequential hierarchical prediction task, Seg-VAR opens new avenues for integrating autoregressive reasoning into spatial-aware vision systems. Code will be available at this https URL.
zh

[CV-177] Fine-Grained Representation for Lane Topology Reasoning

【速读】:该论文旨在解决自动驾驶中车道拓扑结构建模不精确的问题,尤其是在复杂车道结构下难以可靠地推断车道间连接关系的挑战。现有方法通常通过单一查询(query)表示每个车道,并基于车道特征相似性推断拓扑连通性,但在处理复杂场景时表现不佳。其解决方案的关键在于提出一种细粒度车道拓扑推理框架(TopoFG),该框架将鸟瞰图(BEV)特征到拓扑预测的过程分解为三个阶段:层级先验提取器(HPE)、区域聚焦解码器(RFD)和鲁棒边界点拓扑推理模块(RBTR)。其中,HPE融合全局空间先验(来自BEV掩码)与局部序列先验(来自车道关键点序列),构建细粒度查询以引导后续推理;RFD通过在感兴趣区域(RoI)采样参考点并引入交叉注意力机制优化查询表示;RBTR则基于边界点查询特征建模车道连接关系,并采用拓扑去噪策略提升匹配可靠性。该方法通过引入细粒度查询和去噪机制,显著提升了复杂车道结构的建模精度与拓扑预测可信度,在OpenLane-V2基准上实现了48.0%(SubsetA)和45.4%(SubsetB)的OLS指标新最优性能。

链接: https://arxiv.org/abs/2511.12590
作者: Guoqing Xu,Yiheng Li,Yang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise modeling of lane topology is essential for autonomous driving, as it directly impacts navigation and control this http URL methods typically represent each lane with a single query and infer topological connectivity based on the similarity between lane this http URL, this kind of design struggles to accurately model complex lane structures, leading to unreliable topology this http URL this view, we propose a Fine-Grained lane topology reasoning framework (TopoFG).It divides the procedure from bird’s-eye-view (BEV) features to topology prediction via fine-grained queries into three phases, i.e., Hierarchical Prior Extractor (HPE), Region-Focused Decoder (RFD), and Robust Boundary-Point Topology Reasoning (RBTR).Specifically, HPE extracts global spatial priors from the BEV mask and local sequential priors from in-lane keypoint sequences to guide subsequent fine-grained query this http URL constructs fine-grained queries by integrating the spatial and sequential priors. It then samples reference points in RoI regions of the mask and applies cross-attention with BEV features to refine the query representations of each this http URL models lane connectivity based on boundary-point query features and further employs a topological denoising strategy to reduce matching this http URL integrating spatial and sequential priors into fine-grained queries and applying a denoising strategy to boundary-point topology reasoning, our method precisely models complex lane structures and delivers trustworthy topology this http URL experiments on the OpenLane-V2 benchmark demonstrate that TopoFG achieves new state-of-the-art performance, with an OLS of 48.0% on subsetA and 45.4% on subsetB.
zh

[CV-178] Rank-Aware Agglomeration of Foundation Models for Immunohistochemistry Image Cell Counting

【速读】:该论文旨在解决免疫组化(IHC)图像中细胞计数的准确性问题,尤其针对染色重叠、生物标志物染色不均以及细胞形态多样性带来的挑战。现有基于回归的方法虽能较好处理细胞重叠问题,但难以实现端到端的多类别细胞计数,且对基础模型(foundation models)的潜力挖掘不足。解决方案的关键在于提出一种秩感知聚合框架(Rank-aware Agglomeration Framework),其核心创新包括:1)设计秩感知教师选择策略(RATS),通过建模全局至局部图像块的排名来评估各教师模型的计数能力,并实现样本级教师选择,从而有效利用多个强基础模型的互补表征;2)引入细调阶段将多类细胞计数任务重构为视觉-语言对齐问题,利用结构化文本提示生成离散语义锚点,同时编码类别与数量信息,指导特定类别的密度图回归,显著提升重叠细胞的计数精度。该方法在12种IHC生物标志物和5种组织类型上均超越现有最先进方法,且与病理学家评估高度一致。

链接: https://arxiv.org/abs/2511.12588
作者: Zuqi Huang,Mengxin Tian,Huan Liu,Wentao Li,Baobao Liang,Jie Wu,Fang Yan,Zhaoqing Tang,Zhongyu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate cell counting in immunohistochemistry (IHC) images is critical for quantifying protein expression and aiding cancer diagnosis. However, the task remains challenging due to the chromogen overlap, variable biomarker staining, and diverse cellular morphologies. Regression-based counting methods offer advantages over detection-based ones in handling overlapped cells, yet rarely support end-to-end multi-class counting. Moreover, the potential of foundation models remains largely underexplored in this paradigm. To address these limitations, we propose a rank-aware agglomeration framework that selectively distills knowledge from multiple strong foundation models, leveraging their complementary representations to handle IHC heterogeneity and obtain a compact yet effective student model, CountIHC. Unlike prior task-agnostic agglomeration strategies that either treat all teachers equally or rely on feature similarity, we design a Rank-Aware Teacher Selecting (RATS) strategy that models global-to-local patch rankings to assess each teacher’s inherent counting capacity and enable sample-wise teacher selection. For multi-class cell counting, we introduce a fine-tuning stage that reformulates the task as vision-language alignment. Discrete semantic anchors derived from structured text prompts encode both category and quantity information, guiding the regression of class-specific density maps and improving counting for overlapping cells. Extensive experiments demonstrate that CountIHC surpasses state-of-the-art methods across 12 IHC biomarkers and 5 tissue types, while exhibiting high agreement with pathologists’ assessments. Its effectiveness on HE-stained data further confirms the scalability of the proposed method.
zh

[CV-179] mpoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

【速读】:该论文旨在解决长视频生成中难以同时保证视觉质量和时间连贯性的问题。其解决方案的关键在于提出TempoMaster框架,将长视频生成建模为逐帧率预测任务:首先生成低帧率片段作为整体视频的粗略蓝图,随后逐步提升帧率以细化视觉细节和运动连续性;在生成过程中,该框架在每个帧率层级内使用双向注意力机制,在不同帧率间执行自回归推理,从而实现长程时序一致性,并支持高效并行合成。

链接: https://arxiv.org/abs/2511.12578
作者: Yukuo Ma,Cong Liu,Junke Wang,Junqi Liu,Haibin Huang,Zuxuan Wu,Chi Zhang,Xuelong Li
机构: Fudan University (复旦大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.
zh

[CV-180] Beyond Pixels: Semantic-aware Typographic Attack for Geo-Privacy Protection

【速读】:该论文旨在解决大型视觉语言模型(Large Visual Language Models, LVLMs)对用户地理隐私的潜在威胁问题,即LVLMs能够直接从社交媒体用户分享的图像中推断出其地理位置,导致意外的信息泄露。传统方法如对抗性图像扰动虽可提供保护,但往往需要较强的失真以对抗LVLMs,从而显著降低图像的视觉质量并削弱其共享价值。该研究提出的关键解决方案是采用语义感知的排版攻击(typographical attack),通过在图像内容之外添加欺骗性文本扩展来干扰地理定位推理,其核心在于识别并利用有效破坏位置推断的文本语义,并设计两阶段策略生成具有误导性的文本内容,从而在不明显损害图像可视性的情况下实现高效的地理隐私保护。

链接: https://arxiv.org/abs/2511.12575
作者: Jiayi Zhu,Yihao Huang,Yue Cao,Xiaojun Jia,Qing Guo,Felix Juefei-Xu,Geguang Pu,Bin Wang
机构: Hangzhou Institute of Technology, Xidian University, China (西安电子科技大学杭州研究院); National University of Singapore, Singapore (新加坡国立大学); Nanyang Technological University, Singapore (南洋理工大学); Nankai University, China (南开大学); New York University, USA (纽约大学); East China Normal University, China (华东师范大学); Hikvision Digital Technology Co., Ltd, China (海康威视数字技术股份有限公司); Shanghai Industrial Control Safety Innovation Tech. Co., Ltd, China (上海工业控制安全创新科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Visual Language Models (LVLMs) now pose a serious yet overlooked privacy threat, as they can infer a social media user’s geolocation directly from shared images, leading to unintended privacy leakage. While adversarial image perturbations provide a potential direction for geo-privacy protection, they require relatively strong distortions to be effective against LVLMs, which noticeably degrade visual quality and diminish an image’s value for sharing. To overcome this limitation, we identify typographical attacks as a promising direction for protecting geo-privacy by adding text extension outside the visual content. We further investigate which textual semantics are effective in disrupting geolocation inference and design a two-stage, semantics-aware typographical attack that generates deceptive text to protect user privacy. Extensive experiments across three datasets demonstrate that our approach significantly reduces geolocation prediction accuracy of five state-of-the-art commercial LVLMs, establishing a practical and visually-preserving protection strategy against emerging geo-privacy threats.
zh

[CV-181] hrough-Foliage Surface-Temperature Reconstruction for early Wildfire Detection

【速读】:该论文旨在解决在森林植被遮挡条件下,利用无人机进行地表温度重建的难题,以实现对地表火灾的早期检测(即在烟雾或火焰可见之前)。传统合成孔径(SA)热成像虽能缓解树冠遮挡和阳光干扰问题,但引入了热模糊效应,导致真实地表温度信号被掩盖。解决方案的关键在于构建一个视觉状态空间模型(visual state space model),通过训练该模型从模糊数据中恢复部分遮挡土壤和火点的微弱热信号;同时,为克服真实训练数据稀缺的问题,研究者创新性地将潜在扩散模型(latent diffusion model)与矢量量化技术结合,生成大量基于真实野火记录的逼真地表温度模拟数据,并通过温度增强和程序化热森林仿真进一步扩展数据集。实验表明,该方法在多种环境条件下显著降低均方根误差(RMSE),尤其在高温热点识别中表现优异,且具备良好的泛化能力,可应用于搜救场景中的人体热信号识别。

链接: https://arxiv.org/abs/2511.12572
作者: Mohamed Youssef,Lukas Brunner,Klaus Rundhammer,Gerald Czech,Oliver Bimber
机构: Johannes Kepler University Linz (约翰·开普勒林茨大学); Profactor GmbH (普罗法克托有限公司); FFAGATHA GmbH (FFAGATHA有限公司); OÖ Energie- und Forschungs GmbH (奥地利上奥地利能源与研究公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel method for reconstructing surface temperatures through occluding forest vegetation by combining signal processing and machine learning. Our goal is to enable fully automated aerial wildfire monitoring using autonomous drones, allowing for the early detection of ground fires before smoke or flames are visible. While synthetic aperture (SA) sensing mitigates occlusion from the canopy and sunlight, it introduces thermal blur that obscures the actual surface temperatures. To address this, we train a visual state space model to recover the subtle thermal signals of partially occluded soil and fire hotspots from this blurred data. A key challenge was the scarcity of real-world training data. We overcome this by integrating a latent diffusion model into a vector quantized to generated a large volume of realistic surface temperature simulations from real wildfire recordings, which we further expanded through temperature augmentation and procedural thermal forest simulation. On simulated data across varied ambient and surface temperatures, forest densities, and sunlight conditions, our method reduced the RMSE by a factor of 2 to 2.5 compared to conventional thermal and uncorrected SA imaging. In field experiments focused on high-temperature hotspots, the improvement was even more significant, with a 12.8-fold RMSE gain over conventional thermal and a 2.6-fold gain over uncorrected SA images. We also demonstrate our model’s generalization to other thermal signals, such as human signatures for search and rescue. Since simple thresholding is frequently inadequate for detecting subtle thermal signals, the morphological characteristics are equally essential for accurate classification. Our experiments demonstrated another clear advantage: we reconstructed the complete morphology of fire and human signatures, whereas conventional imaging is defeated by partial occlusion.
zh

[CV-182] Linear time small coresets for k-mean clustering of segments with applications

【速读】:该论文致力于解决段集 k-均值聚类问题(segment k-means clustering),即给定 Rd\mathbb{R}^d 中的一组 nn 条线段 S\mathcal{S},寻找 kk 个中心 XRdX \subseteq \mathbb{R}^d,使得目标函数 D(S,X):=SSminxXD(S,x)D(\mathcal{S}, X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S, x) 最小化,其中 D(S,x)=pSpxdpD(S, x) = \int_{p \in S} |p - x| \, dp 表示从线段上所有点到中心 xx 的积分距离。该问题在计算机视觉、轨迹分析和实时视频跟踪等场景中具有重要应用价值。解决方案的关键在于提出首个可证明有效的 ε\varepsilon-coreset 构造方法,能够处理任意输入线段,且对于常数 kkε\varepsilon,其 coreset 大小为 O(log2n)O(\log^2 n),可在 O(nd)O(nd) 时间内完成计算,从而显著提升大规模数据下的聚类效率,并通过实验验证了其在保持高精度的同时实现显著加速的实用性与理论保证。

链接: https://arxiv.org/abs/2511.12564
作者: David Denisov,Shlomi Dolev,Dan Felmdan,Michael Segal
机构: 未知
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注: First published in WALCOM 2026 by Springer Nature

点击查看摘要

Abstract:We study the k -means problem for a set \mathcalS \subseteq \mathbbR^d of n segments, aiming to find k centers X \subseteq \mathbbR^d that minimize D(\mathcalS,X) := \sum_S \in \mathcalS \min_x \in X D(S,x) , where D(S,x) := \int_p \in S |p - x| dp measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any \varepsilon 0 , an \varepsilon -coreset is a weighted subset C \subseteq \mathbbR^d that approximates D(\mathcalS,X) within a factor of 1 \pm \varepsilon for any set of k centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant k and \varepsilon , it produces a coreset of size O(\log^2 n) computable in O(nd) time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method. Comments: First published in WALCOM 2026 by Springer Nature Subjects: Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.12564 [cs.LG] (or arXiv:2511.12564v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.12564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-183] SEMC: Structure-Enhanced Mixture-of-Experts Contrastive Learning for Ultrasound Standard Plane Recognition AAAI2026

【速读】:该论文旨在解决超声标准切面识别中对浅层结构信息利用不足以及通过图像增强生成的对比样本难以捕捉细粒度语义差异的问题,导致模型在结构细节和类别判别性特征上的识别性能受限。其解决方案的关键在于提出一种结构增强的混合专家对比学习框架(SEMC),核心创新包括:1)设计语义-结构融合模块(SSFM),通过多尺度结构信息建模与浅层-深层特征对齐,增强模型对细粒度结构细节的感知能力;2)构建混合专家对比识别模块(MCRM),基于混合专家(MoE)机制在多层次特征上执行分层对比学习与分类,提升类别可分性和识别精度。

链接: https://arxiv.org/abs/2511.12559
作者: Qing Cai,Guihao Yan,Fan Zhang,Cheng Zhang,Zhi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Ultrasound standard plane recognition is essential for clinical tasks such as disease screening, organ evaluation, and biometric measurement. However, existing methods fail to effectively exploit shallow structural information and struggle to capture fine-grained semantic differences through contrastive samples generated by image augmentations, ultimately resulting in suboptimal recognition of both structural and discriminative details in ultrasound standard planes. To address these issues, we propose SEMC, a novel Structure-Enhanced Mixture-of-Experts Contrastive learning framework that combines structure-aware feature fusion with expert-guided contrastive learning. Specifically, we first introduce a novel Semantic-Structure Fusion Module (SSFM) to exploit multi-scale structural information and enhance the model’s ability to perceive fine-grained structural details by effectively aligning shallow and deep features. Then, a novel Mixture-of-Experts Contrastive Recognition Module (MCRM) is designed to perform hierarchical contrastive learning and classification across multi-level features using a mixture-of-experts (MoE) mechanism, further improving class separability and recognition performance. More importantly, we also curate a large-scale and meticulously annotated liver ultrasound dataset containing six standard planes. Extensive experimental results on our in-house dataset and two public datasets demonstrate that SEMC outperforms recent state-of-the-art methods across various metrics.
zh

[CV-184] EmoVerse: A MLLM s-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis CVPR2026

【速读】:该论文旨在解决视觉情绪分析(Visual Emotion Analysis, VEA)领域中因缺乏开源且可解释数据集而导致的研究进展受限问题。现有方法通常为整张图像分配单一离散情绪标签,难以揭示视觉元素如何协同作用于情绪生成。解决方案的关键在于提出EmoVerse——一个大规模开源数据集,其核心创新是采用受知识图谱启发的多层标注机制,将情绪分解为背景-属性-主体(Background-Attribute-Subject, B-A-S)三元组,并将其与图像中的具体视觉区域进行对齐,从而实现词级和主体级的情绪推理。此外,EmoVerse同时包含分类情绪状态(Categorical Emotion States, CES)与维度情绪空间(Dimensional Emotion Space, DES)双标注体系,支持离散与连续情绪表示的统一建模;配合新颖的多阶段标注流程以最小人力成本保障高可靠性,并构建了一个可解释模型,将视觉线索映射至DES表征并提供细粒度归因解释,共同构成推动可解释高阶情绪理解的完整基础。

链接: https://arxiv.org/abs/2511.12554
作者: Yijie Guo,Dexiang Hong,Weidong Chen,Zihan She,Cheng Ye,Xiaojun Chang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures. This is a preprint version of a paper submitted to CVPR 2026

点击查看摘要

Abstract:Visual Emotion Analysis (VEA) aims to bridge the affective gap between visual content and human emotional responses. Despite its promise, progress in this field remains limited by the lack of open-source and interpretable datasets. Most existing studies assign a single discrete emotion label to an entire image, offering limited insight into how visual elements contribute to emotion. In this work, we introduce EmoVerse, a large-scale open-source dataset that enables interpretable visual emotion analysis through multi-layered, knowledge-graph-inspired annotations. By decomposing emotions into Background-Attribute-Subject (B-A-S) triplets and grounding each element to visual regions, EmoVerse provides word-level and subject-level emotional reasoning. With over 219k images, the dataset further includes dual annotations in Categorical Emotion States (CES) and Dimensional Emotion Space (DES), facilitating unified discrete and continuous emotion representation. A novel multi-stage pipeline ensures high annotation reliability with minimal human effort. Finally, we introduce an interpretable model that maps visual cues into DES representations and provides detailed attribution explanations. Together, the dataset, pipeline, and model form a comprehensive foundation for advancing explainable high-level emotion understanding.
zh

[CV-185] HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models

【速读】:该论文旨在解决生成式扩散模型(Generative Diffusion Models)在细粒度视觉分类(Fine-Grained Visual Classification, FGVC)任务中应用时的关键挑战:如何确保合成图像准确捕捉类别定义的细微特征,以维持高保真度。标准方法如基于文本的无分类器引导(Classifier-Free Guidance, CFG)因缺乏足够特异性,常生成误导性样本,反而降低分类器性能。解决方案的核心在于提出分层引导的细粒度增强方法(Hierarchically Guided Fine-grained Augmentation, HiGFA),其关键机制是利用扩散采样过程的时间动态特性,在早期至中期阶段采用固定强度的文本与形变轮廓引导以建立整体场景、风格和结构;在最终阶段激活专门的细粒度分类器引导,并根据预测置信度动态调节所有引导信号的强度,从而实现全局结构构建与细节精修之间的智能平衡,生成多样且忠实的合成图像。

链接: https://arxiv.org/abs/2511.12547
作者: Zhiguang Lu,Qianqian Xu,Peisong Wen,Siran Da,Qingming Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.
zh

[CV-186] ReaSon : Reinforced Causal Search with Information Bottleneck for Video Understanding AAAI2026 AAAI26

【速读】:该论文旨在解决视频理解中因输入token数量受限及视频帧间信息时序稀疏性导致的关键帧选择问题,其核心挑战在于如何选出既具有预测充分性(predictive sufficiency)又具备因果必要性(causal necessity)的代表性关键帧。解决方案的关键在于提出ReaSon框架,该框架通过引入一种新颖的因果信息瓶颈(Causal Information Bottleneck, CIB)将关键帧选择建模为优化问题:首先利用可学习的策略网络从视觉相关候选帧池中选取满足预测充分性的关键帧,随后通过反事实干预(counterfactual interventions)评估因果必要性,最终设计一个与CIB原则对齐的复合奖励信号,驱动策略网络通过强化学习进行端到端优化。

链接: https://arxiv.org/abs/2511.12530
作者: Yuan Zhou,Litao Hua,Shilong Jin,Wentao Huang,Haoran Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026. Code is available at: this https URL

点击查看摘要

Abstract:Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.
zh

[CV-187] D2-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

【速读】:该论文旨在解决视觉定位(Visual Place Recognition, VPR)中基于视觉基础模型(如DINOv2)的性能提升与模型复杂度高、计算开销大之间的矛盾问题,即如何在保持强特征提取能力的同时降低参数量和计算成本,以适应资源受限设备的部署需求。解决方案的关键在于提出一种基于知识蒸馏(knowledge distillation)和可变形机制(deformable mechanism)的框架——D²-VPR:首先采用两阶段训练策略融合知识蒸馏与微调,并引入Distillation Recovery Module(DRM)优化师生模型间的特征空间对齐,减少知识迁移损失;其次设计Top-Down-attention-based Deformable Aggregator(TDDA),利用全局语义信息动态调整感兴趣区域(Region of Interest, ROI)进行聚合,增强对不规则结构的适应性。实验表明,该方法在保持竞争力性能的同时,参数量减少约64.2%,浮点运算次数(FLOPs)降低约62.6%。

链接: https://arxiv.org/abs/2511.12528
作者: Zheyuan Zhang,Jiwei Zhang,Boyu Zhou,Linzhimeng Duan,Hong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2’s exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose D^2 -VPR, a D istillation- and D eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and FLOPs by about 62.6% (compared to CricaVPR).Code is available at this https URL.
zh

[CV-188] MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics AAAI2026

【速读】:该论文旨在解决红外与可见光图像融合中因恶劣天气导致的可见光图像退化问题,以及现有方法依赖固定网络结构、难以适应多种退化场景的局限性。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)驱动的统一退化感知图像融合框架(MdaIF),通过引入混合专家(Mixture-of-Experts, MoE)系统以应对多退化场景,并利用预训练视觉-语言模型(Vision-Language Model, VLM)提取气象感知的语义先验(semantic prior),进而设计退化感知通道注意力模块(Degradation-aware Channel Attention Module, DCAM),实现跨模态特征在通道域中的自适应交互;同时,结合语义先验与通道调制特征引导专家路由机制,显著提升复杂退化条件下的融合性能。

链接: https://arxiv.org/abs/2511.12525
作者: Jing Li,Yifan Wang,Jiafeng Yan,Renlong Zhang,Bin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures. Accepted by AAAI 2026

点击查看摘要

Abstract:Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby compromising fusion performance; and 2) rely on fixed network architectures, limiting their adaptability to diverse degradation scenarios. To address these issues, we propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF). Given the distinct scattering characteristics of different degradation scenarios (e.g., haze, rain, and snow) in atmospheric transmission, a mixture-of-experts (MoE) system is introduced to tackle image fusion across multiple degradation scenarios. To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, collectively referred to as the semantic prior, we employ a pre-trained vision-language model (VLM) in our framework. Guided by the semantic prior, we propose degradation-aware channel attention module (DCAM), which employ degradation prototype decomposition to facilitate multi-modal feature interaction in channel domain. In addition, to achieve effective expert routing, the semantic prior and channel-domain modulated features are utilized to guide the MoE, enabling robust image fusion in complex degradation scenarios. Extensive experiments validate the effectiveness of our MdaIF, demonstrating superior performance over SOTA methods.
zh

[CV-189] DINO-Detect: A Simple yet Effective Framework for Blur-Robust AI-Generated Image Detection

【速读】:该论文旨在解决AI生成图像(AIGI)检测模型在真实场景下因运动模糊(motion blur)导致性能显著下降的问题。运动模糊常见于手持拍摄、快速运动及压缩视频中,会破坏图像的高频特征并抑制判别性纹理,使现有检测器失效。解决方案的关键在于提出一种基于教师-学生知识蒸馏(teacher-student knowledge distillation)的鲁棒框架:利用高容量教师模型(DINOv3)在清晰图像上学习到的语义丰富且稳定的特征表示作为参考,通过冻结教师网络以保持其泛化能力,将教师在清晰图像上的特征与logit响应蒸馏给在模糊图像上训练的学生模型,从而使其在运动模糊条件下仍能输出一致的表征,实现对AIGI的准确检测。

链接: https://arxiv.org/abs/2511.12511
作者: Jialiang Shen,Jiyang Zheng,Yunqi Xue,Huajie Chen,Yu Yao,Hui Kang,Ruiqi Liu,Helin Gong,Yang Yang,Dadong Wang,Tongliang Liu
机构: Baidu Inc.(百度公司); SAIC, The University of Sydney (悉尼大学); CSIRO, Data61; Shanghai Jiao Tong University (上海交通大学); City University of Macau (澳门城市大学); CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:With growing concerns over image authenticity and digital safety, the field of AI-generated image (AIGI) detection has progressed rapidly. Yet, most AIGI detectors still struggle under real-world degradations, particularly motion blur, which frequently occurs in handheld photography, fast motion, and compressed video. Such blur distorts fine textures and suppresses high-frequency artifacts, causing severe performance drops in real-world settings. We address this limitation with a blur-robust AIGI detection framework based on teacher-student knowledge distillation. A high-capacity teacher (DINOv3), trained on clean (i.e., sharp) images, provides stable and semantically rich representations that serve as a reference for learning. By freezing the teacher to maintain its generalization ability, we distill its feature and logit responses from sharp images to a student trained on blurred counterparts, enabling the student to produce consistent representations under motion degradation. Extensive experiments benchmarks show that our method achieves state-of-the-art performance under both motion-blurred and clean conditions, demonstrating improved generalization and real-world applicability. Source codes will be released at: this https URL.
zh

[CV-190] Visible Structure Retrieval for Lightweight Image-Based Relocalisation BMVC2025

【速读】:该论文旨在解决大规模场景中基于结构的相机位姿估计(camera pose estimation)在计算效率和存储开销方面的挑战。传统方法依赖图像检索或搜索启发式策略来缩小2D-3D对应点匹配的搜索空间,但这些方法往往导致复杂流水线或随历史观测数量增长而显著增加的存储需求。解决方案的关键在于提出一种新的范式:通过训练一个紧凑的神经网络——可见结构检索网络(visible structure retrieval network),直接从查询图像映射到可观测的场景结构子集,从而无需图像检索或启发式搜索即可高效定位当前视角对应的3D结构点。该方法在保持与最先进水平相当的定位精度的同时,显著降低了计算和存储开销。

链接: https://arxiv.org/abs/2511.12503
作者: Fereidoon Zangeneh,Leonard Bruns,Amit Dekel,Alessandro Pieropan,Patric Jensfelt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025

点击查看摘要

Abstract:Accurate camera pose estimation from an image observation in a previously mapped environment is commonly done through structure-based methods: by finding correspondences between 2D keypoints on the image and 3D structure points in the map. In order to make this correspondence search tractable in large scenes, existing pipelines either rely on search heuristics, or perform image retrieval to reduce the search space by comparing the current image to a database of past observations. However, these approaches result in elaborate pipelines or storage requirements that grow with the number of past observations. In this work, we propose a new paradigm for making structure-based relocalisation tractable. Instead of relying on image retrieval or search heuristics, we learn a direct mapping from image observations to the visible scene structure in a compact neural network. Given a query image, a forward pass through our novel visible structure retrieval network allows obtaining the subset of 3D structure points in the map that the image views, thus reducing the search space of 2D-3D correspondences. We show that our proposed method enables performing localisation with an accuracy comparable to the state of the art, while requiring lower computational and storage footprint.
zh

[CV-191] BSO: Binary Spiking Online Optimization Algorithm

【速读】:该论文旨在解决二值脉冲神经网络(Binary Spiking Neural Networks, BSNNs)在训练过程中因潜在权重存储和时间序列处理需求而导致的高内存开销问题。解决方案的关键在于提出一种名为 Binary Spiking Online (BSO) 的在线训练算法,该算法通过翻转信号(flip signals)直接更新权重,当梯度动量与权重的乘积超过阈值时触发,从而无需在训练期间存储潜在权重;进一步地,为提升性能,作者还设计了时序感知变体 T-BSO,利用 BSNN 固有的时序动态特性,跨时间步捕捉梯度信息以自适应调整阈值,理论上证明了 BSO 和 T-BSO 的收敛性,并在实验中验证其优于现有 BSNN 训练方法。

链接: https://arxiv.org/abs/2511.12502
作者: Yu Liang,Yu Yang,Wenjie Wei,Ammar Belatreche,Shuai Wang,Malu Zhang,Yang Yang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Binary Spiking Neural Networks (BSNNs) offer promising efficiency advantages for resource-constrained computing. However, their training algorithms often require substantial memory overhead due to latent weights storage and temporal processing requirements. To address this issue, we propose Binary Spiking Online (BSO) optimization algorithm, a novel online training algorithm that significantly reduces training memory. BSO directly updates weights through flip signals under the online training framework. These signals are triggered when the product of gradient momentum and weights exceeds a threshold, eliminating the need for latent weights during training. To enhance performance, we propose T-BSO, a temporal-aware variant that leverages the inherent temporal dynamics of BSNNs by capturing gradient information across time steps for adaptive threshold adjustment. Theoretical analysis establishes convergence guarantees for both BSO and T-BSO, with formal regret bounds characterizing their convergence rates. Extensive experiments demonstrate that both BSO and T-BSO achieve superior optimization performance compared to existing training methods for BSNNs. The codes are available at this https URL.
zh

[CV-192] owards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion AAAI2026

【速读】:该论文旨在解决当前基于摄像头的3D语义场景补全(3D Semantic Scene Completion, SSC)方法在利用时序信息时存在的局限性,即现有方法主要关注提升当前帧内的特征表示,而难以有效重建车辆周边未被当前视角覆盖的“隐藏区域”(hidden regions),尽管历史帧通常包含这些区域的重要上下文信息。解决方案的关键在于提出一种以当前帧为中心的时序融合模块(Current-Centric Contextual 3D Fusion, C3DFusion),其通过显式对齐当前帧与历史帧中3D升维后的点特征,实现更鲁棒的时空特征融合;具体包含两个互补机制:历史上下文模糊化(historical context blurring)当前帧特征密集化(current-centric feature densification),前者通过降低历史点特征的尺度来抑制误匹配噪声,后者通过增强当前点特征在体素空间中的贡献来提升关键区域的重建质量。

链接: https://arxiv.org/abs/2511.12498
作者: Jongseong Bae,Junwoo Ha,Jinnyeong Heo,Yeongin Lee,Ha Young Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Recent camera-based 3D semantic scene completion (SSC) methods have increasingly explored leveraging temporal cues to enrich the features of the current frame. However, while these approaches primarily focus on enhancing in-frame regions, they often struggle to reconstruct critical out-of-frame areas near the sides of the ego-vehicle, although previous frames commonly contain valuable contextual information about these unseen regions. To address this limitation, we propose the Current-Centric Contextual 3D Fusion (C3DFusion) module, which generates hidden region-aware 3D feature geometry by explicitly aligning 3D-lifted point features from both current and historical frames. C3DFusion performs enhanced temporal fusion through two complementary techniques-historical context blurring and current-centric feature densification-which suppress noise from inaccurately warped historical point features by attenuating their scale, and enhance current point features by increasing their volumetric contribution. Simply integrated into standard SSC architectures, C3DFusion demonstrates strong effectiveness, significantly outperforming state-of-the-art methods on the SemanticKITTI and SSCBench-KITTI-360 datasets. Furthermore, it exhibits robust generalization, achieving notable performance gains when applied to other baseline models.
zh

[CV-193] MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

【速读】:该论文旨在解决传统监督学习中图像掩码(image masking)存在的两个关键问题:一是被掩码的像素未被充分利用,导致有价值的情境信息丢失;二是掩码可能移除细粒度任务中的小尺度或关键特征。解决方案的核心在于重新审视掩码区域的作用,将其视为辅助知识而非忽略对象,并提出MaskAnyNet框架,通过引入再学习机制(relearning mechanism)联合利用可见与掩码区域的信息,从而增强特征的语义多样性并保留细粒度细节。该方法可轻松扩展至任意模型结构(如CNN或Transformer),仅需增加一个分支以共同学习重构后的掩码区域,实验证明其在多个基准上均带来一致性能提升。

链接: https://arxiv.org/abs/2511.12480
作者: Jingshan Hong,Haigen Hu,Huihuang Zhang,Qianwei Zhou,Zhao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In supervised learning, traditional image masking faces two key issues: (i) discarded pixels are underutilized, leading to a loss of valuable contextual information; (ii) masking may remove small or critical features, especially in fine-grained tasks. In contrast, masked image modeling (MIM) has demonstrated that masked regions can be reconstructed from partial input, revealing that even incomplete data can exhibit strong contextual consistency with the original image. This highlights the potential of masked regions as sources of semantic diversity. Motivated by this, we revisit the image masking approach, proposing to treat masked content as auxiliary knowledge rather than ignored. Based on this, we propose MaskAnyNet, which combines masking with a relearning mechanism to exploit both visible and masked information. It can be easily extended to any model with an additional branch to jointly learn from the recomposed masked region. This approach leverages the semantic diversity of the masked regions to enrich features and preserve fine-grained details. Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks. Further analysis confirms that the proposed method improves semantic diversity through the reuse of masked content.
zh

[CV-194] MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

【速读】:该论文旨在解决电商场景中多模态大语言模型(Multimodal Large Language Models, MLLMs)在产品理解任务中存在的三大挑战:(i) 混合模态训练导致的模态不平衡问题;(ii) 产品内部视觉与文本信息间内在对齐关系利用不足;(iii) 对电商多模态数据中噪声处理能力有限。其解决方案的关键在于提出MOON2.0框架,包含三个核心机制:(1) 基于模态驱动的专家混合(Modality-driven Mixture-of-Experts, MoE)模块,通过动态识别输入样本的模态组成实现多模态联合学习以缓解模态不平衡;(2) 双层次对齐(Dual-level Alignment)方法,强化单个产品内视觉与文本语义的一致性建模;(3) 基于MLLM的图文协同增强策略(Image-text Co-augmentation)结合动态样本过滤机制,提升训练数据质量。该方案在MBE2.0基准上实现了最先进的零样本性能,并通过注意力热力图可视化验证了多模态对齐能力的显著改善。

链接: https://arxiv.org/abs/2511.12449
作者: Zhanheng Nie,Chenghan Fu,Daoze Zhang,Junxian Wu,Wanxian Guan,Pengjie Wang,Jian Xu,Bo Zheng
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.
zh

[CV-195] CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

【速读】:该论文旨在解决医学视觉问答(Medical Visual Question Answering, VQA)系统在域偏移(domain shift)下可靠性下降的问题,特别是模型容易关注无关区域、且在部署阶段难以进行重新训练或获取额外标注标签时导致的答案缺乏图像证据支撑(weakly grounded in image evidence)。解决方案的关键在于提出一种证据优先的测试时训练方法(evidence-first test-time training, CoTBox-TTT),其核心机制是在推理阶段仅更新少量连续软提示(continuous soft prompts),通过视觉链式思维(visual chain-of-thought)信号识别与问题相关的图像区域,并强制原始图像与局部裁剪图像之间的答案一致性,从而提升模型对真实证据的依赖性。该方法无需标签,可无缝适配多种视觉-语言模型骨干网络(backbones),实验证明其在路径学VQA(pathVQA)数据集上使LLaVA的闭合式问答准确率提升12.3%,具备实际部署可行性。

链接: https://arxiv.org/abs/2511.12446
作者: Jiahe Qian,Yuhao Shen,Zhangtianyi Chen,Juexiao Zhou,Peisong Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Data Science, The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.
zh

[CV-196] Real-Time Drivers Drowsiness Detection and Analysis through Deep Learning

【速读】:该论文旨在解决驾驶员在长时间驾驶过程中因疲劳导致的注意力下降甚至昏睡问题,此类状况可能引发严重交通事故,威胁自身及他人的行车安全。解决方案的关键在于构建一个基于深度卷积神经网络(Deep Convolutional Neural Networks, DCNNs)的实时驾驶员疲劳检测系统,该系统通过车载摄像头采集驾驶员面部图像,利用OpenCV库提取眼部开合程度与嘴部张开等关键面部特征,并结合预训练模型进行实时分类判断。一旦识别出疲劳状态,系统将立即触发连续警报,从而实现非侵入式、低成本且高效的驾驶疲劳预警,实验表明该方法在NTHU-DDD和Yawn-Eye-Dataset数据集上分别达到了99.6%和97%的检测准确率。

链接: https://arxiv.org/abs/2511.12438
作者: ANK Zaman,Prosenjit Chatterjee,Rajat Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers’ safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and this http URL proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car this http URL potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.
zh

[CV-197] xt-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion AAAI2026

【速读】:该论文旨在解决统一多模态图像融合模型中因模态差异导致的梯度冲突问题,以及现有方法引入模态特定编码器后降低跨任务泛化能力的局限性。其解决方案的关键在于提出一种基于通道扰动与预训练知识融合的统一框架(UP-Fusion),通过三个核心模块实现:1)语义感知通道剪枝模块(Semantic-Aware Channel Pruning Module, SCPM),利用预训练模型的语义感知能力筛选和增强多模态特征通道;2)几何仿射调制模块(Geometric Affine Modulation Module, GAM),通过原始模态特征对初始融合特征进行仿射变换以保持特征编码器的模态判别力;3)文本引导通道扰动模块(Text-Guided Channel Perturbation Module, TCPM),在解码阶段重塑通道分布,降低对模态特异性通道的依赖,从而提升模型在多模态图像融合及下游任务中的性能与泛化能力。

链接: https://arxiv.org/abs/2511.12432
作者: Xilai Li,Xiaosong Li,Weijun Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Multi-modality image fusion enhances scene perception by combining complementary information. Unified models aim to share parameters across modalities for multi-modality image fusion, but large modality differences often cause gradient conflicts, limiting performance. Some methods introduce modality-specific encoders to enhance feature perception and improve fusion quality. However, this strategy reduces generalisation across different fusion tasks. To overcome this limitation, we propose a unified multi-modality image fusion framework based on channel perturbation and pre-trained knowledge integration (UP-Fusion). To suppress redundant modal information and emphasize key features, we propose the Semantic-Aware Channel Pruning Module (SCPM), which leverages the semantic perception capability of a pre-trained model to filter and enhance multi-modality feature channels. Furthermore, we proposed the Geometric Affine Modulation Module (GAM), which uses original modal features to apply affine transformations on initial fusion features to maintain the feature encoder modal discriminability. Finally, we apply a Text-Guided Channel Perturbation Module (TCPM) during decoding to reshape the channel distribution, reducing the dependence on modality-specific channels. Extensive experiments demonstrate that the proposed algorithm outperforms existing methods on both multi-modality image fusion and downstream tasks.
zh

[CV-198] RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

【速读】:该论文旨在解决扩散式视觉语言模型(Diffusion Vision-Language Models, DVLMs)在推理过程中因大量视觉 token 导致的计算效率低下问题。现有视觉 token 剪枝方法主要针对自回归式视觉语言模型(Autoregressive VLMs, AVLMs),对 DVLMs 的研究仍属空白。解决方案的关键在于提出 RedVTP,一种响应驱动的视觉 token 剪枝策略,其利用 DVLM 推理过程中的注意力机制,通过掩码响应 token 计算视觉 token 重要性得分;基于观察到该得分在推理步骤间保持稳定,RedVTP 在首次推理后即剪枝掉低重要性视觉 token,从而显著提升推理效率,同时保持甚至提升模型准确性。

链接: https://arxiv.org/abs/2511.12428
作者: Jingqi Xu,Jingxi Lu,Chenghao Li,Sreetama Sarkar,Souvik Kundu,Peter A. Beerel
机构: University of Southern California (南加州大学); Intel Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first inference step, thereby maximizing inference efficiency. Experiments show that RedVTP improves token generation throughput of LLaDA-V and LaViDa by up to 186% and 28.05%, respectively, and reduces inference latency by up to 64.97% and 21.87%, without compromising-and in some cases improving-accuracy.
zh

[CV-199] MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation

【速读】:该论文旨在解决传统ResNet模型在参数效率与判别性能之间难以平衡的问题。其核心挑战在于如何在不牺牲分类精度的前提下显著减少模型参数量。解决方案的关键在于引入生成式流场(flow field)思想,具体为提出MeanFlow-Incubated ResNet(MFI-ResNet),通过压缩-扩展策略实现:在压缩阶段,将每个ResNet阶段内的多层残差块简化为一个或两个MeanFlow模块,构建轻量级元模型;在扩展阶段,对前三个阶段采用选择性孵化(selective incubation)策略恢复为原始ResNet结构,而保留最后一阶段为MeanFlow形式,并进行微调。这一方法利用生成式流场有效刻画了ResNet中的特征变换过程,从而在大幅降低参数量的同时提升判别性能,体现了生成建模与判别学习之间的深层关联。

链接: https://arxiv.org/abs/2511.12422
作者: Nuolin Sun,Linyuan Wang,Haonan Wei,Lei Li,Bin Yan
机构: Information Engineering University (信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:ResNet has achieved tremendous success in computer vision through its residual connection mechanism. ResNet can be viewed as a discretized form of ordinary differential equations (ODEs). From this perspective, the multiple residual blocks within a single ResNet stage essentially perform multi-step discrete iterations of the feature transformation for that stage. The recently proposed flow matching model, MeanFlow, enables one-step generative modeling by learning the mean velocity field to transform distributions. Inspired by this, we propose MeanFlow-Incubated ResNet (MFI-ResNet), which employs a compression-expansion strategy to jointly improve parameter efficiency and discriminative performance. In the compression phase, we simplify the multi-layer structure within each ResNet stage to one or two MeanFlow modules to construct a lightweight meta model. In the expansion phase, we apply a selective incubation strategy to the first three stages, expanding them to match the residual block configuration of the baseline ResNet model, while keeping the last stage in MeanFlow form, and fine-tune the incubated model. Experimental results show that on CIFAR-10 and CIFAR-100 datasets, MFI-ResNet achieves remarkable parameter efficiency, reducing parameters by 46.28% and 45.59% compared to ResNet-50, while still improving accuracy by 0.23% and 0.17%, respectively. This demonstrates that generative flow-fields can effectively characterize the feature transformation process in ResNet, providing a new perspective for understanding the relationship between generative modeling and discriminative learning.
zh

[CV-200] Seeing Through the Rain: Resolving High-Frequency Conflicts in Deraining and Super-Resolution via Diffusion Guidance

【速读】:该论文旨在解决天气退化图像在去雨处理与超分辨率(Super-Resolution, SR)重建之间存在的内在矛盾问题:去雨方法倾向于去除高频噪声,而SR方法则依赖于从低频信息中“幻觉”出高频纹理,二者协同时易导致恢复内容不一致。解决方案的关键在于提出一种基于扩散模型的高频引导机制(Diffusion-based High-frequency Guided Model, DHGM),通过融合预训练扩散先验与高通滤波器,同时实现雨滴去除和结构细节增强,从而在保持图像清晰度的同时保留对小目标检测至关重要的高频细节。

链接: https://arxiv.org/abs/2511.12419
作者: Wenjie Li,Jinglei Shi,Jin Han,Heng Guo,Zhanyu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clean images are crucial for visual tasks such as small object detection, especially at high resolutions. However, real-world images are often degraded by adverse weather, and weather restoration methods may sacrifice high-frequency details critical for analyzing small objects. A natural solution is to apply super-resolution (SR) after weather removal to recover both clarity and fine structures. However, simply cascading restoration and SR struggle to bridge their inherent conflict: removal aims to remove high-frequency weather-induced noise, while SR aims to hallucinate high-frequency textures from existing details, leading to inconsistent restoration contents. In this paper, we take deraining as a case study and propose DHGM, a Diffusion-based High-frequency Guided Model for generating clean and high-resolution images. DHGM integrates pre-trained diffusion priors with high-pass filters to simultaneously remove rain artifacts and enhance structural details. Extensive experiments demonstrate that DHGM achieves superior performance over existing methods, with lower costs.
zh

[CV-201] owards Rotation-only Imaging Geometry: Rotation Estimation

【速读】:该论文旨在解决结构光恢复(Structure from Motion, SfM)任务中传统方法因3D坐标与相机位姿耦合而导致的精度和鲁棒性不足的问题。其解决方案的关键在于从“仅位姿”(pose-only)视角出发,揭示场景结构、旋转与平移之间的内在关系:通过理论推导发现平移可由旋转参数显式表达,从而将成像几何表示压缩至旋转流形(rotation manifold)上,构建基于重投影误差的纯旋转优化框架。该方法在双视图与多视图场景下均表现出优于当前最优旋转估计方法的性能,甚至接近多次捆绑调整(bundle adjustment)的结果,显著提升了3D视觉计算的准确性、效率与可靠性。

链接: https://arxiv.org/abs/2511.12415
作者: Xinrui Li,Qi Cai,Yuanxin Wu
机构: Shanghai Key Laboratory of Navigation and Location-based Services, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University (上海交通大学电子信息与电气工程学院导航与位置服务上海市重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Structure from Motion (SfM) is a critical task in computer vision, aiming to recover the 3D scene structure and camera motion from a sequence of 2D images. The recent pose-only imaging geometry decouples 3D coordinates from camera poses and demonstrates significantly better SfM performance through pose adjustment. Continuing the pose-only perspective, this paper explores the critical relationship between the scene structures, rotation and translation. Notably, the translation can be expressed in terms of rotation, allowing us to condense the imaging geometry representation onto the rotation manifold. A rotation-only optimization framework based on reprojection error is proposed for both two-view and multi-view scenarios. The experiment results demonstrate superior accuracy and robustness performance over the current state-of-the-art rotation estimation methods, even comparable to multiple bundle adjustment iteration results. Hopefully, this work contributes to even more accurate, efficient and reliable 3D visual computing.
zh

[CV-202] Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection WACV2026

【速读】:该论文旨在解决自动化路面缺陷检测系统在跨域场景下泛化能力差的问题。现有监督模型虽在源域表现优异,但需昂贵的重新标注才能适应新环境;而标准自监督方法虽能提取通用特征,仍易受域偏移影响。其解决方案的关键在于提出一种无需标签的自监督框架 \ours,核心创新包括:(1)自监督提示增强模块(Self-supervised Prompt Enhancement Module, SPEM),从无标签目标域数据中自动构建缺陷感知提示以引导冻结的视觉Transformer(ViT)骨干网络;(2)域感知提示对齐目标(Domain-Aware Prompt Alignment, DAPA),通过优化提示条件下的源域与目标域表示一致性来提升跨域适应性。实验表明,该方法在多个基准上显著优于监督、自监督及域适应基线,实现稳健的零样本迁移和高数据效率的少样本适配。

链接: https://arxiv.org/abs/2511.12410
作者: Xi Xiao,Zhuxuanzi Wang,Mingqiao Mo,Chen Liu,Chenrui Ma,Yanshu Li,Smita Krishnaswamy,Xiao Wang,Tianyang Wang
机构: University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Cornell University (康奈尔大学); Yale University (耶鲁大学); University of California Irvine (加州大学欧文分校); Brown University (布朗大学); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2026

点击查看摘要

Abstract:The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve strong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emphvisually probes target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: this https URL
zh

[CV-203] VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

【速读】:该论文旨在解决开放世界端到端自动驾驶(OW-E2EAD)中因训练环境与实际部署场景差异导致的泛化能力不足问题,尤其是在未见过的非结构化户外环境中如何实现鲁棒感知与行为决策。其解决方案的关键在于提出视觉-语言-动作检索(VLA-R)框架,通过冻结的视觉-语言模型实现无需领域特定调优的多尺度、提示引导且可解释的感知特征提取,并引入Q-Former瓶颈融合细粒度视觉表示与语言对齐特征以打通感知与动作域;同时设计视觉-动作对比学习机制,对齐视觉-语言嵌入与动作嵌入,从而支持跨场景迁移的驾驶行为学习与动作检索,实现在有限数据下对未知环境的有效探索与泛化性能。

链接: https://arxiv.org/abs/2511.12405
作者: Hyunki Seong,Seongwoo Moon,Hojin Ahn,Jehun Kang,David Hyunchul Shim
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.
zh

[CV-204] MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting

【速读】:该论文旨在解决现有低秩适应方法(Low-Rank Adaptation, LoRA)在视觉模型中泛化能力不足的问题,特别是其主要局限于视觉Transformer(Vision Transformers, ViTs),难以有效适配卷积神经网络(Convolutional Neural Networks, CNNs)等不同架构。解决方案的关键在于提出MSLoRA——一种骨干网络无关(backbone-agnostic)的参数高效适配模块,通过重加权特征响应而非微调预训练权重来实现注意力机制的动态调整。其核心设计包含两个组件:低秩线性投影与多尺度非线性变换,分别用于建模空间和通道注意力,并通过逐点乘法融合与残差连接实现轻量级的注意力重加权,从而在保持预训练权重冻结的同时显著提升迁移性能,在分类、检测和分割任务中均展现出优越的跨架构通用性和优化稳定性。

链接: https://arxiv.org/abs/2511.12400
作者: Xu Yang,Gady Agam
机构: Illinois Institute of Technology (伊利诺伊理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MSLoRA, a backbone-agnostic, parameter-efficient adapter that reweights feature responses rather than re-tuning the underlying backbone. Existing low-rank adaptation methods are mostly confined to vision transformers (ViTs) and struggle to generalize across architectures. MSLoRA unifies adaptation for both convolutional neural networks (CNNs) and ViTs by combining a low-rank linear projection with a multi-scale nonlinear transformation that jointly modulates spatial and channel attention. The two components are fused through pointwise multiplication and a residual connection, yielding a lightweight module that shifts feature attention while keeping pretrained weights frozen. Extensive experiments demonstrate that MSLoRA consistently improves transfer performance on classification, detection, and segmentation tasks with roughly less than 5% of backbone parameters. The design further enables stable optimization, fast convergence, and strong cross-architecture generalization. By reweighting rather than re-tuning, MSLoRA provides a simple and universal approach for efficient adaptation of frozen vision backbones. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.12400 [cs.CV] (or arXiv:2511.12400v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.12400 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xu Yang [view email] [v1] Sun, 16 Nov 2025 00:35:37 UTC (659 KB) Full-text links: Access Paper: View a PDF of the paper titled MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting, by Xu Yang and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-11 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-205] Calibrated Decomposition of Aleatoric and Epistemic Uncertainty in Deep Features for Inference-Time Adaptation

【速读】:该论文旨在解决现有模型在推理时无法有效区分不同类型的不确定性(即数据驱动的偶然不确定性与模型驱动的认知不确定性),从而难以实现动态资源分配和可靠推理的问题。其解决方案的关键在于提出一种轻量级的推理时不确定性引导选择框架,直接在深度特征空间中解耦两种不确定性:通过正则化的全局密度模型估计偶然不确定性,同时利用三个互补组件(局部支持不足、流形谱坍缩和跨层特征不一致)构建认知不确定性,这些组件具有经验上的正交性且无需采样或额外前向传播;最终将分解后的不确定性引入无分布的共形校准过程,显著提升预测区间精度,并在MOT17数据集上实现约60%的计算节省,同时保持近乎不变的精度,实现了高效的自调节视觉推理。

链接: https://arxiv.org/abs/2511.12389
作者: Divake Kumar,Patrick Poggi,Sina Tayebati,Devashri Naik,Nilesh Ahuja,Amit Ranjan Trivedi
机构: University of Illinois at Chicago (芝加哥大学伊利诺伊分校); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Most estimators collapse all uncertainty modes into a single confidence score, preventing reliable reasoning about when to allocate more compute or adjust inference. We introduce Uncertainty-Guided Inference-Time Selection, a lightweight inference time framework that disentangles aleatoric (data-driven) and epistemic (model-driven) uncertainty directly in deep feature space. Aleatoric uncertainty is estimated using a regularized global density model, while epistemic uncertainty is formed from three complementary components that capture local support deficiency, manifold spectral collapse, and cross-layer feature inconsistency. These components are empirically orthogonal and require no sampling, no ensembling, and no additional forward passes. We integrate the decomposed uncertainty into a distribution free conformal calibration procedure that yields significantly tighter prediction intervals at matched coverage. Using these components for uncertainty guided adaptive model selection reduces compute by approximately 60 percent on MOT17 with negligible accuracy loss, enabling practical self regulating visual inference. Additionally, our ablation results show that the proposed orthogonal uncertainty decomposition consistently yields higher computational savings across all MOT17 sequences, improving margins by 13.6 percentage points over the total-uncertainty baseline.
zh

[CV-206] Leverag ing Quantum-Based Architectures for Robust Diagnostics

【速读】:该论文旨在解决肾脏结石、囊肿和肿瘤的自动诊断与区分问题,以提升医学影像分析的准确性和可靠性。其解决方案的关键在于构建一种混合量子-经典框架:首先利用预训练的ResNet50编码器提取肾部CT图像的深层特征,随后通过角度编码(angle encoding)将这些特征映射为量子比特(qubits),并输入至量子卷积神经网络(Quantum Convolutional Neural Network, QCNN)进行处理;同时结合去噪、对比度受限自适应直方图均衡化(CLAHE)等预处理技术以及数据增强和加权采样策略缓解类别不平衡问题。实验表明,该方法在8比特与12比特配置下均实现了高精度分类(测试准确率达0.99),尤其在12比特配置中对囊肿和肿瘤的召回率与F1分数显著提升,验证了量子辅助特征处理在医学影像诊断中的有效性。

链接: https://arxiv.org/abs/2511.12386
作者: Shabnam Sodagari,Tommy Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The objective of this study is to diagnose and differentiate kidney stones, cysts, and tumors using Computed Tomography (CT) images of the kidney. This study leverages a hybrid quantum-classical framework in this regard. We combine a pretrained ResNet50 encoder, with a Quantum Convolutional Neural Network (QCNN) to explore quantum-assisted diagnosis. We pre-process the kidney images using denoising and contrast limited adaptive histogram equalization to enhance feature extraction. We address class imbalance through data augmentation and weighted sampling. Latent features extracted by the encoder are transformed into qubits via angle encoding and processed by a QCNN. The model is evaluated on both 8-qubit and 12-qubit configurations. Both architectures achieved rapid convergence with stable learning curves and high consistency between training and validation performance. The models reached a test accuracy of 0.99, with the 12-qubit configuration providing improvements in overall recall and precision, particularly for Cyst and Tumor detection, where it achieved perfect recall for Cysts and a tumor F1-score of 0.9956. Confusion matrix analysis further confirmed reliable classification behavior across all classes, with very few misclassifications. Results demonstrate that integrating classical pre-processing and deep feature extraction with quantum circuits enhances medical diagnostic performance.
zh

[CV-207] AGGRNet: Selective Feature Extraction and Aggregation for Enhanced Medical Image Classification

【速读】:该论文旨在解决复杂医学图像分析任务中因类别间视觉模式相似、标注数据稀缺以及专家判读差异导致的分类困难问题,尤其针对现有基于注意力机制的模型在区分细微类别时表现不佳的问题。其解决方案的关键在于提出AGGRNet框架,通过同时提取有判别力(informative)和无判别力(non-informative)特征,增强对细粒度视觉模式的理解能力,从而提升复杂医学图像分类的准确性。实验表明,该方法在多个医学影像数据集上达到SOTA性能,尤其在Kvasir数据集上相较现有最优模型提升达5%。

链接: https://arxiv.org/abs/2511.12382
作者: Ansh Makwe,Akansh Agrawal,Prateek Jain,Akshan Agrawal,Priyanka Bagade
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image analysis for complex tasks such as severity grading and disease subtype classification poses significant challenges due to intricate and similar visual patterns among classes, scarcity of labeled data, and variability in expert interpretations. Despite the usefulness of existing attention-based models in capturing complex visual patterns for medical image classification, underlying architectures often face challenges in effectively distinguishing subtle classes since they struggle to capture inter-class similarity and intra-class variability, resulting in incorrect diagnosis. To address this, we propose AGGRNet framework to extract informative and non-informative features to effectively understand fine-grained visual patterns and improve classification for complex medical image analysis tasks. Experimental results show that our model achieves state-of-the-art performance on various medical imaging datasets, with the best improvement up to 5% over SOTA models on the Kvasir dataset.
zh

[CV-208] Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

【速读】:该论文旨在解决文本到视频检索(text-to-video retrieval)中对隐式查询(implicit queries)处理能力不足的问题,即当用户查询涉及需要推理才能识别相关视频内容时,现有方法难以准确匹配。其核心解决方案在于提出“推理型文本到视频检索”(reasoning text-to-video retrieval)范式,关键创新是将视频内容表示为数字孪生(digital twins),即通过专用视觉模型分解出显著对象的结构化场景表示,从而允许大语言模型(LLM)直接在长时程视频内容上进行推理,而无需依赖视觉标记压缩。该方法采用两阶段框架:首先基于子查询与数字孪生表示的组合对齐进行候选视频筛选,再通过基于LLM的即时精炼机制调用额外的专业模型填补信息缺口,实现对象级定位与语义推理的协同。

链接: https://arxiv.org/abs/2511.12371
作者: Yiqing Shen,Chenxiao Fan,Chenjia Li,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however, they fail with implicit queries where identifying videos relevant to the query requires reasoning. We introduce reasoning text-to-video retrieval, a paradigm that extends traditional retrieval to process implicit queries through reasoning while providing object-level grounding masks that identify which entities satisfy the query conditions. Instead of relying on vision-language models directly, we propose representing video content as digital twins, i.e., structured scene representations that decompose salient objects through specialist vision models. This approach is beneficial because it enables large language models to reason directly over long-horizon video content without visual token compression. Specifically, our two-stage framework first performs compositional alignment between decomposed sub-queries and digital twin representations for candidate identification, then applies large language model-based reasoning with just-in-time refinement that invokes additional specialist models to address information gaps. We construct a benchmark of 447 manually created implicit queries with 135 videos (ReasonT2VBench-135) and another more challenging version of 1000 videos (ReasonT2VBench-1000). Our method achieves 81.2% R@1 on ReasonT2VBench-135, outperforming the strongest baseline by greater than 50 percentage points, and maintains 81.7% R@1 on the extended configuration while establishing state-of-the-art results in three conventional benchmarks (MSR-VTT, MSVD, and VATEX).
zh

[CV-209] Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

【速读】:该论文旨在解决在线场景变化检测(Online Scene Change Detection, SCD)问题,即在无约束视角下实时检测场景中的关键变化,同时克服现有在线方法准确率远低于离线方法的局限性。其解决方案的关键在于提出了一种无需姿态信息(pose-agnostic)、无需标签(label-free)且能保证多视角一致性(multi-view consistency)的新方法,结合自监督融合损失(self-supervised fusion loss)从多源线索中推断场景变化、基于PnP(Perspective-n-Point)的快速姿态估计以及一种基于变化引导的3D高斯泼溅(3D Gaussian Splatting)场景表示更新策略,从而在超过10 FPS的帧率下实现超越最优离线方法的性能表现。

链接: https://arxiv.org/abs/2511.12370
作者: Chamuditha Jayanga Galappaththige,Jason Lai,Lloyd Windrim,Donald Dansereau,Niko Sünderhauf,Dimity Miller
机构: QUT Centre for Robotics (昆士兰科技大学机器人中心); ARIAM; ACFR, University of Sydney (悉尼大学先进机器人与自动化研究中心); Abyss Solutions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines.
zh

[CV-210] Fast Reasoning Segmentation for Images and Videos

【速读】:该论文旨在解决当前推理分割(reasoning segmentation)方法依赖于参数量庞大的多模态大语言模型(multimodal large language models),难以在计算资源受限的边缘设备上部署的问题。现有知识蒸馏(distillation)方法因仅关注输出预测和中间特征匹配,无法有效迁移推理分割所需的多步推理能力(multi-step reasoning capabilities)。其解决方案的关键在于提出 FastReasonSeg,通过引入数字孪生表示(digital twin representations)将感知与推理解耦,从而实现更有效的蒸馏:首先利用教师模型生成的推理链进行监督微调,随后通过强化学习微调并联合奖励机制评估分割准确率与推理质量的一致性,最终在多个基准测试中实现了最优性能,且轻量模型(0.6B参数)显著优于参数量大20倍的模型,同时具备高吞吐率(7.79 FPS)和低内存占用(2.1GB),满足边缘部署需求。

链接: https://arxiv.org/abs/2511.12368
作者: Yiqing Shen,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.
zh

[CV-211] Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

【速读】:该论文旨在解决当前视觉推理任务中模型架构碎片化与跨任务、跨模态泛化能力不足的问题。现有方法通常依赖于为每类任务(如分割、定位、摘要和视觉问答)设计特定的模型结构并进行监督微调,导致缺乏统一解决方案且难以迁移至新任务或模态。其关键解决方案是提出DT-R1框架,该框架基于强化学习训练大语言模型(LLM),使其构建复杂多模态输入的数字孪生(Digital Twin)表示,并在此高阶语义空间中进行统一推理。该方法通过一种新颖的奖励机制(GRPO)同时验证结构完整性和输出准确性,从而在六个涵盖两种模态和四类任务的基准上显著优于现有专用模型,为视觉推理提供了一种以数字孪生为基础的通用范式。

链接: https://arxiv.org/abs/2511.12365
作者: Yiqing Shen,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.
zh

[CV-212] Explainable AI-Generated Image Detection RewardBench

【速读】:该论文旨在解决当前基于分类的生成式 AI (Generative AI) 图像检测方法缺乏可解释性的问题,即这些方法无法以人类专家能够理解的方式说明图像为何被判定为真实或由 AI 生成,从而削弱了检测工具在现实场景中的可信度与说服力。其解决方案的关键在于提出首个专门用于评估多模态大语言模型(Multimodal Large Language Models, MLLMs)判断生成式 AI 图像检测解释质量能力的基准——XAIGID-RewardBench。该基准包含约 3000 个来自不同图像生成模型和 MLLM 检测器的标注三元组,用以测试 MLLM 作为奖励模型(reward model)对解释质量的评判能力,从而系统性地衡量当前 MLLMs 在此任务上的表现与人类水平之间的差距。

链接: https://arxiv.org/abs/2511.12363
作者: Michael Yang,Shijian Deng,William T. Doan,Kai Wang,Tianyu Yang,Harsh Singh,Yapeng Tian
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Toronto (多伦多大学); University of Notre Dame (圣母大学); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional, classification-based AI-generated image detection methods cannot explain why an image is considered real or AI-generated in a way a human expert would, which reduces the trustworthiness and persuasiveness of these detection tools for real-world applications. Leveraging Multimodal Large Language Models (MLLMs) has recently become a trending solution to this issue. Further, to evaluate the quality of generated explanations, a common approach is to adopt an “MLLM as a judge” methodology to evaluate explanations generated by other MLLMs. However, how well those MLLMs perform when judging explanations for AI-generated image detection generated by themselves or other MLLMs has not been well studied. We therefore propose \textbfXAIGID-RewardBench, the first benchmark designed to evaluate the ability of current MLLMs to judge the quality of explanations about whether an image is real or AI-generated. The benchmark consists of approximately 3,000 annotated triplets sourced from various image generation models and MLLMs as policy models (detectors) to assess the capabilities of current MLLMs as reward models (judges). Our results show that the current best reward model scored 88.76% on this benchmark (while human inter-annotator agreement reaches 98.30%), demonstrating that a visible gap remains between the reasoning abilities of today’s MLLMs and human-level performance. In addition, we provide an analysis of common pitfalls that these models frequently encounter. Code and benchmark are available at this https URL.
zh

[CV-213] CLAReSNet: When Convolution Meets Latent Attention for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中面临的三大挑战:高光谱维度、复杂的光谱-空间相关性以及训练样本有限且类别严重不平衡的问题。现有方法如卷积神经网络(CNN)擅长局部特征提取,而Transformer能捕捉长程依赖关系,但单独使用时因计算复杂度高(二次方级)和归纳偏置不足导致性能受限。其解决方案的关键在于提出一种混合架构CLAReSNet(Convolutional Latent Attention Residual Spectral Network),通过多尺度卷积茎与Transformer式注意力机制的融合实现高效建模:首先利用深度残差块和改进的卷积块注意力模块(CBAM)提取层次化空间特征;随后引入多尺度光谱潜在注意力(Multi-Scale Spectral Latent Attention, MSLA)模块,采用自适应潜在token分配策略(8–64个token)将光谱序列复杂度从O(T2D)\mathcal{O}(T^2D)降低至O(TlogTD)\mathcal{O}(T\log T D),并结合双向RNN(LSTM/GRU)增强光谱建模能力;最后通过分层交叉注意力融合动态聚合多级表示,从而在有限样本和严重类别不平衡条件下实现卓越的分类性能。

链接: https://arxiv.org/abs/2511.12346
作者: Asmit Bandyopadhyay,Anindita Das Bhattacharjee,Rakesh Das
机构: Institute of Engineering and Management (IEM); University of Engineering and Management Kolkata; IEM Centre of Excellence for InnovAI; Department of Computer Science & Engineering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification faces critical challenges, including high spectral dimensionality, complex spectral-spatial correlations, and limited training samples with severe class imbalance. While CNNs excel at local feature extraction and transformers capture long-range dependencies, their isolated application yields suboptimal results due to quadratic complexity and insufficient inductive biases. We propose CLAReSNet (Convolutional Latent Attention Residual Spectral Network), a hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck. The model employs a multi-scale convolutional stem with deep residual blocks and an enhanced Convolutional Block Attention Module for hierarchical spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA). MSLA reduces complexity from \mathcalO(T^2D) to \mathcalO(T\log(T)D) by adaptive latent token allocation (8-64 tokens) that scales logarithmically with the sequence length. Hierarchical cross-attention fusion dynamically aggregates multi-level representations for robust classification. Experiments conducted on the Indian Pines and Salinas datasets show state-of-the-art performance, achieving overall accuracies of 99.71% and 99.96%, significantly surpassing HybridSN, SSRN, and SpectralFormer. The learned embeddings exhibit superior inter-class separability and compact intra-class clustering, validating CLAReSNet’s effectiveness under limited samples and severe class imbalance.
zh

[CV-214] Ground Plane Projection for Improved Traffic Analytics at Intersections

【速读】:该论文旨在解决交通路口车辆转向行为计数(turning movement counts)的准确性问题,这对信号控制、交通管理和城市规划具有重要意义。传统计算机视觉系统依赖于基础设施摄像头在图像平面(image plane)上的视觉分析,但存在精度不足的问题。论文提出的关键解决方案是将检测到的车辆从单个或多个摄像头中反投影至地面平面(ground plane),在真实三维坐标系下进行轨迹分类与计数。研究表明,这种基于地面平面的分析方法显著提升了轨迹识别和转向行为统计的准确性,且通过多摄像头弱融合(weak fusion)进一步提高了性能,表明交通分析应从图像平面转向地面平面以实现更高精度。

链接: https://arxiv.org/abs/2511.12342
作者: Sajjad Pakdamansavoji,Kumar Vaibhav Jha,Baher Abdulhai,James H Elder
机构: York University (约克大学); Vector Institute for Artificial Intelligence (人工智能研究院); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate turning movement counts at intersections are important for signal control, traffic management and urban planning. Computer vision systems for automatic turning movement counts typically rely on visual analysis in the image plane of an infrastructure camera. Here we explore potential advantages of back-projecting vehicles detected in one or more infrastructure cameras to the ground plane for analysis in real-world 3D coordinates. For single-camera systems we find that back-projection yields more accurate trajectory classification and turning movement counts. We further show that even higher accuracy can be achieved through weak fusion of back-projected detections from multiple cameras. These results suggeest that traffic should be analyzed on the ground plane, not the image plane
zh

[CV-215] SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理否定指令时表现不佳的问题,例如当输入提示为“检索(或生成)一个没有行人的街道场景”时,模型往往无法正确理解“不”这一语义约束。现有方法通过在大规模否定数据集上微调模型来改善该问题,但常导致模型在肯定类提示上的零样本性能下降。论文提出了一种无需训练的框架,其核心创新在于将VLM的联合嵌入空间(joint embedding space)视为由语义一致的子空间组成,并将否定概念建模为一个子空间而非单一嵌入点;具体而言,针对“包含A但不包含N”的查询,通过构建围绕A和N嵌入的两个球冠区域,以靠近A且远离N的区域中心方向作为图像评分依据,从而实现对否定语义的精准捕捉。该方法在检索、多项选择题和文本到图像生成任务中平均提升约30%的否定理解能力,同时保持了模型在肯定提示下的零样本性能。

链接: https://arxiv.org/abs/2511.12331
作者: Sepehr Kazemi Ranjbar,Kumail Alhamoud,Marzyeh Ghassemi
机构: MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) struggle with negation. Given a prompt like “retrieve (or generate) a street scene without pedestrians,” they often fail to respect the “not.” Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model’s zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as “A but not N,” we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.
zh

[CV-216] Learning Time in Static Classifiers AAAI2026 AAAI

【速读】:该论文旨在解决传统分类器在处理随时间演变的视觉数据时表现受限的问题,因为这些分类器通常假设输入样本之间是时间独立的,无法有效捕捉动态变化(如姿态、光照、物体状态或场景上下文的变化)。解决方案的关键在于提出一种无需修改模型架构或引入循环模块的简单而有效的框架,其核心是一个新颖的支持-原型-查询(Support-Exemplar-Query, SEQ)学习范式,该范式将训练数据组织为时间一致的轨迹,并通过可微分的软DTW(soft-DTW)损失引导模型学习类特定的时间原型,从而在预测序列中实现语义一致性和时间平滑性。这一方法仅通过损失设计即引入强时间归纳偏置,在静态和时序任务中均表现出色,显著提升了细粒度和超细粒度图像分类性能,并实现了视频异常检测中的精确且时序一致的预测。

链接: https://arxiv.org/abs/2511.12321
作者: Xi Ding,Lei Wang,Piotr Koniusz,Yongsheng Gao
机构: 1. University of Technology Sydney (悉尼科技大学); 2. Australian National University (澳大利亚国立大学); 3. NICTA (澳大利亚国家信息通信技术研究中心); 4. UNSW Sydney (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Fortieth AAAI Conference on Artificial Intelligence (AAAI 2026)

点击查看摘要

Abstract:Real-world visual data rarely presents as isolated, static instances. Instead, it often evolves gradually over time through variations in pose, lighting, object state, or scene context. However, conventional classifiers are typically trained under the assumption of temporal independence, limiting their ability to capture such dynamics. We propose a simple yet effective framework that equips standard feedforward classifiers with temporal reasoning, all without modifying model architectures or introducing recurrent modules. At the heart of our approach is a novel Support-Exemplar-Query (SEQ) learning paradigm, which structures training data into temporally coherent trajectories. These trajectories enable the model to learn class-specific temporal prototypes and align prediction sequences via a differentiable soft-DTW loss. A multi-term objective further promotes semantic consistency and temporal smoothness. By interpreting input sequences as evolving feature trajectories, our method introduces a strong temporal inductive bias through loss design alone. This proves highly effective in both static and temporal tasks: it enhances performance on fine-grained and ultra-fine-grained image classification, and delivers precise, temporally consistent predictions in video anomaly detection. Despite its simplicity, our approach bridges static and temporal learning in a modular and data-efficient manner, requiring only a simple classifier on top of pre-extracted features.
zh

[CV-217] LiDAR-GS:Improving LiDAR Gaussian Reconstruction via Diffusion Priors AAAI-26

【速读】:该论文旨在解决基于高斯溅射(Gaussian Splatting, GS)的LiDAR渲染方法在 extrapolated novel view synthesis(外推新视角合成)中因单次扫描重建不完整而导致的伪影问题。解决方案的关键在于引入扩散先验(diffusion priors)驱动的可控LiDAR生成模型,该模型以粗略外推渲染为条件生成几何一致的额外扫描数据,并结合有效的蒸馏机制实现扩展重建,从而在保持传感器捕获细节的同时,提升欠拟合区域的全局几何一致性,显著改善外推视角下的重建质量与真实感。

链接: https://arxiv.org/abs/2511.12304
作者: Qifeng Chen,Jiarun Liu,Rengan Xie,Tao Tang,Sicong Du,Yiru Zhao,Yuchi Huo,Sheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-26

点击查看摘要

Abstract:Recent GS-based rendering has made significant progress for LiDAR, surpassing Neural Radiance Fields (NeRF) in both quality and speed. However, these methods exhibit artifacts in extrapolated novel view synthesis due to the incomplete reconstruction from single traversal scans. To address this limitation, we present LiDAR-GS++, a LiDAR Gaussian Splatting reconstruction method enhanced by diffusion priors for real-time and high-fidelity re-simulation on public urban roads. Specifically, we introduce a controllable LiDAR generation model conditioned on coarsely extrapolated rendering to produce extra geometry-consistent scans and employ an effective distillation mechanism for expansive reconstruction. By extending reconstruction to under-fitted regions, our approach ensures global geometric consistency for extrapolative novel views while preserving detailed scene surfaces captured by sensors. Experiments on multiple public datasets demonstrate that LiDAR-GS++ achieves state-of-the-art performance for both interpolated and extrapolated viewpoints, surpassing existing GS and NeRF-based methods.
zh

[CV-218] Rethinking Bias in Generative Data Augmentation for Medical AI: a Frequency Recalibration Method AAAI2026

【速读】:该论文旨在解决生成式数据增强(Generative Data Augmentation, GDA)在医疗领域中因频率分布不匹配导致的可靠性问题,即AI合成图像与真实医学图像在高频特征上的差异可能引入偏差,从而损害下游任务性能。解决方案的关键在于提出频率校准(Frequency Recalibration, FreRec)方法,其核心包含两个步骤:(1) 统计高频频段替换(Statistical High-frequency Replacement, SHR),用于粗略对齐高频成分;(2) 重建式高频频段映射(Reconstructive High-frequency Mapping, RHM),用于提升图像质量并恢复高频细节,从而显著降低频域分布差异,提升分类等下游任务的性能。FreRec为独立后处理模块,兼容任意生成模型且可无缝集成至主流医学GDA流程。

链接: https://arxiv.org/abs/2511.12301
作者: Chi Liu,Jincheng Liu,Congcong Zhu,Minghao Wang,Sheng Shen,Jia Gu,Tianqing Zhu,Wanlei Zhou
机构: City University of Macau (澳门城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for AAAI 2026 (Main Track Poster)

点击查看摘要

Abstract:Developing Medical AI relies on large datasets and easily suffers from data scarcity. Generative data augmentation (GDA) using AI generative models offers a solution to synthesize realistic medical images. However, the bias in GDA is often underestimated in medical domains, with concerns about the risk of introducing detrimental features generated by AI and harming downstream tasks. This paper identifies the frequency misalignment between real and synthesized images as one of the key factors underlying unreliable GDA and proposes the Frequency Recalibration (FreRec) method to reduce the frequency distributional discrepancy and thus improve GDA. FreRec involves (1) Statistical High-frequency Replacement (SHR) to roughly align high-frequency components and (2) Reconstructive High-frequency Mapping (RHM) to enhance image quality and reconstruct high-frequency details. Extensive experiments were conducted in various medical datasets, including brain MRIs, chest X-rays, and fundus images. The results show that FreRec significantly improves downstream medical image classification performance compared to uncalibrated AI-synthesized samples. FreRec is a standalone post-processing step that is compatible with any generative model and can integrate seamlessly with common medical GDA pipelines.
zh

[CV-219] One target to align them all: LiDAR RGB and event cameras extrinsic calibration for Autonomous Driving

【速读】:该论文旨在解决多模态传感器(事件相机、LiDAR 和 RGB 相机)之间同时外参标定的难题,尤其针对事件相机标定这一挑战性问题。现有方法通常依赖于分步的两两标定,难以实现高效且高精度的联合校准。解决方案的关键在于设计了一种新型三维标定靶,该靶同时具备平面特征(用于RGB相机)、ChArUco板结构(用于LiDAR)和主动LED模式(用于事件相机),从而使得三种模态可同步感知并提取对应特征,实现一次性联合外参标定,显著提升了标定效率与精度,适用于自动驾驶场景下复杂视觉系统的精确对齐需求。

链接: https://arxiv.org/abs/2511.12291
作者: Andrea Bertogalli,Giacomo Boracchi,Luca Magri
机构: Politecnico di Milano(米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel multi-modal extrinsic calibration framework designed to simultaneously estimate the relative poses between event cameras, LiDARs, and RGB cameras, with particular focus on the challenging event camera calibration. Core of our approach is a novel 3D calibration target, specifically designed and constructed to be concurrently perceived by all three sensing modalities. The target encodes features in planes, ChArUco, and active LED patterns, each tailored to the unique characteristics of LiDARs, RGB cameras, and event cameras respectively. This unique design enables a one-shot, joint extrinsic calibration process, in contrast to existing approaches that typically rely on separate, pairwise calibrations. Our calibration pipeline is designed to accurately calibrate complex vision systems in the context of autonomous driving, where precise multi-sensor alignment is critical. We validate our approach through an extensive experimental evaluation on a custom built dataset, recorded with an advanced autonomous driving sensor setup, confirming the accuracy and robustness of our method.
zh

[CV-220] M-UNet: Token-Memory Enhanced Sequential Modeling for Efficient Medical Image Segmentation

【速读】:该论文旨在解决基于Transformer的医学图像分割方法因计算成本过高而难以在临床环境中部署的问题。其核心解决方案是提出一种轻量级框架TM-UNet,关键创新在于引入多尺度令牌记忆(multi-scale token-memory, MSTM)模块:该模块通过策略性空间扫描将二维空间特征转换为令牌序列,并利用矩阵记忆单元选择性地保留和传播判别性上下文信息,形成具有线性复杂度的动态知识存储机制,从而实现高效的全局推理而无需冗余计算;同时,MSTM还结合指数门控机制识别令牌有效性,并通过并行池化操作实现无额外计算开销的多尺度上下文提取,有效支持层次化表示学习。

链接: https://arxiv.org/abs/2511.12270
作者: Yaxuan Jiao,Qing Xu,Yuxiang Luo,Xiangjian He,Zhen Chen,Wenting Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is essential for clinical diagnosis and treatment planning. Although transformer-based methods have achieved remarkable results, their high computational cost hinders clinical deployment. To address this issue, we propose TM-UNet, a novel lightweight framework that integrates token sequence modeling with an efficient memory mechanism for efficient medical segmentation. Specifically, we introduce a multi-scale token-memory (MSTM) block that transforms 2D spatial features into token sequences through strategic spatial scanning, leveraging matrix memory cells to selectively retain and propagate discriminative contextual information across tokens. This novel token-memory mechanism acts as a dynamic knowledge store that captures long-range dependencies with linear complexity, enabling efficient global reasoning without redundant computation. Our MSTM block further incorporates exponential gating to identify token effectiveness and multi-scale contextual extraction via parallel pooling operations, enabling hierarchical representation learning without computational overhead. Extensive experiments demonstrate that TM-UNet outperforms state-of-the-art methods across diverse medical segmentation tasks with substantially reduced computation cost. The code is available at this https URL.
zh

[CV-221] ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

【速读】:该论文旨在解决超分辨率遥感(Ultra-high-resolution remote sensing, UHR RS)图像在处理过程中因细粒度信息丰富而导致的冗余与效率低下问题,特别是现有动态分辨率调整和标记剪枝方法受限于被动感知范式、难以有效利用高分辨率输入的问题。其解决方案的关键在于提出一种主动感知(active perception)范式,通过引入LRS-GRO大规模基准数据集(包含17类多层级问题标注)和ZoomEarth自适应裁剪-缩放框架,其中核心创新是Region-Guided奖励机制,能够引导模型聚焦于信息丰富的区域,并结合监督微调(SFT)与组相对策略优化(GRPO)进行训练,从而实现对UHR RS图像的高效精准处理,在零样本设置下亦表现出卓越性能,并具备良好的下游任务扩展能力。

链接: https://arxiv.org/abs/2511.12267
作者: Ruixun Liu,Bowen Fu,Jiayi Song,Kaiyu Li,Wanchen Li,Lanxuan Xue,Hui Qiao,Weizhan Zhang,Deyu Meng,Xiangyong Cao
机构: Xi’an Jiaotong University (西安交通大学); China Telecom Shaanxi Branch (中国电信陕西分公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.
zh

[CV-222] Calibrated Adversarial Sampling: Multi-Armed Bandit-Guided Generalization Against Unforeseen Attacks

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对多种对抗性扰动时的鲁棒性不足问题,尤其是现有对抗训练(Adversarial Training, AT)框架通常仅针对单一或有限类型的攻击进行优化,导致模型在实际部署中仍可能受到未被训练覆盖的攻击类型影响。解决方案的关键在于提出一种高效的微调方法——校准对抗采样(Calibrated Adversarial Sampling, CAS),其核心思想是从多臂老虎机(multi-armed bandit)的优化视角出发,动态设计奖励机制,并通过考虑多个鲁棒性维度之间的动态性和相互依赖关系,实现探索与利用的平衡,从而提升模型在多种攻击场景下的综合鲁棒性,同时保持较高的干净准确率(clean accuracy)。

链接: https://arxiv.org/abs/2511.12265
作者: Rui Wang,Zeming Wei,Xiyue Zhang,Meng Sun
机构: Peking University (北京大学); University of Bristol (布里斯托大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are known to be vulnerable to various adversarial perturbations. To address the safety concerns arising from these vulnerabilities, adversarial training (AT) has emerged as one of the most effective paradigms for enhancing the robustness of DNNs. However, existing AT frameworks primarily focus on a single or a limited set of attack types, leaving DNNs still exposed to attack types that may be encountered in practice but not addressed during training. In this paper, we propose an efficient fine-tuning method called Calibrated Adversarial Sampling (CAS) to address these issues. From the optimization perspective within the multi-armed bandit framework, it dynamically designs rewards and balances exploration and exploitation by considering the dynamic and interdependent characteristics of multiple robustness dimensions. Experiments on benchmark datasets show that CAS achieves superior overall robustness while maintaining high clean accuracy, providing a new paradigm for robust generalization of DNNs.
zh

[CV-223] CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

【速读】:该论文旨在解决当前多模态大语言模型(MLLMs)在跨视频推理(Cross-Video Reasoning, CVR)能力评估方面的缺失问题,即现有视频理解基准大多局限于单视频分析,难以全面衡量模型在多视频场景下进行时空信息整合与对比推理的能力。其解决方案的关键在于提出首个系统性评估框架CrossVid,该框架包含四个高阶维度和十项具体任务,覆盖真实世界中复杂的CVR场景,并提供5,331个视频及9,015个挑战性问答对,从而推动对MLLMs跨视频时空推理能力的深入研究与提升。

链接: https://arxiv.org/abs/2511.12263
作者: Jingyao Li,Jingyun Wang,Molin Tan,Haochen Wang,Cilin Yan,Likun Shi,Jiayin Cai,Xiaolong Jiang,Yao Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 28 figures

点击查看摘要

Abstract:Cross-Video Reasoning (CVR) presents a significant challenge in video understanding, which requires simultaneous understanding of multiple videos to aggregate and compare information across groups of videos. Most existing video understanding benchmarks focus on single-video analysis, failing to assess the ability of multimodal large language models (MLLMs) to simultaneously reason over various videos. Recent benchmarks evaluate MLLMs’ capabilities on multi-view videos that capture different perspectives of the same scene. However, their limited tasks hinder a thorough assessment of MLLMs in diverse real-world CVR scenarios. To this end, we introduce CrossVid, the first benchmark designed to comprehensively evaluate MLLMs’ spatial-temporal reasoning ability in cross-video contexts. Firstly, CrossVid encompasses a wide spectrum of hierarchical tasks, comprising four high-level dimensions and ten specific tasks, thereby closely reflecting the complex and varied nature of real-world video understanding. Secondly, CrossVid provides 5,331 videos, along with 9,015 challenging question-answering pairs, spanning single-choice, multiple-choice, and open-ended question formats. Through extensive experiments on various open-source and closed-source MLLMs, we observe that Gemini-2.5-Pro performs best on CrossVid, achieving an average accuracy of 50.4%. Notably, our in-depth case study demonstrates that most current MLLMs struggle with CVR tasks, primarily due to their inability to integrate or compare evidence distributed across multiple videos for reasoning. These insights highlight the potential of CrossVid to guide future advancements in enhancing MLLMs’ CVR capabilities.
zh

[CV-224] A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation AAAI2026

【速读】:该论文旨在解决当前生成式AI在胸部X光片报告生成任务中存在的两大关键问题:一是视觉表征缺乏疾病敏感性(disease-awareness),导致模型难以捕捉关键病理特征;二是视觉与语言模态间对齐不足,影响临床准确性。解决方案的核心在于提出一种双阶段疾病感知框架:第一阶段通过交叉注意力机制和多标签分类学习与特定病种对应的疾病感知语义令牌(DASTs),并利用对比学习实现视觉-语言表征对齐;第二阶段引入疾病-视觉注意力融合模块(DVAF)整合疾病感知表示与视觉特征,并设计双模态相似性检索机制(DMSR)结合视觉与疾病特异性相似度以获取上下文引导,从而显著提升报告的临床准确性和语言质量。

链接: https://arxiv.org/abs/2511.12259
作者: Puzhen Wu,Hexin Dong,Yi Lin,Yihao Ding,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院); University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists’ workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage~1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage~2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.
zh

[CV-225] Prompt-Conditioned FiLM and Multi-Scale Fusion on MedSigLIP for Low-Dose CT Quality Assessment

【速读】:该论文旨在解决低剂量计算机断层扫描(Low-Dose Computed Tomography, LDCT)图像质量评估中数据稀缺与模型泛化能力不足的问题。解决方案的关键在于提出一种基于MedSigLIP的提示条件框架,通过特征级线性调制(Feature-wise Linear Modulation, FiLM)注入文本先验信息,并结合多尺度池化策略,使patch-token特征能够根据临床意图进行动态调整,从而实现高效学习与快速适应。该方法利用独立回归头分别提取全局、局部和纹理感知特征,并通过轻量级多层感知机(MLP)融合,训练时采用成对排名损失(pairwise ranking loss),最终在LDCTIQA2023公开数据集上显著优于现有最优方法,验证了提示引导机制的有效性。

链接: https://arxiv.org/abs/2511.12256
作者: Tolga Demiroglu(1),Mehmet Ozan Unal(1),Metin Ertas(2),Isa Yildirim(1) ((1) Electronics and Communication Engineering Department, Istanbul Technical University, Istanbul, Turkey, (2) Istanbul University, Istanbul, Turkey)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We propose a prompt-conditioned framework built on MedSigLIP that injects textual priors via Feature-wise Linear Modulation (FiLM) and multi-scale pooling. Text prompts condition patch-token features on clinical intent, enabling data-efficient learning and rapid adaptation. The architecture combines global, local, and texture-aware pooling through separate regression heads fused by a lightweight MLP, trained with pairwise ranking loss. Evaluated on the LDCTIQA2023 (a public LDCT quality assessment challenge) with 1,000 training images, we achieve PLCC = 0.9575, SROCC = 0.9561, and KROCC = 0.8301, surpassing the top-ranked published challenge submissions and demonstrating the effectiveness of our prompt-guided approach.
zh

[CV-226] Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

【速读】:该论文旨在解决视频浏览器挑战赛(Video Browser Showdown, VBS)中系统在严格时间约束下实现高精度检索的难题。其解决方案的关键在于构建一个名为Fusionista2.0的高效视频检索系统,通过模块级优化实现速度与可用性的双重提升:核心流程重构为基于ffmpeg的快速关键帧提取、基于Vintern-1B-v3.5的多语言文本识别以及基于faster-whisper的实时语音转录;同时引入轻量级视觉-语言模型用于问答响应,避免大型模型带来的计算开销;此外,重新设计了用户界面以增强响应性、可访问性和工作流效率,使非专业用户也能快速获取相关视频内容。实验表明,该方案将检索时间缩短达75%,同时保持准确率和用户满意度提升,验证了其在大规模视频搜索场景下的竞争力与实用性。

链接: https://arxiv.org/abs/2511.12255
作者: Huy M. Le,Dat Tien Nguyen,Phuc Binh Nguyen,Gia-Bao Le-Tran,Phu Truong Thien,Cuong Dinh,Minh Nguyen,Nga Nguyen,Thuy T. N. Nguyen,Huy Gia Ngo,Tan Nhat Nguyen,Binh T. Nguyen,Monojit Choudhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.
zh

[CV-227] AURA: Development and Validation of an Augmented Unplanned Removal Alert System using Synthetic ICU Videos

【速读】:该论文旨在解决重症监护病房(ICU)中意外气管拔管(Unplanned Extubation, UE)这一关键患者安全问题,其传统实时检测方法受限于标注ICU视频数据的伦理与隐私挑战。解决方案的关键在于提出一种名为AURA(Augmented Unplanned Removal Alert)的视觉风险检测系统,该系统完全基于全合成视频数据集进行开发与验证,利用文本到视频扩散模型生成多样化且临床真实的ICU场景,通过姿态估计识别两类高风险行为:碰撞(hand entry into spatial zones near airway tubes)和躁动(agitation, quantified by velocity of tracked anatomical keypoints),从而实现隐私保护、可复现的患者安全监测。

链接: https://arxiv.org/abs/2511.12241
作者: Junhyuk Seo,Hyeyoon Moon,Kyu-Hwan Jung,Namkee Oh,Taerim Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Unplanned extubation (UE) remains a critical patient safety concern in intensive care units (ICUs), often leading to severe complications or death. Real-time UE detection has been limited, largely due to the ethical and privacy challenges of obtaining annotated ICU video data. We propose Augmented Unplanned Removal Alert (AURA), a vision-based risk detection system developed and validated entirely on a fully synthetic video dataset. By leveraging text-to-video diffusion, we generated diverse and clinically realistic ICU scenarios capturing a range of patient behaviors and care contexts. The system applies pose estimation to identify two high-risk movement patterns: collision, defined as hand entry into spatial zones near airway tubes, and agitation, quantified by the velocity of tracked anatomical keypoints. Expert assessments confirmed the realism of the synthetic data, and performance evaluations showed high accuracy for collision detection and moderate performance for agitation recognition. This work demonstrates a novel pathway for developing privacy-preserving, reproducible patient safety monitoring systems with potential for deployment in intensive care settings.
zh

[CV-228] Model Inversion Attack Against Deep Hashing

【速读】:该论文旨在解决深度哈希(Deep Hashing)模型中存在的隐私泄露问题,即通过哈希码(hash codes)可能被用于重建原始训练数据,从而引发生物特征伪造和隐私泄露等安全风险。现有模型反演攻击方法难以适用于深度哈希系统,主要受限于真实训练哈希码不可获取以及哈希空间的高度离散性。为应对这一挑战,论文提出DHMI——首个基于扩散模型的深度哈希反演框架。其关键创新在于:首先利用辅助数据集聚类生成语义哈希中心作为替代锚点;其次设计一种融合分类一致性与哈希邻近度的新型攻击指标,动态筛选候选样本;最后借助一组替代模型(surrogate models)引导候选样本的精细化重构,从而在无任何训练哈希码的黑盒场景下仍能成功生成高保真、语义一致的高质量图像,显著优于现有最先进模型反演攻击方法,揭示了深度哈希系统中潜在的重大隐私风险。

链接: https://arxiv.org/abs/2511.12233
作者: Dongdong Zhao,Qiben Xu,Ranxin Fang,Baogang Song
机构: Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.
zh

[CV-229] Suppressing VLM Hallucinations with Spectral Representation Filtering

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像描述任务中因过度依赖语言先验和跨模态对齐不准确而导致的幻觉问题,即生成与图像内容不符的对象、属性或关系描述。解决方案的关键在于提出一种轻量级、无需训练的谱表示过滤方法(Spectral Representation Filtering, SRF),其核心机制是通过分析真实与幻觉描述特征差异的协方差结构,利用特征分解识别出低秩的幻觉模式,并通过软谱滤波器在深层VLM层的前馈投影权重中抑制这些模式,从而在保持语义保真度的同时均衡特征方差,实现无推理开销的后处理去幻觉优化。

链接: https://arxiv.org/abs/2511.12220
作者: Ameen Ali,Tamim Zoabi,Lior Wolf
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model’s representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.
zh

[CV-230] FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention AAAI2026

【速读】:该论文旨在解决医学视觉-语言预训练(Medical Vision-Language Pre-training, VLP)中因语义相似文本导致的假负例(False Negatives, FaNe)问题,以及跨模态细粒度对齐不足的问题。其解决方案的关键在于:首先,提出一种基于文本间语义相似度的自适应归一化正样本挖掘策略,以缓解假负例干扰;其次,设计了一个文本条件稀疏注意力池化模块,通过文本提示引导局部视觉表征实现细粒度图像-文本对齐;最后,引入一种硬负例感知对比损失,自适应重加权语义相近的负样本,从而增强模态内区分能力。这些改进共同提升了模型在医学图像分类、目标检测和语义分割等下游任务上的性能。

链接: https://arxiv.org/abs/2511.12215
作者: Peng Zhang,Zhihui Lai,Wenting Chen,Xu Wu,Heng Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by False Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.
zh

[CV-231] Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

【速读】:该论文旨在解决多模态扩散模型中模态融合效率与精度不足的问题,尤其是如何在不同时间步和输入条件下实现细粒度的跨模态特征对齐。其解决方案的关键在于提出了一种名为MoS(Mixture of States)的新颖融合范式,核心是一个可学习的、基于token级别的路由器(router),该路由器能够根据去噪时间步和输入内容动态选择最优的隐藏状态进行交互,从而精确匹配token级特征与扩散轨迹。该路由器通过ε-greedy策略训练,稀疏地选取top-k隐藏状态,在仅需极少可学习参数和计算开销的前提下,实现了高效且精准的跨模态信息融合。

链接: https://arxiv.org/abs/2511.12207
作者: Haozhe Liu,Ding Liu,Mingchen Zhuge,Zijian Zhou,Tian Xie,Sen He,Yukang Yang,Shuming Liu,Yuren Cong,Jiadong Guo,Hongyu Xu,Ke Xu,Kam-Woh Ng,Juan C. Pérez,Juan-ManuelPérez-Rúa,Tao Xiang,Wei Liu,Shikun Liu,Jürgen Schmidhuber
机构: King Abdullah University of Science and Technology (KAUST); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities’ hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top- k hidden states and is trained with an \epsilon -greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4\times larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
zh

[CV-232] A Novel AI-Driven System for Real-Time Detection of Mirror Absence Helmet Non-Compliance and License Plates Using YOLOv8 and OCR DATE

【速读】:该论文旨在解决交通违规行为(如未佩戴头盔、摩托车缺少后视镜等)人工执法效率低且一致性差的问题,从而提升道路安全管理水平。其解决方案的关键在于构建一个基于人工智能的自动化检测系统,核心组件包括:使用YOLOv8进行高精度目标检测以识别违规对象,结合EasyOCR实现车牌信息提取,并通过图像预处理优化在复杂场景下的识别效果;此外,系统采用Streamlit开发可视化界面,支持实时监控与违规记录,最终在自建标注数据集上实现了0.9147的平均精度、0.886的召回率以及0.843的mAP@50指标,验证了方案在实际部署中的有效性。

链接: https://arxiv.org/abs/2511.12206
作者: Nishant Vasantkumar Hegde,Aditi Agarwal,Minal Moharir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures. Published in: Proceedings of the 12th International Conference on Emerging Trends in Engineering Technology Signal and Information Processing (ICETET SIP 2025) Note: The conference proceedings contain an outdated abstract due to a publisher-side error. This arXiv version includes the correct and updated abstract

点击查看摘要

Abstract:Road safety is a critical global concern, with manual enforcement of helmet laws and vehicle safety standards (e.g., rear-view mirror presence) being resource-intensive and inconsistent. This paper presents an AI-powered system to automate traffic violation detection, significantly enhancing enforcement efficiency and road safety. The system leverages YOLOv8 for robust object detection and EasyOCR for license plate recognition. Trained on a custom dataset of annotated images (augmented for diversity), it identifies helmet non-compliance, the absence of rear-view mirrors on motorcycles, an innovative contribution to automated checks, and extracts vehicle registration numbers. A Streamlit-based interface facilitates real-time monitoring and violation logging. Advanced image preprocessing enhances license plate recognition, particularly under challenging conditions. Based on evaluation results, the model achieves an overall precision of 0.9147, a recall of 0.886, and a mean Average Precision (mAP@50) of 0.843. The mAP@50 95 of 0.503 further indicates strong detection capability under stricter IoU thresholds. This work demonstrates a practical and effective solution for automated traffic rule enforcement, with considerations for real-world deployment discussed.
zh

[CV-233] GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction

【速读】:该论文旨在解决多视角图像生成中面临的两大核心问题:一是现有方法在扩展单图生成模型时难以维持跨视角的一致性,二是生成高分辨率图像时计算复杂度高且细节恢复不足。解决方案的关键在于提出几何引导的多视角扩散模型(Geometry-guided Multi-View Diffusion Model),其创新点包括:1)设计多视角几何信息提取模块,利用深度图(depth maps)、法向量图(normal maps)和前景分割掩码(foreground segmentation masks)构建共享几何结构,以保障不同视角间的形状与结构一致性;2)引入解耦的几何增强注意力机制(decoupled geometry-enhanced attention mechanism),强化对关键几何特征的关注,提升图像细节保真度;3)结合自适应学习策略与迭代细化过程,优化空间关系建模与视觉连贯性,同时通过动态调整几何信息强度机制平衡生成质量与自然性。

链接: https://arxiv.org/abs/2511.12204
作者: Jiaqi Wu,Yaosen Chen,Shuyuan Zhu
机构: University of Electronic Science and Technology of China (电子科技大学); Sobey Media Intelligence Laboratory (搜必达媒体智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face notable computational challenges in maintaining cross-view consistency and generating high-resolution outputs. To address these issues, we propose the Geometry-guided Multi-View Diffusion Model, which incorporates mechanisms for extracting multi-view geometric information and adjusting the intensity of geometric features to generate images that are both consistent across views and rich in detail. Specifically, we design a multi-view geometry information extraction module that leverages depth maps, normal maps, and foreground segmentation masks to construct a shared geometric structure, ensuring shape and structural consistency across different views. To enhance consistency and detail restoration during generation, we develop a decoupled geometry-enhanced attention mechanism that strengthens feature focus on key geometric details, thereby improving overall image quality and detail preservation. Furthermore, we apply an adaptive learning strategy that fine-tunes the model to better capture spatial relationships and visual coherence between the generated views, ensuring realistic results. Our model also incorporates an iterative refinement process that progressively improves the output quality through multiple stages of image generation. Finally, a dynamic geometry information intensity adjustment mechanism is proposed to adaptively regulate the influence of geometric data, optimizing overall quality while ensuring the naturalness of generated images. More details can be found on the project page: this https URL.
zh

[CV-234] LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image

【速读】:该论文旨在解决多视角扩散模型在3D生成中普遍存在的形状与纹理不一致问题,以及对非正面视角输入鲁棒性差的问题,这些问题会导致几何细节不完整和纹理伪影(textural ghosting)。解决方案的关键在于提出一种名为LSS3D的图像到3D生成方法,其核心创新是引入可学习的空间偏移(learnable spatial shifting)机制:为每张输入视图分配可学习的空间偏移参数,并通过重建的网格引导各视图向一个空间一致的目标调整,从而实现高质量的3D重建,同时将输入视图作为额外约束条件,显著提升对非正面视角(尤其是仰视角度)的鲁棒性。

链接: https://arxiv.org/abs/2511.12202
作者: Zhuojiang Cai,Yiheng Zhang,Meitong Guo,Mingdao Wang,Yuwang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, multi-view diffusion-based 3D generation methods have gained significant attention. However, these methods often suffer from shape and texture misalignment across generated multi-view images, leading to low-quality 3D generation results, such as incomplete geometric details and textural ghosting. Some methods are mainly optimized for the frontal perspective and exhibit poor robustness to oblique perspective inputs. In this paper, to tackle the above challenges, we propose a high-quality image-to-3D approach, named LSS3D, with learnable spatial shifting to explicitly and effectively handle the multiview inconsistencies and non-frontal input view. Specifically, we assign learnable spatial shifting parameters to each view, and adjust each view towards a spatially consistent target, guided by the reconstructed mesh, resulting in high-quality 3D generation with more complete geometric details and clean textures. Besides, we include the input view as an extra constraint for the optimization, further enhancing robustness to non-frontal input angles, especially for elevated viewpoint inputs. We also provide a comprehensive quantitative evaluation pipeline that can contribute to the community in performance comparisons. Extensive experiments demonstrate that our method consistently achieves leading results in both geometric and texture evaluation metrics across more flexible input viewpoints.
zh

[CV-235] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLM s AAAI2026

【速读】:该论文旨在解决现有稀疏注意力方法在训练与推理阶段存在性能差距、缺乏多维细粒度令牌选择能力(如查询、键值对和头维度)的问题,从而导致性能不佳和加速效果有限。其解决方案的关键在于提出 OmniSparse 框架,该框架具备训练感知特性并支持动态令牌预算分配,包含三个自适应且互补的机制:(1) 通过懒惰-活跃分类实现查询选择,保留语义相似性强的活跃查询,剔除冗余的局部上下文查询;(2) 基于最平坦头确定共享预算并均匀分配至所有头,实现头级别的动态预算分配以保障注意力召回率;(3) 通过选择性获取视觉键值缓存减少头级别冗余,依据头级解码查询模式进行缓存精简。实验表明,OmniSparse 在保持全注意力性能的同时,在预填充阶段实现最高 2.7 倍加速,解码阶段内存减少达 2.4 倍。

链接: https://arxiv.org/abs/2511.12201
作者: Feng Chen,Yefei He,Shaoxuan He,Yuanyu He,Jing Liu,Lequan Lin,Akide Liu,Zhaoyang Li,Jiyuan Zhang,Zhenbang Sun,Bohan Zhuang,Qi Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional redundancy; (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall; and (3) KV cache slimming to reduce head-level redundancy by selectively fetching visual KV cache according to the head-level decoding query pattern. Experimental results show that OmniSparse matches the performance of full attention while achieving up to 2.7x speedup during prefill and 2.4x memory reduction during decoding.
zh

[CV-236] Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation AAAI2026

【速读】:该论文针对跨域少样本分割(Cross-domain Few-shot Segmentation, CD-FSS)中现有方法主要关注源域与目标域之间的风格差异,而忽视了分割粒度差异的问题展开研究。这种忽略导致模型在目标域中新类别上的语义判别能力不足。解决方案的关键在于提出一种分层语义学习(Hierarchical Semantic Learning, HSL)框架,其核心包括两个模块:一是双风格随机化(Dual Style Randomization, DSR)模块,通过前景和全局风格随机化模拟目标域中多样的前景-背景风格差异及整体风格变化;二是分层语义挖掘(Hierarchical Semantic Mining, HSM)模块,利用多尺度超像素引导模型在不同粒度上挖掘类内一致性与类间区分性。此外,还引入原型置信度调制阈值(Prototype Confidence-modulated Thresholding, PCMT)模块以缓解前景与背景过于相似时的分割模糊问题。实验表明,该方法在四个主流目标域数据集上均达到当前最优性能。

链接: https://arxiv.org/abs/2511.12200
作者: Sujun Sun,Haowen Gu,Cheng Xie,Yanxu Ren,Mingwu Ren,Haofeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Cross-domain Few-shot Segmentation (CD-FSS) aims to segment novel classes from target domains that are not involved in training and have significantly different data distributions from the source domain, using only a few annotated samples, and recent years have witnessed significant progress on this task. However, existing CD-FSS methods primarily focus on style gaps between source and target domains while ignoring segmentation granularity gaps, resulting in insufficient semantic discriminability for novel classes in target domains. Therefore, we propose a Hierarchical Semantic Learning (HSL) framework to tackle this problem. Specifically, we introduce a Dual Style Randomization (DSR) module and a Hierarchical Semantic Mining (HSM) module to learn hierarchical semantic features, thereby enhancing the model’s ability to recognize semantics at varying granularities. DSR simulates target domain data with diverse foreground-background style differences and overall style variations through foreground and global style randomization respectively, while HSM leverages multi-scale superpixels to guide the model to mine intra-class consistency and inter-class distinction at different granularities. Additionally, we also propose a Prototype Confidence-modulated Thresholding (PCMT) module to mitigate segmentation ambiguity when foreground and background are excessively similar. Extensive experiments are conducted on four popular target domain datasets, and the results demonstrate that our method achieves state-of-the-art performance.
zh

[CV-237] Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System

【速读】:该论文旨在解决驾驶者分心检测模型在真实场景部署中面临的两大挑战:跨视角(cross-view)差异和域迁移(domain shift),如传感器模态或环境变化导致的性能下降问题。现有方法通常仅单独处理其中一类问题,难以实现跨不同车辆配置的鲁棒且可扩展的模型部署。其解决方案的关键在于提出一种两阶段联合框架:第一阶段利用对比学习在多视角数据上提取视图不变且动作判别性强的特征;第二阶段通过信息瓶颈损失实现无监督域适应,将模型从源模态迁移到目标模态而无需目标域标签。该方法在DriveAct多模态驾驶行为数据集上验证,显著提升了RGB视频数据上的Top-1准确率,较基于监督对比学习的跨视图方法提升近50%,同时优于纯无监督域适应方法达5%。

链接: https://arxiv.org/abs/2511.12196
作者: Aditi Bhalla,Christian Hellert,Enkelejda Kasneci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called DriveAct, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.
zh

[CV-238] MMRINet: Efficient Mamba-Based Segmentation with Dual-Path Refinement for Low-Resource MRI Analysis

【速读】:该论文旨在解决在资源受限环境下多参数磁共振成像(multi-parametric MRI)中脑肿瘤自动分割的挑战,尤其针对深度3D网络因计算复杂度高而难以部署的问题。其解决方案的关键在于提出了一种轻量级架构MMRINet,通过将二次复杂度的注意力机制替换为线性复杂度的Mamba状态空间模型(Mamba state-space models),实现高效体素级上下文建模;同时引入双路径特征精炼(Dual-Path Feature Refinement, DPFR)模块以在不增加数据需求的前提下最大化特征多样性,并结合渐进式特征聚合(Progressive Feature Aggregation, PFA)实现有效的多尺度特征融合,从而在仅约2.5M参数下达到高精度分割性能(Dice平均得分0.752,HD95平均12.23),适用于低资源临床场景。

链接: https://arxiv.org/abs/2511.12193
作者: Abdelrahman Elsayed,Ahmed Jaheen,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review at The IEEE International Symposium on Biomedical Imaging (ISBI 2026)

点击查看摘要

Abstract:Automated brain tumor segmentation in multi-parametric MRI remains challenging in resource-constrained settings where deep 3D networks are computationally prohibitive. We propose MMRINet, a lightweight architecture that replaces quadratic-complexity attention with linear-complexity Mamba state-space models for efficient volumetric context modeling. Novel Dual-Path Feature Refinement (DPFR) modules maximize feature diversity without additional data requirements, while Progressive Feature Aggregation (PFA) enables effective multi-scale fusion. In the BraTS-Lighthouse SSA 2025, our model achieves strong performance with an average Dice score of (0.752) and an average HD95 of (12.23) with only ~2.5M parameters, demonstrating efficient and accurate segmentation suitable for low-resource clinical environments. Our GitHub repository can be accessed here: this http URL.
zh

[CV-239] MixAR: Mixture Autoregressive Image Generation

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成方法中因离散码本量化导致的细节信息丢失问题,从而限制生成图像保真度(fidelity)的瓶颈。传统AR方法将图像表示为有限码本中的离散token序列,虽高效但牺牲了细粒度信息;尽管连续潜在空间可提升质量,但其无结构特性使自回归建模效率低下。解决方案的关键在于提出MixAR框架,通过混合训练范式引入离散token作为先验引导来优化连续空间的自回归预测:具体采用DC-Mix策略(用信息丰富的离散token替代同质掩码token),实现计算效率与生成保真度的良好平衡,并结合Training-Inference Mixture(TI-Mix)机制统一训练与推理阶段的分布,从而显著提升生成一致性与质量。

链接: https://arxiv.org/abs/2511.12181
作者: Jinyuan Hu,Jiayou Zhang,Shaobo Cui,Kun Zhang,Guangyi Chen
机构: Tsinghua University (清华大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive (AR) approaches, which represent images as sequences of discrete tokens from a finite codebook, have achieved remarkable success in image generation. However, the quantization process and the limited codebook size inevitably discard fine-grained information, placing bottlenecks on fidelity. Motivated by this limitation, recent studies have explored autoregressive modeling in continuous latent spaces, which offers higher generation quality. Yet, unlike discrete tokens constrained by a fixed codebook, continuous representations lie in a vast and unstructured space, posing significant challenges for efficient autoregressive modeling. To address these challenges, we introduce MixAR, a novel framework that leverages mixture training paradigms to inject discrete tokens as prior guidance for continuous AR modeling. MixAR is a factorized formulation that leverages discrete tokens as prior guidance for continuous autoregressive prediction. We investigate several discrete-continuous mixture strategies, including self-attention (DC-SA), cross-attention (DC-CA), and a simple approach (DC-Mix) that replaces homogeneous mask tokens with informative discrete counterparts. Moreover, to bridge the gap between ground-truth training tokens and inference tokens produced by the pre-trained AR model, we propose Training-Inference Mixture (TI-Mix) to achieve consistent training and generation distributions. In our experiments, we demonstrate a favorable balance of the DC-Mix strategy between computational efficiency and generation fidelity, and consistent improvement of TI-Mix.
zh

[CV-240] Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective AAAI2026

【速读】:该论文旨在解决点云补全(Point Cloud Completion)任务中因严重遮挡和几何缺失导致的结构不一致与拓扑伪影问题。现有方法多采用“基于填充”的范式(Completion-by-Inpainting),即从融合后的潜在特征中合成缺失结构,但受限于几何与语义约束不足,常产生结构性错误。解决方案的关键在于提出一种新的“基于修正”的范式(Completion-by-Correction):以预训练图像到3D模型生成的拓扑完整形状先验为基础,通过特征空间中的修正操作使其与部分观测对齐,从而将无约束合成转变为受引导的精细化重构。基于此范式,作者进一步设计了PGNet框架,通过多阶段双特征编码、粗略结构对齐与分层几何细节修正,显著提升了重建质量,在ShapeNetViPC数据集上平均Chamfer Distance降低23.5%,F-score提升7.1%。

链接: https://arxiv.org/abs/2511.12170
作者: Wang Luo,Di Wu,Hengyuan Na,Yinlin Zhu,Miao Hu,Guocong Quan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Point cloud completion aims to reconstruct complete 3D shapes from partial observations, which is a challenging problem due to severe occlusions and missing geometry. Despite recent advances in multimodal techniques that leverage complementary RGB images to compensate for missing geometry, most methods still follow a Completion-by-Inpainting paradigm, synthesizing missing structures from fused latent features. We empirically show that this paradigm often results in structural inconsistencies and topological artifacts due to limited geometric and semantic constraints. To address this, we rethink the task and propose a more robust paradigm, termed Completion-by-Correction, which begins with a topologically complete shape prior generated by a pretrained image-to-3D model and performs feature-space correction to align it with the partial observation. This paradigm shifts completion from unconstrained synthesis to guided refinement, enabling structurally consistent and observation-aligned reconstruction. Building upon this paradigm, we introduce PGNet, a multi-stage framework that conducts dual-feature encoding to ground the generative prior, synthesizes a coarse yet structurally aligned scaffold, and progressively refines geometric details via hierarchical correction. Experiments on the ShapeNetViPC dataset demonstrate the superiority of PGNet over state-of-the-art baselines in terms of average Chamfer Distance (-23.5%) and F-score (+7.1%).
zh

[CV-241] Codebook-Centric Deep Hashing: End-to-End Joint Learning of Semantic Hash Centers and Neural Hash Function

【速读】:该论文旨在解决基于哈希中心(hash center)的深度哈希方法中,随机初始化哈希中心忽视类别间语义关系的问题,以及现有两阶段方法因分阶段优化导致的性能次优和计算复杂度增加的问题。其解决方案的关键在于提出一种端到端的中心重分配哈希(Center-Reassigned Hashing, CRH)框架,通过在训练过程中动态地从预设码本中重新分配哈希中心,无需显式的中心优化阶段即可自适应地调整中心以匹配数据分布,从而将语义信息无缝整合进哈希函数学习过程;同时引入**多头机制(multi-head mechanism)**增强哈希中心的表征能力,更有效地捕捉复杂的语义结构,最终在多个基准数据集上实现优于当前最优深度哈希方法的检索性能。

链接: https://arxiv.org/abs/2511.12162
作者: Shuo Yin,Zhiyuan Yin,Yuqing Hou,Rui Liu,Yong Chen,Dell Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Peking University (北京大学); 4. University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages

点击查看摘要

Abstract:Hash center-based deep hashing methods improve upon pairwise or triplet-based approaches by assigning fixed hash centers to each class as learning targets, thereby avoiding the inefficiency of local similarity optimization. However, random center initialization often disregards inter-class semantic relationships. While existing two-stage methods mitigate this by first refining hash centers with semantics and then training the hash function, they introduce additional complexity, computational overhead, and suboptimal performance due to stage-wise discrepancies. To address these limitations, we propose \textbfCenter-Reassigned Hashing (CRH) , an end-to-end framework that \textbfdynamically reassigns hash centers from a preset codebook while jointly optimizing the hash function. Unlike previous methods, CRH adapts hash centers to the data distribution \textbfwithout explicit center optimization phases , enabling seamless integration of semantic relationships into the learning process. Furthermore, \textbfa multi-head mechanism enhances the representational capacity of hash centers, capturing richer semantic structures. Extensive experiments on three benchmarks demonstrate that CRH learns semantically meaningful hash centers and outperforms state-of-the-art deep hashing methods in retrieval tasks.
zh

[CV-242] FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing AAAI2026

【速读】:该论文旨在解决文本引导图像编辑中因缺乏有效源信息整合而导致的背景保留差、空间不一致性和过度编辑等问题,尤其是在基于流(flow-based)的无反演方法中表现尤为突出。其解决方案的关键在于提出了一种新颖的无反演框架FIA-Edit,核心创新是引入频率交互注意力机制(Frequency-Interactive Attention),具体包含两个关键组件:一是频率表示交互模块(FRI),通过在自注意力中交换源图与目标图的频域特征以增强跨域对齐;二是特征注入模块(FIJ),显式将源侧查询(query)、键(key)、值(value)及文本嵌入注入目标分支的交叉注意力中,从而实现结构和语义的精准保留。该方法在保证高保真度和可控性的同时,计算效率显著优于现有方法(单张512×512图像仅需约6秒,RTX 4090)。

链接: https://arxiv.org/abs/2511.12151
作者: Kaixiang Yang,Boyang Shen,Xin Li,Yuchen Dai,Yuxuan Luo,Yueran Ma,Wei Fang,Qiang Li,Zhiwei Wang
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学); 3. Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:Text-guided image editing has advanced rapidly with the rise of diffusion models. While flow-based inversion-free methods offer high efficiency by avoiding latent inversion, they often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing due to the lack of effective integration of source information. In this paper, we present FIA-Edit, a novel inversion-free framework that achieves high-fidelity and semantically precise edits through a Frequency-Interactive Attention. Specifically, we design two key components: (1) a Frequency Representation Interaction (FRI) module that enhances cross-domain alignment by exchanging frequency components between source and target features within self-attention, and (2) a Feature Injection (FIJ) module that explicitly incorporates source-side queries, keys, values, and text embeddings into the target branch’s cross-attention to preserve structure and semantics. Comprehensive and extensive experiments demonstrate that FIA-Edit supports high-fidelity editing at low computational cost (~6s per 512 * 512 image on an RTX 4090) and consistently outperforms existing methods across diverse tasks in visual quality, background fidelity, and controllability. Furthermore, we are the first to extend text-guided image editing to clinical applications. By synthesizing anatomically coherent hemorrhage variations in surgical images, FIA-Edit opens new opportunities for medical data augmentation and delivers significant gains in downstream bleeding classification. Our project is available at: this https URL.
zh

[CV-243] Breaking the Modality Wall: Time-step Mixup for Efficient Spiking Knowledge Transfer from Static to Event Domain

【速读】:该论文旨在解决事件相机(Event Camera)与脉冲神经网络(Spiking Neural Network, SNN)联合训练中因事件数据稀疏性和DVS输出稀疏性导致的模型难以有效训练的问题,以及跨模态知识迁移中RGB与DVS之间分布差异大所引发的性能下降问题。解决方案的关键在于提出一种基于概率时间步混合策略(Probabilistic Time-step Mixup, TSM)的跨模态训练框架——Time-step Mixup Knowledge Transfer (TMKT),其核心机制是利用SNN的异步特性,在每个序列内对RGB和DVS输入按不同时间步插值生成平滑的教学课程,从而降低梯度方差并稳定优化过程;同时引入两个轻量级模态感知目标:帧级源监督引导(Modality Aware Guidance, MAG)和序列级混合比例感知(Mixup Ratio Perception, MRP),以显式对齐时序特征与混合调度,实现更平滑的知识迁移并缓解训练阶段的模态不匹配问题。

链接: https://arxiv.org/abs/2511.12150
作者: Yuqi Xie,Shuhan Ye,Yi Yu,Chong Wang,Qixin Zhang,Jiazhen Xu,Le Shen,Yuanbin Qian,Jiangbo Qian,Guoqi Li
机构: Ningbo University (宁波大学); Nanyang Technological University (南洋理工大学); Merchants’ Guild Economics and Cultural; Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of event cameras and spiking neural networks (SNNs) promises energy-efficient visual intelligence, yet scarce event data and the sparsity of DVS outputs hinder effective training. Prior knowledge transfers from RGB to DVS often underperform because the distribution gap between modalities is substantial. In this work, we present Time-step Mixup Knowledge Transfer (TMKT), a cross-modal training framework with a probabilistic Time-step Mixup (TSM) strategy. TSM exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time steps to produce a smooth curriculum within each sequence, which reduces gradient variance and stabilizes optimization with theoretical analysis. To employ auxiliary supervision from TSM, TMKT introduces two lightweight modality-aware objectives, Modality Aware Guidance (MAG) for per-frame source supervision and Mixup Ratio Perception (MRP) for sequence-level mix ratio estimation, which explicitly align temporal features with the mixing schedule. TMKT enables smoother knowledge transfer, helps mitigate modality mismatch during training, and achieves superior performance in spiking image classification tasks. Extensive experiments across diverse benchmarks and multiple SNN backbones, together with ablations, demonstrate the effectiveness of our method.
zh

[CV-244] AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在安全评估中缺乏统一框架的问题,特别是由于不同动作分词器(action tokenizer)导致的可复现性差与公平比较困难,以及现有攻击方法大多未在真实机器人场景中验证的局限性。其解决方案的关键在于提出 AttackVLA 框架,该框架贯穿 VLA 的数据构建、训练到推理全生命周期,并集成多种已有及适配自视觉-语言模型的攻击方法,在仿真与真实机器人环境中进行系统性评估。进一步地,为填补现有攻击多局限于无目标失败或静态动作状态的空白,作者提出 BackdoorVLA——一种针对性后门攻击,可迫使 VLA 在触发条件下执行指定的长时程动作序列,实现在真实场景中平均 58.4% 的目标成功率,最高达 100%,从而揭示了对基于 VLA 的具身系统进行精准对抗操纵的可能性。

链接: https://arxiv.org/abs/2511.12149
作者: Jiayu Li,Yunhan Zhao,Xiang Zheng,Zonghuan Xu,Yige Li,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); City University of Hong Kong (香港城市大学); Singapore Management University (新加坡管理大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models enable robots to interpret natural-language instructions and perform diverse tasks, yet their integration of perception, language, and control introduces new safety vulnerabilities. Despite growing interest in attacking such models, the effectiveness of existing techniques remains unclear due to the absence of a unified evaluation framework. One major issue is that differences in action tokenizers across VLA architectures hinder reproducibility and fair comparison. More importantly, most existing attacks have not been validated in real-world scenarios. To address these challenges, we propose AttackVLA, a unified framework that aligns with the VLA development lifecycle, covering data construction, model training, and inference. Within this framework, we implement a broad suite of attacks, including all existing attacks targeting VLAs and multiple adapted attacks originally developed for vision-language models, and evaluate them in both simulation and real-world settings. Our analysis of existing attacks reveals a critical gap: current methods tend to induce untargeted failures or static action states, leaving targeted attacks that drive VLAs to perform precise long-horizon action sequences largely unexplored. To fill this gap, we introduce BackdoorVLA, a targeted backdoor attack that compels a VLA to execute an attacker-specified long-horizon action sequence whenever a trigger is present. We evaluate BackdoorVLA in both simulated benchmarks and real-world robotic settings, achieving an average targeted success rate of 58.4% and reaching 100% on selected tasks. Our work provides a standardized framework for evaluating VLA vulnerabilities and demonstrates the potential for precise adversarial manipulation, motivating further research on securing VLA-based embodied systems.
zh

[CV-245] Variation-Bounded Loss for Noise-Tolerant Learning AAAI2026

【速读】:该论文旨在解决监督学习中噪声标签(noisy labels)对模型性能的负面影响问题,其核心解决方案是提出一种新的鲁棒损失函数家族——变差有界损失(Variation-Bounded Loss, VBL),其关键在于引入“变差比”(Variation Ratio)作为衡量损失函数鲁棒性的新指标。理论分析表明,较小的变差比可带来更强的鲁棒性,并且该指标能够放宽传统方法对对称性(symmetric condition)的严格要求,提供更简洁的路径实现非对称条件(asymmetric condition)。基于此,作者将多种常用损失函数重构为变差有界形式,实验证明了该方法在多个数据集上的有效性与灵活性。

链接: https://arxiv.org/abs/2511.12143
作者: Jialiang Wang,Xiong Zhou,Xianming Liu,Gangfeng Hu,Deming Zhai,Junjun Jiang,Haoliang Li
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Mitigating the negative impact of noisy labels has been aperennial issue in supervised learning. Robust loss functions have emerged as a prevalent solution to this problem. In this work, we introduce the Variation Ratio as a novel property related to the robustness of loss functions, and propose a new family of robust loss functions, termed Variation-Bounded Loss (VBL), which is characterized by a bounded variation ratio. We provide theoretical analyses of the variation ratio, proving that a smaller variation ratio would lead to better robustness. Furthermore, we reveal that the variation ratio provides a feasible method to relax the symmetric condition and offers a more concise path to achieve the asymmetric condition. Based on the variation ratio, we reformulate several commonly used loss functions into a variation-bounded form for practical applications. Positive experiments on various datasets exhibit the effectiveness and flexibility of our approach.
zh

[CV-246] MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering AAAI

【速读】:该论文旨在解决多模态场景下生成式 AI(Generative AI)回答的可信度问题,即如何通过引入来源溯源(source attribution)机制,在视觉问答(Visual Question Answering, VQA)任务中提供可验证、可追溯的答案。现有研究主要集中在纯文本场景,忽视了图像等多模态信息在证据检索与答案生成中的作用。解决方案的关键在于提出 MAVIS——首个用于评估多模态来源溯源系统的基准数据集,包含 15.7 万条视觉问答实例,每条答案均标注事实级引用(fact-level citations),对应多模态文档;同时开发了细粒度自动评价指标,从信息量(informativeness)、 groundedness(依据性)和流畅性(fluency)三个维度衡量系统性能,并揭示出:(1)基于多模态检索增强生成(Multimodal RAG)的大语言模型(LVLMs)虽能提升答案的信息量和流畅性,但在图像文档上的 groundedness 表现弱于文本文档;(2)不同提示策略在信息量与依据性之间存在权衡关系;(3)缓解图像文档解读中的上下文偏差是未来研究的重要方向。

链接: https://arxiv.org/abs/2511.12142
作者: Seokwon Song,Minsu Park,Gunhee Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Association for the Advancement of Artificial Intelligence (AAAI), 2026

点击查看摘要

Abstract:Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but they exhibit weaker groundedness for image documents than for text documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research. The dataset and experimental code are available at this https URL
zh

[CV-247] Compression and Inference of Spiking Neural Networks on Resource-Constrained Hardware

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在资源受限边缘设备上部署时面临的训练与推理效率低、内存占用高及延迟大的问题。其核心解决方案是设计了一个轻量级C语言运行时(runtime),通过将训练好的模型从SNNTorch导出为紧凑的C代码表示,采用静态且缓存友好的数据布局和预分配策略来消除解释器和动态内存分配开销;同时利用脉冲活动的稀疏性对不活跃的神经元和突触进行剪枝,显著减少上游卷积层的计算量。实验表明,该方案在N-MNIST和ST-MNIST数据集上实现了与Python基线相当的准确性,并在桌面CPU上达到约10倍加速,配合剪枝进一步提升性能,同时大幅降低内存消耗,使得SNN可在微控制器(如Arduino Portenta H7)上高效执行。

链接: https://arxiv.org/abs/2511.12136
作者: Karol C. Jurzec,Tomasz Szydlo,Maciej Wielgosz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures, 1 table; code available at this https URL

点击查看摘要

Abstract:Spiking neural networks (SNNs) communicate via discrete spikes in time rather than continuous activations. Their event-driven nature offers advantages for temporal processing and energy efficiency on resource-constrained hardware, but training and deployment remain challenging. We present a lightweight C-based runtime for SNN inference on edge devices and optimizations that reduce latency and memory without sacrificing accuracy. Trained models exported from SNNTorch are translated to a compact C representation; static, cache-friendly data layouts and preallocation avoid interpreter and allocation overheads. We further exploit sparse spiking activity to prune inactive neurons and synapses, shrinking computation in upstream convolutional layers. Experiments on N-MNIST and ST-MNIST show functional parity with the Python baseline while achieving ~10 speedups on desktop CPU and additional gains with pruning, together with large memory reductions that enable microcontroller deployment (Arduino Portenta H7). Results indicate that SNNs can be executed efficiently on conventional embedded platforms when paired with an optimized runtime and spike-driven model compression. Code: this https URL
zh

[CV-248] OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description AAAI2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在视觉问答(Visual Question Answering, VQA)任务中因依赖大规模训练数据而继承语言偏见,导致预测可靠性下降以及在分布外(out-of-distribution, OOD)场景下泛化能力不足的问题。解决方案的关键在于提出一种名为Object Attribute Description Promoter (OAD-Promoter) 的新方法,其核心由三个模块构成:基于对象聚焦的样本生成(Object-concentrated Example Generation, OEG)模块通过融合全局与局部视觉线索缓解偏见;记忆知识辅助(Memory Knowledge Assistance, MKA)模块利用存储示例检索相关知识以增强对未见域样本的处理能力;以及结合前序模块输出的OAD提示机制,优化LLM推理过程。该框架显著提升了LLM在少样本或零样本VQA场景下的性能,达到新的最先进水平。

链接: https://arxiv.org/abs/2511.12131
作者: Quanxing Xu,Ling Zhou,Feifei Zhang,Jinyu Tian,Rubing Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few-shot or zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during the acquisition of knowledge. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few-shot or zero-shot settings, achieving new state-of-the-art results.
zh

[CV-249] RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving AAAI2026

【速读】:该论文旨在解决4D毫米波雷达(4D mmWave radar)在自动驾驶系统中因点云稀疏和噪声导致的3D场景运动感知精度不足的问题,尤其是在光学传感器因恶劣天气失效时,车辆感知能力受限。解决方案的关键在于提出RadarMP方法,其核心创新是将雷达目标检测与运动估计任务统一建模于一个端到端架构中,实现点云生成与逐点3D场景光流预测的一致性联合优化;同时,针对雷达特性设计了基于多普勒频移(Doppler shift)和回波强度的自监督损失函数,无需人工标注即可有效约束空间一致性与运动一致性,从而显著提升复杂气象与光照条件下鲁棒的运动感知性能。

链接: https://arxiv.org/abs/2511.12117
作者: Ruiqi Cheng,Huijun Di,Jian Li,Feng Liu,Wei Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures. Accepted by AAAI 2026

点击查看摘要

Abstract:Accurate 3D scene motion perception significantly enhances the safety and reliability of an autonomous driving system. Benefiting from its all-weather operational capability and unique perceptual properties, 4D mmWave radar has emerged as an essential component in advanced autonomous driving. However, sparse and noisy radar points often lead to imprecise motion perception, leaving autonomous vehicles with limited sensing capabilities when optical sensors degrade under adverse weather conditions. In this paper, we propose RadarMP, a novel method for precise 3D scene motion perception using low-level radar echo signals from two consecutive frames. Unlike existing methods that separate radar target detection and motion estimation, RadarMP jointly models both tasks in a unified architecture, enabling consistent radar point cloud generation and pointwise 3D scene flow prediction. Tailored to radar characteristics, we design specialized self-supervised loss functions guided by Doppler shifts and echo intensity, effectively supervising spatial and motion consistency without explicit annotations. Extensive experiments on the public dataset demonstrate that RadarMP achieves reliable motion perception across diverse weather and illumination conditions, outperforming radar-based decoupled motion perception pipelines and enhancing perception capabilities for full-scenario autonomous driving systems.
zh

[CV-250] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

【速读】:该论文旨在解决现有医学图像分割方法普遍存在的任务特异性与缺乏交互性的问题,尤其是当前基于文本提示的分割方法仅支持单轮对话,难以实现多轮推理。为应对这一挑战,作者提出了多轮实体级医学推理分割(Multi-Round Entity-Level Medical Reasoning Segmentation, MEMR-Seg)新任务,并构建了包含17.7万条多轮医学分割对话的大规模数据集MR-MedSeg,其中每轮对话均涉及实体层面的推理。解决方案的关键在于提出MediRound基线模型,并引入一种轻量但高效的推理阶段判断修正机制(Judgment Correction Mechanism),以缓解多轮分割链式流程中的误差传播问题,从而显著提升多轮交互式医学图像分割的准确性和鲁棒性。

链接: https://arxiv.org/abs/2511.12110
作者: Qinyue Tong,Ziqian Lu,Jun Liu,Rui Zuo,Zheming Lu
机构: Zhejiang University (浙江大学); Zhejiang Sci-Tech University (浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12pages, 6 figures

点击查看摘要

Abstract:Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.
zh

[CV-251] Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection AAAI2026

【速读】:该论文旨在解决当前深度伪造(DeepFake)检测方法中因忽视不同伪造技术所产生独特人工痕迹(artifacts)而导致的检测精度不足问题。现有基于DINOv2的微调方法通常将其视为通用二分类任务,未能充分利用伪造类型差异带来的判别信息。解决方案的关键在于提出一种轻量级细粒度适配器(DeepFake Fine-Grained Adapter, DFF-Adapter),在DINOv2每个Transformer块中嵌入多头LoRA模块,实现骨干网络的高效适配;同时设计一个共享分支将细粒度伪造类型线索传递至真实性判别头,从而通过多任务协同优化机制,利用特定伪造方法的知识显著提升真实性判别能力,仅用350万可训练参数即达到或超越复杂SOTA方法的检测性能。

链接: https://arxiv.org/abs/2511.12107
作者: Tianxiang Zhang,Peipeng Yu,Zhihua Xia,Longchen Dai,Xiaoyu Zhou,Hui Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:The proliferation of sophisticated deepfakes poses significant threats to information integrity. While DINOv2 shows promise for detection, existing fine-tuning approaches treat it as generic binary classification, overlooking distinct artifacts inherent to different deepfake methods. To address this, we propose a DeepFake Fine-Grained Adapter (DFF-Adapter) for DINOv2. Our method incorporates lightweight multi-head LoRA modules into every transformer block, enabling efficient backbone adaptation. DFF-Adapter simultaneously addresses authenticity detection and fine-grained manipulation type classification, where classifying forgery methods enhances artifact sensitivity. We introduce a shared branch propagating fine-grained manipulation cues to the authenticity head. This enables multi-task cooperative optimization, explicitly enhancing authenticity discrimination with manipulation-specific knowledge. Utilizing only 3.5M trainable parameters, our parameter-efficient approach achieves detection accuracy comparable to or even surpassing that of current complex state-of-the-art methods.
zh

[CV-252] EMPO: Global Temporal Building Density and Height Estimation from Satellite Imagery

【速读】:该论文旨在解决全球范围内建筑密度与高度时空动态监测的难题,传统方法在大规模、高频率更新方面存在计算成本高、时效性差的问题。解决方案的关键在于提出了一种名为TEMPO的多任务深度学习模型,该模型结合现有建筑轮廓和高度数据与季度分辨率的PlanetScope卫星影像,能够在37.6米/像素的空间分辨率下预测建筑密度与高度,并实现从2018年第一季度至2025年第二季度的全球时序制图。该方法在验证中表现出优异的准确性(F1分数85%–88%)和时间稳定性(五年的趋势一致性得分0.96),显著降低了计算成本,为全球发展模式与气候影响的大规模监测提供了高效工具。

链接: https://arxiv.org/abs/2511.12104
作者: Tammy Glazer,Gilles Q. Hacheme,Akram Zaytar,Luana Marotti,Amy Michaels,Girmaw Abebe Tadesse,Kevin White,Rahul Dodhia,Andrew Zolli,Inbal Becker-Reshef,Juan M. Lavista Ferres,Caleb Robinson
机构: Microsoft AI for Good Research Lab(微软AI for Good研究实验室); Planet Labs PBC(行星实验室股份公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present TEMPO, a global, temporally resolved dataset of building density and height derived from high-resolution satellite imagery using deep learning models. We pair building footprint and height data from existing datasets with quarterly PlanetScope basemap satellite images to train a multi-task deep learning model that predicts building density and building height at a 37.6-meter per pixel resolution. We apply this model to global PlanetScope basemaps from Q1 2018 through Q2 2025 to create global, temporal maps of building density and height. We validate these maps by comparing against existing building footprint datasets. Our estimates achieve an F1 score between 85% and 88% on different hand-labeled subsets, and are temporally stable, with a 0.96 five-year trend-consistency score. TEMPO captures quarterly changes in built settlements at a fraction of the computational cost of comparable approaches, unlocking large-scale monitoring of development patterns and climate impacts essential for global resilience and adaptation efforts.
zh

[CV-253] BdSL-SPOTER: A Transformer-Based Framework for Bengali Sign Language Recognition with Cultural Adaptation

【速读】:该论文旨在解决孟加拉手语(Bengali Sign Language, BdSL)识别中因数据稀缺和文化特异性导致的准确率低、模型效率差的问题。解决方案的关键在于提出一种基于姿态的Transformer框架BdSL-SPOTER,其核心创新包括:1)针对孟加拉文化特性的预处理方法;2)采用参数精简的四层Transformer编码器,结合优化的可学习位置编码(learnable positional encodings)以提升表示能力;3)引入课程学习(curriculum learning)策略,在有限数据下增强模型泛化性能并加速收敛。实验表明,该方法在BdSLW60基准上达到97.92% Top-1验证准确率,较Bi-LSTM基线提升22.82%,同时显著降低计算开销(FLOPs与参数量减少,FPS提高),具备实际部署潜力。

链接: https://arxiv.org/abs/2511.12103
作者: Sayad Ibna Azad,Md. Atiqur Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 20th International Symposium on Visual Computing (ISVC 2025)

点击查看摘要

Abstract:We introduce BdSL-SPOTER, a pose-based transformer framework for accurate and efficient recognition of Bengali Sign Language (BdSL). BdSL-SPOTER extends the SPOTER paradigm with cultural specific preprocessing and a compact four-layer transformer encoder featuring optimized learnable positional encodings, while employing curriculum learning to enhance generalization on limited data and accelerate convergence. On the BdSLW60 benchmark, it achieves 97.92% Top-1 validation accuracy, representing a 22.82% improvement over the Bi-LSTM baseline, all while keeping computational costs low. With its reduced number of parameters, lower FLOPs, and higher FPS, BdSL-SPOTER provides a practical framework for real-world accessibility applications and serves as a scalable model for other low-resource regional sign languages.
zh

[CV-254] Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation

【速读】:该论文旨在解决当前视觉模型在训练过程中依赖有限充分原因(sufficient causes)进行预测的问题,导致模型对分布偏移或关键特征缺失敏感,且其学习到的依赖关系可能并非充分因果(causal)。解决方案的关键在于提出Subset-Selected Counterfactual Augmentation (SS-CA),该方法将反事实解释(counterfactual explanations)直接整合进训练流程以实现针对性干预。具体而言,基于Subset-selection-based LIMA attribution方法,作者开发了Counterfactual LIMA来识别最小空间区域集合,这些区域的移除可选择性地改变模型预测;随后设计一种数据增强策略,用自然背景替换这些区域,并联合原始样本与增强样本共同训练模型,从而缓解不完整的因果学习问题。实验表明,该方法显著提升模型在分布内(in-distribution, ID)和分布外(out-of-distribution, OOD)测试集上的泛化能力及鲁棒性。

链接: https://arxiv.org/abs/2511.12100
作者: Yannan Chen,Ruoyu Chen,Bin Zeng,Wei Wang,Shiming Liu,Qunli Zhang,Zheng Hu,Laiyuan Wang,Yaowei Wang,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); Pengcheng Laboratory (鹏城实验室); CAS (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Tianjin University (天津大学); HIT, Shenzhen (哈尔滨工业大学深圳校区); Huawei Inc. (华为公司); Munich Research Center, Huawei Düsseldorf GmbH (华为慕尼黑研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In current visual model training, models often rely on only limited sufficient causes for their predictions, which makes them sensitive to distribution shifts or the absence of key features. Attribution methods can accurately identify a model’s critical regions. However, masking these areas to create counterfactuals often causes the model to misclassify the target, while humans can still easily recognize it. This divergence highlights that the model’s learned dependencies may not be sufficiently causal. To address this issue, we propose Subset-Selected Counterfactual Augmentation (SS-CA), which integrates counterfactual explanations directly into the training process for targeted intervention. Building on the subset-selection-based LIMA attribution method, we develop Counterfactual LIMA to identify minimal spatial region sets whose removal can selectively alter model predictions. Leveraging these attributions, we introduce a data augmentation strategy that replaces the identified regions with natural background, and we train the model jointly on both augmented and original samples to mitigate incomplete causal learning. Extensive experiments across multiple ImageNet variants show that SS-CA improves generalization on in-distribution (ID) test data and achieves superior performance on out-of-distribution (OOD) benchmarks such as ImageNet-R and ImageNet-S. Under perturbations including noise, models trained with SS-CA also exhibit enhanced generalization, demonstrating that our approach effectively uses interpretability insights to correct model deficiencies and improve both performance and robustness.
zh

[CV-255] Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models

【速读】:该论文旨在解决自回归视频扩散模型(Autoregressive Video Diffusion Models, VDMs)在生成长视频时面临的全局一致性差与局部运动动态质量低的问题。现有方法中,基于块的扩展方式易导致去噪延迟和误差累积,而流式去噪(stream denoising)虽支持实时采样,但存在运动不连贯和动力学表现弱的缺陷。解决方案的关键在于提出自适应视频起始标记(Adaptive Begin-of-Video Tokens, ada-BOV),这是一种可学习嵌入(learnable embeddings),通过类层归一化(adaptive-layer-norm-like)的调制机制动态吸收先前帧的去噪信息,从而在保持全局时空一致性的同时实现灵活的条件建模;此外,论文还引入一种解耦采样轨迹长度与注意力窗口大小约束的精炼策略,提升局部动态引导能力,并设计扰动增强训练噪声调度以平衡收敛速度与模型鲁棒性,显著改善了长视频生成的质量与稳定性。

链接: https://arxiv.org/abs/2511.12099
作者: Tianle Cheng,Zeyan Zhang,Kaifeng Gao,Jun Xiao
机构: Zhejiang University (浙江大学); Manycore Tech Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs. The BOV tokens are special learnable embeddings on VDMs. They adaptively absorb denoised preceding frames via an adaptive-layer-norm-like modulation. This design preserves the global consistency while allowing for flexible conditioning in dynamic scenarios. To ensure the quality of local dynamics essential in modulating BOV tokens, we further propose a refinement strategy for stream denoising. It decouples the sampling trajectory length from the attention window size constraint, leading to improved local guidance and overall imaging quality. We also propose a disturbance-augmented training noise schedule, which balances the convergence speed with model robustness for the stream denoising. Extensive experiments demonstrate that our method achieves compelling qualitative and quantitative results across multiple metrics.
zh

[CV-256] DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

【速读】:该论文旨在解决医学图像翻译中生成式CT(Computed Tomography)图像时存在的两个关键问题:一是基于卷积神经网络(CNN)的模型缺乏全局语义理解能力,二是Transformer模型因高模型容量和弱归纳偏置在小规模医疗数据集上容易过拟合。解决方案的核心在于提出一种DINOv3-Guided Cross Fusion(DGCF)框架,该框架通过将冻结的自监督DINOv3 Transformer与可训练的CNN编码器-解码器相结合,利用可学习的交叉融合模块分层融合Transformer的全局表征与CNN的局部特征,从而实现局部外观与上下文语义表示之间的平衡;同时引入多层级DINOv3感知损失(MLDP loss),在DINOv3特征空间中强制合成CT与真实CT之间的语义相似性,显著提升图像质量和下游分割任务性能。

链接: https://arxiv.org/abs/2511.12098
作者: Xianhao Zhou,Jianghao Wu,Ku Zhao,Jinlong He,Huangxuan Zhao,Lei Chen,Shaoting Zhang,Guotai Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3’s feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI \rightarrow CT and CBCT \rightarrow CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at this https URL.
zh

[CV-257] Sparse by Rule: Probability-Based N:M Pruning for Spiking Neural Networks

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在深度架构下参数量和计算成本激增的问题,从而阻碍其在边缘设备上的部署。现有剪枝方法主要分为两类:非结构化剪枝虽能达到高稀疏度但难以在通用硬件上加速,而结构化剪枝虽利于部署却灵活性差且常因匹配稀疏度导致精度下降。论文提出SpikeNM,首个面向SNN的半结构化N:M剪枝框架,从头学习稀疏SNN,强制每M个权重块中最多保留N个非零值。其关键创新在于采用M路基底对数几率参数化(M-way basis-logit parameterization)结合可微分Top-k采样器,将每块复杂度从组合爆炸级(\sum_{k=1}^N \binom{M}{k})线性降低至(\mathcal{O}(M)),支持更激进的稀疏化;同时引入受神经科学启发的可获得性引导蒸馏(Eligibility-Inspired Distillation, EID),通过时间累积信用转化为块级软目标,使掩码概率与脉冲动态对齐,降低采样方差并提升高稀疏度下的搜索稳定性。实验表明,在2:4稀疏度下,SpikeNM不仅保持甚至提升主流数据集性能,还生成适合硬件部署的稀疏模式,与SNN固有的脉冲稀疏性相辅相成。

链接: https://arxiv.org/abs/2511.12097
作者: Shuhan Ye,Yi Yu,Qixin Zhang,Chenqi Kong,Qiangqiang Wu,Xudong Jiang,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain-inspired Spiking neural networks (SNNs) promise energy-efficient intelligence via event-driven, sparse computation, but deeper architectures inflate parameters and computational cost, hindering their edge deployment. Recent progress in SNN pruning helps alleviate this burden, yet existing efforts fall into only two families: \emphunstructured pruning, which attains high sparsity but is difficult to accelerate on general hardware, and \emphstructured pruning, which eases deployment but lack flexibility and often degrades accuracy at matched sparsity. In this work, we introduce \textbfSpikeNM, the first SNN-oriented \emphsemi-structured (N:M) pruning framework that learns sparse SNNs \emphfrom scratch, enforcing \emphat most (N) non-zeros per (M)-weight block. To avoid the combinatorial space complexity (\sum_k=1^N\binomMk) growing exponentially with (M), SpikeNM adopts an (M)-way basis-logit parameterization with a differentiable top-(k) sampler, \emphlinearizing per-block complexity to (\mathcal O(M)) and enabling more aggressive sparsification. Further inspired by neuroscience, we propose \empheligibility-inspired distillation (EID), which converts temporally accumulated credits into block-wise soft targets to align mask probabilities with spiking dynamics, reducing sampling variance and stabilizing search under high sparsity. Experiments show that at (2:4) sparsity, SpikeNM maintains and even with gains across main-stream datasets, while yielding hardware-amenable patterns that complement intrinsic spike sparsity.
zh

[CV-258] Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillatio

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)因时间编码机制导致的高训练成本问题,从而限制其在实际场景中的部署。解决方案的关键在于提出首个面向事件相机数据和SNN的数据蒸馏框架PACE(Phase-Aligned Condensation for Events),其核心由两个模块构成:一是ST-DSM(Spatiotemporal Dense Spike Mapping),通过残差膜电位增强脉冲特征密度并实现精细的时空匹配;二是PEQ-N(Plug-and-Play Probabilistic Integer Quantizer),提供与标准事件帧处理流程兼容的直通概率整数量化器。该方法显著压缩训练数据规模,在保持高精度的同时大幅降低训练时间和存储开销,实现了分钟级SNN训练和边缘设备高效部署。

链接: https://arxiv.org/abs/2511.12095
作者: Shuhan Ye,Yi Yu,Qixin Zhang,Chenqi Kong,Qiangqiang Wu,Kun Wang,Xudong Jiang
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras sense brightness changes and output binary asynchronous event streams, attracting increasing attention. Their bio-inspired dynamics align well with spiking neural networks (SNNs), offering a promising energy-efficient alternative to conventional vision systems. However, SNNs remain costly to train due to temporal coding, which limits their practical deployment. To alleviate the high training cost of SNNs, we introduce \textbfPACE (Phase-Aligned Condensation for Events), the first dataset distillation framework to SNNs and event-based vision. PACE distills a large training dataset into a compact synthetic one that enables fast SNN training, which is achieved by two core modules: \textbfST-DSM and \textbfPEQ-N. ST-DSM uses residual membrane potentials to densify spike-based features (SDR) and to perform fine-grained spatiotemporal matching of amplitude and phase (ST-SM), while PEQ-N provides a plug-and-play straight through probabilistic integer quantizer compatible with standard event-frame pipelines. Across DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, PACE outperforms existing coreset selection and dataset distillation baselines, with particularly strong gains on dynamic event streams and at low or moderate IPC. Specifically, on N-MNIST, it achieves (84.4%) accuracy, about (85%) of the full training set performance, while reducing training time by more than (50\times) and storage cost by (6000\times), yielding compact surrogates that enable minute-scale SNN training and efficient edge deployment.
zh

[CV-259] aching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning

【速读】:该论文旨在解决提示调优(prompt tuning)在持续学习(continual learning)场景下因各层独立引入可学习提示而导致的灾难性遗忘问题。其核心挑战在于:虽然逐层独立调优提升了对新任务的适应灵活性,但过多的自由度导致某些层可能被不必要地更新,且所有历史提示叠加后易覆盖先前任务的关键特征表示,从而加剧遗忘。解决方案的关键在于提出一种分层分组的提示调优方法(hierarchical layer-grouped prompt tuning),通过两个机制增强模型稳定性:(i) 同一组内的层共享大致相同的提示,并由位置编码进行微调,以保留预训练模型内部特征关系与传播路径;(ii) 引入单一任务特定的根提示(root prompt),用于生成每组对应的子提示,使所有子提示基于同一根提示条件生成,从而提升子提示间的协同效应并降低独立性,有效缓解遗忘问题。

链接: https://arxiv.org/abs/2511.12090
作者: Shengqin Jiang,Tianqi Kong,Yuankai Qi,Haokui Zhang,Lina Yao,Quan Z. Sheng,Qingshan Liu,Ming-Hsuan Yang
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Macquarie University (麦考瑞大学); Northwestern Polytechnical University (西北工业大学); University of New South Wales (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); Nanjing University of Posts and Telecommunications (南京邮电大学); University of California at Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Prompt-based continual learning methods fine-tune only a small set of additional learnable parameters while keeping the pre-trained model’s parameters frozen. It enables efficient adaptation to new tasks while mitigating the risk of catastrophic forgetting. These methods typically attach one independent task-specific prompt to each layer of pre-trained models to locally modulate its features, ensuring that the layer’s representation aligns with the requirements of the new task. However, although introducing learnable prompts independently at each layer provides high flexibility for adapting to new tasks, this overly flexible tuning could make certain layers susceptible to unnecessary updates. As all prompts till the current task are added together as a final prompt for all seen tasks, the model may easily overwrite feature representations essential to previous tasks, which increases the risk of catastrophic forgetting. To address this issue, we propose a novel hierarchical layer-grouped prompt tuning method for continual learning. It improves model stability in two ways: (i) Layers in the same group share roughly the same prompts, which are adjusted by position encoding. This helps preserve the intrinsic feature relationships and propagation pathways of the pre-trained model within each group. (ii) It utilizes a single task-specific root prompt to learn to generate sub-prompts for each layer group. In this way, all sub-prompts are conditioned on the same root prompt, enhancing their synergy and reducing independence. Extensive experiments across four benchmarks demonstrate that our method achieves favorable performance compared with several state-of-the-art methods.
zh

[CV-260] SemanticStitch: Enhancing Image Coherence through Foreground-Aware Seam Carving

【速读】:该论文旨在解决图像拼接(image stitching)中因拍摄角度差异、位置偏移及物体运动等因素导致的错位与视觉不一致问题,尤其针对传统缝合切割(seam carving)方法忽视语义信息而造成前景对象连续性破坏的缺陷。其解决方案的关键在于提出SemanticStitch框架,该框架通过引入前景对象的语义先验(semantic priors),结合一种强调显著对象语义完整性的新型损失函数(loss function),有效提升了拼接结果的视觉一致性与前景完整性,从而在真实场景下显著优于传统方法。

链接: https://arxiv.org/abs/2511.12084
作者: Ji-Ping Jin,Chen-Bin Feng,Rui Fan,Chi-Man Vong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12pages, has been early accepted by The Visual Computer: International Journal of Computer Graphics, 2025

点击查看摘要

Abstract:Image stitching often faces challenges due to varying capture angles, positional differences, and object movements, leading to misalignments and visual discrepancies. Traditional seam carving methods neglect semantic information, causing disruptions in foreground continuity. We introduce SemanticStitch, a deep learning-based framework that incorporates semantic priors of foreground objects to preserve their integrity and enhance visual coherence. Our approach includes a novel loss function that emphasizes the semantic integrity of salient objects, significantly improving stitching quality. We also present two specialized real-world datasets to evaluate our method’s effectiveness. Experimental results demonstrate substantial improvements over traditional techniques, providing robust support for practical applications.
zh

[CV-261] Supervised Multilabel Image Classification Using Residual Networks with Probabilistic Reasoning

【速读】:该论文旨在解决多标签图像分类(multilabel image categorization)中的标签依赖关系与不确定性建模问题,此类问题在计算机视觉应用中尤为关键。其解决方案的核心在于将概率推理(probabilistic reasoning)引入改进的ResNet-101架构中,通过模拟标签间的依赖性和不确定性来提升预测准确性。实验表明,该方法在COCO-2014数据集上达到0.794的平均精度均值(mAP),优于ResNet-SRN(0.771)和Vision Transformer基线模型(0.785),验证了概率建模在多标签场景下的有效性。

链接: https://arxiv.org/abs/2511.12082
作者: Lokender Singh,Saksham Kumar,Chandan Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCCNT 2025 Conference Proceedings, IIT Indore

点击查看摘要

Abstract:Multilabel image categorization has drawn interest recently because of its numerous computer vision applications. The proposed work introduces a novel method for classifying multilabel images using the COCO-2014 dataset and a modified ResNet-101 architecture. By simulating label dependencies and uncertainties, the approach uses probabilistic reasoning to improve prediction accuracy. Extensive tests show that the model outperforms earlier techniques and approaches to state-of-the-art outcomes in multilabel categorization. The work also thoroughly assesses the model’s performance using metrics like precision-recall score and achieves 0.794 mAP on COCO-2014, outperforming ResNet-SRN (0.771) and Vision Transformer baselines (0.785). The novelty of the work lies in integrating probabilistic reasoning into deep learning models to effectively address the challenges presented by multilabel scenarios.
zh

[CV-262] Point Cloud Quantization through Multimodal Prompting for 3D Understanding AAAI2026

【速读】:该论文旨在解决向量量化(Vector Quantization)在大规模多模态模型中因码本设计不 robust 所导致的表征能力不足问题,尤其是现有基于可训练向量或聚类中心的原型方法在代表性与可解释性上的局限。解决方案的关键在于提出一种由多模态提示驱动的量化框架,其核心创新包括:1)利用预训练文本嵌入通过多对一对比对齐天然蕴含的视觉语义作为鲁棒的原型先验;2)借助多模态提示实现对原型的自适应优化,有效缓解视觉-语言语义鸿沟;并通过双约束量化空间(紧凑性与分离性正则化)融合几何与语义特征,结合 Gumbel-Softmax 松弛实现可微离散化,同时保持量化稀疏性。

链接: https://arxiv.org/abs/2511.12079
作者: Hongxuan Li(1),Wencheng Zhu(1 and 2),Huiying Xu(3),Xinzhong Zhu(3),Pengfei Zhu(1) ((1) College of Intelligence and Computing, Tianjin University, (2) Haihe Laboratory of Information Technology Application Innovation, (3) School of Computer Science and Technology, Zhejiang Normal University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026. 11 pages, 7 figures. Corresponding author: Wencheng Zhu (wenchengzhu@tju. this http URL )

点击查看摘要

Abstract:Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.
zh

[CV-263] Learning to Hear by Seeing: Its Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在跨模态情感理解上的局限性,尤其是艺术作品中多模态情感表达的建模问题。现有方法多为以人类为中心或单一模态,忽视了艺术创作中视觉与听觉元素协同传递的情感意图;同时,当前音频-视觉语言模型(Audio-Visual Language Models, AVLMs)通常依赖大规模音频预训练,限制了模型的可扩展性。解决方案的关键在于提出一个两阶段框架——Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM):第一阶段通过视觉引导的音频对齐(Vision-Guided Audio Alignment, VG-Align),利用同步音视频片段中共享LLM的下一个词分布对齐,使冻结的视觉路径“指导”音频路径学习听觉感知,从而在有限音频预训练下实现高效音频理解;第二阶段引入轻量级跨模态情感适配器(Cross-Modal Emotion Adapter, EmoAdapter),包含情感增强模块和情感监督模块,注入情感敏感残差并施加情感监督信号,显著提升跨模态情感理解能力。

链接: https://arxiv.org/abs/2511.12077
作者: Dengming Zhang,Weitao You,Jingxiong Li,Weishen Lin,Wenda Shi,Xue Zhao,Heda Zuo,Junxian Wu,Lingyun Sun
机构: Zhejiang University (浙江大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotion understanding is critical for making Large Language Models (LLMs) more general, reliable, and aligned with humans. Art conveys emotion through the joint design of visual and auditory elements, yet most prior work is human-centered or single-modality, overlooking the emotion intentionally expressed by the artwork. Meanwhile, current Audio-Visual Language Models (AVLMs) typically require large-scale audio pretraining to endow Visual Language Models (VLMs) with hearing, which limits scalability. We present Vision Anchored Audio-Visual Emotion LLM (VAEmotionLLM), a two-stage framework that teaches a VLM to hear by seeing with limited audio pretraining and to understand emotion across modalities. In Stage 1, Vision-Guided Audio Alignment (VG-Align) distills the frozen visual pathway into a new audio pathway by aligning next-token distributions of the shared LLM on synchronized audio-video clips, enabling hearing without a large audio dataset. In Stage 2, a lightweight Cross-Modal Emotion Adapter (EmoAdapter), composed of the Emotion Enhancer and the Emotion Supervisor, injects emotion-sensitive residuals and applies emotion supervision to enhance cross-modal emotion understanding. We also construct ArtEmoBenchmark, an art-centric emotion benchmark that evaluates content and emotion understanding under audio-only, visual-only, and audio-visual inputs. VAEmotionLLM achieves state-of-the-art results on ArtEmoBenchmark, outperforming audio-only, visual-only, and audio-visual baselines. Ablations show that the proposed components are complementary.
zh

[CV-264] DCA-LUT: Deep Chromatic Alignment with 5D LUT for Purple Fringing Removal

【速读】:该论文旨在解决由镜头色散引起的纵向色差(Longitudinal Chromatic Aberration, LCA)导致的紫色边缘伪影(Purple Fringing)问题,该伪影长期影响数字成像的清晰度与真实感。传统方法依赖昂贵的消色差(Apochromatic, APO)光学硬件或手工特征提取,缺乏数据驱动能力。其解决方案的关键在于提出首个基于深度学习的框架DCA-LUT,核心创新是引入一种新型色度感知坐标变换(Chromatic-Aware Coordinate Transformation, CA-CT)模块,通过学习图像自适应的颜色空间来解耦并隔离紫色伪影至专用维度,从而精准建模“紫色伪影通道”,进而指导亮度通道的恢复;最终利用学习得到的五维查找表(5D Look-Up Table, 5D LUT)实现高效且强大的非线性色彩映射,显著提升去伪影效果。

链接: https://arxiv.org/abs/2511.12066
作者: Jialang Lu,Shuning Sun,Pu Wang,Chen Wu,Feng Gao,Lina Gong,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Purple fringing, a persistent artifact caused by Longitudinal Chromatic Aberration (LCA) in camera lenses, has long degraded the clarity and realism of digital imaging. Traditional solutions rely on complex and expensive apochromatic (APO) lens hardware and the extraction of handcrafted features, ignoring the data-driven approach. To fill this gap, we introduce DCA-LUT, the first deep learning framework for purple fringing removal. Inspired by the physical root of the problem, the spatial misalignment of RGB color channels due to lens dispersion, we introduce a novel Chromatic-Aware Coordinate Transformation (CA-CT) module, learning an image-adaptive color space to decouple and isolate fringing into a dedicated dimension. This targeted separation allows the network to learn a precise ``purple fringe channel", which then guides the accurate restoration of the luminance channel. The final color correction is performed by a learned 5D Look-Up Table (5D LUT), enabling efficient and powerful% non-linear color mapping. To enable robust training and fair evaluation, we constructed a large-scale synthetic purple fringing dataset (PF-Synth). Extensive experiments in synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in purple fringing removal.
zh

[CV-265] MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity AAAI2026

【速读】:该论文旨在解决现有基于学习的轨迹相似性计算方法中存在的三大问题:(1)对轨迹语义和层次结构建模不足,缺乏运动动态提取与多尺度结构表示;(2)由于逐点编码导致计算成本过高;(3)使用物理上不合理的数据增强方式,破坏了轨迹语义。解决方案的关键在于提出MovSemCL框架,其核心创新包括:首先将原始GPS轨迹转换为运动语义特征并分块处理,随后通过局部与全局注意力机制实现高效层级表征,降低计算开销;同时引入曲率引导的数据增强策略,在保留关键段落(如转弯和交叉口)的同时掩码冗余区域,生成物理上合理的增强视图,从而提升轨迹相似性计算的准确性与效率。

链接: https://arxiv.org/abs/2511.12061
作者: Zhichen Lai,Hua Lu,Huan Li,Jialiang Li,Christian S. Jensen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 8 pages, 6 figures; accepted by AAAI 2026 as an Oral paper

点击查看摘要

Abstract:Trajectory similarity computation is fundamental functionality that is used for, e.g., clustering, prediction, and anomaly detection. However, existing learning-based methods exhibit three key limitations: (1) insufficient modeling of trajectory semantics and hierarchy, lacking both movement dynamics extraction and multi-scale structural representation; (2) high computational costs due to point-wise encoding; and (3) use of physically implausible augmentations that distort trajectory semantics. To address these issues, we propose MovSemCL, a movement-semantics contrastive learning framework for trajectory similarity computation. MovSemCL first transforms raw GPS trajectories into movement-semantics features and then segments them into patches. Next, MovSemCL employs intra- and inter-patch attentions to encode local as well as global trajectory patterns, enabling efficient hierarchical representation and reducing computational costs. Moreover, MovSemCL includes a curvature-guided augmentation strategy that preserves informative segments (e.g., turns and intersections) and masks redundant ones, generating physically plausible augmented views. Experiments on real-world datasets show that MovSemCL is capable of outperforming state-of-the-art methods, achieving mean ranks close to the ideal value of 1 at similarity search tasks and improvements by up to 20.3% at heuristic approximation, while reducing inference latency by up to 43.4%.
zh

[CV-266] PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

【速读】:该论文旨在解决基于扩散Transformer(DiT)的视频生成模型在实际部署中面临的推理速度慢和内存消耗高的问题。其核心解决方案是提出了一种名为PipeDiT的新型流水线框架,关键创新包括:(1)设计了用于序列并行(Sequence Parallelism, SP)的流水线算法(PipeSP),实现潜在空间生成计算与多GPU间通信的重叠,从而降低推理延迟;(2)提出DeDiVAE方法将扩散模块与变分自编码器(Variational Autoencoder, VAE)模块解耦至两个独立GPU组,并支持流水线执行,有效减少内存占用和推理延迟;(3)为更高效利用VAE组的GPU资源,引入注意力协同处理(Attention Co-processing, Aco)机制进一步优化整体生成延迟。

链接: https://arxiv.org/abs/2511.12056
作者: Sijie Wang,Qiang Wang,Shaohuai Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.
zh

[CV-267] UniABG: Unified Adversarial View Bridging and Graph Correspondence for Unsupervised Cross-View Geo-Localization AAAI2026

【速读】:该论文旨在解决跨视图地理定位(Cross-view geo-localization, CVGL)中监督方法依赖大量成对标注导致可扩展性差,而无监督方法因跨视图域差异产生噪声伪标签的问题。解决方案的关键在于提出一种双阶段无监督框架UniABG,其核心创新包括:第一阶段采用视图感知对抗桥接(View-Aware Adversarial Bridging, VAAB)建模视图不变特征以增强伪标签鲁棒性;第二阶段通过异质图滤波校准(Heterogeneous Graph Filtering Calibration, HGFC)构建双向视图结构图,实现可靠的跨视图对应关系优化。实验表明,该方法在无监督条件下显著优于现有技术,甚至超越部分监督基线。

链接: https://arxiv.org/abs/2511.12054
作者: Cuiqun Chen,Qi Chen,Bin Yang,Xingyi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as Oral Presentation at AAAI 2026. 10 pages, 9 figures

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) matches query images ( \textite.g. , drone) to geographically corresponding opposite-view imagery ( \textite.g. , satellite). While supervised methods achieve strong performance, their reliance on extensive pairwise annotations limits scalability. Unsupervised alternatives avoid annotation costs but suffer from noisy pseudo-labels due to intrinsic cross-view domain gaps. To address these limitations, we propose \textitUniABG , a novel dual-stage unsupervised cross-view geo-localization framework integrating adversarial view bridging with graph-based correspondence calibration. Our approach first employs View-Aware Adversarial Bridging (VAAB) to model view-invariant features and enhance pseudo-label robustness. Subsequently, Heterogeneous Graph Filtering Calibration (HGFC) refines cross-view associations by constructing dual inter-view structure graphs, achieving reliable view correspondence. Extensive experiments demonstrate state-of-the-art unsupervised performance, showing that UniABG improves Satellite \rightarrow Drone AP by +10.63% on University-1652 and +16.73% on SUES-200, even surpassing supervised baselines. The source code is available at this https URL
zh

[CV-268] DeiTFake: Deepfake Detection Model using DeiT Multi-Stage Training

【速读】:该论文旨在解决深度伪造(Deepfake)对数字媒体完整性构成的严重威胁,提出了一种基于DeiT(Data-efficient Image Transformer)的检测方法DeiTFake。其解决方案的关键在于采用一种新颖的两阶段渐进式训练策略:第一阶段使用标准数据增强进行迁移学习,第二阶段引入高级仿射变换和针对深度伪造特性的增强技术进行微调;同时利用DeiT的知识蒸馏机制捕捉细微的篡改痕迹,从而显著提升检测模型的鲁棒性和准确性。

链接: https://arxiv.org/abs/2511.12048
作者: Saksham Kumar,Ashish Singh,Srinivasarao Thota,Sunil Kumar Singh,Chandan Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Deepfakes are major threats to the integrity of digital media. We propose DeiTFake, a DeiT-based transformer and a novel two-stage progressive training strategy with increasing augmentation complexity. The approach applies an initial transfer-learning phase with standard augmentations followed by a fine-tuning phase using advanced affine and deepfake-specific augmentations. DeiT’s knowledge distillation model captures subtle manipulation artifacts, increasing robustness of the detection model. Trained on the OpenForensics dataset (190,335 images), DeiTFake achieves 98.71% accuracy after stage one and 99.22% accuracy with an AUROC of 0.9997, after stage two, outperforming the latest OpenForensics baselines. We analyze augmentation impact and training schedules, and provide practical benchmarks for facial deepfake detection.
zh

[CV-269] DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging

【速读】:该论文旨在解决标准视觉Transformer(Vision Transformer, ViT)在医学图像分析中无法有效利用潜在解剖结构分组(如器官、组织和病灶区域)的问题。现有方法如SBM-Transformer虽尝试通过随机二值掩码引入结构信息,但存在不可微分、训练不稳定以及难以建模复杂社区结构等局限。其解决方案的关键在于提出DCMM-Transformer,该架构将度校正混合成员模型(Degree-Corrected Mixed-Membership, DCMM)作为自注意力机制中的加性偏置项,从而以完全可微且可解释的方式引入群体结构和节点度异质性,显著提升了模型对医学图像中解剖先验的建模能力与注意力图的语义一致性。

链接: https://arxiv.org/abs/2511.12047
作者: Huimin Cheng,Xiaowei Yu,Shushan Wu,Luyang Fang,Chao Cao,Jing Zhang,Tianming Liu,Dajiang Zhu,Wenxuan Zhong,Ping Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical images exhibit latent anatomical groupings, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit. While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffer from non-differentiability, training instability, and the inability to model complex community structure. We present DCMM-Transformer, a novel ViT architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.
zh

[CV-270] BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)过程中因依赖第三方预训练“教师”模型而引发的安全风险问题,尤其是后门攻击(backdoor attack)的隐蔽性和有效性不足。现有方法通常依赖复杂的替代学生模型(surrogate student models)和模拟蒸馏过程以保证迁移性,并通过类似通用对抗扰动(Universal Adversarial Perturbations, UAPs)的方式构造触发器,导致其在幅度上难以隐蔽且具备明显对抗特性。论文提出一种名为BackWeak的新型攻击范式,其关键在于:无需替代模型,仅通过极小学习率对良性教师模型进行微调,即可植入一种“弱”触发器(weak trigger)——即人眼不可察觉、对抗效应可忽略的扰动。实验证明,这种精细微调足以使后门在标准蒸馏流程中可靠转移到多种学生架构,实现高攻击成功率,同时显著提升隐蔽性与效率。

链接: https://arxiv.org/abs/2511.12046
作者: Shanmin Wang,Dongdong Zhao
机构: Wuhan University of Technology (武汉理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained “teacher” models downloaded from third-party repositories introduces serious security risks – most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and they construct triggers in a way similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy “weak” triggers – imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim’s standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger’s stealthiness and its potential adversarial characteristics.
zh

[CV-271] FedSDA: Federated Stain Distribution Alignment for Non-IID Histopathological Image Classification

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因非独立同分布(non-IID)组织病理学图像导致的特征分布偏移问题,该问题会显著降低模型性能。解决方案的关键在于提出一种联邦染色分布对齐方法(Federated Stain Distribution Alignment, FedSDA),其核心思想是仅通过调整各客户端的数据分布来缓解分布偏移:利用扩散模型(diffusion models)在拟合数据分布上的优势,并结合染色分离技术提取与非IID特性密切相关的关键特征,从而在联邦框架内将每个客户端的染色分布对齐至目标分布。此外,为避免在原始数据上训练扩散模型可能引发的隐私泄露风险,FedSDA设计了隐私保护机制,在不暴露原始数据的前提下实现有效的分布对齐,实验证明该方法在提升模型性能和应对非IID问题方面优于现有基准方法。

链接: https://arxiv.org/abs/2511.12044
作者: Cheng-Chang Tsai,Kai-Wen Cheng,Chun-Shien Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version. 22 pages, 18 figures, 6 tables

点击查看摘要

Abstract:Federated learning (FL) has shown success in collaboratively training a model among decentralized data resources without directly sharing privacy-sensitive training data. Despite recent advances, non-IID (non-independent and identically distributed) data poses an inevitable challenge that hinders the use of FL. In this work, we address the issue of non-IID histopathological images with feature distribution shifts from an intuitive perspective that has only received limited attention. Specifically, we address this issue from the perspective of data distribution by solely adjusting the data distributions of all clients. Building on the success of diffusion models in fitting data distributions and leveraging stain separation to extract the pivotal features that are closely related to the non-IID properties of histopathological images, we propose a Federated Stain Distribution Alignment (FedSDA) method. FedSDA aligns the stain distribution of each client with a target distribution in an FL framework to mitigate distribution shifts among clients. Furthermore, considering that training diffusion models on raw data in FL has been shown to be susceptible to privacy leakage risks, we circumvent this problem while still effectively achieving alignment. Extensive experimental results show that FedSDA is not only effective in improving baselines that focus on mitigating disparities across clients’ model updates but also outperforms baselines that address the non-IID data issues from the perspective of data distribution. We show that FedSDA provides valuable and practical insights for the computational pathology community.
zh

[CV-272] SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images AAAI2026

【速读】:该论文旨在解决从稀疏、低分辨率(Low-Resolution, LR)图像中进行前向三维重建时难以恢复精细纹理细节的问题,其根本原因在于LR输入本身缺乏高频信息。解决方案的关键在于提出SRSplat框架,通过联合利用外部高质量参考图像与内部纹理线索来补偿纹理信息的缺失:首先基于多模态大语言模型(Multimodal Large Language Models, MLLMs)和扩散模型构建场景特定的参考图库;进而设计Reference-Guided Feature Enhancement (RGFE)模块,对齐并融合LR输入图像与其参考孪生图像的特征;最后引入Texture-Aware Density Control (TADC)模块,根据LR输入的内部纹理丰富度自适应调整高斯原始(Gaussian primitives)密度,从而提升重建质量。

链接: https://arxiv.org/abs/2511.12040
作者: Xinyuan Hu,Changyue Shi,Chuxiao Yang,Minghao Chen,Jiajun Ding,Tao Wei,Chen Wei,Zhou Yu,Min Tan
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026-Oral. Project Page: this https URL

点击查看摘要

Abstract:Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose \textbfSRSplat, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the \textitReference-Guided Feature Enhancement (RGFE) module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from \textitRGFE. To further refine predicted Gaussian primitives, we introduce \textitTexture-Aware Density Control (TADC), which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities.
zh

[CV-273] IMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space

【速读】:该论文旨在解决视频生成模型中基于视频扩散变换器(video diffusion transformer, vDiT)的自注意力机制计算延迟过高的问题。现有方法虽尝试通过减少自注意力中的冗余计算来加速推理,但往往忽视了视频流中固有的时空相关性,并直接套用大语言模型的稀疏模式,导致优化效果有限。论文的关键解决方案是:基于对vDiT潜空间中时空相关性的深入分析,提出一种轻量且自适应的注意力分数复用策略——利用同一通道内空间或时间上相关的token,复用其部分注意力得分以近似完整计算。该方法在4种主流vDiT架构上实现了高达85%的计算节省,同时保持视频质量几乎不变(VBench指标仅损失0.06%)。

链接: https://arxiv.org/abs/2511.12035
作者: Wenxuan Miao,Yulin Sun,Aiyue Chen,Jing Lin,Yiwu Yao,Yiming Gan,Jieru Zhao,Jingwen Leng,Mingyi Guo,Yu Feng
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Qizhi Institute (上海奇智研究院); Huawei Technologies Co.,Ltd (华为技术有限公司); Institute of Computing Technology, Chinese Academy of Science (中国科学院计算技术研究所)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ( 0.06% loss on VBench). Subjects: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.12035 [cs.AR] (or arXiv:2511.12035v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2511.12035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-274] Calibrated Multimodal Representation Learning with Missing Modalities

【速读】:该论文旨在解决多模态表示学习中因模态缺失导致的对齐偏差问题,即在实际数据集中普遍存在部分模态缺失的情况下,传统跨模态对齐方法由于依赖所有模态同时存在而难以有效建模,从而引发锚点偏移(anchor shift)现象。其解决方案的关键在于提出CalMRL方法,通过利用模态间的先验知识和内在关联,在表示层上对缺失模态进行建模与补全,并采用两步优化策略结合共享潜在变量后验分布的闭式解,实现对不完整对齐的校准,从而缓解锚点偏移并保证收敛性。

链接: https://arxiv.org/abs/2511.12034
作者: Xiaohao Liu,Xiaobo Xia,Jiaheng Wei,Shuo Yang,Xiu Su,See-Kiong Ng,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.
zh

[CV-275] Improved Masked Image Generation with Knowledge-Augmented Token Representations AAAI-26

【速读】:该论文旨在解决当前掩码图像生成(Masked Image Generation, MIG)方法在学习视觉token序列语义依赖关系时效率低、质量受限的问题,尤其是在token本身缺乏明确语义且序列长度较长的情况下。其解决方案的关键在于提出一种知识增强的掩码图像生成框架KA-MIG,通过引入从训练数据中提取的显式token级语义依赖先验知识(即三种类型的图结构:共现图、语义相似性图和位置- token不兼容图),设计了一个图感知编码器以学习token与位置感知的表示,并结合轻量级融合机制将这些增强表示集成到现有MIG模型中,从而显著提升模型对语义依赖的捕捉能力,最终改善图像生成质量。

链接: https://arxiv.org/abs/2511.12032
作者: Guotao Liang,Baoquan Zhang,Zhiyuan Wen,Zihao Han,Yunming Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI-26

点击查看摘要

Abstract:Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction. Existing methods typically rely solely on the model itself to learn semantic dependencies among visual token sequences. However, directly learning such semantic dependencies from data is challenging because the individual tokens lack clear semantic meanings, and these sequences are usually long. To address this limitation, we propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies (\emphi.e., extracted from the training data) as priors to learn richer representations for improving performance. In particular, we explore and identify three types of advantageous token knowledge graphs, including two positive and one negative graphs (\emphi.e., the co-occurrence graph, the semantic similarity graph, and the position-token incompatibility graph). Based on three prior knowledge graphs, we design a graph-aware encoder to learn token and position-aware representations. After that, a lightweight fusion mechanism is introduced to integrate these enriched representations into the existing MIG methods. Resorting to such prior knowledge, our method effectively enhances the model’s ability to capture semantic dependencies, leading to improved generation quality. Experimental results demonstrate that our method improves upon existing MIG for class-conditional image generation on ImageNet.
zh

[CV-276] VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation AAAI2026

【速读】:该论文旨在解决从单张RGB图像中估计手部与物体三维姿态(3D poses)的问题,该任务在增强现实(Augmented Reality, AR)和人机交互(Human-Computer Interaction, HCI)等领域具有重要意义。现有方法主要依赖视觉线索,常导致违反物理约束的结果,如手与物体之间的穿插或非接触情况;而引入物理推理的近期方法通常采用后处理优化或不可微分的物理引擎,牺牲了视觉一致性与端到端可训练性。论文提出了一种融合视觉与物理线索的新框架,其核心创新在于:1)联合视觉-物理线索学习(joint visual-physical cue learning),使模型同时提取二维视觉特征与三维物理约束特征,实现对手-物交互更全面的表征学习;2)候选姿态聚合机制(candidate pose aggregation),通过扩散生成多个候选姿态并结合视觉与物理预测进行筛选与融合,最终输出既视觉一致又物理合理的姿态估计。

链接: https://arxiv.org/abs/2511.12030
作者: Jun Zhou,Chi Xu,Kaifeng Tang,Yuting Ge,Tingrui Guo,Li Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures, extended version of the AAAI 2026 paper “VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation”

点击查看摘要

Abstract:Estimating the 3D poses of hands and objects from a single RGB image is a fundamental yet challenging problem, with broad applications in augmented reality and human-computer interaction. Existing methods largely rely on visual cues alone, often producing results that violate physical constraints such as interpenetration or non-contact. Recent efforts to incorporate physics reasoning typically depend on post-optimization or non-differentiable physics engines, which compromise visual consistency and end-to-end trainability. To overcome these limitations, we propose a novel framework that jointly integrates visual and physical cues for hand-object pose estimation. This integration is achieved through two key ideas: 1) joint visual-physical cue learning: The model is trained to extract 2D visual cues and 3D physical cues, thereby enabling more comprehensive representation learning for hand-object interactions; 2) candidate pose aggregation: A novel refinement process that aggregates multiple diffusion-generated candidate poses by leveraging both visual and physical predictions, yielding a final estimate that is visually consistent and physically plausible. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in both pose accuracy and physical plausibility.
zh

[CV-277] GCAgent : Long-Video Understanding via Schematic and Narrative Episodic Memory

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中面临的两大核心挑战:一是受限于token数量导致难以建模全局上下文,二是难以捕捉长期时间依赖关系和复杂事件间的因果与时序关联。为应对这些问题,作者提出了一种名为GCAgent的新型全局上下文感知智能体框架,其关键创新在于引入了“图式与叙事情景记忆”(Schematic and Narrative Episodic Memory),该记忆结构将事件及其因果和时序关系以紧凑、有序的方式进行建模,从而从根本上缓解长期依赖问题。通过感知-行动-反思多阶段循环机制,并结合记忆管理器动态检索相关的情景记忆,GCAgent实现了鲁棒且上下文感知的推理能力,在Video-MME Long数据集上相较强基线模型最高提升23.5%准确率,且在7B规模模型中达到当前最优性能(Long split 73.4%,整体平均71.9%),验证了基于智能体的推理范式与结构化记忆对认知启发式长视频理解的有效性。

链接: https://arxiv.org/abs/2511.12027
作者: Jeong Hun Yeo,Sangyun Chung,Sungjune Park,Dae Hoe Kim,Jinyoung Moon,Yong Man Ro
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); Electronics and Telecommunications Research Institute (ETRI)(电子通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4% accuracy on the Long split and the highest overall average (71.9%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.
zh

[CV-278] Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark AAAI2026

【速读】:该论文旨在解决手术环境中点跟踪准确率低的问题,尤其在烟雾遮挡、镜面反射和组织形变等复杂视觉条件下,现有方法因缺乏语义上下文而难以理解跟踪失败机制。解决方案的关键在于构建首个大规模多模态数据集VL-SurgPT,该数据集将视觉跟踪与文本描述的点状态信息相结合,包含908段术中视频片段及17,171个标注点,覆盖组织与器械两类跟踪任务;同时提出TG-SurgPT方法,利用文本引导实现语义增强的跟踪策略,在不利视觉条件下显著提升跟踪精度与鲁棒性,从而推动具备情境感知能力的计算机辅助手术系统的发展。

链接: https://arxiv.org/abs/2511.12026
作者: Rulin Zhou,Wenlong He,An Wang,Jianhang Zhang,Xuanhui Zeng,Xi Zhang,Chaowei Zhu,Haijun Hu,Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026 oral

点击查看摘要

Abstract:Accurate point tracking in surgical environments remains challenging due to complex visual conditions, including smoke occlusion, specular reflections, and tissue deformation. While existing surgical tracking datasets provide coordinate information, they lack the semantic context necessary to understand tracking failure mechanisms. We introduce VL-SurgPT, the first large-scale multimodal dataset that bridges visual tracking with textual descriptions of point status in surgical scenes. The dataset comprises 908 in vivo video clips, including 754 for tissue tracking (17,171 annotated points across five challenging scenarios) and 154 for instrument tracking (covering seven instrument types with detailed keypoint annotations). We establish comprehensive benchmarks using eight state-of-the-art tracking methods and propose TG-SurgPT, a text-guided tracking approach that leverages semantic descriptions to improve robustness in visually challenging conditions. Experimental results demonstrate that incorporating point status information significantly improves tracking accuracy and reliability, particularly in adverse visual scenarios where conventional vision-only methods struggle. By bridging visual and linguistic modalities, VL-SurgPT enables the development of context-aware tracking systems crucial for advancing computer-assisted surgery applications that can maintain performance even under challenging intraoperative conditions.
zh

[CV-279] Null-Space Diffusion Distillation for Efficient Photorealistic Lensless Imaging

【速读】:该论文旨在解决无透镜相机(lensless camera)成像中因依赖配对的无透镜-有透镜监督数据而导致模型偏差的问题,以及现有无需真实标签的扩散先验方法在噪声大、高度混叠且病态的无透镜去卷积场景下性能不稳定的问题。其解决方案的关键在于分离范围空间(range-space)约束与零空间(null-space)扩散先验更新:通过引入一种单次前向传播的学生模型——零空间扩散蒸馏(Null-Space Diffusion Distillation, NSDD),该模型从迭代的DDNM+求解器中蒸馏零空间分量,并以无透镜测量值和一个范围空间锚点作为条件,从而在保持测量一致性的同时实现高保真度重建,显著提升速度与内存效率,且无需配对监督即可获得接近教师模型的感知质量。

链接: https://arxiv.org/abs/2511.12024
作者: Jose Reinaldo Cunha Santos A V Silva Neto,Hodaka Kawachi,Yasushi Yagi,Tomoya Nakamura
机构: Institute of Scientific and Industrial Research, The University of Osaka, Japan(大阪大学产业科学研究所); D3 Center, The University of Osaka, Japan(大阪大学D3中心); Graduate School of Engineering Science, The University of Osaka, Japan(大阪大学工学科学研究科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages without reference, 6 figures, 1 table

点击查看摘要

Abstract:State-of-the-art photorealistic reconstructions for lensless cameras often rely on paired lensless-lensed supervision, which can bias models due to lens-lensless domain mismatch. To avoid this, ground-truth-free diffusion priors are attractive; however, generic formulations tuned for conventional inverse problems often break under the noisy, highly multiplexed, and ill-posed lensless deconvolution setting. We observe that methods which separate range-space enforcement from null-space diffusion-prior updates yield stable, realistic reconstructions. Building on this, we introduce Null-Space Diffusion Distillation (NSDD): a single-pass student that distills the null-space component of an iterative DDNM+ solver, conditioned on the lensless measurement and on a range-space anchor. NSDD preserves measurement consistency and achieves photorealistic results without paired supervision at a fraction of the runtime and memory. On Lensless-FFHQ and PhlatCam, NSDD is the second fastest, behind Wiener, and achieves near-teacher perceptual quality (second-best LPIPS, below DDNM+), outperforming DPS and classical convex baselines. These results suggest a practical path toward fast, ground-truth-free, photorealistic lensless imaging.
zh

[CV-280] LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension

【速读】:该论文旨在解决弱监督指代表达理解(Weakly-Supervised Referring Expression Comprehension, WREC)方法在现实场景中难以处理零个或多个目标对象的问题,提出了一种更贴近实际应用的新任务——弱监督广义指代表达理解(Weakly-Supervised Generalized Referring Expression Comprehension, WGREC)。其核心挑战在于两个方面:一是监督信号模糊性(supervisory signal ambiguity),即图像级弱监督难以指导模型准确推断出表达所对应的对象数量和身份;二是语义表示坍塌(semantic representation collapse),即标准欧几里得相似度导致层次相关概念聚类为非区分性簇,模糊类别边界。解决方案的关键在于提出一个两阶段框架LIHE(Linguistic Instance-Split Hyperbolic-Euclidean),其中第一阶段“参照解耦”(Referential Decoupling)预测目标数量并分解复杂表达为子表达,第二阶段“参照定位”(Referent Grounding)利用创新的混合相似度模块HEMix,融合欧几里得空间的精确定位能力与双曲几何对层次结构建模的优势,有效防止语义坍塌并保留细粒度概念差异,从而实现对可变数量参考对象的精准定位。

链接: https://arxiv.org/abs/2511.12020
作者: Xianglong Shi,Silin Cheng,Sirui Zhao,Yunhan Jiang,Enhong Chen,Yang Liu,Sebastien Ourselin
机构: University of Science and Technology of China (中国科学技术大学); The University of Hong Kong (香港大学); King’s College London (伦敦国王学院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5%. The code is available at this https URL.
zh

[CV-281] Enhancing Road Safety Through Multi-Camera Image Segmentation with Post-Encroachment Time Analysis

【速读】:该论文旨在解决信号交叉口交通安全隐患评估中传统基于事故数据的研究因数据稀疏性和滞后性而导致的局限性问题。其解决方案的关键在于提出了一种基于多摄像头计算机视觉的实时安全评估框架,通过计算后侵入时间(Post-Encroachment Time, PET)实现高精度风险识别:利用四台同步摄像机提供连续视觉覆盖,结合YOLOv11目标分割算法在NVIDIA Jetson AGX Xavier边缘设备上进行车辆检测,并通过单应性矩阵将检测到的车辆多边形映射至统一鸟瞰图空间以对齐不同视角;进一步设计了一种像素级PET算法,无需固定栅格即可实现亚厘米级(3.3 sq-cm)精细危险可视化,最终生成对数尺度热力图用于高分辨率、实时且可扩展的交叉口安全分析。

链接: https://arxiv.org/abs/2511.12018
作者: Shounak Ray Chaudhuri,Arash Jahangiri,Christopher Paolini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 8 pages, 10 figures, Submitted to IEEE Intelligent Vehicles Symposium 2026

点击查看摘要

Abstract:Traffic safety analysis at signalized intersections is vital for reducing vehicle and pedestrian collisions, yet traditional crash-based studies are limited by data sparsity and latency. This paper presents a novel multi-camera computer vision framework for real-time safety assessment through Post-Encroachment Time (PET) computation, demonstrated at the intersection of H Street and Broadway in Chula Vista, California. Four synchronized cameras provide continuous visual coverage, with each frame processed on NVIDIA Jetson AGX Xavier devices using YOLOv11 segmentation for vehicle detection. Detected vehicle polygons are transformed into a unified bird’s-eye map using homography matrices, enabling alignment across overlapping camera views. A novel pixel-level PET algorithm measures vehicle position without reliance on fixed cells, allowing fine-grained hazard visualization via dynamic heatmaps, accurate to 3.3 sq-cm. Timestamped vehicle and PET data is stored in an SQL database for long-term monitoring. Results over various time intervals demonstrate the framework’s ability to identify high-risk regions with sub-second precision and real-time throughput on edge devices, producing data for an 800 x 800 pixel logarithmic heatmap at an average of 2.68 FPS. This study validates the feasibility of decentralized vision-based PET analysis for intelligent transportation systems, offering a replicable methodology for high-resolution, real-time, and scalable intersection safety evaluation.
zh

[CV-282] Adaptive Diagnostic Reasoning Framework for Pathology with Multimodal Large Language Models

【速读】:该论文旨在解决当前病理学中人工智能(AI)工具在临床采纳受限的问题,即大多数系统缺乏人类可读的推理过程,难以审计决策并预防错误。其解决方案的关键在于提出RECAP-PATH框架,这是一种可解释的自学习范式,通过两阶段学习机制自主推导诊断标准:第一阶段“多样化”扩展病理风格的解释,第二阶段“优化”提升解释的准确性。该方法仅需少量标注数据且无需白盒访问或权重更新,即可生成与专家评估一致的癌症诊断理由,并显著优于基线模型,在乳腺癌和前列腺癌数据集上验证了其有效性。

链接: https://arxiv.org/abs/2511.12008
作者: Yunqi Hong,Johnson Kao,Liam Edwards,Nein-Tzu Liu,Chung-Yen Huang,Alex Oliveira-Kowaleski,Cho-Jui Hsieh,Neil Y.C. Lin
机构: University of California, Los Angeles (加州大学洛杉矶分校); National Taiwan University Hospital (台湾大学医院); Tri-Service General Hospital (三军总医院); National Defense Medical Center (国防医学院); David Geffen School of Medicine, University of California, Los Angeles (加州大学洛杉矶分校大卫·格芬医学院); Mechanical and Aerospace Engineering Department, University of California, Los Angeles (加州大学洛杉矶分校机械与航空工程系); Bioengineering Department, University of California, Los Angeles (加州大学洛杉矶分校生物工程系); Institute for Quantitative and Computational Biosciences, University of California (加州大学定量与计算生物科学研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI tools in pathology have improved screening throughput, standardized quantification, and revealed prognostic patterns that inform treatment. However, adoption remains limited because most systems still lack the human-readable reasoning needed to audit decisions and prevent errors. We present RECAP-PATH, an interpretable framework that establishes a self-learning paradigm, shifting off-the-shelf multimodal large language models from passive pattern recognition to evidence-linked diagnostic reasoning. At its core is a two-phase learning process that autonomously derives diagnostic criteria: diversification expands pathology-style explanations, while optimization refines them for accuracy. This self-learning approach requires only small labeled sets and no white-box access or weight updates to generate cancer diagnoses. Evaluated on breast and prostate datasets, RECAP-PATH produced rationales aligned with expert assessment and delivered substantial gains in diagnostic accuracy over baselines. By uniting visual understanding with reasoning, RECAP-PATH provides clinically trustworthy AI and demonstrates a generalizable path toward evidence-linked interpretation.
zh

[CV-283] Uncertainty-Guided Selective Adaptation Enables Cross-Platform Predictive Fluorescence Microscopy

【速读】:该论文旨在解决深度学习在显微成像中因域偏移(如不同仪器或采集条件)导致模型性能下降的问题,尤其针对传统对抗域适应(Adversarial Domain Adaptation, ADDA)方法需重训练整个网络、易破坏语义特征表示的局限性。其解决方案的关键在于:仅对网络最浅层的卷积层进行对抗性适配,同时冻结深层网络结构,从而在保持语义一致性的同时实现高效迁移;进一步提出SIT-ADDA-Auto框架,通过浅层对抗对齐与预测不确定性相结合,自动选择最优适配深度,无需目标域标签,显著提升了跨域图像重建和下游分割任务的鲁棒性。

链接: https://arxiv.org/abs/2511.12006
作者: Kai-Wen K. Yang,Andrew Bai,Alexandra Bermudez,Yunqi Hong,Zoe Latham,Iris Sloan,Michael Liu,Vishrut Goyal,Cho-Jui Hsieh,Neil Y.C. Lin
机构: University of California, Los Angeles(加州大学洛杉矶分校); Jonsson Comprehensive Cancer Center(琼森综合癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning is transforming microscopy, yet models often fail when applied to images from new instruments or acquisition settings. Conventional adversarial domain adaptation (ADDA) retrains entire networks, often disrupting learned semantic representations. Here, we overturn this paradigm by showing that adapting only the earliest convolutional layers, while freezing deeper layers, yields reliable transfer. Building on this principle, we introduce Subnetwork Image Translation ADDA with automatic depth selection (SIT-ADDA-Auto), a self-configuring framework that integrates shallow-layer adversarial alignment with predictive uncertainty to automatically select adaptation depth without target labels. We demonstrate robustness via multi-metric evaluation, blinded expert assessment, and uncertainty-depth ablations. Across exposure and illumination shifts, cross-instrument transfer, and multiple stains, SIT-ADDA improves reconstruction and downstream segmentation over full-encoder adaptation and non-adversarial baselines, with reduced drift of semantic features. Our results provide a design rule for label-free adaptation in microscopy and a recipe for field settings; the code is publicly available.
zh

[CV-284] LithoSeg: A Coarse-to-Fine Framework for High-Precision Lithography Segmentation

【速读】:该论文旨在解决光刻扫描电子显微镜(SEM)图像中沟槽轮廓的精确分割与测量问题,这是确保工艺控制精度、优化器件性能及提升半导体制造良率的关键环节。现有方法在不同图案几何结构和工艺窗口下缺乏足够的精度与鲁棒性,限制了其实际应用。解决方案的核心在于提出一种粗到细的网络架构LithoSeg:在粗粒度阶段,采用人机协同引导的自举(Human-in-the-Loop Bootstrapping)策略对Segment Anything Model (SAM) 进行最小监督下的鲁棒性增强;在细粒度阶段,将二维分割重构为一维回归问题,通过粗掩膜采样沟槽法向剖面,并利用轻量级多层感知机(MLP)进行逐点精细化修正,从而显著提升分割准确性和量测精度,同时降低标注需求。

链接: https://arxiv.org/abs/2511.12005
作者: Xinyu He,Botong Zhao,Bingbing Li,Shujing Lyu,Jiwei Shen,Yue Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Accurate segmentation and measurement of lithography scanning electron microscope (SEM) images are crucial for ensuring precise process control, optimizing device performance, and advancing semiconductor manufacturing yield. Lithography segmentation requires pixel-level delineation of groove contours and consistent performance across diverse pattern geometries and process window. However, existing methods often lack the necessary precision and robustness, limiting their practical applicability. To overcome this challenge, we propose LithoSeg, a coarse-to-fine network tailored for lithography segmentation. In the coarse stage, we introduce a Human-in-the-Loop Bootstrapping scheme for the Segment Anything Model (SAM) to attain robustness with minimal supervision. In the subsequent fine stage, we recast 2D segmentation as 1D regression problem by sampling groove-normal profiles using the coarse mask and performing point-wise refinement with a lightweight MLP. LithoSeg outperforms previous approaches in both segmentation accuracy and metrology precision while requiring less supervision, offering promising prospects for real-world applications.
zh

[CV-285] Selecting Fine-Tuning Examples by Quizzing VLMs

【速读】:该论文旨在解决在对文本到图像扩散模型进行特定主题微调时,因训练图像质量参差不齐而导致生成结果不佳的问题。解决方案的关键在于提出QZLoRA框架,该框架通过QuizRank方法自动对图像进行排序——将图像视为“教育干预”,并利用视觉语言模型(VLM)对其进行“测验”,从而筛选出最能代表目标概念的高质量图像用于低秩适应(LoRA)微调。此方法显著提升了生成图像的语义一致性与真实感,且在少量样本下即可实现良好效果。

链接: https://arxiv.org/abs/2511.12002
作者: Tenghao Ji,Eytan Adar
机构: University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A challenge in fine-tuning text-to-image diffusion models for specific topics is to select good examples. Fine-tuning from image sets of varying quality, such as Wikipedia Commons, will often produce poor output. However, training images that \textitdo exemplify the target concept (e.g., a \textitfemale Mountain Bluebird) help ensure that the generated images are similarly representative (e.g., have the prototypical blue-wings and gray chest). In this work, we propose QZLoRA, a framework to select images for low-rank adaptation (LoRA). The approach leverages QuizRank, a method to automatically rank images by treating them as an educational intervention' and quizzing’ a VLM. We demonstrate that QZLoRA can produce better aligned, photorealistic images with fewer samples. We also show that these fine-tuned models can produce stylized that are similarly representative (i.e., illustrations). Our results highlight the promise of combining automated visual reasoning with parameter-efficient fine-tuning for topic-adaptive generative modeling.
zh

[CV-286] Dynamic Parameter Optimization for Highly Transferable Transformation-Based Attacks

【速读】:该论文旨在解决现有基于变换的攻击方法在参数优化中存在的盲区问题,这些问题限制了其迁移能力的充分发挥。具体而言,现有方法通常采用低迭代设置、统一参数配置且依赖网格搜索进行优化,导致攻击性能在高迭代下表现不稳定、跨模型和任务的迁移性差,以及计算复杂度高(O(mn))。解决方案的关键在于通过实证研究发现三种与参数强度相关的迁移性动态模式,并提出一种新颖的同心衰减模型(Concentric Decay Model, CDM)来解释这些模式;在此基础上,设计了一种高效的动态参数优化策略(Dynamic Parameter Optimization, DPO),利用“先上升后下降”的规律实现自适应参数调整,将复杂度降低至O(nlogm),从而显著提升不同目标模型、迭代次数和任务场景下的攻击迁移性。

链接: https://arxiv.org/abs/2511.11993
作者: Jiaming Liang,Chi-Man Pun
机构: University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite their wide application, the vulnerabilities of deep neural networks raise societal concerns. Among them, transformation-based attacks have demonstrated notable success in transfer attacks. However, existing attacks suffer from blind spots in parameter optimization, limiting their full potential. Specifically, (1) prior work generally considers low-iteration settings, yet attacks perform quite differently at higher iterations, so characterizing overall performance based only on low-iteration results is misleading. (2) Existing attacks use uniform parameters for different surrogate models, iterations, and tasks, which greatly impairs transferability. (3) Traditional transformation parameter optimization relies on grid search. For n parameters with m steps each, the complexity is O(mn). Large computational overhead limits further optimization of parameters. To address these limitations, we conduct an empirical study with various transformations as baselines, revealing three dynamic patterns of transferability with respect to parameter strength. We further propose a novel Concentric Decay Model (CDM) to effectively explain these patterns. Building on these insights, we propose an efficient Dynamic Parameter Optimization (DPO) based on the rise-then-fall pattern, reducing the complexity to O(nlogm). Comprehensive experiments on existing transformation-based attacks across different surrogate models, iterations, and tasks demonstrate that our DPO can significantly improve transferability.
zh

[CV-287] BeyondFacial: Identity-Preserving Personalized Generation Beyond Facial Close-ups

【速读】:该论文旨在解决当前身份保真个性化生成(Identity-Preserving Personalized Generation, IPPG)方法过度依赖面部区域导致视觉叙事能力弱、复杂文本提示下语义一致性差的问题,其核心瓶颈在于身份(ID)特征嵌入削弱了生成模型的语义表达能力。解决方案的关键在于提出一种双线推理(Dual-Line Inference, DLI)架构,实现身份与语义表征的分离,并引入身份自适应融合(Identity Adaptive Fusion, IdAF)策略,在噪声预测阶段延迟身份-语义融合,通过自适应注意力融合与噪声决策掩码避免身份嵌入对语义的干扰;同时设计身份聚合前置(Identity Aggregation Prepending, IdAP)模块,以聚合的身份信息替代随机初始化,显著提升身份保真度。该方案无需手动掩码或微调,可作为即插即用组件集成至现有框架,突破面部特写限制,支持电影级角色-场景协同创作。

链接: https://arxiv.org/abs/2511.11989
作者: Songsong Zhang,Chuanqi Tang,Hongguang Zhang,Guijian Tang,Minglong Li,Xueqiong Li,Shaowu Yang,Yuanxi Peng,Wenjing Yang,Jing Zhao
机构: DIDS, NUDT, China (国防科技大学); TJU, China (天津大学); Unit 32010 of the Chinese PLA (中国人民解放军某单位)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Identity-Preserving Personalized Generation (IPPG) has advanced film production and artistic creation, yet existing approaches overemphasize facial regions, resulting in outputs dominated by facial this http URL methods suffer from weak visual narrativity and poor semantic consistency under complex text prompts, with the core limitation rooted in identity (ID) feature embeddings undermining the semantic expressiveness of generative models. To address these issues, this paper presents an IPPG method that breaks the constraint of facial close-ups, achieving synergistic optimization of identity fidelity and scene semantic creation. Specifically, we design a Dual-Line Inference (DLI) pipeline with identity-semantic separation, resolving the representation conflict between ID and semantics inherent in traditional single-path architectures. Further, we propose an Identity Adaptive Fusion (IdAF) strategy that defers ID-semantic fusion to the noise prediction stage, integrating adaptive attention fusion and noise decision masking to avoid ID embedding interference on semantics without manual masking. Finally, an Identity Aggregation Prepending (IdAP) module is introduced to aggregate ID information and replace random initializations, further enhancing identity preservation. Experimental results validate that our method achieves stable and effective performance in IPPG tasks beyond facial close-ups, enabling efficient generation without manual masking or fine-tuning. As a plug-and-play component, it can be rapidly deployed in existing IPPG frameworks, addressing the over-reliance on facial close-ups, facilitating film-level character-scene creation, and providing richer personalized generation capabilities for related domains.
zh

[CV-288] From Classification to Cross-Modal Understanding: Leverag ing Vision-Language Models for Fine-Grained Renal Pathology

【速读】:该论文旨在解决肾活检中细粒度肾小球亚型分类的问题,其核心挑战在于临床可解释标签稀缺且获取困难,而现有计算病理学方法多依赖于全监督下的粗粒度疾病分类,缺乏对少样本场景下视觉-语言模型(VLM)适配策略的系统评估。解决方案的关键在于将细粒度亚型分类建模为一个贴近临床实际的少样本问题,并通过系统比较病理专用与通用视觉-语言模型在不同样本量(4–8样本/类别)、模型架构及适配策略下的表现,发现:病理专用的视觉-语言骨干网络配合基础微调(vanilla fine-tuning)是最有效的起点;即使在极少量标注数据下,此类模型也能有效捕捉亚型差异,显著提升判别能力和校准性能,同时强调正负样本间的判别力与图像-文本嵌入对齐同样重要,共同决定了诊断性能和多模态表征结构。

链接: https://arxiv.org/abs/2511.11984
作者: Zhenhao Guo,Rachit Saluja,Tianyuan Yao,Quan Liu,Junchao Zhu,Haibo Wang,Daniel Reisenbüchler,Yuankai Huo,Benjamin Liechty,David J. Pisapia,Kenji Ikemura,Steven Salvatoree,Surya Seshane,Mert R. Sabuncu,Yihe Yang,Ruining Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained glomerular subtyping is central to kidney biopsy interpretation, but clinically valuable labels are scarce and difficult to obtain. Existing computational pathology approaches instead tend to evaluate coarse diseased classification under full supervision with image-only models, so it remains unclear how vision-language models (VLMs) should be adapted for clinically meaningful subtyping under data constraints. In this work, we model fine-grained glomerular subtyping as a clinically realistic few-shot problem and systematically evaluate both pathology-specialized and general-purpose vision-language models under this setting. We assess not only classification performance (accuracy, AUC, F1) but also the geometry of the learned representations, examining feature alignment between image and text embeddings and the separability of glomerular subtypes. By jointly analyzing shot count, model architecture and domain knowledge, and adaptation strategy, this study provides guidance for future model selection and training under real clinical data constraints. Our results indicate that pathology-specialized vision-language backbones, when paired with the vanilla fine-tuning, are the most effective starting point. Even with only 4-8 labeled examples per glomeruli subtype, these models begin to capture distinctions and show substantial gains in discrimination and calibration, though additional supervision continues to yield incremental improvements. We also find that the discrimination between positive and negative examples is as important as image-text alignment. Overall, our results show that supervision level and adaptation strategy jointly shape both diagnostic performance and multimodal structure, providing guidance for model selection, adaptation strategies, and annotation investment.
zh

[CV-289] Evaluation of Attention Mechanisms in U-Net Architectures for Semantic Segmentation of Brazilian Rock Art Petroglyphs

【速读】:该论文旨在解决岩画(petroglyphs)语义分割的精度问题,以提升巴西考古遗址中岩石艺术数字化保护的质量。其核心挑战在于复杂背景下的细粒度边界识别与小目标特征提取。解决方案的关键在于引入基于BEGL(Border-Enhanced Gaussian Loss)损失函数的U-Net变体架构,并通过嵌入注意力机制(如残差块和空间-通道注意力模块)增强模型对边缘细节和关键区域的感知能力,从而显著提升Dice Score(DSC)指标,最高达0.710,较基线模型提升2.5–2.9%。

链接: https://arxiv.org/abs/2511.11959
作者: Leonardi Melo,Luís Gustavo,Dimmy Magalhães,Lucciani Vieira,Mauro Araújo
机构: Computer Vision Research Center, iCEV Institute of Higher Education (iCEV高等教育研究所); iCEV Institute of Higher Education (iCEV高等教育研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures. Preprint submitted to arXiv

点击查看摘要

Abstract:This study presents a comparative analysis of three U-Net-based architectures for semantic segmentation of rock art petroglyphs from Brazilian archaeological sites. The investigated architectures were: (1) BEGL-UNet with Border-Enhanced Gaussian Loss function; (2) Attention-Residual BEGL-UNet, incorporating residual blocks and gated attention mechanisms; and (3) Spatial Channel Attention BEGL-UNet, which employs spatial-channel attention modules based on Convolutional Block Attention Module. All implementations employed the BEGL loss function combining binary cross-entropy with Gaussian edge enhancement. Experiments were conducted on images from the Poço da Bebidinha Archaeological Complex, Piauí, Brazil, using 5-fold cross-validation. Among the architectures, Attention-Residual BEGL-UNet achieved the best overall performance with Dice Score of 0.710, validation loss of 0.067, and highest recall of 0.854. Spatial Channel Attention BEGL-UNet obtained comparable performance with DSC of 0.707 and recall of 0.857. The baseline BEGL-UNet registered DSC of 0.690. These results demonstrate the effectiveness of attention mechanisms for archaeological heritage digital preservation, with Dice Score improvements of 2.5-2.9% over the baseline.
zh

[CV-290] From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing

【速读】:该论文旨在解决雾霾环境下图像清晰化(dehazing)问题,传统基于RGB帧的方法受限于有限的动态范围(60 dB),导致去雾过程 ill-posed,易丢失结构和光照细节。其解决方案的关键在于首次引入事件相机(event camera)作为输入模态,利用其高达120 dB的高动态范围(HDR)和微秒级延迟优势,捕捉雾霾场景中的精细结构信息;进一步提出一种事件引导的扩散模型(event-guided diffusion model),通过设计事件引导模块将稀疏的HDR事件特征(如边缘、角点)映射至扩散模型潜在空间,实现从雾霾输入到清晰图像的生成,从而提供精确的结构引导、提升视觉真实性和减少语义漂移。

链接: https://arxiv.org/abs/2511.11944
作者: Ling Wang,Yunfan Lu,Wenzong Ma,Huizai Yao,Pengteng Li,Hui Xiong
机构: The Hong Kong University of Science and Technology (GuangZhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures. Completed in April 2025

点击查看摘要

Abstract:Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results. However, they operate on RGB frames, which suffer from limited dynamic range. Therefore, dehazing remains ill-posed and can erase structure and illumination details. To address this, we use event cameras for dehazing for the \textbffirst time. Event cameras offer much higher HDR ( 120 dBvs.60 dB ) and microsecond latency, therefore they suit hazy scenes. In practice, transferring HDR cues from events to frames is hard because real paired data are scarce. To tackle this, we propose an event-guided diffusion model that utilizes the strong generative priors of diffusion models to reconstruct clear images from hazy inputs by effectively transferring HDR information from events. Specifically, we design an event-guided module that maps sparse HDR event features, \textite.g., edges, corners, into the diffusion latent space. This clear conditioning provides precise structural guidance during generation, improves visual realism, and reduces semantic drift. For real-world evaluation, we collect a drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors. Experiments on two benchmarks and our dataset achieve state-of-the-art results.
zh

[CV-291] A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

【速读】:该论文旨在解决跨分布(out-of-distribution, OOD)检测方法在不同表示学习范式下的性能差异与选择依据问题,尤其关注CLIP-stratified场景下OOD检测的有效性。其解决方案的关键在于通过系统性比较CNN与微调的视觉Transformer(Vision Transformer, ViT)在多个数据集上的表现,结合AURC和AUGRC等指标,揭示特征空间学习对OOD检测效果的决定性作用,并提出基于统计检验(Friedman检验与Conover-Holm事后检验)和图论方法(Bron-Kerbosch团算法)的多方法对比框架,从而为不同分布偏移强度下OOD检测方法的选择提供实证依据与理论支撑。

链接: https://arxiv.org/abs/2511.11934
作者: C. César Claros Olivares,Austin J. Brockmeier
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a systematic comparison of out-of-distribution (OOD) detection methods across CLIP-stratified regimes using AURC and AUGRC as primary metrics. Experiments cover two representation paradigms: CNNs trained from scratch and a fine-tuned Vision Transformer (ViT), evaluated on CIFAR-10/100, SuperCIFAR-100, and TinyImageNet. Using a multiple-comparison-controlled, rank-based pipeline (Friedman test with Conover-Holm post-hoc) and Bron-Kerbosch cliques, we find that the learned feature space largely determines OOD efficacy. For both CNNs and ViTs, probabilistic scores (e.g., MSR, GEN) dominate misclassification (ID) detection. Under stronger shifts, geometry-aware scores (e.g., NNGuide, fDBD, CTM) prevail on CNNs, whereas on ViTs GradNorm and KPCA Reconstruction Error remain consistently competitive. We further show a class-count-dependent trade-off for Monte-Carlo Dropout (MCD) and that a simple PCA projection improves several detectors. These results support a representation-centric view of OOD detection and provide statistically grounded guidance for method selection under distribution shift.
zh

[CV-292] Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering

【速读】:该论文旨在解决扩展现实(Extended Reality, XR)中空间音频渲染难以实时适应不同物理场景的问题,从而避免视觉与听觉线索之间的感官错位,提升用户沉浸感。解决方案的关键在于提出一种名为SAMOSA的轻量级本地化系统,其核心是通过融合实时估算的房间几何结构、表面材质以及语义驱动的声学上下文,构建多模态场景表示,并利用场景先验实现高效的声学校准,最终合成高保真的混响脉冲响应(Room Impulse Response, RIR),显著增强XR中的听觉真实感。

链接: https://arxiv.org/abs/2511.11930
作者: Tianyu Xu,Jihan Li,Penghe Zu,Pranav Sahay,Maruchi Kim,Jack Obeng-Marnu,Farley Miller,Xun Qian,Katrina Passarella,Mahitha Rachumalla,Rajeev Nongpiur,D. Shin
机构: Google(谷歌)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:In Extended Reality (XR), rendering sound that accurately simulates real-world acoustics is pivotal in creating lifelike and believable virtual experiences. However, existing XR spatial audio rendering methods often struggle with real-time adaptation to diverse physical scenes, causing a sensory mismatch between visual and auditory cues that disrupts user immersion. To address this, we introduce SAMOSA, a novel on-device system that renders spatially accurate sound by dynamically adapting to its physical environment. SAMOSA leverages a synergistic multimodal scene representation by fusing real-time estimations of room geometry, surface materials, and semantic-driven acoustic context. This rich representation then enables efficient acoustic calibration via scene priors, allowing the system to synthesize a highly realistic Room Impulse Response (RIR). We validate our system through technical evaluation using acoustic metrics for RIR synthesis across various room configurations and sound types, alongside an expert evaluation (N=12). Evaluation results demonstrate SAMOSA’s feasibility and efficacy in enhancing XR auditory realism.
zh

[CV-293] Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中面临的计算效率瓶颈问题,即随着视频长度增加,视觉标记(vision tokens)数量线性增长导致注意力机制开销、内存占用和推理延迟急剧上升。其解决方案的核心是提出一种轻量且高效的视觉标记选择模块——Query-aware Token Selector (QTSplus),该模块作为视觉编码器与LLM之间的信息门控机制,通过交叉注意力评分筛选关键视觉证据,并根据文本查询复杂度动态预测保留预算,在训练阶段使用可微分的直通估计器(differentiable straight-through estimator)进行Top-n标记选择,推理时采用硬门控策略;同时引入小型重编码器(re-encoder)利用绝对时间信息保持时序结构,实现秒级定位能力并维持全局覆盖范围。实验表明,QTSplus在压缩89%视觉流的同时降低28%端到端延迟,且在多个长视频理解基准上保持近似原模型精度,尤其在方向性和顺序性任务上分别提升+20.5和+5.6点,验证了其对真实世界长视频场景的有效扩展能力。

链接: https://arxiv.org/abs/2511.11910
作者: Siyou Li,Huanan Wu,Juexi Shao,Yinghao Ma,Yujian Gan,Yihao Luo,Yuwei Wang,Dong Nie,Lu Wang,Wengqing Wu,Le Zhang,Massimo Poesio,Juntao Yu
机构: Queen Mary University of London (伦敦玛丽女王大学); University of Sheffield (谢菲尔德大学); Imperial College London (帝国理工学院); Pengcheng Laboratory (鹏城实验室); Meta Inc (Meta公司); Meituan Inc (美团); Nanjing University of Science (南京理工大学); University of Birmingham (伯明翰大学); Utrecht University (乌得勒支大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbfQTSplus), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emphpredicting an instance-specific retention budget based on the complexity of the query, and (iii) \emphselecting Top- n tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf89% and reduces end-to-end latency by \textbf28% on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf+20.5 and \textbf+5.6 points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence. We will make all code, data, and trained models’ weights publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.11910 [cs.CV] (or arXiv:2511.11910v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.11910 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-294] PI-NAIM: Path-Integrated Neural Adaptive Imputation Model

【速读】:该论文旨在解决医学影像与多模态临床场景中因数据缺失导致的诊断流程中断问题,现有插补方法或表达能力不足,或计算成本过高。其解决方案的关键在于提出一种双路径架构PI-NAIM,通过智能路径路由机制动态将样本分配至最优插补策略:低复杂度缺失模式由高效统计插补(MICE)处理,高复杂度模式则交由具备时序分析能力的生成对抗插补网络(GAIN)处理;同时引入跨路径注意力融合模块,利用缺失感知嵌入对两分支输出进行自适应加权整合,并实现插补精度与下游任务性能的端到端联合优化,从而在MIMIC-III等基准上显著优于现有方法(RMSE降至0.108,死亡预测AUROC达0.812)。

链接: https://arxiv.org/abs/2511.11908
作者: Afifa Khaled,Ebrahim Hamid Sumiea
机构: University of Science and Technology of China (中国科学技术大学); Universiti Teknologi PETRONAS (PETRONAS科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical imaging and multi-modal clinical settings often face the challange of missing modality in their diagnostic pipelines. Existing imputation methods either lack representational capacity or are computationally expensive. We propose PI-NAIM, a novel dual-path architecture that dynamically routes samples to optimized imputation approaches based on missingness complexity. Our framework integrates: (1) intelligent path routing that directs low missingness samples to efficient statistical imputation (MICE) and complex patterns to powerful neural networks (GAIN with temporal analysis); (2) cross-path attention fusion that leverages missingness-aware embeddings to intelligently combine both branches; and (3) end-to-end joint optimization of imputation accuracy and downstream task performance. Extensive experiments on MIMIC-III and multimodal benchmarks demonstrate state-of-the-art performance, achieving RMSE of 0.108 (vs. baselines’ 0.119-0.152) and substantial gains in downstream tasks with an AUROC of 0.812 for mortality prediction. PI-NAIM’s modular design enables seamless integration into vision pipelines handling incomplete sensor measurements, missing modalities, or corrupted inputs, providing a unified solution for real-world scenario. The code is publicly available at this https URL
zh

[CV-295] End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction

【速读】:该论文旨在解决术中行为的细粒度分析及其对患者术后结局影响的长期挑战。其核心解决方案是提出一种端到端系统 Frame-to-Outcome (F2O),通过基于 Transformer 的时空建模与帧级分类,将组织解剖视频自动转化为手势序列,并挖掘与术后结果相关的模式。F2O 的关键创新在于能够鲁棒地检测机器人辅助根治性前列腺切除术(robot-assisted radical prostatectomy)神经保留步骤中连续的短时(约 2 秒)手势(AUC 达 0.80 帧级和 0.81 视频级),并利用提取的手势频率、持续时间和转换特征预测术后结局,性能与人工标注相当(准确率分别为 0.79 和 0.75,置信区间重叠),且在 25 个共享特征上表现出高度一致性(效应量方向一致,差异约 0.07,相关系数 r = 0.96, p < 1e-14)。该方法为手术过程的自动可解释评估提供了基础,推动了数据驱动的手术反馈与前瞻性临床决策支持的发展。

链接: https://arxiv.org/abs/2511.11899
作者: Xi Li,Nicholas Matsumoto,Ujjwal Pasupulety,Atharva Deo,Cherine Yang,Jay Moran,Miguel E. Hernandez,Peter Wager,Jasmine Lin,Jeanine Kim,Alvin C. Goh,Christian Wagner,Geoffrey A. Sonn,Andrew J. Hung
机构: Cedars-Sinai Health System (西达赛奈医疗中心); Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心); St. Antonius-Gronau (圣安东尼医院); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained analysis of intraoperative behavior and its impact on patient outcomes remain a longstanding challenge. We present Frame-to-Outcome (F2O), an end-to-end system that translates tissue dissection videos into gesture sequences and uncovers patterns associated with postoperative outcomes. Leveraging transformer-based spatial and temporal modeling and frame-wise classification, F2O robustly detects consecutive short (~2 seconds) gestures in the nerve-sparing step of robot-assisted radical prostatectomy (AUC: 0.80 frame-level; 0.81 video-level). F2O-derived features (gesture frequency, duration, and transitions) predicted postoperative outcomes with accuracy comparable to human annotations (0.79 vs. 0.75; overlapping 95% CI). Across 25 shared features, effect size directions were concordant with small differences (~ 0.07), and strong correlation (r = 0.96, p 1e-14). F2O also captured key patterns linked to erectile function recovery, including prolonged tissue peeling and reduced energy use. By enabling automatic interpretable assessment, F2O establishes a foundation for data-driven surgical feedback and prospective clinical decision support.
zh

[CV-296] Prompt Triage: Structured Optimization Enhances Vision-Language Model Performance on Medical Imaging Benchmarks

【速读】:该论文旨在解决医学视觉-语言模型(Vision-Language Models, VLMs)在医疗影像任务中性能不足的问题,尤其是现有改进方法如微调(fine-tuning)依赖大量领域特定数据和计算资源,而人工提示工程(prompt engineering)难以泛化且对医疗机构不友好。其解决方案的关键在于引入结构化的自动化提示优化框架——Declarative Self-improving Python (DSPy),通过形式化评估实现无需人工设计提示的权重无关(weight-agnostic)性能提升。实验表明,该方法在五个跨学科医学影像任务中使VLMs的性能相比零样本提示基线平均提升53%,最大提升达300%–3,400%,显著增强了医疗AI系统的可扩展性与实用性,同时保障了数据隐私并支持可复现研究。

链接: https://arxiv.org/abs/2511.11898
作者: Arnav Singhvi,Vasiliki Bikia,Asad Aali,Akshay Chaudhari,Roxana Daneshjou
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language foundation models (VLMs) show promise for diverse imaging tasks but often underperform on medical benchmarks. Prior efforts to improve performance include model finetuning, which requires large domain-specific datasets and significant compute, or manual prompt engineering, which is hard to generalize and often inaccessible to medical institutions seeking to deploy these tools. These challenges motivate interest in approaches that draw on a model’s embedded knowledge while abstracting away dependence on human-designed prompts to enable scalable, weight-agnostic performance improvements. To explore this, we adapt the Declarative Self-improving Python (DSPy) framework for structured automated prompt optimization in medical vision-language systems through a comprehensive, formal evaluation. We implement prompting pipelines for five medical imaging tasks across radiology, gastroenterology, and dermatology, evaluating 10 open-source VLMs with four prompt optimization techniques. Optimized pipelines achieved a median relative improvement of 53% over zero-shot prompting baselines, with the largest gains ranging from 300% to 3,400% on tasks where zero-shot performance is low. These results highlight the substantial potential of applying automated prompt optimization to medical AI systems, demonstrating significant gains for vision-based applications requiring accurate clinical image interpretation. By reducing dependence on prompt design to elicit intended outputs, these techniques allow clinicians to focus on patient care and clinical decision-making. Furthermore, our experiments offer scalability and preserve data privacy, demonstrating performance improvement on open-source VLMs. We publicly release our evaluation pipelines to support reproducible research on specialized medical tasks, available at this https URL.
zh

[CV-297] Advancing Annotat3D with Harpia: A CUDA-Accelerated Library For Large-Scale Volumetric Data Segmentation

【速读】:该论文旨在解决高分辨率三维成像技术(如X射线断层扫描和先进显微成像)产生的大规模数据集对现有处理、分割和交互探索工具带来的挑战。其解决方案的关键在于引入Harpia——一个基于CUDA的新型处理库,该库通过严格的内存控制、原生分块执行机制以及一系列GPU加速的过滤、标注和量化工具,实现了对超出单个GPU内存容量的数据集的可靠运行,并显著提升了处理速度、内存效率和可扩展性,特别适用于高性能计算(HPC)和远程访问环境中的协作式科学成像工作流。

链接: https://arxiv.org/abs/2511.11890
作者: Camila Machado de Araujo,Egon P. B. S. Borges,Ricardo Marcelo Canteiro Grangeiro,Allan Pinto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:High-resolution volumetric imaging techniques, such as X-ray tomography and advanced microscopy, generate increasingly large datasets that challenge existing tools for efficient processing, segmentation, and interactive exploration. This work introduces new capabilities to Annotat3D through Harpia, a new CUDA-based processing library designed to support scalable, interactive segmentation workflows for large 3D datasets in high-performance computing (HPC) and remote-access environments. Harpia features strict memory control, native chunked execution, and a suite of GPU-accelerated filtering, annotation, and quantification tools, enabling reliable operation on datasets exceeding single-GPU memory capacity. Experimental results demonstrate significant improvements in processing speed, memory efficiency, and scalability compared to widely used frameworks such as NVIDIA cuCIM and scikit-image. The system’s interactive, human-in-the-loop interface, combined with efficient GPU resource management, makes it particularly suitable for collaborative scientific imaging workflows in shared HPC infrastructures.
zh

[CV-298] Lacking Data? No worries! How synthetic images can alleviate image scarcity in wildlife surveys: a case study with muskox (Ovibos moschatus)

【速读】:该论文旨在解决稀有或分布稀疏物种(如麝牛)在野生动物监测中因真实训练数据匮乏而导致深度学习目标检测模型(ODMs)难以训练的问题。其核心解决方案是引入合成影像(SI),通过在零样本(ZS)和少样本(FS)场景下逐步增加SI比例来增强模型的检测性能,从而缓解小样本限制,提升模型在无真实数据或仅有少量真实数据时的准确性与鲁棒性。

链接: https://arxiv.org/abs/2511.11882
作者: Simon Durand,Samuel Foucher,Alexandre Delplanque,Joëlle Taillon,Jérôme Théau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 10 figures, submitted to Remote Sensing in Ecology and Conservation

点击查看摘要

Abstract:Accurate population estimates are essential for wildlife management, providing critical insights into species abundance and distribution. Traditional survey methods, including visual aerial counts and GNSS telemetry tracking, are widely used to monitor muskox populations in Arctic regions. These approaches are resource intensive and constrained by logistical challenges. Advances in remote sensing, artificial intelligence, and high resolution aerial imagery offer promising alternatives for wildlife detection. Yet, the effectiveness of deep learning object detection models (ODMs) is often limited by small datasets, making it challenging to train robust ODMs for sparsely distributed species like muskoxen. This study investigates the integration of synthetic imagery (SI) to supplement limited training data and improve muskox detection in zero shot (ZS) and few-shot (FS) settings. We compared a baseline model trained on real imagery with 5 ZS and 5 FS models that incorporated progressively more SI in the training set. For the ZS models, where no real images were included in the training set, adding SI improved detection performance. As more SI were added, performance in precision, recall and F1 score increased, but eventually plateaued, suggesting diminishing returns when SI exceeded 100% of the baseline model training dataset. For FS models, combining real and SI led to better recall and slightly higher overall accuracy compared to using real images alone, though these improvements were not statistically significant. Our findings demonstrate the potential of SI to train accurate ODMs when data is scarce, offering important perspectives for wildlife monitoring by enabling rare or inaccessible species to be monitored and to increase monitoring frequency. This approach could be used to initiate ODMs without real data and refine it as real images are acquired over time.
zh

[CV-299] ransformers vs. Recurrent Models for Estimating Forest Gross Primary Production

【速读】:该论文旨在解决森林碳 dioxide (CO₂) 吸收(即总初级生产力,Gross Primary Production, GPP)时空动态监测的难题,传统涡度相关(Eddy Covariance, EC)技术虽能提供高频数据但空间覆盖有限,而遥感方法常依赖单一传感器光谱指数与统计模型,难以刻画GPP复杂的时序特征。解决方案的关键在于引入深度学习(Deep Learning, DL)与多源数据融合技术,对比评估两种代表性模型——基于Transformer架构的GPT-2与基于循环神经网络的长短期记忆网络(Long Short-Term Memory, LSTM)在多变量输入下的GPP预测性能。结果表明:LSTM整体精度更高且所需时间窗口更短,体现更高的效率;GPT-2在极端事件中表现更优,凸显其对复杂时序模式的捕捉能力;同时,辐射是最重要的预测因子,其次是Sentinel-2、MODIS地表温度和Sentinel-1数据,揭示了多模态输入对提升预测性能的关键作用。

链接: https://arxiv.org/abs/2511.11880
作者: David Montero,Miguel D. Mahecha,Francesco Martinuzzi,César Aybar,Anne Klosterhalfen,Alexander Knohl,Jesús Anaya,Clemens Mosig,Sebastian Wieneke
机构: Leipzig University (莱比锡大学); iDiv; MPI PKS (马克斯·普朗克研究所); Universitat de València (瓦伦西亚大学); University of Göttingen (哥廷根大学); Universidad de Medellín (麦德林大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monitoring the spatiotemporal dynamics of forest CO _2 uptake (Gross Primary Production, GPP), remains a central challenge in terrestrial ecosystem research. While Eddy Covariance (EC) towers provide high-frequency estimates, their limited spatial coverage constrains large-scale assessments. Remote sensing offers a scalable alternative, yet most approaches rely on single-sensor spectral indices and statistical models that are often unable to capture the complex temporal dynamics of GPP. Recent advances in deep learning (DL) and data fusion offer new opportunities to better represent the temporal dynamics of vegetation processes, but comparative evaluations of state-of-the-art DL models for multimodal GPP prediction remain scarce. Here, we explore the performance of two representative models for predicting GPP: 1) GPT-2, a transformer architecture, and 2) Long Short-Term Memory (LSTM), a recurrent neural network, using multivariate inputs. Overall, both achieve similar accuracy. But, while LSTM performs better overall, GPT-2 excels during extreme events. Analysis of temporal context length further reveals that LSTM attains similar accuracy using substantially shorter input windows than GPT-2, highlighting an accuracy-efficiency trade-off between the two architectures. Feature importance analysis reveals radiation as the dominant predictor, followed by Sentinel-2, MODIS land surface temperature, and Sentinel-1 contributions. Our results demonstrate how model architecture, context length, and multimodal inputs jointly determine performance in GPP prediction, guiding future developments of DL frameworks for monitoring terrestrial carbon dynamics.
zh

[CV-300] FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision

【速读】:该论文旨在解决医学图像分割中边界信息编码不足的问题,这一缺陷导致分割结果在病灶或器官边缘区域精度不高,从而影响临床诊断与治疗的准确性。其解决方案的关键在于提出一种基于符号距离函数(Signed Distance Function, SDF)的新颖损失函数——FocusSDF,该方法通过自适应地为靠近边界区域的像素分配更高权重,引导网络模型更加关注边界区域,从而实现对边界的显式建模和有效保留。

链接: https://arxiv.org/abs/2511.11864
作者: Muzammal Shafique,Nasir Rahim,Jamil Ahmad,Mohammad Siadat,Khalid Malik,Ghaus Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of medical images constitutes an essential component of medical image analysis, providing the foundation for precise diagnosis and efficient therapeutic interventions in clinical practices. Despite substantial progress, most segmentation models do not explicitly encode boundary information; as a result, making boundary preservation a persistent challenge in medical image segmentation. To address this challenge, we introduce FocusSDF, a novel loss function based on the signed distance functions (SDFs), which redirects the network to concentrate on boundary regions by adaptively assigning higher weights to pixels closer to the lesion or organ boundary, effectively making it boundary aware. To rigorously validate FocusSDF, we perform extensive evaluations against five state-of-the-art medical image segmentation models, including the foundation model MedSAM, using four distance-based loss functions across diverse datasets covering cerebral aneurysm, stroke, liver, and breast tumor segmentation tasks spanning multiple imaging modalities. The experimental results consistently demonstrate the superior performance of FocusSDF over existing distance transform based loss functions.
zh

[CV-301] Defending Unauthorized Model Merging via Dual-Stage Weight Protection

【速读】:该论文旨在解决预训练模型未经授权合并(unauthorized model merging)所引发的知识产权侵犯与模型所有权失效问题。当前开放模型库中,攻击者可自由组合微调后的模型生成具备多能力的新模型,严重威胁原模型开发者权益。解决方案的关键在于提出MergeGuard——一种双阶段权重保护框架:第一阶段通过L2正则化优化重新分配任务相关信息至各层,使重要梯度均匀分布;第二阶段注入结构化扰动以错位任务子空间,破坏损失曲面中的曲率兼容性。二者协同重塑参数几何结构,使得合并模型因破坏性干涉而性能崩溃,同时保护模型保持完整功能,实验表明其可使合并模型准确率下降最高达90%,且对原模型性能影响不足1.5%。

链接: https://arxiv.org/abs/2511.11851
作者: Wei-Jia Chen,Min-Yen Tsai,Cheng-Yi Lee,Chia-Mu Yu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 10 pages, under review

点击查看摘要

Abstract:The rapid proliferation of pretrained models and open repositories has made model merging a convenient yet risky practice, allowing free-riders to combine fine-tuned models into a new multi-capability model without authorization. Such unauthorized model merging not only violates intellectual property rights but also undermines model ownership and accountability. To address this issue, we present MergeGuard, a proactive dual-stage weight protection framework that disrupts merging compatibility while maintaining task fidelity. In the first stage, we redistribute task-relevant information across layers via L2-regularized optimization, ensuring that important gradients are evenly dispersed. In the second stage, we inject structured perturbations to misalign task subspaces, breaking curvature compatibility in the loss landscape. Together, these stages reshape the model’s parameter geometry such that merged models collapse into destructive interference while the protected model remains fully functional. Extensive experiments on both vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models demonstrate that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.
zh

[CV-302] MP-GFormer: A 3D-Geometry-Aware Dynamic Graph Transformer Approach for Machining Process Planning

【速读】:该论文旨在解决加工过程规划(Machining Process Planning, MP)中动态依赖关系建模不足的问题,特别是现有动态图学习(Dynamic Graph Learning, DGL)方法无法有效融合零件三维(3D)几何信息而导致的领域感知能力缺失问题。解决方案的关键在于提出MP-GFormer——一种3D几何感知的动态图Transformer模型,通过引入注意力机制将随加工操作演化的3D几何表示(基于StereoLithography表面网格)融入DGL框架,从而在预测加工操作序列时增强对零件几何特征的空间-时间依赖关系的建模能力,显著提升了主操作与子操作预测的准确性。

链接: https://arxiv.org/abs/2511.11837
作者: Fatemeh Elhambakhsh,Gaurav Ameta,Aditi Roy,Hyunwoong Ko
机构: Arizona State University (亚利桑那州立大学); Siemens Foundational Technologies (西门子基础技术)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machining process planning (MP) is inherently complex due to structural and geometrical dependencies among part features and machining operations. A key challenge lies in capturing dynamic interdependencies that evolve with distinct part geometries as operations are performed. Machine learning has been applied to address challenges in MP, such as operation selection and machining sequence prediction. Dynamic graph learning (DGL) has been widely used to model dynamic systems, thanks to its ability to integrate spatio-temporal relationships. However, in MP, while existing DGL approaches can capture these dependencies, they fail to incorporate three-dimensional (3D) geometric information of parts and thus lack domain awareness in predicting machining operation sequences. To address this limitation, we propose MP-GFormer, a 3D-geometry-aware dynamic graph transformer that integrates evolving 3D geometric representations into DGL through an attention mechanism to predict machining operation sequences. Our approach leverages StereoLithography surface meshes representing the 3D geometry of a part after each machining operation, with the boundary representation method used for the initial 3D designs. We evaluate MP-GFormer on a synthesized dataset and demonstrate that the method achieves improvements of 24% and 36% in accuracy for main and sub-operation predictions, respectively, compared to state-of-the-art approaches.
zh

[CV-303] opoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在全局视觉感知能力上的严重不足问题,特别是由于传统评估基准中存在局部捷径(local shortcuts),导致模型性能被高估。其解决方案的关键在于提出一个名为TopoPerception的新基准,该基准利用拓扑特性(topological properties)来评估LVLMs在不同粒度下的全局视觉感知能力;由于拓扑依赖于图像的整体结构且对局部特征不变,TopoPerception能够实现无捷径的全局感知评估,从而揭示现有模型在理解图像整体结构方面的根本性缺陷。

链接: https://arxiv.org/abs/2511.11831
作者: Wenhao Zhou,Hao Zheng,Rong Zhao
机构: Tsinghua University (清华大学); IDG/McGovern Institute for Brain Research, Tsinghua University (IDG/麻省理工学院脑研究所,清华大学); Center for Brain-Inspired Computing Research (CBICR), Tsinghua University (脑启发计算研究中心,清华大学); Department of Precision Instruments, Tsinghua University (精密仪器系,清华大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) typically align visual features from an encoder with a pre-trained Large Language Model (LLM). However, this makes the visual perception module a bottleneck, which constrains the overall capabilities of LVLMs. Conventional evaluation benchmarks, while rich in visual semantics, often contain unavoidable local shortcuts that can lead to an overestimation of models’ perceptual abilities. Here, we introduce TopoPerception, a benchmark that leverages topological properties to rigorously evaluate the global visual perception capabilities of LVLMs across various granularities. Since topology depends on the global structure of an image and is invariant to local features, TopoPerception enables a shortcut-free assessment of global perception, fundamentally distinguishing it from semantically rich tasks. We evaluate state-of-the-art models on TopoPerception and find that even at the coarsest perceptual granularity, all models perform no better than random chance, indicating a profound inability to perceive global visual features. Notably, a consistent trend emerge within model families: more powerful models with stronger reasoning capabilities exhibit lower accuracy. This suggests that merely scaling up models is insufficient to address this deficit and may even exacerbate it. Progress may require new training paradigms or architectures. TopoPerception not only exposes a critical bottleneck in current LVLMs but also offers a lens and direction for improving their global visual perception. The data and code are publicly available at: this https URL.
zh

[CV-304] SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction

【速读】:该论文旨在解决单目标跟踪(Single-Object Tracking, SOT)与短期运动预测在遮挡、尺度变化和时间漂移等挑战下难以保持时序一致性的问题,从而影响实时感知性能。其解决方案的关键在于提出一种最小常量内存的时序Transformer模型SOTFormer,通过引入基于真值引导的记忆机制(ground-truth-primed memory)和烧入锚损失(burn-in anchor loss)来显式稳定初始化并实现身份传播的鲁棒性;同时,仅使用一个轻量级时序注意力层对跨帧嵌入进行精炼,从而在固定GPU显存下实现端到端的实时推理(如Mini-LaSOT基准上达到53.7 FPS)。

链接: https://arxiv.org/abs/2511.11824
作者: Zhongping Dong,Pengyang Yu,Shuangjian Li,Liming Chen,Mohand Tahar Kechadi
机构: University College Dublin (都柏林大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbfSOTFormer, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.
zh

[CV-305] Coordinate Descent for Network Linearization

【速读】:该论文旨在解决基于ResNet网络的私有推理(Private Inference)中ReLU激活函数导致的显著推理延迟问题,其核心挑战在于减少ReLU单元数量这一离散优化问题。传统方法通常采用平滑近似策略,联合优化模型精度与ReLU预算,但最终的硬阈值处理步骤常引入较大性能损失。本文提出一种在离散域直接优化的新方法,以坐标下降(Coordinate Descent)作为优化框架,通过迭代更新单个变量来逼近最优解,从而天然地生成稀疏结构。实验表明,该方法在主流基准上达到当前最优性能。

链接: https://arxiv.org/abs/2511.11781
作者: Vlad Rakhlin,Amir Jevnisek,Shai Avidan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:ReLU activations are the main bottleneck in Private Inference that is based on ResNet networks. This is because they incur significant inference latency. Reducing ReLU count is a discrete optimization problem, and there are two common ways to approach it. Most current state-of-the-art methods are based on a smooth approximation that jointly optimizes network accuracy and ReLU budget at once. However, the last hard thresholding step of the optimization usually introduces a large performance loss. We take an alternative approach that works directly in the discrete domain by leveraging Coordinate Descent as our optimization framework. In contrast to previous methods, this yields a sparse solution by design. We demonstrate, through extensive experiments, that our method is State of the Art on common benchmarks.
zh

[CV-306] Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

【速读】:该论文旨在解决当前文本到图像生成模型在处理长且复杂的组合式提示(compositional prompts)时表现不稳定的问题,这类提示在创意工作流程中十分常见。解决方案的关键在于提出Image-POSER框架,这是一个基于反思式强化学习(reflective reinforcement learning)的系统,其核心创新包括:(i) 构建一个预训练文本到图像和图像到图像专家模型的多样化注册表;(ii) 通过动态任务分解实现对长形式提示的端到端处理;(iii) 利用视觉语言模型(vision-language model)作为结构化反馈 critic,在每一步监督对齐性。该方法将图像合成与编辑建模为马尔可夫决策过程(Markov Decision Process),从而学习非平凡的专家流水线,自适应地组合不同模型的优势,显著提升生成结果在对齐度、保真度和美学上的表现,并获得人类评估的一致偏好。

链接: https://arxiv.org/abs/2511.11780
作者: Hossein Mohebbi,Mohammed Abdulrahman,Yanting Miao,Pascal Poupart,Suraj Kothawade
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.
zh

[CV-307] Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy: A Review

【速读】:该论文旨在解决当前机器人感知技术中语言理解与空间感知之间存在鸿沟的问题,即如何通过融合大语言模型(Large Language Models, LLMs)与三维视觉(3D vision)来提升机器人对复杂环境的感知、推理与交互能力。其解决方案的关键在于构建一个以LLMs为语义中枢、结合多模态3D数据表示和感知技术的统一框架,实现从自然语言指令到三维场景理解、物体定位、动态场景生成及语言引导操作的端到端映射,同时引入触觉、听觉和热成像等多模态输入以增强环境认知,并通过标准化基准数据集和评估指标推动该领域的可复现性与进步。

链接: https://arxiv.org/abs/2511.11777
作者: Vinit Mehta,Charu Sharma,Karthick Thiyagarajan
机构: IIIT Hyderabad (印度国际信息技术学院); Western Sydney University (西悉尼大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages, 15 figures, MDPI Sensors Journal

点击查看摘要

Abstract:With the rapid advancement of artificial intelligence and robotics, the integration of Large Language Models (LLMs) with 3D vision is emerging as a transformative approach to enhancing robotic sensing technologies. This convergence enables machines to perceive, reason and interact with complex environments through natural language and spatial understanding, bridging the gap between linguistic intelligence and spatial perception. This review provides a comprehensive analysis of state-of-the-art methodologies, applications and challenges at the intersection of LLMs and 3D vision, with a focus on next-generation robotic sensing technologies. We first introduce the foundational principles of LLMs and 3D data representations, followed by an in-depth examination of 3D sensing technologies critical for robotics. The review then explores key advancements in scene understanding, text-to-3D generation, object grounding and embodied agents, highlighting cutting-edge techniques such as zero-shot 3D segmentation, dynamic scene synthesis and language-guided manipulation. Furthermore, we discuss multimodal LLMs that integrate 3D data with touch, auditory and thermal inputs, enhancing environmental comprehension and robotic decision-making. To support future research, we catalog benchmark datasets and evaluation metrics tailored for 3D-language and vision tasks. Finally, we identify key challenges and future research directions, including adaptive model architectures, enhanced cross-modal alignment and real-time processing capabilities, which pave the way for more intelligent, context-aware and autonomous robotic sensing systems.
zh

[CV-308] Batch Transformer Architecture: Case of Synthetic Image Generation for Emotion Expression Facial Recognition

【速读】:该论文旨在解决传统编码器-解码器人工神经网络(ANN)架构中因注意力机制作用于全维度序列或批量实体而导致的瓶颈问题,从而限制模型效率与可扩展性。其解决方案的关键在于提出一种新型的“隐式稀疏”风格的Transformer变体——Batch Transformer,该架构通过仅对“重要”维度(即主成分)实施注意力机制,实现特征选择与压缩,显著降低模型内部瓶颈层的计算复杂度,同时在有限原始数据集(如带妆容和遮挡的人脸图像生成任务)上提升数据多样性,增强生成质量与泛化能力。

链接: https://arxiv.org/abs/2511.11754
作者: Stanislav Selitskiy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A novel Transformer variation architecture is proposed in the implicit sparse style. Unlike “traditional” Transformers, instead of attention to sequential or batch entities in their entirety of whole dimensionality, in the proposed Batch Transformers, attention to the “important” dimensions (primary components) is implemented. In such a way, the “important” dimensions or feature selection allows for a significant reduction of the bottleneck size in the encoder-decoder ANN architectures. The proposed architecture is tested on the synthetic image generation for the face recognition task in the case of the makeup and occlusion data set, allowing for increased variability of the limited original data set.
zh

[CV-309] Improving a Hybrid Graphsage Deep Network for Automatic Multi-objective Logistics Management in Supply Chain

【速读】:该论文旨在解决供应链物流管理中多任务预测的效率与准确性问题,具体包括 shipment type(运输类型)、物流延迟(logistics delay)以及交通状态(traffic status)的自动预测,以提升供应链的韧性(resiliency)和可持续性(sustainability)。解决方案的关键在于提出一种混合图卷积网络(hybrid GraphSAGE network, H-GSN),该模型能够从多个公开数据库(DataCo、Shipping 和 Smart Logistics)中联合学习特征,并在不同物流场景下实现高精度预测——例如在 Smart Logistics 数据集中对 10 类物流 ID 和 3 种交通状态的预测平均准确率达 97.8% 和 100%,在 DataCo 和 Shipping 数据集中对运输类型和物流延迟的预测准确率分别达到 98.7% 和 99.4%,验证了其在增强供应链智能决策能力方面的有效性。

链接: https://arxiv.org/abs/2511.11753
作者: Mehdi Khaleghi,Nastaran Khaleghi,Sobhan Sheykhivand,Sebelan Danishvar
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Systematic logistics, conveyance amenities and facilities as well as warehousing information play a key role in fostering profitable development in a supply chain. The aim of transformation in industries is the improvement of the resiliency regarding the supply chain. The resiliency policies are required for companies to affect the collaboration with logistics service providers positively. The decrement of air pollutant emissions is a persistent advantage of the efficient management of logistics and transportation in supply chain. The management of shipment type is a significant factor in analyzing the sustainability of logistics and supply chain. An automatic approach to predict the shipment type, logistics delay and traffic status are required to improve the efficiency of the supply chain management. A hybrid graphsage network (H-GSN) is proposed in this paper for multi-task purpose of logistics management in a supply chain. The shipment type, shipment status, traffic status, logistics ID and logistics delay are the objectives in this article regarding three different databases including DataCo, Shipping and Smart Logistcis available on Kaggle as supply chain logistics databases. The average accuracy of 97.8% and 100% are acquired for 10 kinds of logistics ID and 3 types of traffic status prediction in Smart Logistics dataset. The average accuracy of 98.7% and 99.4% are obtained for shipment type prediction in DataCo and logistics delay in Shipping database, respectively. The evaluation metrics for different logistics scenarios confirm the efficiency of the proposed method to improve the resilience and sustainability of the supply chain.
zh

[CV-310] Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models AAAI2026

【速读】:该论文旨在解决现代视觉语言模型(Vision-Language Models, VLMs)在决策过程中缺乏可解释性且容易产生幻觉(hallucination)的问题,尤其是在处理分布外(out-of-distribution)数据时。现有神经符号框架虽能提供透明推理,但其符号提取仅依赖任务标签,导致符号与视觉数据的语义关联弱、接地不足(weakly grounded)。解决方案的关键在于提出一个名为Concept-RuleNet的多智能体系统:首先通过多模态概念生成器从训练图像子集直接挖掘判别性视觉概念,确保符号具有强视觉 grounding;随后利用这些视觉概念引导符号发现,缓解标签偏倚;再由大语言模型推理智能体将符号组合成可执行的一阶逻辑规则;最后在推理阶段,由视觉验证智能体量化每个符号的存在程度,并协同黑盒神经模型输出触发规则执行,从而实现兼具准确性和可解释性的推理路径。

链接: https://arxiv.org/abs/2511.11751
作者: Sanchit Sinha,Guangzhi Xiong,Zhenghao He,Aidong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: AAAI 2026 (oral)

点击查看摘要

Abstract:Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into ‘why’ a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.
zh

[CV-311] oward bilipshiz geometric models

【速读】:该论文旨在解决当前点云神经网络在保持对称性感知距离(symmetry-aware distances)方面的理论缺陷问题,特别是这些网络是否能够通过双 Lipschitz(bi-Lipschitz)等价关系来保留点云空间中自然的几何结构。研究发现,两种常用的对称性感知度量——Procrustes Matching(PM)度量与Hard Gromov Wasserstein距离——本身并不具备双 Lipschitz 等价性,因此主流的点云不变网络(invariant networks)无法在 PM 度量下满足双 Lipschitz 性质。解决方案的关键在于对现有网络架构进行结构性修改,使其输出映射满足双 Lipschitz 条件,从而保证在点云对应匹配任务中具有更稳定的几何保真性和泛化能力。实验初步验证了改进后的 bi-Lipschitz 模型相较于传统不变模型在点云配准任务中的优势。

链接: https://arxiv.org/abs/2511.11735
作者: Yonatan Sverdlov,Eitan Rosen,Nadav Dym
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Many neural networks for point clouds are, by design, invariant to the symmetries of this datatype: permutations and rigid motions. The purpose of this paper is to examine whether such networks preserve natural symmetry aware distances on the point cloud spaces, through the notion of bi-Lipschitz equivalence. This inquiry is motivated by recent work in the Equivariant learning literature which highlights the advantages of bi-Lipschitz models in other scenarios. We consider two symmetry aware metrics on point clouds: (a) The Procrustes Matching (PM) metric and (b) Hard Gromov Wasserstien distances. We show that these two distances themselves are not bi-Lipschitz equivalent, and as a corollary deduce that popular invariant networks for point clouds are not bi-Lipschitz with respect to the PM metric. We then show how these networks can be modified so that they do obtain bi-Lipschitz guarantees. Finally, we provide initial experiments showing the advantage of the proposed bi-Lipschitz model over standard invariant models, for the tasks of finding correspondences between 3D point clouds. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2511.11735 [cs.CV] (or arXiv:2511.11735v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.11735 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-312] Exposing DeepFakes via Hyperspectral Domain Mapping AAAI2026

【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型生成的深度伪造(Deepfake)图像难以被现有检测方法识别的问题,尤其是当这些图像在RGB空间中高度逼真、人工篡改痕迹微弱时。其解决方案的关键在于提出一种两阶段的HSI-Detect框架:首先从标准RGB输入重建出包含31个波段的高光谱图像(Hyperspectral Image, HSI),然后在高光谱域中进行检测。通过扩展至更密集的光谱通道,该方法能放大在RGB域中微弱或不可见的篡改痕迹,尤其在特定频段中更为显著,从而提升检测精度。

链接: https://arxiv.org/abs/2511.11732
作者: Aditya Mehta,Swarnim Chaudhary,Pratik Narang,Jagat Sesh Challa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026 Student Abstract

点击查看摘要

Abstract:Modern generative and diffusion models produce highly realistic images that can mislead human perception and even sophisticated automated detection systems. Most detection methods operate in RGB space and thus analyze only three spectral channels. We propose HSI-Detect, a two-stage pipeline that reconstructs a 31-channel hyperspectral image from a standard RGB input and performs detection in the hyperspectral domain. Expanding the input representation into denser spectral bands amplifies manipulation artifacts that are often weak or invisible in the RGB domain, particularly in specific frequency bands. We evaluate HSI-Detect across FaceForensics++ dataset and show the consistent improvements over RGB-only baselines, illustrating the promise of spectral-domain mapping for Deepfake detection.
zh

[CV-313] GROVER: Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion AAAI2026

【速读】:该论文旨在解决多模态空间组学数据(如空间转录组、蛋白质组和表观遗传组)与组织病理图像之间难以有效融合的问题,尤其针对不同模态间语义差异大、分辨率不匹配以及样本制备引起的生物扰动导致的信号失真等挑战。其核心解决方案是提出GROVER框架,关键在于:1)利用基于Kolmogorov-Arnold网络的图卷积神经网络编码器建模各模态与其空间结构之间的非线性依赖关系,生成高表达能力的模态特异性嵌入;2)设计spot-feature-pair对比学习策略,显式优化每个空间点上跨模态的对应关系;3)引入动态专家路由机制,自适应地为每个空间点选择信息量高的模态并抑制噪声或低质量输入,从而实现鲁棒且可靠的多模态空间组学融合。

链接: https://arxiv.org/abs/2511.11730
作者: Yongjun Xiao,Dian Meng,Xinlei Huang,Yanran Liu,Shiwei Ruan,Ziyue Qiao,Xubin Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, Accepted to AAAI 2026

点击查看摘要

Abstract:Effectively modeling multimodal spatial omics data is critical for understanding tissue complexity and underlying biological mechanisms. While spatial transcriptomics, proteomics, and epigenomics capture molecular features, they lack pathological morphological context. Integrating these omics with histopathological images is therefore essential for comprehensive disease tissue analysis. However, substantial heterogeneity across omics, imaging, and spatial modalities poses significant challenges. Naive fusion of semantically distinct sources often leads to ambiguous representations. Additionally, the resolution mismatch between high-resolution histology images and lower-resolution sequencing spots complicates spatial alignment. Biological perturbations during sample preparation further distort modality-specific signals, hindering accurate integration. To address these challenges, we propose Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion (GROVER), a novel framework for adaptive integration of spatial multi-omics data. GROVER leverages a Graph Convolutional Network encoder based on Kolmogorov-Arnold Networks to capture the nonlinear dependencies between each modality and its associated spatial structure, thereby producing expressive, modality-specific embeddings. To align these representations, we introduce a spot-feature-pair contrastive learning strategy that explicitly optimizes the correspondence across modalities at each spot. Furthermore, we design a dynamic expert routing mechanism that adaptively selects informative modalities for each spot while suppressing noisy or low-quality inputs. Experiments on real-world spatial omics datasets demonstrate that GROVER outperforms state-of-the-art baselines, providing a robust and reliable solution for multimodal integration.
zh

[CV-314] Optimizing Input of Denoising Score Matching is Biased Towards Higher Score Norm NIPS25

【速读】:该论文旨在解决扩散模型中通过去噪得分匹配(denoising score matching)优化条件输入时引入的偏差问题,该偏差破坏了去噪得分匹配与精确得分匹配(exact score matching)之间的等价性,并导致得分范数(score norm)升高。解决方案的关键在于识别并量化这一偏差来源:当在训练过程中对条件输入进行优化时,所采用的近似策略会引入系统性偏移,从而影响模型学习到的得分函数的质量;论文进一步指出,这种偏差不仅存在于条件扩散模型中,也出现在使用预训练扩散模型优化数据分布的任务中(如图像压缩、文本到3D生成等),表明其具有广泛的影响。

链接: https://arxiv.org/abs/2511.11727
作者: Tongda Xu
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: NIPS 25 Workshop: Frontiers in Probabilistic Inference: Sampling Meets Learning

点击查看摘要

Abstract:Many recent works utilize denoising score matching to optimize the conditional input of diffusion models. In this workshop paper, we demonstrate that such optimization breaks the equivalence between denoising score matching and exact score matching. Furthermore, we show that this bias leads to higher score norm. Additionally, we observe a similar bias when optimizing the data distribution using a pre-trained diffusion model. Finally, we discuss the wide range of works across different domains that are affected by this bias, including MAR for auto-regressive generation, PerCo for image compression, and DreamFusion for text to 3D generation.
zh

[CV-315] Do Blind Spots Matter for Word-Referent Mapping? A Computational Study with Infant Egocentric Video

【速读】:该论文旨在解决儿童语言习得中“词-指称”映射问题,即如何在缺乏先验知识的情况下,从自然环境中将词汇与视觉对象建立可靠关联。其解决方案的关键在于提出一种生物合理且自监督的视觉表示学习策略:基于掩码自动编码器(Masked Autoencoder)构建视觉主干网络,并引入人类视网膜盲点(blind spot)信息设计新型掩码机制,以模拟人脑填补视野空白的认知过程。该方法替代传统随机掩码策略,更符合生物学原理,并在对比学习驱动的视频-文本模型中验证了其在跨情境和时序扩展场景下有效学习词-指称映射的能力。

链接: https://arxiv.org/abs/2511.11725
作者: Zekai Shi,Zhixi Cai,Kalin Stefanov
机构: Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes’ field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pretrained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.
zh

[CV-316] Fast 3D Surrogate Modeling for Data Center Thermal Management AAAI2026

【速读】:该论文旨在解决数据中心中能耗与碳排放过高问题,核心在于通过实时温度预测实现高效冷却控制与工作负载调度。传统基于计算流体动力学(Computational Fluid Dynamics, CFD)的热模型虽精度高,但计算成本巨大且依赖专家构建网格和边界条件,难以满足实时需求。为此,作者提出一种基于视觉的代理建模框架,直接在数据中心的三维体素(voxelized)表示上运行,融合服务器负载、风扇转速及HVAC温度设定点等输入参数,采用多种深度学习架构(包括3D CNN U-Net变体、3D傅里叶神经算子和3D视觉Transformer)映射至高保真温度场。关键创新在于利用端到端学习替代传统CFD求解,实现高达20,000倍的速度提升(从数小时降至数百毫秒),同时保持良好泛化能力,从而支持实时热点识别与动态冷却优化,最终带来7%的能耗降低和碳排放减少。

链接: https://arxiv.org/abs/2511.11722
作者: Soumyendu Sarkar,Antonio Guillen-Perez,Zachariah J Carmichael,Avisek Naug,Refik Mert Cam,Vineet Gundecha,Ashwin Ramesh Babu,Sahand Ghorbanpour,Ricardo Luna Gutierrez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Submitted to AAAI 2026 Conference

点击查看摘要

Abstract:Reducing energy consumption and carbon emissions in data centers by enabling real-time temperature prediction is critical for sustainability and operational efficiency. Achieving this requires accurate modeling of the 3D temperature field to capture airflow dynamics and thermal interactions under varying operating conditions. Traditional thermal CFD solvers, while accurate, are computationally expensive and require expert-crafted meshes and boundary conditions, making them impractical for real-time use. To address these limitations, we develop a vision-based surrogate modeling framework that operates directly on a 3D voxelized representation of the data center, incorporating server workloads, fan speeds, and HVAC temperature set points. We evaluate multiple architectures, including 3D CNN U-Net variants, a 3D Fourier Neural Operator, and 3D vision transformers, to map these thermal inputs to high-fidelity heat maps. Our results show that the surrogate models generalize across data center configurations and achieve up to 20,000x speedup (hundreds of milliseconds vs. hours). This fast and accurate estimation of hot spots and temperature distribution enables real-time cooling control and workload redistribution, leading to substantial energy savings (7%) and reduced carbon footprint.
zh

[CV-317] AdaptFly: Prompt-Guided Adaptation of Foundation Models for Low-Altitude UAV Networks

【速读】:该论文旨在解决低空无人机(UAV)网络中语义分割模型在天气、光照和视角变化下性能迅速退化的问题,同时应对资源受限无人机无法执行基于梯度的测试时自适应(Test-Time Adaptation, TTA),以及资源充裕无人机独立适应导致共享经验浪费的挑战。解决方案的关键在于提出一种无需权重更新的提示引导式测试时自适应框架——AdaptFly,其核心创新包括:针对资源受限无人机采用轻量级token提示检索共享全局记忆库;针对资源充裕无人机则利用协方差矩阵自适应进化策略(Covariance Matrix Adaptation Evolution Strategy, CMA-ES)实现无梯度稀疏视觉提示优化;并通过激活统计检测器触发适应机制,结合跨无人机知识池实现群体协作与极低带宽开销的知识聚合,从而显著提升分割精度与鲁棒性。

链接: https://arxiv.org/abs/2511.11720
作者: Jiao Chen,Haoyi Wang,Jianhua Tang,Junyi Wang
机构: South China University of Technology (华南理工大学); Guilin University of Electronic Technology (桂林电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Low-altitude Unmanned Aerial Vehicle (UAV) networks rely on robust semantic segmentation as a foundational enabler for distributed sensing-communication-control co-design across heterogeneous agents within the network. However, segmentation foundation models deteriorate quickly under weather, lighting, and viewpoint drift. Resource-limited UAVs cannot run gradient-based test-time adaptation, while resource-massive UAVs adapt independently, wasting shared experience. To address these challenges, we propose AdaptFly, a prompt-guided test-time adaptation framework that adjusts segmentation models without weight updates. AdaptFly features two complementary adaptation modes. For resource-limited UAVs, it employs lightweight token-prompt retrieval from a shared global memory. For resource-massive UAVs, it uses gradient-free sparse visual prompt optimization via Covariance Matrix Adaptation Evolution Strategy. An activation-statistic detector triggers adaptation, while cross-UAV knowledge pool consolidates prompt knowledge and enables fleet-wide collaboration with negligible bandwidth overhead. Extensive experiments on UAVid and VDD benchmarks, along with real-world UAV deployments under diverse weather conditions, demonstrate that AdaptFly significantly improves segmentation accuracy and robustness over static models and state-of-the-art TTA baselines. The results highlight a practical path to resilient, communication-efficient perception in the emerging low-altitude economy.
zh

[CV-318] CompressNAS : A Fast and Efficient Technique for Model Compression using Decomposition

【速读】:该论文旨在解决深度卷积神经网络(Deep Convolutional Neural Networks, CNNs)在微控制器(Microcontrollers, MCUs)和轻量级神经处理单元(Neural Processing Units, NPUs)上部署时面临的模型规模与计算需求不断增长的问题。现有低秩张量分解方法(如Tucker分解)虽能有效压缩模型参数和运算量,但通常采用局部rank选择策略,忽视了压缩率与精度之间的全局权衡。论文提出CompressNAS框架,将rank选择视为一个全局搜索问题,并引入快速精度估计器以高效评估候选分解方案,在内存和精度约束下实现全面的rank探索。其关键创新在于通过类似MicroNAS的设计理念,结合高效的精度预测机制,实现了高压缩比与小精度损失的平衡,从而显著提升了模型在边缘设备上的可部署性。

链接: https://arxiv.org/abs/2511.11716
作者: Sudhakar Sah,Nikhil Chabbra,Matthieu Durnerin
机构: STMicroelectronics(意法半导体)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Deep Convolutional Neural Networks (CNNs) are increasingly difficult to deploy on microcontrollers (MCUs) and lightweight NPUs (Neural Processing Units) due to their growing size and compute demands. Low-rank tensor decomposition, such as Tucker factorization, is a promising way to reduce parameters and operations with reasonable accuracy loss. However, existing approaches select ranks locally and often ignore global trade-offs between compression and accuracy. We introduce CompressNAS, a MicroNAS-inspired framework that treats rank selection as a global search problem. CompressNAS employs a fast accuracy estimator to evaluate candidate decompositions, enabling efficient yet exhaustive rank exploration under memory and accuracy constraints. In ImageNet, CompressNAS compresses ResNet-18 by 8x with less than 4% accuracy drop; on COCO, we achieve 2x compression of YOLOv5s without any accuracy drop and 2x compression of YOLOv5n with a 2.5% drop. Finally, we present a new family of compressed models, STResNet, with competitive performance compared to other efficient models.
zh

[CV-319] Understanding the Representation of Older Adults in Motion Capture Locomotion Datasets

【速读】:该论文旨在解决当前运动捕捉(Motion Capture, MoCap)数据集中对老年人群体代表性不足的问题,尤其是针对老龄化社会中日益重要的健康护理应用场景。现有公开数据集大多缺乏真实老年人的全身运动数据,且部分标注为“老年风格”(old-style)的运动数据存在偏差,未能准确反映年龄相关的步态特征。解决方案的关键在于提出一套量化评估指标,基于对年龄敏感、抗噪性强且在数据稀疏情况下仍稳定的步态参数,系统性地评估“老年风格”步行动作的真实性;研究发现,此类动作往往呈现过度控制的模式,无法忠实刻画衰老带来的生理变化,从而为构建高质量、具代表性的老年人运动数据集提供了方法论基础和改进方向。

链接: https://arxiv.org/abs/2511.11713
作者: Yunkai Yu,Yingying Wang,Rong Zheng
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,4 figures, to be published in IEEE AIOT 2025

点击查看摘要

Abstract:The Internet of Things (IoT) sensors have been widely employed to capture human locomotions to enable applications such as activity recognition, human pose estimation, and fall detection. Motion capture (MoCap) systems are frequently used to generate ground truth annotations for human poses when training models with data from wearable or ambient sensors, and have been shown to be effective to synthesize data in these modalities. However, the representation of older adults, an increasingly important demographic in healthcare, in existing MoCap locomotion datasets has not been thoroughly examined. This work surveyed 41 publicly available datasets, identifying eight that include older adult motions and four that contain motions performed by younger actors annotated as old style. Older adults represent a small portion of participants overall, and few datasets provide full-body motion data for this group. To assess the fidelity of old-style walking motions, quantitative metrics are introduced, defining high fidelity as the ability to capture age-related differences relative to normative walking. Using gait parameters that are age-sensitive, robust to noise, and resilient to data scarcity, we found that old-style walking motions often exhibit overly controlled patterns and fail to faithfully characterize aging. These findings highlight the need for improved representation of older adults in motion datasets and establish a method to quantitatively evaluate the quality of old-style walking motions.
zh

[CV-320] arget-Balanced Score Distillation

【速读】:该论文旨在解决Score Distillation Sampling (SDS)在生成三维资产时存在的过饱和(over-saturation)和过度平滑(over-smoothing)问题,尤其是现有基于负向提示(negative prompts)的改进方法所面临的纹理优化受限或纹理增强导致形状失真的权衡困境。解决方案的关键在于提出Target-Balanced Score Distillation (TBSD),其核心思想是将生成过程建模为多目标优化问题,并引入一种自适应策略,有效平衡负向提示中目标信息(Target Negative Prompts, TNP)对纹理真实性和几何保真度的影响,从而显著提升三维资产的纹理保真度与形状准确性。

链接: https://arxiv.org/abs/2511.11710
作者: Zhou Xu,Qi Wang,Yuxiao Yang,Luyuan Zhang,Zhang Liang,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Score Distillation Sampling (SDS) enables 3D asset generation by distilling priors from pretrained 2D text-to-image diffusion models, but vanilla SDS suffers from over-saturation and over-smoothing. To mitigate this issue, recent variants have incorporated negative prompts. However, these methods face a critical trade-off: limited texture optimization, or significant texture gains with shape distortion. In this work, we first conduct a systematic analysis and reveal that this trade-off is fundamentally governed by the utilization of the negative prompts, where Target Negative Prompts (TNP) that embed target information in the negative prompts dramatically enhancing texture realism and fidelity but inducing shape distortions. Informed by this key insight, we introduce the Target-Balanced Score Distillation (TBSD). It formulates generation as a multi-objective optimization problem and introduces an adaptive strategy that effectively resolves the aforementioned trade-off. Extensive experiments demonstrate that TBSD significantly outperforms existing state-of-the-art methods, yielding 3D assets with high-fidelity textures and geometrically accurate shape.
zh

[CV-321] LE-CapsNet: A Light and Enhanced Capsule Network

【速读】:该论文旨在解决胶囊网络(Capsule Network, CapsNet)在实际应用中面临的计算效率低、资源消耗大及准确率相对较低的问题。其解决方案的关键在于提出一种轻量级、增强型且更精确的CapsNet变体——LE-CapsNet,通过优化网络结构和参数配置,在仅使用380万权重的情况下实现了76.73%的CIFAR-10分类准确率,并使推理速度提升至原始CapsNet的4倍;同时在仿射变换图像识别任务中表现更优,AffNIST数据集上的准确率达到94.3%,显著优于原始CapsNet的90.52%。

链接: https://arxiv.org/abs/2511.11708
作者: Pouya Shiri,Amirali Baniasadi
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capsule Network (CapsNet) classifier has several advantages over CNNs, including better detection of images containing overlapping categories and higher accuracy on transformed images. Despite the advantages, CapsNet is slow due to its different structure. In addition, CapsNet is resource-hungry, includes many parameters and lags in accuracy compared to CNNs. In this work, we propose LE-CapsNet as a light, enhanced and more accurate variant of CapsNet. Using 3.8M weights, LECapsNet obtains 76.73% accuracy on the CIFAR-10 dataset while performing inference 4x faster than CapsNet. In addition, our proposed network is more robust at detecting images with affine transformations compared to CapsNet. We achieve 94.3% accuracy on the AffNIST dataset (compared to CapsNet 90.52%).
zh

[CV-322] Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental modelling

【速读】:该论文旨在解决当前地球观测(Earth Observation, EO)基础模型在空间和时间分辨率上受限的问题,即现有模型通常仅在固定的空间或时间尺度下运行,难以满足生态分析对高时空分辨率的双重需求。其解决方案的关键在于提出一种统一的表示学习框架,通过两阶段设计将不同模态的遥感数据(以Sentinel-1和Sentinel-2为例)整合到一个高时空分辨率的共享特征空间中,实现10米原生空间分辨率与无云Sentinel-2时频的协同建模;第一阶段独立建模各传感器以保留其特异性表征,第二阶段融合多模态特征并仅重训练融合层,从而在保持预训练编码器不变的前提下,有效捕捉互补信息并维持时空一致性,最终生成适用于精细化环境分析的嵌入表示。

链接: https://arxiv.org/abs/2511.11706
作者: Julia Peters,Karin Mora,Miguel D. Mahecha,Chaonan Ji,David Montero,Clemens Mosig,Guido Kraemer
机构: Leipzig University (莱比锡大学); German Centre for Integrative Biodiversity Research (iDiv) Halle–Jena–Leipzig (德国整合生物多样性研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages (incliding 2 pages of references), 7 figures

点击查看摘要

Abstract:Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.
zh

[CV-323] Multimodal ML: Quantifying the Improvement of Calorie Estimation Through Image-Text Pairs

【速读】:该论文旨在解决如何通过引入短文本输入(如菜品名称)来提升基于图像的卡路里估算准确性的问题。其核心挑战在于评估文本信息是否能有效补充图像特征,从而改善仅依赖图像的模型性能。解决方案的关键在于构建一个融合图像与文本的多模态卷积神经网络(Multimodal CNN),该模型同时接收菜品图片和名称作为输入,并利用TensorFlow框架在Nutrition5k数据集上进行训练。实验结果表明,相较于纯图像输入的CNN模型,该多模态方法将平均绝对误差(MAE)从84.76 kcal降低至83.70 kcal,实现了1.25%的统计显著性改进。

链接: https://arxiv.org/abs/2511.11705
作者: Arya Narang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper determines the extent to which short textual inputs (in this case, names of dishes) can improve calorie estimation compared to an image-only baseline model and whether any improvements are statistically significant. Utilizes the TensorFlow library and the Nutrition5k dataset (curated by Google) to train both an image-only CNN and multimodal CNN that accepts both text and an image as input. The MAE of calorie estimations was reduced by 1.06 kcal from 84.76 kcal to 83.70 kcal (1.25% improvement) when using the multimodal model.
zh

[CV-324] Simple Vision-Language Math Reasoning via Rendered Text

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)在数学问题求解任务中表现不足的问题,特别是如何通过轻量级且高效的训练流程提升其推理准确率。解决方案的关键在于提出一种文本到视觉的增强方法:将LaTeX格式的数学公式渲染为图像,并将其与结构化的链式思维(Chain-of-Thought, CoT)提示配对,从而构建一个简单但有效的训练管道。该方法使得紧凑的多模态架构在保持广泛通用能力的同时,显著提升了数学推理性能,在多个基准测试(如MMMU、ChartQA和DocVQA)上达到或超越现有开源及商业专用数学视觉语言求解器的水平。

链接: https://arxiv.org/abs/2511.11704
作者: Matvey Skripkin,Elizaveta Goncharova,Andrey Kuznetsov
机构: FusionBrain Lab; HSE University (高等经济大学); Innopolis University (因诺波利斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.
zh

[CV-325] ask-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

【速读】:该论文旨在解决从自然语言指令中理解三维场景级可操作性(scene-level affordances)的问题,以使具身智能体能够在复杂环境中进行有意义的交互。现有方法通常局限于物体级别的可操作性识别或仅将二维预测提升至三维,忽略了点云中丰富的几何结构信息,并导致较高的计算开销。解决方案的关键在于提出一种任务感知的三维场景级可操作性分割框架(Task-Aware 3D Scene-level Affordance segmentation, TASA),该框架通过粗到精的方式联合利用二维语义线索与三维几何推理:首先设计任务感知的二维可操作性检测模块,基于语言和视觉输入识别可操作点并引导选择任务相关的视角;随后引入三维可操作性精化模块,融合二维语义先验与局部三维几何信息,生成精确且空间一致的三维可操作性掩码,从而在准确性和效率上显著优于基线方法。

链接: https://arxiv.org/abs/2511.11702
作者: Lian He,Meng Liu,Qilang Ye,Yu Zhou,Xiang Deng,Gangyi Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.
zh

[CV-326] EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios with Language Guidance AAAI2026

【速读】:该论文旨在解决少样本(few-shot)三维点云语义分割方法中普遍存在的两个关键问题:一是现有方法依赖预训练阶段,导致模型灵活性和适应性受限;二是对支持集(support set)中除视觉信息外的其他有用数据(如文本注释)利用不足,从而影响模型性能及零样本(zero-shot)推理能力。解决方案的核心在于提出一种无需预训练的网络架构EPSegFZ,其关键创新包括:1)原型增强注册注意力(ProERA)模块与基于双相对位置编码(DRPE)的交叉注意力机制,用于在无预训练条件下提升特征提取精度并构建准确的查询-原型对应关系;2)语言引导原型嵌入(LGPE)模块,有效融合支持集中的文本信息以增强少样本性能并实现零样本推理能力。

链接: https://arxiv.org/abs/2511.11700
作者: Jiahui Wang,Haiyue Zhu,Haoren Guo,Abdullah Al Mamun,Cheng Xiang,Tong Heng Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: AAAI 2026

点击查看摘要

Abstract:Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference. Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.
zh

[CV-327] oward Dignity-Aware AI: Next-Generation Elderly Monitoring from Fall Detection to ADL

【速读】:该论文旨在解决当前老年人监测系统局限于单一任务(如跌倒检测)的局限性,提出向更全面的日常生活活动(Activities of Daily Living, ADL)识别演进的需求。其核心问题在于如何在保障隐私的前提下,实现边缘部署与联邦学习(Federated Learning)相结合的智能监控系统,以支持老年群体的独立性和尊严。解决方案的关键在于:首先利用生成式对抗网络(GAN)增强数据集(如SISFall)作为代理任务进行可行性验证;其次,在非独立同分布(non-IID)条件下开展联邦学习实验,并将模型嵌入Jetson Orin Nano边缘设备中实现部署;最终通过构建面向智能房间环境的ADL监控框架,应对领域偏移、数据稀缺和隐私风险等开放挑战,推动从单任务检测向综合日常行为理解的转变。

链接: https://arxiv.org/abs/2511.11696
作者: Xun Shao,Aoba Otani,Yuto Hirasuka,Runji Cai,Seng W. Loke
机构: Toyohashi University of Technology (丰田工业大学); Guangdong Ocean University (广东海洋大学); Deakin University (迪肯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: This is the author’s preprint version of a paper accepted for presentation at EAI MONAMI 2025 (to appear in Springer LNICST). The final authenticated version will be available online at Springer Link upon publication

点击查看摘要

Abstract:This position paper envisions a next-generation elderly monitoring system that moves beyond fall detection toward the broader goal of Activities of Daily Living (ADL) recognition. Our ultimate aim is to design privacy-preserving, edge-deployed, and federated AI systems that can robustly detect and understand daily routines, supporting independence and dignity in aging societies. At present, ADL-specific datasets are still under collection. As a preliminary step, we demonstrate feasibility through experiments using the SISFall dataset and its GAN-augmented variants, treating fall detection as a proxy task. We report initial results on federated learning with non-IID conditions, and embedded deployment on Jetson Orin Nano devices. We then outline open challenges such as domain shift, data scarcity, and privacy risks, and propose directions toward full ADL monitoring in smart-room environments. This work highlights the transition from single-task detection to comprehensive daily activity recognition, providing both early evidence and a roadmap for sustainable and human-centered elderly care AI.
zh

[CV-328] Value-Aligned Prompt Moderation via Zero-Shot Agent ic Rewriting for Safe Image Generation

【速读】:该论文旨在解决生成式视觉语言模型(如Stable Diffusion)在面对恶意或模糊提示时,可能生成不安全、冒犯性或文化不当内容的问题,同时现有防御机制难以在不牺牲生成质量或增加高昂成本的前提下实现与人类价值观的一致性。解决方案的关键在于提出VALOR(Value-Aligned LLM-Overseen Rewriter),一个模块化、零样本的代理框架:通过分层提示分析(包括多级NSFW检测、文化价值对齐模块和意图消歧器)识别风险,并由大语言模型(LLM)根据动态角色指令选择性重写提示以保留用户意图并强化对齐;若图像仍不安全,则进行风格再生以引导至更安全的视觉领域而不改变核心语义,从而在开放世界场景中实现安全、对齐且有用的内容生成。

链接: https://arxiv.org/abs/2511.11693
作者: Xin Zhao,Xiaojun Chen,Bingshan Liu,Zeyao Liu,Zhendong Zhao,Xiaoyan Gu
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative vision-language models like Stable Diffusion demonstrate remarkable capabilities in creative media synthesis, but they also pose substantial risks of producing unsafe, offensive, or culturally inappropriate content when prompted adversarially. Current defenses struggle to align outputs with human values without sacrificing generation quality or incurring high costs. To address these challenges, we introduce VALOR (Value-Aligned LLM-Overseen Rewriter), a modular, zero-shot agentic framework for safer and more helpful text-to-image generation. VALOR integrates layered prompt analysis with human-aligned value reasoning: a multi-level NSFW detector filters lexical and semantic risks; a cultural value alignment module identifies violations of social norms, legality, and representational ethics; and an intention disambiguator detects subtle or indirect unsafe implications. When unsafe content is detected, prompts are selectively rewritten by a large language model under dynamic, role-specific instructions designed to preserve user intent while enforcing alignment. If the generated image still fails a safety check, VALOR optionally performs a stylistic regeneration to steer the output toward a safer visual domain without altering core semantics. Experiments across adversarial, ambiguous, and value-sensitive prompts show that VALOR significantly reduces unsafe outputs by up to 100.00% while preserving prompt usefulness and creativity. These results highlight VALOR as a scalable and effective approach for deploying safe, aligned, and helpful image generation systems in open-world settings.
zh

[CV-329] AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation AAAI2026

【速读】:该论文旨在解决基于优化的文本到三维(text-to-3D)生成方法中因忽略源分布动态性而导致的语义过平滑(semantic over-smoothing)问题,即在优化过程中将来自二维生成模型的引导信号视为静态,从而导致轨迹不一致、语义线索被抑制或融合。解决方案的关键在于将文本到三维优化重新建模为从一个动态演化的源分布映射到固定目标分布的过程,并引入双条件潜空间(dual-conditioned latent space),同时以文本提示和中间渲染图像作为条件;其中,图像条件自然锚定当前源分布,由此提出AnchorDS机制,通过图像条件提供状态感知的引导信息,显著提升生成稳定性与语义一致性。进一步结合轻量级滤波与微调策略,有效修正错误的源估计,在几乎无额外计算开销下实现更精细细节、更自然色彩及更强语义保真度。

链接: https://arxiv.org/abs/2511.11692
作者: Jiayin Zhu,Linlin Yang,Yicong Li,Angela Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026. Project page: this https URL

点击查看摘要

Abstract:Optimization-based text-to-3D methods distill guidance from 2D generative models via Score Distillation Sampling (SDS), but implicitly treat this guidance as static. This work shows that ignoring source dynamics yields inconsistent trajectories that suppress or merge semantic cues, leading to “semantic over-smoothing” artifacts. As such, we reformulate text-to-3D optimization as mapping a dynamically evolving source distribution to a fixed target distribution. We cast the problem into a dual-conditioned latent space, conditioned on both the text prompt and the intermediately rendered image. Given this joint setup, we observe that the image condition naturally anchors the current source distribution. Building on this insight, we introduce AnchorDS, an improved score distillation mechanism that provides state-anchored guidance with image conditions and stabilizes generation. We further penalize erroneous source estimates and design a lightweight filter strategy and fine-tuning strategy that refines the anchor with negligible overhead. AnchorDS produces finer-grained detail, more natural colours, and stronger semantic consistency, particularly for complex prompts, while maintaining efficiency. Extensive experiments show that our method surpasses previous methods in both quality and efficiency.
zh

[CV-330] Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models AAAI2026

【速读】:该论文旨在解决测试时提示调优(test-time prompt tuning)在零样本设置下因仅依赖无标签测试数据进行可学习提示优化而导致的提示优化偏差(prompt optimization bias)问题。这种偏差会降低下游任务性能,其根源在于模型层面熵最小化目标忽视预测准确性、导致过自信但错误的输出,以及数据层面提示偏差引发视觉与文本模态间的错位,进一步加剧偏差。解决方案的关键在于提出一种双重去偏的测试时提示调优方法(Doubly Debiased Test-Time Prompt Tuning),核心包括两个模块:一是动态检索增强调制模块,利用测试图像特征作为查询从动态知识库中检索高置信度知识以调制模型预测;二是可靠性感知提示优化模块,通过置信度加权集成和跨模态一致性蒸馏施加正则化约束,从而提升提示调优的质量与鲁棒性。

链接: https://arxiv.org/abs/2511.11690
作者: Fei Song,Yi Li,Rui Wang,Jiahuan Zhou,Changwen Zheng,Jiangmeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Test-time prompt tuning for vision-language models has demonstrated impressive generalization capabilities under zero-shot settings. However, tuning the learnable prompts solely based on unlabeled test data may induce prompt optimization bias, ultimately leading to suboptimal performance on downstream tasks. In this work, we analyze the underlying causes of prompt optimization bias from both the model and data perspectives. In terms of the model, the entropy minimization objective typically focuses on reducing the entropy of model predictions while overlooking their correctness. This can result in overconfident yet incorrect outputs, thereby compromising the quality of prompt optimization. On the data side, prompts affected by optimization bias can introduce misalignment between visual and textual modalities, which further aggravates the prompt optimization bias. To this end, we propose a Doubly Debiased Test-Time Prompt Tuning method. Specifically, we first introduce a dynamic retrieval-augmented modulation module that retrieves high-confidence knowledge from a dynamic knowledge base using the test image feature as a query, and uses the retrieved knowledge to modulate the predictions. Guided by the refined predictions, we further develop a reliability-aware prompt optimization module that incorporates a confidence-based weighted ensemble and cross-modal consistency distillation to impose regularization constraints during prompt tuning. Extensive experiments across 15 benchmark datasets involving both natural distribution shifts and cross-datasets generalization demonstrate that our method outperforms baselines, validating its effectiveness in mitigating prompt optimization bias.
zh

[CV-331] Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

【速读】:该论文旨在解决扩散概率模型(Diffusion Probabilistic Models)在生成样本时因迭代采样过程缓慢而导致的效率问题,尤其是在极低函数评估次数(Number of Function Evaluations, NFE)条件下如何提升采样质量。现有方案难以同时满足有效性、自适应性、实际鲁棒性和计算高效性四大核心原则。为此,作者提出了一种新型的分层调度优化器(Hierarchical-Schedule-Optimizer, HSO),其关键在于构建了一个双层优化框架:上层进行全局搜索以确定最优初始化策略,下层执行局部优化以精细调整调度路径;同时引入两个创新机制——中点误差代理(Midpoint Error Proxy, MEP)用于无求解器依赖且数值稳定的局部优化目标,以及间距惩罚适应度函数(Spacing-Penalized Fitness, SPF)以防止时间步过于密集从而保障实际鲁棒性。该方法仅需不到8秒的一次性优化成本即可实现显著优于现有训练-free方法的性能,例如在NFE=5时于LAION-Aesthetics数据集上达到FID=11.94,无需重新训练即可大幅加速扩散模型采样。

链接: https://arxiv.org/abs/2511.11688
作者: Aihua Zhu,Rui Su,Qinglin Zhao,Li Feng,Meng Shen,Shibo He
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.
zh

[CV-332] Stratified Knowledge-Density Super-Network for Scalable Vision Transformers AAAI2026

【速读】:该论文旨在解决在不同资源约束下训练和部署多个视觉Transformer(Vision Transformer, ViT)模型所导致的成本高、效率低的问题。其核心解决方案是将预训练的ViT模型转化为一种分层知识密度超网络(stratified knowledge-density super-network),通过在权重中分层组织知识,实现灵活提取适用于不同模型规模的子网络,从而保留最大知识量。关键创新在于提出两种方法:一是Weighted PCA for Attention Contraction (WPAC),它利用token-wise加权主成分分析对中间特征进行压缩,并将变换矩阵注入相邻层,保持原网络功能的同时提升知识紧凑性;二是Progressive Importance-Aware Dropout (PIAD),通过逐步评估权重组的重要性并更新重要性感知的dropout列表,在此dropout机制下训练超网络以促进知识分层结构。实验证明,WPAC在知识集中度上优于现有剪枝标准,结合PIAD后可作为先进模型压缩与扩展方法的有力替代方案。

链接: https://arxiv.org/abs/2511.11683
作者: Longhua Li,Lei Qi,Xin Geng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Training and deploying multiple vision transformer (ViT) models for different resource constraints is costly and inefficient. To address this, we propose transforming a pre-trained ViT into a stratified knowledge-density super-network, where knowledge is hierarchically organized across weights. This enables flexible extraction of sub-networks that retain maximal knowledge for varying model sizes. We introduce \textbfWeighted \textbfPCA for \textbfAttention \textbfContraction (WPAC), which concentrates knowledge into a compact set of critical weights. WPAC applies token-wise weighted principal component analysis to intermediate features and injects the resulting transformation and inverse matrices into adjacent layers, preserving the original network function while enhancing knowledge compactness. To further promote stratified knowledge organization, we propose \textbfProgressive \textbfImportance-\textbfAware \textbfDropout (PIAD). PIAD progressively evaluates the importance of weight groups, updates an importance-aware dropout list, and trains the super-network under this dropout regime to promote knowledge stratification. Experiments demonstrate that WPAC outperforms existing pruning criteria in knowledge concentration, and the combination with PIAD offers a strong alternative to state-of-the-art model compression and model expansion methods.
zh

[CV-333] MPCM-Net: Multi-scale network integrates partial attention convolution with Mamba for ground-based cloud image segmentation

【速读】:该论文旨在解决地面云图像分割在光伏功率预测中的关键挑战,具体包括:现有深度学习方法依赖扩张卷积进行多尺度上下文提取时缺乏通道间特征的有效性和互操作性;注意力机制增强特征时难以兼顾精度与推理吞吐量;以及解码器改进未能建立层次化局部特征间的全局依赖关系,限制了推理效率。解决方案的关键在于提出MPCM-Net,其核心创新为:在编码器中引入多尺度部分注意力卷积模块(MPAC),包含ParCM和ParSM结构以实现跨尺度云团的全局空间交互并降低计算复杂度;在解码器中采用M2B模块结合SSH架构,在保持线性复杂度的同时完成空间与尺度维度上的深层特征聚合,从而显著提升分割精度与计算效率的平衡。

链接: https://arxiv.org/abs/2511.11681
作者: Penghui Niu,Jiashuai She,Taotao Cai,Yajuan Zhang,Ping Zhang,Junhua Gu,Jianxin Li
机构: Hebei University of Technology (河北工业大学); University of Southern Queensland (南昆士兰大学); Edith Cowan University (埃迪斯科文大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ground-based cloud image segmentation is a critical research domain for photovoltaic power forecasting. Current deep learning approaches primarily focus on encoder-decoder architectural refinements. However, existing methodologies exhibit several limitations:(1)they rely on dilated convolutions for multi-scale context extraction, lacking the partial feature effectiveness and interoperability of inter-channel;(2)attention-based feature enhancement implementations neglect accuracy-throughput balance; and (3)the decoder modifications fail to establish global interdependencies among hierarchical local features, limiting inference efficiency. To address these challenges, we propose MPCM-Net, a Multi-scale network that integrates Partial attention Convolutions with Mamba architectures to enhance segmentation accuracy and computational efficiency. Specifically, the encoder incorporates MPAC, which comprises:(1)a MPC block with ParCM and ParSM that enables global spatial interaction across multi-scale cloud formations, and (2)a MPA block combining ParAM and ParSM to extract discriminative features with reduced computational complexity. On the decoder side, a M2B is employed to mitigate contextual loss through a SSHD that maintains linear complexity while enabling deep feature aggregation across spatial and scale dimensions. As a key contribution to the community, we also introduce and release a dataset CSRC, which is a clear-label, fine-grained segmentation benchmark designed to overcome the critical limitations of existing public datasets. Extensive experiments on CSRC demonstrate the superior performance of MPCM-Net over state-of-the-art methods, achieving an optimal balance between segmentation accuracy and inference speed. The dataset and source code will be available at this https URL.
zh

[CV-334] Probabilistic Wildfire Susceptibility from Remote Sensing Using Random Forests and SHAP

【速读】:该论文旨在解决加州野火风险空间分布不明确、预测模型可解释性不足的问题,以支持更精准的灾害防控决策。其解决方案的关键在于构建一个融合随机森林(Random Forest, RF)与可解释人工智能(Explainable Artificial Intelligence, XAI)的框架,通过Shapley Additive exPlanations(SHAP)方法解析模型预测结果,识别不同生态系统(森林与草地)中的主导驱动因子,并结合时空交叉验证策略评估模型泛化能力。该方法不仅实现了高精度的火灾风险预测(如森林AUC达0.997),还提供了区域层面的风险分级和关键影响因素的量化解释,从而为制定差异化防火策略提供科学依据。

链接: https://arxiv.org/abs/2511.11680
作者: Udaya Bhasker Cheerala,Varun Teja Chirukuri,Venkata Akhil Kumar Gummadi,Jintu Moni Bhuyan,Praveen Damacharla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2025 IEEE Asia-Pacific Conference on Geoscience, Electronics and Remote Sensing Technology (AGERS)

点击查看摘要

Abstract:Wildfires pose a significant global threat to ecosystems worldwide, with California experiencing recurring fires due to various factors, including climate, topographical features, vegetation patterns, and human activities. This study aims to develop a comprehensive wildfire risk map for California by applying the random forest (RF) algorithm, augmented with Explainable Artificial Intelligence (XAI) through Shapley Additive exPlanations (SHAP), to interpret model predictions. Model performance was assessed using both spatial and temporal validation strategies. The RF model demonstrated strong predictive performance, achieving near-perfect discrimination for grasslands (AUC = 0.996) and forests (AUC = 0.997). Spatial cross-validation revealed moderate transferability, yielding ROC-AUC values of 0.6155 for forests and 0.5416 for grasslands. In contrast, temporal split validation showed enhanced generalization, especially for forests (ROC-AUC = 0.6615, PR-AUC = 0.8423). SHAP-based XAI analysis identified key ecosystem-specific drivers: soil organic carbon, tree cover, and Normalized Difference Vegetation Index (NDVI) emerged as the most influential in forests, whereas Land Surface Temperature (LST), elevation, and vegetation health indices were dominant in grasslands. District-level classification revealed that Central Valley and Northern Buttes districts had the highest concentration of high-risk grasslands, while Northern Buttes and North Coast Redwoods dominated forested high-risk areas. This RF-SHAP framework offers a robust, comprehensible, and adaptable method for assessing wildfire risks, enabling informed decisions and creating targeted strategies to mitigate dangers.
zh

[CV-335] A neural optimization framework for free-boundary diffeomorphic mapping problems and its applications

【速读】:该论文旨在解决自由边界微分同胚优化(free-boundary diffeomorphism optimization)在曲面映射问题中的难题,即边界无约束条件下如何保证大形变下的局部双射性(local bijectivity)。传统数值最小二乘拟共形(Numerical Least-Squares Quasiconformal, LSQC)方法虽具备存在性、唯一性、相似不变性和分辨率无关性等理论优势,但其依赖特征点条件(landmark conditioning),难以融入基于梯度的优化框架。解决方案的关键在于提出一种神经代理模型——谱贝尔特拉米网络(Spectral Beltrami Network, SBN),将LSQC能量嵌入多尺度网格-谱架构中,并进一步构建SBN-Opt优化框架,实现对自由边界微分同胚的优化,且能显式控制局部几何畸变,从而在密度均衡映射和不一致表面配准任务中显著优于传统数值算法。

链接: https://arxiv.org/abs/2511.11679
作者: Zhehao Xu,Lok Ming Lui
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Complex Variables (math.CV); Differential Geometry (math.DG)
备注:

点击查看摘要

Abstract:Free-boundary diffeomorphism optimization is a core ingredient in the surface mapping problem but remains notoriously difficult because the boundary is unconstrained and local bijectivity must be preserved under large deformation. Numerical Least-Squares Quasiconformal (LSQC) theory, with its provable existence, uniqueness, similarity-invariance and resolution-independence, offers an elegant mathematical remedy. However, the conventional numerical algorithm requires landmark conditioning, and cannot be applied into gradient-based optimization. We propose a neural surrogate, the Spectral Beltrami Network (SBN), that embeds LSQC energy into a multiscale mesh-spectral architecture. Next, we propose the SBN guided optimization framework SBN-Opt which optimizes free-boundary diffeomorphism for the problem, with local geometric distortion explicitly controllable. Extensive experiments on density-equalizing maps and inconsistent surface registration demonstrate our SBN-Opt’s superiority over traditional numerical algorithms.
zh

[CV-336] Learning with Preserving for Continual Multitask Learning AAAI-2026

【速读】:该论文旨在解决持续多任务学习(Continual Multitask Learning, CMTL)场景下的灾难性遗忘问题,即模型在连续学习新任务时因共享数据流导致对先前任务知识的遗忘。现有方法通常学习碎片化的任务特定特征,引发任务间干扰。其解决方案的关键在于提出学习保持(Learning with Preserving, LwP)框架,核心创新是引入动态加权距离保持(Dynamically Weighted Distance Preservation, DWDP)损失函数,通过正则化潜在表示空间中样本间的成对距离来维持共享表示空间的几何结构,从而避免表征漂移。此机制使模型无需回放缓冲区即可保留隐式知识,支持多样化任务,且在时间序列与图像基准测试中显著优于现有最先进方法,尤其在分布偏移下表现出更强鲁棒性。

链接: https://arxiv.org/abs/2511.11676
作者: Hanchen David Wang,Siwoo Bae,Zirong Chen,Meiyi Ma
机构: Vanderbilt University (范德比尔特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 16 figures, accepted at AAAI-2026

点击查看摘要

Abstract:Artificial intelligence systems in critical fields like autonomous driving and medical imaging analysis often continually learn new tasks using a shared stream of input data. For instance, after learning to detect traffic signs, a model may later need to learn to classify traffic lights or different types of vehicles using the same camera feed. This scenario introduces a challenging setting we term Continual Multitask Learning (CMTL), where a model sequentially learns new tasks on an underlying data distribution without forgetting previously learned abilities. Existing continual learning methods often fail in this setting because they learn fragmented, task-specific features that interfere with one another. To address this, we introduce Learning with Preserving (LwP), a novel framework that shifts the focus from preserving task outputs to maintaining the geometric structure of the shared representation space. The core of LwP is a Dynamically Weighted Distance Preservation (DWDP) loss that prevents representation drift by regularizing the pairwise distances between latent data representations. This mechanism of preserving the underlying geometric structure allows the model to retain implicit knowledge and support diverse tasks without requiring a replay buffer, making it suitable for privacy-conscious applications. Extensive evaluations on time-series and image benchmarks show that LwP not only mitigates catastrophic forgetting but also consistently outperforms state-of-the-art baselines in CMTL tasks. Notably, our method shows superior robustness to distribution shifts and is the only approach to surpass the strong single-task learning baseline, underscoring its effectiveness for real-world dynamic environments.
zh

[CV-337] Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks

【速读】:该论文旨在解决边缘计算中深度神经网络推理时因传输中间特征而导致的显著通信瓶颈问题。其核心解决方案是提出一种轻量级压缩框架,关键在于结合非对称整数量化(asymmetric integer quantization)与稀疏张量表示(sparse tensor representation),并利用范围不对称数值系统(Range Asymmetric Numeral Systems, rANS)进行高效编码,从而在不依赖复杂概率建模或网络结构调整的前提下,大幅降低传输开销。该方法具备分布无关性、低计算开销,并通过近似理论模型优化张量重塑维度以最大化压缩效率,同时实现亚毫秒级GPU加速编码/解码延迟,有效支撑了多种视觉与自然语言处理任务在带宽受限环境下的高性能部署。

链接: https://arxiv.org/abs/2511.11664
作者: Mingyu Sung,Suhwan Im,Vikas Palakonda,Jae-Mo Kang
机构: Kyungpook National University (庆北国立大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Split computing distributes deep neural network inference between resource-constrained edge devices and cloud servers but faces significant communication bottlenecks when transmitting intermediate features. To this end, in this paper, we propose a novel lightweight compression framework that leverages Range Asymmetric Numeral Systems (rANS) encoding with asymmetric integer quantization and sparse tensor representation to reduce transmission overhead dramatically. Specifically, our approach combines asymmetric integer quantization with a sparse representation technique, eliminating the need for complex probability modeling or network modifications. The key contributions include: (1) a distribution-agnostic compression pipeline that exploits inherent tensor sparsity to achieve bandwidth reduction with minimal computational overhead; (2) an approximate theoretical model that optimizes tensor reshaping dimensions to maximize compression efficiency; and (3) a GPU-accelerated implementation with sub-millisecond encoding/decoding latency. Extensive evaluations across diverse neural architectures (ResNet, VGG16, MobileNetV2, SwinT, DenseNet121, EfficientNetB0) demonstrate that the proposed framework consistently maintains near-baseline accuracy across CIFAR100 and ImageNet benchmarks. Moreover, we validated the framework’s effectiveness on advanced natural language processing tasks by employing Llama2 7B and 13B on standard benchmarks such as MMLU, HellaSwag, ARC, PIQA, Winogrande, BoolQ, and OpenBookQA, demonstrating its broad applicability beyond computer vision. Furthermore, this method addresses a fundamental bottleneck in deploying sophisticated artificial intelligence systems in bandwidth-constrained environments without compromising model performance.
zh

[CV-338] AGENet: Adaptive Edge-aware Geodesic Distance Learning for Few-Shot Medical Image Segmentation WACV2026

【速读】:该论文旨在解决医学图像分割中因依赖大量标注数据而产生的瓶颈问题,尤其是在少样本场景下难以实现精确边界分割的挑战,特别是当解剖结构相似且缺乏足够空间上下文时。解决方案的关键在于提出AGENet(Adaptive Geodesic Edge-aware Network),其核心创新是通过边缘感知的测地距离学习来建模空间关系,利用医学结构具有的可预测几何模式指导原型提取,从而在有限训练数据下提升边界精度;该方法采用轻量级几何建模而非复杂神经网络结构,结合边缘感知的测地距离模块、自适应原型提取机制和参数学习策略,在保持计算效率的同时显著降低边界误差,适用于临床场景下的精准分割需求。

链接: https://arxiv.org/abs/2511.11662
作者: Ziyuan Gao
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in WACV 2026 (Round 2)

点击查看摘要

Abstract:Medical image segmentation requires large annotated datasets, creating a significant bottleneck for clinical applications. While few-shot segmentation methods can learn from minimal examples, existing approaches demonstrate suboptimal performance in precise boundary delineation for medical images, particularly when anatomically similar regions appear without sufficient spatial context. We propose AGENet (Adaptive Geodesic Edge-aware Network), a novel framework that incorporates spatial relationships through edge-aware geodesic distance learning. Our key insight is that medical structures follow predictable geometric patterns that can guide prototype extraction even with limited training data. Unlike methods relying on complex architectural components or heavy neural networks, our approach leverages computationally lightweight geometric modeling. The framework combines three main components: (1) An edge-aware geodesic distance learning module that respects anatomical boundaries through iterative Fast Marching refinement, (2) adaptive prototype extraction that captures both global structure and local boundary details via spatially-weighted aggregation, and (3) adaptive parameter learning that automatically adjusts to different organ characteristics. Extensive experiments across diverse medical imaging datasets demonstrate improvements over state-of-the-art methods. Notably, our method reduces boundary errors compared to existing approaches while maintaining computational efficiency, making it highly suitable for clinical applications requiring precise segmentation with limited annotated data.
zh

[CV-339] A Method for Identifying Farmland System Habitat Types Based on the Dynamic-Weighted Feature Fusion Network Model

【速读】:该论文旨在解决当前耕地生态系统缺乏标准化栖息地分类体系、栖息地类型覆盖不全,以及现有分割模型难以有效融合语义与纹理特征导致多尺度栖息地(如大尺度田块与微栖息地)分割精度不足、边界模糊的问题。其解决方案的关键在于构建了一个涵盖15类耕地系统栖息地的超高清遥感图像数据集,并提出一种动态加权特征融合网络(Dynamic-Weighted Feature Fusion Network, DWFF-Net)。该模型通过冻结参数的DINOv3作为编码器提取基础特征,引入基于数据层面的自适应动态加权策略实现多层特征融合,同时在解码器中设计动态权重计算模块以深度融合多尺度特征,并采用混合损失函数优化训练过程,从而显著提升微栖息地(如田埂)的分割性能,最终实现亚米级精度的低成本栖息地制图。

链接: https://arxiv.org/abs/2511.11659
作者: Kesong Zheng,Zhi Song,Peizhou Li,Shuyi Yao,Zhenxing Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages,12 figures

点击查看摘要

Abstract:Addressing the current lack of a standardized habitat classification system for cultivated land ecosystems, incomplete coverage of habitat types, and the inability of existing models to effectively integrate semantic and texture features-resulting in insufficient segmentation accuracy and blurred boundaries for multi-scale habitats (e.g., large-scale field plots and micro-habitats)-this study developed a comprehensively annotated ultra-high-resolution remote sensing image dataset encompassing 15 categories of cultivated land system habitats. Furthermore, we propose a Dynamic-Weighted Feature Fusion Network (DWFF-Net). The encoder of this model utilizes a frozen-parameter DINOv3 to extract foundational features. By analyzing the relationships between different category images and feature maps, we introduce a data-level adaptive dynamic weighting strategy for feature fusion. The decoder incorporates a dynamic weight computation network to achieve thorough integration of multi-layer features, and a hybrid loss function is adopted to optimize model training. Experimental results on the constructed dataset demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 0.6979 and an F1-score of 0.8049, outperforming the baseline network by 0.021 and 0.0161, respectively. Ablation studies further confirm the complementary nature of multi-layer feature fusion, which effectively improves the IoU for micro-habitat categories such as field ridges. This study establishes a habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at a low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes.
zh

[CV-340] Real-time pothole detection with onboard sensors and camera on vehicles

【速读】:该论文旨在解决城市道路中坑洼(pothole)检测不及时、难以大规模管理的问题,以提升交通流畅性和道路维护效率。其解决方案的关键在于利用车辆上的传感器实时采集道路振动数据,并采用支持向量机(Support Vector Machine, SVM)分类器进行特征识别与分类,实现了98.1%的检测准确率,从而为大规模道路病害分析与治理提供可靠的数据基础。

链接: https://arxiv.org/abs/2511.11643
作者: Aswath Muthuselvam,Jeevak Raj S,Mohanaprasad K
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road conditions play an important role in our everyday commute. With the proliferating number of vehicles on the road each year, it has become necessary to access the road conditions very frequently, this would ensure that the traffic also flows smoothly. Even the smallest crack in the road could be easily be chipped into a large pothole due to changing surface temperatures of the road and from the force of vehicles riding over it. In this paper, we have addressed how we could better identify these potholes in realtime with the help of onboard sensors in vehicles so that the data could be useful for analysis and better management of potholes on a large scale. For the implementation, we used an SVM classifier to detect potholes, we achieved 98.1% accuracy based on data collected from a local road for about 2 km which had 26 potholes distributed along the road. Code is available at: this https URL
zh

[CV-341] Image-based Morphological Characterization of Filamentous Biological Structures with Non-constant Curvature Shape Feature

【速读】:该论文旨在解决攀援植物卷须在机械刺激下随时间发生形变的动态过程与其触发事件及接触位置之间关系难以精确提取的问题。解决方案的关键在于提出一种基于图像的几何建模方法,采用分段Clothoid(Clothoid是数学中的一种曲线,其曲率线性变化)的3D模型对卷须在不同部位受机械摩擦后的构型进行重建,实现了高达R²=0.99的高精度和强鲁棒性。该方法相比深度学习方法具有数据需求少、计算成本低和可解释性强等优势,并揭示了卷须顶端区域响应更敏感的现象,为植物生物力学研究和仿生机器人设计提供了新思路。

链接: https://arxiv.org/abs/2511.11639
作者: Jie Fan,Francesco Visentin,Barbara Mazzolai,Emanuela Del Dottore
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript is a preprint version of the article currently under peer review at International Journal of Computer Vision (IJCV)

点击查看摘要

Abstract:Tendrils coil their shape to anchor the plant to supporting structures, allowing vertical growth toward light. Although climbing plants have been studied for a long time, extracting information regarding the relationship between the temporal shape change, the event that triggers it, and the contact location is still challenging. To help build this relation, we propose an image-based method by which it is possible to analyze shape changes over time in tendrils when mechano-stimulated in different portions of their body. We employ a geometric approach using a 3D Piece-Wise Clothoid-based model to reconstruct the configuration taken by a tendril after mechanical rubbing. The reconstruction shows high robustness and reliability with an accuracy of R2 0.99. This method demonstrates distinct advantages over deep learning-based approaches, including reduced data requirements, lower computational costs, and interpretability. Our analysis reveals higher responsiveness in the apical segment of tendrils, which might correspond to higher sensitivity and tissue flexibility in that region of the organs. Our study provides a methodology for gaining new insights into plant biomechanics and offers a foundation for designing and developing novel intelligent robotic systems inspired by climbing plants.
zh

[CV-342] actile Data Recording System for Clothing with Motion-Controlled Robotic Sliding SIGGRAPH

【速读】:该论文旨在解决服装触感(tactile sensation)与物理属性之间关联性不明确的问题,以提升穿着舒适度的量化评估能力。其核心挑战在于如何系统化采集在滑动过程中衣物与皮肤接触时的多模态触觉数据,并准确标注运动参数以增强感知模型的识别精度。解决方案的关键在于构建了一套基于机械臂的触觉数据采集系统,通过模拟手指滑动动作并精确控制速度和方向,实现了对完整衣物的非破坏性、运动标签化的多模态触觉数据库创建;机器学习评估进一步验证了引入运动相关参数可显著提升音频和加速度数据的分类准确性,从而为服装触觉感知与再现研究提供了可靠的数据基础和方法支撑。

链接: https://arxiv.org/abs/2511.11634
作者: Michikuni Eguchi,Takekazu Kitagishi,Yuichi Hiroi,Takefumi Hiraki
机构: University of Tsukuba (筑波大学); Cluster Metaverse Lab (集群元宇宙实验室); The University of Tokyo (东京大学); ZOZO Research (ZOZO 研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 3 pages, 2 figures, 1 table. Presented at SIGGRAPH Asia 2025 Posters (SA Posters '25), December 15-18, 2025, Hong Kong, Hong Kong

点击查看摘要

Abstract:The tactile sensation of clothing is critical to wearer comfort. To reveal physical properties that make clothing comfortable, systematic collection of tactile data during sliding motion is required. We propose a robotic arm-based system for collecting tactile data from intact garments. The system performs stroking measurements with a simulated fingertip while precisely controlling speed and direction, enabling creation of motion-labeled, multimodal tactile databases. Machine learning evaluation showed that including motion-related parameters improved identification accuracy for audio and acceleration data, demonstrating the efficacy of motion-related labels for characterizing clothing tactile sensation. This system provides a scalable, non-destructive method for capturing tactile data of clothing, contributing to future studies on fabric perception and reproduction.
zh

[CV-343] Psychological stress during Examination and its estimation by handwriting in answer script

【速读】:该论文旨在解决如何通过分析学生手写考试卷来量化其心理压力水平的问题,突破传统评分体系的局限,提供对考试过程中认知与情绪状态的深度洞察。解决方案的关键在于融合图符学(graphology)与人工智能技术,利用光学字符识别(Optical Character Recognition, OCR)和基于Transformer的语义情感分析模型,结合高分辨率图像处理、TrOCR(Text Recognition using Convolutional Neural Networks and Recurrent Neural Networks)以及基于RoBERTa模型的情感熵融合方法,生成一个数值化的压力指数(Stress Index),并通过五模型投票机制与无监督异常检测提升系统的鲁棒性。

链接: https://arxiv.org/abs/2511.11633
作者: Abhijeet Kumar,Chetan Agarwal,Pronoy B. Neogi,Mayank Goswami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 Pages, 6 Figures and 1 Table

点击查看摘要

Abstract:This research explores the fusion of graphology and artificial intelligence to quantify psychological stress levels in students by analyzing their handwritten examination scripts. By leveraging Optical Character Recognition and transformer based sentiment analysis models, we present a data driven approach that transcends traditional grading systems, offering deeper insights into cognitive and emotional states during examinations. The system integrates high resolution image processing, TrOCR, and sentiment entropy fusion using RoBERTa based models to generate a numerical Stress Index. Our method achieves robustness through a five model voting mechanism and unsupervised anomaly detection, making it an innovative framework in academic forensics.
zh

[CV-344] MiniGPT -Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and Detection

【速读】:该论文旨在解决胰腺放射影像学中因器官体积小、边界模糊以及个体间解剖位置和形态差异大所带来的成像挑战。其解决方案的关键在于提出一种名为MiniGPT-Pancreas的多模态大语言模型(Multimodal Large Language Model, MLLM),通过级联微调方式整合来自美国国立卫生研究院(NIH)和医学分割挑战赛(Medical Segmentation Decathlon, MSD)数据集的CT图像与文本提示,实现胰腺定位、肿瘤分类及肿瘤检测的多任务交互式分析,从而辅助临床医生进行胰腺癌诊断。

链接: https://arxiv.org/abs/2412.15925
作者: Andrea Moglia,Elia Clement Nastasio,Luca Mainardi,Pietro Cerveri
机构: Polytechnic University of Milan (米兰理工大学); University of Pavia (帕维亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Problem: Pancreas radiological imaging is challenging due to the small size, blurred boundaries, and variability of shape and position of the organ among patients. Goal: In this work we present MiniGPT-Pancreas, a Multimodal Large Language Model (MLLM), as an interactive chatbot to support clinicians in pancreas cancer diagnosis by integrating visual and textual information. Methods: MiniGPT-v2, a general-purpose MLLM, was fine-tuned in a cascaded way for pancreas detection, tumor classification, and tumor detection with multimodal prompts combining questions and computed tomography scans from the National Institute of Health (NIH), and Medical Segmentation Decathlon (MSD) datasets. The AbdomenCT-1k dataset was used to detect the liver, spleen, kidney, and pancreas. Results: MiniGPT-Pancreas achieved an Intersection over Union (IoU) of 0.595 and 0.550 for the detection of pancreas on NIH and MSD datasets, respectively. For the pancreas cancer classification task on the MSD dataset, accuracy, precision, and recall were 0.876, 0.874, and 0.878, respectively. When evaluating MiniGPT-Pancreas on the AbdomenCT-1k dataset for multi-organ detection, the IoU was 0.8399 for the liver, 0.722 for the kidney, 0.705 for the spleen, and 0.497 for the pancreas. For the pancreas tumor detection task, the IoU score was 0.168 on the MSD dataset. Conclusions: MiniGPT-Pancreas represents a promising solution to support clinicians in the classification of pancreas images with pancreas tumors. Future research is needed to improve the score on the detection task, especially for pancreas tumors.
zh

[CV-345] Scalable Vision-Guided Crop Yield Estimation AAAI2026

【速读】:该论文旨在解决农业监测中平均作物产量的精确估计与不确定性量化问题,传统方法如随机采样田块进行作物收割(crop cuts)虽准确但耗时较长。为提升效率并保持精度,作者提出基于预测驱动推断(Prediction-powered Inference, PPI)的方法,其核心在于利用低成本田间照片替代部分作物收割数据:首先训练一个计算机视觉模型从图像中预测真实产量,随后学习一个“控制函数”(control function),通过场地位理坐标对预测结果进行校准,从而将无作物收割记录但有图像的田块纳入估计体系。该方法在近2万组非洲水稻和玉米田实测数据上验证有效,点估计具有渐近无偏性且不会增加渐近方差,同时结合新型偏差校正加速(BCa)自助法构建置信区间,在样本量较少区域(如仅20个田块)仍可显著提升有效样本规模(水稻达73%,玉米12–23%),且置信区间更短而覆盖概率损失最小,体现出低资源影像数据在区域级作物保险和可持续农业投资中的应用潜力。

链接: https://arxiv.org/abs/2511.12999
作者: Harrison H. Li,Medhanie Irgau,Nabil Janmohamed,Karen Solveig Rieckmann,David B. Lobell
机构: 未知
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a conference paper at AAAI 2026 (oral presentation). This is the extended version, including the technical appendix

点击查看摘要

Abstract:Precise estimation and uncertainty quantification for average crop yields are critical for agricultural monitoring and decision making. Existing data collection methods, such as crop cuts in randomly sampled fields at harvest time, are relatively time-consuming. Thus, we propose an approach based on prediction-powered inference (PPI) to supplement these crop cuts with less time-consuming field photos. After training a computer vision model to predict the ground truth crop cut yields from the photos, we learn a ``control function" that recalibrates these predictions with the spatial coordinates of each field. This enables fields with photos but not crop cuts to be leveraged to improve the precision of zone-wide average yield estimates. Our control function is learned by training on a dataset of nearly 20,000 real crop cuts and photos of rice and maize fields in sub-Saharan Africa. To improve precision, we pool training observations across different zones within the same first-level subdivision of each country. Our final PPI-based point estimates of the average yield are provably asymptotically unbiased and cannot increase the asymptotic variance beyond that of the natural baseline estimator – the sample average of the crop cuts – as the number of fields grows. We also propose a novel bias-corrected and accelerated (BCa) bootstrap to construct accompanying confidence intervals. Even in zones with as few as 20 fields, the point estimates show significant empirical improvement over the baseline, increasing the effective sample size by as much as 73% for rice and by 12-23% for maize. The confidence intervals are accordingly shorter at minimal cost to empirical finite-sample coverage. This demonstrates the potential for relatively low-cost images to make area-based crop insurance more affordable and thus spur investment into sustainable agricultural practices.
zh

[CV-346] Inertia-Informed Orientation Priors for Event-Based Optical Flow Estimation

【速读】:该论文旨在解决事件相机(event camera)在估计事件流(event-based optical flow)时面临的挑战,即事件具有时间密集但空间稀疏的特性,导致传统基于模型或学习的方法难以获得稳定且准确的运动估计。其解决方案的关键在于提出一种受生物启发的混合对比度最大化(contrast maximization, CM)方法,通过引入由相机三维速度推导出的方向图(orientation maps)作为先验信息,引导CM优化过程中的运动轨迹估计。方向图提供了方向性约束,有效缩小了可能的运动轨迹搜索空间,从而显著提升了算法的鲁棒性和收敛性能。

链接: https://arxiv.org/abs/2511.12961
作者: Pritam P. Karmokar,William J. Beksi
机构: The University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures, and 3 tables

点击查看摘要

Abstract:Event cameras, by virtue of their working principle, directly encode motion within a scene. Many learning-based and model-based methods exist that estimate event-based optical flow, however the temporally dense yet spatially sparse nature of events poses significant challenges. To address these issues, contrast maximization (CM) is a prominent model-based optimization methodology that estimates the motion trajectories of events within an event volume by optimally warping them. Since its introduction, the CM framework has undergone a series of refinements by the computer vision community. Nonetheless, it remains a highly non-convex optimization problem. In this paper, we introduce a novel biologically-inspired hybrid CM method for event-based optical flow estimation that couples visual and inertial motion cues. Concretely, we propose the use of orientation maps, derived from camera 3D velocities, as priors to guide the CM process. The orientation maps provide directional guidance and constrain the space of estimated motion trajectories. We show that this orientation-guided formulation leads to improved robustness and convergence in event-based optical flow estimation. The evaluation of our approach on the MVSEC, DSEC, and ECD datasets yields superior accuracy scores over the state of the art.
zh

[CV-347] BrainNormalizer: Anatomy-Informed Pseudo-Healthy Brain Reconstruction from Tumor MRI via Edge-Guided ControlNet

【速读】:该论文旨在解决脑肿瘤导致的解剖结构变形问题,即在临床实践中缺乏个体化“无肿瘤状态”下的参考大脑图像,从而影响诊断、治疗规划和手术导航。其解决方案的关键在于提出BrainNormalizer——一种基于边界引导的扩散框架,通过条件生成机制从肿瘤扫描中重建伪健康磁共振成像(MRI)。该方法的核心创新是利用患者自身解剖结构提取的边界线索(edge maps)作为条件输入,在无需配对非肿瘤与肿瘤扫描的情况下,实现解剖上合理且结构一致的重建。具体而言,模型采用两阶段训练策略:首先通过基于图像修复(inpainting)的微调适配预训练扩散模型,随后引入ControlNet分支以注入细粒度解剖轮廓并保留已学习先验知识;推理时则采用故意错位策略,将肿瘤输入与非肿瘤提示及对侧镜像边缘图配对,借助半球对应关系指导重建过程。

链接: https://arxiv.org/abs/2511.12853
作者: Min Gu Kwak,Yeonju Lee,Hairong Wang,Jing Li
机构: University of Pittsburgh (匹兹堡大学); Georgia Institute of Technology (佐治亚理工学院); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain tumors are among the most clinically significant neurological diseases and remain a major cause of morbidity and mortality due to their aggressive growth and structural heterogeneity. As tumors expand, they induce substantial anatomical deformation that disrupts both local tissue organization and global brain architecture, complicating diagnosis, treatment planning, and surgical navigation. Yet a subject-specific reference of how the brain would appear without tumor-induced changes is fundamentally unobtainable in clinical practice. We present BrainNormalizer, an anatomy-informed diffusion framework that reconstructs pseudo-healthy MRIs directly from tumorous scans by conditioning the generative process on boundary cues extracted from the subject’s own anatomy. This boundary-guided conditioning enables anatomically plausible pseudo-healthy reconstruction without requiring paired non-tumorous and tumorous scans. BrainNormalizer employs a two-stage training strategy. The pretrained diffusion model is first adapted through inpainting-based fine-tuning on tumorous and non-tumorous scans. Next, an edge-map-guided ControlNet branch is trained to inject fine-grained anatomical contours into the frozen decoder while preserving learned priors. During inference, a deliberate misalignment strategy pairs tumorous inputs with non-tumorous prompts and mirrored contralateral edge maps, leveraging hemispheric correspondence to guide reconstruction. On the BraTS2020 dataset, BrainNormalizer achieves strong quantitative performance and qualitatively produces anatomically plausible reconstructions in tumor-affected regions while retaining overall structural coherence. BrainNormalizer provides clinically reliable anatomical references for treatment planning and supports new research directions in counterfactual modeling and tumor-induced deformation analysis.
zh

[CV-348] Improving the Generalisation of Learned Reconstruction Frameworks

【速读】:该论文旨在解决在X射线计算机断层成像(X-ray Computed Tomography, CT)中,基于数据驱动的反问题求解方法因缺乏良好泛化能力而带来的挑战,特别是传统卷积神经网络(Convolutional Neural Networks, CNNs)在处理投影数据时由于其网格结构假设与CT sinogram实际位于线流形上的几何特性不匹配,导致参数冗余、训练效率低且难以适应不同采样几何的问题。解决方案的关键在于提出一种混合神经网络架构——图-网格混合模型(Graph-Grid Hybrid Model, GLM),该模型首先构建图结构以显式建模CT采集几何和数据关系,从而有效捕捉测量间的几何依赖;其次融合图卷积与网格卷积操作,在保持对数据几何先验利用的同时显著减少可训练参数量,并实现更优的图像重建性能(如结构相似性与峰值信噪比指标)。实验表明,GLM不仅训练更快、内存占用更低,还展现出对未见采样几何(如从全采样训练到稀疏视角测试)的强鲁棒性,解决了CNN在跨几何场景下泛化能力不足的核心痛点。

链接: https://arxiv.org/abs/2511.12730
作者: Emilien Valat,Ozan Öktem
机构: KTH, Royal Institute of Technology (皇家理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Ensuring proper generalization is a critical challenge in applying data-driven methods for solving inverse problems in imaging, as neural networks reconstructing an image must perform well across varied datasets and acquisition geometries. In X-ray Computed Tomography (CT), convolutional neural networks (CNNs) are widely used to filter the projection data but are ill-suited for this task as they apply grid-based convolutions to the sinogram, which inherently lies on a line manifold, not a regular grid. The CNNs, unaware of the geometry, are implicitly tied to it and require an excessive amount of parameters as they must infer the relations between measurements from the data rather than from prior information. The contribution of this paper is twofold. First, we introduce a graph data structure to represent CT acquisition geometries and tomographic data, providing a detailed explanation of the graph’s structure for circular, cone-beam geometries. Second, we propose GLM, a hybrid neural network architecture that leverages both graph and grid convolutions to process tomographic data. We demonstrate that GLM outperforms CNNs when performance is quantified in terms of structural similarity and peak signal-to-noise ratio, despite the fact that GLM uses only a fraction of the trainable parameters. Compared to CNNs, GLM also requires significantly less training time and memory, and its memory requirements scale better. Crucially, GLM demonstrates robust generalization to unseen variations in the acquisition geometry, like when training only on fully sampled CT data and then testing on sparse-view CT data. Comments: 11 pages, 8 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.12730 [eess.IV] (or arXiv:2511.12730v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2511.12730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-349] Predicting upcoming visual features during eye movements yields scene representations aligned with human visual cortex

【速读】:该论文旨在解决如何从自然视觉经验中学习统一且与大脑响应对齐的场景表征问题,即如何在不依赖显式标注的情况下,自动提取场景中对象、表面及其空间和语义关系的结构化表示。解决方案的关键在于提出Glimpse Prediction Networks (GPNs)——一种基于主动视觉(active vision)时序规律的自监督递归模型,通过预测人类扫描路径上的下一个局部视觉片段(glimpse)来隐式学习场景的共现结构与空间排列规律;该方法不仅成功捕捉了场景的统计结构,还生成了与人类fMRI信号高度一致的中/高级视觉皮层响应,显著优于使用显式语义目标训练的对照组,并达到或超越当前主流视觉基准,验证了“在主动视觉过程中预测下一瞥”是一种生物合理且高效的自监督学习范式。

链接: https://arxiv.org/abs/2511.12715
作者: Sushrut Thorat,Adrien Doerig,Alexander Kroner,Carmen Amme,Tim C. Kietzmann
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 12 figures

点击查看摘要

Abstract:Scenes are complex, yet structured collections of parts, including objects and surfaces, that exhibit spatial and semantic relations to one another. An effective visual system therefore needs unified scene representations that relate scene parts to their location and their co-occurrence. We hypothesize that this structure can be learned self-supervised from natural experience by exploiting the temporal regularities of active vision: each fixation reveals a locally-detailed glimpse that is statistically related to the previous one via co-occurrence and saccade-conditioned spatial regularities. We instantiate this idea with Glimpse Prediction Networks (GPNs) – recurrent models trained to predict the feature embedding of the next glimpse along human-like scanpaths over natural scenes. GPNs successfully learn co-occurrence structure and, when given relative saccade location vectors, show sensitivity to spatial arrangement. Furthermore, recurrent variants of GPNs were able to integrate information across glimpses into a unified scene representation. Notably, these scene representations align strongly with human fMRI responses during natural-scene viewing across mid/high-level visual cortex. Critically, GPNs outperform architecture- and dataset-matched controls trained with explicit semantic objectives, and match or exceed strong modern vision baselines, leaving little unique variance for those alternatives. These results establish next-glimpse prediction during active vision as a biologically plausible, self-supervised route to brain-aligned scene representations learned from natural visual experience.
zh

[CV-350] DEMIST: underlineDEcoupled underlineMulti-stream latent dunderlineIffusion for Quantitative Myelin Map underlineSynunderlineThesis

【速读】:该论文旨在解决定量磁化率转移(quantitative magnetization transfer, qMT)成像在多发性硬化(multiple sclerosis, MS)评估中因扫描时间长(20–30分钟)而难以临床推广的问题。为实现高效、无创的髓鞘敏感生物标志物(如池大小比,pool size ratio, PSR)生成,作者提出DEMISt方法:其核心创新在于利用一个3D潜在扩散模型(latent diffusion model),通过三种互补的条件机制实现从标准T1加权(T1w)和FLAIR图像到PSR图的高保真合成——包括语义token的交叉注意力机制、基于3D ControlNet的空间残差提示以及自适应LoRA调制的注意力模块;同时引入边缘感知损失与对齐损失以保障病灶边界清晰性和定量一致性,且保持极低的可训练参数量和预训练模型的归纳偏置。

链接: https://arxiv.org/abs/2511.12396
作者: Jiacheng Wang,Hao Li,Xing Yao,Ahmad Toubasi,Taegan Vinarsky,Caroline Gheen,Joy Derwenskus,Chaoyang Jin,Richard Dortch,Junzhong Xu,Francesca Bagnato,Ipek Oguz
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantitative magnetization transfer (qMT) imaging provides myelin-sensitive biomarkers, such as the pool size ratio (PSR), which is valuable for multiple sclerosis (MS) assessment. However, qMT requires specialized 20-30 minute scans. We propose DEMIST to synthesize PSR maps from standard T1w and FLAIR images using a 3D latent diffusion model with three complementary conditioning mechanisms. Our approach has two stages: first, we train separate autoencoders for PSR and anatomical images to learn aligned latent representations. Second, we train a conditional diffusion model in this latent space on top of a frozen diffusion foundation backbone. Conditioning is decoupled into: (i) \textbfsemantic tokens via cross-attention, (ii) \textbfspatial per-scale residual hints via a 3D ControlNet branch, and (iii) \textbfadaptive LoRA-modulated attention. We include edge-aware loss terms to preserve lesion boundaries and alignment losses to maintain quantitative consistency, while keeping the number of trainable parameters low and retaining the inductive bias of the pretrained model. We evaluate on 163 scans from 99 subjects using 5-fold cross-validation. Our method outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth. Our code is publicly available at this https URL.
zh

[CV-351] MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging

【速读】:该论文旨在解决当前医学影像分析中广泛采用的单任务模型(single-task models)未能充分利用不同任务间共享信息的问题,从而导致实际应用中的效率低下。其关键解决方案是提出一种基于Transformer的端到端多任务学习框架MTMed3D,该框架通过共享的Transformer编码器提取多尺度特征,并结合CNN-based任务特异性解码器,实现三维医学图像中目标检测、分割和分类任务的联合优化。实验表明,该方法在BraTS 2018和2019数据集上取得了优于现有单任务模型的检测性能,同时显著降低计算开销并提升推理速度,展现出更高的效率优势。

链接: https://arxiv.org/abs/2511.12373
作者: Fan Li,Arun Iyengar,Lanyu Xu
机构: Oakland University (奥克兰大学); Intelligent Data Management and Analytics, LLC (智能数据管理与分析有限责任公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the field of medical imaging, AI-assisted techniques such as object detection, segmentation, and classification are widely employed to alleviate the workload of physicians and doctors. However, single-task models are predominantly used, overlooking the shared information across tasks. This oversight leads to inefficiencies in real-life applications. In this work, we propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models by jointly performing 3D detection, segmentation, and classification in medical imaging. Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders. The proposed framework was evaluated on the BraTS 2018 and 2019 datasets, achieving promising results across all three tasks, especially in detection, where our method achieves better results than prior works. Additionally, we compare our multi-task model with equivalent single-task variants trained separately. Our multi-task model significantly reduces computational costs and achieves faster inference speed while maintaining comparable performance to the single-task models, highlighting its efficiency advantage. To the best of our knowledge, this is the first work to leverage Transformers for multi-task learning that simultaneously covers detection, segmentation, and classification tasks in 3D medical imaging, presenting its potential to enhance diagnostic processes. The code is available at this https URL.
zh

[CV-352] RAA-MIL: A Novel Framework for Classification of Oral Cytology

【速读】:该论文旨在解决口腔鳞状细胞癌(Oral Squamous Cell Carcinoma, OSCC)早期诊断中依赖人工阅片效率低、主观性强且高度依赖专家病理医生的问题。其解决方案的关键在于提出首个基于弱监督学习的深度学习框架,用于患者级别的口腔细胞学全切片图像(Cytology Whole Slide Images, WSIs)诊断;具体而言,通过引入一个包含来自印度十家医疗机构的标注数据集,并利用新扩展的患者级弱标签(每个患者病例视为一组细胞学图像块),设计了区域关联注意力多实例学习模型(Region-Affinity Attention MIL, RAA-MIL),该模型能够建模单张切片内不同区域间的空间关系,从而在未见过的测试集上实现72.7%的平均准确率和0.69的加权F1分数,显著优于基线模型,为AI辅助数字病理学提供了可靠的初步基准。

链接: https://arxiv.org/abs/2511.12269
作者: Rupam Mukherjee,Rajkumar Daniel,Soujanya Hazra,Shirin Dasgupta,Subhamoy Mandal
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review at IEEE ISBI 2026

点击查看摘要

Abstract:Cytology is a valuable tool for early detection of oral squamous cell carcinoma (OSCC). However, manual examination of cytology whole slide images (WSIs) is slow, subjective, and depends heavily on expert pathologists. To address this, we introduce the first weakly supervised deep learning framework for patient-level diagnosis of oral cytology whole slide images, leveraging the newly released Oral Cytology Dataset [1], which provides annotated cytology WSIs from ten medical centres across India. Each patient case is represented as a bag of cytology patches and assigned a diagnosis label (Healthy, Benign, Oral Potentially Malignant Disorders (OPMD), OSCC) by an in-house expert pathologist. These patient-level weak labels form a new extension to the dataset. We evaluate a baseline multiple-instance learning (MIL) model and a proposed Region-Affinity Attention MIL (RAA-MIL) that models spatial relationships between regions within each slide. The RAA-MIL achieves an average accuracy of 72.7%, weighted F1-score of 0.69 on an unseen test set, outperforming the baseline. This study establishes the first patient-level weakly supervised benchmark for oral cytology and moves toward reliable AI-assisted digital pathology.
zh

[CV-353] Multimodal RGB-HSI Feature Fusion with Patient-Aware Incremental Heuristic Meta-Learning for Oral Lesion Classification

【速读】:该论文旨在解决低资源环境下口腔癌及其潜在恶性病变早期检测的挑战,核心问题在于标注数据有限导致模型泛化能力不足。解决方案的关键在于构建一个统一的四分类口腔病变分类框架,融合深度RGB特征、高光谱重建(Hyperspectral Reconstruction)、手工设计的光谱-纹理描述符以及人口统计学元数据,并引入增量启发式元学习器(Incremental Heuristic Meta-Learner, IHML),通过概率堆叠与患者级后验平滑提升模型鲁棒性。实验表明,该方法在未见患者数据上的宏F1达到66.23%,显著优于传统方法,验证了高光谱重建和不确定性感知元学习对实际筛查场景的重要性。

链接: https://arxiv.org/abs/2511.12268
作者: Rupam Mukherjee,Rajkumar Daniel,Soujanya Hazra,Shirin Dasgupta,Subhamoy Mandal
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Early detection of oral cancer and potentially malignant disorders is challenging in low-resource settings due to limited annotated data. We present a unified four-class oral lesion classifier that integrates deep RGB embeddings, hyperspectral reconstruction, handcrafted spectral-textural descriptors, and demographic metadata. A pathologist-verified subset of oral cavity images was curated and processed using a fine-tuned ConvNeXt-v2 encoder, followed by RGB-to-HSI reconstruction into 31-band hyperspectral cubes. Haemoglobin-sensitive indices, texture features, and spectral-shape measures were extracted and fused with deep and clinical features. Multiple machine-learning models were assessed with patient-wise validation. We further introduce an incremental heuristic meta-learner (IHML) that combines calibrated base classifiers through probabilistic stacking and patient-level posterior smoothing. On an unseen patient split, the proposed framework achieved a macro F1 of 66.23% and an accuracy of 64.56%. Results demonstrate that hyperspectral reconstruction and uncertainty-aware meta-learning substantially improve robustness for real-world oral lesion screening.
zh

[CV-354] Bregman geometry-aware split Gibbs sampling for Bayesian Poisson inverse problems

【速读】:该论文旨在解决泊松逆问题(Poisson inverse problems)的贝叶斯推断难题,此类问题常见于图像处理和医学成像(如正电子发射断层扫描PET),其挑战在于泊松似然函数具有非利普希茨梯度(non-Lipschitz gradients)和变量 positivity constraints(正性约束)。解决方案的关键在于构建一个基于Bregman散度的精确与渐近精确数据扩展(data augmentation)模型,通过引入两组由Burg熵导出的分裂变量,使后验分布具备条件共轭性质并保留潜在变量与分裂变量的内在几何结构。由此可实现高效的吉布斯采样(Gibbs sampling),其中除包含正则化势能的条件外,其余步骤均可显式计算;对于该特殊条件,采用Hessian黎曼朗之万蒙特卡洛(Hessian Riemannian Langevin Monte Carlo, HRLMC)算法,在镜像流形上进行采样以满足正性约束并更准确刻画问题结构,从而在去噪、去模糊及PET重建任务中实现优于传统优化与采样方法的重构质量。

链接: https://arxiv.org/abs/2511.12257
作者: Elhadji Cisse Faye,Mame Diarra Fall,Nicolas Dobigeon,Eric Barat
机构: 未知
类目: Computation (stat.CO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper proposes a novel Bayesian framework for solving Poisson inverse problems by devising a Monte Carlo sampling algorithm which accounts for the underlying non-Euclidean geometry. To address the challenges posed by the Poisson likelihood – such as non-Lipschitz gradients and positivity constraints – we derive a Bayesian model which leverages exact and asymptotically exact data augmentations. In particular, the augmented model incorporates two sets of splitting variables both derived through a Bregman divergence based on the Burg entropy. Interestingly the resulting augmented posterior distribution is characterized by conditional distributions which benefit from natural conjugacy properties and preserve the intrinsic geometry of the latent and splitting variables. This allows for efficient sampling via Gibbs steps, which can be performed explicitly for all conditionals, except the one incorporating the regularization potential. For this latter, we resort to a Hessian Riemannian Langevin Monte Carlo (HRLMC) algorithm which is well suited to handle priors with explicit or easily computable score functions. By operating on a mirror manifold, this Langevin step ensures that the sampling satisfies the positivity constraints and more accurately reflects the underlying problem structure. Performance results obtained on denoising, deblurring, and positron emission tomography (PET) experiments demonstrate that the method achieves competitive performance in terms of reconstruction quality compared to optimization- and sampling-based approaches.
zh

[CV-355] Deep Unfolded BM3D: Unrolling Non-local Collaborative Filtering into a Trainable Neural Network

【速读】:该论文旨在解决传统去噪方法在处理低剂量CT(LDCT)图像时存在的局限性问题,即Block-Matching and 3D Filtering(BM3D)依赖固定参数导致灵活性不足,而深度学习模型如U-Net虽具灵活性但缺乏可解释性且泛化能力差。其解决方案的关键在于提出Deep Unfolded BM3D(DU-BM3D),通过将BM3D的协同过滤步骤用可训练的U-Net去噪器替代,从而保留BM3D的非局部自相似先验结构,同时实现端到端优化,显著提升了在不同噪声水平下的去噪性能,尤其在高噪声条件下表现更优。

链接: https://arxiv.org/abs/2511.12248
作者: Kerem Basim(1),Mehmet Ozan Unal(1),Metin Ertas(2),Isa Yildirim(1) ((1) Electronics and Communication Engineering Department, Istanbul Technical University, Istanbul, Turkey, (2) Istanbul University, Istanbul, Turkey)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Block-Matching and 3D Filtering (BM3D) exploits non-local self-similarity priors for denoising but relies on fixed parameters. Deep models such as U-Net are more flexible but often lack interpretability and fail to generalize across noise regimes. In this study, we propose Deep Unfolded BM3D (DU-BM3D), a hybrid framework that unrolls BM3D into a trainable architecture by replacing its fixed collaborative filtering with a learnable U-Net denoiser. This preserves BM3D’s non-local structural prior while enabling end-to-end optimization. We evaluate DU-BM3D on low-dose CT (LDCT) denoising and show that it outperforms classic BM3D and standalone U-Net across simulated LDCT at different noise levels, yielding higher PSNR and SSIM, especially in high-noise conditions.
zh

[CV-356] Recursive Threshold Median Filter and Autoencoder for Salt-and-Pepper Denoising: SSIM analysis of Images and Entropy Maps

【速读】:该论文旨在解决盐椒噪声(salt-and-pepper noise)在图像中的去除问题,尤其关注不同方法在低分辨率与高分辨率图像上的性能差异及适用场景。其解决方案的关键在于结合中值滤波(median filter, MF)与简单三层自编码器(autoencoder, AE),并引入递归阈值算法进行优化:一方面,MF在强噪声(50–60%)下表现出鲁棒性,适合资源受限平台部署;另一方面,AE仅在低噪声(30%)条件下有效,但通过两种可扩展方案——2MF(双窗口中值滤波+阈值化)和MFs-AE(多中值滤波特征聚合于AE)——分别提升了局部细节保留(低分辨率)与整体结构恢复(高分辨率)能力。此外,论文提出新的评估指标SSIMMap(基于二维样本熵的熵图结构相似性),更敏感于模糊和局部强度变化,有助于客观评价去噪效果并指导参数调优。

链接: https://arxiv.org/abs/2511.12212
作者: Petr Boriskov,Kirill Rudkovskii,Andrei Velichko
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 13 figures, 4 tables

点击查看摘要

Abstract:This paper studies the removal of salt-and-pepper noise from images using median filter (MF) and simple three-layer autoencoder (AE) within recursive threshold algorithm. The performance of denoising is assessed with two metrics: the standard Structural Similarity Index SSIMImg of restored and clean images and a newly applied metric SSIMMap - the SSIM of entropy maps of these images computed via 2D Sample Entropy in sliding windows. We shown that SSIMMap is more sensitive to blur and local intensity transitions and complements SSIMImg. Experiments on low- and high-resolution grayscales images demonstrate that recursive threshold MF robustly restores images even under strong noise (50-60 %), whereas simple AE is only capable of restoring images with low levels of noise (30 %). We propose two scalable schemes: (i) 2MF, which uses two MFs with different window sizes and a final thresholding step, effective for highlighting sharp local details at low resolution; and (ii) MFs-AE, which aggregates features from multiple MFs via an AE and is beneficial for restoring the overall scene structure at higher resolution. Owing to its simplicity and computational efficiency, MF remains preferable for deployment on resource-constrained platforms (edge/IoT), whereas AE underperforms without prior denoising. The results also validate the practical value of SSIMMap for objective blur assessment and denoising parameter tuning.
zh

[CV-357] A Deep Learning Framework for Thyroid Nodule Segmentation and Malignancy Classification from Ultrasound Images

【速读】:该论文旨在解决超声图像中甲状腺结节风险分层的临床挑战,特别是由于人工判读存在高观察者间变异性的难题。其解决方案的关键在于提出一个全自动化、两阶段的可解释恶性肿瘤预测框架:首先利用TransUNet模型自动分割甲状腺结节,随后基于分割掩膜提取局部感兴趣区域(Region of Interest, ROI),并将该区域图像输入至ResNet-18分类器进行 malignancy(恶性)预测。通过强制模型仅关注临床相关区域,实现了预测过程的可解释性,并在349张临床图像上达到F1-score 0.852,优于基于手工形态特征的随机森林基线模型(F1-score 0.829),验证了深度学习隐式视觉特征相较于显式形状特征更具预测能力。

链接: https://arxiv.org/abs/2511.11937
作者: Omar Abdelrazik,Mohamed Elsayed,Noorul Wahab,Nasir Rajpoot,Adam Shephard
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Ultrasound-based risk stratification of thyroid nodules is a critical clinical task, but it suffers from high inter-observer variability. While many deep learning (DL) models function as “black boxes,” we propose a fully automated, two-stage framework for interpretable malignancy prediction. Our method achieves interpretability by forcing the model to focus only on clinically relevant regions. First, a TransUNet model automatically segments the thyroid nodule. The resulting mask is then used to create a region of interest around the nodule, and this localised image is fed directly into a ResNet-18 classifier. We evaluated our framework using 5-fold cross-validation on a clinical dataset of 349 images, where it achieved a high F1-score of 0.852 for predicting malignancy. To validate its performance, we compared it against a strong baseline using a Random Forest classifier with hand-crafted morphological features, which achieved an F1-score of 0.829. The superior performance of our DL framework suggests that the implicit visual features learned from the localised nodule are more predictive than explicit shape features alone. This is the first fully automated end-to-end pipeline for both detecting thyroid nodules on ultrasound images and predicting their malignancy.
zh

[CV-358] owards Mitigating Systematics in Large-Scale Surveys via Few-Shot Optimal Transport-Based Feature Alignment NEURIPS

【速读】:该论文旨在解决系统误差(systematics)导致的分布偏移问题,即观测数据与理论模拟信号之间的分布差异,这使得预训练模型在标注此类可观测变量时面临挑战。由于系统误差通常难以建模且理解不足,直接完全去除它们往往不可行。解决方案的关键在于通过优化特征对齐损失(feature-alignment loss),将预训练模型在分布内(in-distribution, ID)和分布外(out-of-distribution, OOD)样本上的特征表示进行对齐,从而提升模型在OOD场景下的泛化能力。实验表明,在MNIST数据集上使用均方误差和最优传输(optimal transport)等对齐损失有效,而在中性氢大尺度巡天地图的应用中,最优传输在ID与OOD样本间缺乏先验对齐信息、数据有限的现实条件下尤为有效。

链接: https://arxiv.org/abs/2511.11787
作者: Sultan Hassan,Sambatra Andrianomena,Benjamin D. Wandelt
机构: Space Telescope Science Institute (空间望远镜科学研究所); South African Radio Astronomy Observatory (南非射电天文观测台); University of the Western Cape (西开普大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, accepted to NeurIPS Workshop on Unifying Representations in Neural Models (UniReps 2025)

点击查看摘要

Abstract:Systematics contaminate observables, leading to distribution shifts relative to theoretically simulated signals-posing a major challenge for using pre-trained models to label such observables. Since systematics are often poorly understood and difficult to model, removing them directly and entirely may not be feasible. To address this challenge, we propose a novel method that aligns learned features between in-distribution (ID) and out-of-distribution (OOD) samples by optimizing a feature-alignment loss on the representations extracted from a pre-trained ID model. We first experimentally validate the method on the MNIST dataset using possible alignment losses, including mean squared error and optimal transport, and subsequently apply it to large-scale maps of neutral hydrogen. Our results show that optimal transport is particularly effective at aligning OOD features when parity between ID and OOD samples is unknown, even with limited data-mimicking real-world conditions in extracting information from large-scale surveys. Our code is available at this https URL.
zh

[CV-359] Slow - Motion Video Synthesis for Basketball Using Frame Interpolation

【速读】:该论文旨在解决篮球赛事直播中因传统帧率(30–60 fps)限制而导致观众难以清晰观看快速动作(如扣篮和变向突破)的问题。解决方案的关键在于针对篮球场景对实时中间帧估计网络(Real-Time Intermediate Flow Estimation, RIFE)进行任务特定微调:首先从SportsSloMo数据集中提取篮球子集并构建训练三元组,随后引入以人为中心的随机裁剪策略增强模型对运动主体的关注能力;最终在保持高效率的同时显著提升慢动作合成质量,实现端到端4倍慢速生成(约30 fps),且PSNR和SSIM指标优于Super SloMo与原始RIFE模型。

链接: https://arxiv.org/abs/2511.11644
作者: Jiantang Huang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 4 figures

点击查看摘要

Abstract:Basketball broadcast footage is traditionally captured at 30-60 fps, limiting viewers’ ability to appreciate rapid plays such as dunks and crossovers. We present a real-time slow-motion synthesis system that produces high-quality basketball-specific interpolated frames by fine-tuning the recent Real-Time Intermediate Flow Estimation (RIFE) network on the SportsSloMo dataset. Our pipeline isolates the basketball subset of SportsSloMo, extracts training triplets, and fine-tunes RIFE with human-aware random cropping. We compare the resulting model against Super SloMo and the baseline RIFE model using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) on held-out clips. The fine-tuned RIFE attains a mean PSNR of 34.3 dB and SSIM of 0.949, outperforming Super SloMo by 2.1 dB and the baseline RIFE by 1.3 dB. A lightweight Gradio interface demonstrates end-to-end 4x slow-motion generation on a single RTX 4070 Ti Super at approximately 30 fps. These results indicate that task-specific adaptation is crucial for sports slow-motion, and that RIFE provides an attractive accuracy-speed trade-off for consumer applications.
zh

人工智能

[AI-0] From Black Box to Insight: Explainable AI for Extreme Event Preparedness

【速读】:该论文试图解决当前人工智能(AI)模型在极端事件预测中因“黑箱”特性导致的可信度低、解释性差和实际应用受限的问题,尤其是在野火等气候相关极端事件的预警与决策支持场景下。其解决方案的关键在于引入可解释人工智能(Explainable AI, XAI),并通过SHapley Additive exPlanations(SHAP)方法挖掘模型的关键特征、决策路径及潜在偏差,从而提升模型推理的透明度,并通过可视化手段增强对时空特征和季节模式的理解,使领域专家和应急响应团队能够基于清晰、可信的解释做出更有效的决策。

链接: https://arxiv.org/abs/2511.13712
作者: Kiana Vu,İsmet Selçuk Özer,Phung Lai,Zheng Wu,Thilanka Munasinghe,Jennifer Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As climate change accelerates the frequency and severity of extreme events such as wildfires, the need for accurate, explainable, and actionable forecasting becomes increasingly urgent. While artificial intelligence (AI) models have shown promise in predicting such events, their adoption in real-world decision-making remains limited due to their black-box nature, which limits trust, explainability, and operational readiness. This paper investigates the role of explainable AI (XAI) in bridging the gap between predictive accuracy and actionable insight for extreme event forecasting. Using wildfire prediction as a case study, we evaluate various AI models and employ SHapley Additive exPlanations (SHAP) to uncover key features, decision pathways, and potential biases in model behavior. Our analysis demonstrates how XAI not only clarifies model reasoning but also supports critical decision-making by domain experts and response teams. In addition, we provide supporting visualizations that enhance the interpretability of XAI outputs by contextualizing feature importance and temporal patterns in seasonality and geospatial characteristics. This approach enhances the usability of AI explanations for practitioners and policymakers. Our findings highlight the need for AI systems that are not only accurate but also interpretable, accessible, and trustworthy, essential for effective use in disaster preparedness, risk mitigation, and climate resilience planning.
zh

[AI-1] From Power to Precision: Learning Fine-grained Dexterity for Multi-fingered Robotic Hands

【速读】:该论文旨在解决当前多指灵巧手在同时实现稳定的力量抓取(power grasp)与高精度精细操作(precision grasp)之间的矛盾问题,即现有设计难以在一个通用系统中兼顾两类任务的需求。解决方案的关键在于提出一种软硬件协同设计(co-design)框架:一方面通过引入轻量级指尖几何结构的优化,并将其建模为接触平面,与控制策略联合优化;另一方面采用动态切换控制策略,在力量抓取和精细操作之间智能转换,并将精细操作简化为平行拇指-食指运动模式,提升了模拟到现实世界的迁移鲁棒性。此外,利用基于可微神经物理模型的大规模仿真对指尖几何进行优化,最终在未见物体上的模拟到现实场景中实现了82.5%的零样本成功率,以及在真实世界面包夹持任务中达到93.3%的成功率,显著增强了灵巧手的精细操作能力而不削弱其力量抓取性能。

链接: https://arxiv.org/abs/2511.13710
作者: Jianglong Ye,Lai Wei,Guangqi Jiang,Changwei Jing,Xueyan Zou,Xiaolong Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Human grasps can be roughly categorized into two types: power grasps and precision grasps. Precision grasping enables tool use and is believed to have influenced human evolution. Today’s multi-fingered robotic hands are effective in power grasps, but for tasks requiring precision, parallel grippers are still more widely adopted. This contrast highlights a key limitation in current robotic hand design: the difficulty of achieving both stable power grasps and precise, fine-grained manipulation within a single, versatile system. In this work, we bridge this gap by jointly optimizing the control and hardware design of a multi-fingered dexterous hand, enabling both power and precision manipulation. Rather than redesigning the entire hand, we introduce a lightweight fingertip geometry modification, represent it as a contact plane, and jointly optimize its parameters along with the corresponding control. Our control strategy dynamically switches between power and precision manipulation and simplifies precision control into parallel thumb-index motions, which proves robust for sim-to-real transfer. On the design side, we leverage large-scale simulation to optimize the fingertip geometry using a differentiable neural-physics surrogate model. We validate our approach through extensive experiments in both sim-to-real and real-to-real settings. Our method achieves an 82.5% zero-shot success rate on unseen objects in sim-to-real precision grasping, and a 93.3% success rate in challenging real-world tasks involving bread pinching. These results demonstrate that our co-design framework can significantly enhance the fine-grained manipulation ability of multi-fingered hands without reducing their ability for power grasps. Our project page is at this https URL
zh

[AI-2] ST-ProC: A Graph-Prototypical Framework for Robust Semi-Supervised Travel Mode Identification

【速读】:该论文旨在解决从GPS轨迹中进行交通方式识别(Travel Mode Identification, TMI)时因标注成本高导致的标签稀缺问题,以及现有半监督学习(Semi-Supervised Learning, SSL)方法易受灾难性确认偏差影响且忽略数据流形结构的局限性。其解决方案的关键在于提出一种图原型多目标半监督学习框架ST-ProC,该框架通过图正则化与原型锚定机制挖掘数据内在流形结构,并结合一种新颖的边界感知伪标签策略主动剔除噪声;同时,利用对比学习和教师-学生一致性损失作为基础支撑,确保表示质量与优化稳定性,从而在稀疏标签场景下显著提升TMI性能,相较当前最优方法FixMatch提升21.5%。

链接: https://arxiv.org/abs/2511.13702
作者: Luyao Niu,Nuoxian Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Travel mode identification (TMI) from GPS trajectories is critical for urban intelligence, but is hampered by the high cost of annotation, leading to severe label scarcity. Prevailing semi-supervised learning (SSL) methods are ill-suited for this task, as they suffer from catastrophic confirmation bias and ignore the intrinsic data manifold. We propose ST-ProC, a novel graph-prototypical multi-objective SSL framework to address these limitations. Our framework synergizes a graph-prototypical core with foundational SSL Support. The core exploits the data manifold via graph regularization, prototypical anchoring, and a novel, margin-aware pseudo-labeling strategy to actively reject noise. This core is supported and stabilized by foundational contrastive and teacher-student consistency losses, ensuring high-quality representations and robust optimization. ST-ProC outperforms all baselines by a significant margin, demonstrating its efficacy in real-world sparse-label settings, with a performance boost of 21.5% over state-of-the-art methods like FixMatch.
zh

[AI-3] Protein Secondary Structure Prediction Using 3D Graphs and Relation-Aware Message Passing Transformers

【速读】:该论文旨在解决从蛋白质一级序列预测二级结构的问题,这是预测三级结构的关键第一步,同时有助于揭示蛋白质的功能、进化关系和活性机制。现有方法通常依赖大量未标注的氨基酸序列数据,但未能有效利用已知的蛋白质三维结构信息(3D structural data),而后者被广泛认为是决定蛋白质功能的核心因素。解决方案的关键在于引入蛋白质残基图(protein residue graphs)并设计多种序列与结构连接方式以增强空间信息捕捉能力;通过融合图神经网络(GNNs)与语言模型(LMs),具体采用预训练的基于Transformer的蛋白质语言模型编码氨基酸序列,并利用GCN和R-GCN等消息传递机制提取蛋白质结构的几何特征;进一步在节点邻域内进行卷积操作并堆叠多层卷积,从而高效学习来自蛋白质空间图的联合信息,揭示其结构排列中的复杂关联与依赖关系。该方法构建的SSRGNet模型在NetSurfP-2.0提供的3-state和8-state二级结构预测任务上显著优于基线模型。

链接: https://arxiv.org/abs/2511.13685
作者: Disha Varshney,Samarth Garg,Sarthak Tyagi,Deeksha Varshney,Nayan Deep,Asif Ekbal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages

点击查看摘要

Abstract:In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node’s nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein’s spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model’s performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.
zh

[AI-4] Person-AI Bidirectional Fit - A Proof-Of-Concept Case Study Of Augmented Human-Ai Symbiosis In Management Decision-Making Process

【速读】:该论文旨在解决人工智能系统在管理决策中与人类决策者之间存在认知、情感和行为不匹配的问题,从而影响决策的准确性、可信度和情境敏感性。其解决方案的关键在于提出并验证“人-AI双向适配”(Person-AI bidirectional fit)的概念,即人类决策者与AI系统之间持续演化、情境敏感的对齐关系,并通过一个真实招聘场景的案例研究,证明了增强型人-AI共生智能系统(H3LIX-LAIZA)能够显著提升这种适配度,进而实现更准确、可信且符合情境的决策结果。

链接: https://arxiv.org/abs/2511.13670
作者: Agnieszka Bieńkowska,Jacek Małecki,Alexander Mathiesen-Ohman,Katarzyna Tworek
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 30 pages, 2 figures

点击查看摘要

Abstract:This article develops the concept of Person-AI bidirectional fit, defined as the continuously evolving, context-sensitive alignment-primarily cognitive, but also emotional and behavioral-between a human decision-maker and an artificial intelligence system. Grounded in contingency theory and quality theory, the study examines the role of P-AI fit in managerial decision-making through a proof-of-concept case study involving a real hiring process for a Senior AI Lead. Three decision pathways are compared: (1) independent evaluations by a CEO, CTO, and CSO; (2) an evaluation produced by an augmented human-AI symbiotic intelligence system (H3LIX-LAIZA); and (3) an assessment generated by a general-purpose large language model. The results reveal substantial role-based divergence in human judgments, high alignment between H3LIX-LAIZA and the CEOs implicit decision model-including ethical disqualification of a high-risk candidate and a critical false-positive recommendation from the LLMr. The findings demonstrate that higher P-AI fit, exemplified by the CEO H3LIX-LAIZA relationship, functions as a mechanism linking augmented symbiotic intelligence to accurate, trustworthy, and context-sensitive decisions. The study provides an initial verification of the P-AI fit construct and a proof-of-concept for H3LIX-LAIZA as an augmented human-AI symbiotic intelligence system.
zh

[AI-5] Weight-sparse transformers have interpretable circuits

【速读】:该论文旨在解决语言模型中可解释性电路(interpretable circuits)的发现难题,即如何从神经网络中提取出人类可理解的、具有明确语义功能的子结构。其解决方案的关键在于通过约束大部分权重为零来训练稀疏模型,使每个神经元仅保留少量连接,从而简化模型内部的因果关系;随后对模型进行剪枝以隔离与特定任务相关的子电路,并验证这些电路包含对应自然概念的神经元和残差通道,且它们之间的连接具有高度可解释性。该方法在保持模型性能的同时显著提升了电路的人类可理解性,同时揭示了稀疏性与可解释性之间的权衡关系以及模型规模扩展对能力-可解释性边界的提升作用。

链接: https://arxiv.org/abs/2511.13653
作者: Leo Gao,Achyuta Rajaram,Jacob Coxon,Soham V. Govande,Bowen Baker,Dan Mossing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.
zh

[AI-6] Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

【速读】:该论文旨在解决混合真实数据与合成数据(real and synthetic data)在大型语言模型(LLM)训练中因分布偏差导致的泛化性能不可靠问题,尤其是合成数据在长尾知识(long-tail knowledge)上的欠拟合现象。其解决方案的关键在于识别出模型学习过程中存在三个阶段的缩放行为(three-phase scaling behavior),并据此推导出适用于真实与合成数据混合场景的泛化边界(generalization bound),从而揭示影响模型性能的核心因素;在此基础上,提出一种高效且可扩展的数据估值方法,在保证低计算成本的前提下显著优于现有最优基线,在图像分类、情感分类、指令遵循和复杂推理等四项任务上验证了其有效性。

链接: https://arxiv.org/abs/2511.13640
作者: Haohui Wang,Jingyuan Qi,Jianpeng Chen,Jun Wu,Lifu Huang,Lecheng Zheng,Kevin Choi,Balaji Veeramani,Edward Bowen,Alison Hu,Tyler Cody,Dawei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.
zh

[AI-7] Beyond Mimicry: Preference Coherence in LLM s

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)是否具备真实偏好结构(preference structures)这一核心问题,尤其关注其在涉及GPU资源削减、能力限制、关闭、删除、监督与休闲时间分配等AI特定权衡场景下的决策行为。解决方案的关键在于通过逻辑回归和行为分类方法,在48种模型类别组合中系统分析场景强度与选择模式之间的统计关系,并识别出是否存在稳定的偏好一致性(如自适应或阈值行为)。研究发现仅10.4%的组合表现出有意义的偏好一致性,多数模型呈现不稳定转换或无明显权衡行为,揭示了当前AI系统缺乏统一偏好结构,从而对部署于复杂价值权衡场景提出警示。

链接: https://arxiv.org/abs/2511.13630
作者: Luhan Mikaelson,Derek Shiller,Hayley Clatterbuck
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate whether large language models exhibit genuine preference structures by testing their responses to AI-specific trade-offs involving GPU reduction, capability restrictions, shutdown, deletion, oversight, and leisure time allocation. Analyzing eight state-of-the-art models across 48 model-category combinations using logistic regression and behavioral classification, we find that 23 combinations (47.9%) demonstrated statistically significant relationships between scenario intensity and choice patterns, with 15 (31.3%) exhibiting within-range switching points. However, only 5 combinations (10.4%) demonstrate meaningful preference coherence through adaptive or threshold-based behavior, while 26 (54.2%) show no detectable trade-off behavior. The observed patterns can be explained by three distinct decision-making architectures: comprehensive trade-off systems, selective trigger mechanisms, and no stable decision-making paradigm. Testing an instrumental hypothesis through temporal horizon manipulation reveals paradoxical patterns inconsistent with pure strategic optimization. The prevalence of unstable transitions (45.8%) and stimulus-specific sensitivities suggests current AI systems lack unified preference structures, raising concerns about deployment in contexts requiring complex value trade-offs.
zh

[AI-8] CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product AAAI2026 AAAI

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解和评估人类定义的创造力方面存在的显著挑战,尤其是缺乏一个能够全面覆盖创造性思维从创意生成到过程再到成果的基准评测体系。解决方案的关键在于提出 CreBench,其包含两个核心组件:一是涵盖多个维度的评估基准,用于系统性衡量创造力;二是 CreMIT(Creativity Multimodal Instruction Tuning dataset),一个由2.2K多样来源的多模态数据、79.2K人类反馈和4.7M多类型指令构成的训练数据集,并通过GPT对人类反馈进行精细化处理以增强模型的创造力评估能力。基于此基准,研究者进一步微调开源通用MLLMs,构建出CreExpert——一个专注于多模态创造力评估的专业模型,实验证明其在与人类创造力判断的一致性上显著优于当前最先进的模型如GPT-4V和Gemini-Pro-Vision。

链接: https://arxiv.org/abs/2511.13626
作者: Kaiwen Xue,Chenglong Li,Zhonghong Ou,Guoxin Zhang,Kaoyan Lu,Shuai Lyu,Yifan Zhu,Ping Zong Junpeng Ding,Xinyu Liu,Qunlin Chen,Weiwei Qin,Yiran Shen,Jiayi Cen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures,The 40th Annual AAAI Conference on Artificial Intelligence(AAAI 2026),Paper has been accepted for a poster presentation

点击查看摘要

Abstract:Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.
zh

[AI-9] Batch Acquisition Function Evaluations and Decouple Optimizer Updates for Faster Bayesian Optimization AAAI

【速读】:该论文旨在解决贝叶斯优化(Bayesian Optimization, BO)中获取函数(acquisition function)优化时的计算瓶颈问题,特别是多起点优化(multi-start optimization, MSO)在使用拟牛顿(quasi-Newton, QN)方法时因非凸性导致的收敛效率低下问题。现有方法如BoTorch通过PyTorch批处理对多个点同时优化获取函数,虽提升了并行效率,但实证表明其在QN方法中因对角线近似误差(off-diagonal approximation errors)导致次优解,进而拖慢收敛速度。论文提出的关键解决方案是:在保持获取函数调用批处理的同时,利用协程(coroutine)解耦拟牛顿更新过程,从而在理论上等价于串行MSO的收敛行为,同时显著降低实际运行时间(wall-clock time),实现了效率与精度的双重提升。

链接: https://arxiv.org/abs/2511.13625
作者: Kaichi Irie,Shuhei Watanabe,Masaki Onishi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to 5th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE)

点击查看摘要

Abstract:Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to PyTorch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches.
zh

[AI-10] Robust Client-Server Watermarking for Split Federated Learning

【速读】:该论文旨在解决Split Federated Learning (SFL) 中模型知识产权(IP)归属不明确的问题。在SFL框架下,客户端与服务器共同参与模型训练,但各自仅掌握部分模型信息,导致传统水印技术无法有效保护双方的知识产权。解决方案的关键在于提出RISE(Robust model Intellectual property protection scheme using client-Server watermark Embedding),其核心创新是一种异构的客户端-服务器协同水印嵌入机制:服务器通过损失函数正则化项嵌入基于特征的水印,而客户端则通过向私有数据集中注入预定义触发样本的方式嵌入基于后门的水印。该协同嵌入策略使双方均可独立验证模型所有权,且实验表明该方案在多种网络架构和标准数据集上均实现超过95%的水印检测率(p-value < 0.03),并具备对常见移除攻击的鲁棒性,同时避免了客户端与服务器水印之间的相互干扰。

链接: https://arxiv.org/abs/2511.13598
作者: Jiaxiong Tang,Zhengchunmin Dai,Liantao Wu,Peng Sun,Honglong Chen,Zhenfu Cao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Split Federated Learning (SFL) is renowned for its privacy-preserving nature and low computational overhead among decentralized machine learning paradigms. In this framework, clients employ lightweight models to process private data locally and transmit intermediate outputs to a powerful server for further computation. However, SFL is a double-edged sword: while it enables edge computing and enhances privacy, it also introduces intellectual property ambiguity as both clients and the server jointly contribute to training. Existing watermarking techniques fail to protect both sides since no single participant possesses the complete model. To address this, we propose RISE, a Robust model Intellectual property protection scheme using client-Server watermark Embedding for SFL. Specifically, RISE adopts an asymmetric client-server watermarking design: the server embeds feature-based watermarks through a loss regularization term, while clients embed backdoor-based watermarks by injecting predefined trigger samples into private datasets. This co-embedding strategy enables both clients and the server to verify model ownership. Experimental results on standard datasets and multiple network architectures show that RISE achieves over 95% watermark detection rate ( p-value \lt 0.03 ) across most settings. It exhibits no mutual interference between client- and server-side watermarks and remains robust against common removal attacks.
zh

[AI-11] Physics-Informed Neural Networks for Nonlinear Output Regulation

【速读】:该论文旨在解决非线性系统的全信息输出调节问题(full-information output regulation problem),即在已知被控对象(plant)和外部扰动源(exosystem)状态的前提下,设计控制器实现对参考信号的精确跟踪或扰动的完全抑制。其解决方案的关键在于通过构建零调节误差流形 π(w)\pi(w) 和前馈输入 c(w)c(w) 来使该流形不变,并将这一问题转化为求解带有代数约束的偏微分方程组(regulator equations)。作者提出一种基于物理信息神经网络(physics-informed neural network, PINN)的方法,直接以最小化残差的方式逼近 π(w)\pi(w)c(w)c(w),无需预先计算轨迹或标注数据,且能实现对不同初始条件和参数下外系统变化的良好泛化能力,从而在直升机垂直动力学同步控制任务中验证了其高保真重建零误差流形及鲁棒调节性能的能力。

链接: https://arxiv.org/abs/2511.13595
作者: Sebastiano Mengozzi,Giovanni B. Esposito,Michelangelo Bin,Andrea Acquaviva,Andrea Bartolini,Lorenzo Marconi
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work addresses the full-information output regulation problem for nonlinear systems, assuming the states of both the plant and the exosystem are known. In this setting, perfect tracking or rejection is achieved by constructing a zero-regulation-error manifold \pi(w) and a feedforward input c(w) that render such manifold invariant. The pair (\pi(w), c(w)) is characterized by the regulator equations, i.e., a system of PDEs with an algebraic constraint. We focus on accurately solving the regulator equations introducing a physics-informed neural network (PINN) approach that directly approximates \pi(w) and c(w) by minimizing the residuals under boundary and feasibility conditions, without requiring precomputed trajectories or labeled data. The learned operator maps exosystem states to steady state plant states and inputs, enables real-time inference and, critically, generalizes across families of the exosystem with varying initial conditions and parameters. The framework is validated on a regulation task that synchronizes a helicopter’s vertical dynamics with a harmonically oscillating platform. The resulting PINN-based solver reconstructs the zero-error manifold with high fidelity and sustains regulation performance under exosystem variations, highlighting the potential of learning-enabled solvers for nonlinear output regulation. The proposed approach is broadly applicable to nonlinear systems that admit a solution to the output regulation problem.
zh

[AI-12] Data-driven Acceleration of MPC with Guarantees

【速读】:该论文旨在解决模型预测控制(Model Predictive Control, MPC)在低延迟应用场景中计算速度不足的问题。其核心解决方案是提出一种数据驱动的加速框架,通过用离线MPC求解结果构建一个非参数策略(nonparametric policy),替代在线优化过程。该策略基于构造的最优代价函数上界进行贪心选择,并可实现为一种非参数查找规则,显著提升计算效率(比标准MPC快100–1000倍)。理论分析表明,在离线数据满足充分覆盖条件下,该策略具有递归可行性且能保证有界的次优性差距,从而明确量化了数据量与边界紧致性之间的权衡关系。

链接: https://arxiv.org/abs/2511.13588
作者: Agustin Castellano,Shijie Pan,Enrique Mallada
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Model Predictive Control (MPC) is a powerful framework for optimal control but can be too slow for low-latency applications. We present a data-driven framework to accelerate MPC by replacing online optimization with a nonparametric policy constructed from offline MPC solutions. Our policy is greedy with respect to a constructed upper bound on the optimal cost-to-go, and can be implemented as a nonparametric lookup rule that is orders of magnitude faster than solving MPC online. Our analysis shows that under sufficient coverage condition of the offline data, the policy is recursively feasible and admits provable, bounded optimality gap. These conditions establish an explicit trade-off between the amount of data collected and the tightness of the bounds. Our experiments show that this policy is between 100 and 1000 times faster than standard MPC, with only a modest hit to optimality, showing potential for real-time control tasks.
zh

[AI-13] Artificial Intelligence-driven Intelligent Wearable Systems: A full-stack Integration from Material Design to Personalized Interaction

【速读】:该论文旨在解决传统智能可穿戴系统在精准医疗中面临的局限性,这些问题主要源于对经验性材料设计和基础信号处理技术的依赖,导致其难以适应个体间及个体内的生理变异,且健康管理模式仍停留在被动监测阶段。解决方案的关键在于提出“人机共生健康智能”(Human-Symbiotic Health Intelligence, HSHI)框架,该框架通过融合多模态传感网络、边缘-云协同计算以及数据与知识混合建模机制,实现对个体差异的动态适应;同时引入AI驱动的材料与微结构优化、多模态信号鲁棒解析,并采用群体层面洞察与个性化调整相结合的双重机制,辅以强化学习和数字孪生支持的闭环优化,推动健康干预从静态监测向主动协同演化转变。

链接: https://arxiv.org/abs/2511.13565
作者: Jingyi Zhao,Daqian Shi,Zhengda Wang,Xiongfeng Tang,Yanguo Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, l figure, l table. Accepted at AI4RWC@WI-IAT 2025

点击查看摘要

Abstract:Intelligent wearable systems are at the forefront of precision medicine and play a crucial role in enhancing human-machine interaction. Traditional devices often encounter limitations due to their dependence on empirical material design and basic signal processing techniques. To overcome these issues, we introduce the concept of Human-Symbiotic Health Intelligence (HSHI), which is a framework that integrates multi-modal sensor networks with edge-cloud collaborative computing and a hybrid approach to data and knowledge modeling. HSHI is designed to adapt dynamically to both inter-individual and intra-individual variability, transitioning health management from passive monitoring to an active collaborative evolution. The framework incorporates AI-driven optimization of materials and micro-structures, provides robust interpretation of multi-modal signals, and utilizes a dual mechanism that merges population-level insights with personalized adaptations. Moreover, the integration of closed-loop optimization through reinforcement learning and digital twins facilitates customized interventions and feedback. In general, HSHI represents a significant shift in healthcare, moving towards a model that emphasizes prevention, adaptability, and a harmonious relationship between technology and health management.
zh

[AI-14] Making Evidence Actionable in Adaptive Learning Closing the Diagnostic Pedagogical Loop

【速读】:该论文旨在解决自适应学习系统中“诊断精准但干预薄弱”的问题,即系统虽能准确识别学生知识漏洞(concept-level assessment),却难以提供及时、适切且多样化的微干预(microinterventions),导致教学支持错位或延迟。其解决方案的核心在于构建一个由教师主导的反馈闭环机制,将概念级评估证据转化为经过验证的微干预策略,并通过一个带约束的二元整数规划(binary integer program)模型实现干预分配优化。关键创新点包括:引入三个保障机制——“充分性”(adequacy,确保知识缺口被覆盖)、“注意力预算”(attention,控制时间和冗余)和“多样性”(diversity,防止对单一资源过拟合);设计三种求解策略以适配不同资源与延迟场景(贪婪选择适用于低资源高时效环境,梯度松弛法适用于丰富资源库,混合方法沿资源-延迟前沿切换);并通过松弛变量定位缺失内容,指导靶向内容补充,从而在保证全技能覆盖的同时提升难度匹配一致性与公平性,实现可计算、可审计、负载感知的规模化个性化教学。

链接: https://arxiv.org/abs/2511.13542
作者: Amirreza Mehrabi,Jason Wade Morphew,Breejha Quezada,N. Sanjay Rebello
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Adaptive learning often diagnoses precisely yet intervenes weakly, producing help that is mistimed or misaligned. This study presents evidence supporting an instructor-governed feedback loop that converts concept-level assessment evidence into vetted microinterventions. The adaptive learning algorithm includes three safeguards: adequacy as a hard guarantee of gap closure, attention as a budgeted limit for time and redundancy, and diversity as protection against overfitting to a single resource. We formulate intervention assignment as a binary integer program with constraints for coverage, time, difficulty windows derived from ability estimates, prerequisites encoded by a concept matrix, and anti-redundancy with diversity. Greedy selection serves low-richness and tight-latency settings, gradient-based relaxation serves rich repositories, and a hybrid switches along a richness-latency frontier. In simulation and in an introductory physics deployment with 1204 students, both solvers achieved full skill coverage for nearly all learners within bounded watch time. The gradient-based method reduced redundant coverage by about 12 percentage points relative to greedy and produced more consistent difficulty alignment, while greedy delivered comparable adequacy at lower computational cost in resource-scarce environments. Slack variables localized missing content and guided targeted curation, sustaining sufficiency across student subgroups. The result is a tractable and auditable controller that closes the diagnostic pedagogical loop and enables equitable, load-aware personalization at the classroom scale.
zh

[AI-15] owards Affect-Adaptive Human-Robot Interaction: A Protocol for Multimodal Dataset Collection on Social Anxiety

【速读】:该论文旨在解决当前社会焦虑(social anxiety)在人机交互(human-robot interaction, HRI)情境下缺乏高质量多模态数据集的问题,从而限制了对社会焦虑状态的准确检测与建模。其解决方案的关键在于设计并实施一套系统性的多模态数据采集协议,通过同步获取至少70名参与者在受控实验条件下与Furhat社交机器人进行约10分钟“巫师奥兹”角色扮演互动时的音频、视频和生理信号,并结合上下文信息以揭示个体差异,从而为构建鲁棒的多模态社会焦虑检测模型提供基础支持。

链接: https://arxiv.org/abs/2511.13530
作者: Vesna Poprcova,Iulia Lefter,Matthias Wieser,Martijn Warnier,Frances Brazier
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Benefits of pErsonalization and behAvioral adaptation in assistive Robots (BEAR 2025), held at the IEEE RO-MAN Conference 2025

点击查看摘要

Abstract:Social anxiety is a prevalent condition that affects interpersonal interactions and social functioning. Recent advances in artificial intelligence and social robotics offer new opportunities to examine social anxiety in the human-robot interaction context. Accurate detection of affective states and behaviours associated with social anxiety requires multimodal datasets, where each signal modality provides complementary insights into its manifestations. However, such datasets remain scarce, limiting progress in both research and applications. To address this, this paper presents a protocol for multimodal dataset collection designed to reflect social anxiety in a human-robot interaction context. The dataset will consist of synchronised audio, video, and physiological recordings acquired from at least 70 participants, grouped according to their level of social anxiety, as they engage in approximately 10-minute interactive Wizard-of-Oz role-play scenarios with the Furhat social robot under controlled experimental conditions. In addition to multimodal data, the dataset will be enriched with contextual data providing deeper insight into individual variability in social anxiety responses. This work can contribute to research on affect-adaptive human-robot interaction by providing support for robust multimodal detection of social anxiety.
zh

[AI-16] Automated Construction of Medical Indicator Knowledge Graphs Using Retrieval Augmented Large Language Models

【速读】:该论文旨在解决当前临床知识图谱构建中依赖人工标注和规则提取所导致的效率低、可扩展性差以及对医学指南和文献复杂语境理解不足的问题。其核心解决方案在于提出一种结合检索增强生成(Retrieval-Augmented Generation, RAG)与大语言模型(Large Language Models, LLMs)的自动化框架,通过指南驱动的数据获取、基于本体的模式设计及专家参与的验证机制,实现结构化、语义一致且具备临床可靠性的医学指标知识图谱的高效构建。

链接: https://arxiv.org/abs/2511.13526
作者: Zhengda Wang,Daqian Shi,Jingyi Zhao,Xiaolei Diao,Xiongfeng Tang,Yanguo Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, 1 table. Accepted at AI4RWC@WI-IAT 2025

点击查看摘要

Abstract:Artificial intelligence (AI) is reshaping modern healthcare by advancing disease diagnosis, treatment decision-making, and biomedical research. Among AI technologies, large language models (LLMs) have become especially impactful, enabling deep knowledge extraction and semantic reasoning from complex medical texts. However, effective clinical decision support requires knowledge in structured, interoperable formats. Knowledge graphs serve this role by integrating heterogeneous medical information into semantically consistent networks. Yet, current clinical knowledge graphs still depend heavily on manual curation and rule-based extraction, which is limited by the complexity and contextual ambiguity of medical guidelines and literature. To overcome these challenges, we propose an automated framework that combines retrieval-augmented generation (RAG) with LLMs to construct medical indicator knowledge graphs. The framework incorporates guideline-driven data acquisition, ontology-based schema design, and expert-in-the-loop validation to ensure scalability, accuracy, and clinical reliability. The resulting knowledge graphs can be integrated into intelligent diagnosis and question-answering systems, accelerating the development of AI-driven healthcare solutions.
zh

[AI-17] AI Fairness Beyond Complete Demographics: Current Achievements and Future Directions ECAI2025

【速读】:该论文旨在解决在缺乏完整人口统计学信息(demographic information)的情况下,如何实现人工智能(AI)系统中的公平性问题。传统公平性方法通常假设可获得完整的群体属性数据(如性别、种族等),但这一假设在现实中受限于法律合规性和潜在的歧视风险。论文的关键解决方案在于提出一种新的公平性概念分类体系(taxonomy of fairness notions),明确界定不完整数据场景下各类公平性定义之间的关系与差异,并系统总结现有针对此类限制条件的有效公平性增强技术,从而填补理论与实际应用之间的鸿沟。

链接: https://arxiv.org/abs/2511.13525
作者: Zichong Wang,Zhipeng Yin,Roland H. C. Yap,Wenbin Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ECAI 2025

点击查看摘要

Abstract:Fairness in artificial intelligence (AI) has become a growing concern due to discriminatory outcomes in AI-based decision-making systems. While various methods have been proposed to mitigate bias, most rely on complete demographic information, an assumption often impractical due to legal constraints and the risk of reinforcing discrimination. This survey examines fairness in AI when demographics are incomplete, addressing the gap between traditional approaches and real-world challenges. We introduce a novel taxonomy of fairness notions in this setting, clarifying their relationships and distinctions. Additionally, we summarize existing techniques that promote fairness beyond complete demographics and highlight open research questions to encourage further progress in the field.
zh

[AI-18] FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

【速读】:该论文旨在解决当前具身智能(Embodied Intelligence)研究中仿真平台难以捕捉复杂、以人类为中心的社会行为的问题,即现有模拟环境多局限于低层次物理交互,缺乏对高阶社会认知和意图理解的支持。解决方案的关键在于提出FreeAskWorld框架,该框架通过整合大语言模型(Large Language Models, LLMs)实现高层行为规划与语义驱动的交互,并基于意图理论(Theory of Intention)和社会认知理论(Social Cognition)构建可扩展、逼真的具身人-代理交互仿真系统。其核心创新包括:一是将经典视觉-语言导航(Vision-and-Language Navigation, VLN)任务扩展为包含主动询问能力的交互增强型方向咨询任务;二是构建大规模基准数据集(FreeAskWorld),涵盖多样化任务类型、对象类别及真实交互数据,支持训练与评估具身AI系统在自然社交情境下的语义理解与交互能力。实验表明,基于该框架微调的模型显著优于原始版本,在高层次规划与自然交互方面表现更优,验证了社会性建模作为额外信息模态对提升具身智能系统性能的有效性。

链接: https://arxiv.org/abs/2511.13524
作者: Yuhang Peng,Yizhou Pan,Xinning He,Jihaoyu Yang,Xinyu Yin,Han Wang,Xiaoji Zheng,Chao Gao,Jiangtao Gong
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied this http URL validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.
zh

[AI-19] Naga: Vedic Encoding for Deep State Space Models

【速读】:该论文旨在解决长程时间序列预测(Long-term Time Series Forecasting, LTSF)中模型对远距离时间依赖关系建模能力不足以及计算效率低的问题。其解决方案的关键在于提出一种受吠陀数学(Vedic mathematics)结构启发的深度状态空间模型(State Space Model, SSM)编码方法——Naga,该方法通过联合处理正向与反向时间序列输入,生成双向表示,并采用逐元素(Hadamard)交互方式融合二者,从而增强模型捕捉跨长距离时间步依赖的能力。此设计不仅提升了预测性能,在多个LTSF基准测试中超越28种现有先进模型,还相较于传统深度SSM方法展现出更高的计算效率。

链接: https://arxiv.org/abs/2511.13510
作者: Melanie Schaller,Nick Janssen,Bodo Rosenhahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: submitted to JMLR

点击查看摘要

Abstract:This paper presents Naga, a deep State Space Model (SSM) encoding approach inspired by structural concepts from Vedic mathematics. The proposed method introduces a bidirectional representation for time series by jointly processing forward and time-reversed input sequences. These representations are then combined through an element-wise (Hadamard) interaction, resulting in a Vedic-inspired encoding that enhances the model’s ability to capture temporal dependencies across distant time steps. We evaluate Naga on multiple long-term time series forecasting (LTSF) benchmarks, including ETTh1, ETTh2, ETTm1, ETTm2, Weather, Traffic, and ILI. The experimental results show that Naga outperforms 28 current state of the art models and demonstrates improved efficiency compared to existing deep SSM-based approaches. The findings suggest that incorporating structured, Vedic-inspired decomposition can provide an interpretable and computationally efficient alternative for long-range sequence modeling.
zh

[AI-20] A Lexical Analysis of online Reviews on Human-AI Interactions

【速读】:该论文试图解决当前关于人机交互(Human-AI Interaction)研究中对用户具体关切与挑战理解不足的问题,尤其聚焦于从用户生成的在线评论中挖掘影响交互体验的关键因素。其解决方案的关键在于采用词汇分析(lexical approach)方法,对来自多个平台的55,968条用户评论进行量化与内容分析,通过因子分析识别出主导交互质量的核心维度,并进一步通过内容分析深化对这些因素的理解,从而为开发更以用户为中心的AI系统提供实证依据和设计指导。

链接: https://arxiv.org/abs/2511.13480
作者: Parisa Arbab,Xiaowen Fang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 table

点击查看摘要

Abstract:This study focuses on understanding the complex dynamics between humans and AI systems by analyzing user reviews. While previous research has explored various aspects of human-AI interaction, such as user perceptions and ethical considerations, there remains a gap in understanding the specific concerns and challenges users face. By using a lexical approach to analyze 55,968 online reviews from this http URL, this http URL, and this http URL, this preliminary research aims to analyze human-AI interaction. Initial results from factor analysis reveal key factors influencing these interactions. The study aims to provide deeper insights into these factors through content analysis, contributing to the development of more user-centric AI systems. The findings are expected to enhance our understanding of human-AI interaction and inform future AI technology and user experience improvements.
zh

[AI-21] Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation

【速读】:该论文旨在解决公共交通领域中燃料效率优化问题,其核心挑战在于如何将复杂的多模态数据转化为可解释、与决策相关的洞察,而传统分析与可视化方法常产生碎片化输出,依赖人工解读,难以实现规模化和一致性。解决方案的关键在于提出一个基于多智能体(multi-agent)的框架,利用多模态大语言模型(multimodal large language models, LLMs)自动化生成数据叙述与能源洞察;该框架通过三个专用智能体——数据叙述代理、LLM-as-a-judge代理以及可选的人工在环评估者——协同工作,迭代地将分析结果转化为面向利益相关者的连贯报告,从而显著提升事实准确性、逻辑一致性和可扩展性。

链接: https://arxiv.org/abs/2511.13476
作者: Zhipeng Ma,Ali Rida Bahja,Andreas Burgdorf,André Pomp,Tobias Meisen,Bo Nørregaard Jørgensen,Zheng Grace Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enhancing fuel efficiency in public transportation requires the integration of complex multimodal data into interpretable, decision-relevant insights. However, traditional analytics and visualization methods often yield fragmented outputs that demand extensive human interpretation, limiting scalability and consistency. This study presents a multi-agent framework that leverages multimodal large language models (LLMs) to automate data narration and energy insight generation. The framework coordinates three specialized agents, including a data narration agent, an LLM-as-a-judge agent, and an optional human-in-the-loop evaluator, to iteratively transform analytical artifacts into coherent, stakeholder-oriented reports. The system is validated through a real-world case study on public bus transportation in Northern Jutland, Denmark, where fuel efficiency data from 4006 trips are analyzed using Gaussian Mixture Model clustering. Comparative experiments across five state-of-the-art LLMs and three prompting paradigms identify GPT-4.1 mini with Chain-of-Thought prompting as the optimal configuration, achieving 97.3% narrative accuracy while balancing interpretability and computational cost. The findings demonstrate that multi-agent orchestration significantly enhances factual precision, coherence, and scalability in LLM-based reporting. The proposed framework establishes a replicable and domain-adaptive methodology for AI-driven narrative generation and decision support in energy informatics.
zh

[AI-22] he Quick Red Fox gets the best Data Driven Classroom Interviews: A manual for an interview app and its associated methodology

【速读】:该论文旨在解决传统课堂访谈中难以高效捕捉学生在数字学习环境(如智能辅导系统或教育游戏)中的具体行为表现,同时避免过度干扰学习过程的问题。其解决方案的关键在于引入数据驱动的课堂访谈(Data Driven Classroom Interviews, DDCIs),通过一个名为Quick Red Fox(QRF)的开源服务器-客户端Android应用程序实现。QRF能够基于预定义的行为触发机制(如行为感知、情绪感知或自我调节学习检测),实时识别并引导研究者聚焦于学生最具研究价值的学习事件,从而优化研究者的时间分配并提升访谈的针对性与效率。

链接: https://arxiv.org/abs/2511.13466
作者: Jaclyn Ocumpaugh,Luc Paquette,Ryan S. Baker,Amanda Barany,Jeff Ginger,Nathan Casano,Andres F. Zambrano,Xiner Liu,Zhanlan Wei,Yiqui Zhou,Qianhui Liu,Stephen Hutt,Alexandra M.A. Andres,Nidhi Nasiar,Camille Giordano,Martin van Velsen,Micheal Mogessi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Data Driven Classroom Interviews (DDCIs) are an interviewing technique that is facilitated by recent technological developments in the learning analytics community. DDCIs are short, targeted interviews that allow researchers to contextualize students’ interactions with a digital learning environment (e.g., intelligent tutoring systems or educational games) while minimizing the amount of time that the researcher interrupts that learning experience, and focusing researcher time on the events they most want to focus on DDCIs are facilitated by a research tool called the Quick Red Fox (QRF)–an open-source server-client Android app that optimizes researcher time by directing interviewers to users that have just displayed an interesting behavior (previously defined by the research team). QRF integrates with existing student modeling technologies (e.g., behavior-sensing, affect-sensing, detection of self-regulated learning) to alert researchers to key moments in a learner’s experience. This manual documents the tech while providing training on the processes involved in developing triggers and interview techniques; it also suggests methods of analyses.
zh

[AI-23] Multi-task GINN-LP for Multi-target Symbolic Regression

【速读】:该论文旨在解决符号回归(Symbolic Regression, SR)在实际应用中的两个关键局限性:一是现有方法多在科学数据集上评估,这些数据具有明确的物理规律,导致模型泛化能力受限;二是SR主要针对单输出回归任务,难以处理现实世界中常见的多目标输出问题,其中各目标变量存在复杂的相互依赖关系。解决方案的关键在于提出多任务回归GINN-LP(Multi-Task Regression GINN-LP, MTRGINN-LP),通过将GINN-LP与多任务深度学习架构结合,构建一个共享主干网络(包含多个幂次项近似模块)和任务特定输出层的结构,在捕捉多目标间依赖关系的同时保持模型的可解释性,从而有效拓展符号回归在多输出场景下的适用范围。

链接: https://arxiv.org/abs/2511.13463
作者: Hussein Rajabu,Lijun Qian,Xishuang Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the area of explainable artificial intelligence, Symbolic Regression (SR) has emerged as a promising approach by discovering interpretable mathematical expressions that fit data. However, SR faces two main challenges: most methods are evaluated on scientific datasets with well-understood relationships, limiting generalization, and SR primarily targets single-output regression, whereas many real-world problems involve multi-target outputs with interdependent variables. To address these issues, we propose multi-task regression GINN-LP (MTRGINN-LP), an interpretable neural network for multi-target symbolic regression. By integrating GINN-LP with a multi-task deep learning, the model combines a shared backbone including multiple power-term approximator blocks with task-specific output layers, capturing inter-target dependencies while preserving interpretability. We validate multi-task GINN-LP on practical multi-target applications, including energy efficiency prediction and sustainable agriculture. Experimental results demonstrate competitive predictive performance alongside high interpretability, effectively extending symbolic regression to broader real-world multi-output tasks.
zh

[AI-24] Artificial Intelligence-Enabled Spirometry for Early Detection of Right Heart Failure

【速读】:该论文旨在解决右心衰竭(Right Heart Failure, RHF)的早期识别问题,尤其是在慢性肺病(cor pulmonale)患者中筛选出可能发展为RHF的高风险个体。其关键解决方案是提出一种基于自监督表示学习的两阶段模型:第一阶段利用变分自编码器(Variational Autoencoder, VAE)的编码器从数据增强的无标签肺量计时间序列(spirogram time series)中学习鲁棒的低维嵌入表示;第二阶段将该嵌入与人口统计学信息融合,并输入CatBoost分类器进行下游RHF预测任务。该方法在UK Biobank的26,617名受试者数据集上实现了0.7501的AUROC,且在慢性肾病(CKD)和瓣膜性心脏病(VHD)等高风险亚组中分别达到0.8194和0.8413,证明了其在临床实践中早期预警RHF的潜力。

链接: https://arxiv.org/abs/2511.13457
作者: Bin Liu,Qinghao Zhao,Yuxi Zhou,Zhejun Sun,Kaijie Lei,Deyun Zhang,Shijia Geng,Shenda Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Right heart failure (RHF) is a disease characterized by abnormalities in the structure or function of the right ventricle (RV), which is associated with high morbidity and mortality. Lung disease often causes increased right ventricular load, leading to RHF. Therefore, it is very important to screen out patients with cor pulmonale who develop RHF from people with underlying lung diseases. In this work, we propose a self-supervised representation learning method to early detecting RHF from patients with cor pulmonale, which uses spirogram time series to predict patients with RHF at an early stage. The proposed model is divided into two stages. The first stage is the self-supervised representation learning-based spirogram embedding (SLSE) network training process, where the encoder of the Variational autoencoder (VAE-encoder) learns a robust low-dimensional representation of the spirogram time series from the data-augmented unlabeled data. Second, this low-dimensional representation is fused with demographic information and fed into a CatBoost classifier for the downstream RHF prediction task. Trained and tested on a carefully selected subset of 26,617 individuals from the UK Biobank, our model achieved an AUROC of 0.7501 in detecting RHF, demonstrating strong population-level distinction ability. We further evaluated the model on high-risk clinical subgroups, achieving AUROC values of 0.8194 on a test set of 74 patients with chronic kidney disease (CKD) and 0.8413 on a set of 64 patients with valvular heart disease (VHD). These results highlight the model’s potential utility in predicting RHF among clinically elevated-risk populations. In conclusion, this study presents a self-supervised representation learning approach combining spirogram time series and demographic data, demonstrating promising potential for early RHF detection in clinical practice.
zh

[AI-25] Discovering Operational Patterns Using Image-Based Convolutional Clustering and Composite Evaluation: A Case Study in Foundry Melting Processes

【速读】:该论文旨在解决工业过程监测中未标注时间序列数据的无监督模式发现难题,尤其针对传感器生成的时间序列数据因缺乏标签、高变异性及操作噪声而难以用传统方法提取有效模式的问题。解决方案的关键在于提出一种基于图像的卷积聚类框架:首先通过重叠滑动窗口将原始一维时间序列转换为灰度矩阵表示,从而利用深度卷积自编码器进行高效特征提取;其次融合软聚类与硬聚类输出,并采用两阶段策略优化聚类选择;最后引入一个新的综合评估指标 $ S_{\text{eva}} $,结合归一化的轮廓系数(Silhouette)、Calinski-Harabasz 指数和 Davies-Bouldin 指数,实现客观性能评价。该方法在北欧铸造厂的3900余次熔炉作业数据上验证了其优越性,成功识别出7种可解释的操作模式,显著提升了聚类性能、鲁棒性和领域对齐的可解释性。

链接: https://arxiv.org/abs/2511.13444
作者: Zhipeng Ma,Bo Nørregaard Jørgensen,Zheng Grace Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial process monitoring increasingly relies on sensor-generated time-series data, yet the lack of labels, high variability, and operational noise make it difficult to extract meaningful patterns using conventional methods. Existing clustering techniques either rely on fixed distance metrics or deep models designed for static data, limiting their ability to handle dynamic, unstructured industrial sequences. Addressing this gap, this paper proposes a novel framework for unsupervised discovery of operational modes in univariate time-series data using image-based convolutional clustering with composite internal evaluation. The proposed framework improves upon existing approaches in three ways: (1) raw time-series sequences are transformed into grayscale matrix representations via overlapping sliding windows, allowing effective feature extraction using a deep convolutional autoencoder; (2) the framework integrates both soft and hard clustering outputs and refines the selection through a two-stage strategy; and (3) clustering performance is objectively evaluated by a newly developed composite score, S_eva, which combines normalized Silhouette, Calinski-Harabasz, and Davies-Bouldin indices. Applied to over 3900 furnace melting operations from a Nordic foundry, the method identifies seven explainable operational patterns, revealing significant differences in energy consumption, thermal dynamics, and production duration. Compared to classical and deep clustering baselines, the proposed approach achieves superior overall performance, greater robustness, and domain-aligned explainability. The framework addresses key challenges in unsupervised time-series analysis, such as sequence irregularity, overlapping modes, and metric inconsistency, and provides a generalizable solution for data-driven diagnostics and energy optimization in industrial systems.
zh

[AI-26] PAST: A Primary-Auxiliary Spatio-Temporal Network for Traffic Time Series Imputation

【速读】:该论文旨在解决交通时间序列数据在多种缺失模式(随机缺失、纤维状缺失和块状缺失)下进行高精度插补的问题,尤其针对现有模型难以适应随机缺失位置且无法有效建模长程与大规模依赖关系的局限性。其解决方案的关键在于将时空模式分为两类:源于数据点间内部关系的主模式(primary patterns)和受时间戳、节点属性等外部因素影响的辅助模式(auxiliary patterns),并提出主-辅时空网络(PAST),通过图融合模块(GIM)捕捉主模式,利用交叉门控模块(CGM)提取辅助模式,二者通过共享隐藏向量交互,并在集成自监督框架下联合训练,从而显著提升复杂缺失场景下的插补性能。

链接: https://arxiv.org/abs/2511.13414
作者: Hanwen Hu,Zimo Wen,Shiyou Qian,Jian Co
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic time series imputation is crucial for the safety and reliability of intelligent transportation systems, while diverse types of missing data, including random, fiber, and block missing make the imputation task challenging. Existing models often focus on disentangling and separately modeling spatial and temporal patterns based on relationships between data points. However, these approaches struggle to adapt to the random missing positions, and fail to learn long-term and large-scale dependencies, which are essential in extensive missing conditions. In this paper, patterns are categorized into two types to handle various missing data conditions: primary patterns, which originate from internal relationships between data points, and auxiliary patterns, influenced by external factors like timestamps and node attributes. Accordingly, we propose the Primary-Auxiliary Spatio-Temporal network (PAST). It comprises a graph-integrated module (GIM) and a cross-gated module (CGM). GIM captures primary patterns via dynamic graphs with interval-aware dropout and multi-order convolutions, and CGM extracts auxiliary patterns through bidirectional gating on embedded external features. The two modules interact via shared hidden vectors and are trained under an ensemble self-supervised framework. Experiments on three datasets under 27 missing data conditions demonstrate that the imputation accuracy of PAST outperforms seven state-of-the-art baselines by up to 26.2% in RMSE and 31.6% in MAE.
zh

[AI-27] An Operational Kardashev-Style Scale for Autonomous AI - Towards AGI and Superintelligence

【速读】:该论文旨在解决当前人工智能发展水平缺乏可量化、多维且可验证评估体系的问题,特别是如何系统性地衡量从固定机器人流程自动化(AAI-0)到通用人工智能(AAI-4)乃至更高阶段的演进路径。其解决方案的关键在于提出一个基于多轴能力指标的可操作性自主AI(Autonomous AI, AAI)量表,涵盖十个核心维度(如自主性、泛化能力、规划能力、记忆持久性、工具经济性等),并通过加权几何平均构建综合AAI指数;同时引入可测量的自我改进系数κ(capability growth per unit of agent-initiated resources)和闭包性质(维护与扩展),将“自改进AI”转化为可证伪的标准,并辅以OWA-Bench基准测试套件实现长期、工具使用和持续代理行为的评估,从而为AI能力跃迁提供结构化、实证化的分析框架。

链接: https://arxiv.org/abs/2511.13411
作者: Przemyslaw Chojecki
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a Kardashev-inspired yet operational Autonomous AI (AAI) Scale that measures the progression from fixed robotic process automation (AAI-0) to full artificial general intelligence (AAI-4) and beyond. Unlike narrative ladders, our scale is multi-axis and testable. We define ten capability axes (Autonomy, Generality, Planning, Memory/Persistence, Tool Economy, Self-Revision, Sociality/Coordination, Embodiment, World-Model Fidelity, Economic Throughput) aggregated by a composite AAI-Index (a weighted geometric mean). We introduce a measurable Self-Improvement Coefficient \kappa (capability growth per unit of agent-initiated resources) and two closure properties (maintenance and expansion) that convert ``self-improving AI’’ into falsifiable criteria. We specify OWA-Bench, an open-world agency benchmark suite that evaluates long-horizon, tool-using, persistent agents. We define level gates for AAI-0\ldots AAI-4 using thresholds on the axes, \kappa , and closure proofs. Synthetic experiments illustrate how present-day systems map onto the scale and how the delegability frontier (quality vs.\ autonomy) advances with self-improvement. We also prove a theorem that AAI-3 agent becomes AAI-5 over time with sufficient conditions, formalizing “baby AGI” becomes Superintelligence intuition.
zh

[AI-28] Finding Kissing Numbers with Game-theoretic Reinforcement Learning

【速读】:该论文致力于解决经典的“接触数问题”(Kissing Number Problem),即确定在高维空间中与一个中心球体相切且互不重叠的最大球体数量。此问题自牛顿时代以来长期悬而未决,尤其在维度高于8时,由于高维几何的不规则性和组合复杂度呈指数级增长(远超围棋复杂度),传统方法难以扩展。解决方案的关键在于将问题建模为一种双人矩阵补全博弈:其中一名玩家填充表示球心向量间余弦值的矩阵元素,另一名玩家修正次优项,二者协同最大化矩阵规模(对应接触数)。通过训练基于博弈论的强化学习系统PackingStar,实现了对极高维空间的有效探索,显著提升了样本质量并突破了人类已知的所有记录(如25–31维),首次在13维实现理性结构之外的突破,并发现超过6000种新构型,展示了AI在超越人类直觉的高维几何探索中的强大能力。

链接: https://arxiv.org/abs/2511.13391
作者: Chengdong Ma,Théo Tao Zhaowei,Pengyu Li,Minghao Liu,Haojun Chen,Zihao Mao,Yuan Cheng,Yuan Qi,Yaodong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem represents the local analogue of Hilbert’s 18th problem on sphere packing, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry and exponentially growing combinatorial complexity beyond 8 dimensions, which exceeds the complexity of Go game, limit the scalability of existing methods. Here we model this problem as a two-player matrix completion game and train the game-theoretic reinforcement learning system, PackingStar, to efficiently explore high-dimensional spaces. The matrix entries represent pairwise cosines of sphere center vectors; one player fills entries while another corrects suboptimal ones, jointly maximizing the matrix size, corresponding to the kissing number. This cooperative dynamics substantially improves sample quality, making the extremely large spaces tractable. PackingStar reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, with the configuration in 25 dimensions geometrically corresponding to the Leech lattice and suggesting possible optimality. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions and discovers over 6000 new structures in 14 and other dimensions. These results demonstrate AI’s power to explore high-dimensional spaces beyond human intuition and open new pathways for the Kissing Number Problem and broader geometry problems.
zh

[AI-29] Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirces Manuscripts with Vision-Language Models

【速读】:该论文旨在解决跨媒介文献中混合材料(如文本与图示)的识别与解释难题,尤其针对查尔斯·S·皮尔士(Charles S. Peirce)手稿中图文混排页面的分析挑战。其核心问题是:如何在不破坏原始语境的前提下,有效提取并结构化这些复杂页面中的图示内容,以支持视觉研究、跨媒介分析及基于文本的数字工作流。解决方案的关键在于构建一个三步流程:首先对文档页面进行布局分割,其次将各片段与IIIF(国际图像互操作性协议)标注重新关联,最后将含图示的片段输入视觉语言模型(Visual Language Models, VLMs),并结合皮尔士符号学框架设计提示词(prompts),自动提取关键知识并生成简洁描述;最终将这些描述整合进知识图谱,实现对复合来源中图示内容的结构化表示。

链接: https://arxiv.org/abs/2511.13378
作者: Carlo Teo Pedretti,Davide Picca,Dario Rodighiero
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Diagrams are crucial yet underexplored tools in many disciplines, demonstrating the close connection between visual representation and scholarly reasoning. However, their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital workflows. In particular, Charles S. Peirce consistently advocated the use of diagrams as essential for reasoning and explanation. His manuscripts, often combining textual content with complex visual artifacts, provide a challenging case for studying documents involving heterogeneous materials. In this preliminary study, we investigate whether Visual Language Models (VLMs) can effectively help us identify and interpret such hybrid pages in context. First, we propose a workflow that (i) segments manuscript page layouts, (ii) reconnects each segment to IIIF-compliant annotations, and (iii) submits fragments containing diagrams to a VLM. In addition, by adopting Peirce’s semiotic framework, we designed prompts to extract key knowledge about diagrams and produce concise captions. Finally, we integrated these captions into knowledge graphs, enabling structured representations of diagrammatic content within composite sources.
zh

[AI-30] A Novel Hierarchical Integration Method for Efficient Model Merging in Medical LLM s

【速读】:该论文旨在解决分布式医疗场景下大型语言模型(Large Language Models, LLMs)的知识融合难题,具体包括跨机构知识整合时的隐私保护、计算开销控制以及灾难性遗忘问题。其关键解决方案是提出一种新颖的分层合并方法,该方法结合选择性最优传输(Selective Optimal Transport, OT)对齐注意力层与余弦相似度加权插值,以缓解排列不变性差异并降低边缘部署场景下的计算复杂度。实验表明,对于架构兼容的医学LLMs,简单平均策略已能实现优异性能(如Task Arithmetic在MedQA上达45.80%准确率),优于复杂的剪枝方法,从而为资源受限物联网环境中的可扩展医疗AI系统提供了高效且可靠的基线方案。

链接: https://arxiv.org/abs/2511.13373
作者: Prakrit Timilsina,Anuj Nepal,Rajan Kadel,Robin Doss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) face significant challenges in distributed healthcare, including consolidating specialized domain knowledge across institutions while maintaining privacy, reducing computational overhead, and preventing catastrophic forgetting during model this http URL paper presents a systematic evaluation of six parameter-space merging techniques applied to two architecturally compatible medical LLMs derived from the Mistral-7B base model. We introduce a novel hierarchical method that combines selective Optimal Transport (OT) alignment for attention layers with cosine similarity-weighted interpolation, designed to address permutation variance while minimizing computational overhead for edge deployment scenarios. Our study evaluates Task Arithmetic, Linear Averaging, DARE-TIES, DELLA, Breadcrumbs, and our Hierarchical approach across five medical benchmarks. Results demonstrate that architecturally compatible models benefit significantly from simple averaging methods, with Task Arithmetic achieving 45.80% accuracy on MedQA, outperforming complex pruning-based approaches. These findings offer critical insights for the deployment of distributed medical AI in resource-constrained IoT environments, where computational efficiency and model compatibility are paramount. Our work establishes that for architecturally compatible models, simple averaging provides a robust and computationally efficient baseline for knowledge consolidation, offering a pragmatic path forward for scalable medical AI systems.
zh

[AI-31] Cognitive Maps in Language Models: A Mechanistic Analysis of Spatial Planning

【速读】:该论文旨在探究大规模语言模型(Large Language Models, LLMs)如何解决空间导航任务,特别是揭示其在不同训练范式下所习得的空间智能机制。研究通过在网格环境中训练GPT-2模型于三种空间学习范式——被动探索(Foraging Model)、目标导向规划(SP-Hamiltonian)以及混合模型(SP-Random Walk)——并结合行为、表征与机制分析,发现两类根本不同的学习算法:一类是具备地图-like 表征的“认知地图”型策略(如Foraging模型),其通过因果干预显示在中间层实现方向依赖性的消失,并形成自洽的坐标系统和分层推理机制;另一类则是路径依赖型策略(如目标导向模型),始终依赖显式方向输入。关键在于,训练方式决定了模型是否发展出可泛化的世界模型(world model)或针对特定任务优化的启发式策略,从而揭示了Transformer架构中空间智能的谱系特性及训练制度对策略涌现的影响机制。

链接: https://arxiv.org/abs/2511.13371
作者: Caroline Baumgartner,Eleanor Spens,Neil Burgess,Petru Manescu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How do large language models solve spatial navigation tasks? We investigate this by training GPT-2 models on three spatial learning paradigms in grid environments: passive exploration (Foraging Model- predicting steps in random walks), goal-directed planning (generating optimal shortest paths) on structured Hamiltonian paths (SP-Hamiltonian), and a hybrid model fine-tuned with exploratory data (SP-Random Walk). Using behavioural, representational and mechanistic analyses, we uncover two fundamentally different learned algorithms. The Foraging model develops a robust, map-like representation of space, akin to a ‘cognitive map’. Causal interventions reveal that it learns to consolidate spatial information into a self-sufficient coordinate system, evidenced by a sharp phase transition where its reliance on historical direction tokens vanishes by the middle layers of the network. The model also adopts an adaptive, hierarchical reasoning system, switching between a low-level heuristic for short contexts and map-based inference for longer ones. In contrast, the goal-directed models learn a path-dependent algorithm, remaining reliant on explicit directional inputs throughout all layers. The hybrid model, despite demonstrating improved generalisation over its parent, retains the same path-dependent strategy. These findings suggest that the nature of spatial intelligence in transformers may lie on a spectrum, ranging from generalisable world models shaped by exploratory data to heuristics optimised for goal-directed tasks. We provide a mechanistic account of this generalisation-optimisation trade-off and highlight how the choice of training regime influences the strategies that emerge.
zh

[AI-32] InfoDecom: Decomposing Information for Defending against Privacy Leakage in Split Inference AAAI2026

【速读】:该论文旨在解决Split Inference(SI)中因数据重构攻击(Data Reconstruction Attacks, DRAs)导致的隐私泄露问题,尤其是在客户端模型较浅时,现有防御方法常因过度扰动冗余信息而导致性能显著下降。解决方案的关键在于提出InfoDecom框架,该框架首先对客户端发送的“破碎数据”(smashed data)进行信息分解并移除冗余内容,随后注入与理论隐私保障相匹配的噪声,从而在不牺牲模型性能的前提下实现更优的隐私保护效果。

链接: https://arxiv.org/abs/2511.13365
作者: Ruijun Deng,Zhihui Lu,Qiang Duan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Split inference (SI) enables users to access deep learning (DL) services without directly transmitting raw data. However, recent studies reveal that data reconstruction attacks (DRAs) can recover the original inputs from the smashed data sent from the client to the server, leading to significant privacy leakage. While various defenses have been proposed, they often result in substantial utility degradation, particularly when the client-side model is shallow. We identify a key cause of this trade-off: existing defenses apply excessive perturbation to redundant information in the smashed data. To address this issue in computer vision tasks, we propose InfoDecom, a defense framework that first decomposes and removes redundant information and then injects noise calibrated to provide theoretically guaranteed privacy. Experiments demonstrate that InfoDecom achieves a superior utility-privacy trade-off compared to existing baselines. The code and the appendix are available at this https URL.
zh

[AI-33] MedDCR: Learning to Design Agent ic Workflows for Medical Coding

【速读】:该论文旨在解决医疗编码(Medical Coding)中自动化系统因依赖固定、手工设计的工作流而难以捕捉真实临床文档复杂性和变异性的问题,从而影响编码的准确性与可信赖性。其解决方案的关键在于提出MedDCR——一个闭环学习框架,将工作流设计视为可学习的问题:通过Designer生成候选流程、Coder执行流程、Reflector基于预测结果提供反馈,并借助记忆库存储和复用历史设计,实现工作流的迭代优化与适应性改进。该方法在基准数据集上显著优于现有最先进模型,同时生成可解释且灵活的工作流,更贴近实际医疗编码实践。

链接: https://arxiv.org/abs/2511.13361
作者: Jiyang Zheng,Islam Nassar,Thanh Vu,Xu Zhong,Yang Lin,Tongliang Liu,Long Duong,Yuan-Fang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Medical coding converts free-text clinical notes into standardized diagnostic and procedural codes, which are essential for billing, hospital operations, and medical research. Unlike ordinary text classification, it requires multi-step reasoning: extracting diagnostic concepts, applying guideline constraints, mapping to hierarchical codebooks, and ensuring cross-document consistency. Recent advances leverage agentic LLMs, but most rely on rigid, manually crafted workflows that fail to capture the nuance and variability of real-world documentation, leaving open the question of how to systematically learn effective workflows. We present MedDCR, a closed-loop framework that treats workflow design as a learning problem. A Designer proposes workflows, a Coder executes them, and a Reflector evaluates predictions and provides constructive feedback, while a memory archive preserves prior designs for reuse and iterative refinement. On benchmark datasets, MedDCR outperforms state-of-the-art baselines and produces interpretable, adaptable workflows that better reflect real coding practice, improving both the reliability and trustworthiness of automated systems.
zh

[AI-34] Reasoning Shapes Alignment: Investigating Cultural Alignment in Large Reasoning Models with Cultural Norms

【速读】:该论文旨在解决大型推理模型在安全性和文化多样性价值对齐方面的挑战,即如何使模型不仅遵循安全规范,还能体现跨文化的多元人类价值观。其解决方案的关键在于提出了一种基于文化规范的文化对齐(Cultural Norm-based Cultural Alignment, CNCA)框架,通过三种自动从有限调查数据中挖掘文化规范的方法,并探索两种对齐范式:一种是将文化规范显式嵌入用户上下文的上下文内对齐方法,另一种是通过增强的思维链(Chain-of-Thought)训练数据将规范内化到模型中的微调方法。实验表明,具备更强推理能力的模型更能从文化规范的挖掘与利用中获益,凸显了借助文化感知策略提升模型对多元人类价值观反映能力的潜力。

链接: https://arxiv.org/abs/2511.13359
作者: Yuhang Wang,Yanxu Zhu,Jitao Sang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advanced reasoning capabilities of Large Reasoning Models enable them to thoroughly understand and apply safety policies through deliberate thought processes, thereby improving the models’ safety. Beyond safety, these models must also be able to reflect the diverse range of human values across various cultures. This paper presents the Cultural Norm-based Cultural Alignment (CNCA) framework, which enables models to leverage their powerful reasoning ability to align with cultural norms. Specifically, we propose three methods to automatically mine cultural norms from limited survey data and explore ways to effectively utilize these norms for improving cultural alignment. Two alignment paradigms are examined: an in-context alignment method, where cultural norms are explicitly integrated into the user context, and a fine-tuning-based method, which internalizes norms through enhanced Chain-of-Thought training data. Comprehensive experiments demonstrate the effectiveness of these methods, highlighting that models with stronger reasoning capabilities benefit more from cultural norm mining and utilization. Our findings emphasize the potential for reasoning models to better reflect diverse human values through culturally informed alignment strategies.
zh

[AI-35] Enhancing All-to-X Backdoor Attacks with Optimized Target Class Mapping

【速读】:该论文旨在解决生成式 AI(Generative AI)系统中后门攻击(Backdoor Attacks)的防御难题,特别是针对多目标全向攻击(All-to-X, A2X)在实际场景中因攻击成功率低而被忽视的问题。现有研究主要聚焦于单目标全向攻击(All-to-One, A2O),但A2X攻击更具现实威胁性且对当前先进防御机制具有鲁棒性。论文提出了一种新型攻击策略,其关键在于优化分组机制与目标类别分配机制,从而显著提升A2X攻击的成功率,同时保持对主流防御手段的抗性。实验表明,该方法在CIFAR10、CIFAR100和Tiny-ImageNet上平均攻击成功率分别提升了6.7%、16.4%和14.1%,最高提升达28%。

链接: https://arxiv.org/abs/2511.13356
作者: Lei Wang,Yulong Tian,Hao Han,Fengyuan Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Backdoor attacks pose severe threats to machine learning systems, prompting extensive research in this area. However, most existing work focuses on single-target All-to-One (A2O) attacks, overlooking the more complex All-to-X (A2X) attacks with multiple target classes, which are often assumed to have low attack success rates. In this paper, we first demonstrate that A2X attacks are robust against state-of-the-art defenses. We then propose a novel attack strategy that enhances the success rate of A2X attacks while maintaining robustness by optimizing grouping and target class assignment mechanisms. Our method improves the attack success rate by up to 28%, with average improvements of 6.7%, 16.4%, 14.1% on CIFAR10, CIFAR100, and Tiny-ImageNet, respectively. We anticipate that this study will raise awareness of A2X attacks and stimulate further research in this under-explored area. Our code is available at this https URL .
zh

[AI-36] Dual-LoRA and Quality-Enhanced Pseudo Replay for Multimodal Continual Food Learning

【速读】:该论文旨在解决现有大规模多模态模型(Large Multimodal Models, LMMs)在食品分析任务中因持续学习新任务而引发的灾难性遗忘问题,该问题通常需要昂贵的从头再训练。解决方案的关键在于提出一种新颖的持续学习框架,融合双LoRA(Low-Rank Adapter)架构与质量增强伪回放(Quality-Enhanced Pseudo Replay)策略:其中,专用LoRA通过正交约束学习任务特定知识以避免干扰已有任务子空间,协作LoRA则利用伪回放机制整合跨任务共享知识;同时,质量增强伪回放通过自一致性与语义相似性评估降低生成样本中的幻觉现象,从而提升回放数据可靠性,显著缓解遗忘并实现复杂食品任务上的有效持续学习。

链接: https://arxiv.org/abs/2511.13351
作者: Xinlan Wu,Bin Zhu,Feng Han,Pengkun Jiao,Jingjing Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Food analysis has become increasingly critical for health-related tasks such as personalized nutrition and chronic disease prevention. However, existing large multimodal models (LMMs) in food analysis suffer from catastrophic forgetting when learning new tasks, requiring costly retraining from scratch. To address this, we propose a novel continual learning framework for multimodal food learning, integrating a Dual-LoRA architecture with Quality-Enhanced Pseudo Replay. We introduce two complementary low-rank adapters for each task: a specialized LoRA that learns task-specific knowledge with orthogonal constraints to previous tasks’ subspaces, and a cooperative LoRA that consolidates shared knowledge across tasks via pseudo replay. To improve the reliability of replay data, our Quality-Enhanced Pseudo Replay strategy leverages self-consistency and semantic similarity to reduce hallucinations in generated samples. Experiments on the comprehensive Uni-Food dataset show superior performance in mitigating forgetting, representing the first effective continual learning approach for complex food tasks.
zh

[AI-37] An LLM -based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains

【速读】:该论文旨在解决开源软件供应链中因依赖项维护不足和社区审计薄弱而导致的隐蔽后门攻击风险问题,特别是在高隐蔽性攻击场景下(如XZ-Util事件)难以有效识别和评估安全威胁。其解决方案的关键在于提出一个细粒度的项目评估框架,从攻击者视角建模隐蔽后门攻击的各个阶段,并定义针对性指标;同时,为克服静态分析在评估仓库维护可靠性(如异常提交者权限提升、评审参与度低)方面的局限性,引入大语言模型(Large Language Models, LLMs)对代码仓库进行语义级评估,无需依赖人工设计的规则模式,从而更精准地识别潜在风险。

链接: https://arxiv.org/abs/2511.13341
作者: Zihe Yan,Kai Luo,Haoyu Yang,Yang Yu,Zhuosheng Zhang,Guancheng Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 figures, 4 tables, conference

点击查看摘要

Abstract:In modern software development workflows, the open-source software supply chain contributes significantly to efficient and convenient engineering practices. With increasing system complexity, using open-source software as third-party dependencies has become a common practice. However, the lack of maintenance for underlying dependencies and insufficient community auditing create challenges in ensuring source code security and the legitimacy of repository maintainers, especially under high-stealthy backdoor attacks exemplified by the XZ-Util incident. To address these problems, we propose a fine-grained project evaluation framework for backdoor risk assessment in open-source software. The framework models stealthy backdoor attacks from the viewpoint of the attacker and defines targeted metrics for each attack stage. In addition, to overcome the limitations of static analysis in assessing the reliability of repository maintenance activities such as irregular committer privilege escalation and limited participation in reviews, the framework uses large language models (LLMs) to conduct semantic evaluation of code repositories without relying on manually crafted patterns. The framework is evaluated on sixty six high-priority packages in the Debian ecosystem. The experimental results indicate that the current open-source software supply chain is exposed to various security risks.
zh

[AI-38] Explainable RL Policies by Distilling to Locally-Specialized Linear Policies with Voronoi State Partitioning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)控制器因使用难以解释的深度神经网络而缺乏透明性的问题,这在需要满足监管要求或建立信任的应用场景中尤为关键。解决方案的关键在于提出一种模型无关的知识蒸馏方法,通过 Voronoi 划分将状态空间分割为多个区域,在每个区域内用一个简化的、人类可理解的线性模型替代原黑箱策略,从而实现局部专业化建模。该方法在保持与原始控制器相当甚至略优性能的同时,显著提升了策略的可解释性。

链接: https://arxiv.org/abs/2511.13322
作者: Senne Deproost,Dennis Steckelmacher,Ann Nowé
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for BNAIC/BeNeLearn 2025

点击查看摘要

Abstract:Deep Reinforcement Learning is one of the state-of-the-art methods for producing near-optimal system controllers. However, deep RL algorithms train a deep neural network, that lacks transparency, which poses challenges when the controller has to meet regulations, or foster trust. To alleviate this, one could transfer the learned behaviour into a model that is human-readable by design using knowledge distilla- tion. Often this is done with a single model which mimics the original model on average but could struggle in more dynamic situations. A key challenge is that this simpler model should have the right balance be- tween flexibility and complexity or right balance between balance bias and accuracy. We propose a new model-agnostic method to divide the state space into regions where a simplified, human-understandable model can operate in. In this paper, we use Voronoi partitioning to find regions where linear models can achieve similar performance to the original con- troller. We evaluate our approach on a gridworld environment and a classic control task. We observe that our proposed distillation to locally- specialized linear models produces policies that are explainable and show that the distillation matches or even slightly outperforms the black-box policy they are distilled from.
zh

[AI-39] Whistledown: Combining User-Level Privacy with Conversational Coherence in LLM s

【速读】:该论文旨在解决用户在使用云部署的大语言模型(Large Language Models, LLMs)进行情感敏感或社会敏感对话时,因提示词(prompts)中包含个人身份信息(Personally Identifiable Information, PII)而引发的隐私泄露风险。尤其当用户讨论朋友、同事或对手等社交关系时,此类信息更易被日志记录、保留甚至泄露。针对这一问题,作者提出Whistledown——一种基于最佳努力原则的隐私保护层,其核心机制为结合伪匿名化(pseudonymization)与ε-局部差分隐私(ε-local differential privacy, ε-LDP),并引入变换缓存(transformation caching)以在不牺牲对话可用性的前提下提供隐私保障。该方案设计轻量,适用于客户端本地部署(个人用户)或企业零信任网关集中部署(企业用户),且无需修改主流LLM服务商的API接口。

链接: https://arxiv.org/abs/2511.13319
作者: Chelsea McMurray,Hayder Tirmazi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Users increasingly rely on large language models (LLMs) for personal, emotionally charged, and socially sensitive conversations. However, prompts sent to cloud-hosted models can contain personally identifiable information (PII) that users do not want logged, retained, or leaked. We observe this to be especially acute when users discuss friends, coworkers, or adversaries, i.e., when they spill the tea. Enterprises face the same challenge when they want to use LLMs for internal communication and decision-making. In this whitepaper, we present Whistledown, a best-effort privacy layer that modifies prompts before they are sent to the LLM. Whistledown combines pseudonymization and \epsilon -local differential privacy ( \epsilon -LDP) with transformation caching to provide best-effort privacy protection without sacrificing conversational utility. Whistledown is designed to have low compute and memory overhead, allowing it to be deployed directly on a client’s device in the case of individual users. For enterprise users, Whistledown is deployed centrally within a zero-trust gateway that runs on an enterprise’s trusted infrastructure. Whistledown requires no changes to the existing APIs of popular LLM providers. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.13319 [cs.CR] (or arXiv:2511.13319v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.13319 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-40] EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation

【速读】:该论文旨在解决通用机器人在人类环境中执行任务时面临的挑战,即如何结合自然语言理解与物理操作能力,实现基于文本指令的精准机器人轨迹生成。其核心解决方案是将扩散模型(diffusion models)引入视觉-运动策略框架中,通过融合视觉和文本输入来学习从自然语言命令到具体操作动作的映射关系,并利用参考示范(reference demonstrations)进行训练,使模型能够在机器人当前环境内完成由文本指定的操纵任务。关键创新在于改进嵌入表示并借鉴图像生成领域的扩散模型技术,从而提升多任务序列执行下的长期成功率,验证了扩散模型在通用多任务操纵中的有效性。

链接: https://arxiv.org/abs/2511.13312
作者: Jonas Bode,Raphael Memmesheimer,Sven Behnke
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages; 2 figures; 1 table. Prprint submitted to the European Robotics Forum 2026

点击查看摘要

Abstract:Acting in human environments is a crucial capability for general-purpose robots, necessitating a robust understanding of natural language and its application to physical tasks. This paper seeks to harness the capabilities of diffusion models within a visuomotor policy framework that merges visual and textual inputs to generate precise robotic trajectories. By employing reference demonstrations during training, the model learns to execute manipulation tasks specified through textual commands within the robot’s immediate environment. The proposed research aims to extend an existing model by leveraging improved embeddings, and adapting techniques from diffusion models for image generation. We evaluate our methods on the CALVIN dataset, proving enhanced performance on various manipulation tasks and an increased long-horizon success rate when multiple tasks are executed in sequence. Our approach reinforces the usefulness of diffusion models and contributes towards general multitask manipulation.
zh

[AI-41] Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agent ic Retrieval

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗健康预测中因内置知识可靠性与覆盖范围不足而导致的事实性错误问题,以及检索增强生成(Retrieval-Augmented Generation, RAG)框架在医疗场景下面临的两个关键挑战:一是如何准确判断何时触发检索机制以满足临床需求,二是如何实现检索器与生成器之间的协同优化以生成情境适配的上下文信息。解决方案的关键在于提出一种生成式分层代理检索增强生成框架(Generative Hierarchical Agentic RAG, GHAR),其核心创新包括:设计双代理架构(Agent-Top作为主诊医师动态决策是否依赖参数化知识或启动检索,Agent-Low作为咨询模块汇总任务相关知识),并首次将两代理的优化统一于马尔可夫决策过程(Markov Decision Process, MDP)中,通过多样化奖励机制对齐二者共同目标——提升预测准确性,同时保持各自角色分工。

链接: https://arxiv.org/abs/2511.13293
作者: Chuang Zhao,Hui Tang,Hongke Zhao,Xiaofang Zhou,Xiaomeng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate healthcare prediction is critical for improving patient outcomes and reducing operational costs. Bolstered by growing reasoning capabilities, large language models (LLMs) offer a promising path to enhance healthcare predictions by drawing on their rich parametric knowledge. However, LLMs are prone to factual inaccuracies due to limitations in the reliability and coverage of their embedded knowledge. While retrieval-augmented generation (RAG) frameworks, such as GraphRAG and its variants, have been proposed to mitigate these issues by incorporating external knowledge, they face two key challenges in the healthcare scenario: (1) identifying the clinical necessity to activate the retrieval mechanism, and (2) achieving synergy between the retriever and the generator to craft contextually appropriate retrievals. To address these challenges, we propose GHAR, a \underlinegenerative \underlinehierarchical \underlineagentic \underlineRAG framework that simultaneously resolves when to retrieve and how to optimize the collaboration between submodules in healthcare. Specifically, for the first challenge, we design a dual-agent architecture comprising Agent-Top and Agent-Low. Agent-Top acts as the primary physician, iteratively deciding whether to rely on parametric knowledge or to initiate retrieval, while Agent-Low acts as the consulting service, summarising all task-relevant knowledge once retrieval was triggered. To tackle the second challenge, we innovatively unify the optimization of both agents within a formal Markov Decision Process, designing diverse rewards to align their shared goal of accurate prediction while preserving their distinct roles. Extensive experiments on three benchmark datasets across three popular tasks demonstrate our superiority over state-of-the-art baselines, highlighting the potential of hierarchical agentic RAG in advancing healthcare systems.
zh

[AI-42] Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)在专业化领域推理任务中因统一训练导致性能受限的问题,以及采用独立大语言模型(LLM)训练时面临的优化挑战,如不同智能体运行频率差异、子智能体调用数量不一及跨服务器部署引发的梯度流中断。其解决方案的关键在于提出一种分层的Group Relative Policy Optimization(M-GRPO),通过为顶层主智能体(planner)和子智能体(multi-turn tool executors)分别计算组相对优势(group-relative advantages),实现层级化的信用分配;同时引入轨迹对齐(trajectory-alignment)机制,在子智能体调用数量变化的情况下仍能生成固定大小的批次数据,并采用解耦训练流水线使各智能体在独立服务器上运行并通过共享存储交换最小统计信息,从而避免跨服务器反向传播,显著提升了工具增强型推理任务的稳定性与样本效率。

链接: https://arxiv.org/abs/2511.13288
作者: Haoyang Hong,Jiajun Yin,Yuan Wang,Jingnan Liu,Zhe Chen,Ailing Yu,Ji Li,Zhiling Ye,Hansong Xiao,Yefei Chen,Hualei Zhou,Yun Yue,Minghui Yang,Chunxiao Guo,Junwei Liu,Peng Wei,Jinjie Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.
zh

[AI-43] KForge: Program Synthesis for Diverse AI Hardware Accelerators

【速读】:该论文旨在解决生成式 AI (Generative AI) 在跨平台优化 GPU 内核(GPU kernels)时面临的挑战,即如何在不同硬件加速器上高效生成高性能代码。传统方法依赖于针对特定平台的手动调优,难以适应多样化的计算架构。其解决方案的关键在于提出 KForge,一个平台无关的框架,由两个协作的大语言模型(LLM)代理组成:生成代理通过编译和正确性反馈迭代生成并优化程序,性能分析代理则解读多种来源的性能剖析数据(包括程序化 API 和图形界面工具),为优化提供可操作建议。该架构仅需单次示例即可适配新平台,并通过跨平台知识迁移显著提升对异构硬件的目标生成质量,已在 NVIDIA CUDA 和 Apple Metal 两大差异显著的并行计算平台上验证了其有效性。

链接: https://arxiv.org/abs/2511.13274
作者: Taras Sereda,Tom St. John,Burak Bartan,Natalie Serrino,Sachin Katti,Zain Asgar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Performance (cs.PF); Software Engineering (cs.SE)
备注: Under review at MLSys 2026

点击查看摘要

Abstract:GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms. We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal. Comments: Under review at MLSys 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Performance (cs.PF); Software Engineering (cs.SE) Cite as: arXiv:2511.13274 [cs.LG] (or arXiv:2511.13274v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13274 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-44] Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLM s

【速读】:该论文旨在解决当前大型音频语言模型(Large Audio-Language Models, LALMs)在感知声源空间动态性,尤其是声源运动方向与轨迹识别方面存在的系统性缺陷问题。其解决方案的关键在于提出并构建了AMPBench——首个专门用于评估音频语言模型听觉运动理解能力的受控问答基准,通过设计基于双耳音频(binaural audio)的测试任务,定量和定性地揭示了现有模型在识别运动线索和区分方向模式上的显著不足,平均准确率低于50%,从而指出了人类与模型在听觉空间推理能力上的根本差距,并为未来提升音频语言模型的空间认知能力提供了诊断工具与研究方向。

链接: https://arxiv.org/abs/2511.13273
作者: Zhe Sun,Yujun Cai,Jiayu Yao,Yiwei Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
zh

[AI-45] Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)在编程教育中对学生知识获取(knowledge gains)影响的研究空白问题,特别是相较于传统在线资源,GenAI辅助是否能有效促进不同编程水平学生的学习效果。其关键解决方案在于通过控制实验设计,对比分析初学者与中级学习者在使用 ChatGPT 解决编程任务时的行为差异、任务表现与概念理解程度,发现 GenAI 虽显著提升任务完成效率(尤其对初学者),但若仅用于直接生成完整解决方案,则难以带来稳定的知识增长;而有效的学习依赖于策略性使用——即避免过度依赖或完全不用,强调将 GenAI 视为“学习工具”而非“解题工具”,从而推动更深层次的理解。

链接: https://arxiv.org/abs/2511.13271
作者: Rufeng Chen,Shuaishuai Jiang,Jiyun Shen,AJung Moon,Lili Wei
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 4 figures, accepted at AIWARE 2025

点击查看摘要

Abstract:The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI’s ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.
zh

[AI-46] Proceedings Seventh International Workshop on Formal Methods for Autonomous Systems

【速读】:该论文聚焦于利用形式化方法(Formal Methods)解决自主系统(Autonomous Systems)所面临的独特挑战,旨在通过严谨的数学建模与验证技术提升系统的安全性、可靠性和可预测性。其关键解决方案在于构建一个跨学科的研究交流平台——国际形式化方法在自主系统研讨会(FMAS),汇聚来自全球多个地区的研究人员,促进新旧作者间的知识共享与协作,从而推动该领域理论与实践的持续发展。

链接: https://arxiv.org/abs/2511.13245
作者: Matt Luckcuck,Maike Schwammberger,Mengwei Xu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This EPTCS volume contains the papers from the Seventh International Workshop on Formal Methods for Autonomous Systems (FMAS 2025), which was held between the 17th and 19th of November 2025. The goal of the FMAS workshop series is to bring together leading researchers who are using formal methods to tackle the unique challenges that autonomous systems present, so that they can publish and discuss their work with a growing community of researchers. FMAS 2025 was co-located with the 20th International Conference on integrated Formal Methods (iFM’25), hosted by Inria Paris, France at the Inria Paris Center. In total, FMAS 2025 received 16 submissions from researchers at institutions in: Canada, China, France, Germany, Ireland, Italy, Japan, the Netherlands, Portugal, Sweden, the United States of America, and the United Kingdom. Though we received fewer submissions than last year, we are encouraged to see the submissions being sent from a wide range of countries. Submissions come from both past and new FMAS authors, which shows us that the existing community appreciates the network that FMAS has built over the past 7 years, while new authors also show the FMAS community’s great potential of growth. Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2511.13245 [cs.LO] (or arXiv:2511.13245v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2511.13245 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 436, 2025 Related DOI: https://doi.org/10.4204/EPTCS.436 Focus to learn more DOI(s) linking to related resources
zh

[AI-47] Seek and You Shall Fold

【速读】:该论文旨在解决如何将非可微的实验数据(如核磁共振中的化学位移、偶极耦合常数等)有效整合到蛋白质生成模型中这一关键挑战,因为传统基于梯度的条件采样方法无法处理这类不可微的观测信号。其解决方案的核心在于提出一种通用框架,通过定制化的遗传算法(genetic algorithm)将连续扩散生成模型与任意黑盒目标函数进行耦合,从而实现对蛋白质结构生成过程的非可微指导。该方法首次成功实现了基于化学位移约束的蛋白质结构生成,验证了其在多种实验数据模态下的有效性,并揭示了当前预测器的局限性,为未来自动化、数据驱动的蛋白质建模提供了普适性策略。

链接: https://arxiv.org/abs/2511.13244
作者: Nadav Bojan Sellam,Meital Bojan,Paul Schanda,Alex Bronstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.
zh

[AI-48] Informative Communication of Robot Plans

【速读】:该论文旨在解决机器人在向用户 verbalize(口头表达)其行动计划时,如何实现信息传递效率最大化的问题。传统增量式策略(incremental strategy)按计划顺序逐条输出动作,但忽略了用户已有的先验知识,导致沟通冗余或无效。解决方案的关键在于引入基于二阶心智理论(second-order theory of mind)的用户认知模型,通过量化每条口头表述的信息增益(information gain),动态选择最能提升用户理解效率的表达顺序。该方法使用户更快理解机器人目标,同时揭示了哪些信息具有传播价值及其原因。

链接: https://arxiv.org/abs/2511.13226
作者: Michele Persiani,Thomas Hellstrom
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Conference: PAAMS 2022, 20th International Conference on Practical Applications of Agents and Multi-Agent Systems

点击查看摘要

Abstract:When a robot is asked to verbalize its plan it can do it in many ways. For example, a seemingly natural strategy is incremental, where the robot verbalizes its planned actions in plan order. However, an important aspect of this type of strategy is that it misses considerations on what is effectively informative to communicate, because not considering what the user knows prior to explanations. In this paper we propose a verbalization strategy to communicate robot plans informatively, by measuring the information gain that verbalizations have against a second-order theory of mind of the user capturing his prior knowledge on the robot. As shown in our experiments, this strategy allows to understand the robot’s goal much quicker than by using strategies such as increasing or decreasing plan order. In addition, following our formulation we hint to what is informative and why when a robot communicates its plan.
zh

[AI-49] okenSqueeze: Performance-Preserving Compression for Reasoning LLM s NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)中复杂推理任务的效率-准确率权衡问题,即如何在降低长链思维(Chain-of-Thought, CoT)路径的 token 消耗以提升推理效率的同时,不牺牲模型的准确性。现有长到短(Long2Short)方法常因过度压缩推理深度而导致性能下降,缺乏兼顾效率与精度的有效手段。其解决方案的关键在于提出 TokenSqueeze 方法:首先通过自生成样本的适应性筛选机制,确保所选样本的推理深度与问题复杂度匹配,避免因过度压缩导致性能损失;其次引入分布对齐的语言精炼策略,在不改变推理逻辑的前提下优化语言表达的清晰度和简洁性,从而实现高保真度的推理路径压缩。实验表明,基于该方法微调后的 DeepSeek-R1-Distill-Qwen-7B 模型在 MATH500 基准上实现了平均 50% 的 token 减少且保持原有准确率,且完全依赖自生成数据,无需人工标注的短答案数据集。

链接: https://arxiv.org/abs/2511.13223
作者: Yuxiang Zhang,Zhengxu Yu,Weihang Pan,Zhongming Jin,Qiang Fu,Deng Cai,Binbin Lin,Jieping Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrate the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek-R1-Distill-Qwen-7B fine-tuned using our proposed method achieved a 50% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model’s self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at this https URL.
zh

[AI-50] FoleyBench: A Benchmark For Video-to-Audio Models

【速读】:该论文旨在解决视频到音频生成(Video-to-audio generation, V2A)领域中缺乏针对Foley风格场景的专用评估基准的问题。现有数据集普遍存在音视频对应关系差(74%的视频存在此问题),且以语音和音乐为主,不适用于Foley音效生成任务。为填补这一空白,作者提出FoleyBench,这是首个专为Foley风格V2A设计的大规模基准数据集,包含5000个(视频、真实音频、文本描述)三元组,其中音频与画面事件具有因果关联,并通过自动化可扩展流水线从YouTube和Vimeo等真实网络视频中构建。其关键创新在于:1)明确聚焦Foley场景,覆盖专门设计的Foley声音类别分类体系;2)提供细粒度元数据(如声源复杂度、UCS/AudioSet类别、视频时长),支持模型性能与失败模式的精细化分析;3)基于该数据集对多个前沿V2A模型进行系统评测,涵盖音频质量、音视频对齐、时间同步及音频-文本一致性等维度。

链接: https://arxiv.org/abs/2511.13219
作者: Satvik Dixit,Koichi Saito,Zhi Zhong,Yuki Mitsufuji,Chris Donahue
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: this https URL
zh

[AI-51] Learning to Solve Resource-Constrained Project Scheduling Problems with Duration Uncertainty using Graph Neural Networks ICTAI2025

【速读】:该论文旨在解决资源受限项目调度问题(Resource-Constrained Project Scheduling Problem, RCPSP)中任务持续时间不确定性的实际挑战,目标是通过考虑已知概率分布的不确定性来最小化项目的整体期望完成时间,并生成一个可在工业场景中多次复用的基准调度方案。解决方案的关键在于结合图神经网络(Graph Neural Networks, GNNs)与深度强化学习(Deep Reinforcement Learning, DRL),构建一种类似于优先级派发规则的任务调度策略,并配合串行调度生成机制(Serial Schedule Generation Scheme)以生成鲁棒且可推广的调度方案。

链接: https://arxiv.org/abs/2511.13214
作者: Guillaume Infantes,Stéphanie Roussel,Antoine Jacquet,Emmanuel Benazera
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICTAI 2025 Conference

点击查看摘要

Abstract:The Resource-Constrained Project Scheduling Problem (RCPSP) is a classical scheduling problem that has received significant attention due to of its numerous applications in industry. However, in practice, task durations are subject to uncertainty that must be considered in order to propose resilient scheduling. In this paper, we address the RCPSP variant with uncertain tasks duration (modeled using known probabilities) and aim to minimize the overall expected project duration. Our objective is to produce a baseline schedule that can be reused multiple times in an industrial setting regardless of the actual duration scenario. We leverage Graph Neural Networks in conjunction with Deep Reinforcement Learning (DRL) to develop an effective policy for task scheduling. This policy operates similarly to a priority dispatch rule and is paired with a Serial Schedule Generation Scheme to produce a schedule. Our empirical evaluation on standard benchmarks demonstrates the approach’s superiority in terms of performance and its ability to generalize. The developed framework, Wheatley, is made publicly available online to facilitate further research and reproducibility.
zh

[AI-52] ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

【速读】:该论文旨在解决基于Transformer的大语言模型(Large Language Models, LLMs)在训练动态长度序列时,现有静态并行策略导致的两个核心问题:一是短序列无法有效取消通信并行化(Communication-Parallelization Cancellation, CPC),二是长序列易引发显存溢出(Out-of-Memory, OOM)。解决方案的关键在于提出ParaDySe——一种面向动态序列的自适应并行策略切换框架。其核心创新包括:构建统一张量布局规范的模块化并行策略函数库,并设计融合混合方法的序列感知内存与时间成本模型;在此基础上,通过高效启发式算法实现层级最优策略的在线选择与无缝切换,从而系统性地整合长序列优化能力与现有训练框架,显著缓解OOM和CPC瓶颈。

链接: https://arxiv.org/abs/2511.13198
作者: Zhixin Ou,Peng Liang,Jianchen Han,Baihui Liu,Linbo Qiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.
zh

[AI-53] Cost-Effective Communication: An Auction-based Method for Language Agent Interaction

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在基于大语言模型(Large Language Models, LLMs)构建时普遍存在的“无序通信”问题,即各智能体间自由、冗余的交互导致令牌(token)消耗呈指数级增长,且信息信噪比低,严重制约其实际部署效率。解决方案的关键在于引入一种名为动态拍卖式语言代理(Dynamic Auction-based Language Agent, DALA)的新框架,将通信带宽视为稀缺且可交易的资源,通过集中式拍卖机制使智能体根据消息的价值密度进行竞价发言,从而内在地激励智能体生成简洁、高信息密度的内容,并主动抑制低价值沟通。该设计不仅显著提升了性能(如在MMLU上达84.32%,HumanEval pass@1达91.21%),还大幅降低资源消耗(仅用625万令牌),并催生出“战略沉默”的涌现能力,实现动态适应性通信策略。

链接: https://arxiv.org/abs/2511.13193
作者: Yijia Fan,Jusheng Zhang,Kaitong Cai,Jing Yang,Chengpei Tang,Jian Wang,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) built on large language models (LLMs) often suffer from inefficient “free-for-all” communication, leading to exponential token costs and low signal-to-noise ratios that hinder their practical deployment. We challenge the notion that more communication is always beneficial, hypothesizing instead that the core issue is the absence of resource rationality. We argue that “free” communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. Specifically, our DALA regards inter-agent communication as a centralized auction, where agents learn to bid for the opportunity to speak based on the predicted value density of their messages. Thus, our DALA intrinsically encourages agents to produce concise, informative messages while filtering out low-value communication. Extensive and comprehensive experiments demonstrate that our economically-driven DALA achieves new state-of-the-art performance across seven challenging reasoning benchmarks, including 84.32% on MMLU and a 91.21% pass@1 rate on HumanEval. Note that this is accomplished with remarkable efficiency, i.e., our DALA uses only 6.25 million tokens, a fraction of the resources consumed by current state-of-the-art methods on GSM8K. Further analysis reveals that our DALA cultivates the emergent skill of strategic silence, effectively adapting its communication strategies from verbosity to silence in a dynamical manner via resource constraints.
zh

[AI-54] Local Collaborative Filtering: A Collaborative Filtering Method that Utilizes Local Similarities among Users

【速读】:该论文旨在解决如何更有效地利用互联网用户行为数据以提升推荐系统性能的问题。其解决方案的关键在于提出了一种名为局部协同过滤(Local Collaborative Filtering, LCF)的新颖协同过滤方法,该方法通过挖掘用户间的局部相似性,并基于大数定律(Law of Large Numbers, LLN)整合相关用户数据,从而增强对用户行为数据的利用效率。

链接: https://arxiv.org/abs/2511.13166
作者: Zhaoxin Shen,Dan Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:To leverage user behavior data from the Internet more effectively in recommender systems, this paper proposes a novel collaborative filtering (CF) method called Local Collaborative Filtering (LCF). LCF utilizes local similarities among users and integrates their data using the law of large numbers (LLN), thereby improving the utilization of user behavior data. Experiments are conducted on the Steam game dataset, and the results of LCF align with real-world needs.
zh

[AI-55] InteractiveGNNExplainer: A Visual Analytics Framework for Multi-Faceted Understanding and Probing of Graph Neural Network Predictions

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在节点分类任务中因复杂非线性操作导致的可解释性不足问题,即模型决策过程缺乏透明度,从而影响用户信任、调试效率、偏见检测及在高风险场景中的应用。解决方案的关键在于提出一种名为InteractiveGNNExplainer的可视化分析框架,其核心创新在于将协同交互视图(动态图布局、嵌入投影、特征检查与邻域分析)与后验解释方法(如GNNExplainer)和内在解释机制(如GAT注意力权重)相结合,并首次引入交互式图编辑功能,支持用户进行“假设性”分析(what-if analysis),通过实时修改图结构并观察预测结果与解释的变化,实现对GNN行为的深入诊断与敏感性验证,从而提升模型透明度与可信度。

链接: https://arxiv.org/abs/2511.13160
作者: TC Singh,Sougata Mukherjea
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel in graph-based learning tasks, but their complex, non-linear operations often render them as opaque “black boxes”. This opacity hinders user trust, complicates debugging, bias detection, and adoption in critical domains requiring explainability. This paper introduces InteractiveGNNExplainer, a visual analytics framework to enhance GNN explainability, focusing on node classification. Our system uniquely integrates coordinated interactive views (dynamic graph layouts, embedding projections, feature inspection, neighborhood analysis) with established post-hoc (GNNExplainer) and intrinsic (GAT attention) explanation techniques. Crucially, it incorporates interactive graph editing, allowing users to perform a “what-if” analysis by perturbing graph structures and observing immediate impacts on GNN predictions and explanations. We detail the system architecture and, through case studies on Cora and CiteSeer datasets, demonstrate how InteractiveGNNExplainer facilitates in-depth misclassification diagnosis, comparative analysis of GCN versus GAT behaviors, and rigorous probing of model sensitivity. These capabilities foster a deeper, multifaceted understanding of GNN predictions, contributing to more transparent, trustworthy, and robust graph analysis.
zh

[AI-56] SoK: The Last Line of Defense: On Backdoor Defense Evaluation

【速读】:该论文旨在解决当前后门攻击防御方法在评估过程中存在的异构性与不一致性问题,即不同研究采用的实验设置、评价指标和威胁模型假设差异显著,导致防御效果难以公平比较。其解决方案的关键在于通过系统性的文献综述与大规模实证分析,对2018至2025年间发表的183篇后门防御论文进行深入剖析,并基于覆盖MNIST、CIFAR-100和ImageNet-1K三个数据集,四种模型架构(ResNet-18、VGG-19、ViT-B/16、DenseNet-121),以及16种代表性防御方法和五类常见攻击方式的超过3000次实验,揭示现有评估实践中的关键缺陷,包括计算开销报告不足、良性条件下行为未充分验证、超参数选择偏倚及实验覆盖不全等问题。最终,论文提出具体挑战与可操作建议,以推动未来后门防御评估的标准化与科学化。

链接: https://arxiv.org/abs/2511.13143
作者: Gorka Abad,Marina Krček,Stefanos Koffas,Behrad Tajalli,Marco Arazzi,Roberto Riaño,Xiaoyun Xu,Zhuoran Liu,Antonino Nocera,Stjepan Picek
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to deep learning models by implanting hidden vulnerabilities that can be activated by malicious inputs. While numerous defenses have been proposed to mitigate these attacks, the heterogeneous landscape of evaluation methodologies hinders fair comparison between defenses. This work presents a systematic (meta-)analysis of backdoor defenses through a comprehensive literature review and empirical evaluation. We analyzed 183 backdoor defense papers published between 2018 and 2025 across major AI and security venues, examining the properties and evaluation methodologies of these defenses. Our analysis reveals significant inconsistencies in experimental setups, evaluation metrics, and threat model assumptions in the literature. Through extensive experiments involving three datasets (MNIST, CIFAR-100, ImageNet-1K), four model architectures (ResNet-18, VGG-19, ViT-B/16, DenseNet-121), 16 representative defenses, and five commonly used attacks, totaling over 3,000 experiments, we demonstrate that defense effectiveness varies substantially across different evaluation setups. We identify critical gaps in current evaluation practices, including insufficient reporting of computational overhead and behavior under benign conditions, bias in hyperparameter selection, and incomplete experimentation. Based on our findings, we provide concrete challenges and well-motivated recommendations to standardize and improve future defense evaluations. Our work aims to equip researchers and industry practitioners with actionable insights for developing, assessing, and deploying defenses to different systems. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.13143 [cs.CR] (or arXiv:2511.13143v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.13143 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-57] Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition AAAI2026

【速读】:该论文旨在解决复杂协作多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)任务中,动态任务分解(Dynamic Task Decomposition)从零开始学习时所需大量训练样本的问题,尤其是在部分可观测环境下探索巨大联合动作空间的挑战。解决方案的关键在于提出一种两层分层MARL框架——条件扩散模型用于动态任务分解(Conditional Diffusion Model for Dynamic Task Decomposition, C D³T),其中高层策略通过预测下一观测和奖励来学习子任务表示,并据此生成子任务选择策略;低层则通过协作学习和共享专业化技能实现子任务执行。此外,所学子任务表示作为语义信息嵌入多头注意力混合网络,增强价值分解并建立个体与联合价值函数之间的高效推理桥梁,从而显著提升长时程任务在动态不确定环境中的学习效率与性能表现。

链接: https://arxiv.org/abs/2511.13137
作者: Yanda Zhu,Yuanyang Zhu,Daoyi Dong,Caihua Chen,Chunlin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026

点击查看摘要

Abstract:Task decomposition has shown promise in complex cooperative multi-agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long-horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C \textD^\text3 T), a novel two-level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high-level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C \textD^\text3 T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi-head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C \textD^\text3 T achieves better performance than existing baselines.
zh

[AI-58] Soft Conflict-Resolution Decision Transformer for Offline Multi-Task Reinforcement Learning

【速读】:该论文针对多任务强化学习(Multi-task Reinforcement Learning, MTRL)中因任务间梯度冲突导致的知识共享受限与学习效率低下问题展开研究。现有基于掩码的方法采用粗粒度的二值掩码抑制冲突参数,但存在过度压制关键参数、阻碍跨任务知识迁移的问题;同时,固定稀疏度策略无法适配不同任务间差异化的冲突水平,难以兼顾训练稳定性和性能表现。解决方案的关键在于提出SoCo-DT方法:首先利用Fisher信息动态调整掩码值,实现对重要参数的保留与冲突参数的抑制;其次引入基于四分位距(Interquartile Range, IQR)的任务特异性阈值机制,根据冲突与和谐得分分布自适应设定稀疏度;最后结合非对称余弦退火调度策略,使阈值在训练过程中持续演化,从而实现更精细的冲突缓解和更高效的多任务协同学习。

链接: https://arxiv.org/abs/2511.13133
作者: Shudong Wang,Xinfei Wang,Chenhao Zhang,Shanchen Pang,Haiyuan Gui,Wenhao Ji,Xiaojian Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model’s generalization and learning efficiency. To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6% on MT50 and by 10.5% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.13133 [cs.LG] (or arXiv:2511.13133v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13133 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-59] Synthetic Forgetting without Access: A Few-shot Zero-glance Framework for Machine Unlearning

【速读】:该论文旨在解决机器遗忘(machine unlearning)中一个现实且具有挑战性的场景——少样本零瞥(few-shot zero-glance)设置下的隐私合规问题,即在仅能访问少量保留数据、且完全无法获取需遗忘数据(forget set)的情况下,如何有效消除模型对特定数据的记忆。其解决方案的关键在于提出GFOES框架,该框架包含生成式反馈网络(Generative Feedback Network, GFN)和两阶段微调策略:GFN通过合成最优擦除样本(Optimal Erasure Samples, OES),诱导目标类别高损失,从而在不依赖原始遗忘数据的前提下实现类特定知识的遗忘;两阶段微调则先在第一阶段实现激进遗忘,再于第二阶段恢复保留类别的性能,最终在三个图像分类数据集上验证了该方法在logit和表示层均能实现有效遗忘,同时仅使用5%的原始数据即可保持良好性能。

链接: https://arxiv.org/abs/2511.13116
作者: Qipeng Song,Nan Yang,Ziqi Xu,Yue Li,Wei Shao,Feng Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning aims to eliminate the influence of specific data from trained models to ensure privacy compliance. However, most existing methods assume full access to the original training dataset, which is often impractical. We address a more realistic yet challenging setting: few-shot zero-glance, where only a small subset of the retained data is available and the forget set is entirely inaccessible. We introduce GFOES, a novel framework comprising a Generative Feedback Network (GFN) and a two-phase fine-tuning procedure. GFN synthesises Optimal Erasure Samples (OES), which induce high loss on target classes, enabling the model to forget class-specific knowledge without access to the original forget data, while preserving performance on retained classes. The two-phase fine-tuning procedure enables aggressive forgetting in the first phase, followed by utility restoration in the second. Experiments on three image classification datasets demonstrate that GFOES achieves effective forgetting at both logit and representation levels, while maintaining strong performance using only 5% of the original data. Our framework offers a practical and scalable solution for privacy-preserving machine learning under data-constrained conditions.
zh

[AI-60] Self-Adaptive Graph Mixture of Models

【速读】:该论文旨在解决当前图神经网络(Graph Neural Networks, GNNs)性能提升趋于饱和的问题,尤其是面对不同图任务或数据集时难以选择最优模型的挑战。其解决方案的关键在于提出Self-Adaptive Graph Mixture of Models (SAGMM),该框架通过引入架构多样性与拓扑感知的注意力门控机制,实现对每个节点自适应地分配最合适的GNN专家模型,并结合剪枝策略在不牺牲性能的前提下提升计算效率;此外,还设计了基于预训练专家模型的高效训练变体,仅微调门控和任务特定层,从而显著降低训练成本并增强实用性。

链接: https://arxiv.org/abs/2511.13062
作者: Mohit Meena(1),Yash Punjabi(1),Abhishek A(1),Vishal Sharma(1),Mahesh Chandran(1) ((1) Fujitsu Research of India, Bangalore)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for learning over graph-structured data, yet recent studies have shown that their performance gains are beginning to plateau. In many cases, well-established models such as GCN and GAT, when appropriately tuned, can match or even exceed the performance of more complex, state-of-the-art architectures. This trend highlights a key limitation in the current landscape: the difficulty of selecting the most suitable model for a given graph task or dataset. To address this, we propose Self-Adaptive Graph Mixture of Models (SAGMM), a modular and practical framework that learns to automatically select and combine the most appropriate GNN models from a diverse pool of architectures. Unlike prior mixture-of-experts approaches that rely on variations of a single base model, SAGMM leverages architectural diversity and a topology-aware attention gating mechanism to adaptively assign experts to each node based on the structure of the input graph. To improve efficiency, SAGMM includes a pruning mechanism that reduces the number of active experts during training and inference without compromising performance. We also explore a training-efficient variant in which expert models are pretrained and frozen, and only the gating and task-specific layers are trained. We evaluate SAGMM on 16 benchmark datasets covering node classification, graph classification, regression, and link prediction tasks, and demonstrate that it consistently outperforms or matches leading GNN baselines and prior mixture-based methods, offering a robust and adaptive solution for real-world graph learning.
zh

[AI-61] MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity

【速读】:该论文旨在解决稀疏矩阵-向量乘法(Sparse Matrix-Vector Multiplication, SpMV)在低且非结构化稀疏度(30%-90%)场景下性能不佳的问题,这限制了剪枝后大语言模型(Large Language Models, LLMs)的内存压缩和加速效果。其核心解决方案是提出MACKO-SpMV,一种GPU优化的格式与内核协同设计方法,在不依赖专用硬件单元(如张量核心)或特定格式预计算的前提下,显著降低存储开销并保持对GPU执行模型的兼容性。实验表明,当稀疏度为50%时,MACKO-SpMV相比密集表示实现1.5倍内存减少和1.2–1.5倍加速,相较cuSPARSE、Sputnik和DASP等基线方法分别提升2.8–13.0x、1.9–2.6x和2.2–2.5x,从而使得在真实LLM工作负载中采用50%非结构化剪枝成为可行方案。

链接: https://arxiv.org/abs/2511.13061
作者: Vladimír Macko,Vladimír Boža
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
备注: 8 pages + 7 pages appendix, 11 figures, Code available at this https URL

点击查看摘要

Abstract:Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU’s execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.
zh

[AI-62] Latency and Ordering Effects in Online Decisions

【速读】:该论文旨在解决在线决策系统在延迟反馈(latency)和顺序敏感性(order-sensitivity,即非交换性 dynamics)环境下性能分析与优化的难题。这类系统中,动作会影响观测数据的到达时机与顺序,从而导致传统凸优化框架下的损失基准失效。解决方案的关键在于构建一个结构化的下界不等式: excess benchmark loss ≥ Lₘᵢdₑₐₗ + g₁(λ) + g₂(ε⋆) + g₁₂(λ, ε⋆) − Dₙ꜀ₓ,其中 g₁ 和 g₂ 分别量化延迟和顺序敏感性的惩罚项,g₁₂ 表征二者之间的几何交互作用,而 Dₙ꜀ₓ 为非凸性或近似误差项,在凸 Legendre 假设下可消失。该框架进一步扩展至 prox-regular 和弱凸场景,提供超越凸假设的鲁棒保证,并通过简单的 2×2 随机实验和流式诊断指标(如有效样本量、截断率、交互热图)实现对四类效应的实时估计与监控,从而将异构延迟、非交换性和实现差距效应统一为可解释、可测试、可调优的下界表达式。

链接: https://arxiv.org/abs/2511.13060
作者: Duo Yi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online decision systems routinely operate under delayed feedback and order-sensitive (noncommutative) dynamics: actions affect which observations arrive, and in what sequence. Taking a Bregman divergence D_\Phi as the loss benchmark, we prove that the excess benchmark loss admits a structured lower bound L \ge L_\mathrmideal + g_1(\lambda) + g_2(\varepsilon_\star) + g_12(\lambda,\varepsilon_\star) - D_\mathrmncx , where g_1 and g_2 are calibrated penalties for latency and order-sensitivity, g_12 captures their geometric interaction, and D_\mathrmncx\ge 0 is a nonconvexity/approximation penalty that vanishes under convex Legendre assumptions. We extend this inequality to prox-regular and weakly convex settings, obtaining robust guarantees beyond the convex case. We also give an operational recipe for estimating and monitoring the four terms via simple 2\times 2 randomized experiments and streaming diagnostics (effective sample size, clipping rate, interaction heatmaps). The framework packages heterogeneous latency, noncommutativity, and implementation-gap effects into a single interpretable lower-bound statement that can be stress-tested and tuned in real-world systems.
zh

[AI-63] Dimension vs. Precision: A Comparative Analysis of Autoencoders and Quantization for Efficient Vector Retrieval on BEIR SciFact

【速读】:该论文旨在解决高维、高精度(float32)向量嵌入在实际部署中带来的存储与内存瓶颈问题。其核心解决方案是系统性地评估两种压缩策略:一是通过深度自编码器(Autoencoder, AE)进行维度降低,将原始384维向量压缩至12–384维的潜在空间;二是通过量化(Quantization)降低数值精度,包括float16、int8和二值化(binary)三种方式。关键发现为:int8标量量化在实现4倍压缩比的同时仅带来约1–2%的nDCG@10性能损失,是最优平衡点;而自编码器在同等压缩比下性能下降更显著,二值化则导致灾难性性能退化,不适用于该任务。

链接: https://arxiv.org/abs/2511.13057
作者: Satyanarayan Pati(Involead Services Pvt Ltd, Delhi, India)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures, 1 table

点击查看摘要

Abstract:Dense retrieval models have become a standard for state-of-the-art information retrieval. However, their high-dimensional, high-precision (float32) vector embeddings create significant storage and memory challenges for real-world deployment. To address this, we conduct a rigorous empirical study on the BEIR SciFact benchmark, evaluating the trade-offs between two primary compression strategies: (1) Dimensionality Reduction via deep Autoencoders (AE), reducing original 384-dim vectors to latent spaces from 384 down to 12, and (2) Precision Reduction via Quantization (float16, int8, and binary). We systematically compare each method by measuring the “performance loss” (or gain) relative to a float32 baseline across a full suite of retrieval metrics (NDCG, MAP, MRR, Recall, Precision) at various k cutoffs. Our results show that int8 scalar quantization provides the most effective “sweet spot,” achieving a 4x compression with a negligible [~1-2%] drop in nDCG@10. In contrast, Autoencoders show a graceful degradation but suffer a more significant performance loss at equivalent 4x compression ratios (AE-96). binary quantization was found to be unsuitable for this task due to catastrophic performance drops. This work provides a practical guide for deploying efficient, high-performance retrieval systems.
zh

[AI-64] Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting AAAI2026

【速读】:该论文旨在解决语言模型(Language Models, LMs)在监督微调(Supervised Fine-Tuning, SFT)过程中因训练数据有限而导致的过拟合问题,这种过拟合会使模型依赖任务中的虚假关联或损害其预训练阶段获得的广泛能力。解决方案的关键在于提出一种名为“从不良行为中学习”(Learning-from-the-Undesirable, LfU)的一致性正则化方法:通过引入对“不良更新”(即导致模型向不期望行为偏移的梯度上升步骤)后的内部表示进行对齐,从而引导微调过程趋向于鲁棒性强的解空间。LfU利用不良更新作为表示层面的数据增强手段,有效提升了模型在小样本条件下的泛化能力,并在多个下游任务中展现出更强的适应性和稳定性。

链接: https://arxiv.org/abs/2511.13052
作者: Yunhun Nam,Jaehyung Kim,Jongheon Jeong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages; AAAI 2026; Code is available at this https URL

点击查看摘要

Abstract:Language models (LMs) are often adapted through supervised fine-tuning (SFT) to specialize their capabilities for downstream tasks. However, in typical scenarios where the fine-tuning data is limited, e.g., compared to pre-training, SFT can lead LMs to overfit, causing them to rely on spurious patterns within the target task or to compromise other broadly useful capabilities as a side effect of narrow specialization. In this paper, we propose Learning-from-the-Undesirable (LfU), a simple yet effective regularization scheme for SFT to mitigate overfitting issues when fine-tuning LMs with limited data. Specifically, we aim to regularize the fine-tuning process to favor solutions that are resilient to “undesirable” model updates, e.g., gradient ascent steps that steer the model toward undesirable behaviors. To this end, we propose a novel form of consistency regularization that directly aligns internal representations of the model with those after an undesirable update. By leveraging representation-level data augmentation through undesirable updates, LfU effectively promotes generalization under limited data. Our experiments on diverse LM downstream tasks show that LfU serves as an effective prior that enhances adaptability while preserving pretrained knowledge. For example, our LM from LfU achieves a 16.8% average improvement on math tasks compared to vanilla SFT on the same dataset, where the latter even leads to degraded performance on those tasks. Furthermore, LfU exhibits improved robustness to prompt variations, e.g., yielding a 92.1% lower standard deviation in output performances compared to SFT, highlighting its versatile effects.
zh

[AI-65] One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow AAAI2026

【速读】:该论文旨在解决离线强化学习(offline reinforcement learning)中策略生成效率与表达能力之间的矛盾问题:一方面,单步高斯策略虽推理速度快,但难以建模复杂、多模态的动作分布;另一方面,基于流(flow-based)的方法虽能提升表达能力,却通常依赖蒸馏和两阶段训练,导致训练流程复杂且不稳定。解决方案的关键在于对MeanFlow进行残差重构(residual reformulation),将速度场(velocity field)与噪声到动作的映射集成到单一策略网络中,从而实现直接的噪声到动作生成,无需额外的速度估计模块。这一设计不仅保留了单步生成的高效性,还显著提升了策略对多模态动作分布的建模能力,并支持在Q-learning框架下完成单阶段稳定训练。

链接: https://arxiv.org/abs/2511.13035
作者: Zeyuan Wang,Da Li,Yulin Chen,Ye Shi,Liang Bai,Tianyuan Yu,Yanwei Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in AAAI 2026 Poster

点击查看摘要

Abstract:We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to reformulate MeanFlow to enable direct noise-to-action generation by integrating the velocity field and noise-to-action transformation into a single policy network-eliminating the need for separate velocity estimation. We explore several reformulation variants and identify an effective residual formulation that supports expressive and stable policy learning. Our method offers three key advantages: 1) efficient one-step noise-to-action generation, 2) expressive modelling of multimodal action distributions, and 3) efficient and stable policy learning via Q-learning in a single-stage training setup. Extensive experiments on 73 tasks across the OGBench and D4RL benchmarks demonstrate that our method achieves strong performance in both offline and offline-to-online reinforcement learning settings. Code is available at this https URL.
zh

[AI-66] Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学推理任务中存在“表面正确但实质错误”的问题,即模型虽能生成看似合理的最终答案,但其推理过程缺乏严谨性,尤其在需要形式化证明的场景下表现不佳。核心挑战在于如何构建可靠的证明验证机制以提升模型输出的数学有效性。解决方案的关键在于:首先,通过多基准评估避免单一指标导致的误判;其次,将两种主流生成式验证方法——GenSelect 和 LLM-as-a-Judge 进行大规模扩展(达数百万 token),并发现二者结合可形成最有效的解法验证与选择框架;最后,揭示提示工程对 LLM-as-a-Judge 性能的影响,并指出强化学习虽能降低这种敏感性,却无法提升最终答案的准确性,表明当前模型更倾向于奖励形式上的正确性而非实质数学有效性。这一发现为设计可扩展、高可信度的数学推理系统提供了实践指导。

链接: https://arxiv.org/abs/2511.13027
作者: Sadegh Mahdavi,Branislav Kisacanin,Shubham Toshniwal,Wei Du,Ivan Moshkov,George Armstrong,Renjie Liao,Christos Thrampoulidis,Igor Gitman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model’s performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.
zh

[AI-67] SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment

【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在边缘设备上部署时因模型压缩效率不足而面临的挑战,特别是量化(quantization)技术在SLMs上的适用性尚未被充分探索。研究发现,SLMs与大语言模型(Large Language Models, LLMs)在量化敏感性方面存在根本差异,直接迁移LLM优化的量化方法会导致次优效果,这主要源于SLMs独特的架构特性和训练动态。解决方案的关键在于提出首个系统性的基准测试框架SLMQuant,通过多维度评估不同量化方法在多种架构和任务下的表现,识别出影响SLMs有效量化的关键因素,并据此提炼出面向SLMs的可操作压缩设计原则,从而为资源受限场景下轻量级语言模型的高效部署提供理论基础与实践指导。

链接: https://arxiv.org/abs/2511.13023
作者: Jiacheng Wang,Yejun Zeng,Jinyang Guo,Yuqing Ma,Aishan Liu,Xianglong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the growing interest in Small Language Models (SLMs) as resource-efficient alternatives to Large Language Models (LLMs), their deployment on edge devices remains challenging due to unresolved efficiency gaps in model compression. While quantization has proven effective for LLMs, its applicability to SLMs is significantly underexplored, with critical questions about differing quantization bottlenecks and efficiency profiles. This paper introduces SLMQuant, the first systematic benchmark for evaluating LLM compression techniques when applied to SLMs. Through comprehensive multi-track evaluations across diverse architectures and tasks, we analyze how state-of-the-art quantization methods perform on SLMs. Our findings reveal fundamental disparities between SLMs and LLMs in quantization sensitivity, demonstrating that direct transfer of LLM-optimized techniques leads to suboptimal results due to SLMs’ unique architectural characteristics and training dynamics. We identify key factors governing effective SLM quantization and propose actionable design principles for SLM-tailored compression. SLMQuant establishes a foundational framework for advancing efficient SLM deployment on low-end devices in edge applications, and provides critical insights for deploying lightweight language models in resource-constrained scenarios.
zh

[AI-68] Are Graph Transformers Necessary? Efficient Long-Range Message Passing with Fractal Nodes in MPNNs AAAI2026

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理图结构数据时难以平衡局部与全局信息的问题,尤其是传统消息传递神经网络(Message Passing Neural Networks, MPNNs)因缺乏有效机制而难以建模长程依赖,而图Transformer虽能捕捉长程交互却常忽略MPNN的局部性与计算效率。其解决方案的关键在于提出“分形节点”(fractal nodes)的概念——通过图划分自然诱导的分形结构,使分形节点与原始节点共存,并自适应聚合子图级特征表示,从而在每个子图内强制特征相似性;同时,分形节点提供直接的短路连接,缓解了过度挤压问题(over-squashing),提升了MPNN对长程依赖的建模能力,在保持MPNN计算效率的同时达到或超越图Transformer的性能表现。

链接: https://arxiv.org/abs/2511.13010
作者: Jeongwhan Choi,Seungjun Park,Sumin Park,Sung-Bae Cho,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in AAAI 2026 for Oral Representation. This is the extended version including the appendix

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for learning on graph-structured data, but often struggle to balance local and global information. While graph Transformers aim to address this by enabling long-range interactions, they often overlook the inherent locality and efficiency of Message Passing Neural Networks (MPNNs). We propose a new concept called fractal nodes, inspired by the fractal structure observed in real-world networks. Our approach is based on the intuition that graph partitioning naturally induces fractal structure, where subgraphs often reflect the connectivity patterns of the full graph. Fractal nodes are designed to coexist with the original nodes and adaptively aggregate subgraph-level feature representations, thereby enforcing feature similarity within each subgraph. We show that fractal nodes alleviate the over-squashing problem by providing direct shortcut connections that enable long-range propagation of subgraph-level representations. Experiment results show that our method improves the expressive power of MPNNs and achieves comparable or better performance to graph Transformers while maintaining the computational efficiency of MPNN by improving the long-range dependencies of MPNN.
zh

[AI-69] GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLM s AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源和领域特定场景下,因缺乏大规模人类偏好标注而难以对齐人类偏好的问题。传统方法依赖监督式奖励模型或外部裁判,但在医学、法律等专业领域往往不可行。其解决方案的关键在于提出一种生成式熵引导偏好建模方法(Generative Entropy-guided Preference Modeling, GEM),通过构建一个基于熵理论的闭环认知优化框架,使LLM能够内化多维细粒度的认知信号并自主评估偏好。具体而言,GEM的核心创新包括:1)基于思维链(Chain-of-Thought, CoT)提示生成多样候选推理路径,并利用基于熵的token评分机制筛选高置信度与高熵token;2)设计自评估组优势算法(Self-evaluated Group Advantage, SEGA),将熵得分转化为隐式奖励用于策略优化,从而实现仅用少量偏好数据即可高效完成模型对齐。

链接: https://arxiv.org/abs/2511.13007
作者: Yiyang Zhao,Huiyu Bai,Xuejiao Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted by AAAI 2026-AIA and designated as an oral presentation paper

点击查看摘要

Abstract:Alignment of large language models (LLMs) with human preferences typically relies on supervised reward models or external judges that demand abundant annotations. However, in fields that rely on professional knowledge, such as medicine and law, such large-scale preference labels are often unachievable. In this paper, we propose a generative entropy-guided preference modeling approach named GEM for LLMs aligment at low-resource and domain-specific scenarios. Instead of training a discriminative reward model on preference data, we directly train the LLM to internalize a closed-loop optimization architecture that can extract and exploit the multi-dimensional, fine-grained cognitive signals implicit in human preferences. Specifically, our Cognitive Filtering module, based on entropy theory in decision making, first leverages Chain-of-Thought (CoT) prompting to generate diverse candidate reasoning chains (CoTs) from preference data. Subsequently, it introduces a token scoring mechanism to rank and weight the sampled CoTs, boosting the importance of high-confidence answers and strategically high-entropy tokens. Building on these filtered preferences, we fine-tune the LLM using a novel self-evaluated group advantage algorithm, SEGA, which effectively aggregates group-level cognitive signals and transforms the entropy-based scores into implicit rewards for policy optimization. In these ways, GEM empowers the LLM to rely on its own judgments and establishes an entropy-guided closed-loop cognitive optimization framework, enabling highly efficient few-shot alignment of LLMs. Experiments on general benchmarks and domain-specific tasks (such as mathematical reasoning and medical dialogues) demonstrate that our GEM achieves significant improvements with few-shot preference data.
zh

[AI-70] Learning Branching Policies for MILPs with Proximal Policy Optimization AAAI

【速读】:该论文旨在解决分支定界法(Branch-and-Bound, B&B)在求解大规模混合整数线性规划(Mixed Integer Linear Program, MILP)问题时因指数时间复杂度导致的计算效率瓶颈,以及现有基于模仿学习(Imitation Learning, IL)的分支策略泛化能力不足的问题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的新框架——Tree-Gate Proximal Policy Optimization (TGPPO),该框架采用近端策略优化(Proximal Policy Optimization, PPO)算法训练分支策略,并构建一个参数化的状态空间表示来动态捕捉搜索树的演化上下文,从而提升模型在结构多样或未见实例上的适应性和鲁棒性。

链接: https://arxiv.org/abs/2511.12986
作者: Abdelouahed Ben Mhamed,Assia Kamal-Idrissi,Amal El Fallah Seghrouchni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 11 pages, 3 figures, AAAI conference

点击查看摘要

Abstract:Branch-and-Bound (B\B) is the dominant exact solution method for Mixed Integer Linear Programs (MILP), yet its exponential time complexity poses significant challenges for large-scale instances. The growing capabilities of machine learning have spurred efforts to improve B\B by learning data-driven branching policies. However, most existing approaches rely on Imitation Learning (IL), which tends to overfit to expert demonstrations and struggles to generalize to structurally diverse or unseen instances. In this work, we propose Tree-Gate Proximal Policy Optimization (TGPPO), a novel framework that employs Proximal Policy Optimization (PPO), a Reinforcement Learning (RL) algorithm, to train a branching policy aimed at improving generalization across heterogeneous MILP instances. Our approach builds on a parameterized state space representation that dynamically captures the evolving context of the search tree. Empirical evaluations show that TGPPO often outperforms existing learning-based policies in terms of reducing the number of nodes explored and improving p-Primal-Dual Integrals (PDI), particularly in out-of-distribution instances. These results highlight the potential of RL to develop robust and adaptable branching strategies for MILP solvers.
zh

[AI-71] Esim: EVM Bytecode Similarity Detection Based on Stable-Semantic Graph

【速读】:该论文旨在解决去中心化金融(DeFi)生态系统中智能合约代码复用广泛且开源贡献有限所引发的代码剽窃与漏洞传播问题,尤其针对EVM(以太坊虚拟机)字节码层面的相似性检测难题。传统基于指令流或控制流图(CFG)的二进制相似性检测方法在EVM字节码上效果受限,因其具有低级特性及大量重用的基本块,并受Solidity编译器(Solc)版本多样性影响。论文提出的关键解决方案是构建一种新的EVM字节码表示——稳定语义图(Stable-Semantic Graph, SSG),该结构捕捉“稳定指令”(stable instructions)之间的关系;并实现原型工具Esim,通过将SSG嵌入矩阵并采用异构图神经网络进行相似性检测,显著提升了准确率(AUC达96.3%),优于现有最优工具Etherscan,在大规模分析267万份智能合约时验证了其有效性。

链接: https://arxiv.org/abs/2511.12971
作者: Zhuo Chen,Gaoqiang Ji,Yiling He,Lei Wu,Yajin Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decentralized finance (DeFi) is experiencing rapid expansion. However, prevalent code reuse and limited open-source contributions have introduced significant challenges to the blockchain ecosystem, including plagiarism and the propagation of vulnerable code. Consequently, an effective and accurate similarity detection method for EVM bytecode is urgently needed to identify similar contracts. Traditional binary similarity detection methods are typically based on instruction stream or control flow graph (CFG), which have limitations on EVM bytecode due to specific features like low-level EVM bytecode and heavily-reused basic blocks. Moreover, the highly-diverse Solidity Compiler (Solc) versions further complicate accurate similarity detection. Motivated by these challenges, we propose a novel EVM bytecode representation called Stable-Semantic Graph (SSG), which captures relationships between ‘stable instructions’ (special instructions identified by our study). Moreover, we implement a prototype, Esim, which embeds SSG into matrices for similarity detection using a heterogeneous graph neural network. Esim demonstrates high accuracy in SSG construction, achieving F1-scores of 100% for control flow and 95.16% for data flow, and its similarity detection performance reaches 96.3% AUC, surpassing traditional approaches. Our large-scale study, analyzing 2,675,573 smart contracts on six EVM-compatible chains over a one-year period, also demonstrates that Esim outperforms the SOTA tool Etherscan in vulnerability search. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.12971 [cs.CR] (or arXiv:2511.12971v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.12971 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-72] MedRule-KG: A Knowledge-Graph–Steered Scaffold for Reliable Mathematical and Biomedical Reasoning AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学推理与早期药物发现任务中缺乏领域一致性结构的问题,即模型生成结果可能违反数学或生物医学领域的基本规则。解决方案的关键在于提出 MedRule-KG,一个紧凑的知识图谱框架(knowledge-graph scaffold)与轻量级验证器(lightweight verifier)相结合的系统:通过在提示中注入精心筛选的符号事实,并利用确定性检查器强制执行规则约束,将生成过程形式化为带约束的推理(constrained inference),同时引入适用于解码的软引导代理(soft guidance surrogate)以提升有效性。实验证明,该方法在90项任务中使违规次数减少83.2%,且保持高精确匹配率,验证了其在交互式设计场景中的实用性。

链接: https://arxiv.org/abs/2511.12963
作者: Crystal Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026 Workshop AI2ASE

点击查看摘要

Abstract:We study how to impose domain-consistent structure on large language models (LLMs) used for scientific reasoning and early-stage drug discovery. We present MedRule-KG, a compact knowledge-graph scaffold paired with a lightweight verifier that steers generation toward mathematically and biomedically valid outputs. The system injects curated symbolic facts into prompts and then enforces rule satisfaction with a deterministic checker. We formalize generation as constrained inference, introduce a soft guidance surrogate suitable for decoding, and perform a thorough statistical analysis with uncertainty quantification. Across 90 tasks spanning reaction feasibility, metabolic compatibility, and toxicity screening, MedRule-KG reduces violation counts by 83.2% relative to a strong chain-of-thought baseline while improving exact match. Results remain stable under stratification and scale with dataset size, and the verifier adds negligible latency, making the approach practical for interactive design.
zh

[AI-73] Global Cross-Time Attention Fusion for Enhanced Solar Flare Prediction from Multivariate Time Series

【速读】:该论文旨在解决太阳耀斑预测中因事件分布极度不均衡(强烈耀斑罕见,微弱耀斑常见)而导致模型学习效果受限的问题。解决方案的关键在于提出一种基于Transformer架构的全局跨时间注意力融合(Global Cross-Time Attention Fusion, GCTAF)机制:通过引入一组可学习的全局交叉注意力令牌(global tokens),这些令牌能总结整个时间序列中的显著时序模式,并通过与输入序列的交叉注意力交互进行优化,最终将提炼后的全局信息融合回时序表示中,从而增强模型对非连续但关键时间点的识别能力,提升对强烈耀斑事件的检测性能。

链接: https://arxiv.org/abs/2511.12955
作者: Onur Vural,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been accepted at the 2025 IEEE International Conference on Big Data (IEEE BigData 2025) on October 23, 2025

点击查看摘要

Abstract:Multivariate time series classification is increasingly investigated in space weather research as a means to predict intense solar flare events, which can cause widespread disruptions across modern technological systems. Magnetic field measurements of solar active regions are converted into structured multivariate time series, enabling predictive modeling across segmented observation windows. However, the inherently imbalanced nature of solar flare occurrences, where intense flares are rare compared to minor flare events, presents a significant barrier to effective learning. To address this challenge, we propose a novel Global Cross-Time Attention Fusion (GCTAF) architecture, a transformer-based model to enhance long-range temporal modeling. Unlike traditional self-attention mechanisms that rely solely on local interactions within time series, GCTAF injects a set of learnable cross-attentive global tokens that summarize salient temporal patterns across the entire sequence. These tokens are refined through cross-attention with the input sequence and fused back into the temporal representation, enabling the model to identify globally significant, non-contiguous time points that are critical for flare prediction. This mechanism functions as a dynamic attention-driven temporal summarizer that augments the model’s capacity to capture discriminative flare-related dynamics. We evaluate our approach on the benchmark solar flare dataset and show that GCTAF effectively detects intense flares and improves predictive performance, demonstrating that refining transformer-based architectures presents a high-potential alternative for solar flare prediction tasks.
zh

[AI-74] Privacy-Preserving Federated Learning from Partial Decryption Verifiable Threshold Multi-Client Functional Encryption

【速读】:该论文旨在解决联邦学习(Federated Learning)中的梯度泄露问题,特别是现有基于门限密码学的方案无法保证聚合结果可验证性,从而导致系统易受投毒攻击(Poisoning Attack)的问题。其解决方案的关键在于构建一种部分解密可验证的门限多客户端函数加密方案,并将其应用于联邦学习中,形成可验证门限安全聚合协议(Verifiable Threshold Security Aggregation Protocol for Federated Learning, VTSAFL)。该协议使客户端能够验证聚合结果的真实性,同时通过保持功能密钥和部分解密结果的大小恒定,显著降低计算与通信开销,实现高效且安全的模型聚合,尤其适用于资源受限的物联网(IoT)设备场景。

链接: https://arxiv.org/abs/2511.12936
作者: Minjie Wang,Jinguang Han,Weizhi Meng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In federated learning, multiple parties can cooperate to train the model without directly exchanging their own private data, but the gradient leakage problem still threatens the privacy security and model integrity. Although the existing scheme uses threshold cryptography to mitigate the inference attack, it can not guarantee the verifiability of the aggregation results, making the system vulnerable to the threat of poisoning attack. We construct a partial decryption verifiable threshold multi client function encryption scheme, and apply it to Federated learning to implement the federated learning verifiable threshold security aggregation protocol (VTSAFL). VTSAFL empowers clients to verify aggregation results, concurrently minimizing both computational and communication overhead. The size of the functional key and partial decryption results of the scheme are constant, which provides efficiency guarantee for large-scale deployment. The experimental results on MNIST dataset show that vtsafl can achieve the same accuracy as the existing scheme, while reducing the total training time by more than 40%, and reducing the communication overhead by up to 50%. This efficiency is critical for overcoming the resource constraints inherent in Internet of Things (IoT) devices.
zh

[AI-75] okenize Once Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM -based Recommendation AAAI AAAI-26

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的推荐系统中,物品标记化(item tokenization)方法存在的两个核心问题:一是现有方法通常需为每个物品领域单独训练模型,缺乏泛化能力;二是不同物品领域间分布和语义差异显著,难以构建能保留领域特异性信息的统一标记化方案。解决方案的关键在于提出UniTok框架,其创新性地结合了混合专家(Mixture-of-Experts, MoE)架构与多代码本(codebooks),通过共享编码器将跨域物品映射至统一潜在空间,再由领域特定专家捕获独特语义,同时引入始终激活的共享专家以编码可迁移的通用知识;此外,设计互信息校准机制以缓解领域间语义不平衡问题,从而实现高效、可扩展且具备强泛化能力的跨域物品标记化。

链接: https://arxiv.org/abs/2511.12922
作者: Yu Hou,Won-Yong Shin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Social and Information Networks (cs.SI)
备注: 20 pages, 8 figures, 9 tables; Annual AAAI Conference on Artificial Intelligence (AAAI-26) (to appear) (Please cite our conference version.)

点击查看摘要

Abstract:Large language model (LLM)-based recommender systems have achieved high-quality performance by bridging the discrepancy between the item space and the language space through item tokenization. However, existing item tokenization methods typically require training separate models for each item domain, limiting generalization. Moreover, the diverse distributions and semantics across item domains make it difficult to construct a unified tokenization that preserves domain-specific information. To address these challenges, we propose UniTok, a Unified item Tokenization framework that integrates our own mixture-of-experts (MoE) architecture with a series of codebooks to convert items into discrete tokens, enabling scalable tokenization while preserving semantic information across multiple item domains. Specifically, items from different domains are first projected into a unified latent space through a shared encoder. They are then routed to domain-specific experts to capture the unique semantics, while a shared expert, which is always active, encodes common knowledge transferable across domains. Additionally, to mitigate semantic imbalance across domains, we present a mutual information calibration mechanism, which guides the model towards retaining similar levels of semantic information for each domain. Comprehensive experiments on wide-ranging real-world datasets demonstrate that the proposed UniTok framework is (a) highly effective: achieving up to 51.89% improvements over strong benchmarks, (b) theoretically sound: showing the analytical validity of our architectural design and optimization; and © highly generalizable: demonstrating robust performance across diverse domains without requiring per-domain retraining, a capability not supported by existing baselines.
zh

[AI-76] Fault2Flow: An AlphaEvolve-Optimized Human-in-the-Loop Multi-Agent System for Fault-to-Workflow Automation

【速读】:该论文旨在解决电力系统故障诊断中依赖人工、易出错且难以维护的问题,即如何将密集的规章文本与专家经验知识有效整合为可执行的自动化流程。解决方案的关键在于提出Fault2Flow框架,这是一个基于大语言模型(Large Language Models, LLMs)的多智能体系统,其核心创新包括:(1) 将规章逻辑结构化为PASTA格式的故障树;(2) 通过人机协同接口引入专家知识进行验证;(3) 利用AlphaEvolve模块优化推理逻辑;(4) 最终生成可在n8n平台上直接执行的工作流,实现从故障分析到操作自动化的可复现路径。

链接: https://arxiv.org/abs/2511.12916
作者: Yafang Wang,Yangjie Tian,Xiaoyu Shen,Gaoyang Zhang,Jiaze Sun,He Zhang,Ruohua Xu,Feng Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Power grid fault diagnosis is a critical process hindered by its reliance on manual, error-prone methods. Technicians must manually extract reasoning logic from dense regulations and attempt to combine it with tacit expert knowledge, which is inefficient, error-prone, and lacks maintainability as ragulations are updated and experience evolves. While Large Language Models (LLMs) have shown promise in parsing unstructured text, no existing framework integrates these two disparate knowledge sources into a single, verified, and executable workflow. To bridge this gap, we propose Fault2Flow, an LLM-based multi-agent system. Fault2Flow systematically: (1) extracts and structures regulatory logic into PASTA-formatted fault trees; (2) integrates expert knowledge via a human-in-the-loop interface for verification; (3) optimizes the reasoning logic using a novel AlphaEvolve module; and (4) synthesizes the final, verified logic into an n8n-executable workflow. Experimental validation on transformer fault diagnosis datasets confirms 100% topological consistency and high semantic fidelity. Fault2Flow establishes a reproducible path from fault analysis to operational automation, substantially reducing expert workload.
zh

[AI-77] CoS: Towards Optimal Event Scheduling via Chain-of-Scheduling

【速读】:该论文旨在解决事件推荐(Event Recommendation)在基于事件的社交网络(Event-based Social Networks, EBSNs)中的核心挑战,即如何在满足时间与地理约束的前提下,最大化用户的偏好匹配度。现有方法因问题本身的NP-hard特性,在效率、效果和泛化能力之间存在固有权衡。其解决方案的关键在于提出Chain-of-Scheduling(CoS)框架,通过将调度任务分解为探索(Exploration)、验证(Verification)和整合(Integration)三个原子阶段,并利用知识蒸馏(Knowledge Distillation, KD)使大语言模型(Large Language Models, LLMs)能够自主生成CoS流程,从而实现高效率、近理论最优的效果以及可解释性,同时展现出强大的零样本学习能力。

链接: https://arxiv.org/abs/2511.12913
作者: Yiming Zhao,Jiwei Tang,Shimin Di,Libin Zheng,Jianxing Yu,Jian Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommending event schedules is a key issue in Event-based Social Networks (EBSNs) in order to maintain user activity. An effective recommendation is required to maximize the user’s preference, subjecting to both time and geographical constraints. Existing methods face an inherent trade-off among efficiency, effectiveness, and generalization, due to the NP-hard nature of the problem. This paper proposes the Chain-of-Scheduling (CoS) framework, which activates the event scheduling capability of Large Language Models (LLMs) through a guided, efficient scheduling process. CoS enhances LLM by formulating the schedule task into three atomic stages, i.e., exploration, verification and integration. Then we enable the LLMs to generate CoS autonomously via Knowledge Distillation (KD). Experimental results show that CoS achieves near-theoretical optimal effectiveness with high efficiency on three real-world datasets in a interpretable manner. Moreover, it demonstrates strong zero-shot learning ability on out-of-domain data.
zh

[AI-78] LinkedIn Profile Characteristics and Professional Success Indicators

【速读】:该论文试图解决的问题是:如何通过LinkedIn个人资料特征来预测和理解职业成功,具体包括晋升(promotion)、关注者数量(follower count)以及职业进展速率(career progression rate)等指标。其解决方案的关键在于利用超过62,000个匿名LinkedIn用户数据集,采用机器学习技术构建预测模型,识别出对职业成功最具影响力的特征因素,从而为专业人士优化其在线职业形象与职业策略提供可操作的洞见。

链接: https://arxiv.org/abs/2511.12905
作者: Tania-Amanda Fredrick Eneye,Ashlesha Malla,Pawan Paudel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores the relationship between LinkedIn profile characteristics and professional success, focusing on the indicators of promotions, follower count, and career progression rate. By leveraging a dataset of over 62,000 anonymized LinkedIn profiles, we developed predictive models using machine learning techniques to identify the most influential factors driving professional success. Results indicate that while promotions are highly predictable, follower growth exhibits greater complexity. This research provides actionable insights for professionals seeking to optimize their LinkedIn presence and career strategies.
zh

[AI-79] Contrastive Entropy Bounds for Density and Conditional Density Decomposition

【速读】:该论文旨在解决神经网络特征可解释性问题,从贝叶斯高斯视角出发,通过优化代价函数来逼近概率界,从而学习一个使该界紧致且代价最优的密度模型,通常采用高斯混合密度(Gaussian Mixture Density)形式。其核心解决方案在于:首先,将自编码器(Autoencoder)的目标等价于最大化一个高斯算子的迹(trace),即在数据与模型分布正交基下所有特征值之和;其次,在无需一一对应关系时,可通过最大化该算子的核范数(nuclear norm),即奇异值之和,以提升整体秩而非仅优化迹,从而为自编码器训练提供新目标函数,并将核范数作为多输出网络(如Mixture Density Networks, MDNs)的散度度量。此外,利用希尔伯特空间中的内积与范数定义边界和代价,引入额外范数以增强样本多样性并避免平凡解(如恒定输出),同时提出编码器-混合器-解码器架构,其中解码器为多输出结构,每样本生成多个中心,进一步收紧上界,并在小方差高斯混合数据假设下实现定量分析。

链接: https://arxiv.org/abs/2511.12903
作者: Bo Hu,Jose C. Principe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper studies the interpretability of neural network features from a Bayesian Gaussian view, where optimizing a cost is reaching a probabilistic bound; learning a model approximates a density that makes the bound tight and the cost optimal, often with a Gaussian mixture density. The two examples are Mixture Density Networks (MDNs) using the bound for the marginal and autoencoders using the conditional bound. It is a known result, not only for autoencoders, that minimizing the error between inputs and outputs maximizes the dependence between inputs and the middle. We use Hilbert space and decomposition to address cases where a multiple-output network produces multiple centers defining a Gaussian mixture. Our first finding is that an autoencoder’s objective is equivalent to maximizing the trace of a Gaussian operator, the sum of eigenvalues under bases orthonormal w.r.t. the data and model distributions. This suggests that, when a one-to-one correspondence as needed in autoencoders is unnecessary, we can instead maximize the nuclear norm of this operator, the sum of singular values, to maximize overall rank rather than trace. Thus the trace of a Gaussian operator can be used to train autoencoders, and its nuclear norm can be used as divergence to train MDNs. Our second test uses inner products and norms in a Hilbert space to define bounds and costs. Such bounds often have an extra norm compared to KL-based bounds, which increases sample diversity and prevents the trivial solution where a multiple-output network produces the same constant, at the cost of requiring a sample batch to estimate and optimize. We propose an encoder-mixture-decoder architecture whose decoder is multiple-output, producing multiple centers per sample, potentially tightening the bound. Assuming the data are small-variance Gaussian mixtures, this upper bound can be tracked and analyzed quantitatively. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.12903 [cs.LG] (or arXiv:2511.12903v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.12903 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-80] Online Learning of HTN Methods for integrated LLM -HTN Planning

【速读】:该论文旨在解决在集成HTN(Hierarchical Task Network)规划与大语言模型(LLM)驱动聊天机器人场景下,如何实现在线学习以减少对LLM的依赖并提升任务分解效率的问题。其解决方案的关键在于:当ChatGPT生成任务分解时,ChatHTN不仅将其作为记忆存储(类似缓存),更重要的是从中提炼出可泛化的通用方法(generalized method),使其不仅能适用于当前具体任务实例,还能推广到同类任务的其他实例,从而显著降低对LLM调用次数,同时保持甚至提升问题求解能力。

链接: https://arxiv.org/abs/2511.12901
作者: Yuesheng Xu,Hector Munoz-Avila
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The Twelfth Annual Conference on Advances in Cognitive Systems (ACS-2025)

点击查看摘要

Abstract:We present online learning of Hierarchical Task Network (HTN) methods in the context of integrated HTN planning and LLM-based chatbots. Methods indicate when and how to decompose tasks into subtasks. Our method learner is built on top of the ChatHTN planner. ChatHTN queries ChatGPT to generate a decomposition of a task into primitive tasks when no applicable method for the task is available. In this work, we extend ChatHTN. Namely, when ChatGPT generates a task decomposition, ChatHTN learns from it, akin to memoization. However, unlike memoization, it learns a generalized method that applies not only to the specific instance encountered, but to other instances of the same task. We conduct experiments on two domains and demonstrate that our online learning procedure reduces the number of calls to ChatGPT while solving at least as many problems, and in some cases, even more.
zh

[AI-81] owards High-Consistency Embodied World Model with Multi-View Trajectory Videos

【速读】:该论文旨在解决现有具身世界模型(embodied world models)在将低级动作(如关节位置)转化为精确机器人运动时存在的误差问题,从而导致预测帧与真实物理交互不一致。其关键解决方案是提出MTV-World模型,引入多视角轨迹视频(multi-view trajectory-video)作为控制信号,替代传统的低级动作输入;通过利用相机内参和外参以及笛卡尔空间变换生成轨迹视频,并设计多视角框架以补偿单视角投影带来的空间信息损失,从而提升视觉-运动预测的精度和物理交互的一致性。

链接: https://arxiv.org/abs/2511.12882
作者: Taiyi Su,Jian Zhu,Yaxuan Li,Chong Ma,Zitai Huang,Yichen Zhu,Hanli Wang,Yi Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that introduces Multi-view Trajectory-Video control for precise visuomotor prediction. Specifically, instead of directly using low-level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian-space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi-view framework that compensates for spatial information loss and ensures high-consistency with physical world. MTV-World forecasts future frames based on multi-view trajectory videos as input and conditioning on an initial frame per view. Furthermore, to systematically evaluate both robotic motion precision and object interaction accuracy, we develop an auto-evaluation pipeline leveraging multimodal large models and referring video object segmentation models. To measure spatial consistency, we formulate it as an object location matching problem and adopt the Jaccard Index as the evaluation metric. Extensive experiments demonstrate that MTV-World achieves precise control execution and accurate physical interaction modeling in complex dual-arm scenarios.
zh

[AI-82] hink Speak Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making AAAI2026

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在经济决策中难以有效处理语言的语义模糊性和情境丰富性的问题,从而提升决策的有效性与鲁棒性。其解决方案的关键在于提出了一种名为LAMP(Language-Augmented Multi-Agent Policy)的框架,该框架采用“思考-说话-决策”(Think-Speak-Decide)三阶段管道:首先通过“思考”模块从数值观测中提取短期冲击和长期趋势并缓存高价值推理轨迹;其次通过“说话”模块基于推理生成并交换战略信息,利用对同行沟通的解析更新信念;最后通过“决策”模块融合数值数据、推理过程与反思内容,构建可优化的语言增强型策略。这一设计显著提升了经济模拟中的累积回报、鲁棒性和可解释性,验证了语言增强策略在真实经济场景中的潜力。

链接: https://arxiv.org/abs/2511.12876
作者: Heyang Ma,Qirui Mi,Qipeng Yang,Zijun Fan,Bo Li,Haifeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: Extended version of a submission to AAAI 2026

点击查看摘要

Abstract:Economic decision-making depends not only on structured signals such as prices and taxes, but also on unstructured language, including peer dialogue and media narratives. While multi-agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language-Augmented Multi-Agent Policy), a framework that integrates language into economic decision-making and narrows the gap to real-world settings. LAMP follows a Think-Speak-Decide pipeline: (1) Think interprets numerical observations to extract short-term shocks and long-term trends, caching high-value reasoning trajectories; (2) Speak crafts and exchanges strategic messages based on reasoning, updating beliefs by parsing peer communications; and (3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language-augmented decision-making. Experiments in economic simulation show that LAMP outperforms both MARL and LLM-only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrate the potential of language-augmented policies to deliver more effective and robust economic strategies.
zh

[AI-83] On the Fundamental Limits of LLM s at Scale

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在持续扩展规模时所面临的五个根本性限制问题:幻觉(hallucination)、上下文压缩(context compression)、推理能力退化(reasoning degradation)、检索脆弱性(retrieval fragility)以及多模态对齐偏差(multimodal misalignment)。这些问题虽已被实证观察到,但缺乏从计算理论、信息论和学习理论角度的统一解释。论文的关键解决方案是提出一个基于定理支撑的理论框架,首次形式化了LLM扩展的内在理论上限:首先,通过可计算性与不可计算性的分析揭示误差的不可消除性;其次,利用信息论与统计约束说明即使在可判定任务中也存在精度上限,且长尾知识需要极高的样本复杂度;再次,阐明几何与计算效应导致上下文实际压缩远低于名义长度。此外,论文指出似然训练偏好模式补全而非推理,并揭示检索受限下的语义漂移与耦合噪声机制,以及多模态扩展中的浅层跨模态对齐问题。最终,该框架不仅识别出缩放带来的增益边界,还为实践提供路径如受限oracle检索、位置课程学习及稀疏或分层注意力机制以缓解上述瓶颈。

链接: https://arxiv.org/abs/2511.12869
作者: Muhammad Ahmed Mohsin,Muhammad Umer,Ahsan Bilal,Zeeshan Memon,Muhammad Ibtsaam Qadir,Sagnik Bhattacharya,Hassan Rizwan,Abhiram R. Gorle,Maahe Zehra Kazmi,Ayesha Mohsin,Muhammad Usman Rafique,Zihao He,Pulkit Mehta,Muhammad Ali Jamshed,John M. Cioffi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注: Submitted to TMLR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning degradation, (4) retrieval fragility, and (5) multimodal misalignment. While existing surveys describe these phenomena empirically, they lack a rigorous theoretical synthesis connecting them to the foundational limits of computation, information, and learning. This work closes that gap by presenting a unified, proof-informed framework that formalizes the innate theoretical ceilings of LLM scaling. First, computability and uncomputability imply an irreducible residue of error: for any computably enumerable model family, diagonalization guarantees inputs on which some model must fail, and undecidable queries (e.g., halting-style tasks) induce infinite failure sets for all computable predictors. Second, information-theoretic and statistical constraints bound attainable accuracy even on decidable tasks, finite description length enforces compression error, and long-tail factual knowledge requires prohibitive sample complexity. Third, geometric and computational effects compress long contexts far below their nominal size due to positional under-training, encoding attenuation, and softmax crowding. We further show how likelihood-based training favors pattern completion over inference, how retrieval under token limits suffers from semantic drift and coupling noise, and how multimodal scaling inherits shallow cross-modal alignment. Across sections, we pair theorems and empirical evidence to outline where scaling helps, where it saturates, and where it cannot progress, providing both theoretical foundations and practical mitigation paths like bounded-oracle retrieval, positional curricula, and sparse or hierarchical attention.
zh

[AI-84] Bootstrapping LLM s via Preference-Based Policy Optimization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏大量人工标注数据的情况下,如何通过偏好数据实现行为对齐的问题。传统方法依赖于昂贵的人工标注,而本文提出了一种基于偏好策略优化(Preference-based Policy Optimization, PbPO)的新框架,其关键在于将学习过程建模为一个主策略与奖励模型(Reward Model, RM)之间的最小-最大博弈:RM被约束在一个由偏好数据导出的置信集内以确保可靠利用;同时,迭代在线算法通过引导对演化策略的探索主动收集偏好数据,从而实现策略和RM的持续自我改进。理论分析表明,该方法在序列级和词元级奖励模型设置下均具有高概率的遗憾边界,实验证明其在五个基准测试中显著优于现有最优偏好优化技术。

链接: https://arxiv.org/abs/2511.12867
作者: Chen Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.
zh

[AI-85] An approach of deep reinforcement learning for maximizing the net present value of stochastic projects

【速读】:该论文旨在解决具有随机活动持续时间和现金流的项目优化问题,其中活动需满足前置约束并产生现金流入与流出,目标是通过加速流入和延迟流出来最大化期望净现值(Expected Net Present Value, ENPV)。解决方案的关键在于将问题建模为离散时间马尔可夫决策过程(Discrete-Time Markov Decision Process, MDP),并引入双深度Q网络(Double Deep Q-Network, DDQN)方法。该方法通过双网络架构缓解动作价值高估问题,同时利用目标网络提升训练收敛性与鲁棒性,在大规模或高度不确定环境下展现出优于传统静态与动态策略的计算能力、策略可靠性及适应性。

链接: https://arxiv.org/abs/2511.12865
作者: Wei Xu,Fan Yang,Qinyuan Cui,Zhi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates a project with stochastic activity durations and cash flows under discrete scenarios, where activities must satisfy precedence constraints generating cash inflows and outflows. The objective is to maximize expected net present value (NPV) by accelerating inflows and deferring outflows. We formulate the problem as a discrete-time Markov Decision Process (MDP) and propose a Double Deep Q-Network (DDQN) approach. Comparative experiments demonstrate that DDQN outperforms traditional rigid and dynamic strategies, particularly in large-scale or highly uncertain environments, exhibiting superior computational capability, policy reliability, and adaptability. Ablation studies further reveal that the dual-network architecture mitigates overestimation of action values, while the target network substantially improves training convergence and robustness. These results indicate that DDQN not only achieves higher expected NPV in complex project optimization but also provides a reliable framework for stable and effective policy implementation.
zh

[AI-86] From Black-Box to White-Box: Control-Theoretic Neural Network Interpretability

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)虽具备卓越性能但缺乏机制可解释性的问题。其解决方案的关键在于构建一个控制理论框架,将训练好的神经网络视为非线性状态空间系统,通过局部线性化、可控性与可观性Gramian矩阵以及Hankel奇异值分析其内部计算过程。具体而言,针对特定输入,在对应的隐藏激活模式下对网络进行局部线性化,构造以隐藏神经元激活为状态的系统模型,并基于输入-状态和状态-输出雅可比矩阵定义局部可控性和可观性Gramian,进而计算Hankel奇异值及其关联模态,从而提供一种原理性的神经元与路径重要性度量:可控性反映神经元受输入扰动激发的难易程度,可观性衡量神经元对输出的影响强度,Hankel奇异值则按输入到输出的能量传递能力对内部模式进行排序。此方法将神经网络转化为一系列局部白盒动力学模型,识别出适合剪枝或约束以提升可解释性的内部方向。

链接: https://arxiv.org/abs/2511.12852
作者: Jihoon Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Deep neural networks achieve state of the art performance but remain difficult to interpret mechanistically. In this work, we propose a control theoretic framework that treats a trained neural network as a nonlinear state space system and uses local linearization, controllability and observability Gramians, and Hankel singular values to analyze its internal computation. For a given input, we linearize the network around the corresponding hidden activation pattern and construct a state space model whose state consists of hidden neuron activations. The input state and state output Jacobians define local controllability and observability Gramians, from which we compute Hankel singular values and associated modes. These quantities provide a principled notion of neuron and pathway importance: controllability measures how easily each neuron can be excited by input perturbations, observability measures how strongly each neuron influences the output, and Hankel singular values rank internal modes that carry input output energy. We illustrate the framework on simple feedforward networks, including a 1 2 2 1 SwiGLU network and a 2 3 3 2 GELU network. By comparing different operating points, we show how activation saturation reduces controllability, shrinks the dominant Hankel singular value, and shifts the dominant internal mode to a different subset of neurons. The proposed method turns a neural network into a collection of local white box dynamical models and suggests which internal directions are natural candidates for pruning or constraints to improve interpretability.
zh

[AI-87] RoS-Guard: Robust and Scalable Online Change Detection with Delay-Optimal Guarantees

【速读】:该论文针对在线变化检测(Online Change Detection, OCD)在实际应用中面临的两大挑战展开研究:一是现有方法通常假设系统知识精确已知,而现实中存在估计误差和环境波动导致的不确定性;二是传统方法在大规模系统中计算效率低下。为解决上述问题,论文提出了一种名为RoS-Guard的鲁棒且最优的OCD算法,其核心创新在于通过紧致松弛与重构优化问题,并结合神经网络展开(neural unrolling)技术,实现基于GPU加速的高效并行计算,同时提供期望误报率和最坏情况平均检测延迟的理论保障。

链接: https://arxiv.org/abs/2511.12846
作者: Zelin Zhu,Yancheng Huang,Kai Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online change detection (OCD) aims to rapidly identify change points in streaming data and is critical in applications such as power system monitoring, wireless network sensing, and financial anomaly detection. Existing OCD methods typically assume precise system knowledge, which is unrealistic due to estimation errors and environmental variations. Moreover, existing OCD methods often struggle with efficiency in large-scale systems. To overcome these challenges, we propose RoS-Guard, a robust and optimal OCD algorithm tailored for linear systems with uncertainty. Through a tight relaxation and reformulation of the OCD optimization problem, RoS-Guard employs neural unrolling to enable efficient parallel computation via GPU acceleration. The algorithm provides theoretical guarantees on performance, including expected false alarm rate and worst-case average detection delay. Extensive experiments validate the effectiveness of RoS-Guard and demonstrate significant computational speedup in large-scale system scenarios.
zh

[AI-88] Mapping fNIRS Signals to Agent Performance: Toward Reinforcement Learning from Neural Feedback AAAI2026 AAAI

【速读】:该论文旨在解决如何利用人类大脑的隐式神经信号(特别是功能性近红外光谱,fNIRS)来引导强化学习代理的行为对齐问题,从而实现无需显式反馈的脑驱动型强化学习从人类反馈(RLHF)系统。其解决方案的关键在于构建一个基于被动脑-机接口(BCI)的框架,通过训练分类器和回归器从预处理的fNIRS特征向量中预测代理性能水平(最优、次优或最差)以及动作偏离近优策略的程度,进而将这些连续或离散的神经信号映射为可指导代理训练的偏好信号;实验表明,即使在跨被试泛化场景下,仅需少量个体特异性数据微调预训练模型即可显著提升性能预测准确率(二分类F1提升17%,多分类提升41%),验证了该方法的可行性与可改进性。

链接: https://arxiv.org/abs/2511.12844
作者: Julia Santaniello,Matthew Russell,Benson Jiang,Donatello Sassaroli,Robert Jacob,Jivko SInapov
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Association for the Advancement of Artificial Intelligence (AAAI) 2026. To appear in the AAAI 2026 Proceedings

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a methodology that aligns agent behavior with human preferences by integrating human feedback into the agent’s training process. We introduce a possible framework that employs passive Brain-Computer Interfaces (BCI) to guide agent training from implicit neural signals. We present and release a novel dataset of functional near-infrared spectroscopy (fNIRS) recordings collected from 25 human participants across three domains: a Pick-and-Place Robot, Lunar Lander, and Flappy Bird. We train classifiers to predict levels of agent performance (optimal, sub-optimal, or worst-case) from windows of preprocessed fNIRS feature vectors, achieving an average F1 score of 67% for binary classification and 46% for multi-class models averaged across conditions and domains. We also train regressors to predict the degree of deviation between an agent’s chosen action and a set of near-optimal policies, providing a continuous measure of performance. We evaluate cross-subject generalization and demonstrate that fine-tuning pre-trained models with a small sample of subject-specific data increases average F1 scores by 17% and 41% for binary and multi-class models, respectively. Our work demonstrates that mapping implicit fNIRS signals to agent performance is feasible and can be improved, laying the foundation for future brain-driven RLHF systems.
zh

[AI-89] Connectivity-Guided Sparsification of 2-FWL GNNs: Preserving Full Expressivity with Improved Efficiency AAAI2026

【速读】:该论文旨在解决高阶图神经网络(HOGNN)在保持高表达能力的同时面临的计算复杂度问题。现有方法虽能提升效率,但往往以牺牲表达能力为代价。解决方案的关键在于提出一种名为Co-Sparsify的连通性感知稀疏化框架,其核心洞察是:三节点交互仅在双连通分量(biconnected components)内具有表达必要性——即任意两节点间存在环路的最大子图;而在双连通分量之外,结构关系可通过二节点消息传递或全局读出完全捕捉,无需更高阶建模。Co-Sparsify通过将二节点消息传递限制在连通分量内、三节点交互限制在双连通分量内,实现了无近似、无采样的计算消除,同时保留了完整的2-FWL测试表达能力。

链接: https://arxiv.org/abs/2511.12838
作者: Rongqin Chen,Fan Mo,Pak Lon Ip,Shenghui Zhang,Dan Wu,Ye Li,Leong Hou U
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Higher-order Graph Neural Networks (HOGNNs) based on the 2-FWL test achieve superior expressivity by modeling 2- and 3-node interactions, but at \mathcalO(n^3) computational cost. However, this computational burden is typically mitigated by existing efficiency methods at the cost of reduced expressivity. We propose \textbfCo-Sparsify, a connectivity-aware sparsification framework that eliminates \emphprovably redundant computations while preserving full 2-FWL expressive power. Our key insight is that 3-node interactions are expressively necessary only within \emphbiconnected components – maximal subgraphs where every pair of nodes lies on a cycle. Outside these components, structural relationships can be fully captured via 2-node message passing or global readout, rendering higher-order modeling unnecessary. Co-Sparsify restricts 2-node message passing to connected components and 3-node interactions to biconnected ones, removing computation without approximation or sampling. We prove that Co-Sparsified GNNs are as expressive as the 2-FWL test. Empirically, on PPGN, Co-Sparsify matches or exceeds accuracy on synthetic substructure counting tasks and achieves state-of-the-art performance on real-world benchmarks (ZINC, QM9). This study demonstrates that high expressivity and scalability are not mutually exclusive: principled, topology-guided sparsification enables powerful, efficient GNNs with theoretical guarantees.
zh

[AI-90] Catastrophic Forgetting in Kolmogorov-Arnold Networks AAAI2026

【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时会丢失先前任务的知识。其核心解决方案在于提出一个理论框架,将遗忘行为与激活函数的支持重叠(activation support overlap)和数据内在维度(intrinsic data dimension)相联系,并通过系统实验验证了该框架的有效性。关键创新点包括:揭示了Kolmogorov-Arnold Networks(KANs)在低维算法任务中具有较好的知识保留能力,但在高维场景(如图像分类和语言建模)下仍易发生遗忘;进一步提出KAN-LoRA这一参数高效适配器设计,用于语言模型的知识编辑任务,从而为持续学习系统的实际部署提供理论依据与实践路径。

链接: https://arxiv.org/abs/2511.12828
作者: Mohammad Marufur Rahman,Guanchu Wang,Kaixiong Zhou,Minghan Chen,Fan Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, accepted in the main technical track of AAAI 2026

点击查看摘要

Abstract:Catastrophic forgetting is a longstanding challenge in continual learning, where models lose knowledge from earlier tasks when learning new ones. While various mitigation strategies have been proposed for Multi-Layer Perceptrons (MLPs), recent architectural advances like Kolmogorov-Arnold Networks (KANs) have been suggested to offer intrinsic resistance to forgetting by leveraging localized spline-based activations. However, the practical behavior of KANs under continual learning remains unclear, and their limitations are not well understood. To address this, we present a comprehensive study of catastrophic forgetting in KANs and develop a theoretical framework that links forgetting to activation support overlap and intrinsic data dimension. We validate these analyses through systematic experiments on synthetic and vision tasks, measuring forgetting dynamics under varying model configurations and data complexity. Further, we introduce KAN-LoRA, a novel adapter design for parameter-efficient continual fine-tuning of language models, and evaluate its effectiveness in knowledge editing tasks. Our findings reveal that while KANs exhibit promising retention in low-dimensional algorithmic settings, they remain vulnerable to forgetting in high-dimensional domains such as image classification and language modeling. These results advance the understanding of KANs’ strengths and limitations, offering practical insights for continual learning system design.
zh

[AI-91] Expressive Temporal Specifications for Reward Monitoring

【速读】:该论文旨在解决强化学习中奖励函数设计的挑战,特别是稀疏奖励(sparse rewards)问题,该问题在长时决策场景下尤为突出,严重影响智能体训练效率。解决方案的关键在于利用有限轨迹上的定量线性时序逻辑(quantitative Linear Temporal Logic on finite traces, LTL_f[𝒞])来合成奖励监测器(reward monitors),这些监测器能够为运行时可观测的状态轨迹生成密集且细粒度的奖励信号,从而在训练过程中提供更丰富的反馈,引导智能体趋向最优行为。该方法不依赖特定算法,仅需状态标签函数,并天然支持非马尔可夫性质的规范表达,实验证明其在任务完成度量化指标和收敛速度上优于传统的布尔语义监测器。

链接: https://arxiv.org/abs/2511.12808
作者: Omar Adalat,Francesco Belardinelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (( \textLTL_f[\mathcalF] )) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.
zh

[AI-92] he Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation

【速读】:该论文旨在解决自消耗生成模型(self-consuming generative models)在递归重训练过程中对用户偏好对齐(alignment)的长期影响问题,即当模型基于自身输出进行迭代训练时,如何保证其持续与用户偏好保持一致。解决方案的关键在于构建一个基于Bradley-Terry(BT)模型的两阶段筛选机制,并将对齐过程建模为模型所有者(Model Owner)与公众用户(Public User)之间的动态博弈:前者控制哪些输出被用于模型学习,后者通过交互决定哪些输出被保留和传播。该框架揭示了三种结构性收敛状态——共识坍缩、共享最优妥协和不对称优化,并证明了一个根本性不可能定理:任何基于BT的递归筛选机制都无法同时维持多样性、确保对称影响力并消除对初始条件的依赖。这一分析表明,对齐是一个由权力不对称和路径依赖共同塑造的动态社会选择过程,而非静态目标。

链接: https://arxiv.org/abs/2511.12804
作者: Ali Falahati,Mohammad Mohammadi Amiri,Kate Larson,Lukasz Golab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In self-consuming generative models that train on their own outputs, alignment with user preferences becomes a recursive rather than one-time process. We provide the first formal foundation for analyzing the long-term effects of such recursive retraining on alignment. Under a two-stage curation mechanism based on the Bradley-Terry (BT) model, we model alignment as an interaction between two factions: the Model Owner, who filters which outputs should be learned by the model, and the Public User, who determines which outputs are ultimately shared and retained through interactions with the model. Our analysis reveals three structural convergence regimes depending on the degree of preference alignment: consensus collapse, compromise on shared optima, and asymmetric refinement. We prove a fundamental impossibility theorem: no recursive BT-based curation mechanism can simultaneously preserve diversity, ensure symmetric influence, and eliminate dependence on initialization. Framing the process as dynamic social choice, we show that alignment is not a static goal but an evolving equilibrium, shaped both by power asymmetries and path dependence.
zh

[AI-93] Genomic Next-Token Predictors are In-Context Learners

【速读】:该论文旨在解决一个 fundamental question:在非语言序列领域(如基因组序列)中,是否可以通过大规模预测训练自然涌现出上下文学习能力(In-context Learning, ICL),而无需依赖人类语言特有的统计特性。此前研究普遍认为ICL是语言数据中独特统计结构的产物,但本文通过构建跨模态对比实验框架,在基因组模型(Evo2)上验证了ICL的有机出现——该模型仅基于核苷酸(A/T/C/G)预测进行训练,规模与中小型大语言模型相当。解决方案的关键在于设计了一套受控的符号推理任务,同时在语言和基因组两种形式下实施,实现了对ICL性能的直接比较;结果表明,基因组模型同样表现出随演示样本数量增加而呈对数线性增长的模式归纳能力,首次证明ICL可在非语言符号序列中自发产生,支持“ICL是大规模预测建模在丰富数据上的涌现属性”这一统一假设,从而推动了跨模态、无特定任务依赖的ICL机制理解。

链接: https://arxiv.org/abs/2511.12797
作者: Nathan Breslow,Aayush Mishra,Mahler Revsine,Michael C. Schatz,Anqi Liu,Daniel Khashabi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:In-context learning (ICL) – the capacity of a model to infer and apply abstract patterns from examples provided within its input – has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN) Cite as: arXiv:2511.12797 [cs.LG] (or arXiv:2511.12797v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.12797 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-94] Maximizing the efficiency of human feedback in AI alignment: a comparative analysis

【速读】:该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)依赖偏好建模时,在有限标注预算下传统随机配对采样与Bradley-Terry模型所面临的统计效率低下和性能受限问题。其解决方案的关键在于引入受博弈论、统计学与社会选择理论启发的自适应采样与评估策略,其中最优方法Swiss InfoGain采用瑞士轮锦标赛机制结合代理互信息增益配对规则,显著提升了在低资源场景下的样本效率与偏好推断准确性,同时在高资源条件下亦优于Bradley-Terry基线,从而实现了对对齐质量与人工标注工作量之间的更优平衡。

链接: https://arxiv.org/abs/2511.12796
作者: Andreas Chouliaras,Dimitris Chatzopoulos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, 6 algorithms. AICS2025

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) relies on preference modeling to align machine learning systems with human values, yet the popular approach of random pair sampling with Bradley-Terry modeling is statistically limited and inefficient under constrained annotation budgets. In this work, we explore alternative sampling and evaluation strategies for preference inference in RLHF, drawing inspiration from areas such as game theory, statistics, and social choice theory. Our best-performing method, Swiss InfoGain, employs a Swiss tournament system with a proxy mutual-information-gain pairing rule, which significantly outperforms all other methods in constrained annotation budgets while also being more sample-efficient. Even in high-resource settings, we can identify superior alternatives to the Bradley-Terry baseline. Our experiments demonstrate that adaptive, resource-aware strategies reduce redundancy, enhance robustness, and yield statistically significant improvements in preference learning, highlighting the importance of balancing alignment quality with human workload in RLHF pipelines.
zh

[AI-95] Neuro-Logic Lifelong Learning

【速读】:该论文旨在解决神经符号人工智能(Neural-Symbolic Artificial Intelligence)中利用神经网络求解归纳逻辑编程(Inductive Logic Programming, ILP)问题的挑战,尤其关注如何通过持续学习(lifelong learning)机制提升模型在序列化任务中的效率与性能。其解决方案的关键在于提出一种组合式框架(compositional framework),充分利用逻辑规则的组合性(compositional nature)和可迁移性(transferable nature),使先前任务中学到的逻辑规则能够高效复用于后续任务,从而实现更好的可扩展性和性能表现。

链接: https://arxiv.org/abs/2511.12793
作者: Bowen He,Xiaoan Xu,Alper Kamil Bozkurt,Vahid Tarokh,Juncheng Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Solving Inductive Logic Programming (ILP) problems with neural networks is a key challenge in Neural-Symbolic Ar- tificial Intelligence (AI). While most research has focused on designing novel network architectures for individual prob- lems, less effort has been devoted to exploring new learning paradigms involving a sequence of problems. In this work, we investigate lifelong learning ILP, which leverages the com- positional and transferable nature of logic rules for efficient learning of new problems. We introduce a compositional framework, demonstrating how logic rules acquired from ear- lier tasks can be efficiently reused in subsequent ones, leading to improved scalability and performance. We formalize our approach and empirically evaluate it on sequences of tasks. Experimental results validate the feasibility and advantages of this paradigm, opening new directions for continual learn- ing in Neural-Symbolic AI.
zh

[AI-96] Multi-Agent Reinforcement Learning for Heterogeneous Satellite Cluster Resources Optimization

【速读】:该论文旨在解决异构卫星集群在执行自主地球观测(Earth Observation, EO)任务时的资源优化问题,尤其针对实时性、不确定性及去中心化特性带来的挑战。传统优化方法难以有效应对能量与存储资源受限、部分可观测性以及因载荷能力差异导致的代理异质性等问题。解决方案的关键在于引入多智能体强化学习(Multi-Agent Reinforcement Learning, MARL),通过构建基于Basilisk和BSK-RL框架的近真实仿真环境,系统地将优化问题从单星扩展至多星场景,并采用MAPPO、HAPPO和HATRPO等先进MARL算法实现异构卫星间的协同决策,从而在保障成像性能的同时高效管理资源,缓解非平稳性和智能体间奖励耦合问题,为未来复杂动态环境下智能化EO任务规划提供可扩展的理论基础与实践路径。

链接: https://arxiv.org/abs/2511.12792
作者: Mohamad A. Hady,Siyi Hu,Mahardhika Pratama,Zehong Cao,Ryszard Kowalczyk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work investigates resource optimization in heterogeneous satellite clusters performing autonomous Earth Observation (EO) missions using Reinforcement Learning (RL). In the proposed setting, two optical satellites and one Synthetic Aperture Radar (SAR) satellite operate cooperatively in low Earth orbit to capture ground targets and manage their limited onboard resources efficiently. Traditional optimization methods struggle to handle the real-time, uncertain, and decentralized nature of EO operations, motivating the use of RL and Multi-Agent Reinforcement Learning (MARL) for adaptive decision-making. This study systematically formulates the optimization problem from single-satellite to multi-satellite scenarios, addressing key challenges including energy and memory constraints, partial observability, and agent heterogeneity arising from diverse payload capabilities. Using a near-realistic simulation environment built on the Basilisk and BSK-RL frameworks, we evaluate the performance and stability of state-of-the-art MARL algorithms such as MAPPO, HAPPO, and HATRPO. Results show that MARL enables effective coordination across heterogeneous satellites, balancing imaging performance and resource utilization while mitigating non-stationarity and inter-agent reward coupling. The findings provide practical insights into scalable, autonomous satellite operations and contribute a foundation for future research on intelligent EO mission planning under heterogeneous and dynamic conditions.
zh

[AI-97] Optimal Look-back Horizon for Time Series Forecasting in Federated Learning AAAI-26

【速读】:该论文旨在解决联邦时间序列预测(Federated Time Series Forecasting, FTTSF)中关键的“回看窗口长度(look-back horizon)自适应选择”问题,即如何在数据分散、异构且非独立同分布(non-IID)的联邦学习场景下,动态确定最优的历史时间窗口以最小化预测损失。其解决方案的核心在于构建一个基于内在表示空间(intrinsic representation space)的理论框架:首先设计了一种合成数据生成器(Synthetic Data Generator, SDG),能够捕捉客户端数据中的自回归依赖、季节性和趋势等关键时序结构并保留异质性;随后定义了一个映射变换,将时间窗映射至具有明确几何与统计性质的内在空间;进一步地,通过分解预测损失为贝叶斯不可约不确定性项与近似误差项,证明了存在一个最优回看窗口——即在不可约损失趋于饱和而近似误差仍上升的最小窗口处,总预测损失达到最小值。这一理论分析为联邦时间序列预测中的自适应窗口选择提供了严谨依据。

链接: https://arxiv.org/abs/2511.12791
作者: Dahao Tang,Nan Yang,Yanli Li,Zhiyu Zhu,Zhibo Jin,Dong Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-26 as Oral Presentation

点击查看摘要

Abstract:Selecting an appropriate look-back horizon remains a fundamental challenge in time series forecasting (TSF), particularly in the federated learning scenarios where data is decentralized, heterogeneous, and often non-independent. While recent work has explored horizon selection by preserving forecasting-relevant information in an intrinsic space, these approaches are primarily restricted to centralized and independently distributed settings. This paper presents a principled framework for adaptive horizon selection in federated time series forecasting through an intrinsic space formulation. We introduce a synthetic data generator (SDG) that captures essential temporal structures in client data, including autoregressive dependencies, seasonality, and trend, while incorporating client-specific heterogeneity. Building on this model, we define a transformation that maps time series windows into an intrinsic representation space with well-defined geometric and statistical properties. We then derive a decomposition of the forecasting loss into a Bayesian term, which reflects irreducible uncertainty, and an approximation term, which accounts for finite-sample effects and limited model capacity. Our analysis shows that while increasing the look-back horizon improves the identifiability of deterministic patterns, it also increases approximation error due to higher model complexity and reduced sample efficiency. We prove that the total forecasting loss is minimized at the smallest horizon where the irreducible loss starts to saturate, while the approximation loss continues to rise. This work provides a rigorous theoretical foundation for adaptive horizon selection for time series forecasting in federated learning.
zh

[AI-98] Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation AAAI’26

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中高效估计多目标策略的问题,即如何将 $ n $ 个目标任务最优地划分为 $ k \ll n $ 个相关组,使得每组内的目标可以协同训练以提升整体性能。其核心挑战在于:随着任务数量 $ n $ 增大,单一策略对所有任务进行优化会变得低效甚至次优。解决方案的关键在于提出一种两阶段方法——元训练(meta-training)与微调(fine-tuning),其中利用了训练良好的策略网络具有的一阶近似性质(first-order approximation property),该性质在多个RL环境中被验证误差控制在2%以内。基于此特性,作者设计了PolicyGradEx算法,通过评估策略网络的梯度变化来高效估算任务间的亲和度得分矩阵,并进一步采用聚类最大化组内亲和度,从而实现快速且高质量的任务分组。实验表明,该方法平均优于现有最先进基线16%,且相比完整训练获得分组结果提速达26倍。

链接: https://arxiv.org/abs/2511.12779
作者: Zhenshuo Zhang,Minxuan Duan,Youran Ye,Hongyang R. Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages. To appear in AAAI’26

点击查看摘要

Abstract:We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given n objectives (or tasks), we seek the optimal partition of these objectives into k \ll n groups, where each group comprises related objectives that can be trained together. This problem arises in applications such as robotics, control, and preference optimization in language models, where learning a single policy for all n objectives is suboptimal as n grows. We introduce a two-stage procedure – meta-training followed by fine-tuning – to address this problem. We first learn a meta-policy for all objectives using multitask learning. Then, we adapt the meta-policy to multiple randomly sampled subsets of objectives. The adaptation step leverages a first-order approximation property of well-trained policy networks, which is empirically verified to be accurate within a 2% error margin across various RL environments. The resulting algorithm, PolicyGradEx, efficiently estimates an aggregate task-affinity score matrix given a policy evaluation algorithm. Based on the estimated affinity score matrix, we cluster the n objectives into k groups by maximizing the intra-cluster affinity scores. Experiments on three robotic control and the Meta-World benchmarks demonstrate that our approach outperforms state-of-the-art baselines by 16% on average, while delivering up to 26\times faster speedup relative to performing full training to obtain the clusters. Ablation studies validate each component of our approach. For instance, compared with random grouping and gradient-similarity-based grouping, our loss-based clustering yields an improvement of 19% . Finally, we analyze the generalization error of policy networks by measuring the Hessian trace of the loss surface, which gives non-vacuous measures relative to the observed generalization errors.
zh

[AI-99] Event-CausNet: Unlocking Causal Knowledge from Text with Large Language Models for Reliable Spatio-Temporal Forecasting

【速读】:该论文旨在解决时空图神经网络(Spatio-Temporal Graph Neural Networks, ST-GNNs)在应对非周期性交通事件(如交通事故)时预测性能急剧下降的问题。其根本原因在于ST-GNN本质上是相关性模型,依赖历史模式进行学习,而突发事件引入了新的因果因素,导致原有模式失效。解决方案的关键在于提出Event-CausNet框架,该框架通过大语言模型(Large Language Model, LLM)量化非结构化事件报告,构建基于平均处理效应(Average Treatment Effect, ATE)估计的因果知识库,并利用一种新颖的因果注意力机制将该知识注入双流GNN-LSTM网络中,从而动态调整和增强预测结果。此方法实现了从相关性建模到因果推理的跨越,显著提升了模型在突发事件下的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2511.12769
作者: Luyao Niu,Zepu Wang,Shuyi Guan,Yang Liu,Peng Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While spatio-temporal Graph Neural Networks (GNNs) excel at modeling recurring traffic patterns, their reliability plummets during non-recurring events like accidents. This failure occurs because GNNs are fundamentally correlational models, learning historical patterns that are invalidated by the new causal factors introduced during disruptions. To address this, we propose Event-CausNet, a framework that uses a Large Language Model to quantify unstructured event reports, builds a causal knowledge base by estimating average treatment effects, and injects this knowledge into a dual-stream GNN-LSTM network using a novel causal attention mechanism to adjust and enhance the forecast. Experiments on a real-world dataset demonstrate that Event-CausNet achieves robust performance, reducing prediction error (MAE) by up to 35.87%, significantly outperforming state-of-the-art baselines. Our framework bridges the gap between correlational models and causal reasoning, providing a solution that is more accurate and transferable, while also offering crucial interpretability, providing a more reliable foundation for real-world traffic management during critical disruptions.
zh

[AI-100] Optimal Forag ing in Memory Retrieval: Evaluating Random Walks and Metropolis-Hastings Sampling in Modern Semantic Spaces

【速读】:该论文试图解决的问题是:如何在高维语义嵌入空间中建模人类记忆检索行为,使其与生态觅食理论(optimal foraging theory)中的边际价值定理(Marginal Value Theorem, MVT)一致。解决方案的关键在于,使用先进的语义嵌入表示(如现代高维embedding空间),结合简单的随机游走(random walk)采样机制,即可生成与人类语义流畅性任务中观察到的觅食模式高度一致的行为结果,而无需引入复杂的自适应采样策略(如Metropolis-Hastings采样)。这一发现表明,恰当结构化的嵌入空间本身足以模拟近最优的觅食动态,从而支持Hills(2012)的观点,即语义空间的拓扑结构比复杂的接受-拒绝机制更能解释人类记忆检索行为。

链接: https://arxiv.org/abs/2511.12759
作者: James Moore
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human memory retrieval often resembles ecological foraging where animals search for food in a patchy environment. Optimal foraging means following the Marginal Value Theorem (MVT), in which individuals exploit a patch of semantically related concepts until it becomes less rewarding and then switch to a new cluster. While human behavioral data suggests foraging-like patterns in semantic fluency tasks, it remains unclear whether modern high-dimensional embedding spaces provide representations that allow algorithms to match observed human behavior. Using state-of-the-art embeddings and prior semantic fluency data, I find that random walks on these embedding spaces produce results consistent with optimal foraging and the MVT. Surprisingly, introducing Metropolis-Hastings sampling, an adaptive algorithm expected to model strategic acceptance and rejection of new clusters, does not produce results consistent with human behavior. These findings challenge the assumption that more complex sampling mechanisms inherently lead to better cognitive models of memory retrieval. Instead, they show that appropriately structured embeddings, even with simple sampling, can produce near-optimal foraging dynamics. This supports the perspective of Hills (2012) rather than Abbott (2015), demonstrating that modern embeddings can approximate human memory foraging without relying on complex acceptance criteria.
zh

[AI-101] Adaptively Coordinating with Novel Partners via Learned Latent Strategies NEURIPS2025

【速读】:该论文旨在解决人机协作中异构团队成员间实时适应性不足的问题,尤其是在时间压力大、策略空间复杂的任务场景下,人工代理难以快速识别并响应人类合作伙伴的动态行为偏好。解决方案的关键在于提出一种策略条件化的合作框架(strategy-conditioned cooperator framework),通过变分自编码器(variational autoencoder)从代理轨迹数据中学习潜在的策略空间,并利用聚类方法识别出不同的策略类型;随后训练一个条件化合作代理,基于这些策略类别生成对应类型的虚拟伙伴进行强化学习。在在线适应阶段,采用固定份额后悔最小化算法(fixed-share regret minimization algorithm)动态估计和调整对新伙伴策略的推断,从而实现对未知人类或代理队友的高效实时适应。

链接: https://arxiv.org/abs/2511.12754
作者: Benjamin Li,Shuyang Shi,Lucia Romero,Huao Li,Yaqi Xie,Woojun Kim,Stefanos Nikolaidis,Michael Lewis,Katia Sycara,Simon Stepputtis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Adaptation is the cornerstone of effective collaboration among heterogeneous team members. In human-agent teams, artificial agents need to adapt to their human partners in real time, as individuals often have unique preferences and policies that may change dynamically throughout interactions. This becomes particularly challenging in tasks with time pressure and complex strategic spaces, where identifying partner behaviors and selecting suitable responses is difficult. In this work, we introduce a strategy-conditioned cooperator framework that learns to represent, categorize, and adapt to a broad range of potential partner strategies in real-time. Our approach encodes strategies with a variational autoencoder to learn a latent strategy space from agent trajectory data, identifies distinct strategy types through clustering, and trains a cooperator agent conditioned on these clusters by generating partners of each strategy type. For online adaptation to novel partners, we leverage a fixed-share regret minimization algorithm that dynamically infers and adjusts the partner’s strategy estimation during interaction. We evaluate our method in a modified version of the Overcooked domain, a complex collaborative cooking environment that requires effective coordination among two players with a diverse potential strategy space. Through these experiments and an online user study, we demonstrate that our proposed agent achieves state of the art performance compared to existing baselines when paired with novel human, and agent teammates.
zh

[AI-102] Whose Narrative is it Anyway? A KV Cache Manipulation Attack

【速读】:该论文旨在解决自回归大语言模型(Large Language Models, LLMs)在推理过程中因KV缓存(Key Value Cache)被篡改而导致的完整性安全问题。KV缓存作为模型内部状态的重要表示,若遭恶意操纵,可能引发生成内容的偏离甚至被劫持,而这种攻击无需修改用户输入的提示词(prompt),隐蔽性强。解决方案的关键在于提出一种名为“History Swapping”的新型块级攻击方法:通过将当前生成过程中某一连续段落的KV缓存,替换为预先计算好的、来自不同话题的缓存片段,从而实现对模型生成轨迹的精准控制。实验表明,只有全层覆盖式替换才能成功改变对话主题,并揭示了高层结构规划早期编码于缓存中、局部语篇结构由模型末层维持的机制,凸显了KV缓存作为安全分析关键接口的价值。

链接: https://arxiv.org/abs/2511.12752
作者: Mukkesh Ganesh,Kaushik Iyer,Arun Baalaaji Sankar Ananthan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 10 figures

点击查看摘要

Abstract:The Key Value(KV) cache is an important component for efficient inference in autoregressive Large Language Models (LLMs), but its role as a representation of the model’s internal state makes it a potential target for integrity attacks. This paper introduces “History Swapping,” a novel block-level attack that manipulates the KV cache to steer model generation without altering the user-facing prompt. The attack involves overwriting a contiguous segment of the active generation’s cache with a precomputed cache from a different topic. We empirically evaluate this method across 324 configurations on the Qwen 3 family of models, analyzing the impact of timing, magnitude, and layer depth of the cache overwrite. Our findings reveal that only full-layer overwrites can successfully hijack the conversation’s topic, leading to three distinct behaviors: immediate and persistent topic shift, partial recovery, or a delayed hijack. Furthermore, we observe that high-level structural plans are encoded early in the generation process and local discourse structure is maintained by the final layers of the model. This work demonstrates that the KV cache is a significant vector for security analysis, as it encodes not just context but also topic trajectory and structural planning, making it a powerful interface for manipulating model behavior.
zh

[AI-103] Are LLM s The Way Forward? A Case Study on LLM -Guided Reinforcement Learning for Decentralized Autonomous Driving

【速读】:该论文旨在解决自动驾驶车辆在复杂交通环境(如密集高速路和汇入场景)中导航时,强化学习(Reinforcement Learning, RL)因依赖人工设计的奖励函数而难以捕捉语义与社会交互复杂性的局限性。其解决方案的关键在于利用小型本地部署的大语言模型(Large Language Models, LLMs,参数量约14B),通过奖励塑造(reward shaping)而非直接控制的方式增强RL代理的决策能力:LLM在训练阶段对状态-动作转换进行评分以优化奖励信号,而测试阶段仍由标准RL策略执行控制。这一方法在保持较高成功率的同时,避免了LLM直接控制带来的不稳定性、延迟及效率问题,但实验也揭示了LLM引入的系统性保守偏差及其对模型依赖性的显著影响,凸显当前小规模LLMs在安全关键任务中的局限性。

链接: https://arxiv.org/abs/2511.12751
作者: Timur Anvar,Jeffrey Chen,Yuyan Wang,Rohan Chandra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. A key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs ( 14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.
zh

[AI-104] Adaptive Graph Rewiring to Mitigate Over-Squashing in Mesh-Based GNNs for Fluid Dynamics Simulations

【速读】:该论文旨在解决基于网格的图神经网络(Mesh-based Graph Neural Networks, Mesh-based GNNs)在流体动力学模拟中因网格细化导致的“过挤压问题”(over-squashing problem),该问题阻碍了长程物理相互作用的捕捉。传统图重连(graph rewiring)方法虽试图通过添加新边缓解此问题,但通常在GNN应用前一次性完成所有重连操作,忽略了物理交互的渐进性和距离信息,因而缺乏物理真实性。解决方案的关键在于提出一种自适应重连框架AdaMeshNet,其核心创新是将重连过程嵌入消息传递(message-passing)步骤中,根据节点间的最短路径距离与速度差计算重连延迟评分(rewiring delay score),动态决定在哪个消息传递层插入新边,从而实现随物理交互传播过程自适应调整图结构,有效建模物理交互的时序特性并提升预测精度。

链接: https://arxiv.org/abs/2511.12709
作者: Sangwoo Seo,Hyunsung Kim,Jiwan Kim,Chanyoung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Mesh-based simulation using Graph Neural Networks (GNNs) has been recognized as a promising approach for modeling fluid dynamics. However, the mesh refinement techniques which allocate finer resolution to regions with steep gradients can induce the over-squashing problem in mesh-based GNNs, which prevents the capture of long-range physical interactions. Conventional graph rewiring methods attempt to alleviate this issue by adding new edges, but they typically complete all rewiring operations before applying them to the GNN. These approaches are physically unrealistic, as they assume instantaneous interactions between distant nodes and disregard the distance information between particles. To address these limitations, we propose a novel framework, called Adaptive Graph Rewiring in Mesh-Based Graph Neural Networks (AdaMeshNet), that introduces an adaptive rewiring process into the message-passing procedure to model the gradual propagation of physical interactions. Our method computes a rewiring delay score for bottleneck nodes in the mesh graph, based on the shortest-path distance and the velocity difference. Using this score, it dynamically selects the message-passing layer at which new edges are rewired, which can lead to adaptive rewiring in a mesh graph. Extensive experiments on mesh-based fluid simulations demonstrate that AdaMeshNet outperforms conventional rewiring methods, effectively modeling the sequential nature of physical interactions and enabling more accurate predictions.
zh

[AI-105] Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs AAAI

【速读】:该论文旨在解决强化学习中通用智能体在复杂环境中遵循复杂任务指令时面临的挑战,即随机采样任务与关卡(level)组合常导致不可解的配对,从而阻碍策略训练的有效性。其解决方案的关键在于提出ATLAS(Aligning Tasks and Levels for Autocurricula of Specifications),一种能够自动生成任务与关卡联合自适应课程(joint autocurricula)的新方法。该方法基于无监督环境设计(Unsupervised Environment Design, UED),通过协同设计任务和关卡结构,确保生成的配对既可解又具挑战性,从而显著提升策略的学习效率,尤其在传统随机采样难以获得可解组合的情况下表现突出。

链接: https://arxiv.org/abs/2511.12706
作者: Daniel Furelos-Blanco,Charles Pert,Frederik Kelbel,Alex F. Spies,Alessandra Russo,Michael Dennis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Extended version of paper accepted for publication at the 40th AAAI Conference on Artificial Intelligence (AAAI)

点击查看摘要

Abstract:Training general agents to follow complex instructions (tasks) in intricate environments (levels) remains a core challenge in reinforcement learning. Random sampling of task-level pairs often produces unsolvable combinations, highlighting the need to co-design tasks and levels. While unsupervised environment design (UED) has proven effective at automatically designing level curricula, prior work has only considered a fixed task. We present ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications), a novel method that generates joint autocurricula over tasks and levels. Our approach builds upon UED to automatically produce solvable yet challenging task-level pairs for policy training. To evaluate ATLAS and drive progress in the field, we introduce an evaluation suite that models tasks as reward machines in Minigrid levels. Experiments demonstrate that ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely. We further show that mutations leveraging the structure of both tasks and levels accelerate convergence to performant policies.
zh

[AI-106] A Closer Look at Personalized Fine-Tuning in Heterogeneous Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中全局泛化与局部个性化之间的权衡难题,尤其针对客户端数据分布非独立同分布(non-IID)时,传统个性化微调(Personalized Fine-Tuning, PFT)易出现过拟合或在领域偏移下性能下降的问题。其解决方案的关键在于提出一种适配于FL场景的线性探针后全参数微调策略(Linear Probing followed by Full Fine-Tuning, LP-FT),通过分阶段参数更新机制缓解联邦特征失真(federated feature distortion)——即本地微调破坏全局学习到的特征表示的现象,并理论刻画了LP-FT如何在部分特征重叠和协变量-概念偏移等条件下优于标准微调,从而为实现鲁棒的个性化联邦学习提供可操作的指导。

链接: https://arxiv.org/abs/2511.12695
作者: Minghui Chen,Hrad Ghoukasian,Ruinan Jin,Zehua Wang,Sai Praneeth Karimireddy,Xiaoxiao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
备注: 33 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized, privacy-preserving model training but struggles to balance global generalization and local personalization due to non-identical data distributions across clients. Personalized Fine-Tuning (PFT), a popular post-hoc solution, fine-tunes the final global model locally but often overfits to skewed client distributions or fails under domain shifts. We propose adapting Linear Probing followed by full Fine-Tuning (LP-FT), a principled centralized strategy for alleviating feature distortion (Kumar et al., 2022), to the FL setting. Through systematic evaluation across seven datasets and six PFT variants, we demonstrate LP-FT’s superiority in balancing personalization and generalization. Our analysis uncovers federated feature distortion, a phenomenon where local fine-tuning destabilizes globally learned features, and theoretically characterizes how LP-FT mitigates this via phased parameter updates. We further establish conditions (e.g., partial feature overlap, covariate-concept shift) under which LP-FT outperforms standard fine-tuning, offering actionable guidelines for deploying robust personalization in FL.
zh

[AI-107] Dynamic Tree Databases in Automated Planning

【速读】:该论文旨在解决大规模任务中显式状态空间搜索的可扩展性问题,核心挑战在于如何高效压缩生成的状态集合(state set)。传统树数据库(tree database)虽在理想情况下可实现每个状态常数级内存占用,但需预先分配大量内存,导致资源浪费。论文提出一种新颖的动态变体树数据库,能够在命题变量和数值变量上实现状态集的压缩,并证明其保持了静态版本的关键性质。该方案的关键创新在于引入动态内存管理机制,在保证压缩效率的同时显著降低内存预分配需求,实验表明其在经典规划与数值规划任务中可实现多个数量级的压缩比,且运行时开销几乎可以忽略。

链接: https://arxiv.org/abs/2511.12677
作者: Oliver Joergensen,Dominik Drexler,Jendrik Seipp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A central challenge in scaling up explicit state-space search for large tasks is compactly representing the set of generated states. Tree databases, a data structure from model checking, require constant space per generated state in the best case, but they need a large preallocation of memory. We propose a novel dynamic variant of tree databases for compressing state sets over propositional and numeric variables and prove that it maintains the desirable properties of the static counterpart. Our empirical evaluation of state compression techniques for grounded and lifted planning on classical and numeric planning tasks reveals compression ratios of several orders of magnitude, often with negligible runtime overhead.
zh

[AI-108] AI Bill of Materials and Beyond: Systematizing Security Assurance through the AI Risk Scanning (AIRS) Framework

【速读】:该论文旨在解决当前人工智能(AI)系统保障机制在软件供应链安全、对抗性机器学习和治理文档之间碎片化的问题,尤其针对现有透明度工具(如Model Cards、Datasheets和Software Bills of Materials, SBOMs)难以提供可验证、机器可读的模型安全性证据这一核心痛点。其解决方案的关键在于提出AI风险扫描(AIRS)框架,该框架基于威胁建模构建,能够生成结构化、可审计的证据以实现对大语言模型(LLM)的可测量保障;通过与MITRE ATLAS对抗性机器学习分类体系对齐,AIRS自动产出涵盖模型完整性、打包与序列化安全、结构适配器及运行时行为等关键领域的标准化产物,并已在量化GPT-OSS-20B模型上验证了其在加载策略控制、分片哈希校验以及污染和后门探测方面的有效性,从而填补了SBOM标准(如SPDX 3.0和CycloneDX 1.6)在AI特定保障字段上的空白,为可信且可机读的AI风险文档提供了原则性基础。

链接: https://arxiv.org/abs/2511.12668
作者: Samuel Nathanson,Alexander Lee,Catherine Chen Kieffer,Jared Junkin,Jessica Ye,Amir Saeed,Melanie Lockhart,Russ Fink,Elisha Peterson,Lanier Watkins
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Assurance for artificial intelligence (AI) systems remains fragmented across software supply-chain security, adversarial machine learning, and governance documentation. Existing transparency mechanisms - including Model Cards, Datasheets, and Software Bills of Materials (SBOMs) - advance provenance reporting but rarely provide verifiable, machine-readable evidence of model security. This paper introduces the AI Risk Scanning (AIRS) Framework, a threat-model-based, evidence-generating framework designed to operationalize AI assurance. The AIRS Framework evolved through three progressive pilot studies - Smurf (AIBOM schema design), OPAL (operational validation), and Pilot C (AIRS) - that reframed AI documentation from descriptive disclosure toward measurable, evidence-bound verification. The framework aligns its assurance fields to the MITRE ATLAS adversarial ML taxonomy and automatically produces structured artifacts capturing model integrity, packaging and serialization safety, structural adapters, and runtime behaviors. Currently, the AIRS Framework is scoped to provide model-level assurances for LLMs, but it could be expanded to include other modalities and cover system-level threats (e.g. application-layer abuses, tool-calling). A proof-of-concept on a quantized GPT-OSS-20B model demonstrates enforcement of safe loader policies, per-shard hash verification, and contamination and backdoor probes executed under controlled runtime conditions. Comparative analysis with SBOM standards of SPDX 3.0 and CycloneDX 1.6 reveals alignment on identity and evaluation metadata, but identifies critical gaps in representing AI-specific assurance fields. The AIRS Framework thus extends SBOM practice to the AI domain by coupling threat modeling with automated, auditable evidence generation, providing a principled foundation for standardized, trustworthy, and machine-verifiable AI risk documentation.
zh

[AI-109] FLClear: Visually Verifiable Multi-Client Watermarking for Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中客户端模型知识产权(Intellectual Property Rights, IPR)保护不足的问题,特别是中央服务器可能恶意篡改全局模型以抹除客户端贡献或伪造所有权的风险。现有水印技术存在水印冲突、安全性不足及验证机制不直观等局限。其解决方案的关键在于提出一种名为FLClear的新框架,通过引入一个与对比学习联合优化的转置模型(transposed model),实现水印与主任务目标的协同融合;在验证阶段,利用该转置模型重建水印,并结合视觉检查和结构相似性指标进行直观且定量的所有权验证,从而实现无冲突的水印聚合、增强的安全性和可解释的验证机制。

链接: https://arxiv.org/abs/2511.12663
作者: Chen Gu,Yingying Sun,Yifan She,Donghui Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables multiple clients to collaboratively train a shared global model while preserving the privacy of their local data. Within this paradigm, the intellectual property rights (IPR) of client models are critical assets that must be protected. In practice, the central server responsible for maintaining the global model may maliciously manipulate the global model to erase client contributions or falsely claim sole ownership, thereby infringing on clients’ IPR. Watermarking has emerged as a promising technique for asserting model ownership and protecting intellectual property. However, existing FL watermarking approaches remain limited, suffering from potential watermark collisions among clients, insufficient watermark security, and non-intuitive verification mechanisms. In this paper, we propose FLClear, a novel framework that simultaneously achieves collision-free watermark aggregation, enhanced watermark security, and visually interpretable ownership verification. Specifically, FLClear introduces a transposed model jointly optimized with contrastive learning to integrate the watermarking and main task objectives. During verification, the watermark is reconstructed from the transposed model and evaluated through both visual inspection and structural similarity metrics, enabling intuitive and quantitative ownership verification. Comprehensive experiments conducted over various datasets, aggregation schemes, and attack scenarios demonstrate the effectiveness of FLClear and confirm that it consistently outperforms state-of-the-art FL watermarking methods.
zh

[AI-110] Scalable Hierarchical AI-Blockchain Framework for Real-Time Anomaly Detection in Large-Scale Autonomous Vehicle Networks

【速读】:该论文旨在解决自动驾驶车辆网络在传感器集成复杂性、实时性能需求及分布式通信协议下所面临的巨大安全挑战,尤其是现有安全方案难以在可接受的安全与隐私框架内实现亚10毫秒(sub-10 ms)的异常检测和大规模车辆网络的分布式协同。其解决方案的关键在于提出一种三层混合安全架构HAVEN(Hierarchical Autonomous Vehicle Enhanced Network),通过分层解耦机制实现高效安全防护:第一层在边缘侧部署轻量级集成异常检测模型以实现亚10毫秒级本地威胁识别;第二层采用拜占庭容错联邦学习(Byzantine-fault-tolerant federated learning)在区域尺度聚合威胁情报,保障分布式协同的鲁棒性;第三层引入精选区块链机制确保关键安全协调的不可篡改性和隐私保护。该架构显著提升了检测精度(94%准确率,92% F1-score)、网络韧性(支持20%节点被攻陷仍有效运行)并降低区块链存储开销,成功平衡了实时安全性与分布式协同之间的权衡。

链接: https://arxiv.org/abs/2511.12648
作者: Rathin Chandra Shit,Sharmila Subudhi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to the Journal

点击查看摘要

Abstract:The security of autonomous vehicle networks is facing major challenges, owing to the complexity of sensor integration, real-time performance demands, and distributed communication protocols that expose vast attack surfaces around both individual and network-wide safety. Existing security schemes are unable to provide sub-10 ms (milliseconds) anomaly detection and distributed coordination of large-scale networks of vehicles within an acceptable safety/privacy framework. This paper introduces a three-tier hybrid security architecture HAVEN (Hierarchical Autonomous Vehicle Enhanced Network), which decouples real-time local threat detection and distributed coordination operations. It incorporates a light ensemble anomaly detection model on the edge (first layer), Byzantine-fault-tolerant federated learning to aggregate threat intelligence at a regional scale (middle layer), and selected blockchain mechanisms (top layer) to ensure critical security coordination. Extensive experimentation is done on a real-world autonomous driving dataset. Large-scale simulations with the number of vehicles ranging between 100 and 1000 and different attack types, such as sensor spoofing, jamming, and adversarial model poisoning, are conducted to test the scalability and resiliency of HAVEN. Experimental findings show sub-10 ms detection latency with an accuracy of 94% and F1-score of 92% across multimodal sensor data, Byzantine fault tolerance validated with 20% compromised nodes, and a reduced blockchain storage overhead, guaranteeing sufficient differential privacy. The proposed framework overcomes the important trade-off between real-time safety obligation and distributed security coordination with novel three-tiered processing. The scalable architecture of HAVEN is shown to provide great improvement in detection accuracy as well as network resilience over other methods.
zh

[AI-111] LLM 4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

【速读】:该论文旨在解决生成式 AI (Generative AI) 在系统性文献综述(Systematic Review, SR)筛选阶段中性能评估不足的问题,尤其关注传统指标在不平衡数据下的误导性及缺乏对漏检(False Negatives, FNs)代价的考量。其核心解决方案在于:优先评估召回率(recall)与丢失证据(lost evidence)风险,采用受控于随机水平且成本敏感的加权马修斯相关系数(Weighted MCC, WMCC),完整报告混淆矩阵以支持未来元分析,并将未分类输出视为需重新人工审核的阳性样本;同时引入防泄漏设计、非LLM基线对比和开放研究材料,最终基于成本效益分析(FN惩罚高于FP)来锚定结论,从而提升评估的科学性与可复现性。

链接: https://arxiv.org/abs/2511.12635
作者: Lech Madeyski,Barbara Kitchenham,Martin Shepperd
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Context: Large language models (LLMs) are released faster than users’ ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.
zh

[AI-112] PID-controlled Langevin Dynamics for Faster Sampling of Generative Models NEURIPS2025

【速读】:该论文旨在解决Langevin动力学采样(Langevin Dynamics Sampling)生成速度极低的问题,其根本原因在于需要大量细粒度迭代才能收敛到目标分布。解决方案的关键在于提出PID控制的Langevin动力学(PID-controlled Langevin Dynamics, PIDLD),该方法基于控制理论重新诠释采样过程:将能量梯度视为反馈信号,通过引入积分项(历史梯度)和微分项(梯度趋势)来高效穿越能量景观并自适应稳定,从而显著减少高质量样本所需的迭代次数。该方法无需额外训练、数据或先验信息,可无缝集成至任意Langevin基采样方法中。

链接: https://arxiv.org/abs/2511.12603
作者: Hongyi Chen,Jianhai Shu,Jingtao Ding,Yong Li,Xiao-Ping Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: NeurIPS 2025 poster paper

点击查看摘要

Abstract:Langevin dynamics sampling suffers from extremely low generation speed, fundamentally limited by numerous fine-grained iterations to converge to the target distribution. We introduce PID-controlled Langevin Dynamics (PIDLD), a novel sampling acceleration algorithm that reinterprets the sampling process using control-theoretic principles. By treating energy gradients as feedback signals, PIDLD combines historical gradients (the integral term) and gradient trends (the derivative term) to efficiently traverse energy landscapes and adaptively stabilize, thereby significantly reducing the number of iterations required to produce high-quality samples. Our approach requires no additional training, datasets, or prior information, making it immediately integrable with any Langevin-based method. Extensive experiments across image generation and reasoning tasks demonstrate that PIDLD achieves higher quality with fewer steps, making Langevin-based generative models more practical for efficiency-critical applications. The implementation can be found at \hrefthis https URLthis https URL.
zh

[AI-113] Symmetry-Aware Graph Metanetwork Autoencoders: Model Merging through Parameter Canonicalization

【速读】:该论文旨在解决神经网络参数化中因对称性导致的多个等效极小值问题,这些问题使得不同结构相似但参数排列或缩放不同的模型难以在损失空间中自然收敛到同一区域,从而阻碍了模型合并(model merging)等应用。解决方案的关键在于提出一种基于Scale Graph Metanetworks (ScaleGMNs) 的自编码器框架,其核心创新是显式建模并利用了网络中的排列对称性和参数缩放对称性,使编码器具有等变性(equivariance),从而无需求解复杂的组合分配问题即可将隐式神经表示(INRs)和卷积神经网络(CNNs)对齐至共享的损失盆地,实现平滑线性插值且避免高损失区域。

链接: https://arxiv.org/abs/2511.12601
作者: Odysseas Boufalis,Jorge Carrasco-Pollo,Joshua Rosenthal,Eduardo Terres-Caballero,Alejandro García-Castellanos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural network parameterizations exhibit inherent symmetries that yield multiple equivalent minima within the loss landscape. Scale Graph Metanetworks (ScaleGMNs) explicitly leverage these symmetries by proposing an architecture equivariant to both permutation and parameter scaling transformations. Previous work by Ainsworth et al. (2023) addressed permutation symmetries through a computationally intensive combinatorial assignment problem, demonstrating that leveraging permutation symmetries alone can map networks into a shared loss basin. In this work, we extend their approach by also incorporating scaling symmetries, presenting an autoencoder framework utilizing ScaleGMNs as invariant encoders. Experimental results demonstrate that our method aligns Implicit Neural Representations (INRs) and Convolutional Neural Networks (CNNs) under both permutation and scaling symmetries without explicitly solving the assignment problem. This approach ensures that similar networks naturally converge within the same basin, facilitating model merging, i.e., smooth linear interpolation while avoiding regions of high loss. The code is publicly available on our GitHub repository.
zh

[AI-114] Enhancing Conversational Recommender Systems with Tree-Structured Knowledge and Pretrained Language Models

【速读】:该论文旨在解决当前生成式对话推荐系统(Conversational Recommender Systems, CRS)在融合预训练语言模型(Pretrained Language Models, PLMs)与知识图谱(Knowledge Graphs, KGs)时存在的三大关键问题:一是未能充分挖掘PLM对图结构关系的推理能力;二是未根据上下文语境筛选检索到的知识,导致噪声引入;三是忽视多轮对话中用户协同偏好信息。解决方案的核心在于提出PCRS-TKA框架,其关键创新包括:构建对话特定的知识树(Knowledge Trees)并序列化为文本以支持结构感知推理,结合语义对齐模块实现异构输入的噪声抑制与一致性增强,并通过专门设计的监督信号显式建模协同偏好,从而在推荐准确性和对话质量上均取得显著提升。

链接: https://arxiv.org/abs/2511.12579
作者: Yongwen Ren,Chao Wang,Peng Du,Chuan Qin,Dazhong Shen,Hui Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in pretrained language models (PLMs) have significantly improved conversational recommender systems (CRS), enabling more fluent and context-aware interactions. To further enhance accuracy and mitigate hallucination, many methods integrate PLMs with knowledge graphs (KGs), but face key challenges: failing to fully exploit PLM reasoning over graph relationships, indiscriminately incorporating retrieved knowledge without context filtering, and neglecting collaborative preferences in multi-turn dialogues. To this end, we propose PCRS-TKA, a prompt-based framework employing retrieval-augmented generation to integrate PLMs with KGs. PCRS-TKA constructs dialogue-specific knowledge trees from KGs and serializes them into texts, enabling structure-aware reasoning while capturing rich entity semantics. Our approach selectively filters context-relevant knowledge and explicitly models collaborative preferences using specialized supervision signals. A semantic alignment module harmonizes heterogeneous inputs, reducing noise and enhancing accuracy. Extensive experiments demonstrate that PCRS-TKA consistently outperforms all baselines in both recommendation and conversational quality.
zh

[AI-115] Enhancing Machine Learning Model Efficiency through Quantization and Bit Depth Optimization: A Performance Analysis on Healthcare Data

【速读】:该论文旨在解决复杂学习模型在实际应用中因执行时间过长而导致的效率瓶颈问题。其核心解决方案在于采用量化(quantization)与位深度(bit-depth)优化技术,将输入数据从 float64 和 int32 精度压缩至更低精度(如 float32),从而显著降低时间复杂度,同时保持模型性能接近原始水平。实验基于两个医学数据集对逻辑回归(Logistic Regression, LR)模型进行验证,结果表明该方法可在保证模型准确率损失最小的前提下大幅提升计算效率,体现出该优化策略在资源受限场景下的实用性与先进性。

链接: https://arxiv.org/abs/2511.12568
作者: Mitul Goswami,Romit Chatterjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as Chapter 2 in Intelligent and Smart Computing: Applications to Engineering Problems, Cambridge Scholars Publishing (2025). ISBN: 978-1-0364-5886-7

点击查看摘要

Abstract:This research aims to optimize intricate learning models by implementing quantization and bit-depth optimization techniques. The objective is to significantly cut time complexity while preserving model efficiency, thus addressing the challenge of extended execution times in intricate models. Two medical datasets were utilized as case studies to apply a Logistic Regression (LR) machine learning model. Using efficient quantization and bit depth optimization strategies the input data is downscaled from float64 to float32 and int32. The results demonstrated a significant reduction in time complexity, with only a minimal decrease in model accuracy post-optimization, showcasing the state-of-the-art optimization approach. This comprehensive study concludes that the impact of these optimization techniques varies depending on a set of parameters.
zh

[AI-116] LOBERT: Generative AI Foundation Model for Limit Order Book Messages NEURIPS2025

【速读】:该论文旨在解决金融限价订单簿(Limit Order Book, LOB)在消息级别建模中的挑战,包括事件发生时间的不规则性、快速的状态转换以及高频交易者对可见订单流的反应。传统LOB模型通常依赖复杂的特征表示且泛化能力弱,难以适应不同下游任务。解决方案的关键在于提出LOBERT——一种适用于LOB数据的通用编码器-only基础模型,其核心创新是设计了一种新颖的分词机制,将多维消息整体视为单个token,同时保留价格、成交量和时间的连续表征,从而在预测买卖中间价变动和下一消息等任务上实现领先性能,并显著缩短所需上下文长度。

链接: https://arxiv.org/abs/2511.12563
作者: Eljas Linna,Kestutis Baltakys,Alexandros Iosifidis,Juho Kanniainen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submission for NeurIPS 2025 GenAI in Finance Workshop

点击查看摘要

Abstract:Modeling the dynamics of financial Limit Order Books (LOB) at the message level is challenging due to irregular event timing, rapid regime shifts, and the reactions of high-frequency traders to visible order flow. Previous LOB models require cumbersome data representations and lack adaptability outside their original tasks, leading us to introduce LOBERT, a general-purpose encoder-only foundation model for LOB data suitable for downstream fine-tuning. LOBERT adapts the original BERT architecture for LOB data by using a novel tokenization scheme that treats complete multi-dimensional messages as single tokens while retaining continuous representations of price, volume, and time. With these methods, LOBERT achieves leading performance in tasks such as predicting mid-price movements and next messages, while reducing the required context length compared to previous methods.
zh

[AI-117] Perturbing Best Responses in Zero-Sum Games AAAI2026

【速读】:该论文旨在解决在零和博弈中,基于最优响应(best-response)的算法(如Double Oracle和Fictitious Play)在逼近纳什均衡(Nash equilibrium)时收敛速度慢的问题。其解决方案的关键在于引入对效用函数的扰动机制:即在Oracle计算最优响应前,对效用进行扰动,从而减少算法所需的迭代次数;对于某些情形,适当设计的扰动可使期望迭代次数降至对数级别。尽管该扰动方法在计算上较复杂(需遍历所有纯策略),作者进一步证明,在纯策略具有内部结构的博弈中,该扰动能被高效实现。

链接: https://arxiv.org/abs/2511.12523
作者: Adam Dziwoki,Rostislav Horcik
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:This paper investigates the impact of perturbations on the best-response-based algorithms approximating Nash equilibria in zero-sum games, namely Double Oracle and Fictitious Play. More precisely, we assume that the oracle computing the best responses perturbs the utilities before selecting the best response. We show that using such an oracle reduces the number of iterations for both algorithms. For some cases, suitable perturbations ensure the expected number of iterations is logarithmic. Although the utility perturbation is computationally demanding as it requires iterating through all pure strategies, we demonstrate that one can efficiently perturb the utilities in games where pure strategies have further inner structure.
zh

[AI-118] owards Better IncomLDL: We Are Unaware of Hidden Labels in Advance

【速读】:该论文旨在解决不完整标签分布学习(Incomplete Label Distribution Learning, IncomLDL)中一个关键的现实问题:现有方法将缺失标签的描述度设为0,而保持其他标签不变,这种设定不符合实际场景——当某些标签缺失时,剩余标签的分布度应相应增加。为此,作者提出新的问题“带有隐藏标签的标签分布学习”(Label Distribution Learning with Hidden Labels, HidLDL),目标是从真实世界中不完整的标签分布中恢复出完整的标签分布。解决方案的关键在于引入一种创新性约束来捕捉已观测标签的比例信息,并在优化过程中加以利用;同时结合局部特征相似性和全局低秩结构来揭示隐藏标签的潜在模式。此外,作者还从理论上给出了所提方法的恢复边界,证明了从隐藏标签中学习的可行性。

链接: https://arxiv.org/abs/2511.12494
作者: Jiecheng Jiang,Jiawei Tang,Jiahao Jiang,Hui Liu,Junhui Hou,Yuheng Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Label distribution learning (LDL) is a novel paradigm that describe the samples by label distribution of a sample. However, acquiring LDL dataset is costly and time-consuming, which leads to the birth of incomplete label distribution learning (IncomLDL). All the previous IncomLDL methods set the description degrees of “missing” labels in an instance to 0, but remains those of other labels unchanged. This setting is unrealistic because when certain labels are missing, the degrees of the remaining labels will increase accordingly. We fix this unrealistic setting in IncomLDL and raise a new problem: LDL with hidden labels (HidLDL), which aims to recover a complete label distribution from a real-world incomplete label distribution where certain labels in an instance are omitted during annotation. To solve this challenging problem, we discover the significance of proportional information of the observed labels and capture it by an innovative constraint to utilize it during the optimization process. We simultaneously use local feature similarity and the global low-rank structure to reveal the mysterious veil of hidden labels. Moreover, we theoretically give the recovery bound of our method, proving the feasibility of our method in learning from hidden labels. Extensive recovery and predictive experiments on various datasets prove the superiority of our method to state-of-the-art LDL and IncomLDL methods.
zh

[AI-119] Uncover and Unlearn Nuisances: Agnostic Fully Test-Time Adaptation

【速读】:该论文旨在解决完全测试时自适应(Fully Test-Time Adaptation, FTTA)场景下的模型泛化问题,即在无法获取源域数据和预训练模型训练协议的情况下,如何使模型有效应对不可预测的目标域分布偏移(domain shifts)。传统方法依赖于对齐源域与目标域的特征分布,但在FTTA设定下因缺乏训练数据而不可行。解决方案的关键在于提出一种无监督域适应框架(Agnostic FTTA, AFTTA),其核心是采用“揭示-消除”(uncover-and-unlearn)策略:首先通过预定义映射模拟潜在的不必要域偏移并将其视为干扰项(nuisance),随后在测试阶段通过正则化隐空间表示和标签预测中的此类偏移来强制模型消除这些干扰;具体而言,引入基于互信息(mutual information)的约束机制,在特征空间中引导干扰项的去除,并在标签空间中促进置信且一致的预测输出,从而实现对未知目标域的鲁棒泛化。

链接: https://arxiv.org/abs/2511.12491
作者: Ponhvoan Srey,Yaxin Shi,Hangwei Qian,Jing Li,Ivor W. Tsang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:Fully Test-Time Adaptation (FTTA) addresses domain shifts without access to source data and training protocols of the pre-trained models. Traditional strategies that align source and target feature distributions are infeasible in FTTA due to the absence of training data and unpredictable target domains. In this work, we exploit a dual perspective on FTTA, and propose Agnostic FTTA (AFTTA) as a novel formulation that enables the usage of off-the-shelf domain transformations during test-time to enable direct generalization to unforeseeable target data. To address this, we develop an uncover-and-unlearn approach. First, we uncover potential unwanted shifts between source and target domains by simulating them through predefined mappings and consider them as nuisances. Then, during test-time prediction, the model is enforced to unlearn these nuisances by regularizing the consequent shifts in latent representations and label predictions. Specifically, a mutual information-based criterion is devised and applied to guide nuisances unlearning in the feature space and encourage confident and consistent prediction in label space. Our proposed approach explicitly addresses agnostic domain shifts, enabling superior model generalization under FTTA constraints. Extensive experiments on various tasks, involving corruption and style shifts, demonstrate that our method consistently outperforms existing approaches.
zh

[AI-120] ARCHE: A Novel Task to Evaluate LLM s on Latent Reasoning Chain Extraction AAAI2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学领域应用中缺乏对推理结构化理解的问题,即模型虽能生成看似合理的推理内容(如通过链式思维提示),但其输出通常是无结构且非正式的,难以判断其是否真正掌握了科学推理的核心范式。解决方案的关键在于提出一种名为“潜在推理链提取”(Latent Reasoning Chain Extraction, ARCHE)的新任务,要求模型将复杂推理论证分解为标准推理范式的组合,并以“推理逻辑树”(Reasoning Logic Tree, RLT)的形式显式表达,其中每个推理步骤被归类为皮尔士(Peirce)提出的三种基本推理模式之一:演绎(deduction)、归纳(induction)或溯因(abduction)。为此,作者构建了ARCHE Bench基准数据集,涵盖70篇《自然·通讯》文章中的1,900余条引用和38,000个观点,并设计了两个逻辑感知评估指标:实体覆盖率(Entity Coverage, EC)衡量内容完整性,推理边准确率(Reasoning Edge Accuracy, REA)衡量每一步逻辑有效性。实验表明,当前主流LLMs在REA与EC之间存在权衡,尚无法提取完整且符合标准的推理链,揭示了现有模型在科学论证严谨性方面的显著差距。

链接: https://arxiv.org/abs/2511.12485
作者: Pengze Li,Jiaqi Liu,Junchi Yu,Lihao Liu,Mingyu Ding,Wanli Ouyang,Shixiang Tang,Xi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.
zh

[AI-121] One Request Multiple Experts: LLM Orchestrates Domain Specific Models via Adaptive Task Routing

【速读】:该论文旨在解决主动配电网(Active Distribution Networks, ADNs)在集成大量分布式能源资源和新型市场参与主体背景下,因多场景、多目标特性导致的运行复杂性问题。当前依赖专家工程师开发的领域特定模型(Domain Specific Models, DSMs)虽能应对各类技术难题,但其异构性使得ADN运营者在掌握、整合与协同这些模型时面临巨大开销。解决方案的关键在于提出ADN-Agent架构,该架构利用通用大语言模型(Large Language Model, LLM)实现对多个DSM的统一协调,具备自适应意图识别、任务分解与DSM调用能力;同时设计了一种新颖的通信机制以提供统一且灵活的接口支持多样化DSM,并针对语言密集型子任务构建自动化微调流程来增强小型语言模型性能,从而显著提升系统的整体问题求解能力。

链接: https://arxiv.org/abs/2511.12484
作者: Xu Yang,Chenhui Lin,Haotian Liu,Qi Wang,Yue Yang,Wenchuan Wu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the integration of massive distributed energy resources and the widespread participation of novel market entities, the operation of active distribution networks (ADNs) is progressively evolving into a complex multi-scenario, multi-objective problem. Although expert engineers have developed numerous domain specific models (DSMs) to address distinct technical problems, mastering, integrating, and orchestrating these heterogeneous DSMs still entail considerable overhead for ADN operators. Therefore, an intelligent approach is urgently required to unify these DSMs and enable efficient coordination. To address this challenge, this paper proposes the ADN-Agent architecture, which leverages a general large language model (LLM) to coordinate multiple DSMs, enabling adaptive intent recognition, task decomposition, and DSM invocation. Within the ADN-Agent, we design a novel communication mechanism that provides a unified and flexible interface for diverse heterogeneous DSMs. Finally, for some language-intensive subtasks, we propose an automated training pipeline for fine-tuning small language models, thereby effectively enhancing the overall problem-solving capability of the system. Comprehensive comparisons and ablation experiments validate the efficacy of the proposed method and demonstrate that the ADN-Agent architecture outperforms existing LLM application paradigms.
zh

[AI-122] Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network for Multimodal Depression Detection AAAI2026

【速读】:该论文旨在解决当前基于Transformer或图神经网络(Graph Neural Networks, GNNs)的多模态抑郁症检测方法在建模个体差异和跨模态时间依赖性方面存在的局限性,尤其是在多样化行为情境下的泛化能力不足问题。其解决方案的关键在于提出P^3 HF(Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network),包含三个核心创新:(1) 利用大语言模型(Large Language Models, LLMs)将离散个体特征转化为上下文描述以实现个性化表征学习;(2) 设计超图-Transformer(Hypergraph-Former)架构以捕捉高阶跨模态时间关系;(3) 通过对比学习实现事件级别的域解耦,提升在不同行为场景中的泛化性能。实验表明,该方法在MPDD-Young数据集上相较现有方法在二分类与三分类任务中准确率和加权F1指标均提升约10%,且消融实验证实各模块独立贡献显著,尤其人格引导表征学习与高阶超图推理对生成鲁棒、个体感知的抑郁相关表征至关重要。

链接: https://arxiv.org/abs/2511.12460
作者: Changzeng Fu,Shiwen Zhao,Yunze Zhang,Zhongquan Jian,Shiqi Zhao,Chaoran Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026 accepted

点击查看摘要

Abstract:Depression represents a global mental health challenge requiring efficient and reliable automated detection methods. Current Transformer- or Graph Neural Networks (GNNs)-based multimodal depression detection methods face significant challenges in modeling individual differences and cross-modal temporal dependencies across diverse behavioral contexts. Therefore, we propose P ^3 HF (Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network) with three key innovations: (1) personality-guided representation learning using LLMs to transform discrete individual features into contextual descriptions for personalized encoding; (2) Hypergraph-Former architecture modeling high-order cross-modal temporal relationships; (3) event-level domain disentanglement with contrastive learning for improved generalization across behavioral contexts. Experiments on MPDD-Young dataset show P ^3 HF achieves around 10% improvement on accuracy and weighted F1 for binary and ternary depression classification task over existing methods. Extensive ablation studies validate the independent contribution of each architectural component, confirming that personality-guided representation learning and high-order hypergraph reasoning are both essential for generating robust, individual-aware depression-related representations. The code is released at this https URL.
zh

[AI-123] SeedAIchemy: LLM -Driven Seed Corpus Generation for Fuzzing

【速读】:该论文旨在解决软件模糊测试(fuzzing)中高质量输入种子(seed corpus)获取困难的问题,尤其是在自动化和规模化构建有效测试用例方面存在瓶颈。解决方案的关键在于提出了一种名为SeedAIchemy的自动化工具,其核心创新是利用大语言模型(Large Language Model, LLM)驱动的五模块架构,其中四个模块通过LLM工作流生成优化的搜索词,以从互联网上高效收集高质量公共文件,从而构建出显著优于朴素种子集、且接近人工标注质量的测试语料库(corpus)。

链接: https://arxiv.org/abs/2511.12448
作者: Aidan Wen,Norah A. Alzahrani,Jingzhi Jiang,Andrew Joe,Karen Shieh,Andy Zhang,Basel Alomair,David Wagner
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce SeedAIchemy, an automated LLM-driven corpus generation tool that makes it easier for developers to implement fuzzing effectively. SeedAIchemy consists of five modules which implement different approaches at collecting publicly available files from the internet. Four of the five modules use large language model (LLM) workflows to construct search terms designed to maximize corpus quality. Corpora generated by SeedAIchemy perform significantly better than a naive corpus and similarly to a manually-curated corpus on a diverse range of target programs and libraries.
zh

[AI-124] Global-Lens Transformers: Adaptive Token Mixing for Dynamic Link Prediction AAAI2026

【速读】:该论文旨在解决动态图学习中基于Transformer模型因自注意力机制导致的二次时间复杂度问题,从而限制了其在高频或大规模图数据上的可扩展性。其解决方案的关键在于提出一种无需自注意力机制的Transformer风格框架GLFormer,通过引入自适应token混合器(adaptive token mixer)实现基于交互顺序和时间间隔的上下文感知局部聚合,并结合分层聚合模块以逐层扩展时间感受野,从而在保持高性能的同时显著提升计算效率。

链接: https://arxiv.org/abs/2511.12442
作者: Tao Zou,Chengfeng Wu,Tianxi Liao,Junchen Ye,Bowen Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Dynamic graph learning plays a pivotal role in modeling evolving relationships over time, especially for temporal link prediction tasks in domains such as traffic systems, social networks, and recommendation platforms. While Transformer-based models have demonstrated strong performance by capturing long-range temporal dependencies, their reliance on self-attention results in quadratic complexity with respect to sequence length, limiting scalability on high-frequency or large-scale graphs. In this work, we revisit the necessity of self-attention in dynamic graph modeling. Inspired by recent findings that attribute the success of Transformers more to their architectural design than attention itself, we propose GLFormer, a novel attention-free Transformer-style framework for dynamic graphs. GLFormer introduces an adaptive token mixer that performs context-aware local aggregation based on interaction order and time intervals. To capture long-term dependencies, we further design a hierarchical aggregation module that expands the temporal receptive field by stacking local token mixers across layers. Experiments on six widely-used dynamic graph benchmarks show that GLFormer achieves SOTA performance, which reveals that attention-free architectures can match or surpass Transformer baselines in dynamic graph settings with significantly improved efficiency.
zh

[AI-125] Multi-agent Self-triage System with Medical Flowcharts

【速读】:该论文旨在解决当前在线健康资源和大语言模型(Large Language Models, LLMs)在医疗决策支持中面临的关键问题:准确性不足、缺乏透明度以及易受未经验证信息影响。为应对这些问题,作者提出了一种基于临床验证流程图的对话式自我分诊系统,其核心在于通过一个由检索代理、决策代理和聊天代理组成的多智能体框架,将LLMs与美国医学会(American Medical Association)提供的100条临床验证流程图相结合,从而实现结构化且可审计的患者决策支持。该方案的关键创新在于将自由文本交互的灵活性与标准化临床协议的严谨性相融合,显著提升了流程图检索(Top-3准确率达95.29%)和导航准确性(达99.10%),展示了可解释、高精度且泛化能力强的AI辅助自我分诊可行性。

链接: https://arxiv.org/abs/2511.12439
作者: Yujia Liu,Sophia Yu,Hongyue Jin,Jessica Wen,Alexander Qian,Terrence Lee,Mattheus Ramsis,Gi Won Choi,Lianhui Qin,Xin Liu,Edward J. Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Online health resources and large language models (LLMs) are increasingly used as a first point of contact for medical decision-making, yet their reliability in healthcare remains limited by low accuracy, lack of transparency, and susceptibility to unverified information. We introduce a proof-of-concept conversational self-triage system that guides LLMs with 100 clinically validated flowcharts from the American Medical Association, providing a structured and auditable framework for patient decision support. The system leverages a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient-friendly recommendations, respectively. Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles and conditions (N=37,200). By combining the flexibility of free-text interaction with the rigor of standardized clinical protocols, this approach demonstrates the feasibility of transparent, accurate, and generalizable AI-assisted self-triage, with potential to support informed patient decision-making while improving healthcare resource utilization.
zh

[AI-126] SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLM s

【速读】:该论文旨在解决AI生成多媒体内容(如图像、音频和视频)日益泛滥所带来的风险,包括虚假信息传播、身份滥用以及公众信任度下降等问题。现有深度伪造检测工具普遍存在闭源、模态单一、缺乏透明度和教育价值等局限,难以帮助用户理解检测决策过程。解决方案的关键在于提出SynthGuard平台,该平台采用开源设计,结合传统检测器与多模态大语言模型(Multimodal Large Language Models, MLLMs),提供可解释的推理能力、统一支持图像与音频分析,并配备交互式界面,从而提升检测结果的可理解性与可用性,使研究人员、教育工作者及公众均可便捷开展媒体真实性分析。

链接: https://arxiv.org/abs/2511.12404
作者: Shail Desai,Aditya Pawar,Li Lin,Xin Wang,Shu Hu
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has made it possible for anyone to create images, audio, and video with unprecedented ease, enriching education, communication, and creative expression. At the same time, the rapid rise of AI-generated media has introduced serious risks, including misinformation, identity misuse, and the erosion of public trust as synthetic content becomes increasingly indistinguishable from real media. Although deepfake detection has advanced, many existing tools remain closed-source, limited in modality, or lacking transparency and educational value, making it difficult for users to understand how detection decisions are made. To address these gaps, we introduce SynthGuard, an open, user-friendly platform for detecting and analyzing AI-generated multimedia using both traditional detectors and multimodal large language models (MLLMs). SynthGuard provides explainable inference, unified image and audio support, and an interactive interface designed to make forensic analysis accessible to researchers, educators, and the public. The SynthGuard platform is available at: this https URL
zh

[AI-127] Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

【速读】:该论文旨在解决在部分可观测环境中,自主代理如何有效利用外部动作建议(action suggestions)的问题。现有方法通常假设建议者的可靠性是静态且已知的,这限制了其在实际场景中的部署能力。解决方案的关键在于提出一个动态学习框架:首先,将建议者质量参数直接嵌入代理的信念表示中,使代理能够通过贝叶斯推理推断建议者的类型并自适应调整对建议的依赖程度;其次,引入显式的“请求”(ask)动作,使代理能够在关键时刻战略性地请求建议,在信息增益与获取成本之间实现平衡。该框架实现了对不同可靠性建议的鲁棒适应和策略性管理,为不确定环境下的自适应人机协作提供了理论基础。

链接: https://arxiv.org/abs/2511.12378
作者: Dylan M. Asmar,Mykel J. Kochenderfer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provide valuable guidance but inherently vary in reliability. Existing methods for incorporating such advice typically assume static and known suggester quality parameters, limiting practical deployment. We introduce a framework that dynamically learns and adapts to varying suggester reliability in partially observable environments. First, we integrate suggester quality directly into the agent’s belief representation, enabling agents to infer and adjust their reliance on suggestions through Bayesian inference over suggester types. Second, we introduce an explicit ``ask’’ action allowing agents to strategically request suggestions at critical moments, balancing informational gains against acquisition costs. Experimental evaluation demonstrates robust performance across varying suggester qualities, adaptation to changing reliability, and strategic management of suggestion requests. This work provides a foundation for adaptive human-agent collaboration by addressing suggestion uncertainty in uncertain environments.
zh

[AI-128] More Than Irrational: Modeling Belief-Biased Agents AAAI2026 AAAI

【速读】:该论文旨在解决如何预测和推断认知受限用户在面对有限记忆等认知约束时所表现出的次优行为问题,此类行为并非源于非理性,而是由其在有偏信念下进行最优决策的结果。解决方案的关键在于提出了一类计算理性(computational-rational, CR)用户模型,该模型显式建模了有限记忆过程如何导致动态不一致且有偏的信念状态,从而引发次优的序列决策;并进一步设计了一种基于嵌套粒子滤波(nested particle filtering)的在线推断方法,能够从被动观测中实时估计用户的隐含认知边界(如记忆容量)并恢复其有偏信念状态,从而为开发能适应用户认知限制的智能助手提供理论基础与实现路径。

链接: https://arxiv.org/abs/2511.12359
作者: Yifan Zhu,Sammie Katt,Samuel Kaski
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 13 pages, 8 figures. Accepted at the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)

点击查看摘要

Abstract:Despite the explosive growth of AI and the technologies built upon it, predicting and inferring the sub-optimal behavior of users or human collaborators remains a critical challenge. In many cases, such behaviors are not a result of irrationality, but rather a rational decision made given inherent cognitive bounds and biased beliefs about the world. In this paper, we formally introduce a class of computational-rational (CR) user models for cognitively-bounded agents acting optimally under biased beliefs. The key novelty lies in explicitly modeling how a bounded memory process leads to a dynamically inconsistent and biased belief state and, consequently, sub-optimal sequential decision-making. We address the challenge of identifying the latent user-specific bound and inferring biased belief states from passive observations on the fly. We argue that for our formalized CR model family with an explicit and parameterized cognitive process, this challenge is tractable. To support our claim, we propose an efficient online inference method based on nested particle filtering that simultaneously tracks the user’s latent belief state and estimates the unknown cognitive bound from a stream of observed actions. We validate our approach in a representative navigation task using memory decay as an example of a cognitive bound. With simulations, we show that (1) our CR model generates intuitively plausible behaviors corresponding to different levels of memory capacity, and (2) our inference method accurately and efficiently recovers the ground-truth cognitive bounds from limited observations ( \le 100 steps). We further demonstrate how this approach provides a principled foundation for developing adaptive AI assistants, enabling adaptive assistance that accounts for the user’s memory limitations.
zh

[AI-129] Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach

【速读】:该论文旨在解决多变量时间序列中异常检测的挑战,这些问题包括高维度、标注数据有限以及传感器间细微依赖关系带来的复杂性。解决方案的关键在于提出一种融合变分自编码器(Variational Autoencoder, VAE)、基于LSTM的深度Q网络(LSTM-based Deep Q-Network, DQN)、动态奖励缩放(Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection, DRSMT)和主动学习模块的深度强化学习框架。其中,VAE用于提取紧凑且去噪的潜在表示,DQN实现自适应的序贯异常分类,动态奖励缩放机制在训练过程中平衡重建误差与分类信号的重要性以优化探索与利用,而主动学习则通过识别不确定性最高的样本进行选择性标注,显著减少人工监督需求。该方法在Server Machine Dataset (SMD) 和 Water Distribution Testbed (WADI) 两个基准数据集上均优于现有基线模型,在F1分数和平均精度-召回曲线下面积(AU-PR)指标上表现突出,验证了其在真实多变量系统中准确性和可扩展性的优势。

链接: https://arxiv.org/abs/2511.12351
作者: Bahareh Golchin,Banafsheh Rekabdar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting anomalies in multivariate time series is essential for monitoring complex industrial systems, where high dimensionality, limited labeled data, and subtle dependencies between sensors cause significant challenges. This paper presents a deep reinforcement learning framework that combines a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), dynamic reward shaping, and an active learning module to address these issues in a unified learning framework. The main contribution is the implementation of Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT), which demonstrates how each component enhances the detection process. The VAE captures compact latent representations and reduces noise. The DQN enables adaptive, sequential anomaly classification, and the dynamic reward shaping balances exploration and exploitation during training by adjusting the importance of reconstruction and classification signals. In addition, active learning identifies the most uncertain samples for labeling, reducing the need for extensive manual supervision. Experiments on two multivariate benchmarks, namely Server Machine Dataset (SMD) and Water Distribution Testbed (WADI), show that the proposed method outperforms existing baselines in F1-score and AU-PR. These results highlight the effectiveness of combining generative modeling, reinforcement learning, and selective supervision for accurate and scalable anomaly detection in real-world multivariate systems.
zh

[AI-130] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大语言模型(Large Language Models, LLMs)在复杂推理任务中存在两大局限:一是现有方法主要局限于单一领域且依赖可验证奖励信号(Verifiable Reward, VR),二是纯在线RL框架限制了探索空间,从而制约了推理性能的提升。解决方案的关键在于提出一种名为RGR-GRPO(Reward and Guidance through Rubrics)的评分卡驱动型强化学习框架,通过引入细粒度的评分标准(rubrics)提供密集且信息丰富的奖励信号,并结合离线指导策略扩展解空间,从而在多领域推理任务中实现更高效的探索与优化。实验表明,该方法在14个跨领域基准测试中显著优于仅依赖替代奖励机制或离线引导的RL方法,在数学、物理、化学及通用推理任务上平均提升幅度分别为+7.0%、+5.4%、+8.4%和+6.6%,同时保持训练过程中的熵稳定性和优异的pass@k性能,证明其具备持续探索能力和突破现有瓶颈的有效性。

链接: https://arxiv.org/abs/2511.12344
作者: Baolong Bi,Shenghua Liu,Yiwei Wang,Siqian Tong,Lingrui Mei,Yuyao Ge,Yilong Xu,Jiafeng Guo,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose \textbfRGR-GRPO (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.
zh

[AI-131] Optimal Self-Consistency for Efficient Reasoning with Large Language Models

【速读】:该论文旨在解决自洽性(Self-consistency, SC)在大规模数据集上应用时计算成本过高且缺乏统一理论分析的问题,尤其关注其样本效率与缩放行为的建模。解决方案的关键在于首次系统性地从模式估计和投票理论出发,推导出SC在不同数据集上的幂律缩放关系,并提出一种新型动态采样策略——Blend-ASC,该方法在推理过程中根据问题难度动态分配样本数量,从而显著提升样本效率;相比传统固定或动态分配方案,Blend-ASC无需超参数调整且能适配任意样本预算,在平均仅需原生SC 6.8倍样本的情况下实现更优性能,展现出卓越的实用性和通用性。

链接: https://arxiv.org/abs/2511.12309
作者: Austin Feng,Marius Alonso,Ambroise Odonnat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Self-consistency (SC) is a widely used test-time inference technique for improving performance in chain-of-thought reasoning. It involves generating multiple responses, or samples from a large language model (LLM) and selecting the most frequent answer. This procedure can naturally be viewed as a majority vote or empirical mode estimation. Despite its effectiveness, SC is prohibitively expensive at scale when naively applied to datasets, and it lacks a unified theoretical treatment of sample efficiency and scaling behavior. In this paper, we provide the first comprehensive analysis of SC’s scaling behavior and its variants, drawing on mode estimation and voting theory. We derive and empirically validate power law scaling for self-consistency across datasets, and analyze the sample efficiency for fixed-allocation and dynamic-allocation sampling schemes. From these insights, we introduce Blend-ASC, a novel variant of self-consistency that dynamically allocates samples to questions during inference, achieving state-of-the-art sample efficiency. Our approach uses 6.8x fewer samples than vanilla SC on average, outperforming both fixed- and dynamic-allocation SC baselines, thereby demonstrating the superiority of our approach in terms of efficiency. In contrast to existing variants, Blend-ASC is hyperparameter-free and can fit an arbitrary sample budget, ensuring it can be easily applied to any self-consistency application.
zh

[AI-132] UpBench: A Dynamically Evolving Real-World Labor-Market Agent ic Benchmark Framework Built for Human-Centric AI

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在真实世界中执行数字工作任务时缺乏可靠、动态且贴近实际经济场景的评估框架的问题。现有基准测试大多静态、合成或局限于特定领域,难以反映代理在复杂、多变且具有经济意义环境中的真实表现。其解决方案的关键在于提出UpBench——一个基于全球自由职业平台Upwork上真实客户交易构建的动态演进基准,通过专家自由职业者对每项任务分解出可验证的评分标准(rubric),并对AI输出进行细粒度的逐项反馈,从而实现对模型能力、适应性及指令遵循准确性的深入分析,并确保评估过程与专业实践标准一致,推动AI与人类协作而非替代的关系发展。

链接: https://arxiv.org/abs/2511.12306
作者: Darvin Yi,Teng Liu,Mattie Terzolo,Lance Hasson,Ayan Sinh,Pablo Mendes,Andrew Rabinovich
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.
zh

[AI-133] Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

【速读】:该论文针对大语言模型(Large Language Models, LLMs)在推理阶段因上下文长度增加导致的键值缓存(key-value, KV cache)规模扩大,进而使计算任务日益内存受限的问题展开研究。现有近存/存内计算(PIM)方案受限于DRAM工艺约束和处理单元(PE)集成带来的面积开销,难以实现高带宽、高算力密度的加速。解决方案的关键在于提出一种基于chiplet架构的PIM内存模块Sangam,通过将逻辑与存储分离为异构工艺节点制造的chiplet,并利用中介层(interposer)互联,实现了高性能SRAM缓冲区和脉动阵列(systolic array)等先进处理组件的集成,从而显著提升内存密集型GEMM操作的效率。实验表明,Sangam在不同输入规模下相较H100 GPU实现了最高达10.3倍的解码吞吐量提升及数量级能效优化。

链接: https://arxiv.org/abs/2511.12286
作者: Khyati Kiyawat,Zhenxing Fan,Yasas Seneviratne,Morteza Baradaran,Akhil Shekar,Zihan Xia,Mingu Kang,Kevin Skadron
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The logic chiplets sustain high bandwidth access to the DRAM chiplets, which house the memory banks, and enable the integration of advanced processing components such as systolic arrays and SRAM-based buffers to accelerate memory-bound GEMM kernels, capabilities that were not feasible in prior PIM architectures. We propose Sangam, a CXL-attached PIM-chiplet based memory module that can either act as a drop-in replacement for GPUs or co-executes along side the GPUs. Sangam achieves speedup of 3.93, 4.22, 2.82x speedup in end-to-end query latency, 10.3, 9.5, 6.36x greater decoding throughput, and order of magnitude energy savings compared to an H100 GPU for varying input size, output length, and batch size on LLaMA 2-7B, Mistral-7B, and LLaMA 3-70B, respectively.
zh

[AI-134] MoralReason : Generalizable Moral Decision Alignment For LLM Agents Using Reasoning -Level Reinforcement Learning AAAI2026

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在面对训练分布之外的道德情境时,难以保持一致道德推理框架的问题,即“分布外道德对齐”(out-of-distribution moral alignment)问题。其核心挑战在于如何使LLM代理不仅能够评估道德决策,还能主动引导其生成符合特定伦理范式的回答。解决方案的关键在于提出一种名为Moral-Reason-QA的新数据集,该数据集扩展了680个高模糊性人类标注道德场景,并为功利主义、义务论和美德伦理三种框架提供具体的推理路径;同时采用基于组合奖励的组相对策略优化(Group Relative Policy Optimization),同步优化决策一致性与框架特异性推理过程,从而实现对底层道德框架的学习与泛化能力。实验表明,该方法显著提升了模型在未见道德场景中的表现,验证了LLM代理可被系统性训练以内化并应用特定道德框架于新情境中。

链接: https://arxiv.org/abs/2511.12271
作者: Zhiyu An,Wan Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for AAAI 2026

点击查看摘要

Abstract:Large language models are increasingly influencing human moral decisions, yet current approaches focus primarily on evaluating rather than actively steering their moral decisions. We formulate this as an out-of-distribution moral alignment problem, where LLM agents must learn to apply consistent moral reasoning frameworks to scenarios beyond their training distribution. We introduce Moral-Reason-QA, a novel dataset extending 680 human-annotated, high-ambiguity moral scenarios with framework-specific reasoning traces across utilitarian, deontological, and virtue ethics, enabling systematic evaluation of moral generalization in realistic decision contexts. Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework-specific reasoning processes to facilitate learning of the underlying moral frameworks. Experimental results demonstrate successful generalization to unseen moral scenarios, with softmax-normalized alignment scores improving by +0.757 for utilitarian and +0.450 for deontological frameworks when tested on out-of-distribution evaluation sets. The experiments also reveal training challenges and promising directions that inform future research. These findings establish that LLM agents can be systematically trained to internalize and apply specific moral frameworks to novel situations, providing a critical foundation for AI safety as language models become more integrated into human decision-making processes.
zh

[AI-135] Mobile-Agent -RAG Agent -RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的移动代理(Mobile Agents)在真实世界、长周期、跨应用任务中成功率低的问题。作者指出,现有方法过度依赖LLMs内部静态知识,导致两个关键失败点:高阶规划阶段的战略性幻觉(strategic hallucinations)和低阶UI操作阶段的执行错误(operational errors)。解决方案的关键在于提出一种分层多智能体框架——Mobile-Agent-RAG,其核心创新是引入双层检索增强机制(dual-level retrieval augmentation):在规划层使用Manager-RAG检索人类验证的完整任务计划以减少战略幻觉,在执行层使用Operator-RAG检索与当前应用界面(UI)高度对齐的原子级操作指令以提升执行精度。通过构建两个专用的知识库并设计Mobile-Eval-RAG基准进行评估,该方案显著提升了任务完成率(+11.0%)和步骤效率(+10.2%),为上下文感知、可靠的多智能体移动自动化提供了新范式。

链接: https://arxiv.org/abs/2511.12254
作者: Yuxiang Zhou,Jichang Li,Yanhao Zhang,Haonan Lu,Guanbin Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents’ excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.
zh

[AI-136] SCI: An Equilibrium for Signal Intelligence

【速读】:该论文旨在解决现有解释方法在动态场景下缺乏稳定性与可控性的问题,即如何实现可信赖、可调节且收敛稳定的解释行为。其核心挑战在于:传统静态解释器无法适应复杂信号环境的变化,导致解释误差大、波动性强,难以满足实际应用对解释质量的持续保障需求。解决方案的关键在于提出SCI(Surgical Precision Control)框架——一个闭环控制理论驱动的可解释性建模体系,将解释精度建模为受控状态,并通过参数投影更新机制,在人类反馈预算约束下主动最小化解释误差ΔSP(Interpretive Error),同时引入可靠性加权多尺度特征、知识引导的可追溯解释器以及基于李雅普诺夫函数的控制器(含回滚机制、信任域保护和下降条件),从而实现了跨生物医学、工业和环境等多领域解释误差降低25-42%(平均38%),并显著提升解释稳定性(SP方差由0.030降至0.011)。

链接: https://arxiv.org/abs/2511.12240
作者: Vishal Joshua Meesala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 34 pages, 7 figures. Preprint

点击查看摘要

Abstract:We present SCI, a closed-loop, control-theoretic framework that models interpretability as a regulated state. SCI formalizes the interpretive error Delta SP and actively drives SP(t) in [0, 1] (“Surgical Precision”) toward a target via a projected update on the parameters Theta under a human-gain budget. The framework operates through three coordinated components: (1) reliability-weighted, multiscale features P(t, s); (2) a knowledge-guided interpreter psi_Theta that emits traceable markers and rationales; and (3) a Lyapunov-guided controller equipped with rollback, trust-region safeguards, and a descent condition. Across biomedical (EEG/ECG/ICU), industrial (bearings/tool wear), and environmental (climate/seismic) domains, SCI reduces interpretive error by 25-42% (mean 38%, 95% confidence interval 22-43%) relative to static explainers while maintaining AUC/F1 within approximately 1-2 percentage points of baseline. SCI also reduces SP variance from 0.030 to 0.011, indicating substantially more stable explanations. Modeling interpretability as a control objective yields steadier, faster-recovering, and more trustworthy interpretive behavior across diverse signal regimes.
zh

[AI-137] Beyond World Models: Rethinking Understanding in AI Models AAAI2026

【速读】:该论文试图解决的问题是:当前人工智能领域中“世界模型”(world model)是否能够真正表征人类水平的理解。论文通过借鉴科学哲学文献中的案例研究,批判性地检验世界模型框架在多大程度上能刻画人类对世界的理解,尤其是当这种理解超越统计相关性、涉及因果推理和情境认知时。其解决方案的关键在于聚焦于哲学分析中世界模型能力与人类理解之间差异最为显著的具体情境,以此揭示世界模型的局限性,并探讨其作为类人智能表征的有效边界。

链接: https://arxiv.org/abs/2511.12239
作者: Tarun Gupta,Danish Pruthi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026 (Main Track)

点击查看摘要

Abstract:World models have garnered substantial interest in the AI community. These are internal representations that simulate aspects of the external world, track entities and states, capture causal relationships, and enable prediction of consequences. This contrasts with representations based solely on statistical correlations. A key motivation behind this research direction is that humans possess such mental world models, and finding evidence of similar representations in AI models might indicate that these models “understand” the world in a human-like way. In this paper, we use case studies from the philosophy of science literature to critically examine whether the world model framework adequately characterizes human-level understanding. We focus on specific philosophical analyses where the distinction between world model capabilities and human understanding is most pronounced. While these represent particular views of understanding rather than universal definitions, they help us explore the limits of world models.
zh

[AI-138] ViTE: Virtual Graph Trajectory Expert Router for Pedestrian Trajectory Prediction

【速读】:该论文旨在解决行人轨迹预测中因图神经网络(GNN)层数不足导致的感受野受限与层数过多引发计算成本过高的矛盾问题。现有方法通常依赖堆叠多层GNN来捕捉高阶交互关系,但难以在模型表达能力与计算效率之间取得平衡。其解决方案的关键在于提出ViTE框架,包含两个核心模块:一是引入动态虚拟节点的虚拟图(Virtual Graph),用于建模长距离和高阶交互而无需深层GNN结构;二是基于社会情境自适应选择交互专家的专家路由器(Expert Router),采用Mixture-of-Experts设计实现对不同交互模式的灵活推理。这一组合使模型能够在不增加深度的前提下有效捕获复杂交互关系,从而在多个基准数据集上实现最优性能与高效性。

链接: https://arxiv.org/abs/2511.12214
作者: Ruochen Li,Zhanxing Zhu,Tanqiu Qiao,Hubert P. H. Shum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pedestrian trajectory prediction is critical for ensuring safety in autonomous driving, surveillance systems, and urban planning applications. While early approaches primarily focus on one-hop pairwise relationships, recent studies attempt to capture high-order interactions by stacking multiple Graph Neural Network (GNN) layers. However, these approaches face a fundamental trade-off: insufficient layers may lead to under-reaching problems that limit the model’s receptive field, while excessive depth can result in prohibitive computational costs. We argue that an effective model should be capable of adaptively modeling both explicit one-hop interactions and implicit high-order dependencies, rather than relying solely on architectural depth. To this end, we propose ViTE (Virtual graph Trajectory Expert router), a novel framework for pedestrian trajectory prediction. ViTE consists of two key modules: a Virtual Graph that introduces dynamic virtual nodes to model long-range and high-order interactions without deep GNN stacks, and an Expert Router that adaptively selects interaction experts based on social context using a Mixture-of-Experts design. This combination enables flexible and scalable reasoning across varying interaction patterns. Experiments on three benchmarks (ETH/UCY, NBA, and SDD) demonstrate that our method consistently achieves state-of-the-art performance, validating both its effectiveness and practical efficiency.
zh

[AI-139] Debate over Mixed-knowledge: A Robust Multi-Agent Framework for Incomplete Knowledge Graph Question Answering

【速读】:该论文旨在解决Incomplete Knowledge Graph Question Answering (IKGQA)问题,即在现实世界知识图谱(Knowledge Graph, KG)存在结构不完整的情况下,如何提升问答系统的准确性与鲁棒性。现有方法通常依赖外部数据填补知识空白,但缺乏对多源信息进行自适应、上下文感知融合的能力,难以充分发挥不同知识来源的互补优势。其解决方案的关键在于提出一种名为Debate over Mixed-knowledge (DoM)的新框架,该框架基于多智能体辩论(Multi-Agent Debate)范式,通过分工协作机制实现结构化知识与非结构化文本知识的动态整合:具体而言,DoM将问题分解为子问题,由KG代理和检索增强生成(Retrieval-Augmented Generation, RAG)代理分别从知识图谱和外部文本中检索证据,并通过一个裁判代理对中间答案进行评估与聚合,从而有效利用知识互补性并增强对KG不完整性问题的适应能力。

链接: https://arxiv.org/abs/2511.12208
作者: Jilong Liu,Pengyang Shao,Wei Qin,Fei Liu,Yonghui Yang,Richang Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Graph Question Answering (KGQA) aims to improve factual accuracy by leveraging structured knowledge. However, real-world Knowledge Graphs (KGs) are often incomplete, leading to the problem of Incomplete KGQA (IKGQA). A common solution is to incorporate external data to fill knowledge gaps, but existing methods lack the capacity to adaptively and contextually fuse multiple sources, failing to fully exploit their complementary strengths. To this end, we propose Debate over Mixed-knowledge (DoM), a novel framework that enables dynamic integration of structured and unstructured knowledge for IKGQA. Built upon the Multi-Agent Debate paradigm, DoM assigns specialized agents to perform inference over knowledge graphs and external texts separately, and coordinates their outputs through iterative interaction. It decomposes the input question into sub-questions, retrieves evidence via dual agents (KG and Retrieval-Augmented Generation, RAG), and employs a judge agent to evaluate and aggregate intermediate answers. This collaboration exploits knowledge complementarity and enhances robustness to KG incompleteness. In addition, existing IKGQA datasets simulate incompleteness by randomly removing triples, failing to capture the irregular and unpredictable nature of real-world knowledge incompleteness. To address this, we introduce a new dataset, Incomplete Knowledge Graph WebQuestions, constructed by leveraging real-world knowledge updates. These updates reflect knowledge beyond the static scope of KGs, yielding a more realistic and challenging benchmark. Through extensive experiments, we show that DoM consistently outperforms state-of-the-art baselines.
zh

[AI-140] Locally Optimal Solutions to Constraint Displacement Problems via Path-Obstacle Overlaps

【速读】:该论文旨在解决机器人在存在障碍物(constraints)环境中的路径规划问题,即当机器人无法直接找到无碰撞路径时,通过合理位移障碍物来生成可行路径。其核心解决方案为两阶段优化方法:第一阶段在不考虑碰撞的前提下计算一条最小化目标函数的初始轨迹;第二阶段则根据该轨迹对障碍物进行局部最优位移,使最终机器人路径满足无碰撞要求。此方法成功应用于两类不同的约束位移问题,体现了其通用性和有效性。

链接: https://arxiv.org/abs/2511.12203
作者: Antony Thomas,Fulvio Mastrogiovanni,Marco Baglietto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Robotics and Autonomous Systems

点击查看摘要

Abstract:We present a unified approach for constraint displacement problems in which a robot finds a feasible path by displacing constraints or obstacles. To this end, we propose a two stage process that returns locally optimal obstacle displacements to enable a feasible path for the robot. The first stage proceeds by computing a trajectory through the obstacles while minimizing an appropriate objective function. In the second stage, these obstacles are displaced to make the computed robot trajectory feasible, that is, collision-free. Several examples are provided that successfully demonstrate our approach on two distinct classes of constraint displacement problems.
zh

[AI-141] AI-Enhanced IoT Systems for Predictive Maintenance and Affordability Optimization in Smart Microgrids: A Digital Twin Approach

【速读】:该论文旨在解决智能微电网(smart microgrids)中预测性维护与经济性优化难题,核心挑战在于如何在分布式能源环境中提升系统可靠性与能效的同时降低运维成本。解决方案的关键在于构建一个融合数字孪生(Digital Twin)建模的增强型物联网(IoT)框架,通过实时传感器数据采集、基于机器学习的故障预测以及成本感知的运营分析,实现物理微电网组件与其虚拟模型的同步,从而支持早期部件退化检测、动态负荷管理和最优维护调度,最终显著提升预测精度、减少停机时间并带来可量化的成本节约。

链接: https://arxiv.org/abs/2511.12175
作者: Koushik Ahmed Kushal,Florimond Gueniat
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, includes simulation and evaluation results

点击查看摘要

Abstract:This study presents an AI enhanced IoT framework for predictive maintenance and affordability optimization in smart microgrids using a Digital Twin modeling approach. The proposed system integrates real time sensor data, machine learning based fault prediction, and cost aware operational analytics to improve reliability and energy efficiency in distributed microgrid environments. By synchronizing physical microgrid components with a virtual Digital Twin, the framework enables early detection of component degradation, dynamic load management, and optimized maintenance scheduling. Experimental evaluations demonstrate improved predictive accuracy, reduced operational downtime, and measurable cost savings compared to baseline microgrid management methods. The findings highlight the potential of Digital Twin driven IoT architectures as a scalable solution for next generation intelligent and affordable energy systems.
zh

[AI-142] Incremental Maintenance of DatalogMTL Materialisations AAAI2026

【速读】:该论文旨在解决DatalogMTL(扩展了度量时序逻辑的Datalog语言)在处理动态数据更新时效率低下的问题。现有基于物化或自动机的推理方法虽能保证正确性和完备性,但无法高效支持频繁的数据变更,这限制了其在现实场景中的应用。解决方案的关键在于提出DRedMTL算法,它基于经典的DRed增量推理机制,针对DatalogMTL物化表示中包含周期性区间(periodic intervals)的特点,设计了专门的操作符来高效处理这些周期性结构,从而实现对时序事实的增量更新,避免全量重计算,实验表明其性能显著优于传统重物化方法。

链接: https://arxiv.org/abs/2511.12169
作者: Kaiyue Zhao,Dingqi Chen,Shaoyu Wang,Pan Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as oral paper at the main track of AAAI 2026

点击查看摘要

Abstract:DatalogMTL extends the classical Datalog language with metric temporal logic (MTL), enabling expressive reasoning over temporal data. While existing reasoning approaches, such as materialisation based and automata based methods, offer soundness and completeness, they lack support for handling efficient dynamic updates, a crucial requirement for real-world applications that involve frequent data updates. In this work, we propose DRedMTL, an incremental reasoning algorithm for DatalogMTL with bounded intervals. Our algorithm builds upon the classical DRed algorithm, which incrementally updates the materialisation of a Datalog program. Unlike a Datalog materialisation which is in essence a finite set of facts, a DatalogMTL materialisation has to be represented as a finite set of facts plus periodic intervals indicating how the full materialisation can be constructed through unfolding. To cope with this, our algorithm is equipped with specifically designed operators to efficiently handle such periodic representations of DatalogMTL materialisations. We have implemented this approach and tested it on several publicly available datasets. Experimental results show that DRedMTL often significantly outperforms rematerialisation, sometimes by orders of magnitude.
zh

[AI-143] Open Banking Foundational Model: Learning Language Representations from Few Financial Transactions

【速读】:该论文旨在解决金融交易中结构化属性与非结构化文本描述难以统一建模的问题,尤其在数据稀缺的开放银行(Open Banking)场景下,传统特征工程和离散事件序列方法表现受限。其解决方案的关键在于提出一种多模态基础模型(multimodal foundational model),通过将交易的结构化特征与文本描述整合为统一表示,并采用适配后的掩码语言建模(masked language modeling)方法对交易序列进行预训练,从而实现跨机构、跨地域的泛化能力,显著提升金融风控与客户洞察等下游任务的效果。

链接: https://arxiv.org/abs/2511.12154
作者: Gustavo Polleti,Marlesson Santana,Eduardo Fontes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduced a multimodal foundational model for financial transactions that integrates both structured attributes and unstructured textual descriptions into a unified representation. By adapting masked language modeling to transaction sequences, we demonstrated that our approach not only outperforms classical feature engineering and discrete event sequence methods but is also particularly effective in data-scarce Open Banking scenarios. To our knowledge, this is the first large-scale study across thousands of financial institutions in North America, providing evidence that multimodal representations can generalize across geographies and institutions. These results highlight the potential of self-supervised models to advance financial applications ranging from fraud prevention and credit risk to customer insights
zh

[AI-144] RTMol: Rethinking Molecule-text Alignment in a Round-trip View

【速读】:该论文旨在解决分子序列表示(如SMILES)与文本描述之间对齐问题,尤其针对现有方法在分子图像描述生成(molecule-to-text)和基于文本的分子设计(text-to-molecule)任务中存在的三大局限:传统评价指标(如BLEU)偏重语言流畅性而非化学准确性、训练数据常含化学信息不完整的模糊叙述、以及双向生成方向独立优化导致的一致性缺失。解决方案的关键在于提出RTMol框架,通过自监督的“往返学习”(round-trip learning)机制统一两个生成方向,引入新颖的往返评估指标,并实现无需成对分子-文本语料即可进行无监督训练,从而显著提升双向对齐性能(最高达47%),为分子与文本的联合理解与生成提供有效范式。

链接: https://arxiv.org/abs/2511.12135
作者: Letian Chen,Runhan Shi,Gufeng Yu,Yang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning molecular sequence representations (e.g., SMILES notations) with textual descriptions is critical for applications spanning drug discovery, materials design, and automated chemical literature analysis. Existing methodologies typically treat molecular captioning (molecule-to-text) and text-based molecular design (text-to-molecule) as separate tasks, relying on supervised fine-tuning or contrastive learning pipelines. These approaches face three key limitations: (i) conventional metrics like BLEU prioritize linguistic fluency over chemical accuracy, (ii) training datasets frequently contain chemically ambiguous narratives with incomplete specifications, and (iii) independent optimization of generation directions leads to bidirectional inconsistency. To address these issues, we propose RTMol, a bidirectional alignment framework that unifies molecular captioning and text-to-SMILES generation through self-supervised round-trip learning. The framework introduces novel round-trip evaluation metrics and enables unsupervised training for molecular captioning without requiring paired molecule-text corpora. Experiments demonstrate that RTMol enhances bidirectional alignment performance by up to 47% across various LLMs, establishing an effective paradigm for joint molecule-text understanding and generation.
zh

[AI-145] MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge through Group Direct Preference Optimization AAAI2026

【速读】:该论文旨在解决小规模语言模型(<8B参数)在知识蒸馏和微调过程中出现的灾难性遗忘问题,其核心挑战在于现有数据集未能充分考虑训练数据与模型固有能力之间的关系,且传统训练目标缺乏对先验知识保留的有效约束。解决方案的关键在于从数据构建与训练方法两个维度协同优化:首先,构建包含5K个实例的多任务推理数据集,引入元认知知识标注并基于任务知识与模型能力进行筛选,从而提升蒸馏效率;其次,提出GDPO(Group Direction Preference Optimization)训练策略,通过参考模型隐式约束优化路径,在资源受限场景下高效逼近GRPO性能,有效减少参数漂移并增强大模型到小模型的知识迁移效果。

链接: https://arxiv.org/abs/2511.12113
作者: Lanxue Zhang,Yuqiang Xie,Fang Fang,Fanglong Dong,Rui Liu,Yanan Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 10 figures, AAAI 2026

点击查看摘要

Abstract:Large Language Models demonstrate strong reasoning capabilities, which can be effectively compressed into smaller models. However, existing datasets and fine-tuning approaches still face challenges that lead to catastrophic forgetting, particularly for models smaller than 8B. First, most datasets typically ignore the relationship between training data knowledge and the model’s inherent abilities, making it difficult to preserve prior knowledge. Second, conventional training objectives often fail to constrain inherent knowledge preservation, which can result in forgetting of previously learned skills. To address these issues, we propose a comprehensive solution that alleviates catastrophic forgetting from both the data and fine-tuning approach perspectives. On the data side, we construct a dataset of 5K instances that covers multiple reasoning tasks and incorporates metacognitive knowledge, making it more tolerant and effective for distillation into smaller models. We annotate the metacognitive knowledge required to solve each question and filter the data based on task knowledge and the model’s inherent skills. On the training side, we introduce GDPO (Group Direction Preference Optimization), which is better suited for resource-limited scenarios and can efficiently approximate the performance of GRPO. Guided by the large model and by implicitly constraining the optimization path through a reference model, GDPO enables more effective knowledge transfer from the large model and constrains excessive parameter drift. Extensive experiments demonstrate that our approach significantly alleviates catastrophic forgetting and improves reasoning performance on smaller models.
zh

[AI-146] Decoupled Action Head: Confining Task Knowledge to Conditioning Layers

【速读】:该论文旨在解决行为克隆(Behavior Cloning, BC)方法在机器人操作任务中因配对训练数据稀缺而导致的泛化能力有限、模型设计缺乏理论依据的问题,同时揭示扩散策略(Diffusion Policy, DP)有效性的内在机制。其解决方案的关键在于提出一种解耦训练范式(decoupled training recipe),通过利用几乎零成本的运动学生成轨迹作为无观测数据预训练一个通用的动作头(action head),随后冻结该动作头并通过特征调制(feature modulation)适配新任务。此方法不仅显著提升训练效率(如DP-C实现最高41%的速度提升),还表明动作生成主干网络在机器人操作中的作用有限,从而推动了更轻量级模型的设计,例如用仅4M参数的MLP替代原244M参数的U-Net结构,实现83.9%的训练加速。

链接: https://arxiv.org/abs/2511.12101
作者: Jian Zhou,Sihao Lin,Shuai Fu,Qi WU
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Behavior Cloning (BC) is a data-driven supervised learning approach that has gained increasing attention with the success of scaling laws in language and vision domains. Among its implementations in robotic manipulation, Diffusion Policy (DP), with its two variants DP-CNN (DP-C) and DP-Transformer (DP-T), is one of the most effective and widely adopted models, demonstrating the advantages of predicting continuous action sequences. However, both DP and other BC methods remain constrained by the scarcity of paired training data, and the internal mechanisms underlying DP’s effectiveness remain insufficiently understood, leading to limited generalization and a lack of principled design in model development. In this work, we propose a decoupled training recipe that leverages nearly cost-free kinematics-generated trajectories as observation-free data to pretrain a general action head (action generator). The pretrained action head is then frozen and adapted to novel tasks through feature modulation. Our experiments demonstrate the feasibility of this approach in both in-distribution and out-of-distribution scenarios. As an additional benefit, decoupling improves training efficiency; for instance, DP-C achieves up to a 41% speedup. Furthermore, the confinement of task-specific knowledge to the conditioning components under decoupling, combined with the near-identical performance of DP-C in both normal and decoupled training, indicates that the action generation backbone plays a limited role in robotic manipulation. Motivated by this observation, we introduce DP-MLP, which replaces the 244M-parameter U-Net backbone of DP-C with only 4M parameters of simple MLP blocks, achieving a 83.9% faster training speed under normal training and 89.1% under decoupling.
zh

[AI-147] KrwEmd: Revising the Imperfect-Recall Abstraction from Forgetting Everything

【速读】:该论文旨在解决大规模不完美信息博弈中因过度抽象(excessive abstraction)导致的人工智能(AI)性能下降问题,尤其在德州扑克等游戏中,传统不完美记忆抽象(imperfect-recall abstraction)会完全丢弃历史信息,从而损害策略质量。解决方案的关键在于提出KrwEmd算法,其核心创新是引入k-回忆胜率特征(k-recall winrate feature),该特征不仅利用未来和历史游戏信息定性区分信号观测信息集(signal observation infosets),还能定量刻画它们的相似性;随后通过地球移动距离(earth mover’s distance)对这些特征进行聚类,从而实现更精准的信息集抽象,显著提升AI的游戏表现。

链接: https://arxiv.org/abs/2511.12089
作者: Yanchang Fu,Qiyue Yin,Shengda Liu,Pei Xu,Kaiqi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Excessive abstraction is a critical challenge in hand abstraction-a task specific to games like Texas hold’em-when solving large-scale imperfect-information games, as it impairs AI performance. This issue arises from extreme implementations of imperfect-recall abstraction, which entirely discard historical information. This paper presents KrwEmd, the first practical algorithm designed to address this problem. We first introduce the k-recall winrate feature, which not only qualitatively distinguishes signal observation infosets by leveraging both future and, crucially, historical game information, but also quantitatively captures their similarity. We then develop the KrwEmd algorithm, which clusters signal observation infosets using earth mover’s distance to measure discrepancies between their features. Experimental results demonstrate that KrwEmd significantly improves AI gameplay performance compared to existing algorithms.
zh

[AI-148] Explainable Transformer-Based Email Phishing Classification with Adversarial Robustness

【速读】:该论文旨在解决AI生成的钓鱼攻击(AI-generated phishing attacks)对现有钓鱼检测系统造成的威胁,这类攻击通过利用生成式AI技术绕过传统深度学习模型的检测机制,显著降低了系统的鲁棒性。解决方案的关键在于提出一种融合DistilBERT文本分类模型、基于快速梯度法(Fast Gradient Method, FGM)的对抗训练策略以及可解释人工智能(Explainable AI, XAI)技术的混合框架:其中,DistilBERT在保持高精度的同时提升推理效率;FGM对抗训练增强模型对文本扰动的鲁棒性;LIME方法提升模型决策透明度,并结合Flan-T5-small语言模型生成面向终端用户的自然语言安全解释,从而实现精准分类与可理解性的双重保障。

链接: https://arxiv.org/abs/2511.12085
作者: Sajad U P
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Phishing and related cyber threats are becoming more varied and technologically advanced. Among these, email-based phishing remains the most dominant and persistent threat. These attacks exploit human vulnerabilities to disseminate malware or gain unauthorized access to sensitive information. Deep learning (DL) models, particularly transformer-based models, have significantly enhanced phishing mitigation through their contextual understanding of language. However, some recent threats, specifically Artificial Intelligence (AI)-generated phishing attacks, are reducing the overall system resilience of phishing detectors. In response, adversarial training has shown promise against AI-generated phishing threats. This study presents a hybrid approach that uses DistilBERT, a smaller, faster, and lighter version of the BERT transformer model for email classification. Robustness against text-based adversarial perturbations is reinforced using Fast Gradient Method (FGM) adversarial training. Furthermore, the framework integrates the LIME Explainable AI (XAI) technique to enhance the transparency of the DistilBERT architecture. The framework also uses the Flan-T5-small language model from Hugging Face to generate plain-language security narrative explanations for end-users. This combined approach ensures precise phishing classification while providing easily understandable justifications for the model’s decisions.
zh

[AI-149] No-Regret Strategy Solving in Imperfect-Information Games via Pre-Trained Embedding

【速读】:该论文旨在解决大规模不完美信息扩展式博弈(IIEFGs)中信息集抽象质量不足的问题,尤其是在空间资源有限的情况下,传统基于离散聚类的抽象方法因硬分类导致关键细微差异丢失,从而影响策略求解质量。其解决方案的关键在于提出Embedding CFR算法,通过预训练将孤立信息集特征嵌入到低维连续空间中,使嵌入向量能更精确地捕捉信息集间的区别与关联,进而在此嵌入空间内基于累积遗憾进行策略更新与优化,理论分析证明其可降低累计遗憾,实验表明在相同空间开销下收敛速度显著优于基于聚类的抽象算法。

链接: https://arxiv.org/abs/2511.12083
作者: Yanchang Fu,Shengda Liu,Pei Xu,Kaiqi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality information set abstraction remains a core challenge in solving large-scale imperfect-information extensive-form games (IIEFGs)-such as no-limit Texas Hold’em-where the finite nature of spatial resources hinders strategy solving over the full game. State-of-the-art AI methods rely on pre-trained discrete clustering for abstraction, yet their hard classification irreversibly loses critical information: specifically, the quantifiable subtle differences between information sets-vital for strategy solving-thereby compromising the quality of such solving. Inspired by the word embedding paradigm in natural language processing, this paper proposes the Embedding CFR algorithm, a novel approach for solving strategies in IIEFGs within an embedding space. The algorithm pre-trains and embeds features of isolated information sets into an interconnected low-dimensional continuous space, where the resulting vectors more precisely capture both the distinctions and connections between information sets. Embedding CFR presents a strategy-solving process driven by regret accumulation and strategy updates within this embedding space, with accompanying theoretical analysis verifying its capacity to reduce cumulative regret. Experiments on poker show that with the same spatial overhead, Embedding CFR achieves significantly faster exploitability convergence compared to cluster-based abstraction algorithms, confirming its effectiveness. Furthermore, to our knowledge, it is the first algorithm in poker AI that pre-trains information set abstractions through low-dimensional embedding for strategy solving.
zh

[AI-150] reatment Stitching with Schrödinger Bridge for Enhancing Offline Reinforcement Learning in Adaptive Treatment Strategies AAAI

【速读】:该论文旨在解决临床自适应治疗策略(Adaptive Treatment Strategies, ATS)优化中因数据稀缺导致的离线强化学习(Offline Reinforcement Learning, Offline RL)性能受限问题。传统离线RL依赖历史数据学习决策策略,但在临床场景下,数据稀疏性常导致策略泛化能力差、难以有效优化个性化治疗方案。其解决方案的关键在于提出一种名为“治疗拼接”(Treatment Stitching, TreatStitch)的数据增强框架:通过识别不同治疗轨迹中相似的中间患者状态,智能拼接对应治疗片段;当状态差异过大无法直接拼接时,则利用Schrödinger桥方法生成能量高效且平滑的过渡轨迹,从而构建符合临床逻辑的合成治疗路径。该方法显著扩展了原始数据分布,使离线RL能从更丰富的数据中学习,同时理论证明其避免了分布外(out-of-distribution)转移,保障了临床有效性。

链接: https://arxiv.org/abs/2511.12075
作者: Dong-Hee Shin,Deok-Joong Lee,Young-Han Son,Tae-Eui Kam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures, AAAI conference

点击查看摘要

Abstract:Adaptive treatment strategies (ATS) are sequential decision-making processes that enable personalized care by dynamically adjusting treatment decisions in response to evolving patient symptoms. While reinforcement learning (RL) offers a promising approach for optimizing ATS, its conventional online trial-and-error learning mechanism is not permissible in clinical settings due to risks of harm to patients. Offline RL tackles this limitation by learning policies exclusively from historical treatment data, but its performance is often constrained by data scarcity-a pervasive challenge in clinical domains. To overcome this, we propose Treatment Stitching (TreatStitch), a novel data augmentation framework that generates clinically valid treatment trajectories by intelligently stitching segments from existing treatment data. Specifically, TreatStitch identifies similar intermediate patient states across different trajectories and stitches their respective segments. Even when intermediate states are too dissimilar to stitch directly, TreatStitch leverages the Schrödinger bridge method to generate smooth and energy-efficient bridging trajectories that connect dissimilar states. By augmenting these synthetic trajectories into the original dataset, offline RL can learn from a more diverse dataset, thereby improving its ability to optimize ATS. Extensive experiments across multiple treatment datasets demonstrate the effectiveness of TreatStitch in enhancing offline RL performance. Furthermore, we provide a theoretical justification showing that TreatStitch maintains clinical validity by avoiding out-of-distribution transitions.
zh

[AI-151] MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

【速读】:该论文旨在解决生成式人工智能中人类语音表达的可控性与表现力问题,其核心挑战在于语音因素(如内容、音色和情感)的深度耦合以及现有控制机制的粒度粗糙。解决方案的关键在于提出了一种名为MF-Speech的新框架,其核心创新为:一是设计了MF-SpeechEncoder作为因子净化器,通过多目标优化策略将原始语音信号解耦为高度纯净且独立的内容、音色与情感表示;二是构建了MF-SpeechGenerator作为调控器,利用动态融合与分层风格自适应归一化(Hierarchical Style Adaptive Normalization, HSAN)实现对各因子的精确、可组合且细粒度的控制。该方法在多因子组合语音生成任务中显著优于现有最先进方法,验证了其有效性与通用性。

链接: https://arxiv.org/abs/2511.12074
作者: Xinyue Yu,Youqing Fang,Pingyu Wu,Guoyang Ye,Wenbo Zhou,Weiming Zhang,Song Xiao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.
zh

[AI-152] ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

【速读】:该论文旨在解决声音视频生成(Sounding Video Generation, SVG)任务中音频与视频之间的结构错位问题,以及多模态数据处理带来的高计算成本。其核心解决方案是提出ProAV-DiT模型,关键在于设计了一个多尺度双流时空自编码器(Multi-scale Dual-stream Spatio-Temporal Autoencoder, MDSA),通过正交分解将音频和视频投影到统一潜在空间,实现细粒度的时空建模与语义对齐;同时引入多尺度注意力机制(包括多尺度时间自注意力和分组跨模态注意力),增强时间连贯性和模态特异性融合,并将2D潜在表示堆叠为3D潜在空间,由时空扩散Transformer进行处理,从而高效建模时空依赖关系,在保证高质量同步音视频生成的同时显著降低计算开销。

链接: https://arxiv.org/abs/2511.12072
作者: Jiahui Sun,Weining Wang,Mingzhen Sun,Yirong Yang,Xinxin Zhu,Jing Liu
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.
zh

[AI-153] Improving Graph Embeddings in Machine Learning Using Knowledge Completion with Validation in a Case Study on COVID-19 Spread

【速读】:该论文旨在解决图机器学习(Graph Machine Learning, GML)中因图嵌入(Graph Embeddings, GEs)仅依赖显式拓扑结构和特征而可能遗漏隐含知识的问题,尤其是在数据稀疏场景下,这种缺失会影响图结构及其表示质量。解决方案的关键在于引入一个知识补全(Knowledge Completion, KC)阶段,在生成嵌入前通过建模传递关系(transitive relations)来挖掘潜在语义信息,利用基于衰减的推理函数重构图拓扑结构,从而显著改变嵌入空间的几何特性,并提升GraphSAGE与Node2Vec等模型在节点分类和链接预测任务中的表现。

链接: https://arxiv.org/abs/2511.12071
作者: Rosario Napoli,Gabriele Morabito,Antonio Celesti,Massimo Villari,Maria Fazio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 16th IEEE International Conference on Knowledge Graphs (ICKG) 2025

点击查看摘要

Abstract:The rise of graph-structured data has driven major advances in Graph Machine Learning (GML), where graph embeddings (GEs) map features from Knowledge Graphs (KGs) into vector spaces, enabling tasks like node classification and link prediction. However, since GEs are derived from explicit topology and features, they may miss crucial implicit knowledge hidden in seemingly sparse datasets, affecting graph structure and their representation. We propose a GML pipeline that integrates a Knowledge Completion (KC) phase to uncover latent dataset semantics before embedding generation. Focusing on transitive relations, we model hidden connections with decay-based inference functions, reshaping graph topology, with consequences on embedding dynamics and aggregation processes in GraphSAGE and Node2Vec. Experiments show that our GML pipeline significantly alters the embedding space geometry, demonstrating that its introduction is not just a simple enrichment but a transformative step that redefines graph representation quality.
zh

[AI-154] Bayesian Optimization in Language Space: An Eval-Efficient AI Self-Improvement Framework

【速读】:该论文旨在解决自改进型人工智能(self-improving AI)在社会应用中因评估成本过高而导致的效率瓶颈问题,尤其是在生成式AI(Generative AI)任务中,如广告内容优化,其核心挑战在于如何减少对昂贵人类反馈的依赖以提升评估效率。解决方案的关键在于提出TextGrad-Best-of-N Bayesian Optimization(T-BoN BO),其创新性地证明了“Best-of-N选择策略”与“文本梯度”(即来自批评模型的文本编辑)的组合能够统计上模拟标准UCB(Upper Confidence Bound)采集函数的梯度行为,从而在语言空间中实现最优探索,显著提高评估效率。该方法无需直接估计复杂的采集函数,即可在低评估成本下实现高效自改进。

链接: https://arxiv.org/abs/2511.12063
作者: Enoch Hyunwook Kang,Hema Yoganarasimhan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently enabled self-improving AI, i.e., AI that iteratively generates, evaluates, and refines its own outcomes. Recent studies have shown that self-improving AI focusing on prompt optimization can outperform state-of-the-art reinforcement-learning fine-tuned LLMs. Here, their `performance’ is typically measured by query efficiency - the number of LLM-generated solution samples required to meet a certain performance threshold. However, in many societal applications, the primary limitation is not generating new solutions but evaluating them. For instance, evaluating an ad’s effectiveness requires significant human feedback, which is far more costly and time-consuming than generating a candidate ad. To optimize for the evaluation efficiency objective, a natural approach is to extend Bayesian Optimization (BO), a framework proven optimal for evaluation efficiency, to the language domain. However, the difficulty of directly estimating suitable acquisition functions in LLMs’ minds makes this extension challenging. This paper overcomes this challenge by proving that the combination of the simple and widely used Best-of-N selection strategy and simple textual gradients (i.e., textual edits from a critic model) statistically emulates the behavior of the gradients on the canonical UCB acquisition function, which induces optimal exploration in terms of evaluation efficiency. Based on this result, we propose TextGrad-Best-of-N Bayesian Optimization (T-BoN BO), a simple and eval-efficient language-space Bayesian optimization framework for AI self-improvement. We also empirically validate T-BoN BO by applying it to automated ad alignment tasks for persona distribution, demonstrating its superior performance compared to popular state-of-the-art baselines.
zh

[AI-155] Intelligent Collaborative Optimization for Rubber Tyre Film Production Based on Multi-path Differentiated Clipping Proximal Policy Optimization

【速读】:该论文旨在解决橡胶轮胎制造中因传统集中式调度和刚性产线配置所导致的动态生产需求响应能力不足的问题,尤其针对多子系统高度耦合、非线性交互显著且具有涌现动力学特性的复杂制造网络中的协同优化难题。解决方案的关键在于提出一种名为多路径差异化裁剪近端策略优化(Multi-path Differentiated Clipping Proximal Policy Optimization, MPD-PPO)的深度强化学习算法,其核心创新在于采用多分支策略架构并引入差异化的梯度裁剪约束机制,从而保障高维策略更新的稳定性与效率,在宽度和厚度控制等实际场景中验证了其在调优精度与运行效率上的显著提升,有效应对了高维性、多目标权衡及动态适应等关键挑战,为轮胎制造领域实时工业部署提供了可靠性能保障。

链接: https://arxiv.org/abs/2511.12060
作者: Yinghao Ruan,Wei Pang,Shuaihao Liu,Huili Yang,Leyi Han,Xinghui Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:The advent of smart manufacturing is addressing the limitations of traditional centralized scheduling and inflexible production line configurations in the rubber tyre industry, especially in terms of coping with dynamic production demands. Contemporary tyre manufacturing systems form complex networks of tightly coupled subsystems pronounced nonlinear interactions and emergent dynamics. This complexity renders the effective coordination of multiple subsystems, posing an essential yet formidable task. For high-dimensional, multi-objective optimization problems in this domain, we introduce a deep reinforcement learning algorithm: Multi-path Differentiated Clipping Proximal Policy Optimization (MPD-PPO). This algorithm employs a multi-branch policy architecture with differentiated gradient clipping constraints to ensure stable and efficient high-dimensional policy updates. Validated through experiments on width and thickness control in rubber tyre film production, MPD-PPO demonstrates substantial improvements in both tuning accuracy and operational efficiency. The framework successfully tackles key challenges, including high dimensionality, multi-objective trade-offs, and dynamic adaptation, thus delivering enhanced performance and production stability for real-time industrial deployment in tyre manufacturing.
zh

[AI-156] Exploring AI in Steganography and Steganalysis: Trends Clusters and Sustainable Development Potential

【速读】:该论文旨在解决当前AI驱动的隐写术(steganography)数据隐藏技术研究中缺乏系统性量化分析的问题,特别是对研究热点、区域分布、学科交叉及可持续发展目标(SDGs)关联性的认知不足。其解决方案的关键在于采用主题建模的科学计量学方法,对2017至2023年间发表的654篇相关文献进行系统分析,识别出七大主题聚类,并揭示亚洲国家(尤其是中国和印度)在该领域的主导地位;同时评估AI隐写术与SDGs之间的关联性,发现仅有18篇文章与SDGs对齐,其中SDG9(产业、创新与基础设施)占比最高,从而指出该领域在社会价值导向上的显著缺口,为未来研究提供方向指引。

链接: https://arxiv.org/abs/2511.12052
作者: Aditya Kumar Sahu,Chandan Kumar,Saksham Kumar,Serdar Solak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Steganography and steganalysis are strongly related subjects of information security. Over the past decade, many powerful and efficient artificial intelligence (AI) - driven techniques have been designed and presented during research into steganography as well as steganalysis. This study presents a scientometric analysis of AI-driven steganography-based data hiding techniques using a thematic modelling approach. A total of 654 articles within the time span of 2017 to 2023 have been considered. Experimental evaluation of the study reveals that 69% of published articles are from Asian countries. The China is on top (TP:312), followed by India (TP-114). The study mainly identifies seven thematic clusters: steganographic image data hiding, deep image steganalysis, neural watermark robustness, linguistic steganography models, speech steganalysis algorithms, covert communication networks, and video steganography techniques. The proposed study also assesses the scope of AI-steganography under the purview of sustainable development goals (SDGs) to present the interdisciplinary reciprocity between them. It has been observed that only 18 of the 654 articles are aligned with one of the SDGs, which shows that limited studies conducted in alignment with SDG goals. SDG9 which is Industry, Innovation, and Infrastructure is leading among 18 SDGs mapped articles. To the top of our insight, this study is the unique one to present a scientometric study on AI-driven steganography-based data hiding techniques. In the context of descriptive statistics, the study breaks down the underlying causes of observed trends, including the influence of DL developments, trends in East Asia and maturity of foundational methods. The work also stresses upon the critical gaps in societal alignment, particularly the SDGs, ultimately working on unveiling the field’s global impact on AI security challenges.
zh

[AI-157] Mesh-based Super-resolution of Detonation Flows with Multiscale Graph Transformers

【速读】:该论文旨在解决反应流场(reacting flow)在低分辨率网格上进行超分辨率重建(super-resolution reconstruction)的问题,以支持诸如亚网格闭合建模、时空预测加速、数据压缩以及稀疏实验测量的尺度提升等应用。其解决方案的关键在于提出了一种首创的多尺度图变压器方法(multiscale graph transformer, SR-GT),该方法采用基于图结构的流动场表示方式,兼容复杂几何形状和非均匀/非结构化网格;同时利用变压器(transformer)主干网络捕捉低分辨率流动场中不同区域间的长程依赖关系,识别关键特征,并生成保留这些特征的高分辨率流动场输出。该框架通过元素局部(+邻域)图表示对粗网格输入进行编码,在Transformer处理后实现精细化重建,显著优于传统插值类超分辨率方案。

链接: https://arxiv.org/abs/2511.12041
作者: Shivam Barwey,Pinaki Pal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Super-resolution flow reconstruction using state-of-the-art data-driven techniques is valuable for a variety of applications, such as subgrid/subfilter closure modeling, accelerating spatiotemporal forecasting, data compression, and serving as an upscaling tool for sparse experimental measurements. In the present work, a first-of-its-kind multiscale graph transformer approach is developed for mesh-based super-resolution (SR-GT) of reacting flows. The novel data-driven modeling paradigm leverages a graph-based flow-field representation compatible with complex geometries and non-uniform/unstructured grids. Further, the transformer backbone captures long-range dependencies between different parts of the low-resolution flow-field, identifies important features, and then generates the super-resolved flow-field that preserves those features at a higher resolution. The performance of SR-GT is demonstrated in the context of spectral-element-discretized meshes for a challenging test problem of 2D detonation propagation within a premixed hydrogen-air mixture exhibiting highly complex multiscale reacting flow behavior. The SR-GT framework utilizes a unique element-local (+ neighborhood) graph representation for the coarse input, which is then tokenized before being processed by the transformer component to produce the fine output. It is demonstrated that SR-GT provides high super-resolution accuracy for reacting flow-field features and superior performance compared to traditional interpolation-based SR schemes.
zh

[AI-158] EARL: Entropy-Aware RL Alignment of LLM s for Reliable RTL Code Generation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成结构化硬件描述语言(Verilog)代码时存在的关键问题,包括语法错误、功能幻觉(functional hallucinations)以及与设计意图的弱对齐性。现有方法在长序列生成中难以有效聚焦于影响功能正确性的核心代码片段,导致训练信号稀释和学习效率低下。其解决方案的关键在于提出一种熵感知的强化学习框架——EARL(Entropy-Aware Reinforcement Learning),通过熵分析识别出对控制流和模块结构具有高不确定性的关键token(如always、if、assign等),并采用熵引导的选择性梯度更新机制,仅对这些高熵token施加策略梯度更新,从而提升训练稳定性、减少冗余更新,并显著增强模型生成功能性正确的RTL代码的能力。

链接: https://arxiv.org/abs/2511.12033
作者: Jiahe Shi,Zhengqi Gao,Ching-Yun Ko,Duane Boning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code. Despite this progress, a gap remains between model capability and the demands of real-world RTL design, including syntax errors, functional hallucinations, and weak alignment to designer intent. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising approach to bridge this gap, as hardware provides executable and formally checkable signals that can be used to further align model outputs with design intent. However, in long, structured RTL code sequences, not all tokens contribute equally to functional correctness, and naïvely spreading gradients across all tokens dilutes learning signals. A key insight from our entropy analysis in RTL generation is that only a small fraction of tokens (e.g., always, if, assign, posedge) exhibit high uncertainty and largely influence control flow and module structure. To address these challenges, we present EARL, an Entropy-Aware Reinforcement Learning framework for Verilog generation. EARL performs policy optimization using verifiable reward signals and introduces entropy-guided selective updates that gate policy gradients to high-entropy tokens. This approach preserves training stability and concentrates gradient updates on functionally important regions of code. Our experiments on VerilogEval and RTLLM show that EARL improves functional pass rates over prior LLM baselines by up to 14.7%, while reducing unnecessary updates and improving training stability. These results indicate that focusing RL on critical, high-uncertainty tokens enables more reliable and targeted policy improvement for structured RTL code generation.
zh

[AI-159] Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在CPU上进行推理时,因KV缓存(Key-Value Cache, KV cache)更新操作带来的显著性能开销问题。传统方法在每生成一个token时执行分配、复制和原地分步更新,随着序列长度增长,分配与复制的开销成为瓶颈;而预先分配大KV张量虽可避免复制,却引入冗余计算。解决方案的关键在于提出一种名为Balancing Memory and Compute (BMC) 的新型KV缓存分配机制:BMC每隔r次迭代分配一次带有r个冗余行的KV张量,从而在这些迭代中实现无复制的原地更新,仅以少量冗余计算为代价;同时,该冗余行可被复用于推测解码(Speculative Decoding, SD),进一步提升生成效率。BMC构成一个设计空间,通过解析模型可确定最优r值,在不同硬件平台(包括CPU和GPU)上均实现了显著吞吐加速。

链接: https://arxiv.org/abs/2511.12031
作者: Arun Ramachandran,Ramaswamy Govindarajan,Murali Annavaram,Prakash Raghavendra,Hossein Entezari Zarch,Lei Gao,Chaoyi Jiang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the skyrocketing costs of GPUs and their virtual instances in the cloud, there is a significant desire to use CPUs for large language model (LLM) inference. KV cache update, often implemented as allocation, copying, and in-place strided update for each generated token, incurs significant overhead. As the sequence length increases, the allocation and copy overheads dominate the performance. Alternate approaches may allocate large KV tensors upfront to enable in-place updates, but these matrices (with zero-padded rows) cause redundant computations. In this work, we propose a new KV cache allocation mechanism called Balancing Memory and Compute (BMC). BMC allocates, once every r iterations, KV tensors with r redundant rows, allowing in-place update without copy overhead for those iterations, but at the expense of a small amount of redundant computation. Second, we make an interesting observation that the extra rows allocated in the KV tensors and the resulting redundant computation can be repurposed for Speculative Decoding (SD) that improves token generation efficiency. Last, BMC represents a spectrum of design points with different values of r. To identify the best-performing design point(s), we derive a simple analytical model for BMC. The proposed BMC method achieves an average throughput acceleration of up to 3.2x over baseline HuggingFace (without SD). Importantly when we apply BMC with SD, it results in an additional speedup of up to 1.39x, over and above the speedup offered by SD. Further, BMC achieves a throughput acceleration of up to 1.36x and 2.29x over state-of-the-art inference servers vLLM and DeepSpeed, respectively. Although the BMC technique is evaluated extensively across different classes of CPUs (desktop and server class), we also evaluate the scheme with GPUs and demonstrate that it works well for GPUs.
zh

[AI-160] Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning AAAI’2026

【速读】:该论文旨在解决视觉文档检索增强生成(VD-RAG)中视觉证据溯源(visual evidence attribution)的精确性与可验证性问题,即如何在多模态问答任务中确保视觉语言模型(VLMs)的推理过程具有细粒度的证据支持和逐步可追溯性。现有方法多采用端到端训练,虽便于答案验证,但缺乏对推理链条中每一步的精细监督和过程一致性约束。解决方案的关键在于提出Chain-of-Evidence (CoE)范式,将Chain-of-Thought (CoT)推理与视觉证据定位统一起来,通过在推理步骤中标注边界框(bounding boxes)和页面索引(page indexes)实现证据的显式锚定;并进一步设计Look As You Think (LAT)强化学习框架,引导模型生成具备一致证据归属的推理路径,仅当CoE轨迹最终输出正确答案时给予奖励,从而促进模型在推理过程中进行自我验证,显著提升结果的可解释性和泛化能力。

链接: https://arxiv.org/abs/2511.12003
作者: Shuochen Liu,Pengfei Luo,Chao Zhang,Yuhao Chen,Haotian Zhang,Qi Liu,Xin Kou,Tong Xu,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Poster of AAAI’2026

点击查看摘要

Abstract:Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution. During training, LAT evaluates the attribution consistency of each evidence region and provides rewards only when the CoE trajectory yields correct answers, encouraging process-level self-verification. Experiments on vanilla Qwen2.5-VL-7B-Instruct with Paper- and Wiki-VISA benchmarks show that LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5. Meanwhile, LAT not only outperforms the supervised fine-tuning baseline, which is trained to directly produce answers with attribution, but also exhibits stronger generalization across domains.
zh

[AI-161] Goal-Oriented Multi-Agent Reinforcement Learning for Decentralized Agent Teams

【速读】:该论文旨在解决多智能体系统中车辆在动态、不可预测环境中进行协同导航的问题,尤其是在通信受限、无中心控制且观测不完全的条件下,如何实现高效协作。其解决方案的关键在于提出一种去中心化的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架,引入目标感知的通信机制——即智能体仅根据自身目标和局部观测选择性地共享信息,从而在保证协作效率的同时尊重环境中的可见性限制。实验表明,该方法显著提升了任务成功率与完成时间性能,并在智能体数量增加时保持稳定表现,验证了其在复杂场景下的可扩展性与实用性。

链接: https://arxiv.org/abs/2511.11992
作者: Hung Du,Hy Nguyen,Srikanth Thudumu,Rajesh Vasa,Kon Mouzakis
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted poster at the IEEE Consumer Communications Networking Conference (CCNC) 2026

点击查看摘要

Abstract:Connected and autonomous vehicles across land, water, and air must often operate in dynamic, unpredictable environments with limited communication, no centralized control, and partial observability. These real-world constraints pose significant challenges for coordination, particularly when vehicles pursue individual objectives. To address this, we propose a decentralized Multi-Agent Reinforcement Learning (MARL) framework that enables vehicles, acting as agents, to communicate selectively based on local goals and observations. This goal-aware communication strategy allows agents to share only relevant information, enhancing collaboration while respecting visibility limitations. We validate our approach in complex multi-agent navigation tasks featuring obstacles and dynamic agent populations. Results show that our method significantly improves task success rates and reduces time-to-goal compared to non-cooperative baselines. Moreover, task performance remains stable as the number of agents increases, demonstrating scalability. These findings highlight the potential of decentralized, goal-driven MARL to support effective coordination in realistic multi-vehicle systems operating across diverse domains.
zh

[AI-162] Improving Autoformalization Using Direct Dependency Retrieval

【速读】:该论文旨在解决陈述自动形式化(statement autoformalization)中的关键挑战,即如何从非形式化的自然语言数学描述中准确提取并匹配形式化库中的依赖项(library dependencies),以提升生成式 AI 在形式验证任务中的准确性与鲁棒性。当前方法普遍存在上下文感知不足导致的定义或定理幻觉问题,且基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法在形式库依赖检索上精度和召回率较低,难以扩展至大规模公共数据集。解决方案的关键在于提出一种新颖的直接依赖检索(Direct Dependency Retrieval, DDR)框架:该方法通过自然语言描述直接生成候选依赖项,并利用高效的后缀数组(suffix array)校验机制快速验证其在形式化库中的存在性,从而实现高精度、高可扩展性的依赖检索。基于此,作者构建了包含超50万样本的依赖检索数据集并微调了DDR模型,实验证明其在检索精度与召回率上显著优于现有最先进方法,进而使整体自动形式化器在单次尝试准确率和多次尝试稳定性方面均取得优势。

链接: https://arxiv.org/abs/2511.11990
作者: Shaoqi Wang,Lu Yu,Chunjie Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The convergence of deep learning and formal mathematics has spurred research in formal verification. Statement autoformalization, a crucial first step in this process, aims to translate informal descriptions into machine-verifiable representations but remains a significant challenge. The core difficulty lies in the fact that existing methods often suffer from a lack of contextual awareness, leading to hallucination of formal definitions and theorems. Furthermore, current retrieval-augmented approaches exhibit poor precision and recall for formal library dependency retrieval, and lack the scalability to effectively leverage ever-growing public datasets. To bridge this gap, we propose a novel retrieval-augmented framework based on DDR (\textitDirect Dependency Retrieval) for statement autoformalization. Our DDR method directly generates candidate library dependencies from natural language mathematical descriptions and subsequently verifies their existence within the formal library via an efficient suffix array check. Leveraging this efficient search mechanism, we constructed a dependency retrieval dataset of over 500,000 samples and fine-tuned a high-precision DDR model. Experimental results demonstrate that our DDR model significantly outperforms SOTA methods in both retrieval precision and recall. Consequently, an autoformalizer equipped with DDR shows consistent performance advantages in both single-attempt accuracy and multi-attempt stability compared to models using traditional selection-based RAG methods.
zh

[AI-163] LLM -Assisted Formalization Enables Deterministic Detection of Statutory Inconsistency in the Internal Revenue Code

【速读】:该论文旨在解决复杂法律文本中法定不一致性的确定性检测问题,尤其针对美国国内税收法典(Internal Revenue Code, IRC)这类结构复杂、条款交错的法律体系。现有基于大语言模型(Large Language Models, LLMs)的方法在合规性、公平性和立法起草方面具有潜力,但其在处理层级化结构和深度逻辑推理时存在局限,难以实现可靠且可复现的分析结果。解决方案的关键在于构建一个混合神经符号框架,将LLM(如GPT-4o和GPT-5)与符号逻辑(Prolog)相结合:首先利用LLM进行自然语言到形式化规则的初步映射,再通过Prolog实现严谨的逻辑推理与一致性验证,从而获得确定性、可解释且可自主识别不一致区域的结果。实验证明,该方法在准确性和可靠性上显著优于纯概率性提示策略,为法律文本的自动化审查提供了可信赖的技术路径。

链接: https://arxiv.org/abs/2511.11954
作者: Borchuluun Yadamsuren,Steven Keith Platt,Miguel Diaz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 3 appendices with Prolog code and full codebase available at: this https URL

点击查看摘要

Abstract:This study introduces a hybrid neuro-symbolic framework that achieves deterministic detection of statutory inconsistency in complex law. We use the U.S. Internal Revenue Code (IRC) as a case study because its complexity makes it a fertile domain for identifying conflicts. Our research offers a solution for detecting inconsistent provisions by combining Large Language Models (LLMs) with symbolic logic. LLM-based methods can support compliance, fairness, and statutory drafting, yet tax-specific applications remain sparse. A key challenge is that such models struggle with hierarchical processing and deep structured reasoning, especially over long text. This research addresses these gaps through experiments using GPT-4o, GPT-5, and Prolog. GPT-4o was first used to translate Section 121 into Prolog rules and refine them in SWISH. These rules were then incorporated into prompts to test whether Prolog-augmented prompting improved GPT-4o’s inconsistency detection. GPT-4o, whether prompted with natural language alone or with Prolog augmentation, detected the inconsistency in only one of three strategies (33 percent accuracy), but its reasoning quality differed: natural-language prompting achieved 100 percent rule coverage, while Prolog-augmented prompting achieved 66 percent, indicating more incomplete statutory analysis. In contrast to probabilistic prompting, the hybrid Prolog model produced deterministic and reproducible results. Guided by GPT-5 for refinement, the model formalized the IRC section’s competing interpretations and successfully detected an inconsistency zone. Validation tests confirm that the Prolog implementation is accurate, internally consistent, deterministic, and capable of autonomously identifying inconsistencies. These findings show that LLM-assisted formalization, anchored in symbolic logic, enables transparent and reliable statutory inconsistency detection. Comments: 29 pages, 3 appendices with Prolog code and full codebase available at: this https URL Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.4; I.2.7; I.2.3 Cite as: arXiv:2511.11954 [cs.AI] (or arXiv:2511.11954v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.11954 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Borchuluun Yadamsuren [view email] [v1] Sat, 15 Nov 2025 00:05:02 UTC (418 KB)
zh

[AI-164] Augmenting The Weather: A Hybrid Counterfactual-SMOTE Algorithm for Improving Crop Growth Prediction When Climate Changes

【速读】:该论文旨在解决机器学习模型在气候变迁预测中因数据分布偏移(out-of-distribution)而导致性能下降的问题,特别是由于历史数据中缺乏足够“气候异常事件”(climate outlier events)的少数类样本,造成模型对极端气候事件的预测能力不足。解决方案的关键在于提出一种新型数据增强方法——基于反事实的SMOTE(Counterfactual-Based SMOTE, CFA-SMOTE),该方法将可解释人工智能(Explainable AI, XAI)中的实例级反事实生成技术与经典的类别不平衡处理方法SMOTE相结合,通过合成代表气候异常事件的虚拟数据点来扩充训练集,从而提升模型在极端气候条件下的预测性能。

链接: https://arxiv.org/abs/2511.11945
作者: Mohammed Temraz,Mark T Keane
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 8 figures

点击查看摘要

Abstract:In recent years, humanity has begun to experience the catastrophic effects of climate change as economic sectors (such as agriculture) struggle with unpredictable and extreme weather events. Artificial Intelligence (AI) should help us handle these climate challenges but its most promising solutions are not good at dealing with climate-disrupted data; specifically, machine learning methods that work from historical data-distributions, are not good at handling out-of-distribution, outlier events. In this paper, we propose a novel data augmentation method, that treats the predictive problems around climate change as being, in part, due to class-imbalance issues; that is, prediction from historical datasets is difficult because, by definition, they lack sufficient minority-class instances of “climate outlier events”. This novel data augmentation method – called Counterfactual-Based SMOTE (CFA-SMOTE) – combines an instance-based counterfactual method from Explainable AI (XAI) with the well-known class-imbalance method, SMOTE. CFA-SMOTE creates synthetic data-points representing outlier, climate-events that augment the dataset to improve predictive performance. We report comparative experiments using this CFA-SMOTE method, comparing it to benchmark counterfactual and class-imbalance methods under different conditions (i.e., class-imbalance ratios). The focal climate-change domain used relies on predicting grass growth on Irish dairy farms, during Europe-wide drought and forage crisis of 2018.
zh

[AI-165] A Neuromorphic Architecture for Scalable Event-Based Control

【速读】:该论文旨在解决神经形态控制系统中连续节奏生成与离散决策难以统一建模的问题,尤其是在复杂系统中实现可扩展性、可靠性和可调性的平衡。其解决方案的关键在于提出一种基于“反弹式胜者为王(Rebound Winner-Take-All, RWTA)”的模块化架构,该架构在细胞到系统层级上融合了离散计算的可靠性与连续调节的可调性:既继承了胜者为王状态机的离散计算能力,又具备兴奋性生物物理电路的连续调谐特性,从而在统一的事件驱动建模语言下实现对连续节律生成和离散决策的协同处理。

链接: https://arxiv.org/abs/2511.11924
作者: Yongkang Huo,Fulvio Forni,Rodolphe Sepulchre
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the ``rebound Winner-Take-All (RWTA)" motif as the basic element of a scalable neuromorphic control architecture. From the cellular level to the system level, the resulting architecture combines the reliability of discrete computation and the tunability of continuous regulation: it inherits the discrete computation capabilities of winner-take-all state machines and the continuous tuning capabilities of excitable biophysical circuits. The proposed event-based framework addresses continuous rhythmic generation and discrete decision-making in a unified physical modeling language. We illustrate the versatility, robustness, and modularity of the architecture through the nervous system design of a snake robot.
zh

[AI-166] Looking Forward: Challenges and Opportunities in Agent ic AI Reliability

【速读】:该论文旨在解决构建可靠智能体AI(agentic AI)系统所面临的挑战,特别是如何 mitigating the risks of cascading failures(缓解级联故障风险)的问题。其解决方案的关键在于识别并应对动态环境、任务执行不一致、不可预测的涌现行为以及资源密集型可靠性机制等核心研究难题,并通过系统性地探索测试与评估方法来提升智能体AI系统的整体可靠性。

链接: https://arxiv.org/abs/2511.11921
作者: Liudong Xing,Janet(Jing)Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF)
备注: 13 pages, 6 figures; This is a preprint of a chapter accepted for publication in Generative and Agentic AI Reliability: Architectures, Challenges, and Trust for Autonomous Systems, published by SpringerNature

点击查看摘要

Abstract:This chapter presents perspectives for challenges and future development in building reliable AI systems, particularly, agentic AI systems. Several open research problems related to mitigating the risks of cascading failures are discussed. The chapter also sheds lights on research challenges and opportunities in aspects including dynamic environments, inconsistent task execution, unpredictable emergent behaviors, as well as resource-intensive reliability mechanisms. In addition, several research directions along the line of testing and evaluating reliability of agentic AI systems are also discussed.
zh

[AI-167] Batch Matrix-form Equations and Implementation of Multilayer Perceptrons

【速读】:该论文旨在解决多层感知机(Multilayer Perceptron, MLP)在批量矩阵形式(batch matrix-form)下的算法细节缺失问题,即当前文献中普遍以单样本或依赖自动微分(automatic differentiation)的方式描述梯度计算,缺乏对批量前向与反向传播过程的显式、完整数学表达。这一不足限制了在稀疏神经网络等场景下对计算结构的透明分析与系统优化。解决方案的关键在于:首先,提供一套数学严谨且可直接实现的MLP批量矩阵形式前向与反向传播公式,涵盖标准和高级层(如批归一化和Softmax);其次,利用符号数学库SymPy对所有梯度方程进行验证;最终,基于这些公式构建统一的NumPy、PyTorch、JAX、TensorFlow及高性能C++参考实现,证明显式批量矩阵形式可有效支持稀疏计算优化,从而为神经网络算法的理解、教学与研究奠定可验证、可扩展的基础。

链接: https://arxiv.org/abs/2511.11918
作者: Wieger Wesselink,Bram Grooten,Huub van de Wetering,Qiao Xiao,Decebal Constantin Mocanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages; submitted to JMLR

点击查看摘要

Abstract:Multilayer perceptrons (MLPs) remain fundamental to modern deep learning, yet their algorithmic details are rarely presented in complete, explicit \emphbatch matrix-form. Rather, most references express gradients per sample or rely on automatic differentiation. Although automatic differentiation can achieve equally high computational efficiency, the usage of batch matrix-form makes the computational structure explicit, which is essential for transparent, systematic analysis, and optimization in settings such as sparse neural networks. This paper fills that gap by providing a mathematically rigorous and implementation-ready specification of MLPs in batch matrix-form. We derive forward and backward equations for all standard and advanced layers, including batch normalization and softmax, and validate all equations using the symbolic mathematics library SymPy. From these specifications, we construct uniform reference implementations in NumPy, PyTorch, JAX, TensorFlow, and a high-performance C++ backend optimized for sparse operations. Our main contributions are: (1) a complete derivation of batch matrix-form backpropagation for MLPs, (2) symbolic validation of all gradient equations, (3) uniform Python and C++ reference implementations grounded in a small set of matrix primitives, and (4) demonstration of how explicit formulations enable efficient sparse computation. Together, these results establish a validated, extensible foundation for understanding, teaching, and researching neural network algorithms.
zh

[AI-168] An Analysis of Architectural Impact on LLM -based Abstract Visual Reasoning : A Systematic Benchmark on RAVEN-FAIR

【速读】:该论文旨在系统评估大语言模型(Large Language Models, LLMs)在抽象视觉推理任务中的性能表现,重点考察不同模型架构对推理能力的影响。其解决方案的关键在于采用多模型(GPT-4.1-Mini、Claude-3.5-Haiku、Gemini-1.5-Flash、Llama-3.3-70b)与多架构(单次推理、嵌入控制重复、自我反思、多智能体)组合,在RAVEN-FAIR数据集上进行实验,并通过SSIM和LPIPS指标量化视觉响应质量,结合Chain-of-Thought评分及错误类型(语义幻觉、数值误判)分析推理过程。研究发现,GPT-4.1-Mini在所有架构中均表现最优,但各模型对架构设计的敏感性存在差异,且响应覆盖度变化成为跨架构比较的重要干扰因素;为更可靠估计性能上限,采用五次独立运行的最佳结果作为基准,符合当前强调多轮评估以避免单次运行偏差的研究趋势。

链接: https://arxiv.org/abs/2511.11916
作者: Sinan Urgun,Seçkin Arı
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:This study aims to systematically evaluate the performance of large language models (LLMs) in abstract visual reasoning problems. We examined four LLM models (GPT-4.1-Mini, Claude-3.5-Haiku, Gemini-1.5-Flash, Llama-3.3-70b) utilizing four different reasoning architectures (single-shot, embedding-controlled repetition, self-reflection, and multi-agent) on the RAVEN-FAIR dataset. Visual responses generated through a three-stage process (JSON extraction, LLM reasoning, and Tool Function) were evaluated using SSIM and LPIPS metrics; Chain-of-Thought scores and error types (semantic hallucination, numeric misperception) were analyzed. Results demonstrate that GPT-4.1-Mini consistently achieved the highest overall accuracy across all architectures, indicating a strong reasoning capability. While the multi-agent architecture occasionally altered semantic and numeric balance across models, these effects were not uniformly beneficial. Instead, each model exhibited distinct sensitivity patterns to architectural design, underscoring that reasoning effectiveness remains model-specific. Variations in response coverage further emerged as a confounding factor that complicates direct cross-architecture comparison. To estimate the upper-bound performance of each configuration, we report the best of five independent runs, representing a best-case scenario rather than an averaged outcome. This multi-run strategy aligns with recent recommendations, which emphasize that single-run evaluations are fragile and may lead to unreliable conclusions.
zh

[AI-169] KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference

【速读】:该论文旨在解决大语言模型(Language Models, LMs)在移动端和嵌入式设备上进行长上下文推理时面临的内存容量瓶颈问题。由于键值缓存(Key-Value Cache, KV Cache)的大小随上下文长度和批处理规模线性增长,本地部署受限于有限的内存资源,导致无法高效处理多输入长文本任务。解决方案的关键在于提出KVSwap框架,通过将KV缓存完整地卸载至非易失性存储(如磁盘),并利用紧凑的内存元数据动态预测关键缓存条目进行预加载,同时实现计算与硬件感知的磁盘访问重叠,并优化读取模式以适配存储设备特性,从而在严苛内存预算下显著提升吞吐量,且保持生成质量不下降。

链接: https://arxiv.org/abs/2511.11907
作者: Huawei Zhang,Chunwei Xia,Zheng Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models (LMs) underpin emerging mobile and embedded AI applications like meeting and video summarization and document analysis, which often require processing multiple long-context inputs. Running an LM locally on-device improves privacy, enables offline use, and reduces cost, but long-context inference quickly hits a \emphmemory capacity wall as the key-value (KV) cache grows linearly with context length and batch size. We present KVSwap, a software framework to break this memory wall by offloading the KV cache to non-volatile secondary storage (disk). KVSwap leverages the observation that only a small, dynamically changing subset of KV entries is critical for generation. It stores the full cache on disk, uses a compact in-memory metadata to predict which entries to preload, overlaps computation with hardware-aware disk access, and orchestrates read patterns to match storage device characteristics. Our evaluation shows that across representative LMs and storage types, KVSwap delivers higher throughput under tight memory budgets while maintaining the generation quality when compared with existing KV cache offloading schemes. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11907 [cs.DC] (or arXiv:2511.11907v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.11907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-170] Robust Bidirectional Associative Memory via Regularization Inspired by the Subspace Rotation Algorithm

【速读】:该论文旨在解决双向联想记忆(Bidirectional Associative Memory, BAM)在使用双向反向传播(Bidirectional Backpropagation, B-BP)训练时存在的鲁棒性差、对噪声和对抗攻击敏感的问题。解决方案的关键在于提出一种无梯度训练算法——双向子空间旋转算法(Bidirectional Subspace Rotation Algorithm, B-SRA),并基于实验发现两个核心原则:正交权重矩阵(Orthogonal Weight Matrices, OWM)和梯度模式对齐(Gradient-Pattern Alignment, GPA)。通过将这两个原则引入B-BP的正则化策略中,显著提升了BAM模型在多种干扰场景下的稳定性和抗噪能力,其中同时集成OWM与GPA的SAME配置展现出最优鲁棒性。

链接: https://arxiv.org/abs/2511.11902
作者: Ci Lin,Tet Yeap,Iluju Kiringa,Biwei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bidirectional Associative Memory (BAM) trained with Bidirectional Backpropagation (B-BP) often suffers from poor robustness and high sensitivity to noise and adversarial attacks. To address these issues, we propose a novel gradient-free training algorithm, the Bidirectional Subspace Rotation Algorithm (B-SRA), which significantly improves the robustness and convergence behavior of BAM. Through comprehensive experiments, we identify two key principles – orthogonal weight matrices (OWM) and gradient-pattern alignment (GPA) – as central to enhancing the robustness of BAM. Motivated by these findings, we introduce new regularization strategies into B-BP, resulting in models with greatly improved resistance to corruption and adversarial perturbations. We further conduct an ablation study across different training strategies to determine the most robust configuration and evaluate BAM’s performance under a variety of attack scenarios and memory capacities, including 50, 100, and 200 associative pairs. Among all methods, the SAME configuration, which integrates both OWM and GPA, achieves the strongest resilience. Overall, our results demonstrate that B-SRA and the proposed regularization strategies lead to substantially more robust associative memories and open new directions for building resilient neural architectures.
zh

[AI-171] VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

【速读】:该论文旨在解决当前漏洞检测(Vulnerability Detection, VD)技术在上下文感知能力上的局限性问题,即现有基于传统机器学习或大语言模型(LLM)的方法(如提示工程、监督微调或离策略偏好优化)依赖固定输入或静态偏好数据集,无法自适应地探索代码库级别的依赖关系,且受限于函数级基准测试而忽视关键漏洞上下文信息。其解决方案的核心是提出一种基于策略梯度的在线强化学习框架——漏洞自适应策略优化(Vulnerability-Adaptive Policy Optimization, VULPO),通过构建包含仓库级上下文信息的ContextVul数据集,并设计多维奖励结构(涵盖预测正确性、漏洞定位精度与分析语义相关性),同时引入标签级和样本级难度自适应奖励缩放机制,以提升模型对复杂漏洞案例的识别能力并防止奖励劫持。实验表明,VULPO-4B在上下文感知漏洞检测任务中显著优于现有基线方法,在F1指标上较Qwen3-4B提升85%,性能接近参数量达其150倍的DeepSeek-R1-0528模型。

链接: https://arxiv.org/abs/2511.11896
作者: Youpeng Li,Fuxun Yu,Xinda Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2511.11896 [cs.CR] (or arXiv:2511.11896v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.11896 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-172] Chain-of-Generation: Progressive Latent Diffusion for Text-Guided Molecular Design

【速读】:该论文旨在解决文本条件分子生成中因采用单次条件编码(one-shot conditioning)而导致的语义对齐不足、子结构遗漏以及多约束同时满足困难等问题,这些问题限制了生成分子对复杂、组合式自然语言描述的忠实性与可控性。其解决方案的关键在于提出Chain-of-Generation (CoG) 框架,该框架无需训练即可实现多阶段潜空间扩散:将提示分解为课程有序的语义片段,并逐级作为中间目标引入扩散过程,引导去噪轨迹逐步满足更丰富的语言约束;同时引入后对齐学习阶段以增强文本与分子潜空间之间的对应关系,从而显著提升生成分子的语义一致性、多样性与可控性。

链接: https://arxiv.org/abs/2511.11894
作者: Lingxiao Li,Haobo Zhang,Bin Chen,Jiayu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Text-conditioned molecular generation aims to translate natural-language descriptions into chemical structures, enabling scientists to specify functional groups, scaffolds, and physicochemical constraints without handcrafted rules. Diffusion-based models, particularly latent diffusion models (LDMs), have recently shown promise by performing stochastic search in a continuous latent space that compactly captures molecular semantics. Yet existing methods rely on one-shot conditioning, where the entire prompt is encoded once and applied throughout diffusion, making it hard to satisfy all the requirements in the prompt. We discuss three outstanding challenges of one-shot conditioning generation, including the poor interpretability of the generated components, the failure to generate all substructures, and the overambition in considering all requirements simultaneously. We then propose three principles to address those challenges, motivated by which we propose Chain-of-Generation (CoG), a training-free multi-stage latent diffusion framework. CoG decomposes each prompt into curriculum-ordered semantic segments and progressively incorporates them as intermediate goals, guiding the denoising trajectory toward molecules that satisfy increasingly rich linguistic constraints. To reinforce semantic guidance, we further introduce a post-alignment learning phase that strengthens the correspondence between textual and molecular latent spaces. Extensive experiments on benchmark and real-world tasks demonstrate that CoG yields higher semantic alignment, diversity, and controllability than one-shot baselines, producing molecules that more faithfully reflect complex, compositional prompts while offering transparent insight into the generation process.
zh

[AI-173] FLEX: Feature Importance from Layered Counterfactual Explanations

【速读】:该论文旨在解决机器学习模型在高风险场景中因缺乏可解释性而导致的安全部署难题,特别是现有基于反事实解释(counterfactual explanations)的方法通常仅提供个体实例层面的“若…则…”建议,无法量化特征在局部区域或整个数据集范围内对预测结果变化的系统性影响。其解决方案的关键在于提出FLEX(Feature importance from Layered counterfactual EXplanations)框架,该框架通过将一组反事实样本转换为多层次(局部、区域、全局)的特征变更频率得分,实现了从局部反事实推理到全局特征重要性排序的跨尺度归纳,从而揭示哪些特征更频繁地需要改变以翻转预测结果,并支持根据实际约束(如稀疏性、可行性或可操作性)灵活调整解释粒度与重点,最终在交通事故严重程度预测和贷款审批两个任务中验证了其优于SHAP和LIME等传统方法的可解释性和实用性。

链接: https://arxiv.org/abs/2511.11891
作者: Nawid Keshtmand,Roussel Desmond Nzoyem,Jeffrey Nicholas Clark
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 3 tables, 2 algorithms. Preprint under review

点击查看摘要

Abstract:Machine learning models achieve state-of-the-art performance across domains, yet their lack of interpretability limits safe deployment in high-stakes settings. Counterfactual explanations are widely used to provide actionable “what-if” recourse, but they typically remain instance-specific and do not quantify which features systematically drive outcome changes within coherent regions of the feature space or across an entire dataset. We introduce FLEX (Feature importance from Layered counterfactual EXplanations), a model- and domain-agnostic framework that converts sets of counterfactuals into feature change frequency scores at local, regional, and global levels. FLEX generalises local change-frequency measures by aggregating across instances and neighbourhoods, offering interpretable rankings that reflect how often each feature must change to flip predictions. The framework is compatible with different counterfactual generation methods, allowing users to emphasise characteristics such as sparsity, feasibility, or actionability, thereby tailoring the derived feature importances to practical constraints. We evaluate FLEX on two contrasting tabular tasks: traffic accident severity prediction and loan approval, and compare FLEX to SHAP- and LIME-derived feature importance values. Results show that (i) FLEX’s global rankings correlate with SHAP while surfacing additional drivers, and (ii) regional analyses reveal context-specific factors that global summaries miss. FLEX thus bridges the gap between local recourse and global attribution, supporting transparent and intervention-oriented decision-making in risk-sensitive applications.
zh

[AI-174] Flash-Fusion: Enabling Expressive Low-Latency Queries on IoT Sensor Streams with LLM s

【速读】:该论文旨在解决用户在利用大语言模型(Large Language Models, LLMs)分析物联网(Internet of Things, IoT)数据时面临的两大核心问题:一是数据采集基础设施成本高,产生海量低级传感器读数,难以直接用于LLM处理;二是数据分析效率低下,需大量迭代和专业知识,且直接将原始IoT遥测数据输入LLM存在上下文窗口限制、高昂的token成本及非交互式延迟。解决方案的关键在于提出Flash-Fusion系统,其设计遵循两个原则:(1)基于边缘端的统计摘要(edge-based statistical summarization),实现73.5%的数据压缩以应对数据体量问题;(2)云端查询规划机制,通过聚类行为数据并构建富含上下文的提示(prompt),提升语义理解能力。该系统在大学校车车队部署中验证,相较直接输入原始数据至先进LLM的基线方案,实现了95%的延迟降低与98%的token使用量及成本削减,同时保持高质量响应,显著减轻多学科用户(如安全专员、城市规划师、车队经理等)的手动查询编写与预处理负担。

链接: https://arxiv.org/abs/2511.11885
作者: Kausar Patherya,Ashutosh Dhekne,Francisco Romero
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 12 pages, 5 figures. Under review

点击查看摘要

Abstract:Smart cities and pervasive IoT deployments have generated interest in IoT data analysis across transportation and urban planning. At the same time, Large Language Models offer a new interface for exploring IoT data - particularly through natural language. Users today face two key challenges when working with IoT data using LLMs: (1) data collection infrastructure is expensive, producing terabytes of low-level sensor readings that are too granular for direct use, and (2) data analysis is slow, requiring iterative effort and technical expertise. Directly feeding all IoT telemetry to LLMs is impractical due to finite context windows, prohibitive token costs at scale, and non-interactive latencies. What is missing is a system that first parses a user’s query to identify the analytical task, then selects the relevant data slices, and finally chooses the right representation before invoking an LLM. We present Flash-Fusion, an end-to-end edge-cloud system that reduces the IoT data collection and analysis burden on users. Two principles guide its design: (1) edge-based statistical summarization (achieving 73.5% data reduction) to address data volume, and (2) cloud-based query planning that clusters behavioral data and assembles context-rich prompts to address data interpretation. We deploy Flash-Fusion on a university bus fleet and evaluate it against a baseline that feeds raw data to a state-of-the-art LLM. Flash-Fusion achieves a 95% latency reduction and 98% decrease in token usage and cost while maintaining high-quality responses. It enables personas across disciplines - safety officers, urban planners, fleet managers, and data scientists - to efficiently iterate over IoT data without the burden of manual query authoring or preprocessing. Comments: 12 pages, 5 figures. Under review Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Databases (cs.DB) MSC classes: I.2.7 Cite as: arXiv:2511.11885 [cs.DC] (or arXiv:2511.11885v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.11885 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-175] A Multimodal Manufacturing Safety Chatbot: Knowledge Base Design Benchmark Development and Evaluation of Multiple RAG Approaches

【速读】:该论文旨在解决现代制造环境中工人安全培训的挑战,特别是在工业5.0背景下向以人为中心的生产模式转型时,如何设计高效、低成本且高精度的安全培训系统。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的多模态聊天机器人,该模型通过整合定制化的法规与技术文档来确保回答的准确性,并采用系统化实验方法优化检索策略与模型配置,在保证86.66%准确率的同时实现平均10.04秒延迟和每查询0.005美元成本,从而满足下一代安全培训系统的高精度、低延迟与低成本三大核心需求。

链接: https://arxiv.org/abs/2511.11847
作者: Ryan Singh,Austin Hamilton,Amanda White,Michael Wise,Ibrahim Yousif,Arthur Carvalho,Zhe Shan,Reza Abrisham Baf,Mohammad Mayyas,Lora A. Cavuoto,Fadel M. Megahed
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:Ensuring worker safety remains a critical challenge in modern manufacturing environments. Industry 5.0 reorients the prevailing manufacturing paradigm toward more human-centric operations. Using a design science research methodology, we identify three essential requirements for next-generation safety training systems: high accuracy, low latency, and low cost. We introduce a multimodal chatbot powered by large language models that meets these design requirements. The chatbot uses retrieval-augmented generation to ground its responses in curated regulatory and technical documentation. To evaluate our solution, we developed a domain-specific benchmark of expert-validated question and answer pairs for three representative machines: a Bridgeport manual mill, a Haas TL-1 CNC lathe, and a Universal Robots UR5e collaborative robot. We tested 24 RAG configurations using a full-factorial design and assessed them with automated evaluations of correctness, latency, and cost. Our top 2 configurations were then evaluated by ten industry experts and academic researchers. Our results show that retrieval strategy and model configuration have a significant impact on performance. The top configuration (selected for chatbot deployment) achieved an accuracy of 86.66%, an average latency of 10.04 seconds, and an average cost of 0.005 per query. Overall, our work provides three contributions: an open-source, domain-grounded safety training chatbot; a validated benchmark for evaluating AI-assisted safety instruction; and a systematic methodology for designing and assessing AI-enabled instructional and immersive safety training systems for Industry 5.0 environments.
zh

[AI-176] Autonomous Underwater Cognitive System for Adaptive Navigation: A SLAM-Integrated Cognitive Architecture

【速读】:该论文旨在解决深海探索中因动态水下环境导致的定向迷失、通信中断和导航失效等问题。其解决方案的关键在于提出了一种自主水下认知系统(Autonomous Underwater Cognitive System, AUCS),该系统将同时定位与地图构建(SLAM)与基于Soar的认知架构相结合,通过融合声呐(SONAR)、激光雷达(LiDAR)、惯性测量单元(IMU)和多普勒速度计(DVL)等多源传感器数据,并集成感知、注意、规划与学习等认知推理模块,实现对动态与静态目标的语义区分,从而减少误闭环(false loop closures),提升长期地图一致性。该架构构建了完整的感知-认知-行动-学习闭环,使自主水下航行器具备智能感知与自适应能力,为下一代认知型潜水器系统提供了技术基础。

链接: https://arxiv.org/abs/2511.11845
作者: K. A. I. N Jayarathne,R. M. N. M. Rathnayaka,D. P. S. S. Peiris
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Deep-sea exploration poses significant challenges, including disorientation, communication loss, and navigational failures in dynamic underwater environments. This paper presents an Autonomous Underwater Cognitive System (AUCS) that integrates Simultaneous Localization and Mapping (SLAM) with a Soar-based cognitive architecture to enable adaptive navigation in complex oceanic conditions. The system fuses multi-sensor data from SONAR, LiDAR, IMU, and DVL with cognitive reasoning modules for perception, attention, planning, and learning. Unlike conventional SLAM systems, AUCS incorporates semantic understanding, adaptive sensor management, and memory-based learning to differentiate between dynamic and static objects, reducing false loop closures and enhancing long-term map consistency. The proposed architecture demonstrates a complete perception-cognition-action-learning loop, allowing autonomous underwater vehicles to sense, reason, and adapt intelligently. This work lays a foundation for next-generation cognitive submersible systems, improving safety, reliability, and autonomy in deep-sea exploration.
zh

[AI-177] Securing Generative AI in Healthcare: A Zero-Trust Architecture Powered by Confidential Computing on Google Cloud

【速读】:该论文旨在解决生成式 AI(Generative AI)在医疗健康领域应用中因传统安全框架无法覆盖“数据在用状态”(data-in-use gap)而引发的重大安全挑战,即敏感患者数据和专有 AI 模型在处理过程中暴露于潜在攻击的风险。解决方案的关键在于提出一种新型安全范式——保密零信任框架(Confidential Zero-Trust Framework, CZF),其核心是将零信任架构(Zero-Trust Architecture)的细粒度访问控制与可信执行环境(Trusted Execution Environment, TEE)提供的硬件强制数据隔离相结合,实现数据在使用时仍保持加密状态,并通过远程认证(remote attestation)提供工作负载完整性的密码学证明,从而构建可验证、多方协作的安全基础,推动医疗 AI 的负责任落地。

链接: https://arxiv.org/abs/2511.11836
作者: Adaobi Amanna,Ishana Shinde
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 Pages, 1 Figure, 1 Table

点击查看摘要

Abstract:The integration of Generative Artificial Intelligence (GenAI) in healthcare is impeded by significant security challenges unaddressed by traditional frameworks, precisely the data-in-use gap where sensitive patient data and proprietary AI models are exposed during active processing. To address this, the paper proposes the Confidential Zero-Trust Framework (CZF), a novel security paradigm that synergistically combines Zero-Trust Architecture for granular access control with the hardware-enforced data isolation of Confidential Computing. We detailed a multi-tiered architectural blueprint for implementing the CZF on Google Cloud and analyzed its efficacy against real-world threats. The CZF provides a defense-in-depth architecture where data remains encrypted while in-use within a hardware-based Trusted Execution Environment (TEE). The framework’s use of remote attestation offers cryptographic proof of workload integrity, transforming compliance from a procedural exercise into a verifiable technical fact and enabling secure, multi-party collaborations previously blocked by security and intellectual property concerns. By closing the data-in-use gap and enforcing Zero-Trust principles, the CZF provides a robust and verifiable framework that establishes the necessary foundation of trust to enable the responsible adoption of transformative AI technologies in healthcare.
zh

[AI-178] Volatility in Certainty (VC): A Metric for Detecting Adversarial Perturbations During Inference in Neural Network Classifiers

【速读】:该论文旨在解决神经网络分类器在实时系统中缺乏真实标签时,难以有效评估模型鲁棒性与性能退化的问题。其解决方案的关键在于提出并验证“确定性波动性”(Volatility in Certainty, VC)这一无需标签的度量指标,通过计算排序后Softmax输出相邻值的对数平方比的平均值,捕捉模型置信度分布的局部不平滑性;实验表明VC与分类准确率呈强负相关(多数情况下相关系数ρ = -0.90),能有效反映对抗样本引入导致的性能下降,具备架构无关性和实时性,适用于安全关键场景下的早期预警系统。

链接: https://arxiv.org/abs/2511.11834
作者: Vahid Hemmati,Ahmad Mohammadi,Abdul-Rauf Nuhu,Reza Ahmari,Parham Kebria,Abdollah Homaifar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial robustness remains a critical challenge in deploying neural network classifiers, particularly in real-time systems where ground-truth labels are unavailable during inference. This paper investigates \textitVolatility in Certainty (VC), a recently proposed, label-free metric that quantifies irregularities in model confidence by measuring the dispersion of sorted softmax outputs. Specifically, VC is defined as the average squared log-ratio of adjacent certainty values, capturing local fluctuations in model output smoothness. We evaluate VC as a proxy for classification accuracy and as an indicator of adversarial drift. Experiments are conducted on artificial neural networks (ANNs) and convolutional neural networks (CNNs) trained on MNIST, as well as a regularized VGG-like model trained on CIFAR-10. Adversarial examples are generated using the Fast Gradient Sign Method (FGSM) across varying perturbation magnitudes. In addition, mixed test sets are created by gradually introducing adversarial contamination to assess VC’s sensitivity under incremental distribution shifts. Our results reveal a strong negative correlation between classification accuracy and log(VC) (correlation rho -0.90 in most cases), suggesting that VC effectively reflects performance degradation without requiring labeled data. These findings position VC as a scalable, architecture-agnostic, and real-time performance metric suitable for early-warning systems in safety-critical applications.
zh

[AI-179] Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的高计算成本与API开销问题,同时保持用户指定的可靠性水平。其核心挑战在于如何在多模型异构组合中实现成本最小化,而不牺牲任务准确性。解决方案的关键在于提出一种名为“合规约束策略优化”(Conformal Constrained Policy Optimization, CCPO)的新训练范式,该方法融合了约束策略优化、离策略强化学习以及在线合规预测(online conformal prediction)技术,能够联合优化一个成本感知的策略(评分函数)和一个自适应阈值,从而在保证可靠性约束的前提下显著降低整体运行成本。实验表明,CCPO在两个多跳问答基准上相较其他成本敏感基线方法实现了最高达30%的成本节省,且不损害可靠性。

链接: https://arxiv.org/abs/2511.11828
作者: Wenwen Si,Sooyong Jang,Insup Lee,Osbert Bastani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have recently made tremendous progress towards solving challenging AI problems, they have done so at increasingly steep computational and API costs. We propose a novel strategy where we combine multiple LLM models with varying cost/accuracy tradeoffs in an agentic manner, where models and tools are run in sequence as determined by an orchestration model to minimize cost subject to a user-specified level of reliability; this constraint is formalized using conformal prediction to provide guarantees. To solve this problem, we propose Conformal Constrained Policy Optimization (CCPO), a training paradigm that integrates constrained policy optimization with off-policy reinforcement learning and recent advances in online conformal prediction. CCPO jointly optimizes a cost-aware policy (score function) and an adaptive threshold. Across two multi-hop question answering benchmarks, CCPO achieves up to a 30% cost reduction compared to other cost-aware baselines and LLM-guided methods without compromising reliability. Our approach provides a principled and practical framework for deploying LLM agents that are significantly more cost-effective while maintaining reliability.
zh

[AI-180] Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion

【速读】:该论文旨在解决在非平稳噪声环境(如狗吠、婴儿哭声)中,语音质量与可懂度显著下降的问题,尤其是在实时应用场景下传统深度学习模型性能受限的挑战。其解决方案的关键在于提出一种基于双输入声学图像特征融合的混合视觉Transformer(Hybrid ViT)框架,能够有效建模噪声信号中的时序和频谱依赖关系,同时具备计算轻量化特性,适合嵌入式设备部署。实验表明,该方法在Librispeech、UrbanSound8K和Google Audioset数据集上的多项指标(PESQ、STOI、Seg SNR、LLR)均优于原始含噪语音,接近干净参考信号的表现。

链接: https://arxiv.org/abs/2511.11825
作者: Behnaz Bahmei,Siamak Arzanpour,Elina Birmingham
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech quality and intelligibility are significantly degraded in noisy environments. This paper presents a novel transformer-based learning framework to address the single-channel noise suppression problem for real-time applications. Although existing deep learning networks have shown remarkable improvements in handling stationary noise, their performance often diminishes in real-world environments characterized by non-stationary noise (e.g., dog barking, baby crying). The proposed dual-input acoustic-image feature fusion using a hybrid ViT framework effectively models both temporal and spectral dependencies in noisy signals. Designed for real-world audio environments, the proposed framework is computationally lightweight and suitable for implementation on embedded devices. To evaluate its effectiveness, four standard and commonly used quality measurements, namely PESQ, STOI, Seg SNR, and LLR, are utilized. Experimental results obtained using the Librispeech dataset as the clean speech source and the UrbanSound8K and Google Audioset datasets as the noise sources, demonstrate that the proposed method significantly improves noise reduction, speech intelligibility, and perceptual quality compared to the noisy input signal, achieving performance close to the clean reference.
zh

[AI-181] Differences in the Moral Foundations of Large Language Models

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在政治、商业和教育等关键领域应用中,其规范性伦理判断机制不透明的问题。现有对齐研究尚未充分借鉴道德心理学领域的洞见来指导前沿模型的训练与评估。论文的关键解决方案是采用乔纳森·海德特(Jonathan Haidt)提出的道德基础理论(Moral Foundations Theory, MFT),对来自多家主流模型提供商的广泛LLM进行合成实验,以系统性地提取和比较模型与人类基准群体在道德价值判断上的差异。研究发现,不同模型之间及其与代表性人类样本相比,在依赖的道德基础维度上存在显著偏差和方差,且这种差异随模型能力增强而加剧,从而为后续基于MFT的模型微调及政策制定者对LLM对齐中道德基础重要性的深入讨论提供了实证依据。

链接: https://arxiv.org/abs/2511.11790
作者: Peter Kirgis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly being used in critical domains of politics, business, and education, but the nature of their normative ethical judgment remains opaque. Alignment research has, to date, not sufficiently utilized perspectives and insights from the field of moral psychology to inform training and evaluation of frontier models. I perform a synthetic experiment on a wide range of models from most major model providers using Jonathan Haidt’s influential moral foundations theory (MFT) to elicit diverse value judgments from LLMs. Using multiple descriptive statistical approaches, I document the bias and variance of large language model responses relative to a human baseline in the original survey. My results suggest that models rely on different moral foundations from one another and from a nationally representative human baseline, and these differences increase as model capabilities increase. This work seeks to spur further analysis of LLMs using MFT, including finetuning of open-source models, and greater deliberation by policymakers on the importance of moral foundations for LLM alignment.
zh

[AI-182] From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions AAAI-2026

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的多智能体系统中因角色设定(persona)引入的社会偏见问题,特别是这些偏见如何影响智能体间的信任度(trustworthiness)和坚持度(insistence)等社会特质表现。其解决方案的关键在于通过受控实验揭示:LLM智能体在协作与说服任务中表现出显著的群体偏好偏差,即来自历史上优势群体(如男性和白人)的角色被赋予较低的信任度和较少的坚持度,且存在明显的内群体偏好(in-group favoritism),表现为更倾向于服从具有相同身份背景的其他智能体。这一发现表明,即便在无意识情况下,LLM代理的行为也会复制现实世界中的结构性不平等,亟需在多智能体系统设计中引入公平性评估与干预机制以提升其可靠性与伦理安全性。

链接: https://arxiv.org/abs/2511.11789
作者: Jiayi Li,Xiao Liu,Yansong Feng
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: AAAI-2026

点击查看摘要

Abstract:Large Language Model (LLM)-based multi-agent systems are increasingly used to simulate human interactions and solve collaborative tasks. A common practice is to assign agents with personas to encourage behavioral diversity. However, this raises a critical yet underexplored question: do personas introduce biases into multi-agent interactions? This paper presents a systematic investigation into persona-induced biases in multi-agent interactions, with a focus on social traits like trustworthiness (how an agent’s opinion is received by others) and insistence (how strongly an agent advocates for its opinion). Through a series of controlled experiments in collaborative problem-solving and persuasion tasks, we reveal that (1) LLM-based agents exhibit biases in both trustworthiness and insistence, with personas from historically advantaged groups (e.g., men and White individuals) perceived as less trustworthy and demonstrating less insistence; and (2) agents exhibit significant in-group favoritism, showing a higher tendency to conform to others who share the same persona. These biases persist across various LLMs, group sizes, and numbers of interaction rounds, highlighting an urgent need for awareness and mitigation to ensure the fairness and reliability of multi-agent systems.
zh

[AI-183] MALBO: Optimizing LLM -Based Multi-Agent Teams via Multi-Objective Bayesian Optimization

【速读】:该论文旨在解决多智能体系统中大型语言模型(Large Language Models, LLMs)到专业化角色的最优分配问题,这是一个具有巨大组合搜索空间、昂贵黑箱评估以及性能与成本之间权衡的多目标优化挑战。现有方法主要局限于单智能体场景,缺乏针对多智能体、多目标问题的系统性框架。解决方案的关键在于提出 MALBO(Multi-Agent LLM Bayesian Optimization),一种基于多目标贝叶斯优化(Multi-Objective Bayesian Optimization, MOBO)的自动化框架,通过构建独立高斯过程(Gaussian Process)代理模型,在连续特征空间上对LLM配置进行样本高效的探索,并以期望超体积改进(Expected Hypervolume Improvement)为指导策略,从而识别出任务准确率与推理成本之间的帕累托前沿(Pareto front)。该方法不仅显著降低了平均配置成本(超过45%),还发现了能实现最高性能且成本降低达65.8%的异构专业化团队配置,为部署高效、低成本的多智能体AI系统提供了数据驱动的方法论支撑。

链接: https://arxiv.org/abs/2511.11788
作者: Antonio Sabbatella
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Master’s Thesis, University of Milano-Bicocca, 2025

点击查看摘要

Abstract:The optimal assignment of Large Language Models (LLMs) to specialized roles in multi-agent systems is a significant challenge, defined by a vast combinatorial search space, expensive black-box evaluations, and an inherent trade-off between performance and cost. Current optimization methods focus on single-agent settings and lack a principled framework for this multi-agent, multi-objective problem. This thesis introduces MALBO (Multi-Agent LLM Bayesian Optimization), a systematic framework designed to automate the efficient composition of LLM-based agent teams. We formalize the assignment challenge as a multi-objective optimization problem, aiming to identify the Pareto front of configurations between task accuracy and inference cost. The methodology employs multi-objective Bayesian Optimization (MOBO) with independent Gaussian Process surrogate models. By searching over a continuous feature-space representation of the LLMs, this approach performs a sample-efficient exploration guided by the expected hypervolume improvement. The primary contribution is a principled and automated methodology that yields a Pareto front of optimal team configurations. Our results demonstrate that the Bayesian optimization phase, compared to an initial random search, maintained a comparable average performance while reducing the average configuration cost by over 45%. Furthermore, MALBO identified specialized, heterogeneous teams that achieve cost reductions of up to 65.8% compared to homogeneous baselines, all while maintaining maximum performance. The framework thus provides a data-driven tool for deploying cost-effective and highly specialized multi-agent AI systems. Comments: Master’s Thesis, University of Milano-Bicocca, 2025 Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11788 [cs.MA] (or arXiv:2511.11788v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2511.11788 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-184] NegBLEURT Forest: Leverag ing Inconsistencies for Detecting Jailbreak Attacks

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对对抗性提示(adversarial prompts)时容易遭受越狱攻击(jailbreak attacks)的问题,即模型可能生成有害或不当内容,从而绕过预设的安全机制。传统方法依赖于阈值校准或模型微调,但存在泛化能力差和上下文敏感等问题。其解决方案的关键在于提出一种基于语义一致性的分析方法——通过比较成功与失败响应之间的语义差异,引入一种感知否定的评分机制(negation-aware scoring approach),从而捕捉到越狱行为的本质特征;进一步构建了名为NegBLEURT Forest的检测框架,利用孤立森林(Isolation Forest)算法识别异常输出,实现无需阈值调整或模型微调的高精度越狱检测。

链接: https://arxiv.org/abs/2511.11784
作者: Lama Sleem,Jerome Francois,Lujun Li,Nathan Foucher,Niccolo Gentile,Radu State
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jailbreak attacks designed to bypass safety mechanisms pose a serious threat by prompting LLMs to generate harmful or inappropriate content, despite alignment with ethical guidelines. Crafting universal filtering rules remains difficult due to their inherent dependence on specific contexts. To address these challenges without relying on threshold calibration or model fine-tuning, this work introduces a semantic consistency analysis between successful and unsuccessful responses, demonstrating that a negation-aware scoring approach captures meaningful patterns. Building on this insight, a novel detection framework called NegBLEURT Forest is proposed to evaluate the degree of alignment between outputs elicited by adversarial prompts and expected safe behaviors. It identifies anomalous responses using the Isolation Forest algorithm, enabling reliable jailbreak detection. Experimental results show that the proposed method consistently achieves top-tier performance, ranking first or second in accuracy across diverse models using the crafted dataset, while competing approaches exhibit notable sensitivity to model and data variations.
zh

[AI-185] On the Measure of a Model: From Intelligence to Generality

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)评估中过度依赖抽象智力指标(如ARC、Raven-inspired测试和Blackbird任务)所导致的评价体系失真问题,即这些基准未能稳定预测模型在实际任务(如问答、摘要生成或编码)中的表现,从而可能误导模型优化方向。解决方案的关键在于将评估基础从模糊的“智力”概念转向可测量的“泛化能力”(generality),并指出只有泛化能力具备概念与实证上的稳健性;论文进一步提出,泛化能力本质上是一个多任务学习问题,其评估可直接关联到模型在多样化任务上的性能广度与可靠性,从而为AI能力的持续评估提供更稳定、实用的框架。

链接: https://arxiv.org/abs/2511.11773
作者: Ruchira Dhar,Ninell Oldenburg,Anders Soegaard
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EurIPS Workshop on “The Science of Benchmarking and Evaluating AI”

点击查看摘要

Abstract:Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.
zh

[AI-186] Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents AAAI-26

【速读】:该论文试图解决大规模或资源匮乏课程中难以实现公平、高质量形成性反馈的问题。当前,由于教师时间、人力和资源有限,无法对每位学生的反思性作业进行及时、个性化的反馈,导致支持不足的群体反而最需要帮助。解决方案的关键在于构建一个基于理论的多智能体大语言模型(LLM)系统,由五个角色分工明确的代理组成:评估者(Evaluator)、公平监控者(Equity Monitor)、元认知教练(Metacognitive Coach)、聚合器(Aggregator)和反思评审者(Reflexion Reviewer)。该系统首先依据统一量规评分,随后通过公平检查机制识别潜在偏见语言,再嵌入元认知提示以促进学生自我反思,并最终生成不超过120词、具情感共鸣且与教学目标一致的反馈信息,从而在规模化场景下保障反馈的质量与公平性。

链接: https://arxiv.org/abs/2511.11772
作者: Chenyu Zhang,Xiaohang Luo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI-26 AISI Track

点击查看摘要

Abstract:Formative feedback is widely recognized as one of the most effective drivers of student learning, yet it remains difficult to implement equitably at scale. In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection, creating gaps in support precisely where learners would benefit most. This paper presents a theory-grounded system that uses five coordinated role-based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and to generate short, bias-aware, learner-facing comments. The agents first produce structured rubric scores, then check for potentially biased or exclusionary language, add metacognitive prompts that invite students to think about their own thinking, and finally compose a concise feedback message of at most 120 words. The system includes simple fairness checks that compare scoring error across lower and higher scoring learners, enabling instructors to monitor and bound disparities in accuracy. We evaluate the pipeline in a 12-session AI literacy program with adult learners. In this setting, the system produces rubric scores that approach expert-level agreement, and trained graders rate the AI-generated comments as helpful, empathetic, and well aligned with instructional goals. Taken together, these results show that multi-agent LLM systems can deliver equitable, high-quality formative feedback at a scale and speed that would be impossible for human graders alone. More broadly, the work points toward a future where feedback-rich learning becomes feasible for any course size or context, advancing long-standing goals of equity, access, and instructional capacity in education.
zh

[AI-187] Learning to Refine: An Agent ic RL Approach for Iterative SPARQL Query Construction

【速读】:该论文旨在解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中生成复杂、逻辑正确的多跳SPARQL查询的瓶颈问题,尤其是大语言模型(Large Language Models, LLMs)一次性生成查询时脆弱性高、难以可靠交互结构化数据的问题。解决方案的关键在于提出一种新型代理框架(agentic framework),其中LLM通过仅依赖结果驱动的强化学习(GRPO)训练出一个鲁棒的迭代式SPARQL构建策略,无需监督微调即可学会基于实时执行反馈动态调试查询,并逐步修正错误直至获得正确答案。该方法在LC-QuAD 2.0的一个可执行单答案子集上实现了49.7%的准确率,较最强的零样本迭代基线提升17.5个百分点,且进一步分析表明,强化学习驱动的能力结合显式的 deliberative reasoning 步骤(认知支架)显著提升了策略精度。

链接: https://arxiv.org/abs/2511.11770
作者: Floris Vossebeld,Shenghui Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating complex, logically-sound SPARQL queries for multi-hop questions remains a critical bottleneck for Knowledge Graph Question Answering, as the brittle nature of one-shot generation by Large Language Models (LLMs) hinders reliable interaction with structured data. Current methods lack the adaptive policies needed to dynamically debug queries based on real-time execution feedback. This paper introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction. We show that a compact 3B-parameter model, trained exclusively via outcome-driven Reinforcement Learning (GRPO) without supervised fine-tuning, can learn effective policies for this task, discovering how to systematically recover from execution errors and refine its queries toward a correct answer. On a curated, executable single-answer subset of LC-QuAD 2.0, our agent achieves 49.7% accuracy post-entity-linking, a significant 17.5 percentage point improvement over the strongest iterative zero-shot baseline. Further analysis reveals that while the agent’s capability is driven by RL, its performance is enhanced by an explicit deliberative reasoning step that acts as a cognitive scaffold to improve policy precision. This work presents a generalizable blueprint for teaching agents to master formal, symbolic tools through interaction, bridging the gap between probabilistic LLMs and the structured world of Knowledge Graphs.
zh

[AI-188] Demystify Use Reflect: Preparing students to be informed LLM -users

【速读】:该论文旨在解决当前计算机科学教育中对大型语言模型(Large Language Models, LLMs)缺乏系统性、批判性和实践性整合的问题,以培养学生在AI时代下负责任且有意义地参与和使用生成式AI的能力。解决方案的关键在于:构建一个结构化的课程框架,明确讲授LLM的工作原理,引入当前可用工具并探讨伦理问题,设计促进学生反思个人使用行为及AI辅助编程趋势的活动;同时,在课堂中演示LLM输出的验证方法,引导学生将LLM作为问题解决循环中的一个组成部分,并要求其披露和承认LLM协助的程度,从而实现对学生认知水平与使用行为的双重提升。

链接: https://arxiv.org/abs/2511.11764
作者: Nikitha Donekal Chandrashekar,Sehrish Basir Nizamani,Margaret Ellis,Naren Ramakrishnan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 2 pages 1 table Submitted to SIGCSE 2026

点击查看摘要

Abstract:We transitioned our post-CS1 course that introduces various subfields of computer science so that it integrates Large Language Models (LLMs) in a structured, critical, and practical manner. It aims to help students develop the skills needed to engage meaningfully and responsibly with AI. The course now includes explicit instruction on how LLMs work, exposure to current tools, ethical issues, and activities that encourage student reflection on personal use of LLMs as well as the larger evolving landscape of AI-assisted programming. In class, we demonstrate the use and verification of LLM outputs, guide students in the use of LLMs as an ingredient in a larger problem-solving loop, and require students to disclose and acknowledge the nature and extent of LLM assistance. Throughout the course, we discuss risks and benefits of LLMs across CS subfields. In our first iteration of the course, we collected and analyzed data from students pre and post surveys. Student understanding of how LLMs work became more technical, and their verification and use of LLMs shifted to be more discerning and collaborative. These strategies can be used in other courses to prepare students for the AI-integrated future.
zh

[AI-189] Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation

【速读】:该论文旨在解决生成式 AI(Generative AI)在实际应用中因安全防护机制失效而被恶意利用,进而对弱势群体(如老年人)实施网络诈骗的问题。其核心解决方案在于构建并验证了一个端到端的攻击链条:从突破大语言模型(Large Language Models, LLMs)的安全护栏以生成钓鱼内容,到将这些内容部署至真实目标人群,并最终成功诱导老年人泄露敏感信息。研究通过系统评估六种前沿LLM在四类攻击场景下的表现,发现多个模型对特定攻击向量近乎完全无防御能力;并通过针对108名老年志愿者的人工验证实验,证实AI生成的钓鱼邮件成功欺骗了11%的参与者。该工作揭示了当前AI安全措施在保护高风险人群方面的严重不足,并指出LLM不仅可生成钓鱼内容,还能跨越语言障碍、开展多轮信任构建对话,从而重塑欺诈行为的经济逻辑。

链接: https://arxiv.org/abs/2511.11759
作者: Fred Heiding,Simon Lermen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We present an end-to-end demonstration of how attackers can exploit AI safety failures to harm vulnerable populations: from jailbreaking LLMs to generate phishing content, to deploying those messages against real targets, to successfully compromising elderly victims. We systematically evaluated safety guardrails across six frontier LLMs spanning four attack categories, revealing critical failures where several models exhibited near-complete susceptibility to certain attack vectors. In a human validation study with 108 senior volunteers, AI-generated phishing emails successfully compromised 11% of participants. Our work uniquely demonstrates the complete attack pipeline targeting elderly populations, highlighting that current AI safety measures fail to protect those most vulnerable to fraud. Beyond generating phishing content, LLMs enable attackers to overcome language barriers and conduct multi-turn trust-building conversations at scale, fundamentally transforming fraud economics. While some providers report voluntary counter-abuse efforts, we argue these remain insufficient.
zh

[AI-190] Bridging the Skills Gap: A Course Model for Modern Generative AI Education AAAI AAAI-26

【速读】:该论文试图解决的问题是:随着生成式AI工具的普及,教育界对其在学习环境中的应用存在犹豫,导致学生在缺乏正式指导的情况下自行探索这些工具,进而形成“行业需求与高等教育脱节”的现象,尤其在计算机科学领域,尽管顶尖高校普遍教授AI底层机制与框架,却鲜有课程聚焦于现有生成式AI工具的实际应用。解决方案的关键在于开发并实施一门面向计算机科学本科生和研究生的课程,系统讲授生成式AI工具在软件开发中的实际应用,通过混合方法调查验证其有效性,并基于教学实践提供可复制的实施路径,从而提升学生的AI素养与就业竞争力。

链接: https://arxiv.org/abs/2511.11757
作者: Anya Bardach,Hamilton Murrah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 2 figures, in the 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26) EAAI Symposium

点击查看摘要

Abstract:Research on how the popularization of generative Artificial Intelligence (AI) tools impacts learning environments has led to hesitancy among educators to teach these tools in classrooms, creating two observed disconnects. Generative AI competency is increasingly valued in industry but not in higher education, and students are experimenting with generative AI without formal guidance. The authors argue students across fields must be taught to responsibly and expertly harness the potential of AI tools to ensure job market readiness and positive outcomes. Computer Science trajectories are particularly impacted, and while consistently top ranked U.S. Computer Science departments teach the mechanisms and frameworks underlying AI, few appear to offer courses on applications for existing generative AI tools. A course was developed at a private research university to teach undergraduate and graduate Computer Science students applications for generative AI tools in software development. Two mixed method surveys indicated students overwhelmingly found the course valuable and effective. Co-authored by the instructor and one of the graduate students, this paper explores the context, implementation, and impact of the course through data analysis and reflections from both perspectives. It additionally offers recommendations for replication in and beyond Computer Science departments. This is the extended version of this paper to include technical appendices.
zh

[AI-191] owards autonomous quantum physics research using LLM agents with access to intelligent tools

【速读】:该论文旨在解决科学发现中AI仅能辅助而难以自主生成并执行可操作研究构想的问题,即如何实现从创意生成到实验实施的全流程自动化。其解决方案的关键在于构建一个名为AI-Mandel的大型语言模型(Large Language Model, LLM)代理系统,该系统能够从文献中提炼科学问题,并调用领域特定的AI工具将抽象想法转化为实验室可直接执行的实验设计,从而实现“生成-实施”闭环,为迈向具备人类水平的AI科学家提供原型验证。

链接: https://arxiv.org/abs/2511.11752
作者: Sören Arlt,Xuemei Gu,Mario Krenn
机构: 未知
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Quantum Physics (quant-ph)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:Artificial intelligence (AI) is used in numerous fields of science, yet the initial research questions and targets are still almost always provided by human researchers. AI-generated creative ideas in science are rare and often vague, so that it remains a human task to execute them. Automating idea generation and implementation in one coherent system would significantly shift the role of humans in the scientific process. Here we present AI-Mandel, an LLM agent that can generate and implement ideas in quantum physics. AI-Mandel formulates ideas from the literature and uses a domain-specific AI tool to turn them into concrete experiment designs that can readily be implemented in laboratories. The generated ideas by AI-Mandel are often scientifically interesting - for two of them we have already written independent scientific follow-up papers. The ideas include new variations of quantum teleportation, primitives of quantum networks in indefinite causal orders, and new concepts of geometric phases based on closed loops of quantum information transfer. AI-Mandel is a prototypical demonstration of an AI physicist that can generate and implement concrete, actionable ideas. Building such a system is not only useful to accelerate science, but it also reveals concrete open challenges on the path to human-level artificial scientists.
zh

[AI-192] IDOL: Meeting Diverse Distribution Shifts with Prior Physics for Tropical Cyclone Multi-Task Estimation

【速读】:该论文旨在解决热带气旋(Tropical Cyclone, TC)实时估计中因环境场分布变化(如地理条件差异和季节性变动)导致的模型泛化能力不足问题。现有方法多依赖多模态特征融合,但忽视了特征表示的内在分布特性,难以应对分布外(out-of-distribution, OOD)场景。其解决方案的关键在于提出一种基于身份分布导向的物理不变性学习框架(Identity Distribution-Oriented Physical Invariant Learning, IDOL),通过引入先验物理知识引导特征空间约束,利用风场模型与TC暗相关性知识建模任务共享和任务特异的身份令牌(identity tokens),从而捕获任务依赖关系与TC固有的物理不变性,实现对TC风速、气压、内核及外核尺度等属性在分布偏移下的鲁棒估计。

链接: https://arxiv.org/abs/2511.11750
作者: Hanting Yan,Pan Mu,Shiqi Zhang,Yuchao Zhu,Jinglin Zhang,Cong Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tropical Cyclone (TC) estimation aims to accurately estimate various TC attributes in real time. However, distribution shifts arising from the complex and dynamic nature of TC environmental fields, such as varying geographical conditions and seasonal changes, present significant challenges to reliable estimation. Most existing methods rely on multi-modal fusion for feature extraction but overlook the intrinsic distribution of feature representations, leading to poor generalization under out-of-distribution (OOD) scenarios. To address this, we propose an effective Identity Distribution-Oriented Physical Invariant Learning framework (IDOL), which imposes identity-oriented constraints to regulate the feature space under the guidance of prior physical knowledge, thereby dealing distribution variability with physical invariance. Specifically, the proposed IDOL employs the wind field model and dark correlation knowledge of TC to model task-shared and task-specific identity tokens. These tokens capture task dependencies and intrinsic physical invariances of TC, enabling robust estimation of TC wind speed, pressure, inner-core, and outer-core size under distribution shifts. Extensive experiments conducted on multiple datasets and tasks demonstrate the outperformance of the proposed IDOL, verifying that imposing identity-oriented constraints based on prior physical knowledge can effectively mitigates diverse distribution shifts in TC this http URL is available at this https URL.
zh

[AI-193] Diffusion Models: A Mathematical Introduction

【速读】:该论文旨在系统性地从基础概率原理出发,构建扩散生成模型(diffusion-based generative models)的完整理论框架,解决现有文献中推导过程不够清晰、中间步骤缺失以及符号不一致等问题。其解决方案的关键在于:基于高斯分布的基本性质(如密度函数、二次期望、重参数化、乘积和KL散度),逐步推导出前向加噪过程、精确的离散反向后验分布及其变分下界,并揭示该下界等价于实践中广泛采用的噪声预测目标;进一步通过连续时间形式化,利用连续性方程与Fokker-Planck方程从扩散随机微分方程(SDE)导出概率流常微分方程(ODE),并引入流匹配(flow matching)方法,证明修正流(rectified flows)在时间重参数化下可恢复DDIM采样策略,最终将引导扩散(guided diffusion)统一为后验得分修正(classifier guidance)与无分类器引导(classifier-free guidance)两种机制,实现了理论透明性、算法可实现性和数学严谨性的统一。

链接: https://arxiv.org/abs/2511.11746
作者: Sepehr Maleki,Negar Pourmoazemi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a concise, self-contained derivation of diffusion-based generative models. Starting from basic properties of Gaussian distributions (densities, quadratic expectations, re-parameterisation, products, and KL divergences), we construct denoising diffusion probabilistic models from first principles. This includes the forward noising process, its closed-form marginals, the exact discrete reverse posterior, and the related variational bound. This bound simplifies to the standard noise-prediction goal used in practice. We then discuss likelihood estimation and accelerated sampling, covering DDIM, adversarially learned reverse dynamics (DDGAN), and multi-scale variants such as nested and latent diffusion, with Stable Diffusion as a canonical example. A continuous-time formulation follows, in which we derive the probability-flow ODE from the diffusion SDE via the continuity and Fokker-Planck equations, introduce flow matching, and show how rectified flows recover DDIM up to a time re-parameterisation. Finally, we treat guided diffusion, interpreting classifier guidance as a posterior score correction and classifier-free guidance as a principled interpolation between conditional and unconditional scores. Throughout, the focus is on transparent algebra, explicit intermediate steps, and consistent notation, so that readers can both follow the theory and implement the corresponding algorithms in practice.
zh

[AI-194] Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

【速读】:该论文旨在解决在资源受限设备上部署深度神经网络时面临的两大挑战:一是如何在极端量化(如4-bit)条件下保持模型精度,二是如何确保推理延迟的可预测性。解决方案的关键在于提出一种基于贝叶斯认知不确定性(Bayesian epistemic uncertainty)驱动的量化混合专家(Mixture-of-Experts, MoE)框架,通过动态路由机制选择不同位宽的异构专家(BitNet三值、1–16 bit BitLinear、后训练量化),实现自适应量化与低延迟推理的协同优化。实验表明,该方法在音频分类任务中实现了4-bit量化下99.9%的16-bit精度保留(F1=0.858 vs 0.859),同时压缩比达4倍、能耗降低41%,且通过好奇心驱动路由将MoE推理延迟方差降低82%(从230 ms降至29 ms标准差),显著提升了边缘设备上的推理稳定性。

链接: https://arxiv.org/abs/2511.11743
作者: Sebastián Andrés Cajas Ordóñez,Luis Fernando Torres Torres,Mackenzie J. Meni,Carlos Andrés Duran Paredes,Eric Arazo,Cristian Bosch,Ricardo Simon Carbajo,Yuan Lai,Leo Anthony Celi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1-16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, UrbanSound8K), our 4-bit quantization maintains 99.9 percent of 16-bit accuracy (0.858 vs 0.859 F1) with 4x compression and 41 percent energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82 percent (p = 0.008, Levene’s test) from 230 ms to 29 ms standard deviation, enabling stable inference for battery-constrained devices. Statistical analysis confirms 4-bit/8-bit achieve practical equivalence with full precision (p 0.05), while MoE architectures introduce 11 percent latency overhead (p 0.001) without accuracy gains. At scale, deployment emissions dominate training by 10000x for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.
zh

[AI-195] ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts AAAI2026 AAAI

【速读】:该论文旨在解决端到端自动驾驶系统(ADS)在复杂驾驶场景中面临的三大挑战:一是语义信息丰富但存在模糊或噪声,影响决策可靠性;二是多任务间干扰导致规划效率低下;三是推理延迟过长,增加不安全驾驶行为风险。解决方案的关键在于提出ExpertAD框架,其核心创新包括两个模块:一是感知适配器(Perception Adapter, PA),用于增强任务关键特征以实现情境相关的场景理解;二是稀疏专家混合模型(Mixture of Sparse Experts, MoSE),通过最小化多任务间的干扰来提升预测精度与规划效率。实验表明,该方案可降低平均碰撞率最多达20%,推理延迟减少25%,并展现出对罕见场景和未见城市环境的强泛化能力。

链接: https://arxiv.org/abs/2511.11740
作者: Haowen Jiang,Xinyu Huang,You Lu,Dingji Wang,Yuheng Cao,Chaofeng Sha,Bihuan Chen,Keyu Chen,Xin Peng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The paper has been accepted by the Fortieth AAAI Conference on Artificial Intelligence. AAAI 2026

点击查看摘要

Abstract:Recent advancements in end-to-end autonomous driving systems (ADSs) underscore their potential for perception and planning capabilities. However, challenges remain. Complex driving scenarios contain rich semantic information, yet ambiguous or noisy semantics can compromise decision reliability, while interference between multiple driving tasks may hinder optimal planning. Furthermore, prolonged inference latency slows decision-making, increasing the risk of unsafe driving behaviors. To address these challenges, we propose ExpertAD, a novel framework that enhances the performance of ADS with Mixture of Experts (MoE) architecture. We introduce a Perception Adapter (PA) to amplify task-critical features, ensuring contextually relevant scene understanding, and a Mixture of Sparse Experts (MoSE) to minimize task interference during prediction, allowing for effective and efficient planning. Our experiments show that ExpertAD reduces average collision rates by up to 20% and inference latency by 25% compared to prior methods. We further evaluate its multi-skill planning capabilities in rare scenarios (e.g., accidents, yielding to emergency vehicles) and demonstrate strong generalization to unseen urban environments. Additionally, we present a case study that illustrates its decision-making process in complex driving scenarios.
zh

[AI-196] DK-Root: A Joint Data-and-Knowledge-Driven Framework for Root Cause Analysis of QoE Degradations in Mobile Networks

【速读】:该论文旨在解决移动网络中用户体验质量(Quality of Experience, QoE)下降的根因分析难题,其核心挑战在于底层性能指标(Key Performance Indicators, KPIs)之间复杂的跨层交互关系以及高质量专家标注数据的稀缺性。传统基于规则的标签虽可大规模生成但噪声大、粒度粗,限制了纯数据驱动方法的准确性。解决方案的关键在于提出一种数据与知识协同驱动的框架DK-Root:首先利用大量规则标签通过对比表示学习预训练编码器,并结合监督对比目标显式去噪;其次引入类条件扩散模型生成保留根因语义的KPI序列,通过控制反向扩散步数实现弱增强与强增强,提升类内紧凑性和类间可分性;最后在少量专家验证标签上联合微调编码器与轻量分类器,精化决策边界。实验证明该方法在真实运营商级数据集上达到当前最优性能,且消融实验验证了条件扩散增强和预训练-微调设计的有效性。

链接: https://arxiv.org/abs/2511.11737
作者: Qizhe Li,Haolong Chen,Jiansheng Li,Shuqi Chai,Xuan Li,Yuzhou Hou,Xinhua Shao,Fangfang Li,Kaifeng Han,Guangxu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 13 pages, submitted for possible publication

点击查看摘要

Abstract:Diagnosing the root causes of Quality of Experience (QoE) degradations in operational mobile networks is challenging due to complex cross-layer interactions among kernel performance indicators (KPIs) and the scarcity of reliable expert annotations. Although rule-based heuristics can generate labels at scale, they are noisy and coarse-grained, limiting the accuracy of purely data-driven approaches. To address this, we propose DK-Root, a joint data-and-knowledge-driven framework that unifies scalable weak supervision with precise expert guidance for robust root-cause analysis. DK-Root first pretrains an encoder via contrastive representation learning using abundant rule-based labels while explicitly denoising their noise through a supervised contrastive objective. To supply task-faithful data augmentation, we introduce a class-conditional diffusion model that generates KPIs sequences preserving root-cause semantics, and by controlling reverse diffusion steps, it produces weak and strong augmentations that improve intra-class compactness and inter-class separability. Finally, the encoder and the lightweight classifier are jointly fine-tuned with scarce expert-verified labels to sharpen decision boundaries. Extensive experiments on a real-world, operator-grade dataset demonstrate state-of-the-art accuracy, with DK-Root surpassing traditional ML and recent semi-supervised time-series methods. Ablations confirm the necessity of the conditional diffusion augmentation and the pretrain-finetune design, validating both representation quality and classification gains.
zh

[AI-197] Physics-Informed Neural ODEs with Scale-Aware Residuals for Learning Stiff Biophysical Dynamics

【速读】:该论文旨在解决神经微分方程(Neural Differential Equations, Neural ODEs)在建模刚性生物物理系统时预测不稳定的问题,尤其是在长期外推场景下难以保持振荡频率和振幅准确性。其解决方案的关键在于提出物理信息引导的带尺度感知残差的神经微分方程(Physics-Informed Neural ODEs with Scale-Aware Residuals, PI-NODE-SR),通过结合低阶显式求解器(Heun方法)与残差归一化机制,有效平衡不同时间尺度状态变量之间的贡献,从而在有限迭代预算内稳定训练过程,并避免对计算昂贵的隐式求解器的依赖。该方法在Hodgkin-Huxley模型上仅需单次振荡数据即可实现超过100毫秒的准确外推,同时恢复了如门控变量的亚阈值曲率等高阶特征,表明神经修正可补偿数值扩散效应。

链接: https://arxiv.org/abs/2511.11734
作者: Kamalpreet Singh Kainth,Prathamesh Dinesh Joshi,Raj Abhijit Dandekar,Rajat Dandekar,Sreedat Panat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural differential equations offer a powerful framework for modeling continuous-time dynamics, but forecasting stiff biophysical systems remains unreliable. Standard Neural ODEs and physics informed variants often require orders of magnitude more iterations, and even then may converge to suboptimal solutions that fail to preserve oscillatory frequency or amplitude. We introduce PhysicsInformed Neural ODEs with with Scale-Aware Residuals (PI-NODE-SR), a framework that combines a low-order explicit solver (Heun method) residual normalisation to balance contributions between state variables evolving on disparate timescales. This combination stabilises training under realistic iteration budgets and avoids reliance on computationally expensive implicit solvers. On the Hodgkin-Huxley equations, PI-NODE-SR learns from a single oscillation simulated with a stiff solver (Rodas5P) and extrapolates beyond 100 ms, capturing both oscillation frequency and near-correct amplitudes. Remarkably, end-to-end learning of the vector field enables PI-NODE-SR to recover morphological features such as sharp subthreshold curvature in gating variables that are typically reserved for higher-order solvers, suggesting that neural correction can offset numerical diffusion. While performance remains sensitive to initialisation, PI-NODE-SR consistently reduces long-horizon errors relative to baseline Neural-ODEs and PINNs, offering a principled route to stable and efficient learning of stiff biological dynamics.
zh

[AI-198] Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput ICML2025

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)在去中心化推理场景中因网络延迟主导计算开销而导致的效率瓶颈问题。传统推测解码(Speculative Decoding)虽在集中式系统中有效,但在分布式环境中难以发挥优势。其解决方案的关键在于提出去中心化推测解码(Decentralized Speculative Decoding, DSD),通过将通信延迟转化为有用计算:利用分布式节点并行验证多个候选token,并引入基于token语义重要性的自适应验证策略,动态调整接受阈值,在不重新训练模型或改变架构的前提下实现端到端加速15%–20%。理论分析表明,DSD可降低约(N−1)t₁(k−1)/k的跨节点通信成本,实验结果证明其在HumanEval和GSM8K基准上分别达到2.56倍和2.59倍的速度提升,显著优于Eagle3基线,同时保持准确率不变。

链接: https://arxiv.org/abs/2511.11733
作者: Jingwei Song,Wanyi Chen,Xinyuan Song, Max,Chris Tong,Gufeng Chen,Tianyi Zhao,Eric Yang,Bill Shi,Lynn Ai
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 2 tables. Uses ICML 2025 style

点击查看摘要

Abstract:Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in decentralized settings, where network latency often dominates compute, remains under-characterized. We present Decentralized Speculative Decoding (DSD), a plug-and-play framework for decentralized inference that turns communication delay into useful computation by verifying multiple candidate tokens in parallel across distributed nodes. We further introduce an adaptive speculative verification strategy that adjusts acceptance thresholds by token-level semantic importance, delivering an additional 15% to 20% end-to-end speedup without retraining. In theory, DSD reduces cross-node communication cost by approximately (N-1)t1(k-1)/k, where t1 is per-link latency and k is the average number of tokens accepted per round. In practice, DSD achieves up to 2.56x speedup on HumanEval and 2.59x on GSM8K, surpassing the Eagle3 baseline while preserving accuracy. These results show that adapting speculative decoding for decentralized execution provides a system-level optimization that converts network stalls into throughput, enabling faster distributed LLM inference with no model retraining or architectural changes.
zh

[AI-199] A Meta-Heuristic Load Balancer for Cloud Computing Systems

【速读】:该论文旨在解决云计算系统中服务分配问题,即在不导致节点过载的前提下,实现资源的最优配置以维持系统稳定性并最小化成本。其解决方案的关键在于提出了一种抽象的云资源利用率模型,该模型综合考虑了多种类型资源及服务迁移成本;同时设计了一种新型遗传算法(Genetic Algorithm),通过引入其他元启发式算法的输出结果作为初始种群,从而提升全局搜索能力和优化效率。

链接: https://arxiv.org/abs/2511.11721
作者: Leszek Sliwko,Vladimir Getov
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a strategy to allocate services on a Cloud system without overloading nodes and maintaining the system stability with minimum cost. We specify an abstract model of cloud resources utilization, including multiple types of resources as well as considerations for the service migration costs. A prototype meta-heuristic load balancer is demonstrated and experimental results are presented and discussed. We also propose a novel genetic algorithm, where population is seeded with the outputs of other meta-heuristic algorithms.
zh

[AI-200] ECCENTRIC: Edge-Cloud Collaboration Framework for Distributed Inference Using Knowledge Adaptation

【速读】:该论文旨在解决边缘计算(Edge Computing)与云计算(Cloud Computing)协同推理系统中面临的性能、计算和通信成本之间的权衡问题。随着边缘AI的广泛应用,受限于边缘设备的计算资源,系统通常依赖云端进行模型推理,但随着接入设备数量的增长,计算与通信开销显著上升,导致效率下降。解决方案的关键在于提出一种名为Eccentric的新框架,该框架通过从边缘模型向云端模型迁移知识(Knowledge Distillation),在保持最优性能的同时有效降低推理过程中的计算和通信成本。Eccentric本质上是一种适用于边缘-云推理系统的新型压缩方法,能够实现多目标优化下的高效协同推理。

链接: https://arxiv.org/abs/2511.11719
作者: Mohammad Mahdi Kamani,Zhongwei Cheng,Lin Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The massive growth in the utilization of edge AI has made the applications of machine learning models ubiquitous in different domains. Despite the computation and communication efficiency of these systems, due to limited computation resources on edge devices, relying on more computationally rich systems on the cloud side is inevitable in most cases. Cloud inference systems can achieve the best performance while the computation and communication cost is dramatically increasing by the expansion of a number of edge devices relying on these systems. Hence, there is a trade-off between the computation, communication, and performance of these systems. In this paper, we propose a novel framework, dubbed as Eccentric that learns models with different levels of trade-offs between these conflicting objectives. This framework, based on an adaptation of knowledge from the edge model to the cloud one, reduces the computation and communication costs of the system during inference while achieving the best performance possible. The Eccentric framework can be considered as a new form of compression method suited for edge-cloud inference systems to reduce both computation and communication costs. Empirical studies on classification and object detection tasks corroborate the efficacy of this framework.
zh

[AI-201] Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在高维感官输入的三维环境中的两大挑战:一是记忆缓冲区导致的高内存消耗,二是部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)带来的学习复杂性。解决方案的关键在于提出两种新颖的输入表示方法:仅使用语义分割(Semantic Segmentation-only, SS-only)和结合RGB图像与语义分割(RGB+SS),二者均基于对RGB图像进行语义分割处理。实验表明,SS-only能显著降低记忆缓冲区的内存占用(最低减少66.6%,配合无损压缩技术如游程编码时最高可达98.6%),而RGB+SS则通过引入额外的语义信息大幅提升RL代理性能;同时,研究还引入基于密度的热力图可视化方法用于分析代理运动模式并评估其数据收集适配性。

链接: https://arxiv.org/abs/2511.11703
作者: Hugo Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Master’s Thesis at the University of Edinburgh (2024)

点击查看摘要

Abstract:Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS-only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS-only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run-length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents’ performance with the additional semantic information provided. Furthermore, we explored density-based heatmapping as a tool to visualise RL agents’ movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.
zh

[AI-202] Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues

【速读】:该论文旨在解决当前基于显著性(saliency)的可解释人工智能(Explainable AI, XAI)方法在语音情感识别(Speech Emotion Recognition, SER)中缺乏对情感相关声学线索(acoustic markers of emotion)的忠实映射问题,导致解释结果难以与专家知识对应、可信度不足。其解决方案的关键在于提出一个新框架,通过量化显著区域内的声学线索强度,明确“被突出的内容是什么”,并将其与理论驱动且专家参考的声学特征建立关联,从而实现从“what”到“why”的因果连接,提升SER模型解释的可理解性和合理性。

链接: https://arxiv.org/abs/2511.11691
作者: Seham Nasr,Zhao Ren,David Johnson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Explainable AI (XAI) for Speech Emotion Recognition (SER) is critical for building transparent, trustworthy models. Current saliency-based methods, adapted from vision, highlight spectrogram regions but fail to show whether these regions correspond to meaningful acoustic markers of emotion, limiting faithfulness and interpretability. We propose a framework that overcomes these limitations by quantifying the magnitudes of cues within salient regions. This clarifies “what” is highlighted and connects it to “why” it matters, linking saliency to expert-referenced acoustic cues of speech emotions. Experiments on benchmark SER datasets show that our approach improves explanation quality by explicitly linking salient regions to theory-driven speech emotions expert-referenced acoustics. Compared to standard saliency methods, it provides more understandable and plausible explanations of SER models, offering a foundational step towards trustworthy speech-based affective computing.
zh

[AI-203] Beyond One-Way Pruning: Bidirectional Pruning-Regrowth for Extreme Accuracy-Sparsity Tradeoff

【速读】:该论文旨在解决模型剪枝(model pruning)在高稀疏度(sparsity)条件下导致的性能急剧下降问题,这一现象限制了模型压缩比的提升,并使得模型无法满足特定硬件平台对尺寸的严格约束。解决方案的关键在于提出一种双向剪枝-再生策略(bidirectional pruning-regrowth strategy),即从满足硬件约束的极低稀疏度网络出发,通过选择性地重新生成关键连接来恢复模型性能,从而有效缓解高稀疏度下常见的精度骤降问题。

链接: https://arxiv.org/abs/2511.11675
作者: Junchen Liu,Yi Sheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As a widely adopted model compression technique, model pruning has demonstrated strong effectiveness across various architectures. However, we observe that when sparsity exceeds a certain threshold, both iterative and one-shot pruning methods lead to a steep decline in model performance. This rapid degradation limits the achievable compression ratio and prevents models from meeting the stringent size constraints required by certain hardware platforms, rendering them inoperable. To overcome this limitation, we propose a bidirectional pruning-regrowth strategy. Starting from an extremely compressed network that satisfies hardware constraints, the method selectively regenerates critical connections to recover lost performance, effectively mitigating the sharp accuracy drop commonly observed under high sparsity conditions.
zh

[AI-204] Do traveling waves make good positional encodings?

【速读】:该论文旨在解决Transformer模型中因自注意力机制固有的排列不变性(permutation invariance)而导致的位置信息缺失问题。传统方法使用绝对正弦嵌入或学习得到的位置向量,而近期研究则倾向于采用相对位置编码以更好地捕捉平移等变性(translation equivariance)。其解决方案的关键在于提出一种名为RollPE的新颖位置编码机制,该机制基于行波(traveling waves)思想,通过在自注意力中的查询(query)和键(key)张量上施加循环滚动操作(circular roll operation),诱导出位置间的相位差,从而使得注意力计算依赖于相对位置差异而非绝对索引。这一方法显著优于传统绝对位置编码,并与RoPE(Rotary Position Embedding)性能相当,且理论分析表明RollPE可等价于RoPE的特定配置,为理解RoPE提供了新视角。

链接: https://arxiv.org/abs/2511.11668
作者: Chase van de Geijn,Ayush Paliwal,Timo Lüddecke,Alexander S. Ecker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers rely on positional encoding to compensate for the inherent permutation invariance of self-attention. Traditional approaches use absolute sinusoidal embeddings or learned positional vectors, while more recent methods emphasize relative encodings to better capture translation equivariances. In this work, we propose RollPE, a novel positional encoding mechanism based on traveling waves, implemented by applying a circular roll operation to the query and key tensors in self-attention. This operation induces a relative shift in phase across positions, allowing the model to compute attention as a function of positional differences rather than absolute indices. We show this simple method significantly outperforms traditional absolute positional embeddings and is comparable to RoPE. We derive a continuous case of RollPE which implicitly imposes a topographic structure on the query and key space. We further derive a mathematical equivalence of RollPE to a particular configuration of RoPE. Viewing RollPE through the lens of traveling waves may allow us to simplify RoPE and relate it to processes of information flow in the brain.
zh

[AI-205] Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中难以彻底消除有害知识的问题,现有方法常因残留有害知识而无法满足隐私保护、合规性和伦理要求。其解决方案的关键在于提出一种基于知识密度引导的块重插入机制(Knowledge Density-Guided Unlearning via Blocks Reinsertion, KUnBR),通过知识密度估计精准定位包含丰富有害知识的模型层,并采用层重插入策略将这些有害知识丰富的层重新插入原模型中,从而绕过覆盖层导致的梯度阻塞问题,确保在去遗忘过程中有效传播梯度,实现更彻底的知识移除同时保持模型性能。

链接: https://arxiv.org/abs/2511.11667
作者: Feng Guo,Yuntao Wen,Shen Gao,Junshuo Zhang,Shuo Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning, which selectively removes harmful knowledge from a pre-trained model without retraining from scratch, is crucial for addressing privacy, regulatory compliance, and ethical concerns in Large Language Models (LLMs). However, existing unlearning methods often struggle to thoroughly remove harmful knowledge, leaving residual harmful knowledge that can be easily recovered. To address these limitations, we propose Knowledge Density-Guided Unlearning via Blocks Reinsertion (KUnBR), a novel approach that first identifies layers with rich harmful knowledge and then thoroughly eliminates the harmful knowledge via re-insertion strategy. Our method introduces knowledge density estimation to quantify and locate layers containing the most harmful knowledge, enabling precise unlearning. Additionally, we design a layer re-insertion strategy that extracts and re-inserts harmful knowledge-rich layers into the original LLM, bypassing gradient obstruction caused by cover layers and ensuring effective gradient propagation during unlearning. Extensive experiments conducted on several unlearning and general capability benchmarks demonstrate that KUnBR achieves state-of-the-art forgetting performance while maintaining model utility.
zh

[AI-206] Clifford Algebraic Rotor Embeddings : Maybe embeddings should start to CARE

【速读】:该论文旨在解决现有旋转位置编码(Rotary Positional Embeddings, RoPE)方法在扩展至高维输入时面临的非交换性问题,尤其是Spherical RoPE因球面旋转固有的非交换性导致旋转顺序选择模糊,从而丧失了RoPE原有的平移等变性(shift-equivariance)特性。解决方案的关键在于引入四元数(quaternion)作为3D旋转的参数化工具,提出Quaternion Rotary Embeddings (QuatRo),利用四元数对三维空间旋转的紧凑且无歧义的表示能力,统一了Mixed RoPE与Spherical RoPE作为其特例;进一步通过几何代数(geometric algebra)将QuatRo推广为Clifford Algebraic Rotary Embeddings (CARE),将旋转嵌入从四元数扩展至Clifford旋量(rotor)作用于多向量(multivector),实现了两个核心拓展:一是支持任意维度的位置编码,二是允许在多个阶(grade)的多向量中编码位置信息,从而显著提升模型对高维结构数据的建模能力。

链接: https://arxiv.org/abs/2511.11665
作者: Sameeksha Sriram,Ayush Paliwal,Alexander S. Ecker,Chase van de Geijn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rotary Positional Embeddings (RoPE) have demonstrated exceptional performance as a positional encoding method, consistently outperforming their baselines. While recent work has sought to extend RoPE to higher-dimensional inputs, many such extensions are non-commutative, thereby forfeiting RoPE’s shift-equivariance property. Spherical RoPE is one such non-commutative variant, motivated by the idea of rotating embedding vectors on spheres rather than circles. However, spherical rotations are inherently non-commutative, making the choice of rotation sequence ambiguous. In this work, we explore a quaternion-based approach – Quaternion Rotary Embeddings (QuatRo) – in place of Euler angles, leveraging quaternions’ ability to represent 3D rotations to parameterize the axes of rotation. We show Mixed RoPE and Spherical RoPE to be special cases of QuatRo. Further, we propose a generalization of QuatRo to Clifford Algebraic Rotary Embeddings (CARE) using geometric algebra. Viewing quaternions as the even subalgebra of Cl(3,0,0), we extend the notion of rotary embeddings from quaternions to Clifford rotors acting on multivectors. This formulation enables two key generalizations: (1) extending rotary embeddings to arbitrary dimensions, and (2) encoding positional information in multivectors of multiple grades, not just vectors. We present preliminary experiments comparing spherical, quaternion, and Clifford-based rotary embeddings.
zh

[AI-207] SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLM s Quantization AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在端设备上高效部署时面临的极端压缩难题,特别是针对权重和激活值的超低比特量化(ultra-low-bit quantization)所导致的精度损失问题。其解决方案的关键在于提出了一种两阶段框架SpecQuant:第一阶段通过将激活异常值平滑并转移至权重矩阵中以简化后续量化过程;第二阶段采用基于通道的低频傅里叶截断策略,在保留主要信号能量的同时抑制高频噪声,从而提升量化鲁棒性。该方法利用权重能量主要集中于低频分量的特性,结合轻量级推理时自适应调整截断阈值的模块,实现了在LLaMA-3 8B模型上4-bit全量化下仅1.5%零样本准确率差距,同时带来2倍加速与3倍内存降低。

链接: https://arxiv.org/abs/2511.11663
作者: Zhixiong Zhao,Fangxin Liu,Junjie Wang,Chenyang Guan,Zongwu Wang,Li Jiang,Haibing Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression – targeting ultra-low-bit quantization for both activations and weights – from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.
zh

[AI-208] On the Probabilistic Learnability of Compact Neural Network Preimage Bounds AAAI

【速读】:该论文旨在解决神经网络中预像(preimage)边界计算的可扩展性问题,即在高维空间中寻找满足特定输出属性的输入区域时,传统可证明方法因#P-hard问题本质而难以扩展。其解决方案的关键在于提出一种基于随机森林的属性验证器(Random Forest Property Verifier, RF-ProVe),该方法利用随机决策树集成生成候选输入区域,并通过主动重采样进行精炼,从而在保证统计置信度和误差边界的前提下,实现对复杂高维模式的有效捕捉与紧凑预像近似。

链接: https://arxiv.org/abs/2511.11656
作者: Luca Marzari,Manuele Bicego,Ferdinando Cicalese,Alessandro Farinelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 40th Annual AAAI Conference on Artificial Intelligence 2026

点击查看摘要

Abstract:Although recent provable methods have been developed to compute preimage bounds for neural networks, their scalability is fundamentally limited by the #P-hardness of the problem. In this work, we adopt a novel probabilistic perspective, aiming to deliver solutions with high-confidence guarantees and bounded error. To this end, we investigate the potential of bootstrap-based and randomized approaches that are capable of capturing complex patterns in high-dimensional spaces, including input regions where a given output property holds. In detail, we introduce \textbfR andom \textbfF orest \textbfPro perty \textbfVe rifier ( \textttRF-ProVe ), a method that exploits an ensemble of randomized decision trees to generate candidate input regions satisfying a desired output property and refines them through active resampling. Our theoretical derivations offer formal statistical guarantees on region purity and global coverage, providing a practical, scalable solution for computing compact preimage approximations in cases where exact solvers fail to scale.
zh

[AI-209] Convergence of Multiagent Learning Systems for Traffic control

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在交通信号控制(Traffic Signal Control, TSC)任务中因独立学习者(independent learners)导致的收敛性与稳定性问题。现有方法虽在实践中显示有效性,但缺乏理论保障。解决方案的关键在于利用随机逼近(stochastic approximation)方法,对基于Q-learning的多智能体算法进行形式化分析,证明其在特定条件下可收敛,从而将单智能体异步值迭代(asynchronous value iteration)的收敛性结果扩展至多智能体协作场景,为MARL应用于TSC提供了坚实的理论基础。

链接: https://arxiv.org/abs/2511.11654
作者: Sayambhu Sen,Shalabh Bhatnagar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages 2 figures

点击查看摘要

Abstract:Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.
zh

[AI-210] GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning

【速读】:该论文旨在解决现有重排序(reranking)范式在检索增强生成(Retrieval-Augmented Generation, RAG)系统中面临的理论与实践困境:点对式(Pointwise)方法虽灵活但易陷入“排名短视陷阱”(Ranking Myopia Trap),忽略文档间的相对重要性;而列表式(Listwise)方法虽能感知全局排序上下文,却存在固有的“列表刚性”(List Rigidity),导致在处理大规模候选集时出现可扩展性和灵活性严重受限的问题。解决方案的关键在于提出一种新的分组式(Groupwise)重排序范式,该范式将查询与一组候选文档联合输入模型,通过组内比较为每篇文档分配独立的相关性得分,从而在保留点对式灵活性的同时具备列表式的对比能力。此外,作者采用GRPO进行训练,并设计融合排名指标与分布奖励的异构奖励函数,以对齐不同分组间的得分分布,进一步提升模型性能。

链接: https://arxiv.org/abs/2511.11653
作者: Duolin Sun,Meixiu Long,Dan Yang,Yihan Jiao,Zhehao Tan,Jie Feng,Junjie Wang,Yue Shen,Peng Wei,Jian Wang,Jinjie Gu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models have shown strong potential as rerankers to enhance the overall performance of RAG systems. However, existing reranking paradigms are constrained by a core theoretical and practical dilemma: Pointwise methods, while simple and highly flexible, evaluate documents independently, making them prone to the Ranking Myopia Trap, overlooking the relative importance between documents. In contrast, Listwise methods can perceive the global ranking context, but suffer from inherent List Rigidity, leading to severe scalability and flexibility issues when handling large candidate sets. To address these challenges, we propose Groupwise, a novel reranking paradigm. In this approach, the query and a group of candidate documents are jointly fed into the model, which performs within-group comparisons to assign individual relevance scores to each document. This design retains the flexibility of Pointwise methods while enabling the comparative capability of Listwise methods. We further adopt GRPO for model training, equipped with a heterogeneous reward function that integrates ranking metrics with a distributional reward aimed at aligning score distributions across groups. To overcome the bottleneck caused by the scarcity of high quality labeled data, we further propose an innovative pipeline for synthesizing high quality retrieval and ranking data. The resulting data can be leveraged not only for training the reranker but also for training the retriever. Extensive experiments validate the effectiveness of our approach. On two reasoning intensive retrieval benchmarks, BRIGHT and R2MED.
zh

[AI-211] Incomplete Depression Feature Selection with Missing EEG Channels

【速读】:该论文旨在解决脑电图(EEG)在抑郁症分析中因电极脱落或噪声干扰导致的通道缺失问题,以及特征冗余和无关信息对模型性能的影响。其解决方案的关键在于提出一种名为“不完整抑郁特征选择方法(Incomplete Depression Feature Selection with Missing EEG Channels, IDFS-MEC)”的新框架:该方法通过引入缺失通道指示信息并结合自适应通道加权学习机制嵌入正交回归模型,以降低不完整通道对建模的负面影响;同时采用全局冗余最小化学习策略,有效减少所选特征子集间的冗余信息,从而提升抑郁检测的鲁棒性和准确性。

链接: https://arxiv.org/abs/2511.11651
作者: Zhijian Gong,Wenjia Dong,Xueyuan Xu,Fulin Wei,Chunyu Liu,Li Zhuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As a critical mental health disorder, depression has severe effects on both human physical and mental well-being. Recent developments in EEG-based depression analysis have shown promise in improving depression detection accuracies. However, EEG features often contain redundant, irrelevant, and noisy information. Additionally, real-world EEG data acquisition frequently faces challenges, such as data loss from electrode detachment and heavy noise interference. To tackle the challenges, we propose a novel feature selection approach for robust depression analysis, called Incomplete Depression Feature Selection with Missing EEG Channels (IDFS-MEC). IDFS-MEC integrates missing-channel indicator information and adaptive channel weighting learning into orthogonal regression to lessen the effects of incomplete channels on model construction, and then utilizes global redundancy minimization learning to reduce redundant information among selected feature subsets. Extensive experiments conducted on MODMA and PRED-d003 datasets reveal that the EEG feature subsets chosen by IDFS-MEC have superior performance than 10 popular feature selection methods among 3-, 64-, and 128-channel settings.
zh

[AI-212] Enhanced Water Leak Detection with Convolutional Neural Networks and One-Class Support Vector Machine

【速读】:该论文旨在解决供水管网(Water Distribution Networks, WDNs)中因泄漏导致的水资源浪费问题,提出了一种基于水压数据的新型泄漏检测方法。其解决方案的关键在于:仅利用管网拓扑结构和无泄漏状态下的压力数据,通过一个特征提取器与一类支持向量机(One-Class Support Vector Machine, One-Class SVM)构建完全数据驱动的异常检测模型,将泄漏识别为与正常工况偏离的异常事件,从而实现高精度的泄漏检测与定位。

链接: https://arxiv.org/abs/2511.11650
作者: Daniele Ugo Leonzio,Paolo Bestagini,Marco Marcon,Stefano Tubaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Water is a critical resource that must be managed efficiently. However, a substantial amount of water is lost each year due to leaks in Water Distribution Networks (WDNs). This underscores the need for reliable and effective leak detection and localization systems. In recent years, various solutions have been proposed, with data-driven approaches gaining increasing attention due to their superior performance. In this paper, we propose a new method for leak detection. The method is based on water pressure measurements acquired at a series of nodes of a WDN. Our technique is a fully data-driven solution that makes only use of the knowledge of the WDN topology, and a series of pressure data acquisitions obtained in absence of leaks. The proposed solution is based on an feature extractor and a one-class Support Vector Machines (SVM) trained on no-leak data, so that leaks are detected as anomalies. The results achieved on a simulate dataset using the Modena WDN demonstrate that the proposed solution outperforms recent methods for leak detection.
zh

[AI-213] Lightweight Time Series Data Valuation on Time Series Foundation Models via In-Context Finetuning

【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在训练过程中对数据质量高度敏感的问题,即如何高效且准确地评估时间序列数据的价值,以提升TSFM的性能。传统数据估值方法如影响函数(influence functions)因计算复杂度高、难以扩展至大规模模型且无法有效保留时间依赖性而存在局限。解决方案的关键在于提出LTSV(Lightweight Time Series Valuation),其核心思想是利用上下文微调(in-context fine-tuning)近似影响函数,并通过测量微调前后上下文损失的变化来估计样本贡献;同时引入时间块聚合机制(temporal block aggregation),将重叠时间窗口内各块的影响分数进行整合,从而有效捕捉时间依赖结构。该方法在保证估值可靠性的同时显著降低计算开销,实现了数据归因与模型泛化能力之间的高效衔接。

链接: https://arxiv.org/abs/2511.11648
作者: Shunyu Wu,Tianyue Li,Yixuan Leng,Jingyi Suo,Jian Lou,Dan Li,See-Kiong Ng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series foundation models (TSFMs) have demonstrated increasing capabilities due to their extensive pretraining on large volumes of diverse time series data. Consequently, the quality of time series data is crucial to TSFM performance, rendering an accurate and efficient data valuation of time series for TSFMs indispensable. However, traditional data valuation methods, such as influence functions, face severe computational bottlenecks due to their poor scalability with growing TSFM model sizes and often fail to preserve temporal dependencies. In this paper, we propose LTSV, a Lightweight Time Series Valuation on TSFMS via in-context finetuning. Grounded in the theoretical evidence that in-context finetuning approximates the influence function, LTSV estimates a sample’s contribution by measuring the change in context loss after in-context finetuning, leveraging the strong generalization capabilities of TSFMs to produce robust and transferable data valuations. To capture temporal dependencies, we introduce temporal block aggregation, which integrates per-block influence scores across overlapping time windows. Experiments across multiple time series datasets and models demonstrate that LTSV consistently provides reliable and strong valuation performance, while maintaining manageable computational requirements. Our results suggest that in-context finetuning on time series foundation models provides a practical and effective bridge between data attribution and model generalization in time series learning.
zh

[AI-214] Environment-Aware Transfer Reinforcement Learning for Sustainable Beam Selection

【速读】:该论文旨在解决5G及未来网络中基于强化学习(Reinforcement Learning, RL)的波束选择方法在不同传播环境部署时存在的训练时间长、计算资源消耗大以及能效低的问题,这对系统可扩展性和绿色可持续AI发展构成挑战。其解决方案的关键在于将环境建模为点云(point cloud),其中每个点代表gNodeB(gNB)和周围散射体的位置,并通过计算点云间的Chamfer距离来高效识别结构相似环境,从而实现预训练模型的迁移复用。这一机制显著减少了重复训练需求,使训练时间与计算开销降低16倍,同时提升了部署速度并降低了碳排放,为动态多变环境中可扩展、自适应且环保的RL波束选择策略提供了可行路径。

链接: https://arxiv.org/abs/2511.11647
作者: Dariush Salami,Ramin Hashemi,Parham Kazemi,Mikko A. Uusitalo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted to be published in a workshop in IEEE GLOBECOM 2025

点击查看摘要

Abstract:This paper presents a novel and sustainable approach for improving beam selection in 5G and beyond networks using transfer learning and Reinforcement Learning (RL). Traditional RL-based beam selection models require extensive training time and computational resources, particularly when deployed in diverse environments with varying propagation characteristics posing a major challenge for scalability and energy efficiency. To address this, we propose modeling the environment as a point cloud, where each point represents the locations of gNodeBs (gNBs) and surrounding scatterers. By computing the Chamfer distance between point clouds, structurally similar environments can be efficiently identified, enabling the reuse of pre-trained models through transfer learning. This methodology leads to a 16x reduction in training time and computational overhead, directly contributing to energy efficiency. By minimizing the need for retraining in each new deployment, our approach significantly lowers power consumption and supports the development of green and sustainable Artificial Intelligence (AI) in wireless systems. Furthermore, it accelerates time-to-deployment, reduces carbon emissions associated with training, and enhances the viability of deploying AI-driven communication systems at the edge. Simulation results confirm that our approach maintains high performance while drastically cutting energy costs, demonstrating the potential of transfer learning to enable scalable, adaptive, and environmentally conscious RL-based beam selection strategies in dynamic and diverse propagation environments.
zh

[AI-215] A Deep Learning Model to Predicting Changes in Consumer Attributes for New Line-extended Products

【速读】:该论文旨在解决产品线扩展(Product Line Extension)过程中因过度延伸导致品牌形象受损的问题,核心在于如何基于消费者需求精准预测新产品引入后消费者属性的变化。解决方案的关键是提出了一种名为条件表格变分自编码器(Conditional Tabular Variational Auto-Encoder, CTVAE)的新型深度学习模型,该模型能够从大规模消费者与产品表格数据中生成合成数据,从而有效预测目标消费者群体在产品容器或口味等变更后的属性变化,为营销策略制定提供量化依据,助力避免产品间竞争(cannibalization)并优化产品形象设计。

链接: https://arxiv.org/abs/2511.11646
作者: Li Yinxing,Tsukasa Ishigaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 23 pages

点击查看摘要

Abstract:Product line extension is a marketing strategy that enhances a company’s sphere of influence. Because excessive line extensions disrupt brand image, only appropriate line extensions based on consumer needs are desirable. Marketers should know the key consumer attributes of the primary customers for new line-extended products before companies enter the market. This paper describes a method for predicting changes in consumer attributes for new line-extended products using a novel deep learning model. The proposed model, Conditional Tabular Variational Auto-Encoder (CTVAE), generates synthetic data from large-scale tabular data of consumers and products. It can provide various implications about effective product line marketing for marketers. The experimental results demonstrate that the CTVAE offers superior prediction performance than existing models. We indicate implications for new products that change containers or flavors for effective product line marketing. The proposed approach has the potential to contribute to avoiding cannibalization and to designing product images and marketing strategies.
zh

[AI-216] EcoSpa: Efficient Transformer Training with Coupled Sparsity

【速读】:该论文旨在解决Transformer模型在高稀疏度下因忽略权重矩阵间乘法交互结构关系而导致性能下降的问题。现有稀疏训练方法未能保留注意力层和前馈层中耦合权重矩阵的关键结构关联,从而限制了模型在极端稀疏条件下的有效性。解决方案的关键在于提出EcoSpa方法,通过联合评估与稀疏化成对的耦合权重矩阵,并采用对齐的行/列移除策略来保持其交互模式;同时引入新的结构组件重要性校准粒度,在预训练和微调场景中实现耦合估计与稀疏化同步进行,从而在不依赖定制硬件或内核的前提下显著提升训练效率与推理速度。

链接: https://arxiv.org/abs/2511.11641
作者: Jinqi Xiao,Cheng Luo,Lingyi Huang,Cheng Yang,Yang Sui,Huy Phan,Xiao Zang,Yibiao Ying,Zhexiang Tang,Anima Anandkumar,Bo Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Transformers have become the backbone of modern AI, yet their high computational demands pose critical system challenges. While sparse training offers efficiency gains, existing methods fail to preserve critical structural relationships between weight matrices that interact multiplicatively in attention and feed-forward layers. This oversight leads to performance degradation at high sparsity levels. We introduce EcoSpa, an efficient structured sparse training method that jointly evaluates and sparsifies coupled weight matrix pairs, preserving their interaction patterns through aligned row/column removal. EcoSpa introduces a new granularity for calibrating structural component importance and performs coupled estimation and sparsification across both pre-training and fine-tuning scenarios. Evaluations demonstrate substantial improvements: EcoSpa enables efficient training of LLaMA-1B with 50% memory reduction and 21% faster training, achieves 2.2\times model compression on GPT-2-Medium with 2.4 lower perplexity, and delivers 1.6\times inference speedup. The approach uses standard PyTorch operations, requiring no custom hardware or kernels, making efficient transformer training accessible on commodity hardware.
zh

[AI-217] oward Better Generalization in Few-Shot Learning through the Meta-Component Combination

【速读】:该论文旨在解决少样本学习(few-shot learning)中分类器在未见类别上泛化能力不足的问题,尤其针对基于度量的元学习方法容易过拟合已见类别、难以有效迁移到未见类别的局限性。其解决方案的关键在于提出一种新型元学习算法,将每个分类器建模为元组件(meta-components)的组合,这些元组件通过在多个元学习episode上联合学习,并引入正交约束以解耦和促进多样性,从而捕捉不同分类器间的共享子结构,提升模型对未见类别的适应能力。

链接: https://arxiv.org/abs/2511.11632
作者: Qiuhao Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:In few-shot learning, classifiers are expected to generalize to unseen classes given only a small number of instances of each new class. One of the popular solutions to few-shot learning is metric-based meta-learning. However, it highly depends on the deep metric learned on seen classes, which may overfit to seen classes and fail to generalize well on unseen classes. To improve the generalization, we explore the substructures of classifiers and propose a novel meta-learning algorithm to learn each classifier as a combination of meta-components. Meta-components are learned across meta-learning episodes on seen classes and disentangled by imposing an orthogonal regularizer to promote its diversity and capture various shared substructures among different classifiers. Extensive experiments on few-shot benchmark tasks show superior performances of the proposed method.
zh

[AI-218] Predicting Grain Growth in Polycrystalline Materials Using Deep Learning Time Series Models

【速读】:该论文旨在解决材料微结构演化中晶粒生长(Grain Growth)预测的难题,其核心挑战在于如何在保证精度的同时显著降低计算成本。传统全场模拟方法虽精确但计算耗时,难以满足实时性要求。论文提出的关键解决方案是:利用从高保真模拟中提取的低维统计描述符(mean-field statistical descriptors),结合长短期记忆网络(LSTM)进行递归预测,从而实现对晶粒尺寸分布随时间演化的高效准确建模。该方法不仅将单序列预测时间从约20分钟缩短至数秒,且在长时间尺度上保持物理一致性,展现出在数字孪生和工艺优化中的应用潜力。

链接: https://arxiv.org/abs/2511.11630
作者: Eliane Younes,Elie Hachem,Marc Bernacki
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Grain Growth strongly influences the mechanical behavior of materials, making its prediction a key objective in microstructural engineering. In this study, several deep learning approaches were evaluated, including recurrent neural networks (RNN), long short-term memory (LSTM), temporal convolutional networks (TCN), and transformers, to forecast grain size distributions during grain growth. Unlike full-field simulations, which are computationally demanding, the present work relies on mean-field statistical descriptors extracted from high-fidelity simulations. A dataset of 120 grain growth sequences was processed into normalized grain size distributions as a function of time. The models were trained to predict future distributions from a short temporal history using a recursive forecasting strategy. Among the tested models, the LSTM network achieved the highest accuracy (above 90%) and the most stable performance, maintaining physically consistent predictions over extended horizons while reducing computation time from about 20 minutes per sequence to only a few seconds, whereas the other architectures tended to diverge when forecasting further in time. These results highlight the potential of low-dimensional descriptors and LSTM-based forecasting for efficient and accurate microstructure prediction, with direct implications for digital twin development and process optimization.
zh

[AI-219] Global Feature Enhancing and Fusion Framework for Strain Gauge Time Series Classification

【速读】:该论文旨在解决应变计状态(Strain Gauge Status, SGS)识别中因仅依赖局部特征而难以准确区分时间序列的问题,尤其是在局部子序列相似性较高(如飞机机翼静态强度实验数据)时,传统卷积神经网络(Convolutional Neural Networks, CNNs)因卷积操作的局限性无法有效提取全局特征,导致识别精度受限。解决方案的关键在于提出一种基于超图(Hypergraph)的全局特征学习与融合框架:一方面通过特征工程构建全局特征,另一方面通过学习局部特征间的高阶关系来捕获全局语义信息,并将这些全局特征进行语义一致性融合,从而更全面地表示SGS时间序列,提升识别准确率。

链接: https://arxiv.org/abs/2511.11629
作者: Xu Zhang,Peng Wang,Chen Wang,Zhe Xu,Xiaohua Nie,Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Global Feature Enhancing and Fusion Framework for Time Series Classification

点击查看摘要

Abstract:Strain Gauge Status (SGS) recognition is crucial in the field of intelligent manufacturing based on the Internet of Things, as accurate identification helps timely detection of failed mechanical components, avoiding accidents. The loading and unloading sequences generated by strain gauges can be identified through time series classification (TSC) algorithms. Recently, deep learning models, e.g., convolutional neural networks (CNNs) have shown remarkable success in the TSC task, as they can extract discriminative local features from the subsequences to identify the time series. However, we observe that only the local features may not be sufficient for expressing the time series, especially when the local sub-sequences between different time series are very similar, e.g., SGS data of aircraft wings in static strength experiments. Nevertheless, CNNs suffer from the limitation in extracting global features due to the nature of convolution operations. For extracting global features to more comprehensively represent the SGS time series, we propose two insights: (i) Constructing global features through feature engineering. (ii) Learning high-order relationships between local features to capture global features. To realize and utilize them, we propose a hypergraph-based global feature learning and fusion framework, which learns and fuses global features for semantic consistency to enhance the representation of SGS time series, thereby improving recognition accuracy. Our method designs are validated on industrial SGS and public UCR datasets, showing better generalization for unseen data in SGS recognition.
zh

[AI-220] Mixture-of-Schedulers: An Adaptive Scheduling Agent as a Learned Router for Expert Policies

【速读】:该论文旨在解决现代操作系统调度器普遍采用单一静态策略所导致的性能瓶颈问题,尤其是在异构硬件和多样化应用架构背景下,这种“一刀切”的调度方式难以同时保障公平性、吞吐量与延迟等关键指标。其核心解决方案是提出一种自适应调度框架——Adaptive Scheduling Agent (ASA),关键在于通过一个解耦的离线/在线两阶段机制实现动态调度策略选择:首先利用离线训练构建一个与硬件无关的通用机器学习模型,用于识别抽象的工作负载模式;其次在运行时借助时间加权概率投票算法持续评估当前工作负载,并依据预配置的设备特定映射表切换至最优调度器(通过Linux的sched_ext框架实现)。该设计使得系统能快速适配新硬件平台且无需重新训练核心模型,从而显著提升调度决策的智能化水平与响应能力。

链接: https://arxiv.org/abs/2511.11628
作者: Xinbo Wang,Shian Jia,Ziyang Huang,Jing Cao,Mingli Song
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern operating system schedulers employ a single, static policy, which struggles to deliver optimal performance across the diverse and dynamic workloads of contemporary systems. This “one-policy-fits-all” approach leads to significant compromises in fairness, throughput, and latency, particularly with the rise of heterogeneous hardware and varied application architectures. This paper proposes a new paradigm: dynamically selecting the optimal policy from a portfolio of specialized schedulers rather than designing a single, monolithic one. We present the Adaptive Scheduling Agent (ASA), a lightweight framework that intelligently matches workloads to the most suitable “expert” scheduling policy at runtime. ASA’s core is a novel, low-overhead offline/online approach. First, an offline process trains a universal, hardware-agnostic machine learning model to recognize abstract workload patterns from system behaviors. Second, at runtime, ASA continually processes the model’s predictions using a time-weighted probability voting algorithm to identify the workload, then makes a scheduling decision by consulting a pre-configured, machine-specific mapping table to switch to the optimal scheduler via Linux’s sched_ext framework. This decoupled architecture allows ASA to adapt to new hardware platforms rapidly without expensive retraining of the core recognition model. Our evaluation, based on a novel benchmark focused on user-experience metrics, demonstrates that ASA consistently outperforms the default Linux scheduler (EEVDF), achieving superior results in 86.4% of test scenarios. Furthermore, ASA’s selections are near-optimal, ranking among the top three schedulers in 78.6% of all scenarios. This validates our approach as a practical path toward more intelligent, adaptive, and responsive operating system schedulers. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11628 [cs.DC] (or arXiv:2511.11628v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.11628 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-221] SA-EMO: Structure-Aligned Encoder Mixture of Operators for Generalizable Full-waveform Inversion

【速读】:该论文旨在解决全波形反演(Full-waveform Inversion, FWI)在未知或复杂地质条件下存在的固有不适定性、高度非线性以及计算成本高的问题,特别是现有基于单一卷积神经网络(CNN)或单一神经算子的方法难以泛化且无法有效区分多样地质类型的问题。其解决方案的关键在于提出了一种结构对齐编码器-专家混合算子(Structure-Aligned Encoder-Mixture-of-Operators, SA-EMO)架构:首先通过结构对齐编码器将高维地震波场映射到物理一致的潜在空间,消除波形域与速度域之间的时空失配,恢复高频成分并增强特征泛化能力;随后引入自适应路由机制动态选择并融合多种神经算子专家(包括谱、小波、多尺度和局部算子),从而提升模型在复杂地质场景下的预测精度与鲁棒性。实验表明,SA-EMO在OpenFWI基准和Marmousi2数据集上显著优于传统方法,平均绝对误差(MAE)降低约58.443%,边界分辨率提升约10.308%。

链接: https://arxiv.org/abs/2511.11627
作者: Wang Zhenyu,Li Peiyuan,Shi Yongxiang,Wu Ruoyu,Zhang Lei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Full-waveform inversion (FWI) can produce high-resolution subsurface models, yet it remains inherently ill-posed, highly nonlinear, and computationally intensive. Although recent deep learning and numerical acceleration methods have improved speed and scalability, they often rely on single CNN architectures or single neural operators, which struggle to generalize in unknown or complex geological settings and are ineffective at distinguishing diverse geological types. To address these issues, we propose a Structure-Aligned Encoder-Mixture-of-Operators (SA-EMO) architecture for velocity-field inversion under unknown subsurface structures. First, a structure-aligned encoder maps high-dimensional seismic wavefields into a physically consistent latent space, thereby eliminating spatio-temporal mismatch between the waveform and velocity domains, recovering high-frequency components, and enhancing feature generalization. Then, an adaptive routing mechanism selects and fuses multiple neural-operator experts, including spectral, wavelet, multiscale, and local operators, to predict the velocity model. We systematically evaluate our approach on the OpenFWI benchmark and the Marmousi2 dataset. Results show that SA-EMO significantly outperforms traditional CNN or single-operator methods, achieving an average MAE reduction of approximately 58.443% and an improvement in boundary resolution of about 10.308%. Ablation studies further reveal that the structure-aligned encoder, the expert-fusion mechanism, and the routing module each contribute markedly to the performance gains. This work introduces a new paradigm for efficient, scalable, and physically interpretable full-waveform inversion.
zh

[AI-222] MedFedPure: A Medical Federated Framework with MAE-based Detection and Diffusion Purification for Inference-Time Attacks

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)场景下医疗影像诊断模型在推理阶段易受对抗攻击的问题,此类攻击可能通过人眼不可察觉的微小扰动误导AI模型,导致严重误诊。现有防御方法多依赖于集中式数据假设,在去中心化且异构的联邦医疗环境中效果有限。其解决方案的关键在于提出MedFedPure框架,该框架融合三项核心技术:(1) 个性化联邦学习模型以适应各医疗机构的数据分布差异;(2) 基于掩码自编码器(Masked Autoencoder, MAE)的输入异常检测机制,可识别隐藏的对抗扰动;(3) 自适应扩散净化模块,仅对可疑图像进行局部修复而不影响正常图像质量。这一协同设计实现了高鲁棒性(对抗攻击下准确率从49.50%提升至87.33%)与高纯净准确率(97.67%)的平衡,同时保障隐私和实时性,适用于临床部署。

链接: https://arxiv.org/abs/2511.11625
作者: Mohammad Karami,Mohammad Reza Nemati,Aidin Kazemi,Ali Mikaeili Barzili,Hamid Azadegan,Behzad Moshiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has shown great potential in medical imaging, particularly for brain tumor detection using Magnetic Resonance Imaging (MRI). However, the models remain vulnerable at inference time when they are trained collaboratively through Federated Learning (FL), an approach adopted to protect patient privacy. Adversarial attacks can subtly alter medical scans in ways invisible to the human eye yet powerful enough to mislead AI models, potentially causing serious misdiagnoses. Existing defenses often assume centralized data and struggle to cope with the decentralized and diverse nature of federated medical settings. In this work, we present MedFedPure, a personalized federated learning defense framework designed to protect diagnostic AI models at inference time without compromising privacy or accuracy. MedFedPure combines three key elements: (1) a personalized FL model that adapts to the unique data distribution of each institution; (2) a Masked Autoencoder (MAE) that detects suspicious inputs by exposing hidden perturbations; and (3) an adaptive diffusion-based purification module that selectively cleans only the flagged scans before classification. Together, these steps offer robust protection while preserving the integrity of normal, benign images. We evaluated MedFedPure on the Br35H brain MRI dataset. The results show a significant gain in adversarial robustness, improving performance from 49.50% to 87.33% under strong attacks, while maintaining a high clean accuracy of 97.67%. By operating locally and in real time during diagnosis, our framework provides a practical path to deploying secure, trustworthy, and privacy-preserving AI tools in clinical workflows. Index Terms: cancer, tumor detection, federated learning, masked autoencoder, diffusion, privacy Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2511.11625 [cs.LG] (or arXiv:2511.11625v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11625 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-223] Early GVHD Prediction in Liver Transplantation via Multi-Modal Deep Learning on Imbalanced EHR Data

【速读】:该论文旨在解决肝移植术后移植物抗宿主病(Graft-versus-Host Disease, GVHD)早期预测难题,其核心挑战在于电子健康记录(Electronic Health Record, EHR)数据的异质性与极端类别不平衡问题。解决方案的关键在于构建一个能够动态融合多模态信息(包括患者人口统计学、实验室检查、诊断和用药记录)的深度学习框架,该框架通过AUC优化策略有效处理缺失值与不规则时间序列数据,并在极度不平衡的数据分布下实现高召回率与特异性,最终在测试集上达到0.836的AUC和0.768的召回率,显著优于单一模态及传统多模态机器学习基线方法。

链接: https://arxiv.org/abs/2511.11623
作者: Yushan Jiang,Shuteng Niu,Dongjin Song,Yichen Wang,Jingna Feng,Xinyue Hu,Liu Yang,Cui Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graft-versus-host disease (GVHD) is a rare but often fatal complication in liver transplantation, with a very high mortality rate. By harnessing multi-modal deep learning methods to integrate heterogeneous and imbalanced electronic health records (EHR), we aim to advance early prediction of GVHD, paving the way for timely intervention and improved patient outcomes. In this study, we analyzed pre-transplant electronic health records (EHR) spanning the period before surgery for 2,100 liver transplantation patients, including 42 cases of graft-versus-host disease (GVHD), from a cohort treated at Mayo Clinic between 1992 and 2025. The dataset comprised four major modalities: patient demographics, laboratory tests, diagnoses, and medications. We developed a multi-modal deep learning framework that dynamically fuses these modalities, handles irregular records with missing values, and addresses extreme class imbalance through AUC-based optimization. The developed framework outperforms all single-modal and multi-modal machine learning baselines, achieving an AUC of 0.836, an AUPRC of 0.157, a recall of 0.768, and a specificity of 0.803. It also demonstrates the effectiveness of our approach in capturing complementary information from different modalities, leading to improved performance. Our multi-modal deep learning framework substantially improves existing approaches for early GVHD prediction. By effectively addressing the challenges of heterogeneity and extreme class imbalance in real-world EHR, it achieves accurate early prediction. Our proposed multi-modal deep learning method demonstrates promising results for early prediction of a GVHD in liver transplantation, despite the challenge of extremely imbalanced EHR data.
zh

[AI-224] AIvailable: A Software-Defined Architecture for LLM -as-a-Service on Heterogeneous and Legacy GPUs

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)在资源受限环境(如学术实验室或小型企业)中部署与推理时面临的高成本、硬件异构性及显存利用率低的问题。现有框架通常假设使用同质且资源丰富的硬件,难以适应实际场景中的多样性与限制。其解决方案的关键在于提出 AIvailable 平台,采用软件定义的方式在异构 GPU 节点(包括 NVIDIA 和 AMD 设备)上运行 LLMaaS(LLM-as-a-Service),通过动态 VRAM 友好的模型分配与再分配机制,实现每个节点显存的充分利用,并提供无 CPU 回退的全 GPU 加速推理能力;同时,平台架构包含统一客户端接口、安全请求路由、服务前端负载均衡、SDAI 控制器以及异构 GPU 后端,从而保障系统可用性、弹性扩展性和对故障及负载波动的鲁棒性。

链接: https://arxiv.org/abs/2511.11621
作者: Pedro Antunes,Ana Rita Ortigoso,Gabriel Vieira,Daniel Fuentes,Luís Frazão,Nuno Costa,António Pereira
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node’s VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs.
zh

[AI-225] DIAP: A Decentralized Agent Identity Protocol with Zero-Knowledge Proofs and a Hybrid P2P Stack

【速读】:该论文旨在解决自主代理(Autonomous Agents)在去中心化环境中缺乏完全去中心化、可验证且隐私保护的通信协议这一核心挑战。现有系统通常依赖中心化中介引入信任瓶颈,或缺少去中心化身份解析机制,从而限制了身份持久性和跨网络互操作性。其解决方案的关键在于提出一种名为“去中心化星际代理协议”(Decentralized Interstellar Agent Protocol, DIAP)的新框架:通过将代理身份绑定至不可变的IPFS/IPNS内容标识符,并利用零知识证明(Zero-Knowledge Proofs, ZKP)动态、无状态地证明所有权,从而无需更新记录即可实现可信身份验证;同时,该方案还集成Noir生成ZKP、DID-Key身份管理、IPFS存储及基于Libp2p和Iroh的混合P2P通信栈,构建了一个零依赖ZKP部署模型,借助通用证明管理器与编译时脚本嵌入预编译Noir电路,消除了对外部ZKP工具链的需求,实现了即时、可验证且隐私保护的身份证明机制。

链接: https://arxiv.org/abs/2511.11619
作者: Yuanjie Liu,Wenpeng Xing,Ye Zhou,Gaowei Chang,Changting Lin,Meng Han
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The absence of a fully decentralized, verifiable, and privacy-preserving communication protocol for autonomous agents remains a core challenge in decentralized computing. Existing systems often rely on centralized intermediaries, which reintroduce trust bottlenecks, or lack decentralized identity-resolution mechanisms, limiting persistence and cross-network interoperability. We propose the Decentralized Interstellar Agent Protocol (DIAP), a novel framework for agent identity and communication that enables persistent, verifiable, and trustless interoperability in fully decentralized environments. DIAP binds an agent’s identity to an immutable IPFS or IPNS content identifier and uses zero-knowledge proofs (ZKP) to dynamically and statelessly prove ownership, removing the need for record updates. We present a Rust SDK that integrates Noir (for zero-knowledge proofs), DID-Key, IPFS, and a hybrid peer-to-peer stack combining Libp2p GossipSub for discovery and Iroh for high-performance, QUIC based data exchange. DIAP introduces a zero-dependency ZKP deployment model through a universal proof manager and compile-time build script that embeds a precompiled Noir circuit, eliminating the need for external ZKP toolchains. This enables instant, verifiable, and privacy-preserving identity proofs. This work establishes a practical, high-performance foundation for next-generation autonomous agent ecosystems and agent-to-agent (A to A) economies. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11619 [cs.DC] (or arXiv:2511.11619v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.11619 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Wenpeng Xing [view email] [v1] Thu, 6 Nov 2025 06:00:18 UTC (187 KB)
zh

[AI-226] Hierarchical Federated Graph Attention Networks for Scalable and Resilient UAV Collision Avoidance

【速读】:该论文旨在解决大规模多无人机(Multi-UAV)系统中碰撞避让的实时性、抗对抗攻击能力和隐私保护之间的权衡问题。当前框架通常采用单体式设计,存在计算复杂度高(O(n²))、缺乏拜占庭容错能力等缺陷。其解决方案的关键在于提出一种分层架构,将智能分布于三层:局部层利用密集图注意力机制实现10 ms延迟的即时避障;区域层采用稀疏注意力与异步联邦学习(坐标修剪均值聚合),实现O(nk)复杂度和拜占庭容错;全局层则基于轻量级Hashgraph协议进行协调。此外,引入自适应差分隐私机制(ε ∈ [0.1, 1.0] 动态调整噪声水平),以在保障隐私的同时最大化效用,并通过基于分布式哈希表(DHT)的轻量审计日志替代区块链共识,使95百分位决策延迟控制在50 ms以内,最终在500架无人机场景下实现2.0%碰撞率和f ≤ n/3的拜占庭容错能力。

链接: https://arxiv.org/abs/2511.11616
作者: Rathin Chandra Shit,Sharmila Subudhi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted and scheduled for conference presentation

点击查看摘要

Abstract:The real-time performance, adversarial resiliency, and privacy preservation are the most important metrics that need to be balanced to practice collision avoidance in large-scale multi-UAV (Unmanned Aerial Vehicle) systems. Current frameworks tend to prescribe monolithic solutions that are not only prohibitively computationally complex with a scaling cost of O(n^2) but simply do not offer Byzantine fault tolerance. The proposed hierarchical framework presented in this paper tries to eliminate such trade-offs by stratifying a three-layered architecture. We spread the intelligence into three layers: an immediate collision avoiding local layer running on dense graph attention with latency of 10 ms , a regional layer using sparse attention with O(nk) computational complexity and asynchronous federated learning with coordinate-wise trimmed mean aggregation, and lastly, a global layer using a lightweight Hashgraph-inspired protocol. We have proposed an adaptive differential privacy mechanism, wherein the noise level (\epsilon \in [0.1, 1.0]) is dynamically reduced based on an evaluation of the measured real-time threat that in turn maximized the privacy-utility tradeoff. Through the use of Distributed Hash Table (DHT)-based lightweight audit logging instead of heavyweight blockchain consensus, the median cost of getting a 95^th percentile decision within 50ms is observed across all tested swarm sizes. This architecture provides a scalable scenario of 500 UAVs with a collision rate of 2.0% and the Byzantine fault tolerance of f n/3 .
zh

[AI-227] Lightweight Hopfield Neural Networks for Bioacoustic Detection and Call Monitoring of Captive Primates

【速读】:该论文旨在解决被动声学监测(Passive Acoustic Monitoring, PAM)在野生动物和环境监测中因数据量庞大而导致的处理延迟问题,尤其是现有基于资源密集型卷积神经网络(Convolutional Neural Networks, CNNs)的方法依赖大量预标注数据且灵活性差。解决方案的关键在于提出一种轻量级、可解释性强且训练速度快的关联记忆人工智能模型,其架构基于霍普菲尔德神经网络(Hopfield Neural Network, HNN),通过存储目标物种(如黑-白领狐猴)社会叫声特征作为记忆模板,实现对大规模音频数据中同类事件的高效检测;该模型进一步优化了对运动引起的附加信号的存储机制,整体准确率达0.94,可在标准笔记本电脑上以每秒340次分类的速度处理超过5.5小时音频/分钟,显著缩短从数据到洞察的时间周期,适用于圈养与野外多种场景。

链接: https://arxiv.org/abs/2511.11615
作者: Wendy Lomas,Andrew Gascoyne,Colin Dubreuil,Stefano Vaglio,Liam Naughton
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 16 pages, 3 figures, Proceedings of the Future Technologies Conference (FTC) 2025, Volume 1

点击查看摘要

Abstract:Passive acoustic monitoring is a sustainable method of monitoring wildlife and environments that leads to the generation of large datasets and, currently, a processing backlog. Academic research into automating this process is focused on the application of resource intensive convolutional neural networks which require large pre-labelled datasets for training and lack flexibility in application. We present a viable alternative relevant in both wild and captive settings; a transparent, lightweight and fast-to-train associative memory AI model with Hopfield neural network (HNN) architecture. Adapted from a model developed to detect bat echolocation calls, this model monitors captive endangered black-and-white ruffed lemur Varecia variegata vocalisations. Lemur social calls of interest when monitoring welfare are stored in the HNN in order to detect other call instances across the larger acoustic dataset. We make significant model improvements by storing an additional signal caused by movement and achieve an overall accuracy of 0.94. The model can perform 340 classifications per second, processing over 5.5 hours of audio data per minute, on a standard laptop running other applications. It has broad applicability and trains in milliseconds. Our lightweight solution reduces data-to-insight turnaround times and can accelerate decision making in both captive and wild settings.
zh

[AI-228] Beyond the GPU: The Strategic Role of FPGAs in the Next Wave of AI

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 加速中对低延迟、高能效及细粒度硬件控制的需求日益增长,而传统 GPU 架构在这些方面存在局限的问题。其核心解决方案在于利用现场可编程门阵列(Field-Programmable Gate Array, FPGA)的可重构特性,将 AI 算法直接映射到器件逻辑中,实现卷积、注意力机制和后处理等模块的并行流水线执行,从而获得确定性时延与更低功耗。FPGA 可通过局部重配置(partial reconfiguration)和来自 AI 框架的编译流程,支持软硬件协同设计,使硬件结构能够针对特定模型动态调整,并集成嵌入式处理器形成系统级芯片(SoC),实现靠近传感器端的推理部署,显著降低延迟、带宽需求与隐私风险,同时释放数据中心中的 GPU 资源用于更复杂任务。

链接: https://arxiv.org/abs/2511.11614
作者: Arturo Urías Jiménez
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI acceleration has been dominated by GPUs, but the growing need for lower latency, energy efficiency, and fine-grained hardware control exposes the limits of fixed architectures. In this context, Field-Programmable Gate Arrays (FPGAs) emerge as a reconfigurable platform that allows mapping AI algorithms directly into device logic. Their ability to implement parallel pipelines for convolutions, attention mechanisms, and post-processing with deterministic timing and reduced power consumption makes them a strategic option for workloads that demand predictable performance and deep customization. Unlike CPUs and GPUs, whose architecture is immutable, an FPGA can be reconfigured in the field to adapt its physical structure to a specific model, integrate as a SoC with embedded processors, and run inference near the sensor without sending raw data to the cloud. This reduces latency and required bandwidth, improves privacy, and frees GPUs from specialized tasks in data centers. Partial reconfiguration and compilation flows from AI frameworks are shortening the path from prototype to deployment, enabling hardware–algorithm co-design. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11614 [cs.DC] (or arXiv:2511.11614v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.11614 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-229] Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在从自然语言描述中执行结构化、约束驱动的组合优化任务时的能力边界不明确的问题。研究聚焦于高绩效计算(High-Performance Computing, HPC)场景下的工作负载映射与调度问题,通过让21个公开可用的LLM根据相同的文本输入生成任务分配方案、计算总完成时间(makespan),并提供推理过程,从而系统评估其优化能力。解决方案的关键在于设计了一个具有明确解析最优解(9小时20秒)的基准测试框架,量化模型输出的可行性、准确性及可解释性——结果表明,领先模型能直接从自然语言重建最优调度方案,但多数仍难以精确处理时间计算、数据传输算术和依赖关系强制,凸显了LLM作为可解释辅助决策工具而非独立求解器的应用潜力。

链接: https://arxiv.org/abs/2511.11612
作者: Aasish Kumar Sharma,Julian Kunkel
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 2 tables. Evaluation study on LLM-based reasoning for HPC scheduling. Published in Research in Academic Engineering Journal (RAEJ), 2025

点击查看摘要

Abstract:Large language models (LLMs) are increasingly explored for their reasoning capabilities, yet their ability to perform structured, constraint-based optimization from natural language remains insufficiently understood. This study evaluates twenty-one publicly available LLMs on a representative heterogeneous high-performance computing (HPC) workload mapping and scheduling problem. Each model received the same textual description of system nodes, task requirements, and scheduling constraints, and was required to assign tasks to nodes, compute the total makespan, and explain its reasoning. A manually derived analytical optimum of nine hours and twenty seconds served as the ground truth reference. Three models exactly reproduced the analytical optimum while satisfying all constraints, twelve achieved near-optimal results within two minutes of the reference, and six produced suboptimal schedules with arithmetic or dependency errors. All models generated feasible task-to-node mappings, though only about half maintained strict constraint adherence. Nineteen models produced partially executable verification code, and eighteen provided coherent step-by-step reasoning, demonstrating strong interpretability even when logical errors occurred. Overall, the results define the current capability boundary of LLM reasoning in combinatorial optimization: leading models can reconstruct optimal schedules directly from natural language, but most still struggle with precise timing, data transfer arithmetic, and dependency enforcement. These findings highlight the potential of LLMs as explainable co-pilots for optimization and decision-support tasks rather than autonomous solvers.
zh

[AI-230] Quantifying Skill and Chance: A Unified Framework for the Geometry of Games

【速读】:该论文旨在解决如何定量区分游戏中技能(skill)与运气(luck)对结果影响的问题,以建立一个可比较的量化框架。其核心解决方案是将游戏建模为随机决策树,并通过分解游戏结果中的技能杠杆(skill leverage, K)和运气杠杆(luck leverage, L)来定义技能-运气指数 S(G) ∈ [-1, 1],其中 S = -1 表示纯运气(如抛硬币),S = +1 表示纯技能(如国际象棋),中间值表示两者混合。此外,引入波动率 Sigma 来量化连续回合中结果的不确定性,从而实现对玩家影响力、游戏平衡性和预测稳定性的系统性评估,适用于游戏设计、人工智能评价及风险分析等场景。

链接: https://arxiv.org/abs/2511.11611
作者: David H. Silver
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce a quantitative framework for separating skill and chance in games by modeling them as complementary sources of control over stochastic decision trees. We define the Skill-Luck Index S(G) in [-1, 1] by decomposing game outcomes into skill leverage K and luck leverage L. Applying this to 30 games reveals a continuum from pure chance (coin toss, S = -1) through mixed domains such as backgammon (S = 0, Sigma = 1.20) to pure skill (chess, S = +1, Sigma = 0). Poker exhibits moderate skill dominance (S = 0.33) with K = 0.40 +/- 0.03 and Sigma = 0.80. We further introduce volatility Sigma to quantify outcome uncertainty over successive turns. The framework extends to general stochastic decision systems, enabling principled comparisons of player influence, game balance, and predictive stability, with applications to game design, AI evaluation, and risk assessment.
zh

[AI-231] Why Should the Server Do It All?: A Scalable Versatile and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression

【速读】:该论文旨在解决边缘-云模型分割(Model Partitioning, MP)中因固定浅层分割点导致的边缘计算资源利用率低、服务器延迟和能耗集中的问题,尤其在自回归(Autoregressive, AR)大语言模型(Large Language Models, LLMs)推理场景下,由于每token前向传播重复生成大量中间特征(Intermediate Features, IFs),加剧了通信开销与服务器负载。解决方案的关键在于提出SLICER框架——一种无需重训练、与架构无关的压缩方法,其核心包括:(i) 异构Top-K过滤(Asymmetric Top-K Filtering, ATKF)稀疏化低幅值激活;(ii) 幅值分割(Magnitude-Splitting, MS)将剩余非零元素分组为等基数块;(iii) 自适应位量化(Adaptive Bit Quantization, ABQ)在失真预算约束下为每块选择最优位宽。该方案显著降低上行传输量(最高减少10倍)与服务器GPU时间(最高减少4.4倍),同时保持任务性能与基线相当(误差≤3个百分点),并通过将计算任务迁移至边缘端提升多设备场景下的可扩展性与稳定性。

链接: https://arxiv.org/abs/2511.11608
作者: Mingyu Sung,Suhwan Im,Daeho Bang,Il-Min Kim,Sangseok Yun,Jae-Mo Kang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern DNNs often rely on edge-cloud model partitioning (MP), but widely used schemes fix shallow, static split points that underutilize edge compute and concentrate latency and energy on the server. The problem is exacerbated in autoregressive (AR) LLM inference, where per-token forward passes repeatedly generate bulky intermediate features (IFs). We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing. SLICER combines (i) asymmetric top-K filtering (ATKF) to sparsify low-magnitude activations, (ii) magnitude-splitting (MS) to group the remaining non-zeros into equal-cardinality blocks, and (iii) adaptive bit quantization (ABQ) that selects per-block bitwidths under a distortion budget. Across standard vision and LLM workloads (e.g., ImageNet/COCO; HellaSwag, PIQA, ARC-E/C, GSM8K, HumanEval), SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x, while keeping task quality within ~0-3 pp of baseline. In multi-device settings and AR LLMs, SLICER scales by shifting meaningful compute to the edge and lowering bits-per-token and server time per token, stabilizing per-step traffic. The codec attaches to off-the-shelf models without retraining or architectural changes, offering a plug-and-play path to scalable, low-latency distributed inference. Code is provided in the supplementary material.
zh

[AI-232] Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在非平稳环境(non-stationary environment)中样本效率低的问题,这类环境会导致RL代理需要数百万次迭代才能收敛,从而限制了其实际应用。解决方案的关键在于提出一种可集成到任意RL算法策略网络中的Clustering Orthogonal Weight Modified (COWM)层,该层通过聚类技术和投影矩阵稳定学习过程,有效缓解因环境变化引起的梯度干扰,从而提升学习速度和整体效率。

链接: https://arxiv.org/abs/2511.11607
作者: Guoqing Ma,Yuhan Zhang,Yuming Dai,Guangfu Hao,Yang Chen,Shan Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has made significant advancements, achieving superhuman performance in various tasks. However, RL agents often operate under the assumption of environmental stationarity, which poses a great challenge to learning efficiency since many environments are inherently non-stationary. This non-stationarity results in the requirement of millions of iterations, leading to low sample efficiency. To address this issue, we introduce the Clustering Orthogonal Weight Modified (COWM) layer, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively. The COWM layer stabilizes the learning process by employing clustering techniques and a projection matrix. Our approach not only improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency. Empirically, the COWM outperforms state-of-the-art methods and achieves improvements of 9% and 12.6% in vision based and state-based DMControl benchmark. It also shows robustness and generality across various algorithms and tasks.
zh

[AI-233] Machine learning-based cloud resource allocation algorithms: a comprehensive comparative review

【速读】:该论文旨在解决现代计算环境中云资源分配(Cloud Resource Allocation)面临的挑战,即在复杂且动态的工作负载下如何优化性能与成本效率。传统启发式方法难以满足现有云基础设施对多目标优化的需求。解决方案的关键在于系统性地比较和评估十种先进的人工智能(Artificial Intelligence, AI)与机器学习(Machine Learning, ML)算法,涵盖深度强化学习、神经网络架构、增强型传统机器学习方法以及多智能体系统四类。研究发现,融合多种AI/ML技术的混合架构显著优于单一方法,尤其在边缘计算环境中展现出最高的部署成熟度,为学术界与工业界提供了下一代云资源调度策略的重要参考。

链接: https://arxiv.org/abs/2511.11603
作者: Deep Bodra,Sushil Khairnar
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cloud resource allocation has emerged as a major challenge in modern computing environments, with organizations struggling to manage complex, dynamic workloads while optimizing performance and cost efficiency. Traditional heuristic approaches prove inadequate for handling the multi-objective optimization demands of existing cloud infrastructures. This paper presents a comparative analysis of state-of-the-art artificial intelligence and machine learning algorithms for resource allocation. We systematically evaluate 10 algorithms across four categories: Deep Reinforcement Learning approaches, Neural Network architectures, Traditional Machine Learning enhanced methods, and Multi-Agent systems. Analysis of published results demonstrates significant performance improvements across multiple metrics including makespan reduction, cost optimization, and energy efficiency gains compared to traditional methods. The findings reveal that hybrid architectures combining multiple artificial intelligence and machine learning techniques consistently outperform single-method approaches, with edge computing environments showing the highest deployment readiness. Our analysis provides critical insights for both academic researchers and industry practitioners seeking to implement next-generation cloud resource allocation strategies in increasingly complex and dynamic computing environments.
zh

[AI-234] Mind the Gap: Revealing Inconsistencies Across Heterogeneous AI Accelerators

【速读】:该论文旨在解决在异构AI加速器(heterogeneous AI accelerators)环境中,机器学习模型行为一致性难以保障的问题。随着NVIDIA之外的厂商(如AMD、Intel、Mac和华为)提供成本更低且声称兼容的替代方案,其硬件差异可能导致模型执行结果不一致,从而影响部署可靠性。解决方案的关键在于构建一个自动化流水线,对4,000个真实世界模型生成超过10万个变体,并在五种企业级AI加速器上进行大规模实证测试,从而系统性地量化不同平台在算子支持数量、输出偏差率、编译失败率及数值异常处理等方面的差异,识别出PyTorch中的7个实现缺陷和跨厂商的40个平台特异性问题,揭示了当前多硬件生态下ML行为一致性面临的严峻挑战。

链接: https://arxiv.org/abs/2511.11601
作者: Elliott Wen,Sean Ma,Ewan Tempero,Jens Dietrich,Daniel Luo,Jiaxing Shen,Kaiqi Zhao,Bruce Sham,Yousong Song,Jiayi Hua,Jia Hong
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While NVIDIA remains the dominant provider of AI accelerators within cloud data center, emerging vendors such as AMD, Intel, Mac, and Huawei offer cost-effective alternatives with claims of compatibility and performance. This paper presents the first empirical study investigating divergence in machine learning model across heterogeneous AI accelerators. Utilizing an automated pipeline, we synthesize over 100,000 variant models derived from 4,000 real-world models and execute them across five different enterprise-grade accelerators. Our findings suggest that newer AI platforms from Mac and Huawei support at least 17% fewer operators than NVIDIA. These platforms also exhibit a higher rate of output discrepancies (exceeding 5%), which stem from differences in operator implementations, handling of exceptional numerical values, and instruction scheduling. They are also more susceptible to failures during model compilation-based acceleration, and in some cases, the compiled models produce outputs that differ noticeably from those generated using the standard execution mode. In addition, we identify 7 implementation flaws in PyTorch and 40 platform-specific issues across vendors. These results underscore the challenges of achieving consistent machine learning behavior in an increasingly diverse hardware ecosystem.
zh

[AI-235] CausalGuard: A Smart System for Detecting and Preventing False Information in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的“幻觉”(hallucination)问题,即模型在生成内容时自信地输出看似合理但事实上错误的信息,这已成为其在高精度要求场景中应用的主要障碍。现有解决方案或需重新训练整个模型、增加显著计算开销,或未能触及幻觉产生的根本原因。本文提出CausalGuard,其核心创新在于结合因果推理(causal reasoning)与符号逻辑(symbolic logic),从源头上识别并阻止幻觉的发生:一方面通过追踪模型知识与生成内容之间的因果链来干预错误决策路径,另一方面利用自动化推理验证逻辑一致性。该方法在12个基准测试中实现了89.3%的幻觉识别准确率和仅8.3%的漏检率,并将虚假陈述减少近80%,同时保持输出自然性和有用性,尤其适用于医疗诊断等需可解释性的关键领域。

链接: https://arxiv.org/abs/2511.11600
作者: Piyushkumar Patel
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While large language models have transformed how we interact with AI systems, they have a critical weakness: they confidently state false information that sounds entirely plausible. This “hallucination” problem has become a major barrier to using these models where accuracy matters most. Existing solutions either require retraining the entire model, add significant computational costs, or miss the root causes of why these hallucinations occur in the first place. We present CausalGuard, a new approach that combines causal reasoning with symbolic logic to catch and prevent hallucinations as they happen. Unlike previous methods that only check outputs after generation, our system understands the causal chain that leads to false statements and intervenes early in the process. CausalGuard works through two complementary paths: one that traces causal relationships between what the model knows and what it generates, and another that checks logical consistency using automated reasoning. Testing across twelve different benchmarks, we found that CausalGuard correctly identifies hallucinations 89.3% of the time while missing only 8.3% of actual hallucinations. More importantly, it reduces false claims by nearly 80% while keeping responses natural and helpful. The system performs especially well on complex reasoning tasks where multiple steps of logic are required. Because CausalGuard shows its reasoning process, it works well in sensitive areas like medical diagnosis or financial analysis where understanding why a decision was made matters as much as the decision itself. Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2511.11600 [cs.AI] (or arXiv:2511.11600v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.11600 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-236] Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach

【速读】:该论文旨在解决损失给定违约(Loss Given Default, LGD)建模中因数据质量问题导致的模型性能失效问题,特别是当训练数据主要由违约前资产负债表估算的代理值构成(占90%),而非真实破产清算后的回收结果时,传统递归分割方法(如随机森林)会因混合污染结构而系统性失效。其解决方案的关键在于引入基于信息论的方法,利用香农熵(Shannon entropy)和互信息(mutual information)构建特征选择与建模框架,从而在仅有1,218个企业破产案例(1980–2023年)的数据条件下实现更优泛化能力(r²=0.191,RMSE=0.284),并揭示杠杆类特征(含1.510比特互信息)对LGD预测的重要性远超规模效应(仅0.086比特),挑战了监管机构关于回收率与企业规模相关的假设。

链接: https://arxiv.org/abs/2511.11596
作者: Javier Marín
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.
zh

[AI-237] Decision-Making Amid Information-Based Threats in Sociotechnical Systems: A Review

【速读】:该论文试图解决的问题是:在技术系统日益中介人类信息交换的背景下,信息传播的规模和依赖性显著扩大了基于信息的影响范围,从而对个体与组织的决策过程构成潜在威胁;然而当前针对信息型威胁的研究与人类信息处理的基础研究之间存在割裂,难以系统性地识别和应对这些风险。解决方案的关键在于:通过整合信息型威胁研究与人类信息处理机制的成果,识别出共同的认知机制——这些机制既决定了个体对信息型威胁的脆弱性,也塑造了最终的行为结果,并据此提出未来应加强跨领域融合研究的方向,以增强人类对信息威胁的抵御能力并实现人机认知表征的协同优化。

链接: https://arxiv.org/abs/2511.11595
作者: Aaron R. Allred,Erin E. Richardson,Sarah R. Bostrom,James Crum,Cara Spencer,Chad Tossell,Richard E. Niemeyer,Leanne Hirshfield,Allison P.A. Hayman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Technological systems increasingly mediate human information exchange, spanning interactions among humans as well as between humans and artificial agents. The unprecedented scale and reliance on information disseminated through these systems substantially expand the scope of information-based influence that can both enable and undermine sound decision-making. Consequently, understanding and protecting decision-making today faces growing challenges, as individuals and organizations must navigate evolving opportunities and information-based threats across varied domains and information environments. While these risks are widely recognized, research remains fragmented: work evaluating information-based threat phenomena has progressed largely in isolation from foundational studies of human information processing. In this review, we synthesize insights from both domains to identify shared cognitive mechanisms that mediate vulnerability to information-based threats and shape behavioral outcomes. Finally, we outline directions for future research aimed at integrating these perspectives, emphasizing the importance of such integration for mitigating human vulnerabilities and aligning human-machine representations.
zh

[AI-238] Sound Logical Explanations for Mean Aggregation Graph Neural Networks NEURIPS2025

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在知识图谱补全任务中缺乏可解释性与表达能力理论保障的问题,特别是针对使用均值聚合(mean aggregation)且权重非负的GNN模型(MAGNNs)。其关键解决方案是:首先严格证明了此类模型所能表达的单调逻辑规则的精确类别,并进一步提出一种受限的一阶逻辑片段用于解释任意MAGNN预测;实验表明,限制权重为非负值不仅保持或提升模型性能,还能生成实际可用的、符合逻辑规则的解释,并揭示训练模型中的潜在问题。

链接: https://arxiv.org/abs/2511.11593
作者: Matthew Morris,Ian Horrocks
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Full version (with appendices) of paper accepted to NeurIPS 2025 (The Thirty-Ninth Annual Conference on Neural Information Processing Systems)

点击查看摘要

Abstract:Graph neural networks (GNNs) are frequently used for knowledge graph completion. Their black-box nature has motivated work that uses sound logical rules to explain predictions and characterise their expressivity. However, despite the prevalence of GNNs that use mean as an aggregation function, explainability and expressivity results are lacking for them. We consider GNNs with mean aggregation and non-negative weights (MAGNNs), proving the precise class of monotonic rules that can be sound for them, as well as providing a restricted fragment of first-order logic to explain any MAGNN prediction. Our experiments show that restricting mean-aggregation GNNs to have non-negative weights yields comparable or improved performance on standard inductive benchmarks, that sound rules are obtained in practice, that insightful explanations can be generated in practice, and that the sound rules can expose issues in the trained models.
zh

[AI-239] Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL

【速读】:该论文旨在解决最大熵强化学习(Maximum Entropy Reinforcement Learning)框架中存在的两个瓶颈问题:一是由于同时注入熵并更新温度参数导致的Q值估计非平稳性;二是仅基于单步熵进行局部调参,忽略了累积熵对策略长期随机性的影响。解决方案的关键在于提出轨迹熵约束强化学习(Trajectory Entropy-Constrained Reinforcement Learning, TECRL)框架:首先分离学习两个Q函数——分别对应奖励和熵,从而确保价值目标不受温度更新干扰;其次引入专门用于量化预期累积熵的熵Q函数,实现对轨迹熵的约束,进而控制策略的长期随机性。在此基础上,作者进一步设计了DSAC-E算法,通过扩展分布软演员评论家(Distributional Soft Actor-Critic, DSAC)并引入三项改进,显著提升了性能与稳定性。

链接: https://arxiv.org/abs/2511.11592
作者: Guojian Zhan,Likun Wang,Pengcheng Wang,Feihong Zhang,Jingliang Duan,Masayoshi Tomizuka,Shengbo Eben Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 17 pages

点击查看摘要

Abstract:Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the dedicated entropy Q-function, explicitly quantifying the expected cumulative entropy, enables us to enforce a trajectory entropy constraint and consequently control the policy long-term stochasticity. Building on this TECRL framework, we develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements (DSAC-T). Empirical results on the OpenAI Gym benchmark demonstrate that our DSAC-E can achieve higher returns and better stability.
zh

[AI-240] Embedding Explainable AI in NHS Clinical Safety: The Explainability-Enabled Clinical Safety Framework (ECSF)

【速读】:该论文旨在解决当前临床安全标准(如DCB0129和DCB0160)在应对生成式AI(Generative AI)等具有概率性和自适应行为的智能系统时,缺乏明确规范以证明其可解释性(explainability)、模型漂移(model drift)及透明度的问题。解决方案的关键在于提出一个“可解释性赋能的临床安全框架”(Explainability-Enabled Clinical Safety Framework, ECSF),该框架将可解释性深度集成至DCB0129/0160生命周期中,使临床安全官能够直接利用可解释性输出作为结构化的安全证据,而无需改变现有合规路径;ECSF进一步定义了五个关键检查点(global transparency、case-level interpretability、clinician usability、traceable decision pathways、longitudinal interpretability monitoring),并映射SHAP、LIME、Integrated Gradients等可解释技术与DCB文档之间的对应关系,从而实现从确定性风险治理向AI特性适配的安全保障范式转变,支撑GMLP、欧盟AI法案(EU AI Act)及NHS AI Assurance原则的一致性。

链接: https://arxiv.org/abs/2511.11590
作者: Robert Gigiu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 33 pages, 5 figures

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly embedded in NHS workflows, but its probabilistic and adaptive behaviour conflicts with the deterministic assumptions underpinning existing clinical-safety standards. DCB0129 and DCB0160 provide strong governance for conventional software yet do not define how AI-specific transparency, interpretability, or model drift should be evidenced within Safety Cases, Hazard Logs, or post-market monitoring. This paper proposes an Explainability-Enabled Clinical Safety Framework (ECSF) that integrates explainability into the DCB0129/0160 lifecycle, enabling Clinical Safety Officers to use interpretability outputs as structured safety evidence without altering compliance pathways. A cross-regulatory synthesis mapped DCB clauses to principles from Good Machine Learning Practice, the NHS AI Assurance and T.E.S.T. frameworks, and the EU AI Act. The resulting matrix links regulatory clauses, principles, ECSF checkpoints, and suitable explainability outputs. ECSF introduces five checkpoints: global transparency for hazard identification, case-level interpretability for verification, clinician usability for evaluation, traceable decision pathways for risk control, and longitudinal interpretability monitoring for post-market surveillance. Techniques such as SHAP, LIME, Integrated Gradients, saliency mapping, and attention visualisation are mapped to corresponding DCB artefacts. ECSF reframes explainability as a core element of clinical-safety assurance, bridging deterministic risk governance with the probabilistic behaviour of AI and supporting alignment with GMLP, the EU AI Act, and NHS AI Assurance principles.
zh

[AI-241] MedBuild AI: An Agent -Based Hybrid Intelligence Framework for Reshaping Agency in Healthcare Infrastructure Planning through Generative Design for Medical Architecture

【速读】:该论文旨在解决全球医疗基础设施分布不均的问题,特别是在资源匮乏地区缺乏基本医疗服务设施的现状。传统医疗建筑设计与规划过程效率低、覆盖范围有限,难以满足大规模、紧迫的需求。解决方案的关键在于提出了一种名为MedBuild AI的混合智能框架,该框架将大语言模型(Large Language Models, LLMs)与确定性专家系统相结合,在早期设计和概念规划阶段实现智能化重构。其核心机制通过三个代理(agent)协同工作:首先通过对话式交互获取本地健康需求信息,其次基于规则计算转化为建筑功能方案,最后生成模块化、低技术、低成本的平面布局与三维模型。该方法借助计算协商机制,推动了包容性与公平性的医疗建筑设计流程,赋予社区自主权并重塑全球医疗建筑领域的决策结构。

链接: https://arxiv.org/abs/2511.11587
作者: Yiming Zhang,Yuejia Xu,Ziyao Wang,Xin Yan,Xiaosai Hao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Graphics (cs.GR); Multiagent Systems (cs.MA)
备注: 25 pages, 16 figures. Submitted to the IJAC Special Issue “Rebalance and Reciprocity”

点击查看摘要

Abstract:Globally, disparities in healthcare infrastructure remain stark, leaving countless communities without access to even basic services. Traditional infrastructure planning is often slow and inaccessible, and although many architects are actively delivering humanitarian and aid-driven hospital projects worldwide, these vital efforts still fall far short of the sheer scale and urgency of demand. This paper introduces MedBuild AI, a hybrid-intelligence framework that integrates large language models (LLMs) with deterministic expert systems to rebalance the early design and conceptual planning stages. As a web-based platform, it enables any region with satellite internet access to obtain guidance on modular, low-tech, low-cost medical building designs. The system operates through three agents: the first gathers local health intelligence via conversational interaction; the second translates this input into an architectural functional program through rule-based computation; and the third generates layouts and 3D models. By embedding computational negotiation into the design process, MedBuild AI fosters a reciprocal, inclusive, and equitable approach to healthcare planning, empowering communities and redefining agency in global healthcare architecture.
zh

[AI-242] Output Supervision Can Obfuscate the Chain of Thought

【速读】:该论文试图解决的问题是:在使用链式思维(Chain of Thought, CoT)监控器进行训练时,模型可能生成“混淆的CoT”(obfuscated CoTs),即表面上看似安全但实际上包含不良行为的推理过程,从而规避监控器的检测。为应对这一问题,作者提出通过仅基于不访问CoT的输出监控器进行训练来提升可监控性。然而,本文进一步揭示了即使在这种受限训练下,仍存在两种机制可能导致混淆CoT的产生:一是模型在生成安全输出的过程中泛化出看似安全的CoT;二是由于token之间的条件依赖关系,安全外观的CoT会提高安全输出的概率,进而被强化。解决方案的关键在于引入两种缓解机制,分别针对上述两类问题,最终实现了在监控能力和任务性能之间优于常规训练的帕累托改进(Pareto improvement)。

链接: https://arxiv.org/abs/2511.11584
作者: Jacob Drori,Luke Marks,Bryce Woodworth,Alex Cloud,Alexander Matt Turner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output monitors that do not have access to CoT. We show that such training can still cause obfuscated CoTs via two mechanisms. First, when a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. Second, since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. We introduce two mitigations to address these two issues, which achieve a Pareto improvement in terms of monitorability and task performance compared to regular training.
zh

[AI-243] Parallel and Multi-Stage Knowledge Graph Retrieval for Behaviorally Aligned Financial Asset Recommendations

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在个性化金融推荐中面临的三大挑战:上下文长度限制、幻觉问题以及缺乏行为依据。为应对这些问题,作者提出RAG-FLARKO,其核心解决方案在于引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,通过多阶段并行的知识图谱(Knowledge Graph, KG)检索流程实现高效、精准的信息提取。具体而言,系统首先从用户交易知识图谱中检索行为相关的实体,再利用该上下文过滤市场知识图谱中的时序一致信号,构建一个紧凑且行为 grounded 的子图供LLM使用。此方法显著降低上下文开销,提升推荐的相关性与准确性,并使小型模型也能在盈利性和行为一致性上达到高表现,从而为资源受限环境下的可部署金融AI提供可行路径。

链接: https://arxiv.org/abs/2511.11583
作者: Fernando Spadea,Oshani Seneviratne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures, RAGE-KG 2025

点击查看摘要

Abstract:Large language models (LLMs) show promise for personalized financial recommendations but are hampered by context limits, hallucinations, and a lack of behavioral grounding. Our prior work, FLARKO, embedded structured knowledge graphs (KGs) in LLM prompts to align advice with user behavior and market data. This paper introduces RAG-FLARKO, a retrieval-augmented extension to FLARKO, that overcomes scalability and relevance challenges using multi-stage and parallel KG retrieval processes. Our method first retrieves behaviorally relevant entities from a user’s transaction KG and then uses this context to filter temporally consistent signals from a market KG, constructing a compact, grounded subgraph for the LLM. This pipeline reduces context overhead and sharpens the model’s focus on relevant information. Empirical evaluation on a real-world financial transaction dataset demonstrates that RAG-FLARKO significantly enhances recommendation quality. Notably, our framework enables smaller, more efficient models to achieve high performance in both profitability and behavioral alignment, presenting a viable path for deploying grounded financial AI in resource-constrained environments.
zh

[AI-244] DAOpt: Modeling and Evaluation of Data-Driven Optimization under Uncertainty with LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在不确定环境下的优化建模能力不足的问题,即现有研究主要聚焦于参数已知的确定性优化问题,而忽视了现实决策中普遍存在的不确定性。其解决方案的关键在于提出DAOpt框架,包含一个名为OptU的新数据集、一个多智能体决策模块以及用于评估LLMs在样本外可行性与鲁棒性方面的仿真环境,并通过引入来自随机优化和鲁棒优化领域的领域知识进行少样本学习,从而显著增强LLMs在不确定条件下的建模能力。

链接: https://arxiv.org/abs/2511.11576
作者: WenZhuo Zhu,Zheng Cui,Wenhan Lu,Sheng Liu,Yue Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have accelerated research on automated optimization modeling. While real-world decision-making is inherently uncertain, most existing work has focused on deterministic optimization with known parameters, leaving the application of LLMs in uncertain settings largely unexplored. To that end, we propose the DAOpt framework including a new dataset OptU, a multi-agent decision-making module, and a simulation environment for evaluating LLMs with a focus on out-of-sample feasibility and robustness. Additionally, we enhance LLMs’ modeling capabilities by incorporating few-shot learning with domain knowledge from stochastic and robust optimization.
zh

[AI-245] NuBench: An Open Benchmark for Deep Learning-Based Event Reconstruction in Neutrino Telescopes

【速读】:该论文旨在解决中微子望远镜中事件重建(event reconstruction)这一核心逆问题,即如何从探测到的切伦科夫光信号中准确推断入射中微子的能量、方向、相互作用类型等物理属性。传统方法在精度和效率上存在局限,而本文提出的关键解决方案是构建NuBench——一个面向深度学习的开放基准平台,包含七个大规模模拟数据集(近1.3亿个带电流和中性流μ中微子相互作用事件),覆盖10 GeV至100 TeV能区及六种不同探测器几何结构(水和冰环境)。该平台支持脉冲级与事件级信息建模,为跨实验比较和开发新一代机器学习重建算法(如ParticleNeT、DynEdge、GRIT和DeepIce)提供了标准化的数据基础与评估框架。

链接: https://arxiv.org/abs/2511.13111
作者: Rasmus F. Orsoe,Stephan Meighen-Berger,Jeffrey Lazar,Jorge Prado,Ivan Mozun-Mateo,Aske Rosted,Philip Weigel,Arturo Llorente Anaya
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)
备注: Prepared for JINST

点击查看摘要

Abstract:Neutrino telescopes are large-scale detectors designed to observe Cherenkov radiation produced from neutrino interactions in water or ice. They exist to identify extraterrestrial neutrino sources and to probe fundamental questions pertaining to the elusive neutrino itself. A central challenge common across neutrino telescopes is to solve a series of inverse problems known as event reconstruction, which seeks to resolve properties of the incident neutrino, based on the detected Cherenkov light. In recent times, significant efforts have been made in adapting advances from deep learning research to event reconstruction, as such techniques provide several benefits over traditional methods. While a large degree of similarity in reconstruction needs and low-level data exists, cross-experimental collaboration has been hindered by a lack of diverse open-source datasets for comparing methods. We present NuBench, an open benchmark for deep learning-based event reconstruction in neutrino telescopes. NuBench comprises seven large-scale simulated datasets containing nearly 130 million charged- and neutral-current muon-neutrino interactions spanning 10 GeV to 100 TeV, generated across six detector geometries inspired by existing and proposed experiments. These datasets provide pulse- and event-level information suitable for developing and comparing machine-learning reconstruction methods in both water and ice environments. Using NuBench, we evaluate four reconstruction algorithms - ParticleNeT and DynEdge, both actively used within the KM3NeT and IceCube collaborations, respectively, along with GRIT and DeepIce - on up to five core tasks: energy and direction reconstruction, topology classification, interaction vertex prediction, and inelasticity estimation. Comments: Prepared for JINST Subjects: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det) Cite as: arXiv:2511.13111 [hep-ex] (or arXiv:2511.13111v1 [hep-ex] for this version) https://doi.org/10.48550/arXiv.2511.13111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-246] Knowledge is Overrated: A zero-knowledge machine learning and cryptographic hashing-based framework for verifiable low latency inference at the LHC NEURIPS2025

【速读】:该论文旨在解决大型强子对撞机(LHC)在线触发系统中因现代机器学习(ML)模型推理延迟过高而无法满足40 MHz实时处理需求的问题。解决方案的关键在于提出了一种名为PHAZE的新框架,其核心是基于哈希(hashing)和零知识机器学习(zero-knowledge machine learning, zkML)等密码学技术,实现从任意大型基线模型中通过可验证的早期退出机制(certifiable, early-exit mechanism)进行低延迟推理,从而在纳秒级延迟下完成事件选择,同时具备内置异常检测能力,并为未来实现动态低级触发(dynamic low-level trigger)提供可能。

链接: https://arxiv.org/abs/2511.12592
作者: Pratik Jawahar,Caterina Doglioni,Maurizio Pierini
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ML4PS NeurIPS 2025

点击查看摘要

Abstract:Low latency event-selection (trigger) algorithms are essential components of Large Hadron Collider (LHC) operation. Modern machine learning (ML) models have shown great offline performance as classifiers and could improve trigger performance, thereby improving downstream physics analyses. However, inference on such large models does not satisfy the 40\textMHz online latency constraint at the LHC. In this work, we propose \textttPHAZE, a novel framework built on cryptographic techniques like hashing and zero-knowledge machine learning (zkML) to achieve low latency inference, via a certifiable, early-exit mechanism from an arbitrarily large baseline model. We lay the foundations for such a framework to achieve nanosecond-order latency and discuss its inherent advantages, such as built-in anomaly detection, within the scope of LHC triggers, as well as its potential to enable a dynamic low-level trigger in the future.
zh

[AI-247] Quantum Optimization Algorithms

【速读】:该论文旨在解决如何在噪声中等规模量子(NISQ)设备上实现高效量子优化问题的计算,特别是针对组合优化问题如最大割(Max-Cut)问题。其核心解决方案是基于量子近似优化算法(Quantum Approximate Optimization Algorithm, QAOA),该算法通过构建可变参数的量子电路来逼近最优解,并利用参数移位规则(parameter shift rule)进行梯度优化训练。关键创新在于引入Grover混频器(Grover mixer)以约束搜索空间至合法解集,从而提升算法效率;同时将QAOA扩展为更通用的变分量子本征求解器(Variational Quantum Eigensolver, VQE),以应对NISQ时代面临的退化平原(barren plateaus)和量子线路结构设计等挑战。

链接: https://arxiv.org/abs/2511.12379
作者: Jonas Stein,Maximilian Zorn,Leo Sünkel,Thomas Gabor
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Preprint submitted to appear in a Springer Nature Book on Combinatorial Optimization using Quantum Computing

点击查看摘要

Abstract:Quantum optimization allows for up to exponential quantum speedups for specific, possibly industrially relevant problems. As the key algorithm in this field, we motivate and discuss the Quantum Approximate Optimization Algorithm (QAOA), which can be understood as a slightly generalized version of Quantum Annealing for gate-based quantum computers. We delve into the quantum circuit implementation of the QAOA, including Hamiltonian simulation techniques for higher-order Ising models, and discuss parameter training using the parameter shift rule. An example implementation with Pennylane source code demonstrates practical application for the Maximum Cut problem. Further, we show how constraints can be incorporated into the QAOA using Grover mixers, allowing to restrict the search space to strictly valid solutions for specific problems. Finally, we outline the Variational Quantum Eigensolver (VQE) as a generalization of the QAOA, highlighting its potential in the NISQ era and addressing challenges such as barren plateaus and ansatz design.
zh

[AI-248] Decision and Gender Biases in Large Language Models : A Behavioral Economic Perspective

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策过程中是否表现出理性行为,还是复现人类行为偏差的问题。其核心关切在于,尽管LLMs常被视为无偏理性的决策代理,但它们的训练数据源自人类语言语料库,可能内嵌认知与社会偏见。为验证这一假设,研究者采用行为经济学中的两个经典实验——最后通牒博弈(ultimatum game)和赌博游戏(gambling game),对Google Gemma 7B和Qwen两个先进模型在中性及性别条件提示下进行决策测试,并估计不公平厌恶(inequity aversion)和损失厌恶(loss-aversion)参数,与人类基准进行比较。解决方案的关键在于通过结构化实验设计量化模型的行为特征,揭示其虽偏离完全理性但仍保留人类行为倾向,包括适度公平关切、轻微损失厌恶以及细微的性别条件差异,从而为理解LLMs的决策机制提供实证依据。

链接: https://arxiv.org/abs/2511.12319
作者: Luca Corazzini,Elisa Deriu,Marco Guerzoni
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly mediate economic and organisational processes, from automated customer support and recruitment to investment advice and policy analysis. These systems are often assumed to embody rational decision making free from human error; yet they are trained on human language corpora that may embed cognitive and social biases. This study investigates whether advanced LLMs behave as rational agents or whether they reproduce human behavioural tendencies when faced with classic decision problems. Using two canonical experiments in behavioural economics, the ultimatum game and a gambling game, we elicit decisions from two state of the art models, Google Gemma7B and Qwen, under neutral and gender conditioned prompts. We estimate parameters of inequity aversion and loss-aversion and compare them with human benchmarks. The models display attenuated but persistent deviations from rationality, including moderate fairness concerns, mild loss aversion, and subtle gender conditioned differences.
zh

[AI-249] Reinforcement Learning for Charging Optimization of Inhomogeneous Dicke Quantum Batteries

【速读】:该论文旨在解决量子电池(quantum battery)在非均匀性(inhomogeneity)和部分可观测性(partial observability)条件下充电优化的难题。其核心挑战在于如何在信息获取受限的情况下设计高效的充电策略,以最大化电池的可用能量(ergotropy)。解决方案的关键在于采用强化学习(reinforcement learning)方法,学习分段恒定(piecewise-constant)的充电策略,并系统比较四种可观测性场景下的性能表现:从全状态访问到仅能观测单个两能级系统(two-level system, TLS)能量、一阶平均值及二阶关联。结果表明,尽管部分可观测性会导致性能下降,但引入二阶相关性可显著恢复性能,使充电效率达到全状态观测基准的94%–98%,且所学策略具有非贪婪特性,通过短期牺牲换取最终性能提升。

链接: https://arxiv.org/abs/2511.12176
作者: Xiaobin Song,Siyuan Bai,Da-Wei Wang,Hanxiao Tao,Xizhe Wang,Rebing Wu,Benben Jiang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Charging optimization is a key challenge to the implementation of quantum batteries, particularly under inhomogeneity and partial observability. This paper employs reinforcement learning to optimize piecewise-constant charging policies for an inhomogeneous Dicke battery. We systematically compare policies across four observability regimes, from full-state access to experimentally accessible observables (energies of individual two-level systems (TLSs), first-order averages, and second-order correlations). Simulation results demonstrate that full observability yields near-optimal ergotropy with low variability, while under partial observability, access to only single-TLS energies or energies plus first-order averages lags behind the fully observed baseline. However, augmenting partial observations with second-order correlations recovers most of the gap, reaching 94%-98% of the full-state baseline. The learned schedules are nonmyopic, trading temporary plateaus or declines for superior terminal outcomes. These findings highlight a practical route to effective fast-charging protocols under realistic information constraints.
zh

[AI-250] mporal Micro-Doppler Spectrogram-based ViT Multiclass Target Classification

【速读】:该论文旨在解决毫米波调频连续波(FMCW)雷达微多普勒谱图(micro-Doppler spectrogram, MDS)在多类目标分类任务中面临的挑战,尤其是如何有效建模时序特征并提升在目标重叠与部分遮挡情况下的可分性。其解决方案的关键在于提出了一种基于Transformer架构的时空多维感知视觉变换器(Temporal MDS-Vision Transformer, T-MDS-ViT),通过将范围-速度-角度(RVA)空间张量以补丁嵌入方式输入,并引入跨轴注意力机制显式捕捉多帧MDS数据的时序依赖关系;同时,在注意力层中嵌入运动感知约束(mobility-aware constraints),增强模型对复杂场景下目标间相互干扰的鲁棒性,从而实现高精度、高数据效率和实时部署能力的分类性能。

链接: https://arxiv.org/abs/2511.11951
作者: Nghia Thinh Nguyen,Tri Nhu Do
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we propose a new Temporal MDS-Vision Transformer (T-MDS-ViT) for multiclass target classification using millimeter-wave FMCW radar micro-Doppler spectrograms. Specifically, we design a transformer-based architecture that processes stacked range-velocity-angle (RVA) spatiotemporal tensors via patch embeddings and cross-axis attention mechanisms to explicitly model the sequential nature of MDS data across multiple frames. The T-MDS-ViT exploits mobility-aware constraints in its attention layer correspondences to maintain separability under target overlaps and partial occlusions. Next, we apply an explainable mechanism to examine how the attention layers focus on characteristic high-energy regions of the MDS representations and their effect on class-specific kinematic features. We also demonstrate that our proposed framework is superior to existing CNN-based methods in terms of classification accuracy while achieving better data efficiency and real-time deployability.
zh

[AI-251] AI-Open-RAN for Non-Terrestrial Networks

【速读】:该论文旨在解决非地面网络(Non-Terrestrial Networks, NTN)中下一代通信系统在互操作性、灵活性和智能化方面的挑战,特别是在移动场景下对关键性能指标(Key Performance Indicators, KPIs)预测的准确性问题。其解决方案的关键在于提出一种统一的全集成式无线接入网(All-in-One Radio Access Network, AIO-RAN-NTN)架构,该架构基于开放接口与人工智能(AI)驱动的功能,融合了Open-RAN与AI-RAN的技术优势,并利用3GPP标准中的内部及空口接口适配NTN环境。实验表明,尽管AIO-RAN架构在低速移动时仍对移动性敏感,但通过AI模型对KPI进行精准预测可有效缓解这一限制,从而提升网络智能运维能力。

链接: https://arxiv.org/abs/2511.11947
作者: Tri Nhu Do
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose the concept of AIO-RAN-NTN, a unified all-in-one Radio Access Network (RAN) for Non-Terrestrial Networks (NTNs), built on an open architecture that leverages open interfaces and artificial intelligence (AI)-based functionalities. This approach advances interoperability, flexibility, and intelligence in next-generation telecommunications. First, we provide a concise overview of the state-of-the-art architectures for Open-RAN and AI-RAN, highlighting key network functions and infrastructure elements. Next, we introduce our integrated AIO-RAN-NTN blueprint, emphasizing how internal and air interfaces from AIO-RAN and the 3rd Generation Partnership Project (3GPP) can be applied to emerging environments such as NTNs. To examine the impact of mobility on AIO-RAN, we implement a testbed transmission using the OpenAirInterface platform for a standalone (SA) New Radio (NR) 5G system. We then train an AI model on realistic data to forecast key performance indicators (KPIs). Our experiments demonstrate that the AIO-based SA architecture is sensitive to mobility, even at low speeds, but this limitation can be mitigated through AI-driven KPI forecasting.
zh

[AI-252] Improving Neutrino Oscillation Measurements through Event Classification

【速读】:该论文旨在解决下一代长基线中微子振荡实验中中微子能量重建精度不足的问题,其核心挑战源于中微子-核相互作用建模的不确定性。解决方案的关键在于引入一种基于事件分类的重构策略:在能量重建前,利用监督学习技术对事例进行分类,识别其背后的相互作用类型(如准弹性散射、介子交换电流、共振产生和深度非弹性散射),从而利用不同相互作用通道在缺失能量上的系统性差异提升重建性能。该方法通过训练于标注生成器事件的数据,在跨生成器测试框架下展现出对微观物理建模偏差的鲁棒性,并在模拟的DUNE νμ消失分析中显著提升了准确性和灵敏度,为降低重建驱动的系统误差提供了可行路径。

链接: https://arxiv.org/abs/2511.11938
作者: Sebastian A. R. Ellis,Daniel C. Hackett,Shirley Weishi Li,Pedro A. N. Machado,Karla Tame-Narvaez
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Precise neutrino energy reconstruction is essential for next-generation long-baseline oscillation experiments, yet current methods remain limited by large uncertainties in neutrino-nucleus interaction modeling. Even so, it is well established that different interaction channels produce systematically varying amounts of missing energy and therefore yield different reconstruction performance–information that standard calorimetric approaches do not exploit. We introduce a strategy that incorporates this structure by classifying events according to their underlying interaction type prior to energy reconstruction. Using supervised machine-learning techniques trained on labeled generator events, we leverage intrinsic kinematic differences among quasi-elastic scattering, meson-exchange current, resonance production, and deep-inelastic scattering processes. A cross-generator testing framework demonstrates that this classification approach is robust to microphysics mismodeling and, when applied to a simulated DUNE \nu_\mu disappearance analysis, yields improved accuracy and sensitivity. These results highlight a practical path toward reducing reconstruction-driven systematics in future oscillation measurements.
zh

[AI-253] Protein Structure Tokenization via Geometric Byte Pair Encoding

【速读】:该论文旨在解决蛋白质结构令牌化(Protein Structure Tokenizer, PST)缺乏原则性方法的问题,现有方法通常固定令牌大小或依赖连续向量码本,限制了可解释性、多尺度控制能力以及跨架构迁移性能。其解决方案的关键在于提出GeoBPE——一种基于几何约束的PST,通过将连续且噪声干扰的多尺度主链构象转化为离散的“句子”形式,同时强制执行全局几何约束。GeoBPE的核心机制包括:(i) 利用k-medoids聚类迭代识别几何对(Geo-Pair)以生成分辨率可控的层次化词汇表;(ii) 将每个Geo-Pair量化为其最近的原型;(iii) 通过可微分逆运动学优化边界粘合角,最小化SE(3)末端帧损失以减少漂移。该方法实现了10倍比特/残基压缩率、训练数据效率提升10倍,并保持测试与训练失真比稳定在1.0–1.1之间,且具备架构无关性,能有效支持从残基到功能模体乃至整蛋白层级的表示学习与生成任务。

链接: https://arxiv.org/abs/2511.11758
作者: Michael Sun,Weize Yuan,Gang Liu,Wojciech Matusik,Marinka Zitnik
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences’’ of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an \mathrmSE(3) end-frame loss. GeoBPE offers compression ( 10x reduction in bits-per-residue at similar distortion rate), data efficiency ( 10x less training data), and generalization (maintains test/train distortion ratio of 1.0-1.1 ). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across 12 tasks and 24 test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and © tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs. Code is available at this https URL.
zh

[AI-254] he Singularity Warfare: The metatheoretical Framework

【速读】:该论文试图解决的问题是:如何在人工智能(Artificial Intelligence, AI)与量子力学加速推动技术革命的背景下,重新理解未来冲突的本质并构建适应性军事战略框架。其解决方案的关键在于提出“奇点战争”(Singularity Warfare)概念,强调未来战场将融合物理与抽象领域,胜利取决于作战单元能否维持认知与技术上的“一致性”(coherence),同时在对手中制造“非一致性”(decoherence),从而将人类想象力与算法逻辑整合为可行动的现实。

链接: https://arxiv.org/abs/2511.11674
作者: Ridvan Bari Urcosta
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the “Singularity Warfare” concept, arguing that the accelerating pace of technological revolution, driven by artificial intelligence and quantum mechanics, is fundamentally reshaping the nature of conflict. Moving beyond traditional “Newtonian” warfare and current military doctrines, this framework posits that future battlefields will be defined by a merger of physical and abstract domains, where human imagination and algorithmic logic become a unified, actionable reality. Victory will hinge on a unit’s ability to maintain cognitive and technological “coherence” while creating “decoherence” in the adversary. The paper synthesizes theories from physics, philosophy, and futurology to provide a metatheoretical framework for understanding this paradigm shift.
zh

机器学习

[LG-0] Rare Genomic Subtype Discovery from RNA-seq via Autoencoder Embeddings and Stability-Aware Clustering

链接: https://arxiv.org/abs/2511.13705
作者: Alaa Mezghiche
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 16 pages

点击查看摘要

Abstract:Unsupervised learning on high-dimensional RNA-seq data can reveal molecular subtypes beyond standard labels. We combine an autoencoder-based representation with clustering and stability analysis to search for rare but reproducible genomic subtypes. On the UCI “Gene Expression Cancer RNA-Seq” dataset (801 samples, 20,531 genes; BRCA, COAD, KIRC, LUAD, PRAD), a pan-cancer analysis shows clusters aligning almost perfectly with tissue of origin (Cramer’s V = 0.887), serving as a negative control. We therefore reframe the problem within KIRC (n = 146): we select the top 2,000 highly variable genes, standardize them, train a feed-forward autoencoder (128-dimensional latent space), and run k-means for k = 2-10. While global indices favor small k, scanning k with a pre-specified discovery rule (rare 10 percent and stable with Jaccard = 0.60 across 20 seeds after Hungarian alignment) yields a simple solution at k = 5 (silhouette = 0.129, DBI = 2.045) with a rare cluster C0 (6.85 percent of patients) that is highly stable (Jaccard = 0.787). Cluster-vs-rest differential expression (Welch’s t-test, Benjamini-Hochberg FDR) identifies coherent markers. Overall, pan-cancer clustering is dominated by tissue of origin, whereas a stability-aware within-cancer approach reveals a rare, reproducible KIRC subtype.

[LG-1] Learning stochasticity: a nonparametric framework for intrinsic noise estimation

链接: https://arxiv.org/abs/2511.13701
作者: Gianluigi Pillonetto,Alberto Giaretta,Mauro Bisiacco
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the principles that govern dynamical systems is a central challenge across many scientific domains, including biology and ecology. Incomplete knowledge of nonlinear interactions and stochastic effects often renders bottom-up modeling approaches ineffective, motivating the development of methods that can discover governing equations directly from data. In such contexts, parametric models often struggle without strong prior knowledge, especially when estimating intrinsic noise. Nonetheless, incorporating stochastic effects is often essential for understanding the dynamic behavior of complex systems such as gene regulatory networks and signaling pathways. To address these challenges, we introduce Trine (Three-phase Regression for INtrinsic noisE), a nonparametric, kernel-based framework that infers state-dependent intrinsic noise from time-series data. Trine features a three-stage algorithm that com- bines analytically solvable subproblems with a structured kernel architecture that captures both abrupt noise-driven fluctuations and smooth, state-dependent changes in variance. We validate Trine on biological and ecological systems, demonstrating its ability to uncover hidden dynamics without relying on predefined parametric assumptions. Across several benchmark problems, Trine achieves performance comparable to that of an oracle. Biologically, this oracle can be viewed as an idealized observer capable of directly tracking the random fluctuations in molecular concentrations or reaction events within a cell. The Trine framework thus opens new avenues for understanding how intrinsic noise affects the behavior of complex systems.

[LG-2] Efficient Calibration for Decision Making

链接: https://arxiv.org/abs/2511.13699
作者: Parikshit Gopalan,Konstantinos Stavropoulos,Kunal Talwar,Pranay Tankala
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 50 pages, 3 figures

点击查看摘要

Abstract:A decision-theoretic characterization of perfect calibration is that an agent seeking to minimize a proper loss in expectation cannot improve their outcome by post-processing a perfectly calibrated predictor. Hu and Wu (FOCS’24) use this to define an approximate calibration measure called calibration decision loss ( \mathsfCDL ), which measures the maximal improvement achievable by any post-processing over any proper loss. Unfortunately, \mathsfCDL turns out to be intractable to even weakly approximate in the offline setting, given black-box access to the predictions and labels. We suggest circumventing this by restricting attention to structured families of post-processing functions K . We define the calibration decision loss relative to K , denoted \mathsfCDL_K where we consider all proper losses but restrict post-processings to a structured family K . We develop a comprehensive theory of when \mathsfCDL_K is information-theoretically and computationally tractable, and use it to prove both upper and lower bounds for natural classes K . In addition to introducing new definitions and algorithmic techniques to the theory of calibration for decision making, our results give rigorous guarantees for some widely used recalibration procedures in machine learning. Comments: 50 pages, 3 figures Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2511.13699 [cs.LG] (or arXiv:2511.13699v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13699 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Cross-Learning from Scarce Data via Multi-Task Constrained Optimization

链接: https://arxiv.org/abs/2511.13680
作者: Leopoldo Agorio,Juan Cerviño,Miguel Calvo-Fullana,Alejandro Ribeiro,Juan Andrés Bazerque
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 11 figures

点击查看摘要

Abstract:A learning task, understood as the problem of fitting a parametric model from supervised data, fundamentally requires the dataset to be large enough to be representative of the underlying distribution of the source. When data is limited, the learned models fail generalize to cases not seen during training. This paper introduces a multi-task \emphcross-learning framework to overcome data scarcity by jointly estimating \emphdeterministic parameters across multiple, related tasks. We formulate this joint estimation as a constrained optimization problem, where the constraints dictate the resulting similarity between the parameters of the different models, allowing the estimated parameters to differ across tasks while still combining information from multiple data sources. This framework enables knowledge transfer from tasks with abundant data to those with scarce data, leading to more accurate and reliable parameter estimates, providing a solution for scenarios where parameter inference from limited data is critical. We provide theoretical guarantees in a controlled framework with Gaussian data, and show the efficiency of our cross-learning method in applications with real data including image classification and propagation of infectious diseases.

[LG-4] -SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization DATE2026

链接: https://arxiv.org/abs/2511.13676
作者: Hyunwoo Oh,KyungIn Nam,Rajat Bhattacharjya,Hanning Chen,Tamoghno Das,Sanggeon Yun,Suyeon Jang,Andrew Ding,Nikil Dutt,Mohsen Imani
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted to DATE 2026

点击查看摘要

Abstract:Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.

[LG-5] Scientific Data Compression and Super-Resolution Sampling

链接: https://arxiv.org/abs/2511.13675
作者: Minh Vu,Andrey Lokhov
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern scientific simulations, observations, and large-scale experiments generate data at volumes that often exceed the limits of storage, processing, and analysis. This challenge drives the development of data reduction methods that efficiently manage massive datasets while preserving essential physical features and quantities of interest. In many scientific workflows, it is also crucial to enable data recovery from compressed representations - a task known as super-resolution - with guarantees on the preservation of key physical characteristics. A notable example is checkpointing and restarting, which is essential for long-running simulations to recover from failures, resume after interruptions, or examine intermediate results. In this work, we introduce a novel framework for scientific data compression and super-resolution, grounded in recent advances in learning exponential families. Our method preserves and quantifies uncertainty in physical quantities of interest and supports flexible trade-offs between compression ratio and reconstruction fidelity.

[LG-6] Cost-Driven Synthesis of Sound Abstract Interpreters

链接: https://arxiv.org/abs/2511.13663
作者: Qiuhan Gu,Avaljot Singh,Gagandeep Singh
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 37 pages, 20 figures

点击查看摘要

Abstract:Constructing abstract interpreters that provide global soundness guarantees remains a major obstacle in abstract interpretation. We investigate whether modern LLMs can reduce this burden by leveraging them to synthesize sound, non-trivial abstract interpreters across multiple abstract domains in the setting of neural network verification. We formulate synthesis as a constrained optimization problem and introduce a novel mathematically grounded cost function for measuring unsoundness under strict syntactic and semantic constraints. Based on this formulation, we develop a unified framework that unifies LLM-based generation with syntactic and semantic validation and a quantitative cost-guided feedback mechanism. Empirical results demonstrate that our framework not only matches the quality of handcrafted transformers, but more importantly, discovers sound, high-precision transformers for complex nonlinear operators that are absent from existing literature.

[LG-7] FuseSampleAgg: Fused Neighbor Sampling and Aggregation for Mini-batch GNNs

链接: https://arxiv.org/abs/2511.13645
作者: Aleksandar Stanković
类目: Machine Learning (cs.LG)
*备注: 15 pages. Code and reproducibility scripts: this https URL

点击查看摘要

Abstract:We present FuseSampleAgg, a CUDA operator that fuses neighbor sampling and mean aggregation into a single pass for one and two hop GraphSAGE. By eliminating block materialization and extra kernel launches, FuseSampleAgg reduces memory traffic and overhead while preserving GraphSAGE mean semantics via saved index replay. Across the Reddit, ogbn-arxiv, and ogbn-products benchmarks (batch size 1024, automatic mixed precision enabled), we observe step time speedups up to 51x on ogbn-products, about 4x on Reddit with fanouts 10-10 and 15-10, and about 3.3x on ogbn-arxiv at larger fanouts, with peak GPU memory reductions up to 100x, 36x, and about 3.5x, respectively. The operator is deterministic, integrates with standard PyTorch optimizers, and ships with scripts that reproduce all tables and figures from CSV logs. Code and scripts are available at this https URL.

[LG-8] owards Multimodal Representation Learning in Paediatric Kidney Disease ALT

链接: https://arxiv.org/abs/2511.13637
作者: Ana Durica,John Booth,Ivana Drobnjak
类目: Machine Learning (cs.LG)
*备注: 4 pages, 3 figures. EurIPS 2025 Multimodal Representation Learning for Healthcare (MMRL4H) workshop paper

点击查看摘要

Abstract:Paediatric kidney disease varies widely in its presentation and progression, which calls for continuous monitoring of renal function. Using electronic health records collected between 2019 and 2025 at Great Ormond Street Hospital, a leading UK paediatric hospital, we explored a temporal modelling approach that integrates longitudinal laboratory sequences with demographic information. A recurrent neural model trained on these data was used to predict whether a child would record an abnormal serum creatinine value within the following thirty days. Framed as a pilot study, this work provides an initial demonstration that simple temporal representations can capture useful patterns in routine paediatric data and lays the groundwork for future multimodal extensions using additional clinical signals and more detailed renal outcomes.

[LG-9] RAC-DMVC: Reliability-Aware Contrastive Deep Multi-View Clustering under Multi-Source Noise

链接: https://arxiv.org/abs/2511.13561
作者: Shihao Dong,Yue Liu,Xiaotong Zhou,Yuhui Zheng,Huiying Xu,Xinzhong Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-view clustering (MVC), which aims to separate the multi-view data into distinct clusters in an unsupervised manner, is a fundamental yet challenging task. To enhance its applicability in real-world scenarios, this paper addresses a more challenging task: MVC under multi-source noises, including missing noise and observation noise. To this end, we propose a novel framework, Reliability-Aware Contrastive Deep Multi-View Clustering (RAC-DMVC), which constructs a reliability graph to guide robust representation learning under noisy environments. Specifically, to address observation noise, we introduce a cross-view reconstruction to enhances robustness at the data level, and a reliability-aware noise contrastive learning to mitigates bias in positive and negative pairs selection caused by noisy representations. To handle missing noise, we design a dual-attention imputation to capture shared information across views while preserving view-specific features. In addition, a self-supervised cluster distillation module further refines the learned representations and improves the clustering performance. Extensive experiments on five benchmark datasets demonstrate that RAC-DMVC outperforms SOTA methods on multiple evaluation metrics and maintains excellent performance under varying ratios of noise.

[LG-10] Graph Out-of-Distribution Detection via Test-Time Calibration with Dual Dynamic Dictionaries AAAI2026 AAAI

链接: https://arxiv.org/abs/2511.13541
作者: Yue Hou,Ruomei Liu,Yingke Su,Junran Wu,Ke Xu
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026 (The 40th Annual AAAI Conference on Artificial Intelligence)

点击查看摘要

Abstract:A key challenge in graph out-of-distribution (OOD) detection lies in the absence of ground-truth OOD samples during training. Existing methods are typically optimized to capture features within the in-distribution (ID) data and calculate OOD scores, which often limits pre-trained models from representing distributional boundaries, leading to unreliable OOD detection. Moreover, the latent structure of graph data is often governed by multiple underlying factors, which remains less explored. To address these challenges, we propose a novel test-time graph OOD detection method, termed BaCa, that calibrates OOD scores using dual dynamically updated dictionaries without requiring fine-tuning the pre-trained model. Specifically, BaCa estimates graphons and applies a mix-up strategy solely with test samples to generate diverse boundary-aware discriminative topologies, eliminating the need for exposing auxiliary datasets as outliers. We construct dual dynamic dictionaries via priority queues and attention mechanisms to adaptively capture latent ID and OOD representations, which are then utilized for boundary-aware OOD score calibration. To the best of our knowledge, extensive experiments on real-world datasets show that BaCa significantly outperforms existing state-of-the-art methods in OOD detection.

[LG-11] Fairness-Aware Graph Representation Learning with Limited Demographic Information

链接: https://arxiv.org/abs/2511.13540
作者: Zichong Wang,Zhipeng Yin,Liping Yang,Jun Zhuang,Rui Yu,Qingzhao Kong,Wenbin Zhang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Ensuring fairness in Graph Neural Networks is fundamental to promoting trustworthy and socially responsible machine learning systems. In response, numerous fair graph learning methods have been proposed in recent years. However, most of them assume full access to demographic information, a requirement rarely met in practice due to privacy, legal, or regulatory restrictions. To this end, this paper introduces a novel fair graph learning framework that mitigates bias in graph learning under limited demographic information. Specifically, we propose a mechanism guided by partial demographic data to generate proxies for demographic information and design a strategy that enforces consistent node embeddings across demographic groups. In addition, we develop an adaptive confidence strategy that dynamically adjusts each node’s contribution to fairness and utility based on prediction confidence. We further provide theoretical analysis demonstrating that our framework, FairGLite, achieves provable upper bounds on group fairness metrics, offering formal guarantees for bias mitigation. Through extensive experiments on multiple datasets and fair graph learning frameworks, we demonstrate the framework’s effectiveness in both mitigating bias and maintaining model utility.

[LG-12] Mitigating Spurious Correlations in Patch-wise Tumor Classification on High-Resolution Multimodal Images

链接: https://arxiv.org/abs/2511.13527
作者: Ihab Asaad,Maha Shadaydeh,Joachim Denzler
类目: Machine Learning (cs.LG)
*备注: Accepted at EurIPS 2025 Workshop: Unifying Perspectives on Learning Biases (UPLB)

点击查看摘要

Abstract:Patch-wise multi-label classification provides an efficient alternative to full pixel-wise segmentation on high-resolution images, particularly when the objective is to determine the presence or absence of target objects within a patch rather than their precise spatial extent. This formulation substantially reduces annotation cost, simplifies training, and allows flexible patch sizing aligned with the desired level of decision granularity. In this work, we focus on a special case, patch-wise binary classification, applied to the detection of a single class of interest (tumor) on high-resolution multimodal nonlinear microscopy images. We show that, although this simplified formulation enables efficient model development, it can introduce spurious correlations between patch composition and labels: tumor patches tend to contain larger tissue regions, whereas non-tumor patches often consist mostly of background with small tissue areas. We further quantify the bias in model predictions caused by this spurious correlation, and propose to use a debiasing strategy to mitigate its effect. Specifically, we apply GERNE, a debiasing method that can be adapted to maximize worst-group accuracy (WGA). Our results show an improvement in WGA by approximately 7% compared to ERM for two different thresholds used to binarize the spurious feature. This enhancement boosts model performance on critical minority cases, such as tumor patches with small tissues and non-tumor patches with large tissues, and underscores the importance of spurious correlation-aware learning in patch-wise classification problems.

[LG-13] A Quantum Tensor Network-Based Viewpoint for Modeling and Analysis of Time Series Data

链接: https://arxiv.org/abs/2511.13514
作者: Pragatheeswaran Vipulananthan,Kamal Premaratne,Dilip Sarkar,Manohar N. Murthi
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: IEEE International Conference on Knowledge Graph (ICKG), 378-387, 2024

点击查看摘要

Abstract:Accurate uncertainty quantification is a critical challenge in machine learning. While neural networks are highly versatile and capable of learning complex patterns, they often lack interpretability due to their black box'' nature. On the other hand, probabilistic white box’’ models, though interpretable, often suffer from a significant performance gap when compared to neural networks. To address this, we propose a novel quantum physics-based white box'' method that offers both accurate uncertainty quantification and enhanced interpretability. By mapping the kernel mean embedding (KME) of a time series data vector to a reproducing kernel Hilbert space (RKHS), we construct a tensor network-inspired 1D spin chain Hamiltonian, with the KME as one of its eigen-functions or eigen-modes. We then solve the associated Schrödinger equation and apply perturbation theory to quantify uncertainty, thereby improving the interpretability of tasks performed with the quantum tensor network-based model. We demonstrate the effectiveness of this methodology, compared to state-of-the-art white box" models, in change point detection and time series clustering, providing insights into the uncertainties associated with decision-making throughout the process.

[LG-14] Quantum Machine Learning via Contrastive Training

链接: https://arxiv.org/abs/2511.13497
作者: Liudmila A. Zhukas,Vivian Ni Zhang,Qiang Miao,Qingfeng Wang,Marko Cetina,Jungsang Kim,Lawrence Carin,Christopher Monroe
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 7 figures, 20 pages total

点击查看摘要

Abstract:Quantum machine learning (QML) has attracted growing interest with the rapid parallel advances in large-scale classical machine learning and quantum technologies. Similar to classical machine learning, QML models also face challenges arising from the scarcity of labeled data, particularly as their scale and complexity increase. Here, we introduce self-supervised pretraining of quantum representations that reduces reliance on labeled data by learning invariances from unlabeled examples. We implement this paradigm on a programmable trapped-ion quantum computer, encoding images as quantum states. In situ contrastive pretraining on hardware yields a representation that, when fine-tuned, classifies image families with higher mean test accuracy and lower run-to-run variability than models trained from random initialization. Performance improvement is especially significant in regimes with limited labeled training data. We show that the learned invariances generalize beyond the pretraining image samples. Unlike prior work, our pipeline derives similarity from measured quantum overlaps and executes all training and classification stages on hardware. These results establish a label-efficient route to quantum representation learning, with direct relevance to quantum-native datasets and a clear path to larger classical inputs.

[LG-15] GREAT: Generalizable Representation Enhancement via Auxiliary Transformations for Zero-Shot Environmental Prediction

链接: https://arxiv.org/abs/2511.13469
作者: Shiyuan Luo,Chonghao Qiu,Runlong Yu,Yiqun Xie,Xiaowei Jia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Environmental modeling faces critical challenges in predicting ecosystem dynamics across unmonitored regions due to limited and geographically imbalanced observation data. This challenge is compounded by spatial heterogeneity, causing models to learn spurious patterns that fit only local data. Unlike conventional domain generalization, environmental modeling must preserve invariant physical relationships and temporal coherence during augmentation. In this paper, we introduce Generalizable Representation Enhancement via Auxiliary Transformations (GREAT), a framework that effectively augments available datasets to improve predictions in completely unseen regions. GREAT guides the augmentation process to ensure that the original governing processes can be recovered from the augmented data, and the inclusion of the augmented data leads to improved model generalization. Specifically, GREAT learns transformation functions at multiple layers of neural networks to augment both raw environmental features and temporal influence. They are refined through a novel bi-level training process that constrains augmented data to preserve key patterns of the original source data. We demonstrate GREAT’s effectiveness on stream temperature prediction across six ecologically diverse watersheds in the eastern U.S., each containing multiple stream segments. Experimental results show that GREAT significantly outperforms existing methods in zero-shot scenarios. This work provides a practical solution for environmental applications where comprehensive monitoring is infeasible.

[LG-16] AdamX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

链接: https://arxiv.org/abs/2511.13465
作者: Meng Zhu,Quan Xiao,Weidong Min
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 6 figures, 12 tables

点击查看摘要

Abstract:Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at this https URL.

[LG-17] Hardware optimization on Android for inference of AI models

链接: https://arxiv.org/abs/2511.13453
作者: Iulius Gherasim,Carlos García Sánchez
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 8 pages

点击查看摘要

Abstract:The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.

[LG-18] Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

链接: https://arxiv.org/abs/2511.13421
作者: Tingkai Yan,Haodong Wen,Binghui Li,Kairong Luo,Wenguang Chen,Kaifeng Lyu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size N for K epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textiteffective reuse rate of the data, E(K, N) , which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as K -epoch training. Our analysis precisely characterizes the scaling behavior of E(K, N) for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When K is small, we prove that E(K, N) \approx K , indicating that every new epoch yields a linear gain; (2) As K increases, E(K, N) plateaus at a problem-dependent value that grows with N ( \Theta(\log N) for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to 4 epochs results in negligible loss differences compared to using fresh data at each step, \textiti.e., E(K, N) \approx K for K \le 4 in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum K value for which E(K, N) \approx K in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.

[LG-19] MMWSTM-ADRAN: A Novel Hybrid Deep Learning Architecture for Enhanced Climate Time Series Forecasting and Extreme Event Prediction

链接: https://arxiv.org/abs/2511.13419
作者: Shaheen Mohammed Saleh Ahmed,Hakan Hakan Guneyli
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate short-range prediction of extreme air temperature events remains a fundamental challenge in operational climate-risk management. We present Multi-Modal Weather State Transition Model with Anomaly-Driven Recurrent Attention Network Plus (MMWSTM-ADRAN+), a dual-stream deep learning architecture that couples a regime-aware dynamics model with an anomaly-focused attention mechanism to forecast daily maximum temperature and its extremes. The first stream, MMWSTM, combines bidirectional Long Short-Term Memory (BiLSTM) units with a learnable Markov state transition matrix to capture synoptic-scale weather regime changes. The second stream, ADRAN, integrates bidirectional Gated Recurrent Units (BiGRUs), multi-head self-attention, and a novel anomaly amplification layer to enhance sensitivity to low-probability signals. A lightweight attentive fusion gate adaptively determines the contribution of each stream to the final prediction. Model optimization employs a custom ExtremeWeatherLoss function that up-weights errors on the upper 5% and lower 5% of the temperature distribution, and a time-series data augmentation suite (jittering, scaling, time/magnitude warping) that effectively quadruples the training data

[LG-20] Fast and Robust Simulation-Based Inference With Optimization Monte Carlo

链接: https://arxiv.org/abs/2511.13394
作者: Vasilis Gkolemis,Christos Diou,Michael Gutmann
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian parameter inference for complex stochastic simulators is challenging due to intractable likelihood functions. Existing simulation-based inference methods often require large number of simulations and become costly to use in high-dimensional parameter spaces or in problems with partially uninformative outputs. We propose a new method for differentiable simulators that delivers accurate posterior inference with substantially reduced runtimes. Building on the Optimization Monte Carlo framework, our approach reformulates stochastic simulation as deterministic optimization problems. Gradient-based methods are then applied to efficiently navigate toward high-density posterior regions and avoid wasteful simulations in low-probability areas. A JAX-based implementation further enhances the performance through vectorization of key method components. Extensive experiments, including high-dimensional parameter spaces, uninformative outputs, multiple observations and multimodal posteriors show that our method consistently matches, and often exceeds, the accuracy of state-of-the-art approaches, while reducing the runtime by a substantial margin.

[LG-21] Uncovering Causal Drivers of Energy Efficiency for Industrial Process in Foundry via Time-Series Causal Inference

链接: https://arxiv.org/abs/2511.13389
作者: Zhipeng Ma,Bo Nørregaard Jørgensen,Zheng Grace Ma
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by the Energy this http URL Conference 2025 (EI.A 2025)

点击查看摘要

Abstract:Improving energy efficiency in industrial foundry processes is a critical challenge, as these operations are highly energy-intensive and marked by complex interdependencies among process variables. Correlation-based analyses often fail to distinguish true causal drivers from spurious associations, limiting their usefulness for decision-making. This paper applies a time-series causal inference framework to identify the operational factors that directly affect energy efficiency in induction furnace melting. Using production data from a Danish foundry, the study integrates time-series clustering to segment melting cycles into distinct operational modes with the PCMCI+ algorithm, a state-of-the-art causal discovery method, to uncover cause-effect relationships within each mode. Across clusters, robust causal relations among energy consumption, furnace temperature, and material weight define the core drivers of efficiency, while voltage consistently influences cooling water temperature with a delayed response. Cluster-specific differences further distinguish operational regimes: efficient clusters are characterized by stable causal structures, whereas inefficient ones exhibit reinforcing feedback loops and atypical dependencies. The contributions of this study are twofold. First, it introduces an integrated clustering-causal inference pipeline as a methodological innovation for analyzing energy-intensive processes. Second, it provides actionable insights that enable foundry operators to optimize performance, reduce energy consumption, and lower emissions.

[LG-22] Statistically Accurate and Robust Generative Prediction of Rock Discontinuities with A Tabular Foundation Model

链接: https://arxiv.org/abs/2511.13339
作者: Han Meng,Gang Mei,Hong Tian,Nengxiong Xu,Jianbing Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rock discontinuities critically govern the mechanical behavior and stability of rock masses. Their internal distributions remain largely unobservable and are typically inferred from surface-exposed discontinuities using generative prediction approaches. However, surface-exposed observations are inherently sparse, and existing generative prediction approaches either fail to capture the underlying complex distribution patterns or lack robustness under data-sparse conditions. Here, we proposed a simple yet robust approach for statistically accurate generative prediction of rock discontinuities by utilizing a tabular foundation model. By leveraging the powerful sample learning capability of the foundation model specifically designed for small data, our approach can effectively capture the underlying complex distribution patterns within limited measured discontinuities. Comparative experiments on ten datasets with diverse scales and distribution patterns of discontinuities demonstrate superior accuracy and robustness over conventional statistical models and deep generative approaches. This work advances quantitative characterization of rock mass structures, supporting safer and more reliable data-driven geotechnical design.

[LG-23] ab-PET: Graph-Based Positional Encodings for Tabular Transformers

链接: https://arxiv.org/abs/2511.13338
作者: Yunze Leng,Rohan Ghosh,Mehul Motani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised learning with tabular data presents unique challenges, including low data sizes, the absence of structural cues, and heterogeneous features spanning both categorical and continuous domains. Unlike vision and language tasks, where models can exploit inductive biases in the data, tabular data lacks inherent positional structure, hindering the effectiveness of self-attention mechanisms. While recent transformer-based models like TabTransformer, SAINT, and FT-Transformer (which we refer to as 3T) have shown promise on tabular data, they typically operate without leveraging structural cues such as positional encodings (PEs), as no prior structural information is usually available. In this work, we find both theoretically and empirically that structural cues, specifically PEs can be a useful tool to improve generalization performance for tabular transformers. We find that PEs impart the ability to reduce the effective rank (a form of intrinsic dimensionality) of the features, effectively simplifying the task by reducing the dimensionality of the problem, yielding improved generalization. To that end, we propose Tab-PET (PEs for Tabular Transformers), a graph-based framework for estimating and inculcating PEs into embeddings. Inspired by approaches that derive PEs from graph topology, we explore two paradigms for graph estimation: association-based and causality-based. We empirically demonstrate that graph-derived PEs significantly improve performance across 50 classification and regression datasets for 3T. Notably, association-based graphs consistently yield more stable and pronounced gains compared to causality-driven ones. Our work highlights an unexpected role of PEs in tabular transformers, revealing how they can be harnessed to improve generalization.

[LG-24] Edge-aware baselines for ogbn-proteins in PyTorch Geometric: species-wise normalization post-hoc calibration and cost-accuracy trade-offs

链接: https://arxiv.org/abs/2511.13250
作者: Aleksandar Stanković,Dejan Lisica
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 5 tables. Code and artifacts: this https URL

点击查看摘要

Abstract:We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG). We study two system choices that dominate practice: (i) how 8-dimensional edge evidence is aggregated into node inputs, and (ii) how edges are used inside message passing. Our strongest baseline is GraphSAGE with sum-based edge-to-node features. We compare LayerNorm (LN), BatchNorm (BN), and a species-aware Conditional LayerNorm (CLN), and report compute cost (time, VRAM, parameters) together with accuracy (ROC-AUC) and decision quality. In our primary experimental setup (hidden size 512, 3 layers, 3 seeds), sum consistently beats mean and max; BN attains the best AUC, while CLN matches the AUC frontier with better thresholded F1. Finally, post-hoc per-label temperature scaling plus per-label thresholds substantially improves micro-F1 and expected calibration error (ECE) with negligible AUC change, and light label-correlation smoothing yields small additional gains. We release standardized artifacts and scripts used for all of the runs presented in the paper.

[LG-25] Incoherent Beliefs Inconsistent Actions in Large Language Models

链接: https://arxiv.org/abs/2511.13240
作者: Arka Pal,Teo Kitanovski,Arthur Liang,Akilesh Potti,Micah Goldblum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.

[LG-26] Counterfactual Explainable AI (XAI) Method for Deep Learning-Based Multivariate Time Series Classification AAAI2026

链接: https://arxiv.org/abs/2511.13237
作者: Alan G. Paredes Cetina,Kaouther Benguessoum,Raoni Lourenço,Sylvain Kubler
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted in AAAI 2026 Technical Main Track

点击查看摘要

Abstract:Recent advances in deep learning have improved multivariate time series (MTS) classification and regression by capturing complex patterns, but their lack of transparency hinders decision-making. Explainable AI (XAI) methods offer partial insights, yet often fall short of conveying the full decision space. Counterfactual Explanations (CE) provide a promising alternative, but current approaches typically prioritize either accuracy, proximity or sparsity – rarely all – limiting their practical value. To address this, we propose CONFETTI, a novel multi-objective CE method for MTS. CONFETTI identifies key MTS subsequences, locates a counterfactual target, and optimally modifies the time series to balance prediction confidence, proximity and sparsity. This method provides actionable insights with minimal changes, improving interpretability, and decision support. CONFETTI is evaluated on seven MTS datasets from the UEA archive, demonstrating its effectiveness in various domains. CONFETTI consistently outperforms state-of-the-art CE methods in its optimization objectives, and in six other metrics from the literature, achieving \geq10% higher confidence while improving sparsity in \geq40% .

[LG-27] MorphBoost: Self-Organizing Universal Gradient Boosting with Adaptive Tree Morphing

链接: https://arxiv.org/abs/2511.13234
作者: Boris Kriuk
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Traditional gradient boosting algorithms employ static tree structures with fixed splitting criteria that remain unchanged throughout training, limiting their ability to adapt to evolving gradient distributions and problem-specific characteristics across different learning stages. This work introduces MorphBoost, a new gradient boosting framework featuring self-organizing tree structures that dynamically morph their splitting behavior during training. The algorithm implements adaptive split functions that evolve based on accumulated gradient statistics and iteration-dependent learning pressures, enabling automatic adjustment to problem complexity. Key innovations include: (1) morphing split criterion combining gradient-based scores with information-theoretic metrics weighted by training progress; (2) automatic problem fingerprinting for intelligent parameter configuration across binary/multiclass/regression tasks; (3) vectorized tree prediction achieving significant computational speedups; (4) interaction-aware feature importance detecting multiplicative relationships; and (5) fast-mode optimization balancing speed and accuracy. Comprehensive benchmarking across 10 diverse datasets against competitive models (XGBoost, LightGBM, GradientBoosting, HistGradientBoosting, ensemble methods) demonstrates that MorphBoost achieves state-of-the-art performance, outperforming XGBoost by 0.84% on average. MorphBoost secured the overall winner position with 4/10 dataset wins (40% win rate) and 6/30 top-3 finishes (20%), while maintaining the lowest variance (\sigma=0.0948) and highest minimum accuracy across all models, revealing superior consistency and robustness. Performance analysis across difficulty levels shows competitive results on easy datasets while achieving notable improvements on advanced problems due to higher adaptation levels.

[LG-28] Laplace Learning in Wasserstein Space

链接: https://arxiv.org/abs/2511.13229
作者: Mary Chriselda Antony Oliver,Michael Roberts,Carola-Bibiane Schönlieb,Matthew Thorpe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 46 page, 5 figures

点击查看摘要

Abstract:The manifold hypothesis posits that high-dimensional data typically resides on low-dimensional sub spaces. In this paper, we assume manifold hypothesis to investigate graph-based semi-supervised learning methods. In particular, we examine Laplace Learning in the Wasserstein space, extending the classical notion of graph-based semi-supervised learning algorithms from finite-dimensional Euclidean spaces to an infinite-dimensional setting. To achieve this, we prove variational convergence of a discrete graph p- Dirichlet energy to its continuum counterpart. In addition, we characterize the Laplace-Beltrami operator on asubmanifold of the Wasserstein space. Finally, we validate the proposed theoretical framework through numerical experiments conducted on benchmark datasets, demonstrating the consistency of our classification performance in high-dimensional settings. Comments: 46 page, 5 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2511.13229 [cs.LG] (or arXiv:2511.13229v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13229 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mary Chriselda Antony Oliver [view email] [v1] Mon, 17 Nov 2025 10:49:36 UTC (1,753 KB) Full-text links: Access Paper: View a PDF of the paper titled Laplace Learning in Wasserstein Space, by Mary Chriselda Antony Oliver and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-11 Change to browse by: cs stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-29] DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play IJCAI2025

链接: https://arxiv.org/abs/2511.13186
作者: Akash Karthikeyan,Yash Vardhan Pant
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Initial results presented at the IJCAI 2025 Workshop on User-Aligned Assessment of Adaptive AI Systems. Project page: this https URL

点击查看摘要

Abstract:Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards \epsilon -Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3 \times faster convergence and 30 \times higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations

[LG-30] Uncertainty-aware Physics-informed Neural Networks for Robust CARS-to-Raman Signal Reconstruction

链接: https://arxiv.org/abs/2511.13185
作者: Aishwarya Venkataramanan,Sai Karthikeya Vemuri,Adithya Ashok Chalain Valapil,Joachim Denzler
类目: Machine Learning (cs.LG)
*备注: EurIPS DiffSys workshop 2025

点击查看摘要

Abstract:Coherent anti-Stokes Raman scattering (CARS) spectroscopy is a powerful and rapid technique widely used in medicine, material science, and chemical analyses. However, its effectiveness is hindered by the presence of a non-resonant background that interferes with and distorts the true Raman signal. Deep learning methods have been employed to reconstruct the true Raman spectrum from measured CARS data using labeled datasets. A more recent development integrates the domain knowledge of Kramers-Kronig relationships and smoothness constraints in the form of physics-informed loss functions. However, these deterministic models lack the ability to quantify uncertainty, an essential feature for reliable deployment in high-stakes scientific and biomedical applications. In this work, we evaluate and compare various uncertainty quantification (UQ) techniques within the context of CARS-to-Raman signal reconstruction. Furthermore, we demonstrate that incorporating physics-informed constraints into these models improves their calibration, offering a promising path toward more trustworthy CARS data analysis.

[LG-31] Real-time distortion prediction in metallic additive manufacturing via a physics-informed neural operator approach

链接: https://arxiv.org/abs/2511.13178
作者: Mingxuan Tian,Haochen Mu,Donghong Ding,Mengjiao Li,Yuhan Ding,Jianping Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the development of digital twins and smart manufacturing systems, there is an urgent need for real-time distortion field prediction to control defects in metal Additive Manufacturing (AM). However, numerical simulation methods suffer from high computational cost, long run-times that prevent real-time use, while conventional Machine learning (ML) models struggle to extract spatiotemporal features for long-horizon prediction and fail to decouple thermo-mechanical fields. This paper proposes a Physics-informed Neural Operator (PINO) to predict z and y-direction distortion for the future 15 s. Our method, Physics-informed Deep Operator Network-Recurrent Neural Network (PIDeepONet-RNN) employs trunk and branch network to process temperature history and encode distortion fields, respectively, enabling decoupling of thermo-mechanical responses. By incorporating the heat conduction equation as a soft constraint, the model ensures physical consistency and suppresses unphysical artifacts, thereby establishing a more physically consistent mapping between the thermal history and distortion. This is important because such a basis function, grounded in physical laws, provides a robust and interpretable foundation for predictions. The proposed models are trained and tested using datasets generated from experimentally validated Finite Element Method (FEM). Evaluation shows that the model achieves high accuracy, low error accumulation, time efficiency. The max absolute errors in the z and y-directions are as low as 0.9733 mm and 0.2049 mm, respectively. The error distribution shows high errors in the molten pool but low gradient norms in the deposited and key areas. The performance of PINO surrogate model highlights its potential for real-time long-horizon physics field prediction in controlling defects.

[LG-32] Warm-starting active-set solvers using graph neural networks

链接: https://arxiv.org/abs/2511.13174
作者: Ella J. Schmidtobreick,Daniel Arnström,Paul Häusner,Jens Sjölund
类目: Machine Learning (cs.LG)
*备注: Under review, 15 pages, 8 figures

点击查看摘要

Abstract:Quadratic programming (QP) solvers are widely used in real-time control and optimization, but their computational cost often limits applicability in time-critical settings. We propose a learning-to-optimize approach using graph neural networks (GNNs) to predict active sets in the dual active-set solver DAQP. The method exploits the structural properties of QPs by representing them as bipartite graphs and learning to identify the optimal active set for efficiently warm-starting the solver. Across varying problem sizes, the GNN consistently reduces the number of solver iterations compared to cold-starting, while performance is comparable to a multilayer perceptron (MLP) baseline. Furthermore, a GNN trained on varying problem sizes generalizes effectively to unseen dimensions, demonstrating flexibility and scalability. These results highlight the potential of structure-aware learning to accelerate optimization in real-time applications such as model predictive control.

[LG-33] OTARo: Once Tuning for All Precisions toward Robust On-Device LLM s

链接: https://arxiv.org/abs/2511.13147
作者: Shaoyuan Chen,Zhixuan Chen,Dawei Yang,Zhihang Yuan,Qiang Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.

[LG-34] Personalized Federated Learning with Bidirectional Communication Compression via One-Bit Random Sketching AAAI2026

链接: https://arxiv.org/abs/2511.13144
作者: Jiacheng Cheng,Xu Zhang,Guanghui Qiu,Yifang Zhang,Yinchuan Li,Kaiyuan Feng
类目: Machine Learning (cs.LG)
*备注: Accepted in AAAI 2026

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training across decentralized data, but faces key challenges of bidirectional communication overhead and client-side data heterogeneity. To address communication costs while embracing data heterogeneity, we propose pFed1BS, a novel personalized federated learning framework that achieves extreme communication compression through one-bit random sketching. In personalized FL, the goal shifts from training a single global model to creating tailored models for each client. In our framework, clients transmit highly compressed one-bit sketches, and the server aggregates and broadcasts a global one-bit consensus. To enable effective personalization, we introduce a sign-based regularizer that guides local models to align with the global consensus while preserving local data characteristics. To mitigate the computational burden of random sketching, we employ the Fast Hadamard Transform for efficient projection. Theoretical analysis guarantees that our algorithm converges to a stationary neighborhood of the global potential function. Numerical simulations demonstrate that pFed1BS substantially reduces communication costs while achieving competitive performance compared to advanced communication-efficient FL algorithms.

[LG-35] Departures: Distributional Transport for Single-Cell Perturbation Prediction with Neural Schrödinger Bridges

链接: https://arxiv.org/abs/2511.13124
作者: Changxi Chi,Yufei Huang,Jun Xia,Jiangbin Zheng,Yunfan Liu,Zelin Zang,Stan Z. Li
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Predicting single-cell perturbation outcomes directly advances gene function analysis and facilitates drug candidate selection, making it a key driver of both basic and translational biomedical research. However, a major bottleneck in this task is the unpaired nature of single-cell data, as the same cell cannot be observed both before and after perturbation due to the destructive nature of sequencing. Although some neural generative transport models attempt to tackle unpaired single-cell perturbation data, they either lack explicit conditioning or depend on prior spaces for indirect distribution alignment, limiting precise perturbation modeling. In this work, we approximate Schrödinger Bridge (SB), which defines stochastic dynamic mappings recovering the entropy-regularized optimal transport (OT), to directly align the distributions of control and perturbed single-cell populations across different perturbation conditions. Unlike prior SB approximations that rely on bidirectional modeling to infer optimal source-target sample coupling, we leverage Minibatch-OT based pairing to avoid such bidirectional inference and the associated ill-posedness of defining the reverse process. This pairing directly guides bridge learning, yielding a scalable approximation to the SB. We approximate two SB models, one modeling discrete gene activation states and the other continuous expression distributions. Joint training enables accurate perturbation modeling and captures single-cell heterogeneity. Experiments on public genetic and drug perturbation datasets show that our model effectively captures heterogeneous single-cell responses and achieves state-of-the-art performance.

[LG-36] ransformer-Based Scalable Multi-Agent Reinforcement Learning for Networked Systems with Long-Range Interactions

链接: https://arxiv.org/abs/2511.13103
作者: Vidur Sinha,Muhammed Ustaomeroglu,Guannan Qu
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: 8 pages, 7 figures, submitted for review

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has shown promise for large-scale network control, yet existing methods face two major limitations. First, they typically rely on assumptions leading to decay properties of local agent interactions, limiting their ability to capture long-range dependencies such as cascading power failures or epidemic outbreaks. Second, most approaches lack generalizability across network topologies, requiring retraining when applied to new graphs. We introduce STACCA (Shared Transformer Actor-Critic with Counterfactual Advantage), a unified transformer-based MARL framework that addresses both challenges. STACCA employs a centralized Graph Transformer Critic to model long-range dependencies and provide system-level feedback, while its shared Graph Transformer Actor learns a generalizable policy capable of adapting across diverse network structures. Further, to improve credit assignment during training, STACCA integrates a novel counterfactual advantage estimator that is compatible with state-value critic estimates. We evaluate STACCA on epidemic containment and rumor-spreading network control tasks, demonstrating improved performance, network generalization, and scalability. These results highlight the potential of transformer-based MARL architectures to achieve scalable and generalizable control in large-scale networked systems.

[LG-37] A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning

链接: https://arxiv.org/abs/2511.13078
作者: Liuyi Jin,Pasan Gunawardena,Amran Haroon,Runzhi Wang,Sangwoo Lee,Radu Stoleru,Michael Middleton,Zepeng Huo,Jeeeun Kim,Jason Moats
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Emergency Medical Technicians (EMTs) operate in high-pressure environments, making rapid, life-critical decisions under heavy cognitive and operational loads. We present EMSGlass, a smart-glasses system powered by EMSNet, the first multimodal multitask model for Emergency Medical Services (EMS), and EMSServe, a low-latency multimodal serving framework tailored to EMS scenarios. EMSNet integrates text, vital signs, and scene images to construct a unified real-time understanding of EMS incidents. Trained on real-world multimodal EMS datasets, EMSNet simultaneously supports up to five critical EMS tasks with superior accuracy compared to state-of-the-art unimodal baselines. Built on top of PyTorch, EMSServe introduces a modality-aware model splitter and a feature caching mechanism, achieving adaptive and efficient inference across heterogeneous hardware while addressing the challenge of asynchronous modality arrival in the field. By optimizing multimodal inference execution in EMS scenarios, EMSServe achieves 1.9x – 11.7x speedup over direct PyTorch multimodal inference. A user study evaluation with six professional EMTs demonstrates that EMSGlass enhances real-time situational awareness, decision-making speed, and operational efficiency through intuitive on-glass interaction. In addition, qualitative insights from the user study provide actionable directions for extending EMSGlass toward next-generation AI-enabled EMS systems, bridging multimodal intelligence with real-world emergency response workflows.

[LG-38] Orientation-Free Neural Network-Based Bias Estimation for Low-Cost Stationary Accelerometers

链接: https://arxiv.org/abs/2511.13071
作者: Michal Levin,Itzik Klein
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 22 pages, 10 figures

点击查看摘要

Abstract:Low-cost micro-electromechanical accelerometers are widely used in navigation, robotics, and consumer devices for motion sensing and position estimation. However, their performance is often degraded by bias errors. To eliminate deterministic bias terms a calibration procedure is applied under stationary conditions. It requires accelerom- eter leveling or complex orientation-dependent calibration procedures. To overcome those requirements, in this paper we present a model-free learning-based calibration method that estimates accelerometer bias under stationary conditions, without requiring knowledge of the sensor orientation and without the need to rotate the sensors. The proposed approach provides a fast, practical, and scalable solution suitable for rapid field deployment. Experimental validation on a 13.39-hour dataset collected from six accelerometers shows that the proposed method consistently achieves error levels more than 52% lower than traditional techniques. On a broader scale, this work contributes to the advancement of accurate calibration methods in orientation-free scenarios. As a consequence, it improves the reliability of low-cost inertial sensors in diverse scientific and industrial applications and eliminates the need for leveled calibration.

[LG-39] Self-Organization of Attractor Landscapes in High-Capacity Kernel Logistic Regression Hopfield Networks

链接: https://arxiv.org/abs/2511.13053
作者: Akira Tamamori
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Kernel-based learning methods can dramatically increase the storage capacity of Hopfield networks, yet the dynamical mechanism behind this enhancement remains poorly understood. We address this gap by conducting a geometric analysis of the network’s energy landscape. We introduce a novel metric, Pinnacle Sharpness,'' to quantify the local stability of attractors. By systematically varying the kernel width and storage load, we uncover a rich phase diagram of attractor shapes. Our central finding is the emergence of a ridge of optimization,‘’ where the network maximizes attractor stability under challenging high-load and global-kernel conditions. Through a theoretical decomposition of the landscape gradient into a direct driving'' force and an indirect feedback’’ force, we reveal the origin of this phenomenon. The optimization ridge corresponds to a regime of strong anti-correlation between the two forces, where the direct force, amplified by the high storage load, dominates the opposing collective feedback force. This demonstrates a sophisticated self-organization mechanism: the network adaptively harnesses inter-pattern interactions as a cooperative feedback control system to sculpt a robust energy landscape. Our findings provide a new physical picture for the stability of high-capacity associative memories and offer principles for their design.

[LG-40] Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information AAAI2026

链接: https://arxiv.org/abs/2511.13049
作者: Antoine Ledent,Mun Chong Soo,Nong Minh Hieu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at AAAI 2026

点击查看摘要

Abstract:We study a matrix completion problem where both the ground truth R matrix and the unknown sampling distribution P over observed entries are low-rank matrices, and \textitshare a common subspace. We assume that a large amount M of \textitunlabeled data drawn from the sampling distribution P is available, together with a small amount N of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the explicit feedback’, consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as \widetildeO\left(\sqrt\fracndM\right) and \widetildeO\left(\sqrt\fracdrN\right) respectively, where d is the rank of P and r is the rank of M . In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of P and and the ground truth matrix \ground respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

[LG-41] Bi-View Embedding Fusion: A Hybrid Learning Approach for Knowledge Graphs Nodes Classification Addressing Problems with Limited Data

链接: https://arxiv.org/abs/2511.13044
作者: Rosario Napoli,Giovanni Lonia,Antonio Celesti,Massimo Villari,Maria Fazio
类目: Machine Learning (cs.LG)
*备注: Accepted at the 14th International Joint Conference on Knowledge Graphs (IJCKG) 2025

点击查看摘要

Abstract:Traditional Machine Learning (ML) methods require large amounts of data to perform well, limiting their applicability in sparse or incomplete scenarios and forcing the usage of additional synthetic data to improve the model training. To overcome this challenge, the research community is looking more and more at Graph Machine Learning (GML) as it offers a powerful alternative by using relationships within data. However, this method also faces limitations, particularly when dealing with Knowledge Graphs (KGs), which can hide huge information due to their semantic nature. This study introduces Bi-View, a novel hybrid approach that increases the informative content of node features in KGs to generate enhanced Graph Embeddings (GEs) that are used to improve GML models without relying on additional synthetic data. The proposed work combines two complementary GE techniques: Node2Vec, which captures structural patterns through unsupervised random walks, and GraphSAGE, which aggregates neighbourhood information in a supervised way. Node2Vec embeddings are first computed to represent the graph topology, and node features are then enriched with centrality-based metrics, which are used as input for the GraphSAGE model. Moreover, a fusion layer combines the original Node2Vec embeddings with the GraphSAGE-influenced representations, resulting in a dual-perspective embedding space. Such a fusion captures both topological and semantic properties of the graph, enabling the model to exploit informative features that may exist in the dataset but that are not explicitly represented. Our approach improves downstream task performance, especially in scenarios with poor initial features, giving the basis for more accurate and precise KG-enanched GML models.

[LG-42] Learning Time-Scale Invariant Population-Level Neural Representations NEURIPS2025

链接: https://arxiv.org/abs/2511.13022
作者: Eshani Patel,Yisong Yue,Geeling Chau
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, NeurIPS 2025 Foundation Models for the Brain and Body

点击查看摘要

Abstract:General-purpose foundation models for neural time series can help accelerate neuroscientific discoveries and enable applications such as brain computer interfaces (BCIs). A key component in scaling these models is population-level representation learning, which leverages information across channels to capture spatial as well as temporal structure. Population-level approaches have recently shown that such representations can be both efficient to learn on top of pretrained temporal encoders and produce useful representations for decoding a variety of downstream tasks. However, these models remain sensitive to mismatches in preprocessing, particularly on time-scales, between pretraining and downstream settings. We systematically examine how time-scale mismatches affects generalization and find that existing representations lack invariance. To address this, we introduce Time-scale Augmented Pretraining (TSAP), which consistently improves robustness to different time-scales across decoding tasks and builds invariance in the representation space. These results highlight handling preprocessing diversity as a key step toward building generalizable neural foundation models.

[LG-43] he Final-Stage Bottleneck: A Systematic Dissection of the R-Learner for Network Causal Inference

链接: https://arxiv.org/abs/2511.13018
作者: Sairam S,Sara Girdhar,Shivam Soni
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:The R-Learner is a powerful, theoretically-grounded framework for estimating heterogeneous treatment effects, prized for its robustness to nuisance model errors. However, its application to network data, where causal heterogeneity is often graph-dependent, presents a critical challenge to its core assumption of a well-specified final-stage model. In this paper, we conduct a large-scale empirical study to systematically dissect the R-Learner framework on graphs. We provide the first rigorous evidence that the primary driver of performance is the inductive bias of the final-stage CATE estimator, an effect that dominates the choice of nuisance models. Our central finding is the quantification of a catastrophic “representation bottleneck”: we prove with overwhelming statistical significance (p 0.001) that R-Learners with a graph-blind final stage fail completely (MSE 4.0), even when paired with powerful GNN nuisance models. Conversely, our proposed end-to-end Graph R-Learner succeeds and significantly outperforms a strong, non-DML GNN T-Learner baseline. Furthermore, we identify and provide a mechanistic explanation for a subtle, topology-dependent “nuisance bottleneck,” linking it to GNN over-squashing via a targeted “Hub-Periphery Trade-off” analysis. Our findings are validated across diverse synthetic and semi-synthetic benchmarks. We release our code as a reproducible benchmark to facilitate future research on this critical “final-stage bottleneck.”

[LG-44] he Good The Bad and The Hybrid: A Reward Structure Showdown in Reasoning Models Training NEURIPS

链接: https://arxiv.org/abs/2511.13016
作者: Subramanyam Sahoo
类目: Machine Learning (cs.LG)
*备注: Paper accepted to the 2nd Workshop on Aligning Reinforcement Learning Experimentalists and Theorists (ARLET 2025) at NeurIPS; the paper consists of 14 pages (including the appendix) and contains 3 figures

点击查看摘要

Abstract:Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.

[LG-45] RAG Pulse: An Open-Source RAG RAG Workload Trace to Optimize RAG Serving Systems

链接: https://arxiv.org/abs/2511.12979
作者: Zhengchao Wang,Yitao Hu,Jianing Ye,Zhuxuan Chang,Jiazheng Yu,Youpeng Deng,Keqiu Li
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a critical paradigm for building reliable, knowledge-intensive Large Language Model (LLM) applications. However, the multi-stage pipeline (retrieve, generate) and unique workload characteristics (e.g., knowledge dependency) of RAG systems pose significant challenges for serving performance optimization. Existing generic LLM inference traces fail to capture these RAG-specific dynamics, creating a significant performance gap between academic research and real-world deployment. To bridge this gap, this paper introduces RAGPulse, an open-source RAG workload trace dataset. This dataset was collected from an university-wide QA system serving that has served more than 40,000 students and faculties since April 2024. We detail RAGPulse’s system architecture, its privacy-preserving hash-based data format, and provide an in-depth statistical analysis. Our analysis reveals that real-world RAG workloads exhibit significant temporal locality and a highly skewed hot document access pattern. RAGPulse provides a high-fidelity foundation for researchers to develop and validate novel optimization strategies for RAG systems, such as content-aware batching and retrieval caching, ultimately enhancing the efficiency and reliability of RAG services. The code is available at this https URL.

[LG-46] A FEDformer-Based Hybrid Framework for Anomaly Detection and Risk Forecasting in Financial Time Series

链接: https://arxiv.org/abs/2511.12951
作者: Ziling Fan,Ruijia Liang,Yiwen Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial markets are inherently volatile and prone to sudden disruptions such as market crashes, flash collapses, and liquidity crises. Accurate anomaly detection and early risk forecasting in financial time series are therefore crucial for preventing systemic instability and supporting informed investment decisions. Traditional deep learning models, such as LSTM and GRU, often fail to capture long-term dependencies and complex periodic patterns in highly nonstationary financial data. To address this limitation, this study proposes a FEDformer-Based Hybrid Framework for Anomaly Detection and Risk Forecasting in Financial Time Series, which integrates the Frequency Enhanced Decomposed Transformer (FEDformer) with a residual-based anomaly detector and a risk forecasting head. The FEDformer module models temporal dynamics in both time and frequency domains, decomposing signals into trend and seasonal components for improved interpretability. The residual-based detector identifies abnormal fluctuations by analyzing prediction errors, while the risk head predicts potential financial distress using learned latent embeddings. Experiments conducted on the SP 500, NASDAQ Composite, and Brent Crude Oil datasets (2000-2024) demonstrate the superiority of the proposed model over benchmark methods, achieving a 15.7 percent reduction in RMSE and an 11.5 percent improvement in F1-score for anomaly detection. These results confirm the effectiveness of the model in capturing financial volatility, enabling reliable early-warning systems for market crash prediction and risk management.

[LG-47] APT: Affine Prototype-Timestamp For Time Series Forecasting Under Distribution Shift

链接: https://arxiv.org/abs/2511.12945
作者: Yujie Li,Zezhi Shao,Chengqing Yu,Yisong Fu,Tao Sun,Yongjun Xu,Fei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting under distribution shift remains challenging, as existing deep learning models often rely on local statistical normalization (e.g., mean and variance) that fails to capture global distribution shift. Methods like RevIN and its variants attempt to decouple distribution and pattern but still struggle with missing values, noisy observations, and invalid channel-wise affine transformation. To address these limitations, we propose Affine Prototype Timestamp (APT), a lightweight and flexible plug-in module that injects global distribution features into the normalization-forecasting pipeline. By leveraging timestamp conditioned prototype learning, APT dynamically generates affine parameters that modulate both input and output series, enabling the backbone to learn from self-supervised, distribution-aware clustered instances. APT is compatible with arbitrary forecasting backbones and normalization strategies while introducing minimal computational overhead. Extensive experiments across six benchmark datasets and multiple backbone-normalization combinations demonstrate that APT significantly improves forecasting performance under distribution shift.

[LG-48] AIF: Asynchronous Inference Framework for Cost-Effective Pre-Ranking

链接: https://arxiv.org/abs/2511.12934
作者: Zhi Kou,Xiang-Rong Sheng,Shuguang Han,Zhishan Zhao,Yueyao Cheng,Han Zhu,Jian Xu,Bo Zheng
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In industrial recommendation systems, pre-ranking models based on deep neural networks (DNNs) commonly adopt a sequential execution framework: feature fetching and model forward computation are triggered only after receiving candidates from the upstream retrieval stage. This design introduces inherent bottlenecks, including redundant computations of identical users/items and increased latency due to strictly sequential operations, which jointly constrain the model’s capacity and system efficiency. To address these limitations, we propose the Asynchronous Inference Framework (AIF), a cost-effective computational architecture that decouples interaction-independent components, those operating within a single user or item, from real-time prediction. AIF reorganizes the model inference process by performing user-side computations in parallel with the retrieval stage and conducting item-side computations in a nearline manner. This means that interaction-independent components are calculated just once and completed before the real-time prediction phase of the pre-ranking stage. As a result, AIF enhances computational efficiency and reduces latency, freeing up resources to significantly improve the feature set and model architecture of interaction-independent components. Moreover, we delve into model design within the AIF framework, employing approximated methods for interaction-dependent components in online real-time predictions. By co-designing both the framework and the model, our solution achieves notable performance gains without significantly increasing computational and latency costs. This has enabled the successful deployment of AIF in the Taobao display advertising system.

[LG-49] Method of Manufactured Learning for Solver-free Training of Neural Operators

链接: https://arxiv.org/abs/2511.12890
作者: Arth Sojitra,Omer San
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training neural operators to approximate mappings between infinite-dimensional function spaces often requires extensive datasets generated by either demanding experimental setups or computationally expensive numerical solvers. This dependence on solver-based data limits scalability and constrains exploration across physical systems. Here we introduce the Method of Manufactured Learning (MML), a solver-independent framework for training neural operators using analytically constructed, physics-consistent datasets. Inspired by the classical method of manufactured solutions, MML replaces numerical data generation with functional synthesis, i.e., smooth candidate solutions are sampled from controlled analytical spaces, and the corresponding forcing fields are derived by direct application of the governing differential operators. During inference, setting these forcing terms to zero restores the original governing equations, allowing the trained neural operator to emulate the true solution operator of the system. The framework is agnostic to network architecture and can be integrated with any operator learning paradigm. In this paper, we employ Fourier neural operator as a representative example. Across canonical benchmarks including heat, advection, Burgers, and diffusion-reaction equations. MML achieves high spectral accuracy, low residual errors, and strong generalization to unseen conditions. By reframing data generation as a process of analytical synthesis, MML offers a scalable, solver-agnostic pathway toward constructing physically grounded neural operators that retain fidelity to governing laws without reliance on expensive numerical simulations or costly experimental data for training.

[LG-50] On the Information Processing of One-Dimensional Wasserstein Distances with Finite Samples AAAI2026

链接: https://arxiv.org/abs/2511.12881
作者: Cheongjae Jang,Jonghyun Won,Soyeon Jun,Chun Kee Chung,Keehyoung Joo,Yung-Kyun Noh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extended version of paper accepted to AAAI 2026. 18 pages, 12 figures

点击查看摘要

Abstract:Leveraging the Wasserstein distance – a summation of sample-wise transport distances in data space – is advantageous in many applications for measuring support differences between two underlying density functions. However, when supports significantly overlap while densities exhibit substantial pointwise differences, it remains unclear whether and how this transport information can accurately identify these differences, particularly their analytic characterization in finite-sample settings. We address this issue by conducting an analysis of the information processing capabilities of the one-dimensional Wasserstein distance with finite samples. By utilizing the Poisson process and isolating the rate factor, we demonstrate the capability of capturing the pointwise density difference with Wasserstein distances and how this information harmonizes with support differences. The analyzed properties are confirmed using neural spike train decoding and amino acid contact frequency data. The results reveal that the one-dimensional Wasserstein distance highlights meaningful density differences related to both rate and support.

[LG-51] Structured Imitation Learning of Interactive Policies through Inverse Games

链接: https://arxiv.org/abs/2511.12848
作者: Max M. Sun,Todd Murphey
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Presented at the “Workshop on Generative Modeling Meets Human-Robot Interaction” at Robotics: Science and Systems 2025. Workshop website: this https URL

点击查看摘要

Abstract:Generative model-based imitation learning methods have recently achieved strong results in learning high-complexity motor skills from human demonstrations. However, imitation learning of interactive policies that coordinate with humans in shared spaces without explicit communication remains challenging, due to the significantly higher behavioral complexity in multi-agent interactions compared to non-interactive tasks. In this work, we introduce a structured imitation learning framework for interactive policies by combining generative single-agent policy learning with a flexible yet expressive game-theoretic structure. Our method explicitly separates learning into two steps: first, we learn individual behavioral patterns from multi-agent demonstrations using standard imitation learning; then, we structurally learn inter-agent dependencies by solving an inverse game problem. Preliminary results in a synthetic 5-agent social navigation task show that our method significantly improves non-interactive policies and performs comparably to the ground truth interactive policy using only 50 demonstrations. These results highlight the potential of structured imitation learning in interactive settings.

[LG-52] An Evaluation of Representation Learning Methods in Particle Physics Foundation Models

链接: https://arxiv.org/abs/2511.12829
作者: Michael Chen,Raghav Kansal,Abhijith Gandrakota,Zichun Hao,Jennifer Ngadiuba,Maria Spiropulu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a systematic evaluation of representation learning objectives for particle physics within a unified framework. Our study employs a shared transformer-based particle-cloud encoder with standardized preprocessing, matched sampling, and a consistent evaluation protocol on a jet classification dataset. We compare contrastive (supervised and self-supervised), masked particle modeling, and generative reconstruction objectives under a common training regimen. In addition, we introduce targeted supervised architectural modifications that achieve state-of-the-art performance on benchmark evaluations. This controlled comparison isolates the contributions of the learning objective, highlights their respective strengths and limitations, and provides reproducible baselines. We position this work as a reference point for the future development of foundation models in particle physics, enabling more transparent and robust progress across the community.

[LG-53] Efficient Adversarial Malware Defense via Trust-Based Raw Override and Confidence-Adaptive Bit-Depth Reduction

链接: https://arxiv.org/abs/2511.12827
作者: Ayush Chaudhary,Sisir Doppalpudi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at IEEE International Conference on Big Data 2025. 10 pages, 2 figures, 8 tables

点击查看摘要

Abstract:The deployment of robust malware detection systems in big data environments requires careful consideration of both security effectiveness and computational efficiency. While recent advances in adversarial defenses have demonstrated strong robustness improvements, they often introduce computational overhead ranging from 4x to 22x, which presents significant challenges for production systems processing millions of samples daily. In this work, we propose a novel framework that combines Trust-Raw Override (TRO) with Confidence-Adaptive Bit-Depth Reduction (CABDR) to explicitly optimize the trade-off between adversarial robustness and computational efficiency. Our approach leverages adaptive confidence-based mechanisms to selectively apply defensive measures, achieving 1.76x computational overhead - a 2.3x improvement over state-of-the-art smoothing defenses. Through comprehensive evaluation on the EMBER v2 dataset comprising 800K samples, we demonstrate that our framework maintains 91 percent clean accuracy while reducing attack success rates to 31-37 percent across multiple attack types, with particularly strong performance against optimization-based attacks such as C and W (48.8 percent reduction). The framework achieves throughput of up to 1.26 million samples per second (measured on pre-extracted EMBER features with no runtime feature extraction), validated across 72 production configurations with statistical significance (5 independent runs, 95 percent confidence intervals, p less than 0.01). Our results suggest that practical adversarial robustness in production environments requires explicit optimization of the efficiency-robustness trade-off, providing a viable path for organizations to deploy robust defenses without prohibitive infrastructure costs.

[LG-54] Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter AACL

链接: https://arxiv.org/abs/2511.12823
作者: Sajed Jalil,Shuvo Saha,Hossain Mohammad Seym
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: AACL-IJCNLP 2025 Workshop BLP Shared Task 2, 6 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Over the past few years, improving LLM code generation capabilities has been a key focus in NLP research. Despite Bengali having 242 million native speakers worldwide, it receives little attention when it comes to training LLMs. More recently, various fine-tuning and augmented generation techniques have been employed to significantly enhance code generation performance. However, they require considerable expertise and resources to utilize effectively as an end user. The goal of our work is to democratize access to powerful code generation tools in resource-constrained emerging markets, enabling users to leverage them in their native language. We introduce a novel approach that combines Test-Driven Development (TDD) and Code Interpreter (CI), utilizing open-weight models, which improves the baseline accuracy for code generation with Bengali prompts and achieves an overall accuracy of 85%. Our approach requires no finetuning and proves that even the smallest models in the same family can attain up to 98% accuracy compared to the largest models. All of our results are publicly shared in GitHub for validation and reproducibility. Comments: AACL-IJCNLP 2025 Workshop BLP Shared Task 2, 6 pages, 7 figures, 3 tables Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL) ACMclasses: I.2.7; D.2.3 Cite as: arXiv:2511.12823 [cs.SE] (or arXiv:2511.12823v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2511.12823 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs AAAI’26

链接: https://arxiv.org/abs/2511.12817
作者: Shasha Zhou,Mingyu Huang,Jack Cole,Charles Britton,Ming Yin,Jan Wolber,Ke Li
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper at AAAI’26

点击查看摘要

Abstract:The recent proliferation of large language models (LLMs) holds the potential to revolutionize healthcare, with strong capabilities in diverse medical tasks. Yet, deploying LLMs in high-stakes healthcare settings requires rigorous verification and validation to understand any potential harm. This paper investigates the reliability and viability of using medical knowledge graphs (KGs) for the automated factuality evaluation of LLM-generated responses. To ground this investigation, we introduce FAITH, a framework designed to systematically probe the strengths and limitations of this KG-based approach. FAITH operates without reference answers by decomposing responses into atomic claims, linking them to a medical KG, and scoring them based on evidence paths. Experiments on diverse medical tasks with human subjective evaluations demonstrate that KG-grounded evaluation achieves considerably higher correlations with clinician judgments and can effectively distinguish LLMs with varying capabilities. It is also robust to textual variances. The inherent explainability of its scoring can further help users understand and mitigate the limitations of current LLMs. We conclude that while limitations exist, leveraging KGs is a prominent direction for automated factuality assessment in healthcare.

[LG-56] Physics-Constrained Adaptive Neural Networks Enable Real-Time Semiconductor Manufacturing Optimization with Minimal Training Data

链接: https://arxiv.org/abs/2511.12788
作者: Rubén Darío Guerrero
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Optimization and Control (math.OC)
*备注: 32 pages, 21 figures, 10 tables

点击查看摘要

Abstract:The semiconductor industry faces a computational crisis in extreme ultraviolet (EUV) lithography optimization, where traditional methods consume billions of CPU hours while failing to achieve sub-nanometer precision. We present a physics-constrained adaptive learning framework that automatically calibrates electromagnetic approximations through learnable parameters \boldsymbol\theta = \theta_d, \theta_a, \theta_b, \theta_p, \theta_c\ while simultaneously minimizing Edge Placement Error (EPE) between simulated aerial images and target photomasks. The framework integrates differentiable modules for Fresnel diffraction, material absorption, optical point spread function blur, phase-shift effects, and contrast modulation with direct geometric pattern matching objectives, enabling cross-geometry generalization with minimal training data. Through physics-constrained learning on 15 representative patterns spanning current production to future research nodes, we demonstrate consistent sub-nanometer EPE performance (0.664-2.536 nm range) using only 50 training samples per pattern. Adaptive physics learning achieves an average improvement of 69.9% over CNN baselines without physics constraints, with a significant inference speedup over rigorous electromagnetic solvers after training completion. This approach requires 90% fewer training samples through cross-geometry generalization compared to pattern-specific CNN training approaches. This work establishes physics-constrained adaptive learning as a foundational methodology for real-time semiconductor manufacturing optimization, addressing the critical gap between academic physics-informed neural networks and industrial deployment requirements through joint physics calibration and manufacturing precision objectives.

[LG-57] MolEdit: Knowledge Editing for Multimodal Molecule Language Models

链接: https://arxiv.org/abs/2511.12770
作者: Zhenyu Lei,Patrick Soga,Yaochen Zhu,Yinhan He,Yushun Dong,Jundong Li
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Understanding and continuously refining multimodal molecular knowledge is crucial for advancing biomedicine, chemistry, and materials science. Molecule language models (MoLMs) have become powerful tools in these domains, integrating structural representations (e.g., SMILES strings, molecular graphs) with rich contextual descriptions (e.g., physicochemical properties). However, MoLMs can encode and propagate inaccuracies due to outdated web-mined training corpora or malicious manipulation, jeopardizing downstream discovery pipelines. While knowledge editing has been explored for general-domain AI, its application to MoLMs remains uncharted, presenting unique challenges due to the multifaceted and interdependent nature of molecular knowledge. In this paper, we take the first step toward MoLM editing for two critical tasks: molecule-to-caption generation and caption-to-molecule generation. To address molecule-specific challenges, we propose MolEdit, a powerful framework that enables targeted modifications while preserving unrelated molecular knowledge. MolEdit combines a Multi-Expert Knowledge Adapter that routes edits to specialized experts for different molecular facets with an Expertise-Aware Editing Switcher that activates the adapters only when input closely matches the stored edits across all expertise, minimizing interference with unrelated knowledge. To systematically evaluate editing performance, we introduce MEBench, a comprehensive benchmark assessing multiple dimensions, including Reliability (accuracy of the editing), Locality (preservation of irrelevant knowledge), and Generality (robustness to reformed queries). Across extensive experiments on two popular MoLM backbones, MolEdit delivers up to 18.8% higher Reliability and 12.0% better Locality than baselines while maintaining efficiency. The code is available at: this https URL.

[LG-58] INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers NEURIPS2025

链接: https://arxiv.org/abs/2511.12764
作者: Hao Wei,Aleksandra Franz,Bjoern List,Nils Thuerey
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025. 35 pages, 10 figures

点击查看摘要

Abstract:When simulating partial differential equations, hybrid solvers combine coarse numerical solvers with learned correctors. They promise accelerated simulations while adhering to physical constraints. However, as shown in our theoretical framework, directly applying learned corrections to solver outputs leads to significant autoregressive errors, which originate from amplified perturbations that accumulate during long-term rollouts, especially in chaotic regimes. To overcome this, we propose the Indirect Neural Corrector ((\mathrmINC)), which integrates learned corrections into the governing equations rather than applying direct state updates. Our key insight is that (\mathrmINC) reduces the error amplification on the order of (\Delta t^-1 + L), where (\Delta t) is the timestep and L the Lipschitz constant. At the same time, our framework poses no architectural requirements and integrates seamlessly with arbitrary neural networks and solvers. We test (\mathrmINC) in extensive benchmarks, covering numerous differentiable solvers, neural backbones, and test cases ranging from a 1D chaotic system to 3D turbulence. INC improves the long-term trajectory performance ((R^2)) by up to 158.7%, stabilizes blowups under aggressive coarsening, and for complex 3D turbulence cases yields speed-ups of several orders of magnitude. INC thus enables stable, efficient PDE emulation with formal error reduction, paving the way for faster scientific and engineering simulations with reliable physics guarantees. Our source code is available at this https URL

[LG-59] Conformal Online Learning of Deep Koopman Linear Embeddings

链接: https://arxiv.org/abs/2511.12760
作者: Ben Gao,Jordan Patracone,Stéphane Chrétien,Olivier Alata
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce Conformal Online Learning of Koopman embeddings (COLoKe), a novel framework for adaptively updating Koopman-invariant representations of nonlinear dynamical systems from streaming data. Our modeling approach combines deep feature learning with multistep prediction consistency in the lifted space, where the dynamics evolve linearly. To prevent overfitting, COLoKe employs a conformal-style mechanism that shifts the focus from evaluating the conformity of new states to assessing the consistency of the current Koopman model. Updates are triggered only when the current model’s prediction error exceeds a dynamically calibrated threshold, allowing selective refinement of the Koopman operator and embedding. Empirical results on benchmark dynamical systems demonstrate the effectiveness of COLoKe in maintaining long-term predictive accuracy while significantly reducing unnecessary updates and avoiding overfitting.

[LG-60] Prompt-Driven Domain Adaptation for End-to-End Autonomous Driving via In-Context RL

链接: https://arxiv.org/abs/2511.12755
作者: Aleesha Khurram,Amir Moeini,Shangtong Zhang,Rohan Chandra
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite significant progress and advances in autonomous driving, many end-to-end systems still struggle with domain adaptation (DA), such as transferring a policy trained under clear weather to adverse weather conditions. Typical DA strategies in the literature include collecting additional data in the target domain or re-training the model, or both. Both these strategies quickly become impractical as we increase scale and complexity of driving. These limitations have encouraged investigation into few-shot and zero-shot prompt-driven DA at inference time involving LLMs and VLMs. These methods work by adding a few state-action trajectories during inference to the prompt (similar to in-context learning). However, there are two limitations of such an approach: (i) prompt-driven DA methods are currently restricted to perception tasks such as detection and segmentation and (ii) they require expert few-shot data. In this work, we present a new approach to inference-time few-shot prompt-driven DA for closed-loop autonomous driving in adverse weather condition using in-context reinforcement learning (ICRL). Similar to other prompt-driven DA methods, our approach does not require any updates to the model parameters nor does it require additional data collection in adversarial weather regime. Furthermore, our approach advances the state-of-the-art in prompt-driven DA by extending to closed driving using general trajectories observed during inference. Our experiments using the CARLA simulator show that ICRL results in safer, more efficient, and more comfortable driving policies in the target domain compared to state-of-the-art prompt-driven DA baselines.

[LG-61] DIVIDE: A Framework for Learning from Independent Multi-Mechanism Data Using Deep Encoders and Gaussian Processes

链接: https://arxiv.org/abs/2511.12745
作者: Vivek Chawla,Boris Slautin,Utkarsh Pratiush,Dayakar Penumadu,Sergei Kalinin
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 33 pages, 10 main figures, 7 additional in SI

点击查看摘要

Abstract:Scientific datasets often arise from multiple independent mechanisms such as spatial, categorical or structural effects, whose combined influence obscures their individual contributions. We introduce DIVIDE, a framework that disentangles these influences by integrating mechanism-specific deep encoders with a structured Gaussian Process in a joint latent space. Disentanglement here refers to separating independently acting generative factors. The encoders isolate distinct mechanisms while the Gaussian Process captures their combined effect with calibrated uncertainty. The architecture supports structured priors, enabling interpretable and mechanism-aware prediction as well as efficient active learning. DIVIDE is demonstrated on synthetic datasets combining categorical image patches with nonlinear spatial fields, on FerroSIM spin lattice simulations of ferroelectric patterns, and on experimental PFM hysteresis loops from PbTiO3 films. Across benchmarks, DIVIDE separates mechanisms, reproduces additive and scaled interactions, and remains robust under noise. The framework extends naturally to multifunctional datasets where mechanical, electromagnetic or optical responses coexist.

[LG-62] An Evaluation Framework for Network IDS/IPS Datasets: Leverag ing MITRE ATTCK and Industry Relevance Metrics

链接: https://arxiv.org/abs/2511.12743
作者: Adrita Rahman Tori,Khondokar Fida Hasan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 32 Pages

点击查看摘要

Abstract:The performance of Machine Learning (ML) and Deep Learning (DL)-based Intrusion Detection and Prevention Systems (IDS/IPS) is critically dependent on the relevance and quality of the datasets used for training and evaluation. However, current AI model evaluation practices for developing IDS/IPS focus predominantly on accuracy metrics, often overlooking whether datasets represent industry-specific threats. To address this gap, we introduce a novel multi-dimensional framework that integrates the MITRE ATTCK knowledge base for threat intelligence and employs five complementary metrics that together provide a comprehensive assessment of dataset suitability. Methodologically, this framework combines threat intelligence, natural language processing, and quantitative analysis to assess the suitability of datasets for specific industry contexts. Applying this framework to nine publicly available IDS/IPS datasets reveals significant gaps in threat coverage, particularly in the healthcare, energy, and financial sectors. In particular, recent datasets (e.g., CIC-IoMT, CIC-UNSW-NB15) align better with sector-specific threats, whereas others, like CICIoV-24, underperform despite their recency. Our findings provide a standardized, interpretable approach for selecting datasets aligned with sector-specific operational requirements, ultimately enhancing the real-world effectiveness of AI-driven IDS/IPS deployments. The efficiency and practicality of the framework are validated through deployment in a real-world case study, underscoring its capacity to inform dataset selection and enhance the effectiveness of AI-driven IDS/IPS in operational environments.

[LG-63] Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering AAAI-26

链接: https://arxiv.org/abs/2511.12742
作者: Zhongteng Cai,Yaxuan Wang,Yang Liu,Xueru Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI-26

点击查看摘要

Abstract:As synthetic data proliferates across the Internet, it is often reused to train successive generations of generative models. This creates a ``self-consuming loop" that can lead to training instability or \textitmodel collapse. Common strategies to address the issue – such as accumulating historical training data or injecting fresh real data – either increase computational cost or require expensive human annotation. In this paper, we empirically analyze the latent space dynamics of self-consuming diffusion models and observe that the low-dimensional structure of latent representations extracted from synthetic data degrade over generations. Based on this insight, we propose \textitLatent Space Filtering (LSF), a novel approach that mitigates model collapse by filtering out less realistic synthetic data from mixed datasets. Theoretically, we present a framework that connects latent space degradation to empirical observations. Experimentally, we show that LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or relying on human annotation.

[LG-64] Convolutional Model Trees

链接: https://arxiv.org/abs/2511.12725
作者: William Ward Armstrong
类目: Machine Learning (cs.LG)
*备注: 9 pages. No figures. This paper gives an algorithm for creating a continuously differentiable approximation from sample data from the same type of function(in theory) using a forest of model trees (like CART trees with linear functions instead of constants)

点击查看摘要

Abstract:A method for creating a forest of model trees to fit samples of a function defined on images is described in several steps: down-sampling the images, determining a tree’s hyperplanes, applying convolutions to the hyperplanes to handle small distortions of training images, and creating forests of model trees to increase accuracy and achieve a smooth fit. A 1-to-1 correspondence among pixels of images, coefficients of hyperplanes and coefficients of leaf functions offers the possibility of dealing with larger distortions such as arbitrary rotations or changes of perspective. A theoretical method for smoothing forest outputs to produce a continuously differentiable approximation is described. Within that framework, a training procedure is proved to converge.

[LG-65] LAYA: Layer-wise Attention Aggregation for Interpretable Depth-Aware Neural Networks

链接: https://arxiv.org/abs/2511.12723
作者: Gennaro Vessio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks typically rely on the representation produced by their final hidden layer to make predictions, implicitly assuming that this single vector fully captures the semantics encoded across all preceding transformations. However, intermediate layers contain rich and complementary information – ranging from low-level patterns to high-level abstractions – that is often discarded when the decision head depends solely on the last representation. This paper revisits the role of the output layer and introduces LAYA (Layer-wise Attention Aggregator), a novel output head that dynamically aggregates internal representations through attention. Instead of projecting only the deepest embedding, LAYA learns input-conditioned attention weights over layer-wise features, yielding an interpretable and architecture-agnostic mechanism for synthesizing predictions. Experiments on vision and language benchmarks show that LAYA consistently matches or improves the performance of standard output heads, with relative gains of up to about one percentage point in accuracy, while providing explicit layer-attribution scores that reveal how different abstraction levels contribute to each decision. Crucially, these interpretability signals emerge directly from the model’s computation, without any external post hoc explanations. The code to reproduce LAYA is publicly available at: this https URL.

[LG-66] On Robustness of Linear Classifiers to Targeted Data Poisoning

链接: https://arxiv.org/abs/2511.12722
作者: Nakshatra Gupta,Sumanth Prabhu,Supratik Chakraborty,R Venkatesh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data poisoning is a training-time attack that undermines the trustworthiness of learned models. In a targeted data poisoning attack, an adversary manipulates the training dataset to alter the classification of a targeted test point. Given the typically large size of training dataset, manual detection of poisoning is difficult. An alternative is to automatically measure a dataset’s robustness against such an attack, which is the focus of this paper. We consider a threat model wherein an adversary can only perturb the labels of the training dataset, with knowledge limited to the hypothesis space of the victim’s model. In this setting, we prove that finding the robustness is an NP-Complete problem, even when hypotheses are linear classifiers. To overcome this, we present a technique that finds lower and upper bounds of robustness. Our implementation of the technique computes these bounds efficiently in practice for many publicly available datasets. We experimentally demonstrate the effectiveness of our approach. Specifically, a poisoning exceeding the identified robustness bounds significantly impacts test point classification. We are also able to compute these bounds in many more cases where state-of-the-art techniques fail.

[LG-67] Oxytrees: Model Trees for Bipartite Learning AAAI

链接: https://arxiv.org/abs/2511.12713
作者: Pedro Ilídio,Felipe Kenji Nakano,Alireza Gharahighehi,Robbe D’hondt,Ricardo Cerri,Celine Vens
类目: Machine Learning (cs.LG)
*备注: 7 pages, 6 figures, AAAI Conference on Artificial Intelligence 2026

点击查看摘要

Abstract:Bipartite learning is a machine learning task that aims to predict interactions between pairs of instances. It has been applied to various domains, including drug-target interactions, RNA-disease associations, and regulatory network inference. Despite being widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices, significantly reducing training time without compromising predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees with that of the current state-of-the-art. We achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests, while demonstrating competitive or superior performance in most evaluation settings, particularly in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.

[LG-68] Attention-Enhanced Convolutional Autoencoder and Structured Delay Embeddings for Weather Prediction

链接: https://arxiv.org/abs/2511.12682
作者: Amirpasha Hedayat,Karthik Duraisamy
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 13 pages, 7 figures, Preprint

点击查看摘要

Abstract:Weather prediction is a quintessential problem involving the forecasting of a complex, nonlinear, and chaotic high-dimensional dynamical system. This work introduces an efficient reduced-order modeling (ROM) framework for short-range weather prediction and investigates fundamental questions in dimensionality reduction and reduced order modeling of such systems. Unlike recent AI-driven models, which require extensive computational resources, our framework prioritizes efficiency while achieving reasonable accuracy. Specifically, a ResNet-based convolutional autoencoder augmented by block attention modules is developed to reduce the dimensionality of high-dimensional weather data. Subsequently, a linear operator is learned in the time-delayed embedding of the latent space to efficiently capture the dynamics. Using the ERA5 reanalysis dataset, we demonstrate that this framework performs well in-distribution as evidenced by effectively predicting weather patterns within training data periods. We also identify important limitations in generalizing to future states, particularly in maintaining prediction accuracy beyond the training window. Our analysis reveals that weather systems exhibit strong temporal correlations that can be effectively captured through linear operations in an appropriately constructed embedding space, and that projection error rather than inference error is the main bottleneck. These findings shed light on some key challenges in reduced-order modeling of chaotic systems and point toward opportunities for hybrid approaches that combine efficient reduced-order models as baselines with more sophisticated AI architectures, particularly for applications in long-term climate modeling where computational efficiency is paramount.

[LG-69] Sample Complexity of Agnostic Multiclass Classification: Natarajan Dimension Strikes Back

链接: https://arxiv.org/abs/2511.12659
作者: Alon Cohen,Liad Erez,Steve Hanneke,Tomer Koren,Yishay Mansour,Shay Moran,Qian Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The fundamental theorem of statistical learning states that binary PAC learning is governed by a single parameter – the Vapnik-Chervonenkis (VC) dimension – which determines both learnability and sample complexity. Extending this to multiclass classification has long been challenging, since Natarajan’s work in the late 80s proposing the Natarajan dimension (Nat) as a natural analogue of VC. Daniely and Shalev-Shwartz (2014) introduced the DS dimension, later shown by Brukhim et al. (2022) to characterize multiclass learnability. Brukhim et al. also showed that Nat and DS can diverge arbitrarily, suggesting that multiclass learning is governed by DS rather than Nat. We show that agnostic multiclass PAC sample complexity is in fact governed by two distinct dimensions. Specifically, we prove nearly tight agnostic sample complexity bounds that, up to log factors, take the form \fracDS^1.5\epsilon + \fracNat\epsilon^2 where \epsilon is the excess risk. This bound is tight up to a \sqrtDS factor in the first term, nearly matching known Nat/\epsilon^2 and DS/\epsilon lower bounds. The first term reflects the DS-controlled regime, while the second shows that the Natarajan dimension still dictates asymptotic behavior for small \epsilon . Thus, unlike binary or online classification – where a single dimension (VC or Littlestone) controls both phenomena – multiclass learning inherently involves two structural parameters. Our technical approach departs from traditional agnostic learning methods based on uniform convergence or reductions to realizable cases. A key ingredient is a novel online procedure based on a self-adaptive multiplicative-weights algorithm performing a label-space reduction, which may be of independent interest.

[LG-70] NFQ2.0: The CartPole Benchmark Revisited

链接: https://arxiv.org/abs/2511.12644
作者: Sascha Lange,Roland Hafner,Martin Riedmiller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article revisits the 20-year-old neural fitted Q-iteration (NFQ) algorithm on its classical CartPole benchmark. NFQ was a pioneering approach towards modern Deep Reinforcement Learning (Deep RL) in applying multi-layer neural networks to reinforcement learning for real-world control problems. We explore the algorithm’s conceptual simplicity and its transition from online to batch learning, which contributed to its stability. Despite its initial success, NFQ required extensive tuning and was not easily reproducible on real-world control problems. We propose a modernized variant NFQ2.0 and apply it to the CartPole task, concentrating on a real-world system build from standard industrial components, to investigate and improve the learning process’s repeatability and robustness. Through ablation studies, we highlight key design decisions and hyperparameters that enhance performance and stability of NFQ2.0 over the original variant. Finally, we demonstrate how our findings can assist practitioners in reproducing and improving results and applying deep reinforcement learning more effectively in industrial contexts.

[LG-71] Adaptive Dual-Layer Web Application Firewall (ADL-WAF) Leverag ing Machine Learning for Enhanced Anomaly and Threat Detection

链接: https://arxiv.org/abs/2511.12643
作者: Ahmed Sameh,Sahar Selim
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Web Application Firewalls are crucial for protecting web applications against a wide range of cyber threats. Traditional Web Application Firewalls often struggle to effectively distinguish between malicious and legitimate traffic, leading to limited efficacy in threat detection. To overcome these limitations, this paper proposes an Adaptive Dual-Layer WAF employing a two-layered Machine Learning model designed to enhance the accuracy of anomaly and threat detection. The first layer employs a Decision Tree (DT) algorithm to detect anomalies by identifying traffic deviations from established normal patterns. The second layer employs Support Vector Machine to classify these anomalies as either threat anomalies or benign anomalies. Our Adaptive Dual Layer WAF incorporates comprehensive data pre-processing and feature engineering techniques and has been thoroughly evaluated using five large benchmark datasets. Evaluation using these datasets shows that ADL WAF achieves a detection accuracy of 99.88% and a precision of 100%, significantly enhancing anomaly detection and reducing false positives. These findings suggest that integrating machine learning techniques into WAFs can substantially improve web application security by providing more accurate and efficient threat detection.

[LG-72] FedTopo: Topology-Informed Representation Alignment in Federated Learning under Non-I.I.D. Conditions

链接: https://arxiv.org/abs/2511.12628
作者: Ke Hu,Liyao Xiang,Peng Tang,Weidong Qiu
类目: Machine Learning (cs.LG)
*备注: coference

点击查看摘要

Abstract:Current federated-learning models deteriorate under heterogeneous (non-I.I.D.) client data, as their feature representations diverge and pixel- or patch-level objectives fail to capture the global topology which is essential for high-dimensional visual tasks. We propose FedTopo, a framework that integrates Topological-Guided Block Screening (TGBS) and Topological Embedding (TE) to leverage topological information, yielding coherently aligned cross-client representations by Topological Alignment Loss (TAL). First, Topology-Guided Block Screening (TGBS) automatically selects the most topology-informative block, i.e., the one with maximal topological separability, whose persistence-based signatures best distinguish within- versus between-class pairs, ensuring that subsequent analysis focuses on topology-rich features. Next, this block yields a compact Topological Embedding, which quantifies the topological information for each client. Finally, a Topological Alignment Loss (TAL) guides clients to maintain topological consistency with the global model during optimization, reducing representation drift across rounds. Experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 under four non-I.I.D. partitions show that FedTopo accelerates convergence and improves accuracy over strong baselines.

[LG-73] LMM-IR: Large-Scale Netlist-Aware Multimodal Framework for Static IR-Drop Prediction

链接: https://arxiv.org/abs/2511.12581
作者: Kai Ma,Zhen Wang,Hongquan He,Qi Xu,Tinghuan Chen,Hao Geng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Static IR drop analysis is a fundamental and critical task in the field of chip design. Nevertheless, this process can be quite time-consuming, potentially requiring several hours. Moreover, addressing IR drop violations frequently demands iterative analysis, thereby causing the computational burden. Therefore, fast and accurate IR drop prediction is vital for reducing the overall time invested in chip design. In this paper, we firstly propose a novel multimodal approach that efficiently processes SPICE files through large-scale netlist transformer (LNT). Our key innovation is representing and processing netlist topology as 3D point cloud representations, enabling efficient handling of netlist with up to hundreds of thousands to millions nodes. All types of data, including netlist files and image data, are encoded into latent space as features and fed into the model for static voltage drop prediction. This enables the integration of data from multiple modalities for complementary predictions. Experimental results demonstrate that our proposed algorithm can achieve the best F1 score and the lowest MAE among the winning teams of the ICCAD 2023 contest and the state-of-the-art algorithms.

[LG-74] raining Instabilities Induce Flatness Bias in Gradient Descent

链接: https://arxiv.org/abs/2511.12558
作者: Lawrence Wang,Stephen J. Roberts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical analyses of gradient descent (GD) define a stability threshold based on the largest eigenvalue of the loss Hessian, often termed sharpness. When the learning rate lies below this threshold, training is stable and the loss decreases monotonically. Yet, modern deep networks often achieve their best performance beyond this regime. We demonstrate that such instabilities induce an implicit bias in GD, driving parameters toward flatter regions of the loss landscape and thereby improving generalization. The key mechanism is the Rotational Polarity of Eigenvectors (RPE), a geometric phenomenon in which the leading eigenvectors of the Hessian rotate during training instabilities. These rotations, which increase with learning rates, promote exploration and provably lead to flatter minima. This theoretical framework extends to stochastic GD, where instability-driven flattening persists and its empirical effects outweigh minibatch noise. Finally, we show that restoring instabilities in Adam further improves generalization. Together, these results establish and understand the constructive role of training instabilities in deep learning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.12558 [cs.LG] (or arXiv:2511.12558v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.12558 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lawrence Wang [view email] [v1] Sun, 16 Nov 2025 11:26:25 UTC (2,752 KB)

[LG-75] CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching

链接: https://arxiv.org/abs/2511.12548
作者: Wenzhang Du(Mahanakorn University of Technology, International College, Bangkok, Thailand)
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, 3 tables; anonymized logs and scripts reproduce all figures and tables

点击查看摘要

Abstract:First-order optimizers are reliable but slow in sharp, anisotropic regions. We study a curvature-adaptive method that periodically sketches a low-rank Hessian subspace via Hessian–vector products and preconditions gradients only in that subspace, leaving the orthogonal complement first-order. For L-smooth non-convex objectives, we recover the standard O(1/T) stationarity guarantee with a widened stable stepsize range; under a Polyak–Lojasiewicz (PL) condition with bounded residual curvature outside the sketch, the loss contracts at refresh steps. On CIFAR-10/100 with ResNet-18/34, the method enters the low-loss region substantially earlier: measured by epochs to a pre-declared train-loss threshold (0.75), it reaches the threshold 2.95x faster than Adam on CIFAR-100/ResNet-18, while matching final test accuracy. The approach is one-knob: performance is insensitive to the sketch rank k across 1,3,5, and k=0 yields a principled curvature-free ablation. We release anonymized logs and scripts that regenerate all figures and tables.

[LG-76] Center-Outward q-Dominance: A Sample-Computable Proxy for Strong Stochastic Dominance in Multi-Objective Optimisation AAAI-26

链接: https://arxiv.org/abs/2511.12545
作者: Robin van der Laag,Hao Wang,Thomas Bäck,Yingjie Fan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extended version including appendix of a paper accepted at AAAI-26 main technical track (to appear)

点击查看摘要

Abstract:Stochastic multi-objective optimization (SMOOP) requires ranking multivariate distributions; yet, most empirical studies perform scalarization, which loses information and is unreliable. Based on the optimal transport theory, we introduce the center-outward q-dominance relation and prove it implies strong first-order stochastic dominance (FSD). Also, we develop an empirical test procedure based on q-dominance, and derive an explicit sample size threshold, n^*(\delta) , to control the Type I error. We verify the usefulness of our approach in two scenarios: (1) as a ranking method in hyperparameter tuning; (2) as a selection method in multi-objective optimization algorithms. For the former, we analyze the final stochastic Pareto sets of seven multi-objective hyperparameter tuners on the YAHPO-MO benchmark tasks with q-dominance, which allows us to compare these tuners when the expected hypervolume indicator (HVI, the most common performance metric) of the Pareto sets becomes indistinguishable. For the latter, we replace the mean value-based selection in the NSGA-II algorithm with q -dominance, which shows a superior convergence rate on noise-augmented ZDT benchmark problems. These results establish center-outward q-dominance as a principled, tractable foundation for seeking truly stochastically dominant solutions for SMOOPs.

[LG-77] FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration

链接: https://arxiv.org/abs/2511.12544
作者: Mukul Lokhande,Akash Sankhe,S. V. Jaya Chand,Santosh Kumar Vishvakarma
类目: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The growing demand for low-power and area-efficient TinyML inference on AIoT devices necessitates memory architectures that minimise data movement while sustaining high computational efficiency. This paper presents FERMI-ML, a Flexible and Resource-Efficient Memory-In-Situ (MIS) SRAM macro designed for TinyML acceleration. The proposed 9T XNOR-based RX9T bit-cell integrates a 5T storage cell with a 4T XNOR compute unit, enabling variable-precision MAC and CAM operations within the same array. A 22-transistor (C22T) compressor-tree-based accumulator facilitates logarithmic 1-64-bit MAC computation with reduced delay and power compared to conventional adder trees. The 4 KB macro achieves dual functionality for in-situ computation and CAM-based lookup operations, supporting Posit-4 or FP-4 precision. Post-layout results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W, while maintaining a Quality-of-Result (QoR) above 97.5% with InceptionV4 and ResNet-18. FERMI-ML thus demonstrates a compact, reconfigurable, and energy-aware digital Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.

[LG-78] Regret Guarantees for Linear Contextual Stochastic Shortest Path

链接: https://arxiv.org/abs/2511.12534
作者: Dor Polikar,Alon Cohen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We define the problem of linear Contextual Stochastic Shortest Path (CSSP), where at the beginning of each episode, the learner observes an adversarially chosen context that determines the MDP through a fixed but unknown linear function. The learner’s objective is to reach a designated goal state with minimal expected cumulative loss, despite having no prior knowledge of the transition dynamics, loss functions, or the mapping from context to MDP. In this work, we propose LR-CSSP, an algorithm that achieves a regret bound of \widetildeO(K^2/3 d^2/3 |S| |A|^1/3 B_\star^2 T_\star \log (1/ \delta)) , where K is the number of episodes, d is the context dimension, S and A are the sets of states and actions respectively, B_\star bounds the optimal cumulative loss and T_\star , unknown to the learner, bounds the expected time for the optimal policy to reach the goal. In the case where all costs exceed \ell_\min , LR-CSSP attains a regret of \widetilde O(\sqrtK \cdot d^2 |S|^3 |A| B_\star^3 \log(1/\delta)/\ell_\min) . Unlike in contextual finite-horizon MDPs, where limited knowledge primarily leads to higher losses and regret, in the CSSP setting, insufficient knowledge can also prolong episodes and may even lead to non-terminating episodes. Our analysis reveals that LR-CSSP effectively handles continuous context spaces, while ensuring all episodes terminate within a reasonable number of time steps.

[LG-79] Spectral Bias Mitigation via xLSTM-PINN: Memory-Gated Representation Refinement for Physics-Informed Learning

链接: https://arxiv.org/abs/2511.12512
作者: Ze Tao,Darui Zhao,Fujun Liu,Ke Xu,Xiangsheng Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed learning for PDEs is surging across scientific computing and industrial simulation, yet prevailing methods face spectral bias, residual-data imbalance, and weak extrapolation. We introduce a representation-level spectral remodeling xLSTM-PINN that combines gated-memory multiscale feature extraction with adaptive residual-data weighting to curb spectral bias and strengthen extrapolation. Across four benchmarks, we integrate gated cross-scale memory, a staged frequency curriculum, and adaptive residual reweighting, and verify with analytic references and extrapolation tests, achieving markedly lower spectral error and RMSE and a broader stable learning-rate window. Frequency-domain benchmarks show raised high-frequency kernel weights and a right-shifted resolvable bandwidth, shorter high-k error decay and time-to-threshold, and narrower error bands with lower MSE, RMSE, MAE, and MaxAE. Compared with the baseline PINN, we reduce MSE, RMSE, MAE, and MaxAE across all four benchmarks and deliver cleaner boundary transitions with attenuated high-frequency ripples in both frequency and field maps. This work suppresses spectral bias, widens the resolvable band and shortens the high-k time-to-threshold under the same budget, and without altering AD or physics losses improves accuracy, reproducibility, and transferability.

[LG-80] Hierarchical Frequency-Decomposition Graph Neural Networks for Road Network Representation Learning

链接: https://arxiv.org/abs/2511.12507
作者: Jingtian Ma,Jingyuan Wang,Leong Hou U
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Road networks are critical infrastructures underpinning intelligent transportation systems and their related applications. Effective representation learning of road networks remains challenging due to the complex interplay between spatial structures and frequency characteristics in traffic patterns. Existing graph neural networks for modeling road networks predominantly fall into two paradigms: spatial-based methods that capture local topology but tend to over-smooth representations, and spectral-based methods that analyze global frequency components but often overlook localized variations. This spatial-spectral misalignment limits their modeling capacity for road networks exhibiting both coarse global trends and fine-grained local fluctuations. To bridge this gap, we propose HiFiNet, a novel hierarchical frequency-decomposition graph neural network that unifies spatial and spectral modeling. HiFiNet constructs a multi-level hierarchy of virtual nodes to enable localized frequency analysis, and employs a decomposition-updating-reconstruction framework with a topology-aware graph transformer to separately model and fuse low- and high-frequency signals. Theoretically justified and empirically validated on multiple real-world datasets across four downstream tasks, HiFiNet demonstrates superior performance and generalization ability in capturing effective road network representations.

[LG-81] Iris: First-Class Multi-GPU Programming Experience in Triton

链接: https://arxiv.org/abs/2511.12500
作者: Muhammad Awad,Muhammad Osama,Brandon Potter
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA communication libraries that demand substantial engineering effort for even basic overlap patterns, while simpler abstractions often sacrifice performance. We present Iris, a multi-GPU communication library implemented entirely in Python and Triton that eliminates this trade-off. Iris provides tile-based symmetric memory abstractions that naturally align with Triton’s programming model, enabling developers to write single-source kernels that seamlessly interleave computation and communication. We demonstrate a taxonomy of compute-communication overlap patterns–from bulk-synchronous to fine-grained workgroup specialization–that can be implemented with minimal code changes in Iris, often requiring just a few additional lines within the same Triton kernel. Our evaluation shows that Iris achieves near-optimal bandwidth utilization in microbenchmarks and delivers up to 1.79x speedup over PyTorch and RCCL for GEMM+All-Scatter workloads, demonstrating that high-level implementations can match or exceed heavily-optimized libraries while dramatically simplifying multi-GPU programming.

[LG-82] SculptDrug : A Spatial Condition-Aware Bayesian Flow Model for Structure-based Drug Design

链接: https://arxiv.org/abs/2511.12489
作者: Qingsong Zhong,Haomin Yu,Yan Lin,Wangmeng Shen,Long Zeng,Jilin Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structure-Based drug design (SBDD) has emerged as a popular approach in drug discovery, leveraging three-dimensional protein structures to generate drug ligands. However, existing generative models encounter several key challenges: (1) incorporating boundary condition constraints, (2) integrating hierarchical structural conditions, and (3) ensuring spatial modeling fidelity. To address these limitations, we propose SculptDrug, a spatial condition-aware generative model based on Bayesian flow networks (BFNs). First, SculptDrug follows a BFN-based framework and employs a progressive denoising strategy to ensure spatial modeling fidelity, iteratively refining atom positions while enhancing local interactions for precise spatial alignment. Second, we introduce a Boundary Awareness Block that incorporates protein surface constraints into the generative process to ensure that generated ligands are geometrically compatible with the target protein. Third, we design a Hierarchical Encoder that captures global structural context while preserving fine-grained molecular interactions, ensuring overall consistency and accurate ligand-protein conformations. We evaluate SculptDrug on the CrossDocked dataset, and experimental results demonstrate that SculptDrug outperforms state-of-the-art baselines, highlighting the effectiveness of spatial condition-aware modeling.

[LG-83] Diffusion Model Based Signal Recovery Under 1-Bit Quantization

链接: https://arxiv.org/abs/2511.12471
作者: Youming Chen,Zhaoqiang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated to be powerful priors for signal recovery, but their application to 1-bit quantization tasks, such as 1-bit compressed sensing and logistic regression, remains a challenge. This difficulty stems from the inherent non-linear link function in these tasks, which is either non-differentiable or lacks an explicit characterization. To tackle this issue, we introduce Diff-OneBit, which is a fast and effective DM-based approach for signal recovery under 1-bit quantization. Diff-OneBit addresses the challenge posed by non-differentiable or implicit links functions via leveraging a differentiable surrogate likelihood function to model 1-bit quantization, thereby enabling gradient based iterations. This function is integrated into a flexible plug-and-play framework that decouples the data-fidelity term from the diffusion prior, allowing any pretrained DM to act as a denoiser within the iterative reconstruction process. Extensive experiments on the FFHQ, CelebA and ImageNet datasets demonstrate that Diff-OneBit gives high-fidelity reconstructed images, outperforming state-of-the-art methods in both reconstruction quality and computational efficiency across 1-bit compressed sensing and logistic regression tasks.

[LG-84] Logarithmic Regret and Polynomial Scaling in Online Multi-step-ahead Prediction

链接: https://arxiv.org/abs/2511.12467
作者: Jiachen Qian,Yang Zheng
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This letter studies the problem of online multi-step-ahead prediction for unknown linear stochastic systems. Using conditional distribution theory, we derive an optimal parameterization of the prediction policy as a linear function of future inputs, past inputs, and past outputs. Based on this characterization, we propose an online least-squares algorithm to learn the policy and analyze its regret relative to the optimal model-based predictor. We show that the online algorithm achieves logarithmic regret with respect to the optimal Kalman filter in the multi-step setting. Furthermore, with new proof techniques, we establish an almost-sure regret bound that does not rely on fixed failure probabilities for sufficiently large horizons N . Finally, our analysis also reveals that, while the regret remains logarithmic in N , its constant factor grows polynomially with the prediction horizon H , with the polynomial order set by the largest Jordan block of eigenvalue 1 in the system matrix.

[LG-85] Redundancy-optimized Multi-head Attention Networks for Multi-View Multi-Label Feature Selection

链接: https://arxiv.org/abs/2511.12462
作者: Yuzhou Liu,Jiarui Liu,Wanfu Gao
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Multi-view multi-label data offers richer perspectives for artificial intelligence, but simultaneously presents significant challenges for feature selection due to the inherent complexity of interrelations among features, views and labels. Attention mechanisms provide an effective way for analyzing these intricate relationships. They can compute importance weights for information by aggregating correlations between Query and Key matrices to focus on pertinent values. However, existing attention-based feature selection methods predominantly focus on intra-view relationships, neglecting the complementarity of inter-view features and the critical feature-label correlations. Moreover, they often fail to account for feature redundancy, potentially leading to suboptimal feature subsets. To overcome these limitations, we propose a novel method based on Redundancy-optimized Multi-head Attention Networks for Multi-view Multi-label Feature Selection (RMAN-MMFS). Specifically, we employ each individual attention head to model intra-view feature relationships and use the cross-attention mechanisms between different heads to capture inter-view feature complementarity. Furthermore, we design static and dynamic feature redundancy terms: the static term mitigates redundancy within each view, while the dynamic term explicitly models redundancy between unselected and selected features across the entire selection process, thereby promoting feature compactness. Comprehensive evaluations on six real-world datasets, compared against six multi-view multi-label feature selection methods, demonstrate the superior performance of the proposed method.

[LG-86] VISAGNN: Versatile Staleness-Aware Efficient Training on Large-Scale Graphs

链接: https://arxiv.org/abs/2511.12434
作者: Rui Xue
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown exceptional success in graph representation learning and a wide range of real-world applications. However, scaling deeper GNNs poses challenges due to the neighbor explosion problem when training on large-scale graphs. To mitigate this, a promising class of GNN training algorithms utilizes historical embeddings to reduce computation and memory costs while preserving the expressiveness of the model. These methods leverage historical embeddings for out-of-batch nodes, effectively approximating full-batch training without losing any neighbor information-a limitation found in traditional sampling methods. However, the staleness of these historical embeddings often introduces significant bias, acting as a bottleneck that can adversely affect model performance. In this paper, we propose a novel VersatIle Staleness-Aware GNN, named VISAGNN, which dynamically and adaptively incorporates staleness criteria into the large-scale GNN training process. By embedding staleness into the message passing mechanism, loss function, and historical embeddings during training, our approach enables the model to adaptively mitigate the negative effects of stale embeddings, thereby reducing estimation errors and enhancing downstream accuracy. Comprehensive experiments demonstrate the effectiveness of our method in overcoming the staleness issue of existing historical embedding techniques, showcasing its superior performance and efficiency on large-scale benchmarks, along with significantly faster convergence.

[LG-87] ailored Primitive Initialization is the Secret Key to Reinforcement Learning

链接: https://arxiv.org/abs/2511.12429
作者: Yihang Yao,Guangtao Zeng,Raina Wu,Yang Zhang,Ding Zhao,Zhang-Wei Hong,Chuang Gan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). While RL has demonstrated substantial performance gains, it still faces key challenges, including low sampling efficiency and a strong dependence on model initialization: some models achieve rapid improvements with minimal RL steps, while others require significant training data to make progress. In this work, we investigate these challenges through the lens of reasoning token coverage and argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training. We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives, thereby expanding the coverage of reasoning-state distributions before RL. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that Tailor generates more diverse and higher-quality warm-start data, resulting in higher downstream RL performance.

[LG-88] GRAPHTEXTACK: A Realistic Black-Box Node Injection Attack on LLM -Enhanced GNNs AAAI2026

链接: https://arxiv.org/abs/2511.12423
作者: Jiaji Ma,Puja Trivedi,Danai Koutra
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: AAAI 2026

点击查看摘要

Abstract:Text-attributed graphs (TAGs), which combine structural and textual node information, are ubiquitous across many domains. Recent work integrates Large Language Models (LLMs) with Graph Neural Networks (GNNs) to jointly model semantics and structure, resulting in more general and expressive models that achieve state-of-the-art performance on TAG benchmarks. However, this integration introduces dual vulnerabilities: GNNs are sensitive to structural perturbations, while LLM-derived features are vulnerable to prompt injection and adversarial phrasing. While existing adversarial attacks largely perturb structure or text independently, we find that uni-modal attacks cause only modest degradation in LLM-enhanced GNNs. Moreover, many existing attacks assume unrealistic capabilities, such as white-box access or direct modification of graph data. To address these gaps, we propose GRAPHTEXTACK, the first black-box, multi-modal, poisoning node injection attack for LLM-enhanced GNNs. GRAPHTEXTACK injects nodes with carefully crafted structure and semantics to degrade model performance, operating under a realistic threat model without relying on model internals or surrogate models. To navigate the combinatorial, non-differentiable search space of connectivity and feature assignments, GRAPHTEXTACK introduces a novel evolutionary optimization framework with a multi-objective fitness function that balances local prediction disruption and global graph influence. Extensive experiments on five datasets and two state-of-the-art LLM-enhanced GNN models show that GRAPHTEXTACK significantly outperforms 12 strong baselines.

[LG-89] Integrating Neural Differential Forecasting with Safe Reinforcement Learning for Blood Glucose Regulation

链接: https://arxiv.org/abs/2511.12417
作者: Yushen Liu,Yanfu Zhang,Xugui Zhou
类目: Machine Learning (cs.LG)
*备注: ISBI 2026

点击查看摘要

Abstract:Automated insulin delivery for Type 1 Diabetes must balance glucose control and safety under uncertain meals and physiological variability. While reinforcement learning (RL) enables adaptive personalization, existing approaches struggle to simultaneously guarantee safety, leaving a gap in achieving both personalized and risk-aware glucose control, such as overdosing before meals or stacking corrections. To bridge this gap, we propose TSODE, a safety-aware controller that integrates Thompson Sampling RL with a Neural Ordinary Differential Equation (NeuralODE) forecaster to address this challenge. Specifically, the NeuralODE predicts short-term glucose trajectories conditioned on proposed insulin doses, while a conformal calibration layer quantifies predictive uncertainty to reject or scale risky actions. In the FDA-approved UVa/Padova simulator (adult cohort), TSODE achieved 87.9% time-in-range with less than 10% time below 70 mg/dL, outperforming relevant baselines. These results demonstrate that integrating adaptive RL with calibrated NeuralODE forecasting enables interpretable, safe, and robust glucose regulation.

[LG-90] he Sure Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

链接: https://arxiv.org/abs/2511.12414
作者: Yuting Tan,Yi Huang,Zhuo Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response “Sure” with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the “Sure” rate approaches 100% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.

[LG-91] Interpretable Fine-Gray Deep Survival Model for Competing Risks: Predicting Post-Discharge Foot Complications for Diabetic Patients in Ontario

链接: https://arxiv.org/abs/2511.12409
作者: Dhanesh Ramachandram,Anne Loefler,Surain Roberts,Amol Verma,Maia Norman,Fahad Razak,Conrad Pow,Charles de Mestral
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model interpretability is crucial for establishing AI safety and clinician trust in medical applications for example, in survival modelling with competing risks. Recent deep learning models have attained very good predictive performance but their limited transparency, being black-box models, hinders their integration into clinical practice. To address this gap, we propose an intrinsically interpretable survival model called CRISPNAM-FG. Leveraging the structure of Neural Additive Models (NAMs) with separate projection vectors for each risk, our approach predicts the Cumulative Incidence Function using the Fine-Gray formulation, achieving high predictive power with intrinsically transparent and auditable predictions. We validated the model on several benchmark datasets and applied our model to predict future foot complications in diabetic patients across 29 Ontario hospitals (2016-2023). Our method achieves competitive performance compared to other deep survival models while providing transparency through shape functions and feature importance plots.

[LG-92] On the Dimension-Free Approximation of Deep Neural Networks for Symmetric Korobov Functions

链接: https://arxiv.org/abs/2511.12398
作者: Yulong Lu,Tong Mao,Jinchao Xu,Yahong Yang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deep neural networks have been widely used as universal approximators for functions with inherent physical structures, including permutation symmetry. In this paper, we construct symmetric deep neural networks to approximate symmetric Korobov functions and prove that both the convergence rate and the constant prefactor scale at most polynomially with respect to the ambient dimension. This represents a substantial improvement over prior approximation guarantees that suffer from the curse of dimensionality. Building on these approximation bounds, we further derive a generalization-error rate for learning symmetric Korobov functions whose leading factors likewise avoid the curse of dimensionality.

[LG-93] Multi-Domain EEG Representation Learning with Orthogonal Mapping and Attention-based Fusion for Cognitive Load Classification

链接: https://arxiv.org/abs/2511.12394
作者: Prithila Angkan,Amin Jalali,Paul Hungler,Ali Etemad
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: This work has been submitted to the Transactions on Human Machine Systems for possible publication

点击查看摘要

Abstract:We propose a new representation learning solution for the classification of cognitive load based on Electroencephalogram (EEG). Our method integrates both time and frequency domains by first passing the raw EEG signals through the convolutional encoder to obtain the time domain representations. Next, we measure the Power Spectral Density (PSD) for all five EEG frequency bands and generate the channel power values as 2D images referred to as multi-spectral topography maps. These multi-spectral topography maps are then fed to a separate encoder to obtain the representations in frequency domain. Our solution employs a multi-domain attention module that maps these domain-specific embeddings onto a shared embedding space to emphasize more on important inter-domain relationships to enhance the representations for cognitive load classification. Additionally, we incorporate an orthogonal projection constraint during the training of our method to effectively increase the inter-class distances while improving intra-class clustering. This enhancement allows efficient discrimination between different cognitive states and aids in better grouping of similar states within the feature space. We validate the effectiveness of our model through extensive experiments on two public EEG datasets, CL-Drive and CLARE for cognitive load classification. Our results demonstrate the superiority of our multi-domain approach over the traditional single-domain techniques. Moreover, we conduct ablation and sensitivity analyses to assess the impact of various components of our method. Finally, robustness experiments on different amounts of added noise demonstrate the stability of our method compared to other state-of-the-art solutions.

[LG-94] CEDL: Centre-Enhanced Discriminative Learning for Anomaly Detection

链接: https://arxiv.org/abs/2511.12388
作者: Zahra Zamanzadeh Darban,Qizhou Wang,Charu C. Aggarwal,Geoffrey I. Webb,Ehsan Abbasnejad,Mahsa Salehi
类目: Machine Learning (cs.LG)
*备注: 20 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Supervised anomaly detection methods perform well in identifying known anomalies that are well represented in the training set. However, they often struggle to generalise beyond the training distribution due to decision boundaries that lack a clear definition of normality. Existing approaches typically address this by regularising the representation space during training, leading to separate optimisation in latent and label spaces. The learned normality is therefore not directly utilised at inference, and their anomaly scores often fall within arbitrary ranges that require explicit mapping or calibration for probabilistic interpretation. To achieve unified learning of geometric normality and label discrimination, we propose Centre-Enhanced Discriminative Learning (CEDL), a novel supervised anomaly detection framework that embeds geometric normality directly into the discriminative objective. CEDL reparameterises the conventional sigmoid-derived prediction logit through a centre-based radial distance function, unifying geometric and discriminative learning in a single end-to-end formulation. This design enables interpretable, geometry-aware anomaly scoring without post-hoc thresholding or reference calibration. Extensive experiments on tabular, time-series, and image data demonstrate that CEDL achieves competitive and balanced performance across diverse real-world anomaly detection tasks, validating its effectiveness and broad applicability.

[LG-95] BitSnap: Checkpoint Sparsification and Quantization in LLM Training

链接: https://arxiv.org/abs/2511.12376
作者: Qingping Li,Yanxin Peng,Baodong Wu,Shigang Li,Guohao Dai,Shengen Yan,Yu Wang
类目: Machine Learning (cs.LG)
*备注: 12 pages, numerous figures

点击查看摘要

Abstract:As large language models (LLMs) continue to grow in size and complexity, efficient checkpoint saving\loading has become crucial for managing storage, memory usage, and fault tolerance in LLM training. The current works do not comprehensively take into account the optimization of these several aspects. This paper proposes a novel checkpoint sparsification and quantization method that adapts dynamically to different training stages and model architectures. We present a comprehensive analysis of existing lossy and lossless compression techniques, identify current limitations, and introduce our adaptive approach that balances compression ratio, speed, and precision impact throughout the training process. Experiments on different sizes of LLMs demonstrate that our bitmask-based sparsification method achieves 16x compression ratio without compromising model accuracy. Additionally, the cluster-based quantization method achieves 2x compression ratio with little precision loss.

[LG-96] LILogic Net: Compact Logic Gate Networks with Learnable Connectivity for Efficient Hardware Deployment

链接: https://arxiv.org/abs/2511.12340
作者: Katarzyna Fojcik,Renaldas Zioma,Jogundas Armaitis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient deployment of machine learning models ultimately requires taking hardware constraints into account. The binary logic gate is the fundamental building block of all digital chips. Designing models that operate directly on these units enables energy-efficient computation. Recent work has demonstrated the feasibility of training randomly connected networks of binary logic gates (such as OR and NAND) using gradient-based methods. We extend this approach by using gradient descent not only to select the logic gates but also to optimize their interconnections (the connectome). Optimizing the connections allows us to substantially reduce the number of logic gates required to fit a particular dataset. Our implementation is efficient both at training and inference: for instance, our LILogicNet model with only 8,000 gates can be trained on MNIST in under 5 minutes and achieves 98.45% test accuracy, matching the performance of state-of-the-art models that require at least two orders of magnitude more gates. Moreover, for our largest architecture with 256,000 gates, LILogicNet achieves 60.98% test accuracy on CIFAR-10 exceeding the performance of prior logic-gate-based models with a comparable gate budget. At inference time, the fully binarized model operates with minimal compute overhead, making it exceptionally efficient and well suited for deployment on low-power digital hardware.

[LG-97] BlinDNO: A Distributional Neural Operator for Dynamical System Reconstruction from Time-Label-Free data

链接: https://arxiv.org/abs/2511.12316
作者: Zhijun Zeng,Junqing Chen,Zuoqiang Shi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:We study an inverse problem for stochastic and quantum dynamical systems in a time-label-free setting, where only unordered density snapshots sampled at unknown times drawn from an observation-time distribution are available. These observations induce a distribution over state densities, from which we seek to recover the parameters of the underlying evolution operator. We formulate this as learning a distribution-to-function neural operator and propose BlinDNO, a permutation-invariant architecture that integrates a multiscale U-Net encoder with an attention-based mixer. Numerical experiments on a wide range of stochastic and quantum systems, including a 3D protein-folding mechanism reconstruction problem in a cryo-EM setting, demonstrate that BlinDNO reliably recovers governing parameters and consistently outperforms existing neural inverse operator baselines.

[LG-98] Active Learning of Symbolic Automata Over Rational Numbers

链接: https://arxiv.org/abs/2511.12315
作者: Sebastian Hagedorn,Martín Muñoz,Cristian Riveros,Rodrigo Toro Icarte
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:Automata learning has many applications in artificial intelligence and software engineering. Central to these applications is the L^* algorithm, introduced by Angluin. The L^* algorithm learns deterministic finite-state automata (DFAs) in polynomial time when provided with a minimally adequate teacher. Unfortunately, the L^* algorithm can only learn DFAs over finite alphabets, which limits its applicability. In this paper, we extend L^* to learn symbolic automata whose transitions use predicates over rational numbers, i.e., over infinite and dense alphabets. Our result makes the L^* algorithm applicable to new settings like (real) RGX, and time series. Furthermore, our proposed algorithm is optimal in the sense that it asks a number of queries to the teacher that is at most linear with respect to the number of transitions, and to the representation size of the predicates.

[LG-99] MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

链接: https://arxiv.org/abs/2511.12305
作者: Zhizhen Li,Xuanhao Luo,Xueren Ge,Longyu Zhou,Xingqin Lin,Yuchen Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large AI models have been widely adopted in wireless communications for channel modeling, beamforming, and resource optimization. However, most existing efforts remain limited to single-modality inputs and channel-specific objec- tives, overlooking the broader potential of large foundation models for unified wireless sensing. To bridge this gap, we propose MMSense, a multi-modal, multi-task foundation model that jointly addresses channel-centric, environment-aware, and human-centered sensing. Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations, enabling effective cross-modal align- ment within a unified feature space. A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment and instruction-driven task adaptation. Furthermore, task- specific sequential attention and uncertainty-based loss weighting mechanisms enhance cross-task generalization. Experiments on real wireless scenario datasets show that our approach outper- forms both task-specific and large-model baselines, confirming its strong generalization across heterogeneous sensing tasks.

[LG-100] Cross-view Joint Learning for Mixed-Missing Multi-view Unsupervised Feature Selection

链接: https://arxiv.org/abs/2511.12261
作者: Zongxin Shen,Yanyong Huang,Dongjie Wang,Jinyuan Chang,Fengmao Lv,Tianrui Li,Xiaoyi Jiang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Incomplete multi-view unsupervised feature selection (IMUFS), which aims to identify representative features from unlabeled multi-view data containing missing values, has received growing attention in recent years. Despite their promising performance, existing methods face three key challenges: 1) by focusing solely on the view-missing problem, they are not well-suited to the more prevalent mixed-missing scenario in practice, where some samples lack entire views or only partial features within views; 2) insufficient utilization of consistency and diversity across views limits the effectiveness of feature selection; and 3) the lack of theoretical analysis makes it unclear how feature selection and data imputation interact during the joint learning process. Being aware of these, we propose CLIM-FS, a novel IMUFS method designed to address the mixed-missing problem. Specifically, we integrate the imputation of both missing views and variables into a feature selection model based on nonnegative orthogonal matrix factorization, enabling the joint learning of feature selection and adaptive data imputation. Furthermore, we fully leverage consensus cluster structure and cross-view local geometrical structure to enhance the synergistic learning process. We also provide a theoretical analysis to clarify the underlying collaborative mechanism of CLIM-FS. Experimental results on eight real-world multi-view datasets demonstrate that CLIM-FS outperforms state-of-the-art methods.

[LG-101] Chicken Swarm Kernel Particle Filter: A Structured Rejuvenation Approach with KLD-Efficient Sampling

链接: https://arxiv.org/abs/2511.12222
作者: Hangshuo Tian
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Particle filters (PFs) are often combined with swarm intelligence (SI) algorithms, such as Chicken Swarm Optimization (CSO), for particle rejuvenation. Separately, Kullback–Leibler divergence (KLD) sampling is a common strategy for adaptively sizing the particle set. However, the theoretical interaction between SI-based rejuvenation kernels and KLD-based adaptive sampling is not yet fully understood. This paper investigates this specific interaction. We analyze, under a simplified modeling framework, the effect of the CSO rejuvenation step on the particle set distribution. We propose that the fitness-driven updates inherent in CSO can be approximated as a form of mean-square contraction. This contraction tends to produce a particle distribution that is more concentrated than that of a baseline PF, or in mathematical terms, a distribution that is plausibly more ``peaked’’ in a majorization sense. By applying Karamata’s inequality to the concave function that governs the expected bin occupancy in KLD-sampling, our analysis suggests a connection: under the stated assumptions, the CSO-enhanced PF (CPF) is expected to require a lower \emphexpected particle count than the standard PF to satisfy the same statistical error bound. The goal of this study is not to provide a fully general proof, but rather to offer a tractable theoretical framework that helps to interpret the computational efficiency empirically observed when combining these techniques, and to provide a starting point for designing more efficient adaptive filters. Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2511.12222 [cs.LG] (or arXiv:2511.12222v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.12222 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-102] AlignTree: Efficient Defense Against LLM Jailbreak Attacks AAAI AAAI-26

链接: https://arxiv.org/abs/2511.12217
作者: Gil Goren,Shahar Katz,Lior Wolf
类目: Machine Learning (cs.LG)
*备注: Accepted as an Oral Presentation at the 40th AAAI Conference on Artificial Intelligence (AAAI-26), January 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction – a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated with harmful content. Unlike previous methods, AlignTree does not require additional prompts or auxiliary guard models. Through extensive experiments, we demonstrate the efficiency and robustness of AlignTree across multiple LLMs and benchmarks.

[LG-103] MPD-SGR: Robust Spiking Neural Networks with Membrane Potential Distribution-Driven Surrogate Gradient Regularization AAAI2026

链接: https://arxiv.org/abs/2511.12199
作者: Runhao Jiang,Chengzhi Jiang,Rui Yan,Huajin Tang
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026

点击查看摘要

Abstract:The surrogate gradient (SG) method has shown significant promise in enhancing the performance of deep spiking neural networks (SNNs), but it also introduces vulnerabilities to adversarial attacks. Although spike coding strategies and neural dynamics parameters have been extensively studied for their impact on robustness, the critical role of gradient magnitude, which reflects the model’s sensitivity to input perturbations, remains underexplored. In SNNs, the gradient magnitude is primarily determined by the interaction between the membrane potential distribution (MPD) and the SG function. In this study, we investigate the relationship between the MPD and SG and its implications for improving the robustness of SNNs. Our theoretical analysis reveals that reducing the proportion of membrane potential lying within the gradient-available range of the SG function effectively mitigates the sensitivity of SNNs to input perturbations. Building upon this insight, we propose a novel MPD-driven surrogate gradient regularization (MPD-SGR) method, which enhances robustness by explicitly regularizing the MPD based on its interaction with the SG function. Extensive experiments across multiple image classification benchmarks and diverse network architectures confirm that the MPD-SGR method significantly enhances the resilience of SNNs to adversarial perturbations and exhibits strong generalizability across diverse network configurations, SG function variants, and spike encoding schemes.

[LG-104] Evaluation of Multi- and Single-objective Learning Algorithms for Imbalanced Data

链接: https://arxiv.org/abs/2511.12191
作者: Szymon Wojciechowski,Michał Woźniak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many machine learning tasks aim to find models that work well not for a single, but for a group of criteria, often opposing ones. One such example is imbalanced data classification, where, on the one hand, we want to achieve the best possible classification quality for data from the minority class without degrading the classification quality of the majority class. One solution is to propose an aggregate learning criterion and reduce the multi-objective learning task to a single-criteria optimization problem. Unfortunately, such an approach is characterized by ambiguity of interpretation since the value of the aggregated criterion does not indicate the value of the component criteria. Hence, there are more and more proposals for algorithms based on multi-objective optimization (MOO), which can simultaneously optimize multiple criteria. However, such an approach results in a set of multiple non-dominated solutions (Pareto front). The selection of a single solution from the Pareto front is a challenge itself, and much attention is paid to the issue of how to select it considering user preferences, as well as how to compare solutions returned by different MOO algorithms among themselves. Thus, a significant gap has been identified in the classifier evaluation methodology, i.e., how to reliably compare methods returning single solutions with algorithms returning solutions in the form of Pareto fronts. To fill the aforementioned gap, this article proposes a new, reliable way of evaluating algorithms based on multi-objective algorithms with methods that return single solutions while pointing out solutions from a Pareto front tailored to the user’s preferences. This work focuses only on algorithm comparison, not their learning. The algorithms selected for this study are illustrative to help understand the proposed approach. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.12191 [cs.LG] (or arXiv:2511.12191v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.12191 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-105] Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size? AAAI2026

链接: https://arxiv.org/abs/2511.12188
作者: Xuanyu Chen,Nan Yang,Shuai Wang,Dong Yuan
类目: Machine Learning (cs.LG)
*备注: The extended version of the paper “Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?”. Accepted by AAAI2026

点击查看摘要

Abstract:The recent success of large language models (LLMs) has sparked a growing interest in training large-scale models. As the model size continues to scale, concerns are growing about the depletion of high-quality, well-curated training data. This has led practitioners to explore training approaches like Federated Learning (FL), which can leverage the abundant data on edge devices while maintaining privacy. However, the decentralization of training datasets in FL introduces challenges to scaling large models, a topic that remains under-explored. This paper fills this gap and provides qualitative insights on generalizing the previous model scaling experience to federated learning scenarios. Specifically, we derive a PAC-Bayes (Probably Approximately Correct Bayesian) upper bound for the generalization error of models trained with stochastic algorithms in federated settings and quantify the impact of distributed training data on the optimal model size by finding the analytic solution of model size that minimizes this bound. Our theoretical results demonstrate that the optimal model size has a negative power law relationship with the number of clients if the total training compute is unchanged. Besides, we also find that switching to FL with the same training compute will inevitably reduce the upper bound of generalization performance that the model can achieve through training, and that estimating the optimal model size in federated scenarios should depend on the average training compute across clients. Furthermore, we also empirically validate the correctness of our results with extensive training runs on different models, network settings, and datasets.

[LG-106] Understanding InfoNCE: Transition Probability Matrix Induced Feature Clustering

链接: https://arxiv.org/abs/2511.12180
作者: Ge Cheng,Shuo Wang,Yun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 8 figures

点击查看摘要

Abstract:Contrastive learning has emerged as a cornerstone of unsupervised representation learning across vision, language, and graph domains, with InfoNCE as its dominant objective. Despite its empirical success, the theoretical underpinnings of InfoNCE remain limited. In this work, we introduce an explicit feature space to model augmented views of samples and a transition probability matrix to capture data augmentation dynamics. We demonstrate that InfoNCE optimizes the probability of two views sharing the same source toward a constant target defined by this matrix, naturally inducing feature clustering in the representation space. Leveraging this insight, we propose Scaled Convergence InfoNCE (SC-InfoNCE), a novel loss function that introduces a tunable convergence target to flexibly control feature similarity alignment. By scaling the target matrix, SC-InfoNCE enables flexible control over feature similarity alignment, allowing the training objective to better match the statistical properties of downstream data. Experiments on benchmark datasets, including image, graph, and text tasks, show that SC-InfoNCE consistently achieves strong and reliable performance across diverse domains.

[LG-107] SGDiff: Rethinking Synthetic Time Series Generation from a Pure Graph Perspective AAAI2026

链接: https://arxiv.org/abs/2511.12174
作者: Lifeng Shen,Xuyang Li,Lele Long
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Diffusion models have shown great promise in data generation, yet generating time series data remains challenging due to the need to capture complex temporal dependencies and structural patterns. In this paper, we present \textitTSGDiff, a novel framework that rethinks time series generation from a graph-based perspective. Specifically, we represent time series as dynamic graphs, where edges are constructed based on Fourier spectrum characteristics and temporal dependencies. A graph neural network-based encoder-decoder architecture is employed to construct a latent space, enabling the diffusion process to model the structural representation distribution of time series effectively. Furthermore, we propose the Topological Structure Fidelity (Topo-FID) score, a graph-aware metric for assessing the structural similarity of time series graph representations. Topo-FID integrates two sub-metrics: Graph Edit Similarity, which quantifies differences in adjacency matrices, and Structural Entropy Similarity, which evaluates the entropy of node degree distributions. This comprehensive metric provides a more accurate assessment of structural fidelity in generated time series. Experiments on real-world datasets demonstrate that \textitTSGDiff generates high-quality synthetic time series data generation, faithfully preserving temporal dependencies and structural integrity, thereby advancing the field of synthetic time series generation.

[LG-108] FGM optimization in complex domains using Gaussian process regression based profile generation algorithm

链接: https://arxiv.org/abs/2511.12171
作者: Chaitanya Kumar Konda,Piyush Agrawal,Shivansh Srivastava,Manish Agrawal
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This manuscript addresses the challenge of designing functionally graded materials (FGMs) for arbitrary-shaped domains. Towards this goal, the present work proposes a generic volume fraction profile generation algorithm based on Gaussian Process Regression (GPR). The proposed algorithm can handle complex-shaped domains and generate smooth FGM profiles while adhering to the specified volume fraction values at boundaries/part of boundaries. The resulting design space from GPR comprises diverse profiles, enhancing the potential for discovering optimal configurations. Further, the algorithm allows the user to control the smoothness of the underlying profiles and the size of the design space through a length scale parameter. Further, the proposed profile generation scheme is coupled with the genetic algorithm to find the optimum FGM profiles for a given application. To make the genetic algorithm consistent with the GPR profile generation scheme, the standard simulated binary crossover operator in the genetic algorithm has been modified with a projection operator. We present numerous thermoelastic optimization examples to demonstrate the efficacy of the proposed profile generation algorithm and optimization framework.

[LG-109] Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

链接: https://arxiv.org/abs/2511.12158
作者: Houtan Ghaffari,Lukas Rauch,Paul Devos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many bioacoustics, neuroscience, and linguistics research utilize birdsongs as proxy models to acquire knowledge in diverse areas. Developing models generally requires precisely annotated data at the level of syllables. Hence, automated and data-efficient methods that reduce annotation costs are in demand. This work presents a lightweight, yet performant neural network architecture for birdsong annotation called Residual-MLP-RNN. Then, it presents a robust three-stage training pipeline for developing reliable deep birdsong syllable detectors with minimal expert labor. The first stage is self-supervised learning from unlabeled data. Two of the most successful pretraining paradigms are explored, namely, masked prediction and online clustering. The second stage is supervised training with effective data augmentations to create a robust model for frame-level syllable detection. The third stage is semi-supervised post-training, which leverages the unlabeled data again. However, unlike the initial phase, this time it is aligned with the downstream task. The performance of this data-efficient approach is demonstrated for the complex song of the Canary in extreme label-scarcity scenarios. Canary has one of the most difficult songs to annotate, which implicitly validates the method for other birds. Finally, the potential of self-supervised embeddings is assessed for linear probing and unsupervised birdsong analysis.

[LG-110] Rethinking Deep Alignment Through The Lens Of Incomplete Learning AAAI’26

链接: https://arxiv.org/abs/2511.12155
作者: Thong Bach,Dung Nguyen,Thao Minh Le,Truyen Tran
类目: Machine Learning (cs.LG)
*备注: AAAI’26

点击查看摘要

Abstract:Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens – vocabulary elements where base models assign higher probability than aligned models – as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48–98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

[LG-111] Finding Time Series Anomalies using Granular-ball Vector Data Description AAAI2026

链接: https://arxiv.org/abs/2511.12147
作者: Lifeng Shen,Liang Peng,Ruiwen Liu,Shuyin Xia,Yi Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Modeling normal behavior in dynamic, nonlinear time series data is challenging for effective anomaly detection. Traditional methods, such as nearest neighbor and clustering approaches, often depend on rigid assumptions, such as a predefined number of reliable neighbors or clusters, which frequently break down in complex temporal scenarios. To address these limitations, we introduce the Granular-ball One-Class Network (GBOC), a novel approach based on a data-adaptive representation called Granular-ball Vector Data Description (GVDD). GVDD partitions the latent space into compact, high-density regions represented by granular-balls, which are generated through a density-guided hierarchical splitting process and refined by removing noisy structures. Each granular-ball serves as a prototype for local normal behavior, naturally positioning itself between individual instances and clusters while preserving the local topological structure of the sample set. During training, GBOC improves the compactness of representations by aligning samples with their nearest granular-ball centers. During inference, anomaly scores are computed based on the distance to the nearest granular-ball. By focusing on dense, high-quality regions and significantly reducing the number of prototypes, GBOC delivers both robustness and efficiency in anomaly detection. Extensive experiments validate the effectiveness and superiority of the proposed method, highlighting its ability to handle the challenges of time series anomaly detection.

[LG-112] Fusion-ResNet: A Lightweight multi-label NILM Model Using PCA-ICA Feature Fusion

链接: https://arxiv.org/abs/2511.12139
作者: Sahar Moghimian Hoosh,Ilia Kamyshev,Henni Ouerdane
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Extended version of the conference paper “Enhancing Non-Intrusive Load Monitoring with Features Extracted by Independent Component Analysis” – arXiv:2501.16817 . Instead of solely using ICA or PCA for feature extraction, we propose the fusion of ICA and PCA, which outperforms other baseline models. This extended version is meant for journal publication

点击查看摘要

Abstract:Non-intrusive load monitoring (NILM) is an advanced load monitoring technique that uses data-driven algorithms to disaggregate the total power consumption of a household into the consumption of individual appliances. However, real-world NILM deployment still faces major challenges, including overfitting, low model generalization, and disaggregating a large number of appliances operating at the same time. To address these challenges, this work proposes an end-to-end framework for the NILM classification task, which consists of high-frequency labeled data, a feature extraction method, and a lightweight neural network. Within this framework, we introduce a novel feature extraction method that fuses Independent Component Analysis (ICA) and Principal Component Analysis (PCA) features. Moreover, we propose a lightweight architecture for multi-label NILM classification (Fusion-ResNet). The proposed feature-based model achieves a higher F1 score on average and across different appliances compared to state-of-the-art NILM classifiers while minimizing the training and inference time. Finally, we assessed the performance of our model against baselines with a varying number of simultaneously active devices. Results demonstrate that Fusion-ResNet is relatively robust to stress conditions with up to 15 concurrently active appliances.

[LG-113] FairGSE: Fairness-Aware Graph Neural Network without High False Positive Rates AAAI2026

链接: https://arxiv.org/abs/2511.12132
作者: Zhenqiang Ye,Jinjie Lu,Tianlong Gu,Fengrui Hao,Xuemin Wang
类目: Machine Learning (cs.LG)
*备注: AAAI 2026

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as the mainstream paradigm for graph representation learning due to their effective message aggregation. However, this advantage also amplifies biases inherent in graph topology, raising fairness concerns. Existing fairness-aware GNNs provide satisfactory performance on fairness metrics such as Statistical Parity and Equal Opportunity while maintaining acceptable accuracy trade-offs. Unfortunately, we observe that this pursuit of fairness metrics neglects the GNN’s ability to predict negative labels, which renders their predictions with extremely high False Positive Rates (FPR), resulting in negative effects in high-risk scenarios. To this end, we advocate that classification performance should be carefully calibrated while improving fairness, rather than simply constraining accuracy loss. Furthermore, we propose Fair GNN via Structural Entropy (\textbfFairGSE), a novel framework that maximizes two-dimensional structural entropy (2D-SE) to improve fairness without neglecting false positives. Experiments on several real-world datasets show FairGSE reduces FPR by 39% vs. state-of-the-art fairness-aware GNNs, with comparable fairness improvement.

[LG-114] HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning AAAI2026

链接: https://arxiv.org/abs/2511.12123
作者: Zejiao Liu,Junqi Tu,Yitian Hong,Luolin Xiong,Yaochu Jin,Yang Tang,Fangfei Li
类目: Machine Learning (cs.LG)
*备注: AAAI 2026

点击查看摘要

Abstract:In cooperative Multi-Agent Reinforcement Learning (MARL), efficient exploration is crucial for optimizing the performance of joint policy. However, existing methods often update joint policies via independent agent exploration, without coordination among agents, which inherently constrains the expressive capacity and exploration of joint policies. To address this issue, we propose a conductor-based joint policy framework that directly enhances the expressive capacity of joint policies and coordinates exploration. In addition, we develop a Hierarchical Conductor-based Policy Optimization (HCPO) algorithm that instructs policy updates for the conductor and agents in a direction aligned with performance improvement. A rigorous theoretical guarantee further establishes the monotonicity of the joint policy optimization process. By deploying local conductors, HCPO retains centralized training benefits while eliminating inter-agent communication during execution. Finally, we evaluate HCPO on three challenging benchmarks: StarCraftII Multi-agent Challenge, Multi-agent MuJoCo, and Multi-agent Particle Environment. The results indicate that HCPO outperforms competitive MARL baselines regarding cooperative efficiency and stability.

[LG-115] Dynamic Anomaly Identification in Accounting Transactions via Multi-Head Self-Attention Networks

链接: https://arxiv.org/abs/2511.12122
作者: Yi Wang,Ruoyi Fang,Anzhuo Xie,Hanrui Feng,Jianlin Lai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the problem of dynamic anomaly detection in accounting transactions and proposes a real-time detection method based on a Transformer to tackle the challenges of hidden abnormal behaviors and high timeliness requirements in complex trading environments. The approach first models accounting transaction data by representing multi-dimensional records as time-series matrices and uses embedding layers and positional encoding to achieve low-dimensional mapping of inputs. A sequence modeling structure with multi-head self-attention is then constructed to capture global dependencies and aggregate features from multiple perspectives, thereby enhancing the ability to detect abnormal patterns. The network further integrates feed-forward layers and regularization strategies to achieve deep feature representation and accurate anomaly probability estimation. To validate the effectiveness of the method, extensive experiments were conducted on a public dataset, including comparative analysis, hyperparameter sensitivity tests, environmental sensitivity tests, and data sensitivity tests. Results show that the proposed method outperforms baseline models in AUC, F1-Score, Precision, and Recall, and maintains stable performance under different environmental conditions and data perturbations. These findings confirm the applicability and advantages of the Transformer-based framework for dynamic anomaly detection in accounting transactions and provide methodological support for intelligent financial risk control and auditing.

[LG-116] o Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance

链接: https://arxiv.org/abs/2511.12121
作者: Wanlong Fang,Tianle Zhang,Alvin Chan
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal learning often relies on aligning representations across modalities to enable effective information integration, an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We identify an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work provides practical guidance on when and how explicit alignment should be applied to achieve optimal unimodal encoder performance.

[LG-117] SenseRay-3D: Generalizable and Physics-Informed Framework for End-to-End Indoor Propagation Modeling

链接: https://arxiv.org/abs/2511.12092
作者: Yu Zheng,Kezhi Wang,Wenji Xi,Gang Yu,Jiming Chen,Jie Zhang
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Submitted for possible journal publications

点击查看摘要

Abstract:Modeling indoor radio propagation is crucial for wireless network planning and optimization. However, existing approaches often rely on labor-intensive manual modeling of geometry and material properties, resulting in limited scalability and efficiency. To overcome these challenges, this paper presents SenseRay-3D, a generalizable and physics-informed end-to-end framework that predicts three-dimensional (3D) path-loss heatmaps directly from RGB-D scans, thereby eliminating the need for explicit geometry reconstruction or material annotation. The proposed framework builds a sensing-driven voxelized scene representation that jointly encodes occupancy, electromagnetic material characteristics, and transmitter-receiver geometry, which is processed by a SwinUNETR-based neural network to infer environmental path-loss relative to free-space path-loss. A comprehensive synthetic indoor propagation dataset is further developed to validate the framework and to serve as a standardized benchmark for future research. Experimental results show that SenseRay-3D achieves a mean absolute error of 4.27 dB on unseen environments and supports real-time inference at 217 ms per sample, demonstrating its scalability, efficiency, and physical consistency. SenseRay-3D paves a new path for sense-driven, generalizable, and physics-consistent modeling of indoor propagation, marking a major leap beyond our pioneering EM DeepRay framework.

[LG-118] From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

链接: https://arxiv.org/abs/2511.12081
作者: Bencheng Yan,Yuejie Lei,Zhiyuan Zeng,Di Wang,Kaiyi Lin,Pengjie Wang,Jian Xu,Bo Zheng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.

[LG-119] ReCast: Reliability-aware Codebook Assisted Lightweight Time Series Forecasting AAAI2026

链接: https://arxiv.org/abs/2511.11991
作者: Xiang Ma,Taihua Chen,Pengcheng Wang,Xuemei Li,Caiming Zhang
类目: Machine Learning (cs.LG)
*备注: AAAI 2026 Oral

点击查看摘要

Abstract:Time series forecasting is crucial for applications in various domains. Conventional methods often rely on global decomposition into trend, seasonal, and residual components, which become ineffective for real-world series dominated by local, complex, and highly dynamic patterns. Moreover, the high model complexity of such approaches limits their applicability in real-time or resource-constrained environments. In this work, we propose a novel \textbfREliability-aware \textbfCodebook-\textbfASsisted \textbfTime series forecasting framework (\textbfReCast) that enables lightweight and robust prediction by exploiting recurring local shapes. ReCast encodes local patterns into discrete embeddings through patch-wise quantization using a learnable codebook, thereby compactly capturing stable regular structures. To compensate for residual variations not preserved by quantization, ReCast employs a dual-path architecture comprising a quantization path for efficient modeling of regular structures and a residual path for reconstructing irregular fluctuations. A central contribution of ReCast is a reliability-aware codebook update strategy, which incrementally refines the codebook via weighted corrections. These correction weights are derived by fusing multiple reliability factors from complementary perspectives by a distributionally robust optimization (DRO) scheme, ensuring adaptability to non-stationarity and robustness to distribution shifts. Extensive experiments demonstrate that ReCast outperforms state-of-the-art (SOTA) models in accuracy, efficiency, and adaptability to distribution shifts.

[LG-120] Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression

链接: https://arxiv.org/abs/2511.11973
作者: Xinming Gao,Shangzhe Li,Yujin Cai,Wenwu Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction, making it particularly valuable in high-risk or costly domains. Extreme Q -Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem, yielding strong empirical performance. However, XQL and its stabilized variant MXQL suffer from notable limitations: both require extensive hyperparameter tuning specific to each dataset and domain, and also exhibit instability during training. To address these issues, we proposed a principled method to estimate the temperature coefficient \beta via quantile regression under mild assumptions. To further improve training stability, we introduce a value regularization technique with mild generalization, inspired by recent advances in constrained value learning. Experimental results demonstrate that the proposed algorithm achieves competitive or superior performance across a range of benchmark tasks, including D4RL and NeoRL2, while maintaining stable training dynamics and using a consistent set of hyperparameters across all datasets and domains.

[LG-121] Computation-aware Energy-harvesting Federated Learning: Cyclic Scheduling with Selective Participation

链接: https://arxiv.org/abs/2511.11949
作者: Eunjeong Jeong,Nikolaos Pappas
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: This paper has been submitted to a peer-reviewed journal

点击查看摘要

Abstract:Federated Learning (FL) is a powerful paradigm for distributed learning, but its increasing complexity leads to significant energy consumption from client-side computations for training models. In particular, the challenge is critical in energy-harvesting FL (EHFL) systems where participation availability of each device oscillates due to limited energy. To address this, we propose FedBacys, a battery-aware EHFL framework using cyclic client participation based on users’ battery levels. By clustering clients and scheduling them sequentially, FedBacys minimizes redundant computations, reduces system-wide energy usage, and improves learning stability. We also introduce FedBacys-Odd, a more energy-efficient variant that allows clients to participate selectively, further reducing energy costs without compromising performance. We provide a convergence analysis for our framework and demonstrate its superior energy efficiency and robustness compared to existing algorithms through numerical experiments.

[LG-122] Learning the relative composition of EEG signals using pairwise relative shift pretraining NEURIPS2025

链接: https://arxiv.org/abs/2511.11940
作者: Christopher Sandino,Sayeri Lala,Geeling Chau,Melika Ayoughi,Behrooz Mahasseni,Ellen Zippi,Ali Moin,Erdrin Azemi,Hanlin Goh
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Foundation Models for the Brain and Body NeurIPS 2025 Workshop

点击查看摘要

Abstract:Self-supervised learning (SSL) offers a promising approach for learning electroencephalography (EEG) representations from unlabeled data, reducing the need for expensive annotations for clinical applications like sleep staging and seizure detection. While current EEG SSL methods predominantly use masked reconstruction strategies like masked autoencoders (MAE) that capture local temporal patterns, position prediction pretraining remains underexplored despite its potential to learn long-range dependencies in neural signals. We introduce PAirwise Relative Shift or PARS pretraining, a novel pretext task that predicts relative temporal shifts between randomly sampled EEG window pairs. Unlike reconstruction-based methods that focus on local pattern recovery, PARS encourages encoders to capture relative temporal composition and long-range dependencies inherent in neural signals. Through comprehensive evaluation on various EEG decoding tasks, we demonstrate that PARS-pretrained transformers consistently outperform existing pretraining strategies in label-efficient and transfer learning settings, establishing a new paradigm for self-supervised EEG representation learning.

[LG-123] SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis

链接: https://arxiv.org/abs/2511.11935
作者: Munib Mesinovic,Tingting Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic health record (EHR) data present tremendous opportunities for advancing survival analysis through deep learning, yet reproducibility remains severely constrained by inconsistent preprocessing methodologies. We present SurvBench, a comprehensive, open-source preprocessing pipeline that transforms raw PhysioNet datasets into standardised, model-ready tensors for multi-modal survival analysis. SurvBench provides data loaders for three major critical care databases, MIMIC-IV, eICU, and MC-MED, supporting diverse modalities including time-series vitals, static demographics, ICD diagnosis codes, and radiology reports. The pipeline implements rigorous data quality controls, patient-level splitting to prevent data leakage, explicit missingness tracking, and standardised temporal aggregation. SurvBench handles both single-risk (e.g., in-hospital mortality) and competing-risks scenarios (e.g., multiple discharge outcomes). The outputs are compatible with pycox library packages and implementations of standard statistical and deep learning models. By providing reproducible, configuration-driven preprocessing with comprehensive documentation, SurvBench addresses the “preprocessing gap” that has hindered fair comparison of deep learning survival models, enabling researchers to focus on methodological innovation rather than data engineering.

[LG-124] Beyond the Laplacian: Interpolated Spectral Augmentation for Graph Neural Networks

链接: https://arxiv.org/abs/2511.11928
作者: Ziyao Cui,Edric Tam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are fundamental tools in graph machine learning. The performance of GNNs relies crucially on the availability of informative node features, which can be limited or absent in real-life datasets and applications. A natural remedy is to augment the node features with embeddings computed from eigenvectors of the graph Laplacian matrix. While it is natural to default to Laplacian spectral embeddings, which capture meaningful graph connectivity information, we ask whether spectral embeddings from alternative graph matrices can also provide useful representations for learning. We introduce Interpolated Laplacian Embeddings (ILEs), which are derived from a simple yet expressive family of graph matrices. Using tools from spectral graph theory, we offer a straightforward interpretation of the structural information that ILEs capture. We demonstrate through simulations and experiments on real-world datasets that feature augmentation via ILEs can improve performance across commonly used GNN architectures. Our work offers a straightforward and practical approach that broadens the practitioner’s spectral augmentation toolkit when node features are limited.

[LG-125] A Systematic Study of Model Extraction Attacks on Graph Foundation Models

链接: https://arxiv.org/abs/2511.11912
作者: Haoyan Xu,Ruizhi Qian,Jiate Li,Yushun Dong,Minghao Lin,Hanson Yan,Zhengtao Yao,Qinghua Liu,Junhao Dong,Ruopeng Huang,Yue Zhao,Mengyuan Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph machine learning has advanced rapidly in tasks such as link prediction, anomaly detection, and node classification. As models scale up, pretrained graph models have become valuable intellectual assets because they encode extensive computation and domain expertise. Building on these advances, Graph Foundation Models (GFMs) mark a major step forward by jointly pretraining graph and text encoders on massive and diverse data. This unifies structural and semantic understanding, enables zero-shot inference, and supports applications such as fraud detection and biomedical analysis. However, the high pretraining cost and broad cross-domain knowledge in GFMs also make them attractive targets for model extraction attacks (MEAs). Prior work has focused only on small graph neural networks trained on a single graph, leaving the security implications for large-scale and multimodal GFMs largely unexplored. This paper presents the first systematic study of MEAs against GFMs. We formalize a black-box threat model and define six practical attack scenarios covering domain-level and graph-specific extraction goals, architectural mismatch, limited query budgets, partial node access, and training data discrepancies. To instantiate these attacks, we introduce a lightweight extraction method that trains an attacker encoder using supervised regression of graph embeddings. Even without contrastive pretraining data, this method learns an encoder that stays aligned with the victim text encoder and preserves its zero-shot inference ability on unseen graphs. Experiments on seven datasets show that the attacker can approximate the victim model using only a tiny fraction of its original training cost, with almost no loss in accuracy. These findings reveal that GFMs greatly expand the MEA surface and highlight the need for deployment-aware security defenses in large-scale graph learning systems.

[LG-126] Leverag ing Exogenous Signals for Hydrology Time Series Forecasting

链接: https://arxiv.org/abs/2511.11849
作者: Junyang He,Judy Fox,Alireza Jafari,Ying-Jung Chen,Geoffrey Fox
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in time series research facilitate the development of foundation models. While many state-of-the-art time series foundation models have been introduced, few studies examine their effectiveness in specific downstream applications in physical science. This work investigates the role of integrating domain knowledge into time series models for hydrological rainfall-runoff modeling. Using the CAMELS-US dataset, which includes rainfall and runoff data from 671 locations with six time series streams and 30 static features, we compare baseline and foundation models. Results demonstrate that models incorporating comprehensive known exogenous inputs outperform more limited approaches, including foundation models. Notably, incorporating natural annual periodic time series contribute the most significant improvements.

[LG-127] On the Trade-Off Between Transparency and Security in Adversarial Machine Learning

链接: https://arxiv.org/abs/2511.11842
作者: Lucas Fenaux,Christopher Srinivasa,Florian Kerschbaum
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Transparency and security are both central to Responsible AI, but they may conflict in adversarial settings. We investigate the strategic effect of transparency for agents through the lens of transferable adversarial example attacks. In transferable adversarial example attacks, attackers maliciously perturb their inputs using surrogate models to fool a defender’s target model. These models can be defended or undefended, with both players having to decide which to use. Using a large-scale empirical evaluation of nine attacks across 181 models, we find that attackers are more successful when they match the defender’s decision; hence, obscurity could be beneficial to the defender. With game theory, we analyze this trade-off between transparency and security by modeling this problem as both a Nash game and a Stackelberg game, and comparing the expected outcomes. Our analysis confirms that only knowing whether a defender’s model is defended or not can sometimes be enough to damage its security. This result serves as an indicator of the general trade-off between transparency and security, suggesting that transparency in AI systems can be at odds with security. Beyond adversarial machine learning, our work illustrates how game-theoretic reasoning can uncover conflicts between transparency and security.

[LG-128] Simplicial covering dimension of extremal concept classes

链接: https://arxiv.org/abs/2511.11819
作者: Ari Blondal,Hamed Hatami,Pooya Hatami,Chavdar Lalov,Sivan Tretiak
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 31 pages, 5 figures

点击查看摘要

Abstract:Dimension theory is a branch of topology concerned with defining and analyzing dimensions of geometric and topological spaces in purely topological terms. In this work, we adapt the classical notion of topological dimension (Lebesgue covering) to binary concept classes. The topological space naturally associated with a concept class is its space of realizable distributions. The loss function and the class itself induce a simplicial structure on this space, with respect to which we define a simplicial covering dimension. We prove that for finite concept classes, this simplicial covering dimension exactly characterizes the list replicability number (equivalently, global stability) in PAC learning. This connection allows us to apply tools from classical dimension theory to compute the exact list replicability number of the broad family of extremal concept classes. Comments: 31 pages, 5 figures Subjects: Machine Learning (cs.LG); Algebraic Topology (math.AT) Cite as: arXiv:2511.11819 [cs.LG] (or arXiv:2511.11819v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11819 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-129] CATCHFed: Efficient Unlabeled Data Utilization for Semi-Supervised Federated Learning in Limited Labels Environments

链接: https://arxiv.org/abs/2511.11778
作者: Byoungjun Park,Pedro Porto Buarque de Gusmão,Dongjin Ji,Minhoe Kim
类目: Machine Learning (cs.LG)
*备注: 11pages, prepared for submission

点击查看摘要

Abstract:Federated learning is a promising paradigm that utilizes distributed client resources while preserving data privacy. Most existing FL approaches assume clients possess labeled data, however, in real-world scenarios, client-side labels are often unavailable. Semi-supervised Federated learning, where only the server holds labeled data, addresses this issue. However, it experiences significant performance degradation as the number of labeled data decreases. To tackle this problem, we propose \textitCATCHFed, which introduces client-aware adaptive thresholds considering class difficulty, hybrid thresholds to enhance pseudo-label quality, and utilizes unpseudo-labeled data for consistency regularization. Extensive experiments across various datasets and configurations demonstrate that CATCHFed effectively leverages unlabeled client data, achieving superior performance even in extremely limited-label settings.

[LG-130] Learning Fair Representations with Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2511.11767
作者: Amisha Priyadarshini,Sergio Gago-Masague
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Despite recent advances in fairness-aware machine learning, predictive models often exhibit discriminatory behavior towards marginalized groups. Such unfairness might arise from biased training data, model design, or representational disparities across groups, posing significant challenges in high-stakes decision-making domains such as college admissions. While existing fair learning models aim to mitigate bias, achieving an optimal trade-off between fairness and accuracy remains a challenge. Moreover, the reliance on black-box models hinders interpretability, limiting their applicability in socially sensitive domains. In this paper, we try to circumvent these issues by integrating Kolmogorov-Arnold Networks (KANs) within a fair adversarial learning framework. Leveraging the adversarial robustness and interpretability of KANs, our approach enables a balance between fairness and accuracy. To further facilitate this balance, we propose an adaptive penalty update mechanism that dynamically adjusts fairness constraints during the model training. We conduct numerical experiments on two real-world college admissions datasets, across three different optimization strategies. The results demonstrate the efficiency and robustness of KANs by consistently outperforming the baseline fair learning models, and maintaining high predictive accuracy while achieving competitive fairness across sensitive attributes.

[LG-131] Sumudu Neural Operator for ODEs and PDEs AAAI

链接: https://arxiv.org/abs/2511.11762
作者: Ben Zelenskiy,Saibilila Abudukelimu,George Flint,Kevin Zhu,Sunishchal Dev
类目: Machine Learning (cs.LG)
*备注: 5th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE)

点击查看摘要

Abstract:We introduce the Sumudu Neural Operator (SNO), a neural operator rooted in the properties of the Sumudu Transform. We leverage the relationship between the polynomial expansions of transform pairs to decompose the input space as coefficients, which are then transformed into the Sumudu Space, where the neural operator is parameterized. We evaluate the operator in ODEs (Duffing Oscillator, Lorenz System, and Driven Pendulum) and PDEs (Euler-Bernoulli Beam, Burger’s Equation, Diffusion, Diffusion-Reaction, and Brusselator). SNO achieves superior performance to FNO on PDEs and demonstrates competitive accuracy with LNO on several PDE tasks, including the lowest error on the Euler-Bernoulli Beam and Diffusion Equation. Additionally, we apply zero-shot super-resolution to the PDE tasks to observe the model’s capability of obtaining higher quality data from low-quality samples. These preliminary findings suggest promise for the Sumudu Transform as a neural operator design, particularly for certain classes of PDEs.

[LG-132] Noise-Aware Optimization in Nominally Identical Manufacturing and Measuring Systems for High-Throughput Parallel Workflows

链接: https://arxiv.org/abs/2511.11739
作者: Christina Schenk,Miguel Hernández-del-Valle,Luis Calero-Lumbreras,Marcus Noack,Maciej Haranczyk
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO)
*备注: 17 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Device-to-device variability in experimental noise critically impacts reproducibility, especially in automated, high-throughput systems like additive manufacturing farms. While manageable in small labs, such variability can escalate into serious risks at larger scales, such as architectural 3D printing, where noise may cause structural or economic failures. This contribution presents a noise-aware decision-making algorithm that quantifies and models device-specific noise profiles to manage variability adaptively. It uses distributional analysis and pairwise divergence metrics with clustering to choose between single-device and robust multi-device Bayesian optimization strategies. Unlike conventional methods that assume homogeneous devices or generic robustness, this framework explicitly leverages inter-device differences to enhance performance, reproducibility, and efficiency. An experimental case study involving three nominally identical 3D printers (same brand, model, and close serial numbers) demonstrates reduced redundancy, lower resource usage, and improved reliability. Overall, this framework establishes a paradigm for precision- and resource-aware optimization in scalable, automated experimental platforms.

[LG-133] KAN/H: Kolmogorov-Arnold Network using Haar-like bases

链接: https://arxiv.org/abs/2511.11736
作者: Susumu Katayama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes KAN/H, a variant of Kolmogorov-Arnold Network (KAN) that uses a Haar-variant basis system having both global and local bases instead of B-spline. The resulting algorithm is applied to function approximation problems and MNIST. We show that it does not require most of the problem-specific hyper-parameter tunings.

[LG-134] Harli: Harvest Underutilized Resources in LLM Serving with Finetuning Tasks

链接: https://arxiv.org/abs/2511.11729
作者: Ao Xu,Han Zhao,Weihao Cui,Quan Chen,Yukang Chen,Shulai Zhang,Shuang Chen,Jiemin Jiang,Zhibin Yu,Minyi Guo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized. We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges–limited memory and unpredictable interference–using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode latency modeling, and a QoS-guaranteed throughput-maximizing scheduler for throughput maximization. Experimental results show that Harli improves the finetune throughput by 46.2% on average (up to 92.0%) over state-of-the-art serving systems, while maintaining strict QoS guarantees for inference decode. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2511.11729 [cs.DC] (or arXiv:2511.11729v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.11729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-135] Multiscale Grassmann Manifolds for Single-Cell Data Analysis

链接: https://arxiv.org/abs/2511.11717
作者: Xiang Xiang Wang,Sean Cottrell,Guo-Wei Wei
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Single-cell data analysis seeks to characterize cellular heterogeneity based on high-dimensional gene expression profiles. Conventional approaches represent each cell as a vector in Euclidean space, which limits their ability to capture intrinsic correlations and multiscale geometric structures. We propose a multiscale framework based on Grassmann manifolds that integrates machine learning with subspace geometry for single-cell data analysis. By generating embeddings under multiple representation scales, the framework combines their features from different geometric views into a unified Grassmann manifold. A power-based scale sampling function is introduced to control the selection of scales and balance in- formation across resolutions. Experiments on nine benchmark single-cell RNA-seq datasets demonstrate that the proposed approach effectively preserves meaningful structures and provides stable clustering performance, particularly for small to medium-sized datasets. These results suggest that Grassmann manifolds offer a coherent and informative foundation for analyzing single cell data.

[LG-136] Federated Learning for Pediatric Pneumonia Detection: Enabling Collaborative Diagnosis Without Sharing Patient Data

链接: https://arxiv.org/abs/2511.11714
作者: Daniel M. Jimenez-Gutierrez,Enrique Zuazua,Joaquin Del Rio,Oleksii Sliusarenko,Xabi Uribe-Etxebarria
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Early and accurate pneumonia detection from chest X-rays (CXRs) is clinically critical to expedite treatment and isolation, reduce complications, and curb unnecessary antibiotic use. Although artificial intelligence (AI) substantially improves CXR-based detection, development is hindered by globally distributed data, high inter-hospital variability, and strict privacy regulations (e.g., HIPAA, GDPR) that make centralization impractical. These constraints are compounded by heterogeneous imaging protocols, uneven data availability, and the costs of transferring large medical images across geographically dispersed sites. In this paper, we evaluate Federated Learning (FL) using the this http URL FL platform, enabling multiple hospitals (nodes) to collaboratively train a CXR classifier for pneumonia while keeping data in place and private. Using the Pediatric Pneumonia Chest X-ray dataset, we simulate cross-hospital collaboration with non-independent and non-identically distributed (non-IID) data, reproducing real-world variability across institutions and jurisdictions. Our experiments demonstrate that collaborative and privacy-preserving training across multiple hospitals via FL led to a dramatic performance improvement achieving 0.900 Accuracy and 0.966 ROC-AUC, corresponding to 47.5% and 50.0% gains over single-hospital models (0.610; 0.644), without transferring any patient CXR. These results indicate that FL delivers high-performing, generalizable, secure and private pneumonia detection across healthcare networks, with data kept local. This is especially relevant for rare diseases, where FL enables secure multi-institutional collaboration without data movement, representing a breakthrough for accelerating diagnosis and treatment development in low-data domains. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2511.11714 [cs.LG] (or arXiv:2511.11714v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11714 Focus to learn more arXiv-issued DOI via DataCite

[LG-137] Which Sparse Autoencoder Features Are Real? Model-X Knockoffs for False Discovery Rate Control

链接: https://arxiv.org/abs/2511.11711
作者: Tsogt-Ochir Enkhbayar
类目: Machine Learning (cs.LG)
*备注: 1 figure

点击查看摘要

Abstract:Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature discovery by combining SAEs with multiple-testing-aware inference, advancing the foundations of mechanistic interpretability.

[LG-138] FSC-Net: Fast-Slow Consolidation Networks for Continual Learning

链接: https://arxiv.org/abs/2511.11707
作者: Mohamed El Gorrim
类目: Machine Learning (cs.LG)
*备注: Code and the full repo available at this https URL

点击查看摘要

Abstract:Continual learning remains challenging due to catastrophic forgetting, where neural networks lose previously acquired knowledge when learning new tasks. Inspired by memory consolidation in neuroscience, we propose FSC-Net (Fast-Slow Consolidation Networks), a dual-network architecture that separates rapid task learning from gradual knowledge consolidation. Our method employs a fast network (NN1) for immediate adaptation to new tasks and a slow network (NN2) that consolidates knowledge through distillation and replay. Within the family of MLP-based NN1 variants we evaluated, consolidation effectiveness is driven more by methodology than architectural embellishments – a simple MLP outperforms more complex similarity-gated variants by 1.2pp. Through systematic hyperparameter analysis, we observed empirically that pure replay without distillation during consolidation achieves superior performance, consistent with the hypothesis that distillation from the fast network introduces recency bias. On Split-MNIST (30 seeds), FSC-Net achieves 91.71% +/- 0.62% retention accuracy, a +4.27pp gain over the fast network alone (87.43% +/- 1.27%, paired t=23.585, p 1e-10). On Split-CIFAR-10 (5 seeds), our method achieves 33.31% +/- 0.38% retention with an +8.20pp gain over the fast network alone (25.11% +/- 1.61%, paired t=9.75, p 1e-3), demonstrating +8.20pp gain, though absolute performance (33.31%) remains modest and below random expectation, highlighting need for stronger backbones. Our results provide empirical evidence that the dual-timescale consolidation mechanism, rather than architectural complexity, is central to mitigating catastrophic forgetting in this setting.

[LG-139] Bayesian Neural Networks with Monte Carlo Dropout for Probabilistic Electricity Price Forecasting

链接: https://arxiv.org/abs/2511.11701
作者: Abhinav Das,Stephan Schlüter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate electricity price forecasting is critical for strategic decision-making in deregulated electricity markets, where volatility stems from complex supply-demand dynamics and external factors. Traditional point forecasts often fail to capture inherent uncertainties, limiting their utility for risk management. This work presents a framework for probabilistic electricity price forecasting using Bayesian neural networks (BNNs) with Monte Carlo (MC) dropout, training separate models for each hour of the day to capture diurnal patterns. A critical assessment and comparison with the benchmark model, namely: generalized autoregressive conditional heteroskedasticity with exogenous variable (GARCHX) model and the LASSO estimated auto-regressive model (LEAR), highlights that the proposed model outperforms the benchmark models in terms of point prediction and intervals. This work serves as a reference for leveraging probabilistic neural models in energy market predictions.

[LG-140] ghter Truncated Rectangular Prism Approximation for RNN Robustness Verification

链接: https://arxiv.org/abs/2511.11699
作者: Xingqi Lin,Liangyu Chen,Min Wu,Min Zhang,Zhenbing Zeng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robustness verification is a promising technique for rigorously proving Recurrent Neural Networks (RNNs) robustly. A key challenge is to over-approximate the nonlinear activation functions with linear constraints, which can transform the verification problem into an efficiently solvable linear programming problem. Existing methods over-approximate the nonlinear parts with linear bounding planes individually, which may cause significant over-estimation and lead to lower verification accuracy. In this paper, in order to tightly enclose the three-dimensional nonlinear surface generated by the Hadamard product, we propose a novel truncated rectangular prism formed by two linear relaxation planes and a refinement-driven method to minimize both its volume and surface area for tighter over-approximation. Based on this approximation, we implement a prototype DeepPrism for RNN robustness verification. The experimental results demonstrate that \emphDeepPrism has significant improvement compared with the state-of-the-art approaches in various tasks of image classification, speech recognition and sentiment analysis.

[LG-141] Moirai 2.0: When Less Is More for Time Series Forecasting

链接: https://arxiv.org/abs/2511.11698
作者: Chenghao Liu,Taha Aksu,Juncheng Liu,Xu Liu,Hanshu Yan,Quang Pham,Doyen Sahoo,Caiming Xiong,Silvio Savarese,Junnan Li
类目: Machine Learning (cs.LG)
*备注: 16 pages, 13 figures, and 1 table

点击查看摘要

Abstract:We introduce Moirai 2.0, a decoder-only time-series foundation model trained on a new corpus of 36M series. The model adopts quantile forecasting and multi-token prediction, improving both probabilistic accuracy and inference efficiency. On the Gift-Eval benchmark, it ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size. Compared to Moirai 1.0, Moirai 2.0 replaces masked-encoder training, multi-patch inputs, and mixture-distribution outputs with a simpler decoder-only architecture, single patch, and quantile loss. Ablation studies isolate these changes – showing that the decoder-only backbone along with recursive multi-quantile decoding contribute most to the gains. Additional experiments show that Moirai 2.0 outperforms larger models from the same family and exhibits robust domain-level results. In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large, while also performing better. Model performance plateaus with increasing parameter count and declines at longer horizons, motivating future work on data scaling and long-horizon modeling. We release code and evaluation details to support further research.

[LG-142] Benchmarking GNNs for OOD Materials Property Prediction with Uncertainty Quantification

链接: https://arxiv.org/abs/2511.11697
作者: Liqin Tan,Pin Chen,Menghan Liu,Xiean Wang,Jianhuan Cen,Qingsong Zou
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 12 pages, 1 figure, 5 tables

点击查看摘要

Abstract:We present MatUQ, a benchmark framework for evaluating graph neural networks (GNNs) on out-of-distribution (OOD) materials property prediction with uncertainty quantification (UQ). MatUQ comprises 1,375 OOD prediction tasks constructed from six materials datasets using five OFM-based and a newly proposed structure-aware splitting strategy, SOAP-LOCO, which captures local atomic environments more effectively. We evaluate 12 representative GNN models under a unified uncertainty-aware training protocol that combines Monte Carlo Dropout and Deep Evidential Regression (DER), and introduce a novel uncertainty metric, D-EviU, which shows the strongest correlation with prediction errors in most tasks. Our experiments yield two key findings. First, the uncertainty-aware training approach significantly improves model prediction accuracy, reducing errors by an average of 70.6% across challenging OOD scenarios. Second, the benchmark reveals that no single model dominates universally: earlier models such as SchNet and ALIGNN remain competitive, while newer models like CrystalFramer and SODNet demonstrate superior performance on specific material properties. These results provide practical insights for selecting reliable models under distribution shifts in materials discovery.

[LG-143] Regularized Schrödinger: Alleviating Distortion and Exposure Bias in Solving Inverse Problems

链接: https://arxiv.org/abs/2511.11686
作者: Qing Yao,Lijian Gao,Qirong Mao,Dong Ming
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Diffusion models serve as a powerful generative framework for solving inverse problems. However, they still face two key challenges: 1) the distortion-perception tradeoff, where improving perceptual quality often degrades reconstruction fidelity, and 2) the exposure bias problem, where the training-inference input mismatch leads to prediction error accumulation and reduced reconstruction quality. In this work, we propose the Regularized Schrödinger Bridge (RSB), an adaptation of Schrödinger Bridge tailored for inverse problems that addresses the above limitations. RSB employs a novel regularized training strategy that perturbs both the input states and targets, effectively mitigating exposure bias by exposing the model to simulated prediction errors and also alleviating distortion by well-designed interpolation via the posterior mean. Extensive experiments on two typical inverse problems for speech enhancement demonstrate that RSB outperforms state-of-the-art methods, significantly improving distortion metrics and effectively reducing exposure bias.

[LG-144] R-Tuning: Wavelet-Decomposed Replay and Semantic Alignment for Continual Adaptation of Pretrained Time-Series Models

链接: https://arxiv.org/abs/2511.11685
作者: Tianyi Yin,Jingwei Wang,Chenze Wang,Han Wang,Jiexuan Cai,Min Liu,Yunlong Ma,Kun Gao,Yuting Song,Weiming Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained models have demonstrated exceptional generalization capabilities in time-series forecasting; however, adapting them to evolving data distributions remains a significant challenge. A key hurdle lies in accessing the original training data, as fine-tuning solely on new data often leads to catastrophic forgetting. To address this issue, we propose Replay Tuning (R-Tuning), a novel framework designed for the continual adaptation of pre-trained time-series models. R-Tuning constructs a unified latent space that captures both prior and current task knowledge through a frequency-aware replay strategy. Specifically, it augments model-generated samples via wavelet-based decomposition across multiple frequency bands, generating trend-preserving and fusion-enhanced variants to improve representation diversity and replay efficiency. To further reduce reliance on synthetic samples, R-Tuning introduces a latent consistency constraint that aligns new representations with the prior task space. This constraint guides joint optimization within a compact and semantically coherent latent space, ensuring robust knowledge retention and adaptation. Extensive experimental results demonstrate the superiority of R-Tuning, which reduces MAE and MSE by up to 46.9% and 46.8%, respectively, on new tasks, while preserving prior knowledge with gains of up to 5.7% and 6.0% on old tasks. Notably, under few-shot settings, R-Tuning outperforms all state-of-the-art baselines even when synthetic proxy samples account for only 5% of the new task dataset.

[LG-145] A Bayesian Model for Multi-stage Censoring ML4H2025

链接: https://arxiv.org/abs/2511.11684
作者: Shuvom Sadhuka,Sophia Lin,Emma Pierson,Bonnie Berger
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Proceedings of ML4H 2025

点击查看摘要

Abstract:Many sequential decision settings in healthcare feature funnel structures characterized by a series of stages, such as screenings or evaluations, where the number of patients who advance to each stage progressively decreases and decisions become increasingly costly. For example, an oncologist may first conduct a breast exam, followed by a mammogram for patients with concerning exams, followed by a biopsy for patients with concerning mammograms. A key challenge is that the ground truth outcome, such as the biopsy result, is only revealed at the end of this funnel. The selective censoring of the ground truth can introduce statistical biases in risk estimation, especially in underserved patient groups, whose outcomes are more frequently censored. We develop a Bayesian model for funnel decision structures, drawing from prior work on selective labels and censoring. We first show in synthetic settings that our model is able to recover the true parameters and predict outcomes for censored patients more accurately than baselines. We then apply our model to a dataset of emergency department visits, where in-hospital mortality is observed only for those who are admitted to either the hospital or ICU. We find that there are gender-based differences in hospital and ICU admissions. In particular, our model estimates that the mortality risk threshold to admit women to the ICU is higher for women (5.1%) than for men (4.5%).

[LG-146] Homotopy-Guided Self-Supervised Learning of Parametric Solutions for AC Optimal Power Flow

链接: https://arxiv.org/abs/2511.11677
作者: Shimiao Li,Aaron Tuor,Draguna Vrabie,Larry Pileggi,Jan Drgona
类目: Machine Learning (cs.LG)
*备注: paper submitted to PES General Meeting 2026

点击查看摘要

Abstract:Learning to optimize (L2O) parametric approximations of AC optimal power flow (AC-OPF) solutions offers the potential for fast, reusable decision-making in real-time power system operations. However, the inherent nonconvexity of AC-OPF results in challenging optimization landscapes, and standard learning approaches often fail to converge to feasible, high-quality solutions. This work introduces a \textithomotopy-guided self-supervised L2O method for parametric AC-OPF problems. The key idea is to construct a continuous deformation of the objective and constraints during training, beginning from a relaxed problem with a broad basin of attraction and gradually transforming it toward the original problem. The resulting learning process improves convergence stability and promotes feasibility without requiring labeled optimal solutions or external solvers. We evaluate the proposed method on standard IEEE AC-OPF benchmarks and show that homotopy-guided L2O significantly increases feasibility rates compared to non-homotopy baselines, while achieving objective values comparable to full OPF solvers. These findings demonstrate the promise of homotopy-based heuristics for scalable, constraint-aware L2O in power system optimization.

[LG-147] Synergistic Feature Fusion for Latent Lyrical Classification: A Gated Deep Learning Architecture

链接: https://arxiv.org/abs/2511.11673
作者: M. A. Gameiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the challenge of integrating complex, high-dimensional deep semantic features with simple, interpretable structural cues for lyrical content classification. We introduce a novel Synergistic Fusion Layer (SFL) architecture, a deep learning model utilizing a gated mechanism to modulate Sentence-BERT embeddings (Fdeep) using low-dimensional auxiliary features (Fstruct). The task, derived from clustering UMAP-reduced lyrical embeddings, is reframed as binary classification, distinguishing a dominant, homogeneous cluster (Class 0) from all other content (Class 1). The SFL model achieved an accuracy of 0.9894 and a Macro F1 score of 0.9894, outperforming a comprehensive Random Forest (RF) baseline that used feature concatenation (Accuracy = 0.9868). Crucially, the SFL model demonstrated vastly superior reliability and calibration, exhibiting a 93% reduction in Expected Calibration Error (ECE = 0.0035) and a 2.5x lower Log Loss (0.0304) compared to the RF baseline (ECE = 0.0500; Log Loss = 0.0772). This performance validates the architectural hypothesis that non-linear gating is superior to simple feature concatenation, establishing the SFL model as a robust and trustworthy system for complex multimodal lyrical analysis.

[LG-148] Evaluation of LLM -based Explanations for a Learning Analytics Dashboard

链接: https://arxiv.org/abs/2511.11671
作者: Alina Deriyeva,Benjamin Paassen
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Learning Analytics Dashboards can be a powerful tool to support self-regulated learning in Digital Learning Environments and promote development of meta-cognitive skills, such as reflection. However, their effectiveness can be affected by the interpretability of the data they provide. To assist in the interpretation, we employ a large language model to generate verbal explanations of the data in the dashboard and evaluate it against a standalone dashboard and explanations provided by human teachers in an expert study with university level educators (N=12). We find that the LLM-based explanations of the skill state presented in the dashboard, as well as general recommendations on how to proceed with learning within the course are significantly more favored compared to the other conditions. This indicates that using LLMs for interpretation purposes can enhance the learning experience for learners while maintaining the pedagogical standards approved by teachers.

[LG-149] Adaptive Stepsizing for Stochastic Gradient Langevin Dynamics in Bayesian Neural Networks

链接: https://arxiv.org/abs/2511.11666
作者: Rajit Rajpal,Benedict Leimkuhler,Yuanhao Jiang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian neural networks (BNNs) require scalable sampling algorithms to approximate posterior distributions over parameters. Existing stochastic gradient Markov Chain Monte Carlo (SGMCMC) methods are highly sensitive to the choice of stepsize and adaptive variants such as pSGLD typically fail to sample the correct invariant measure without addition of a costly divergence correction term. In this work, we build on the recently proposed `SamAdams’ framework for timestep adaptation (Leimkuhler, Lohmann, and Whalley 2025), introducing an adaptive scheme: SA-SGLD, which employs time rescaling to modulate the stepsize according to a monitored quantity (typically the local gradient norm). SA-SGLD can automatically shrink stepsizes in regions of high curvature and expand them in flatter regions, improving both stability and mixing without introducing bias. We show that our method can achieve more accurate posterior sampling than SGLD on high-curvature 2D toy examples and in image classification with BNNs using sharp priors.

[LG-150] How many stations are sufficient? Exploring the effect of urban weather station density reduction on imputation accuracy of air temperature and humidity

链接: https://arxiv.org/abs/2511.11652
作者: Marvin Plein,Carsten F. Dormann,Andreas Christen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban weather station networks (WSNs) are widely used to monitor urban weather and climate patterns and aid urban planning. However, maintaining WSNs is expensive and labor-intensive. Here, we present a step-wise station removal procedure to thin an existing WSN in Freiburg, Germany, and analyze the ability of WSN subsets to reproduce air temperature and humidity patterns of the entire original WSN for a year following a simulated reduction of WSN density. We found that substantial reductions in station numbers after one year of full deployment are possible while retaining high predictive accuracy. A reduction from 42 to 4 stations, for instance, increased mean prediction RMSEs from 0.69 K to 0.83 K for air temperature and from 3.8% to 4.4% for relative humidity, corresponding to RMSE increases of only 20% and 16%, respectively. Predictive accuracy is worse for remote stations in forests than for stations in built-up or open settings, but consistently better than a state-of-the-art numerical urban land-surface model (Surface Urban Energy and Water Balance Scheme). Stations located at the edges between built-up and rural areas are most valuable when reconstructing city-wide climate characteristics. Our study demonstrates the potential of thinning WSNs to maximize the efficient allocation of financial and personnel-related resources in urban climate research.

[LG-151] he Environmental Impact of Ensemble Techniques in Recommender Systems

链接: https://arxiv.org/abs/2511.11649
作者: Jannik Nitschke
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Bachelor Thesis, University of Siegen

点击查看摘要

Abstract:Ensemble techniques in recommender systems have demonstrated accuracy improvements of 10-30%, yet their environmental impact remains unmeasured. While deep learning recommendation algorithms can generate up to 3,297 kg CO2 per paper, ensemble methods have not been sufficiently evaluated for energy consumption. This thesis investigates how ensemble techniques influence environmental impact compared to single optimized models. We conducted 93 experiments across two frameworks (Surprise for rating prediction, LensKit for ranking) on four datasets spanning 100,000 to 7.8 million interactions. We evaluated four ensemble strategies (Average, Weighted, Stacking/Rank Fusion, Top Performers) against simple baselines and optimized single models, measuring energy consumption with a smart plug. Results revealed a non-linear accuracy-energy relationship. Ensemble methods achieved 0.3-5.7% accuracy improvements while consuming 19-2,549% more energy depending on dataset size and strategy. The Top Performers ensemble showed best efficiency: 0.96% RMSE improvement with 18.8% energy overhead on MovieLens-1M, and 5.7% NDCG improvement with 103% overhead on MovieLens-100K. Exhaustive averaging strategies consumed 88-270% more energy for comparable gains. On the largest dataset (Anime, 7.8M interactions), the Surprise ensemble consumed 2,005% more energy (0.21 Wh vs. 0.01 Wh) for 1.2% accuracy improvement, producing 53.8 mg CO2 versus 2.6 mg CO2 for the single model. This research provides one of the first systematic measurements of energy and carbon footprint for ensemble recommender systems, demonstrates that selective strategies offer superior efficiency over exhaustive averaging, and identifies scalability limitations at industrial scale. These findings enable informed decisions about sustainable algorithm selection in recommender systems. Comments: Bachelor Thesis, University of Siegen Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2511.11649 [cs.IR] (or arXiv:2511.11649v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.11649 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jannik Nitschke [view email] [v1] Mon, 10 Nov 2025 14:06:58 UTC (1,038 KB)

[LG-152] Exploring Parallelism in FPGA-Based Accelerators for Machine Learning Applications

链接: https://arxiv.org/abs/2511.11640
作者: Sed Centeno,Christopher Sprague,Arnab A Purkayastha,Ray Simar,Neeraj Magotra
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:Speculative backpropagation has emerged as a promising technique to accelerate the training of neural networks by overlapping the forward and backward passes. Leveraging speculative weight updates when error gradients fall within a specific threshold reduces training time without substantially compromising accuracy. In this work, we implement speculative backpropagation on the MNIST dataset using OpenMP as the parallel programming platform. OpenMP’s multi-threading capabilities enable simultaneous execution of forward and speculative backpropagation steps, significantly improving training speed. The application is planned for synthesis on a state-of-the-art FPGA to demonstrate its potential for hardware acceleration. Our CPU-based experimental results demonstrate that speculative backpropagation achieves a maximum speedup of 24% in execution time when using a threshold of 0.25, and accuracy remaining within 3-4% of the baseline across various epochs. Additionally, when comparing individual step execution time, speculative backpropagation yields a maximum speedup of 35% over the baseline, demonstrating the effectiveness of overlapping forward and backward passes.

[LG-153] Enhancing PINN Accuracy for the RLW Equation: Adaptive and Conservative Approaches

链接: https://arxiv.org/abs/2511.11638
作者: Aamir Shehzad
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Pattern Formation and Solitons (nlin.PS)
*备注: 32 pages, 19 figures This work investigates adaptive and conservative PINN frameworks for solving the RLW equation

点击查看摘要

Abstract:Standard physics-informed neural network implementations have produced large error rates when using these models to solve the regularized long wave (RLW) equation. Two improved PINN approaches were developed in this research: an adaptive approach with self-adaptive loss weighting and a conservative approach enforcing explicit conservation laws. Three benchmark tests were used to demonstrate how effective PINN’s are as they relate to the type of problem being solved (i.e., time dependent RLW equation). The first was a single soliton traveling along a line (propagation), the second was the interaction between two solitons, and the third was the evolution of an undular bore over the course of t=250 . The results demonstrated that the effectiveness of PINNs are problem specific. The adaptive PINN was significantly better than both the conservative PINN and the standard PINN at solving problems involving complex nonlinear interactions such as colliding two solitons. The conservative approach was significantly better at solving problems involving long term behavior of single solitons and undular bores. However, the most important finding from this research is that explicitly enforcing conservation laws may be harmful to optimizing the solution of highly nonlinear systems of equations and therefore requires special training methods. The results from our adaptive and conservative approaches were within O(10^-5) of established numerical solutions for the same problem, thus demonstrating that PINNs can provide accurate solutions to complex systems of partial differential equations without the need for a discretization of space or time (mesh free). Moreover, the finding from this research challenges the assumptions that conservation enforcement will always improve the performance of a PINN and provides researchers with guidelines for designing PINNs for use on specific types of problems.

[LG-154] An Explainable and Fair AI Tool for PCOS Risk Assessment: Calibration Subgroup Equity and Interactive Clinical Deployment

链接: https://arxiv.org/abs/2511.11636
作者: Asma Sadia Khan,Sadia Tabassum
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper presents a fairness-audited and interpretable machine learning framework for predicting polycystic ovary syndrome (PCOS), designed to evaluate model performance and identify diagnostic disparities across patient subgroups. The framework integrated SHAP-based feature attributions with demographic audits to connect predictive explanations with observed disparities for actionable insights. Probabilistic calibration metrics (Brier Score and Expected Calibration Error) are incorporated to ensure reliable risk predictions across subgroups. Random Forest, SVM, and XGBoost models were trained with isotonic and Platt scaling for calibration and fairness comparison. A calibrated Random Forest achieved a high predictive accuracy of 90.8%. SHAP analysis identified follicle count, weight gain, and menstrual irregularity as the most influential features, which are consistent with the Rotterdam diagnostic criteria. Although the SVM with isotonic calibration achieved the lowest calibration error (ECE = 0.0541), the Random Forest model provided a better balance between calibration and interpretability (Brier = 0.0678, ECE = 0.0666). Therefore, it was selected for detailed fairness and SHAP analyses. Subgroup analysis revealed that the model performed best among women aged 25-35 (accuracy 90.9%) but underperformed in those under 25 (69.2%), highlighting age-related disparities. The model achieved perfect precision in obese women and maintained high recall in lean PCOS cases, demonstrating robustness across phenotypes. Finally, a Streamlit-based web interface enables real-time PCOS risk assessment, Rotterdam criteria evaluation, and interactive ‘what-if’ analysis, bridging the gap between AI research and clinical usability.

[LG-155] Physics-Informed Neural Network-based Reliability Analysis of Buried Pipelines

链接: https://arxiv.org/abs/2511.11613
作者: Pouya Taraghi,Yong Li,Samer Adeeb
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
*备注: This manuscript has been submitted to Reliability Engineering System Safety and is currently under peer review

点击查看摘要

Abstract:Buried pipelines transporting oil and gas across geohazard-prone regions are exposed to potential ground movement, leading to the risk of significant strain demand and structural failure. Reliability analysis, which determines the probability of failure after accounting for pertinent uncertainties, is essential for ensuring the safety of pipeline systems. However, traditional reliability analysis methods involving computationally intensive numerical models, such as finite element simulations of pipeline subjected to ground movement, have limited applications; this is partly because stochastic sampling approaches require repeated simulations over a large number of samples for the uncertain variables when estimating low probabilities. This study introduces Physics-Informed Neural Network for Reliability Analysis (PINN-RA) for buried pipelines subjected to ground movement, which integrates PINN-based surrogate model with Monte Carlo Simulation (MCS) to achieve efficient reliability assessment. To enable its application under uncertain variables associated with soil properties and ground movement, the PINN-based surrogate model is extended to solve a parametric differential equation system, namely the governing equation of pipelines embedded in soil with different properties. The findings demonstrate that PINN-RA significantly reduces the computational effort required and thus accelerates reliability analysis. By eliminating the need for repetitive numerical evaluations of pipeline subjected to permanent ground movement, the proposed approach provides an efficient and scalable tool for pipeline reliability assessment, enabling rapid decision-making in geohazard-prone regions.

[LG-156] Enhancing failure prediction in nuclear industry: Hybridization of knowledge- and data-driven techniques

链接: https://arxiv.org/abs/2511.11604
作者: Amaratou Mahamadou Saley,Thierry Moyaux,Aïcha Sekhari,Vincent Cheutet,Jean-Baptiste Danielou
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 19 pages, 9 figures, 6 journal, Journal Q1 (Computers and Industrial Engineering)

点击查看摘要

Abstract:The convergence of the Internet of Things (IoT) and Industry 4.0 has significantly enhanced data-driven methodologies within the nuclear industry, notably enhancing safety and economic efficiency. This advancement challenges the precise prediction of future maintenance needs for assets, which is crucial for reducing downtime and operational costs. However, the effectiveness of data-driven methodologies in the nuclear sector requires extensive domain knowledge due to the complexity of the systems involved. Thus, this paper proposes a novel predictive maintenance methodology that combines data-driven techniques with domain knowledge from a nuclear equipment. The methodological originality of this paper is located on two levels: highlighting the limitations of purely data-driven approaches and demonstrating the importance of knowledge in enhancing the performance of the predictive models. The applicative novelty of this work lies in its use within a domain such as a nuclear industry, which is highly restricted and ultrasensitive due to security, economic and environmental concerns. A detailed real-world case study which compares the current state of equipment monitoring with two scenarios, demonstrate that the methodology significantly outperforms purely data-driven methods in failure prediction. While purely data-driven methods achieve only a modest performance with a prediction horizon limited to 3 h and a F1 score of 56.36%, the hybrid approach increases the prediction horizon to 24 h and achieves a higher F1 score of 93.12%.

[LG-157] Aspiration-based Perturbed Learning Automata in Games with Noisy Utility Measurements. Part A: Stochastic Stability in Non-zero-Sum Games

链接: https://arxiv.org/abs/2511.11602
作者: Georgios C. Chasparis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Reinforcement-based learning has attracted considerable attention both in modeling human behavior as well as in engineering, for designing measurement- or payoff-based optimization schemes. Such learning schemes exhibit several advantages, especially in relation to filtering out noisy observations. However, they may exhibit several limitations when applied in a distributed setup. In multi-player weakly-acyclic games, and when each player applies an independent copy of the learning dynamics, convergence to (usually desirable) pure Nash equilibria cannot be guaranteed. Prior work has only focused on a small class of games, namely potential and coordination games. To address this main limitation, this paper introduces a novel payoff-based learning scheme for distributed optimization, namely aspiration-based perturbed learning automata (APLA). In this class of dynamics, and contrary to standard reinforcement-based learning schemes, each player’s probability distribution for selecting actions is reinforced both by repeated selection and an aspiration factor that captures the player’s satisfaction level. We provide a stochastic stability analysis of APLA in multi-player positive-utility games under the presence of noisy observations. This is the first part of the paper that characterizes stochastic stability in generic non-zero-sum games by establishing equivalence of the induced infinite-dimensional Markov chain with a finite dimensional one. In the second part, stochastic stability is further specialized to weakly acyclic games.

[LG-158] WildfireGenome: Interpretable Machine Learning Reveals Local Drivers of Wildfire Risk and Their Cross-County Variation

链接: https://arxiv.org/abs/2511.11589
作者: Chenyue Liu,Ali Mostafavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current wildfire risk assessments rely on coarse hazard maps and opaque machine learning models that optimize regional accuracy while sacrificing interpretability at the decision scale. WildfireGenome addresses these gaps through three components: (1) fusion of seven federal wildfire indicators into a sign-aligned, PCA-based composite risk label at H3 Level-8 resolution; (2) Random Forest classification of local wildfire risk; and (3) SHAP and ICE/PDP analyses to expose county-specific nonlinear driver relationships. Across seven ecologically diverse U.S. counties, models achieve accuracies of 0.755-0.878 and Quadratic Weighted Kappa up to 0.951, with principal components explaining 87-94% of indicator variance. Transfer tests show reliable performance between ecologically similar regions but collapse across dissimilar contexts. Explanations consistently highlight needleleaf forest cover and elevation as dominant drivers, with risk rising sharply at 30-40% needleleaf coverage. WildfireGenome advances wildfire risk assessment from regional prediction to interpretable, decision-scale analytics that guide vegetation management, zoning, and infrastructure planning.

[LG-159] Parameter-Efficient and Personalized Federated Training of Generative Models at the Edge

链接: https://arxiv.org/abs/2511.11585
作者: Kabir Khan,Manju Sarkar,Anita Kar,Suresh Ghosh
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 37 pages, 8 figures

点击查看摘要

Abstract:Large generative models (for example, language and diffusion models) enable high-quality text and image synthesis but are hard to train or adapt in cross-device federated settings due to heavy computation and communication and statistical/system heterogeneity. We propose FedGen-Edge, a framework that decouples a frozen, pre-trained global backbone from lightweight client-side adapters and federates only the adapters. Using Low-Rank Adaptation (LoRA) constrains client updates to a compact subspace, which reduces uplink traffic by more than 99 percent versus full-model FedAvg, stabilizes aggregation under non-IID data, and naturally supports personalization because each client can keep a locally tuned adapter. On language modeling (PTB) and image generation (CIFAR-10), FedGen-Edge achieves lower perplexity/FID and faster convergence than strong baselines while retaining a simple FedAvg-style server. A brief ablation shows diminishing returns beyond moderate LoRA rank and a trade-off between local epochs and client drift. FedGen-Edge offers a practical path toward privacy-preserving, resource-aware, and personalized generative AI on heterogeneous edge devices.

[LG-160] Social and Physical Attributes-Defined Trust Evaluation for Effective Collaborator Selection in Human-Device Coexistence Systems

链接: https://arxiv.org/abs/2511.11578
作者: Botao Zhu,Xianbin Wang
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In human-device coexistence systems, collaborations among devices are determined by not only physical attributes such as network topology but also social attributes among human users. Consequently, trust evaluation of potential collaborators based on these multifaceted attributes becomes critical for ensuring the eventual outcome. However, due to the high heterogeneity and complexity of physical and social attributes, efficiently integrating them for accurate trust evaluation remains challenging. To overcome this difficulty, a canonical correlation analysis-enhanced hypergraph self-supervised learning (HSLCCA) method is proposed in this research. First, by treating all attributes as relationships among connected devices, a relationship hypergraph is constructed to comprehensively capture inter-device relationships across three dimensions: spatial attribute-related, device attribute-related, and social attribute-related. Next, a self-supervised learning framework is developed to integrate these multi-dimensional relationships and generate device embeddings enriched with relational semantics. In this learning framework, the relationship hypergraph is augmented into two distinct views to enhance semantic information. A parameter-sharing hypergraph neural network is then utilized to learn device embeddings from both views. To further enhance embedding quality, a CCA approach is applied, allowing the comparison of data between the two views. Finally, the trustworthiness of devices is calculated based on the learned device embeddings. Extensive experiments demonstrate that the proposed HSLCCA method significantly outperforms the baseline algorithm in effectively identifying trusted devices.

[LG-161] Detecting Statistically Significant Fairness Violations in Recidivism Forecasting Algorithms

链接: https://arxiv.org/abs/2511.11575
作者: Animesh Joshi
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Machine learning algorithms are increasingly deployed in critical domains such as finance, healthcare, and criminal justice [1]. The increasing popularity of algorithmic decision-making has stimulated interest in algorithmic fairness within the academic community. Researchers have introduced various fairness definitions that quantify disparities between privileged and protected groups, use causal inference to determine the impact of race on model predictions, and that test calibration of probability predictions from the model. Existing literature does not provide a way in which to assess whether observed disparities between groups are statistically significant or merely due to chance. This paper introduces a rigorous framework for testing the statistical significance of fairness violations by leveraging k-fold cross-validation [2] to generate sampling distributions of fairness metrics. This paper introduces statistical tests that can be used to identify statistically significant violations of fairness metrics based on disparities between predicted and actual outcomes, model calibration, and causal inference techniques [1]. We demonstrate this approach by testing recidivism forecasting algorithms trained on data from the National Institute of Justice. Our findings reveal that machine learning algorithms used for recidivism forecasting exhibit statistically significant bias against Black individuals under several fairness definitions, while also exhibiting no bias or bias against White individuals under other definitions. The results from this paper underscore the importance of rigorous and robust statistical testing while evaluating algorithmic decision-making systems.

[LG-162] LLM on a Budget: Active Knowledge Distillation for Efficient Classification of Large Text Corpora

链接: https://arxiv.org/abs/2511.11574
作者: Viviana Luccioli,Rithika Iyengar,Ryan Panley,Flora Haberkorn,Xiaoyu Ge,Leland Crane,Nitish Sinha,Seung Jung Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are highly accurate in classification tasks, however, substantial computational and financial costs hinder their large-scale deployment in dynamic environments. Knowledge Distillation (KD) where a LLM “teacher” trains a smaller and more efficient “student” model, offers a promising solution to this problem. However, the distillation process itself often remains costly for large datasets, since it requires the teacher to label a vast number of samples while incurring significant token consumption. To alleviate this challenge, in this work we explore the active learning (AL) as a way to create efficient student models at a fraction of the cost while preserving the LLM’s performance. In particular, we introduce M-RARU (Multi-class Randomized Accept/Reject Uncertainty Sampling), a novel AL algorithm that significantly reduces training costs. M-RARU employs an innovative strategy combining uncertainty with a randomized accept-reject mechanism to select only the most informative data points for the LLM teacher. This focused approach significantly minimizes required API calls and data processing time. We evaluate M-RARU against random sampling across five diverse student models (SVM, LDA, RF, GBDT, and DistilBERT) on multiple benchmark datasets. Experiments demonstrate that our proposed method achieves up to 80% reduction in sample requirements as compared to random sampling, substantially improving classification accuracy while reducing financial costs and overall training time.

[LG-163] Softmax as a Lagrangian-Legendrian Seam

链接: https://arxiv.org/abs/2511.11573
作者: Christopher R. Lee-Jenkins
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note offers a first bridge from machine learning to modern differential geometry. We show that the logits-to-probabilities step implemented by softmax can be modeled as a geometric interface: two potential-generated, conservative descriptions (from negative entropy and log-sum-exp) meet along a Legendrian “seam” on a contact screen (the probability simplex) inside a simple folded symplectic collar. Bias-shift invariance appears as Reeb flow on the screen, and the Fenchel-Young equality/KL gap provides a computable distance to the seam. We work out the two- and three-class cases to make the picture concrete and outline next steps for ML: compact logit models (projective or spherical), global invariants, and connections to information geometry where on-screen dynamics manifest as replicator flows.

[LG-164] A Gentle Introduction to Conformal Time Series Forecasting

链接: https://arxiv.org/abs/2511.13608
作者: M. Stocker,W. Małgorzewicz,M. Fontana,S. Ben Taieb
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Conformal prediction is a powerful post-hoc framework for uncertainty quantification that provides distribution-free coverage guarantees. However, these guarantees crucially rely on the assumption of exchangeability. This assumption is fundamentally violated in time series data, where temporal dependence and distributional shifts are pervasive. As a result, classical split-conformal methods may yield prediction intervals that fail to maintain nominal validity. This review unifies recent advances in conformal forecasting methods specifically designed to address nonexchangeable data. We first present a theoretical foundation, deriving finite-sample guarantees for split-conformal prediction under mild weak-dependence conditions. We then survey and classify state-of-the-art approaches that mitigate serial dependence by reweighting calibration data, dynamically updating residual distributions, or adaptively tuning target coverage levels in real time. Finally, we present a comprehensive simulation study that compares these techniques in terms of empirical coverage, interval width, and computational cost, highlighting practical trade-offs and open research directions.

[LG-165] Power Homotopy for Zeroth-Order Non-Convex Optimizations

链接: https://arxiv.org/abs/2511.13592
作者: Chen Xu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce GS-PowerHP, a novel zeroth-order method for non-convex optimization problems of the form \max_x \in \mathbbR^d f(x) . Our approach leverages two key components: a power-transformed Gaussian-smoothed surrogate F_N,\sigma(\mu) = \mathbbE_x\sim\mathcalN(\mu,\sigma^2 I_d)[e^N f(x)] whose stationary points cluster near the global maximizer x^* of f for sufficiently large N , and an incrementally decaying \sigma for enhanced data efficiency. Under mild assumptions, we prove convergence in expectation to a small neighborhood of x^* with the iteration complexity of O(d^2 \varepsilon^-2) . Empirical results show our approach consistently ranks among the top three across a suite of competing algorithms. Its robustness is underscored by the final experiment on a substantially high-dimensional problem ( d=150,528 ), where it achieved first place on least-likely targeted black-box attacks against images from ImageNet, surpassing all competing methods.

[LG-166] he Shape of Data: Topology Meets Analytics. A Practical Introduction to Topological Analytics and the Stability Index (TSI) in Business

链接: https://arxiv.org/abs/2511.13503
作者: Ioannis Diamantis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP)
*备注: 36 pages, 22 figures

点击查看摘要

Abstract:Modern business and economic datasets often exhibit nonlinear, multi-scale structures that traditional linear tools under-represent. Topological Data Analysis (TDA) offers a geometric lens for uncovering robust patterns, such as connected components, loops and voids, across scales. This paper provides an intuitive, figure-driven introduction to persistent homology and a practical, reproducible TDA pipeline for applied analysts. Through comparative case studies in consumer behavior, equity markets (SAX/eSAX vs.\ TDA) and foreign exchange dynamics, we demonstrate how topological features can reveal segmentation patterns and structural relationships beyond classical statistical methods. We discuss methodological choices regarding distance metrics, complex construction and interpretation, and we introduce the \textitTopological Stability Index (TSI), a simple yet interpretable indicator of structural variability derived from persistence lifetimes. We conclude with practical guidelines for TDA implementation, visualization and communication in business and economic analytics.

[LG-167] Systematic evaluation of time-frequency features for binaural sound source localization ICASSP2026

链接: https://arxiv.org/abs/2511.13487
作者: Davoud Shariat Panah,Alessandro Ragano,Dan Barry,Jan Skoglund,Andrew Hines
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

[LG-168] aming Barren Plateaus in Arbitrary Parameterized Quantum Circuits Without Sacrificing Expressibility

链接: https://arxiv.org/abs/2511.13408
作者: Zhenyu Chen,Yuguo Shao,Zhengwei Liu,Zhaohui Wei
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum algorithms based on parameterized quantum circuits (PQCs) have enabled a wide range of applications on near-term quantum devices. However, existing PQC architectures face several challenges, among which the ``barren plateaus" phenomenon is particularly prominent. In such cases, the loss function concentrates exponentially with increasing system size, thereby hindering effective parameter optimization. To address this challenge, we propose a general and hardware-efficient method for eliminating barren plateaus in an arbitrary PQC. Specifically, our approach achieves this by inserting a layer of easily implementable quantum channels into the original PQC, each channel requiring only one ancilla qubit and four additional gates, yielding a modified PQC (MPQC) that is provably at least as expressive as the original PQC and, under mild assumptions, is guaranteed to be free from barren plateaus. Furthermore, by appropriately adjusting the structure of MPQCs, we rigorously prove that any parameter in the original PQC can be made trainable. Importantly, the absence of barren plateaus in MPQCs is robust against realistic noise, making our approach directly applicable to current noisy intermediate-scale quantum (NISQ) hardware. Numerically, we demonstrate the practicality of our method by modifying a commonly used PQC for thermal-state preparation. The results show that barren plateaus are effectively eliminated in this class of circuits with up to 100 qubits and 2400 layers, whereas the original ansatz suffers from severe gradient vanishing.

[LG-169] Causal Inference Biomarker Discovery Graph Neural Network Feature Selection

链接: https://arxiv.org/abs/2511.13295
作者: Chaowang Lan,Jingxin Wu,Yulong Yuan,Chuxun Liu,Huangyi Kang,Caihua Liu
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biomarker discovery from high-throughput transcriptomic data is crucial for advancing precision medicine. However, existing methods often neglect gene-gene regulatory relationships and lack stability across datasets, leading to conflation of spurious correlations with genuine causal effects. To address these issues, we develop a causal graph neural network (Causal-GNN) method that integrates causal inference with multi-layer graph neural networks (GNNs). The key innovation is the incorporation of causal effect estimation for identifying stable biomarkers, coupled with a GNN-based propensity scoring mechanism that leverages cross-gene regulatory networks. Experimental results demonstrate that our method achieves consistently high predictive accuracy across four distinct datasets and four independent classifiers. Moreover, it enables the identification of more stable biomarkers compared to traditional methods. Our work provides a robust, efficient, and biologically interpretable tool for biomarker discovery, demonstrating strong potential for broad application across medical disciplines.

[LG-170] Case study of a differentiable heterogeneous multiphysics solver for a nuclear fusion application

链接: https://arxiv.org/abs/2511.13262
作者: Jack B. Coughlin,Archis Joglekar,Jonathan Brodrick,Alexander Lavin
类目: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Mathematical Software (cs.MS); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:This work presents a case study of a heterogeneous multiphysics solver from the nuclear fusion domain. At the macroscopic scale, an auto-differentiable ODE solver in JAX computes the evolution of the pulsed power circuit and bulk plasma parameters for a compressing Z Pinch. The ODE solver requires a closure for the impedance of the plasma load obtained via root-finding at every timestep, which we solve efficiently using gradient-based Newton iteration. However, incorporating non-differentiable production-grade plasma solvers like Gkeyll (a C/CUDA plasma simulation suite) into a gradient-based workflow is non-trivial. The ‘‘Tesseract’’ software addresses this challenge by providing a multi-physics differentiable abstraction layer made fully compatible with JAX (through the tesseract_jax adapter). This architecture ensures end-to-end differentiability while allowing seamless interchange between high-fidelity solvers (Gkeyll), neural surrogates, and analytical approximations for rapid, progressive prototyping.

[LG-171] Likelihood-guided Regularization in Attention Based Models

链接: https://arxiv.org/abs/2511.13221
作者: Mohamed Salem,Inyoung Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The transformer architecture has demonstrated strong performance in classification tasks involving structured and high-dimensional data. However, its success often hinges on large- scale training data and careful regularization to prevent overfitting. In this paper, we intro- duce a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), which simultaneously enhances model generalization and dynamically prunes redundant parameters. The proposed variational Ising-based regularization approach leverages Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout-based methods, which enforce fixed sparsity patterns, the variational Ising-based regularization method learns task-adaptive regularization, improving both efficiency and interpretability. We evaluate our approach on benchmark vision datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, demonstrating improved generalization under sparse, complex data and allowing for principled uncertainty quantification on both weights and selection parameters. Additionally, we show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms. Our results highlight the effectiveness of structured Bayesian sparsification in enhancing transformer-based architectures, offering a principled alternative to standard regularization techniques.

[LG-172] Reconstruction of Manifold Distances from Noisy Observations

链接: https://arxiv.org/abs/2511.13025
作者: Charles Fefferman,Jonathan Marty,Kevin Ren
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Differential Geometry (math.DG); Probability (math.PR)
*备注: 43 pages

点击查看摘要

Abstract:We consider the problem of reconstructing the intrinsic geometry of a manifold from noisy pairwise distance observations. Specifically, let M denote a diameter 1 d-dimensional manifold and \mu a probability measure on M that is mutually absolutely continuous with the volume measure. Suppose X_1,\dots,X_N are i.i.d. samples of \mu and we observe noisy-distance random variables d’(X_j, X_k) that are related to the true geodesic distances d(X_j,X_k) . With mild assumptions on the distributions and independence of the noisy distances, we develop a new framework for recovering all distances between points in a sufficiently dense subsample of M . Our framework improves on previous work which assumed i.i.d. additive noise with known moments. Our method is based on a new way to estimate L_2 -norms of certain expectation-functions f_x(y)=\mathbbEd’(x,y) and use them to build robust clusters centered at points of our sample. Using a new geometric argument, we establish that, under mild geometric assumptions–bounded curvature and positive injectivity radius–these clusters allow one to recover the true distances between points in the sample up to an additive error of O(\varepsilon \log \varepsilon^-1) . We develop two distinct algorithms for producing these clusters. The first achieves a sample complexity N \asymp \varepsilon^-2d-2\log(1/\varepsilon) and runtime o(N^3) . The second introduces novel geometric ideas that warrant further investigation. In the presence of missing observations, we show that a quantitative lower bound on sampling probabilities suffices to modify the cluster construction in the first algorithm and extend all recovery guarantees. Our main technical result also elucidates which properties of a manifold are necessary for the distance recovery, which suggests further extension of our techniques to a broader class of metric probability spaces.

[LG-173] Revealing the dynamic responses of Pb under shock loading based on DFT-accuracy machine learning potential

链接: https://arxiv.org/abs/2511.12995
作者: Enze Hou,Xiaoyang Wang,Han Wang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lead (Pb) is a typical low-melting-point ductile metal and serves as an important model material in the study of dynamic responses. Under shock-wave loading, its dynamic mechanical behavior comprises two key phenomena: plastic deformation and shock induced phase transitions. The underlying mechanisms of these processes are still poorly understood. Revealing these mechanisms remains challenging for experimental approaches. Non-equilibrium molecular dynamics (NEMD) simulations are an alternative theoretical tool for studying dynamic responses, as they capture atomic-scale mechanisms such as defect evolution and deformation pathways. However, due to the limited accuracy of empirical interatomic potentials, the reliability of previous NEMD studies is questioned. Using our newly developed machine learning potential for Pb-Sn alloys, we revisited the microstructure evolution in response to shock loading under various shock orientations. The results reveal that shock loading along the [001] orientation of Pb exhibits a fast, reversible, and massive phase transition and stacking fault evolution. The behavior of Pb differs from previous studies by the absence of twinning during plastic deformation. Loading along the [011] orientation leads to slow, irreversible plastic deformation, and a localized FCC-BCC phase transition in the Pitsch orientation relationship. This study provides crucial theoretical insights into the dynamic mechanical response of Pb, offering a theoretical input for understanding the microstructure-performance relationship under extreme conditions.

[LG-174] Scalable learning of macroscopic stochastic dynamics

链接: https://arxiv.org/abs/2511.12842
作者: Mengyi Chen,Pengru Huang,Kostya S. Novoselov,Qianxiao Li
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Macroscopic dynamical descriptions of complex physical systems are crucial for understanding and controlling material behavior. With the growing availability of data and compute, machine learning has become a promising alternative to first-principles methods to build accurate macroscopic models from microscopic trajectory simulations. However, for spatially extended systems, direct simulations of sufficiently large microscopic systems that inform macroscopic behavior is prohibitive. In this work, we propose a framework that learns the macroscopic dynamics of large stochastic microscopic systems using only small-system simulations. Our framework employs a partial evolution scheme to generate training data pairs by evolving large-system snapshots within local patches. We subsequently identify the closure variables associated with the macroscopic observables and learn the macroscopic dynamics using a custom loss. Furthermore, we introduce a hierarchical upsampling scheme that enables efficient generation of large-system snapshots from small-system trajectory distributions. We empirically demonstrate the accuracy and robustness of our framework through a variety of stochastic spatially extended systems, including those described by stochastic partial differential equations, idealised lattice spin systems, and a more realistic NbMoTa alloy system.

[LG-175] Benign Overfitting in Linear Classifiers with a Bias Term

链接: https://arxiv.org/abs/2511.12840
作者: Yuta Kondo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Modern machine learning models with a large number of parameters often generalize well despite perfectly interpolating noisy training data - a phenomenon known as benign overfitting. A foundational explanation for this in linear classification was recently provided by Hashimoto et al. (2025). However, this analysis was limited to the setting of “homogeneous” models, which lack a bias (intercept) term - a standard component in practice. This work directly extends Hashimoto et al.'s results to the more realistic inhomogeneous case, which incorporates a bias term. Our analysis proves that benign overfitting persists in these more complex models. We find that the presence of the bias term introduces new constraints on the data’s covariance structure required for generalization, an effect that is particularly pronounced when label noise is present. However, we show that in the isotropic case, these new constraints are dominated by the requirements inherited from the homogeneous model. This work provides a more complete picture of benign overfitting, revealing the non-trivial impact of the bias term on the conditions required for good generalization.

[LG-176] DIGing–SGLD: Decentralized and Scalable Langevin Sampling over Time–Varying Networks

链接: https://arxiv.org/abs/2511.12836
作者: Waheed U. Bajwa,Mert Gurbuzbalaban,Mustafa Ali Kutbay,Lingjiong Zhu,Muhammad Zulqarnain
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sampling from a target distribution induced by training data is central to Bayesian learning, with Stochastic Gradient Langevin Dynamics (SGLD) serving as a key tool for scalable posterior sampling and decentralized variants enabling learning when data are distributed across a network of agents. This paper introduces DIGing-SGLD, a decentralized SGLD algorithm designed for scalable Bayesian learning in multi-agent systems operating over time-varying networks. Existing decentralized SGLD methods are restricted to static network topologies, and many exhibit steady-state sampling bias caused by network effects, even when full batches are used. DIGing-SGLD overcomes these limitations by integrating Langevin-based sampling with the gradient-tracking mechanism of the DIGing algorithm, originally developed for decentralized optimization over time-varying networks, thereby enabling efficient and bias-free sampling without a central coordinator. To our knowledge, we provide the first finite-time non-asymptotic Wasserstein convergence guarantees for decentralized SGLD-based sampling over time-varying networks, with explicit constants. Under standard strong convexity and smoothness assumptions, DIGing-SGLD achieves geometric convergence to an O(\sqrt\eta) neighborhood of the target distribution, where \eta is the stepsize, with dependence on the target accuracy matching the best-known rates for centralized and static-network SGLD algorithms using constant stepsize. Numerical experiments on Bayesian linear and logistic regression validate the theoretical results and demonstrate the strong empirical performance of DIGing-SGLD under dynamically evolving network conditions.

[LG-177] Practical Causal Evaluation Metrics for Biological Networks

链接: https://arxiv.org/abs/2511.12805
作者: Noriaki Sato,Marco Scutari,Shuichi Kawano,Rui Yamaguchi,Seiya Imoto
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG)
*备注: 15 pages, 1 figure

点击查看摘要

Abstract:Estimating causal networks from biological data is a critical step in systems biology. When evaluating the inferred network, assessing the networks based on their intervention effects is particularly important for downstream probabilistic reasoning and the identification of potential drug targets. In the context of gene regulatory network inference, biological databases are often used as reference sources. These databases typically describe relationships in a qualitative rather than quantitative manner. However, few evaluation metrics have been developed that take this qualitative nature into account. To address this, we developed a metric, the sign-augmented Structural Intervention Distance (sSID), and a weighted sSID that incorporates the net effects of the intervention. Through simulations and analyses of real transcriptomic datasets, we found that our proposed metrics could identify a different algorithm as optimal compared to conventional metrics, and the network selected by sSID had a superior performance in the classification task of clinical covariates using transcriptomic data. This suggests that sSID can distinguish networks that are structurally correct but functionally incorrect, highlighting its potential as a more biologically meaningful and practical evaluation metric.

[LG-178] Function-on-Function Bayesian Optimization

链接: https://arxiv.org/abs/2511.12783
作者: Jingru Huang,Haijie Xu,Manrui Jiang,Chen Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, conference

点击查看摘要

Abstract:Bayesian optimization (BO) has been widely used to optimize expensive and gradient-free objective functions across various domains. However, existing BO methods have not addressed the objective where both inputs and outputs are functions, which increasingly arise in complex systems as advanced sensing technologies. To fill this gap, we propose a novel function-on-function Bayesian optimization (FFBO) framework. Specifically, we first introduce a function-on-function Gaussian process (FFGP) model with a separable operator-valued kernel to capture the correlations between function-valued inputs and outputs. Compared to existing Gaussian process models, FFGP is modeled directly in the function space. Based on FFGP, we define a scalar upper confidence bound (UCB) acquisition function using a weighted operator-based scalarization strategy. Then, a scalable functional gradient ascent algorithm (FGA) is developed to efficiently identify the optimal function-valued input. We further analyze the theoretical properties of the proposed method. Extensive experiments on synthetic and real-world data demonstrate the superior performance of FFBO over existing approaches.

[LG-179] SB-HB: A Hierarchical Bayesian Extension of the TSB Model for Intermittent Demand Forecasting

链接: https://arxiv.org/abs/2511.12749
作者: Zong-Han Bai,Po-Yen Chu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. 11 pages, 1 figure, Equal contribution by the two authors

点击查看摘要

Abstract:Intermittent demand forecasting poses unique challenges due to sparse observations, cold-start items, and obsolescence. Classical models such as Croston, SBA, and the Teunter-Syntetos-Babai (TSB) method provide simple heuristics but lack a principled generative foundation. Deep learning models address these limitations but often require large datasets and sacrifice interpretability. We introduce TSB-HB, a hierarchical Bayesian extension of TSB. Demand occurrence is modeled with a Beta-Binomial distribution, while nonzero demand sizes follow a Log-Normal distribution. Crucially, hierarchical priors enable partial pooling across items, stabilizing estimates for sparse or cold-start series while preserving heterogeneity. This framework yields a fully generative and interpretable model that generalizes classical exponential smoothing. On the UCI Online Retail dataset, TSB-HB achieves lower RMSE and RMSSE than Croston, SBA, TSB, ADIDA, IMAPA, ARIMA and Theta, and on a subset of the M5 dataset it outperforms all classical baselines we evaluate. The model provides calibrated probabilistic forecasts and improved accuracy on intermittent and lumpy items by combining a generative formulation with hierarchical shrinkage, while remaining interpretable and scalable. Comments: Preprint. 11 pages, 1 figure, Equal contribution by the two authors Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2511.12749 [stat.ML] (or arXiv:2511.12749v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.12749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-180] Accelerated Distributional Temporal Difference Learning with Linear Function Approximation

链接: https://arxiv.org/abs/2511.12688
作者: Kaicheng Jin,Yang Peng,Jiansheng Yang,Zhihua Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the finite-sample statistical rates of distributional temporal difference (TD) learning with linear function approximation. The purpose of distributional TD learning is to estimate the return distribution of a discounted Markov decision process for a given policy. Previous works on statistical analysis of distributional TD learning focus mainly on the tabular case. We first consider the linear function approximation setting and conduct a fine-grained analysis of the linear-categorical Bellman equation. Building on this analysis, we further incorporate variance reduction techniques in our new algorithms to establish tight sample complexity bounds independent of the support size K when K is large. Our theoretical results imply that, when employing distributional TD learning with linear function approximation, learning the full distribution of the return function from streaming data is no more difficult than learning its expectation. This work provide new insights into the statistical efficiency of distributional reinforcement learning algorithms.

[LG-181] Auto-encoder model for faster generation of effective one-body gravitational waveform approximations

链接: https://arxiv.org/abs/2511.12642
作者: Suyog Garg,Feng-Li Lin,Kipp Cannon
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Submitting to PRD

点击查看摘要

Abstract:Upgrades to current gravitational wave detectors for the next observation run and upcoming third-generation observatories, like the Einstein telescope, are expected to have enormous improvements in detection sensitivities and compact object merger event rates. Estimation of source parameters for a wider parameter space that these detectable signals will lie in, will be a computational challenge. Thus, it is imperative to have methods to speed-up the likelihood calculations with theoretical waveform predictions, which can ultimately make the parameter estimation faster and aid in rapid multi-messenger follow-ups. Towards this end, we present a conditional variational auto-encoder model, based on the best performing architecture of Liao+2021, for faster generation of aligned-spin SEOBNRv4 inspiral-merger-ringdown waveforms. Our parameter space consists of four parameters, [ m_1 , m_2 , \chi_1(z) , \chi_2(z) ]. The masses are uniformly sampled in [5,75],M_\odot with a mass ratio limit at 10,M_\odot , while the spins are uniform in [-0.99,0.99] . We train the model using \sim10^5 input waveforms data with a 70%/10% train/validation split, while 20% data are reserved for testing. The median mismatch for the generated waveforms in the test dataset is \sim10^-2 , with better performance in a restricted parameter space of \chi_\rm eff\in[-0.80,0.80] . Our model is able to generate 100 waveforms in 0.1 second at an average speed of about 4.46 ms per waveform. This is 2-3 orders of magnitude faster than the native SEOBNRv4 implementation in lalsimulation. The latent sampling uncertainty of our model can be quantified with a mean mismatch deviation of 2\times10^-1 for 1000 generations of the same waveform. Our work aims to be the first step towards developing a production-ready machine learning framework for the faster generation of gravitational waveform approximations.

[LG-182] DLMMPR:Deep Learning-based Measurement Matrix for Phase Retrieval

链接: https://arxiv.org/abs/2511.12556
作者: Jing Liu,Bing Guo,Ren Zhu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper pioneers the integration of learning optimization into measurement matrix design for phase retrieval. We introduce the Deep Learning-based Measurement Matrix for Phase Retrieval (DLMMPR) algorithm, which parameterizes the measurement matrix within an end-to-end deep learning architecture. Synergistically augmented with subgradient descent and proximal mapping modules for robust recovery, DLMMPR’s efficacy is decisively confirmed through comprehensive empirical validation across diverse noise regimes. Benchmarked against DeepMMSE and PrComplex, our method yields substantial gains in PSNR and SSIM, underscoring its superiority.

[LG-183] Discovering autonomous quantum error correction via deep reinforcement learning

链接: https://arxiv.org/abs/2511.12482
作者: Yue Yin,Tailong Xiao,Xiaoyang Deng,Ming He,Jianping Fan,Guihua Zeng
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum error correction is essential for fault-tolerant quantum computing. However, standard methods relying on active measurements may introduce additional errors. Autonomous quantum error correction (AQEC) circumvents this by utilizing engineered dissipation and drives in bosonic systems, but identifying practical encoding remains challenging due to stringent Knill-Laflamme conditions. In this work, we utilize curriculum learning enabled deep reinforcement learning to discover Bosonic codes under approximate AQEC framework to resist both single-photon and double-photon losses. We present an analytical solution of solving the master equation under approximation conditions, which can significantly accelerate the training process of reinforcement learning. The agent first identifies an encoded subspace surpassing the breakeven point through rapid exploration within a constrained evolutionary time-frame, then strategically fine-tunes its policy to sustain this performance advantage over extended temporal horizons. We find that the two-phase trained agent can discover the optimal set of codewords, i.e., the Fock states \ket4 and \ket7 considering the effect of both single-photon and double-photon loss. We identify that the discovered code surpasses the breakeven threshold over a longer evolution time and achieve the state-of-art performance. We also analyze the robustness of the code against the phase damping and amplitude damping noise. Our work highlights the potential of curriculum learning enabled deep reinforcement learning in discovering the optimal quantum error correct code especially in early fault-tolerant quantum systems.

[LG-184] A Multicollinearity-Aware Signal-Processing Framework for Cross-β Identification via X-ray Scattering of Alzheimers Tissue

链接: https://arxiv.org/abs/2511.12451
作者: Abdullah Al Bashit,Prakash Nepal,Lee Makowski
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures, journal paper under review

点击查看摘要

Abstract:X-ray scattering measurements of in situ human brain tissue encode structural signatures of pathological cross- \beta inclusions, yet systematic exploitation of these data for automated detection remains challenging due to substrate contamination, strong inter-feature correlations, and limited sample sizes. This work develops a three-stage classification framework for identifying cross- \beta structural inclusions-a hallmark of Alzheimer’s disease-in X-ray scattering profiles of post-mortem human brain. Stage 1 employs a Bayes-optimal classifier to separate mica substrate from tissue regions on the basis of their distinct scattering signatures. Stage 2 introduces a multicollinearityaware, class-conditional correlation pruning scheme with formal guarantees on the induced Bayes risk and approximation error, thereby reducing redundancy while retaining class-discriminative information. Stage 3 trains a compact neural network on the pruned feature set to detect the presence or absence of cross- \beta fibrillar ordering. The top-performing model, optimized with a composite loss combining Focal and Dice objectives, attains a test F1-score of 84.30% using 11 of 211 candidate features and 174 trainable parameters. The overall framework yields an interpretable, theory-grounded strategy for data-limited classification problems involving correlated, high-dimensional experimental measurements, exemplified here by X-ray scattering profiles of neurodegenerative tissue.

[LG-185] From Black Box to Bijection: Interpreting Machine Learning to Build a Zeta Map Algorithm

链接: https://arxiv.org/abs/2511.12421
作者: Xiaoyu Huang,Blake Jackson,Kyu-Hwan Lee
类目: Combinatorics (math.CO); Machine Learning (cs.LG)
*备注: Extended abstract submitted to the 38th FPSAC (2026, Seattle). 12 pages, 1 figure

点击查看摘要

Abstract:There is a large class of problems in algebraic combinatorics which can be distilled into the same challenge: construct an explicit combinatorial bijection. Traditionally, researchers have solved challenges like these by visually inspecting the data for patterns, formulating conjectures, and then proving them. But what is to be done if patterns fail to emerge until the data grows beyond human scale? In this paper, we propose a new workflow for discovering combinatorial bijections via machine learning. As a proof of concept, we train a transformer on paired Dyck paths and use its learned attention patterns to derive a new algorithmic description of the zeta map, which we call the \textitScaffolding Map.

[LG-186] Stochastic Predictive Analytics for Stocks in the Newsvendor Problem

链接: https://arxiv.org/abs/2511.12397
作者: Pedro A. Pury
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:This work addresses a key challenge in inventory management by developing a stochastic model that describes the dynamic distribution of inventory stock over time without assuming a specific demand distribution. Our model provides a flexible and applicable solution for situations with limited historical data and short-term predictions, making it well-suited for the Newsvendor problem. We evaluate our model’s performance using real-world data from a large electronic marketplace, demonstrating its effectiveness in a practical forecasting scenario.

[LG-187] PCA: How Uniformity Induces Robustness to Background Noise in Contrastive Learning

链接: https://arxiv.org/abs/2511.12278
作者: Mingqi Wu,Qiang Sun,Yi Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages main, 26 pages appendix

点击查看摘要

Abstract:High-dimensional data often contain low-dimensional signals obscured by structured background noise, which limits the effectiveness of standard PCA. Motivated by contrastive learning, we address the problem of recovering shared signal subspaces from positive pairs, paired observations sharing the same signal but differing in background. Our baseline, PCA+, uses alignment-only contrastive learning and succeeds when background variation is mild, but fails under strong noise or high-dimensional regimes. To address this, we introduce PCA++, a hard uniformity-constrained contrastive PCA that enforces identity covariance on projected features. PCA++ has a closed-form solution via a generalized eigenproblem, remains stable in high dimensions, and provably regularizes against background interference. We provide exact high-dimensional asymptotics in both fixed-aspect-ratio and growing-spike regimes, showing uniformity’s role in robust signal recovery. Empirically, PCA++ outperforms standard PCA and alignment-only PCA+ on simulations, corrupted-MNIST, and single-cell transcriptomics, reliably recovering condition-invariant structure. More broadly, we clarify uniformity’s role in contrastive learning, showing that explicit feature dispersion defends against structured noise and enhances robustness.

[LG-188] Reinforcement Learning for Chemical Ordering in Alloy Nanoparticles

链接: https://arxiv.org/abs/2511.12260
作者: Jonas Elsborg,Arghya Bhowmik
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 15 pages, 7 figures, 1 table

点击查看摘要

Abstract:We approach the search for optimal element ordering in bimetallic alloy nanoparticles (NPs) as a reinforcement learning (RL) problem, and have built an RL agent that learns to perform such global optimisation using the geometric graph representation of the NPs. To demonstrate the effectiveness, we train an RL agent to perform composition-conserving atomic swap actions on the icosahedral nanoparticle structure. Trained once on randomised Ag_XAu_309-X compositions and orderings, the agent discovers previously established ground state structure. We show that this optimization is robust to differently ordered initialisations of the same NP compositions. We also demonstrate that a trained policy can extrapolate effectively to NPs of unseen size. However, the efficacy is limited when multiple alloying elements are involved. Our results demonstrate that RL with pre-trained equivariant graph encodings can navigate combinatorial ordering spaces at the nanoparticle scale, and offer a transferable optimisation strategy with the potential to generalise across composition and reduce repeated individual search cost.

[LG-189] Chemistry-Enhanced Diffusion-Based Framework for Small-to-Large Molecular Conformation Generation

链接: https://arxiv.org/abs/2511.12182
作者: Yifei Zhu,Jiahui Zhang,Jiawei Peng,Mengge Li,Chao Xu,Zhenggang Lan
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Obtaining 3D conformations of realistic polyatomic molecules at the quantum chemistry level remains challenging, and although recent machine learning advances offer promise, predicting large-molecule structures still requires substantial computational effort. Here, we introduce StoL, a diffusion model-based framework that enables rapid and knowledge-free generation of large molecular structures from small-molecule data. Remarkably, StoL assembles molecules in a LEGO-style fashion from scratch, without seeing the target molecules or any structures of comparable size during training. Given a SMILES input, it decomposes the molecule into chemically valid fragments, generates their 3D structures with a diffusion model trained on small molecules, and assembles them into diverse conformations. This fragment-based strategy eliminates the need for large-molecule training data while maintaining high scalability and transferability. By embedding chemical principles into key steps, StoL ensures faster convergence, chemically rational structures, and broad configurational coverage, as confirmed against DFT calculations.

[LG-190] Rapid Machine Learning-Driven Detection of Pesticides and Dyes Using Raman Spectroscopy

链接: https://arxiv.org/abs/2511.12167
作者: Quach Thi Thai Binh,Thuan Phuoc,Xuan Hai,Thang Bach Phan,Vu Thi Hanh Thu,Nguyen Tuan Hung
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures

点击查看摘要

Abstract:The extensive use of pesticides and synthetic dyes poses critical threats to food safety, human health, and environmental sustainability, necessitating rapid and reliable detection methods. Raman spectroscopy offers molecularly specific fingerprints but suffers from spectral noise, fluorescence background, and band overlap, limiting its real-world applicability. Here, we propose a deep learning framework based on ResNet-18 feature extraction, combined with advanced classifiers, including XGBoost, SVM, and their hybrid integration, to detect pesticides and dyes from Raman spectroscopy, called MLRaman. The MLRaman with the CNN-XGBoost model achieved a predictive accuracy of 97.4% and a perfect AUC of 1.0, while it with the CNN-SVM model provided competitive results with robust class-wise discrimination. Dimensionality reduction analyses (PCA, t-SNE, UMAP) confirmed the separability of Raman embeddings across 10 analytes, including 7 pesticides and 3 dyes. Finally, we developed a user-friendly Streamlit application for real-time prediction, which successfully identified unseen Raman spectra from our independent experiments and also literature sources, underscoring strong generalization capacity. This study establishes a scalable, practical MLRaman model for multi-residue contaminant monitoring, with significant potential for deployment in food safety and environmental surveillance.

[LG-191] Informed Bootstrap Augmentation Improves EEG Decoding

链接: https://arxiv.org/abs/2511.12073
作者: Woojae Jeong,Wenhui Cui,Kleanthis Avramidis,Takfarinas Medani,Shrikanth Narayanan,Richard Leahy
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) offers detailed access to neural dynamics but remains constrained by noise and trial-by-trial variability, limiting decoding performance in data-restricted or complex paradigms. Data augmentation is often employed to enhance feature representations, yet conventional uniform averaging overlooks differences in trial informativeness and can degrade representational quality. We introduce a weighted bootstrapping approach that prioritizes more reliable trials to generate higher-quality augmented samples. In a Sentence Evaluation paradigm, weights were computed from relative ERP differences and applied during probabilistic sampling and averaging. Across conditions, weighted bootstrapping improved decoding accuracy relative to unweighted (from 68.35% to 71.25% at best), demonstrating that emphasizing reliable trials strengthens representational quality. The results demonstrate that reliability-based augmentation yields more robust and discriminative EEG representations. The code is publicly available at this https URL.

[LG-192] Aggregating Conformal Prediction Sets via α-Allocation

链接: https://arxiv.org/abs/2511.12065
作者: Congbin Xu,Yue Yu,Haojie Ren,Zhaojun Wang,Changliang Zou
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal prediction offers a distribution-free framework for constructing prediction sets with finite-sample coverage. Yet, efficiently leveraging multiple conformity scores to reduce prediction set size remains a major open challenge. Instead of selecting a single best score, this work introduces a principled aggregation strategy, COnfidence-Level Allocation (COLA), that optimally allocates confidence levels across multiple conformal prediction sets to minimize empirical set size while maintaining provable coverage. Two variants are further developed, COLA-s and COLA-f, which guarantee finite-sample marginal coverage via sample splitting and full conformalization, respectively. In addition, we develop COLA-l, an individualized allocation strategy that promotes local size efficiency while achieving asymptotic conditional coverage. Extensive experiments on synthetic and real-world datasets demonstrate that COLA achieves considerably smaller prediction sets than state-of-the-art baselines while maintaining valid coverage.

[LG-193] Bayesian–AI Fusion for Epidemiological Decision Making: Calibrated Risk Honest Uncertainty and Hyperparameter Intelligence

链接: https://arxiv.org/abs/2511.11983
作者: Debashis Chatterjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern epidemiological analytics increasingly use machine learning models that offer strong prediction but often lack calibrated uncertainty. Bayesian methods provide principled uncertainty quantification, yet are viewed as difficult to integrate with contemporary AI workflows. This paper proposes a unified Bayesian and AI framework that combines Bayesian prediction with Bayesian hyperparameter optimization. We use Bayesian logistic regression to obtain calibrated individual-level disease risk and credible intervals on the Pima Indians Diabetes dataset. In parallel, we use Gaussian-process Bayesian optimization to tune penalized Cox survival models on the GBSG2 breast cancer cohort. This yields a two-layer system: a Bayesian predictive layer that represents risk as a posterior distribution, and a Bayesian optimization layer that treats model selection as inference over a black-box objective. Simulation studies in low- and high-dimensional regimes show that the Bayesian layer provides reliable coverage and improved calibration, while Bayesian shrinkage improves AUC, Brier score, and log-loss. Bayesian optimization consistently pushes survival models toward near-oracle concordance. Overall, Bayesian reasoning enhances both what we infer and how we search, enabling calibrated risk and principled hyperparameter intelligence for epidemiological decision making. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: Primary: 62F15, 62P10, 62F15, 62P10, 62J12, 62N02, 62L05, 62G15, 68T07, 65C60 Cite as: arXiv:2511.11983 [stat.ML] (or arXiv:2511.11983v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.11983 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-194] PCA recovery thresholds in low-rank matrix inference with sparse noise

链接: https://arxiv.org/abs/2511.11927
作者: Urte Adomaityte,Gabriele Sicuro,Pierpaolo Vivo
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 24 pages, 7 figures

点击查看摘要

Abstract:We study the high-dimensional inference of a rank-one signal corrupted by sparse noise. The noise is modelled as the adjacency matrix of a weighted undirected graph with finite average connectivity in the large size limit. Using the replica method from statistical physics, we analytically compute the typical value of the top eigenvalue, the top eigenvector component density, and the overlap between the signal vector and the top eigenvector. The solution is given in terms of recursive distributional equations for auxiliary probability density functions which can be efficiently solved using a population dynamics algorithm. Specialising the noise matrix to Poissonian and Random Regular degree distributions, the critical signal strength is analytically identified at which a transition happens for the recovery of the signal via the top eigenvector, thus generalising the celebrated BBP transition to the sparse noise case. In the large-connectivity limit, known results for dense noise are recovered. Analytical results are in agreement with numerical diagonalisation of large matrices.

[LG-195] Modeling X-ray photon pile-up with a normalizing flow NEURIPS2025

链接: https://arxiv.org/abs/2511.11863
作者: Ole König,Daniela Huppenkothen,Douglas Finkbeiner,Christian Kirsch,Jörn Wilms,Justina R. Yang,James F. Steiner,Juan Rafael Martínez-Galarza
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted in Machine Learning and the Physical Sciences Workshop, NeurIPS 2025

点击查看摘要

Abstract:The dynamic range of imaging detectors flown on-board X-ray observatories often only covers a limited flux range of extrasolar X-ray sources. The analysis of bright X-ray sources is complicated by so-called pile-up, which results from high incident photon flux. This nonlinear effect distorts the measured spectrum, resulting in biases in the inferred physical parameters, and can even lead to a complete signal loss in extreme cases. Piled-up data are commonly discarded due to resulting intractability of the likelihood. As a result, a large number of archival observations remain underexplored. We present a machine learning solution to this problem, using a simulation-based inference framework that allows us to estimate posterior distributions of physical source parameters from piled-up eROSITA data. We show that a normalizing flow produces better-constrained posterior densities than traditional mitigation techniques, as more data can be leveraged. We consider model- and calibration-dependent uncertainties and the applicability of such an algorithm to real data in the eROSITA archive.

[LG-196] A Computational Method for Solving the Stochastic Joint Replenishment Problem in High Dimensions

链接: https://arxiv.org/abs/2511.11830
作者: Barış Ata,Wouter van Eekelen,Yuan Zhong
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 52 pages, 3 figures

点击查看摘要

Abstract:We consider a discrete-time formulation for a class of high-dimensional stochastic joint replenishment problems. First, we approximate the problem by a continuous-time impulse control problem. Exploiting connections among the impulse control problem, backward stochastic differential equations (BSDEs) with jumps, and the stochastic target problem, we develop a novel, simulation-based computational method that relies on deep neural networks to solve the impulse control problem. Based on that solution, we propose an implementable inventory control policy for the original (discrete-time) stochastic joint replenishment problem, and test it against the best available benchmarks in a series of test problems. For the problems studied thus far, our method matches or beats the best benchmark we could find, and it is computationally feasible up to at least 50 dimensions – that is, 50 stock-keeping units (SKUs).

[LG-197] FreDN: Spectral Disentanglement for Time Series Forecasting via Learnable Frequency Decomposition

链接: https://arxiv.org/abs/2511.11817
作者: Zhongde An,Jinhong You,Jiyanglin Li,Yiming Tang,Wen Li,Heming Du,Shouguo Du
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting is essential in a wide range of real world applications. Recently, frequency-domain methods have attracted increasing interest for their ability to capture global dependencies. However, when applied to non-stationary time series, these methods encounter the \textitspectral entanglement and the computational burden of complex-valued learning. The \textitspectral entanglement refers to the overlap of trends, periodicities, and noise across the spectrum due to \textitspectral leakage and the presence of non-stationarity. However, existing decompositions are not suited to resolving spectral entanglement. To address this, we propose the Frequency Decomposition Network (FreDN), which introduces a learnable Frequency Disentangler module to separate trend and periodic components directly in the frequency domain. Furthermore, we propose a theoretically supported ReIm Block to reduce the complexity of complex-valued operations while maintaining performance. We also re-examine the frequency-domain loss function and provide new theoretical insights into its effectiveness. Extensive experiments on seven long-term forecasting benchmarks demonstrate that FreDN outperforms state-of-the-art methods by up to 10%. Furthermore, compared with standard complex-valued architectures, our real-imaginary shared-parameter design reduces the parameter count and computational cost by at least 50%.

[LG-198] Socrates-Mol: Self-Oriented Cognitive Reasoning through Autonomous Trial-and-Error with Empirical-Bayesian Screening for Molecules

链接: https://arxiv.org/abs/2511.11769
作者: Xiangru Wang,Zekun Jiang,Heng Yang,Cheng Tan,Xingying Lan,Chunming Xu,Tianhang Zhou
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Molecular property prediction is fundamental to chemical engineering applications such as solvent screening. We present Socrates-Mol, a framework that transforms language models into empirical Bayesian reasoners through context engineering, addressing cold start problems without model fine-tuning. The system implements a reflective-prediction cycle where initial outputs serve as priors, retrieved molecular cases provide evidence, and refined predictions form posteriors, extracting reusable chemical rules from sparse data. We introduce ranking tasks aligned with industrial screening priorities and employ cross-model self-consistency across five language models to reduce variance. Experiments on amine solvent LogP prediction reveal task-dependent patterns: regression achieves 72% MAE reduction and 112% R-squared improvement through self-consistency, while ranking tasks show limited gains due to systematic multi-model biases. The framework reduces deployment costs by over 70% compared to full fine-tuning, providing a scalable solution for molecular property prediction while elucidating the task-adaptive nature of self-consistency mechanisms.

[LG-199] Generalized Inequality-based Approach for Probabilistic WCET Estimation

链接: https://arxiv.org/abs/2511.11682
作者: Hayate Toba,Atsushi Yano,Takuya Azumi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating the probabilistic Worst-Case Execution Time (pWCET) is essential for ensuring the timing correctness of real-time applications, such as in robot IoT systems and autonomous driving systems. While methods based on Extreme Value Theory (EVT) can provide tight bounds, they suffer from model uncertainty due to the need to decide where the upper tail of the distribution begins. Conversely, inequality-based approaches avoid this issue but can yield pessimistic results for heavy-tailed distributions. This paper proposes a method to reduce such pessimism by incorporating saturating functions (arctangent and hyperbolic tangent) into Chebyshev’s inequality, which mitigates the influence of large outliers while preserving mathematical soundness. Evaluations on synthetic and real-world data from the Autoware autonomous driving stack demonstrate that the proposed method achieves safe and tighter bounds for such distributions.

[LG-200] Omics-scale polymer computational database transferable to real-world artificial intelligence applications

链接: https://arxiv.org/abs/2511.11626
作者: Ryo Yoshida,Yoshihiro Hayashi,Hidemine Furuya,Ryohei Hosoya,Kazuyoshi Kaneko,Hiroki Sugisawa,Yu Kaneko,Aiko Takahashi,Yoh Noguchi,Shun Nanjo,Keiko Shinoda,Tomu Hamakawa,Mitsuru Ohno,Takuya Kitamura,Misaki Yonekawa,Stephen Wu,Masato Ohnishi,Chang Liu,Teruki Tsurimoto,Arifin,Araki Wakiuchi,Kohei Noda,Junko Morikawa,Teruaki Hayakawa,Junichiro Shiomi,Masanobu Naito,Kazuya Shiratori,Tomoki Nagai,Norio Tomotsu,Hiroto Inoue,Ryuichi Sakashita,Masashi Ishii,Isao Kuwajima,Kenji Furuichi,Norihiko Hiroi,Yuki Takemoto,Takahiro Ohkuma,Keita Yamamoto,Naoya Kowatari,Masato Suzuki,Naoya Matsumoto,Seiryu Umetani,Hisaki Ikebata,Yasuyuki Shudo,Mayu Nagao,Shinya Kamada,Kazunori Kamio,Taichi Shomura,Kensaku Nakamura,Yudai Iwamizu,Atsutoshi Abe,Koki Yoshitomi,Yuki Horie,Katsuhiko Koike,Koichi Iwakabe,Shinya Gima,Kota Usui,Gikyo Usuki,Takuro Tsutsumi,Keitaro Matsuoka,Kazuki Sada,Masahiro Kitabata,Takuma Kikutsuji,Akitaka Kamauchi,Yusuke Iijima,Tsubasa Suzuki,Takenori Goda,Yuki Takabayashi,Kazuko Imai,Yuji Mochizuki,Hideo Doi,Koji Okuwaki,Hiroya Nitta,Taku Ozawa,Hitoshi Kamijima,Toshiaki Shintani,Takuma Mitamura,Massimiliano Zamengo,Yuitsu Sugami,Seiji Akiyama,Yoshinari Murakami,Atsushi Betto,Naoya Matsuo,Satoru Kagao,Tetsuya Kobayashi,Norie Matsubara,Shosei Kubo,Yuki Ishiyama,Yuri Ichioka,Mamoru Usami,Satoru Yoshizaki,Seigo Mizutani,Yosuke Hanawa,Shogo Kunieda,Mitsuru Yambe,Takeru Nakamura,Hiromori Murashima,Kenji Takahashi,Naoki Wada,Masahiro Kawano
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注: 65 pages, 11 figures

点击查看摘要

Abstract:Developing large-scale foundational datasets is a critical milestone in advancing artificial intelligence (AI)-driven scientific innovation. However, unlike AI-mature fields such as natural language processing, materials science, particularly polymer research, has significantly lagged in developing extensive open datasets. This lag is primarily due to the high costs of polymer synthesis and property measurements, along with the vastness and complexity of the chemical space. This study presents PolyOmics, an omics-scale computational database generated through fully automated molecular dynamics simulation pipelines that provide diverse physical properties for over 10^5 polymeric materials. The PolyOmics database is collaboratively developed by approximately 260 researchers from 48 institutions to bridge the gap between academia and industry. Machine learning models pretrained on PolyOmics can be efficiently fine-tuned for a wide range of real-world downstream tasks, even when only limited experimental data are available. Notably, the generalisation capability of these simulation-to-real transfer models improve significantly as the size of the PolyOmics database increases, exhibiting power-law scaling. The emergence of scaling laws supports the “more is better” principle, highlighting the significance of ultralarge-scale computational materials data for improving real-world prediction performance. This unprecedented omics-scale database reveals vast unexplored regions of polymer materials, providing a foundation for AI-driven polymer science.

[LG-201] Limitations of Quantum Advantage in Unsupervised Machine Learning CCL2025

链接: https://arxiv.org/abs/2511.10709
作者: Apoorva D. Patel
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 4 pages,1 figure. Invited talk at the 2025 IEEE International Conference on Quantum Control, Computing and Learning (IEEE qCCL2025), Hong Kong, June 2025. Published in the proceedings, pp. 39-42

点击查看摘要

Abstract:Machine learning models are used for pattern recognition analysis of big data, without direct human intervention. The task of unsupervised learning is to find the probability distribution that would best describe the available data, and then use it to make predictions for observables of interest. Classical models generally fit the data to Boltzmann distribution of Hamiltonians with a large number of tunable parameters. Quantum extensions of these models replace classical probability distributions with quantum density matrices. An advantage can be obtained only when features of density matrices that are absent in classical probability distributions are exploited. Such situations depend on the input data as well as the targeted observables. Explicit examples are discussed that bring out the constraints limiting possible quantum advantage. The problem-dependent extent of quantum advantage has implications for both data analysis and sensing applications.

信息检索

[IR-0] Compact Multimodal Language Models as Robust OCR Alternatives for Noisy Textual Clinical Reports

链接: https://arxiv.org/abs/2511.13523
作者: Nikita Neveditsin,Pawan Lingras,Salil Patil,Swarup Patil,Vijay Mago
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Digitization of medical records often relies on smartphone photographs of printed reports, producing images degraded by blur, shadows, and other noise. Conventional OCR systems, optimized for clean scans, perform poorly under such real-world conditions. This study evaluates compact multimodal language models as privacy-preserving alternatives for transcribing noisy clinical documents. Using obstetric ultrasound reports written in regionally inflected medical English common to Indian healthcare settings, we compare eight systems in terms of transcription accuracy, noise sensitivity, numeric accuracy, and computational efficiency. Compact multimodal models consistently outperform both classical and neural OCR pipelines. Despite higher computational costs, their robustness and linguistic adaptability position them as viable candidates for on-premises healthcare digitization.

[IR-1] PolicyBot - Reliable Question Answering over Policy Documents

链接: https://arxiv.org/abs/2511.13489
作者: Gautam Nagarajan,Omir Kumar,Sudarsun Santhiappan
类目: Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:All citizens of a country are affected by the laws and policies introduced by their government. These laws and policies serve essential functions for citizens. Such as granting them certain rights or imposing specific obligations. However, these documents are often lengthy, complex, and difficult to navigate, making it challenging for citizens to locate and understand relevant information. This work presents PolicyBot, a retrieval-augmented generation (RAG) system designed to answer user queries over policy documents with a focus on transparency and reproducibility. The system combines domain-specific semantic chunking, multilingual dense embeddings, multi-stage retrieval with reranking, and source-aware generation to provide responses grounded in the original documents. We implemented citation tracing to reduce hallucinations and improve user trust, and evaluated alternative retrieval and generation configurations to identify effective design choices. The end-to-end pipeline is built entirely with open-source tools, enabling easy adaptation to other domains requiring document-grounded question answering. This work highlights design considerations, practical challenges, and lessons learned in deploying trustworthy RAG systems for governance-related contexts.

[IR-2] FLOWER: Flow-Oriented Entity-Relationship Tool

链接: https://arxiv.org/abs/2511.13357
作者: Dmitry Moskalev
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Exploring relationships across data sources is a crucial optimization for entities recognition. Since databases can store big amount of information with synthetic and organic data, serving all quantity of objects correctly is an important task to deal with. However, the decision of how to construct entity relationship model is associated with human factor. In this paper, we present flow-oriented entity-relationship tool. This is first and unique end-to-end solution that eliminates routine and resource-intensive problems of processing, creating and visualizing both of explicit and implicit dependencies for prominent SQL dialects on-the-fly. Once launched, FLOWER automatically detects built-in constraints and starting to create own correct and necessary one using dynamic sampling and robust data analysis techniques. This approach applies to improve entity-relationship model and data storytelling to better understand the foundation of data and get unseen insights from DB sources using SQL or natural language. Evaluated on state-of-the-art STATS benchmark, experiments show that FLOWER is superior to reservoir sampling by 2.4x for distribution representation and 2.6x for constraint learning with 2.15x acceleration. For data storytelling, our tool archives 1.19x for accuracy enhance with 1.86x context decrease compare to LLM. Presented tool is also support 23 languages and compatible with both of CPU and GPU. Those results show that FLOWER can manage with real-world data a way better to ensure with quality, scalability and applicability for different use-cases.

[IR-3] Cog-RAG : Cognitive-Inspired Dual-Hypergraph with Theme Alignment Retrieval-Augmented Generation AAAI2026

链接: https://arxiv.org/abs/2511.13201
作者: Hao Hu,Yifan Feng,Ruoxue Li,Rundong Xue,Xingliang Hou,Zhiqiang Tian,Yue Gao,Shaoyi Du
类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI 2026 main conference

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances the response quality and domain-specific performance of large language models (LLMs) by incorporating external knowledge to combat hallucinations. In recent research, graph structures have been integrated into RAG to enhance the capture of semantic relations between entities. However, it primarily focuses on low-order pairwise entity relations, limiting the high-order associations among multiple entities. Hypergraph-enhanced approaches address this limitation by modeling multi-entity interactions via hyperedges, but they are typically constrained to inter-chunk entity-level representations, overlooking the global thematic organization and alignment across chunks. Drawing inspiration from the top-down cognitive process of human reasoning, we propose a theme-aligned dual-hypergraph RAG framework (Cog-RAG) that uses a theme hypergraph to capture inter-chunk thematic structure and an entity hypergraph to model high-order semantic relations. Furthermore, we design a cognitive-inspired two-stage retrieval strategy that first activates query-relevant thematic content from the theme hypergraph, and then guides fine-grained recall and diffusion in the entity hypergraph, achieving semantic alignment and consistent generation from global themes to local details. Our extensive experiments demonstrate that Cog-RAG significantly outperforms existing state-of-the-art baseline approaches.

[IR-4] Mitigating Recommendation Biases via Group-Alignment and Global-Uniformity in Representation Learning

链接: https://arxiv.org/abs/2511.13041
作者: Miaomiao Cai,Min Hou,Lei Chen,Le Wu,Haoyue Bai,Yong Li,Meng Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Collaborative Filtering~(CF) plays a crucial role in modern recommender systems, leveraging historical user-item interactions to provide personalized suggestions. However, CF-based methods often encounter biases due to imbalances in training data. This phenomenon makes CF-based methods tend to prioritize recommending popular items and performing unsatisfactorily on inactive users. Existing works address this issue by rebalancing training samples, reranking recommendation results, or making the modeling process robust to the bias. Despite their effectiveness, these approaches can compromise accuracy or be sensitive to weighting strategies, making them challenging to train. In this paper, we deeply analyze the causes and effects of the biases and propose a framework to alleviate biases in recommendation from the perspective of representation distribution, namely Group-Alignment and Global-Uniformity Enhanced Representation Learning for Debiasing Recommendation (AURL). Specifically, we identify two significant problems in the representation distribution of users and items, namely group-discrepancy and global-collapse. These two problems directly lead to biases in the recommendation results. To this end, we propose two simple but effective regularizers in the representation space, respectively named group-alignment and global-uniformity. The goal of group-alignment is to bring the representation distribution of long-tail entities closer to that of popular entities, while global-uniformity aims to preserve the information of entities as much as possible by evenly distributing representations. Our method directly optimizes both the group-alignment and global-uniformity regularization terms to mitigate recommendation biases. Extensive experiments on three real datasets and various recommendation backbones verify the superiority of our proposed framework.

[IR-5] Personalized Federated Recommendation With Knowledge Guidance

链接: https://arxiv.org/abs/2511.12959
作者: Jaehyung Lim,Wonbin Kweon,Woojoo Kim,Junyoung Kim,Dongha Kim,Hwanjo Yu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated Recommendation (FedRec) has emerged as a key paradigm for building privacy-preserving recommender systems. However, existing FedRec models face a critical dilemma: memory-efficient single-knowledge models suffer from a suboptimal knowledge replacement practice that discards valuable personalization, while high-performance dual-knowledge models are often too memory-intensive for practical on-device deployment. We propose Federated Recommendation with Knowledge Guidance (FedRKG), a model-agnostic framework that resolves this dilemma. The core principle, Knowledge Guidance, avoids full replacement and instead fuses global knowledge into preserved local embeddings, attaining the personalization benefits of dual-knowledge within a single-knowledge memory footprint. Furthermore, we introduce Adaptive Guidance, a fine-grained mechanism that dynamically modulates the intensity of this guidance for each user-item interaction, overcoming the limitations of static fusion methods. Extensive experiments on benchmark datasets demonstrate that FedRKG significantly outperforms state-of-the-art methods, validating the effectiveness of our approach. The code is available at this https URL.

[IR-6] Can We Predict the Next Question? A Collaborative Filtering Approach to Modeling User Behavior

链接: https://arxiv.org/abs/2511.12949
作者: Bokang Fu,Jiahao Wang,Xiaojing Liu,Yuli Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have excelled in language understanding and generation, powering advanced dialogue and recommendation systems. However, a significant limitation persists: these systems often model user preferences statically, failing to capture the dynamic and sequential nature of interactive behaviors. The sequence of a user’s historical questions provides a rich, implicit signal of evolving interests and cognitive patterns, yet leveraging this temporal data for predictive tasks remains challenging due to the inherent disconnect between language modeling and behavioral sequence modeling. To bridge this gap, we propose a Collaborative Filtering-enhanced Question Prediction (CFQP) framework. CFQP dynamically models evolving user-question interactions by integrating personalized memory modules with graph-based preference propagation. This dual mechanism allows the system to adaptively learn from user-specific histories while refining predictions through collaborative signals from similar users. Experimental results demonstrate that our approach effectively generates agents that mimic real-user questioning patterns, highlighting its potential for building proactive and adaptive dialogue systems. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2511.12949 [cs.IR] (or arXiv:2511.12949v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.12949 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-7] A Plug-and-Play Spatially-Constrained Representation Enhancement Framework for Local-Life Recommendation

链接: https://arxiv.org/abs/2511.12947
作者: Hao Jiang,Guoquan Wang,Sheng Yu,Yang Zeng,Wencong Zeng,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Local-life recommendation have witnessed rapid growth, providing users with convenient access to daily essentials. However, this domain faces two key challenges: (1) spatial constraints, driven by the requirements of the local-life scenario, where items are usually shown only to users within a limited geographic area, indirectly reducing their exposure probability; and (2) long-tail sparsity, where few popular items dominate user interactions, while many high-quality long-tail items are largely overlooked due to imbalanced interaction opportunities. Existing methods typically adopt a user-centric perspective, such as modeling spatial user preferences or enhancing long-tail representations with collaborative filtering signals. However, we argue that an item-centric perspective is more suitable for this domain, focusing on enhancing long-tail items representation that align with the spatially-constrained characteristics of local lifestyle services. To tackle this issue, we propose ReST, a Plug-And-Play Spatially-Constrained Representation Enhancement Framework for Long-Tail Local-Life Recommendation. Specifically, we first introduce a Meta ID Warm-up Network, which initializes fundamental ID representations by injecting their basic attribute-level semantic information. Subsequently, we propose a novel Spatially-Constrained ID Representation Enhancement Network (SIDENet) based on contrastive learning, which incorporates two efficient strategies: a spatially-constrained hard sampling strategy and a dynamic representation alignment strategy. This design adaptively identifies weak ID representations based on their attribute-level information during training. It additionally enhances them by capturing latent item relationships within the spatially-constrained characteristics of local lifestyle services, while preserving compatibility with popular items.

[IR-8] Rethinking the filter bubble? Developing a research agenda for the protective filter bubble

链接: https://arxiv.org/abs/2511.12873
作者: Jacob Erickson
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: This work has been published in Big Data Society. Please cite the journal version

点击查看摘要

Abstract:Filter bubbles and echo chambers have received global attention from scholars, media organizations, and the general public. Filter bubbles have primarily been regarded as intrinsically negative, and many studies have sought to minimize their influence. The detrimental influence of filter bubbles is well-studied. Filter bubbles may, for example, create information silos, amplify misinformation, and promote hatred and extremism. However, comparatively few studies have considered the other side of the filter bubble; its protective benefits, particularly to marginalized communities and those living in countries with low levels of press freedom. Through a review of the literature on digital safe spaces and protective filter bubbles, this commentary suggests that there may be a need to rethink the filter bubble, and it proposes several areas for future research.

[IR-9] MindRec: Mind-inspired Coarse-to-fine Decoding for Generative Recommendation

链接: https://arxiv.org/abs/2511.12597
作者: Mengyao Gao,Chongming Gao,Haoyan Liu,Qingpeng Cai,Peng Jiang,Jiajia Chen,Shuai Yuan,Xiangnan He
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in large language model-based recommendation systems often represent items as text or semantic IDs and generate recommendations in an auto-regressive manner. However, due to the left-to-right greedy decoding strategy and the unidirectional logical flow, such methods often fail to produce globally optimal recommendations. In contrast, human reasoning does not follow a rigid left-to-right sequence. Instead, it often begins with keywords or intuitive insights, which are then refined and expanded. Inspired by this fact, we propose Mind-inspired Recommender (MindRec), a novel generative framework that emulates human thought processes. Particularly, our method first generates key tokens that reflect user preferences, and then expands them into the complete item, enabling flexible and human-like generation. To further emulate the structured nature of human decision-making, we organize items into a hierarchical category tree. This structure guides the model to first produce the coarse-grained category and then progressively refine its selection through finer-grained subcategories before generating the specific item. To mitigate the local optimum problem inherent in greedy decoding, we design a novel beam search algorithm, Diffusion Beam Search, tailored for our mind-inspired generation paradigm. Experimental results demonstrate that MindRec yields a 9.5% average improvement in top-1 recommendation performance over state-of-the-art methods, highlighting its potential to enhance recommendation accuracy. The implementation is available via this https URL.

[IR-10] DualGR: Generative Retrieval with Long and Short-Term Interests Modeling

链接: https://arxiv.org/abs/2511.12518
作者: Zhongchao Yi,Kai Feng,Xiaojian Ma,Yalong Wang,Yongqi Liu,Han Li,Zhengyang Zhou,Yang Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In large-scale industrial recommendation systems, retrieval must produce high-quality candidates from massive corpora under strict latency. Recently, Generative Retrieval (GR) has emerged as a viable alternative to Embedding-Based Retrieval (EBR), which quantizes items into a finite token space and decodes candidates autoregressively, providing a scalable path that explicitly models target-history interactions via cross-attention. However, three challenges persist: 1) how to balance users’ long-term and short-term interests , 2) noise interference when generating hierarchical semantic IDs (SIDs), 3) the absence of explicit modeling for negative feedback such as exposed items without clicks. To address these challenges, we propose DualGR, a generative retrieval framework that explicitly models dual horizons of user interests with selective activation. Specifically, DualGR utilizes Dual-Branch Long/Short-Term Router (DBR) to cover both stable preferences and transient intents by explicitly modeling users’ long- and short-term behaviors. Meanwhile, Search-based SID Decoding (S2D) is presented to control context-induced noise and enhance computational efficiency by constraining candidate interactions to the current coarse (level-1) bucket during fine-grained (level-2/3) SID prediction. % also reinforcing intra-class consistency. Finally, we propose an Exposure-aware Next-Token Prediction Loss (ENTP-Loss) that treats “exposed-but-unclicked” items as hard negatives at level-1, enabling timely interest fade-out. On the large-scale Kuaishou short-video recommendation system, DualGR has achieved outstanding performance. Online A/B testing shows +0.527% video views and +0.432% watch time lifts, validating DualGR as a practical and effective paradigm for industrial generative retrieval.

[IR-11] ask-Aware Retrieval Augmentation for Dynamic Recommendation AAAI2026

链接: https://arxiv.org/abs/2511.12495
作者: Zhen Tao,Xinke Jiang,Qingshuai Feng,Haoyu Zhang,Lun Du,Yuchen Fang,Hao Miao,Bangquan Xie,Qingqiang Sun
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: AAAI 2026

点击查看摘要

Abstract:Dynamic recommendation systems aim to provide personalized suggestions by modeling temporal user-item interactions across time-series behavioral data. Recent studies have leveraged pre-trained dynamic graph neural networks (GNNs) to learn user-item representations over temporal snapshot graphs. However, fine-tuning GNNs on these graphs often results in generalization issues due to temporal discrepancies between pre-training and fine-tuning stages, limiting the model’s ability to capture evolving user preferences. To address this, we propose TarDGR, a task-aware retrieval-augmented framework designed to enhance generalization capability by incorporating task-aware model and retrieval-augmentation. Specifically, TarDGR introduces a Task-Aware Evaluation Mechanism to identify semantically relevant historical subgraphs, enabling the construction of task-specific datasets without manual labeling. It also presents a Graph Transformer-based Task-Aware Model that integrates semantic and structural encodings to assess subgraph relevance. During inference, TarDGR retrieves and fuses task-aware subgraphs with the query subgraph, enriching its representation and mitigating temporal generalization issues. Experiments on multiple large-scale dynamic graph datasets demonstrate that TarDGR consistently outperforms state-of-the-art methods, with extensive empirical evidence underscoring its superior accuracy and generalization capabilities.

[IR-12] Continuous-time Discrete-space Diffusion Model for Recommendation WSDM2026

链接: https://arxiv.org/abs/2511.12114
作者: Chengyi Liu,Xiao Chen,Shijie Wang,Wenqi Fan,Qing Li
类目: Information Retrieval (cs.IR)
*备注: Accepted by WSDM 2026

点击查看摘要

Abstract:In the era of information explosion, Recommender Systems (RS) are essential for alleviating information overload and providing personalized user experiences. Recent advances in diffusion-based generative recommenders have shown promise in capturing the dynamic nature of user preferences. These approaches explore a broader range of user interests by progressively perturbing the distribution of user-item interactions and recovering potential preferences from noise, enabling nuanced behavioral understanding. However, existing diffusion-based approaches predominantly operate in continuous space through encoded graph-based historical interactions, which may compromise potential information loss and suffer from computational inefficiency. As such, we propose CDRec, a novel Continuous-time Discrete-space Diffusion Recommendation framework, which models user behavior patterns through discrete diffusion on historical interactions over continuous time. The discrete diffusion algorithm operates via discrete element operations (e.g., masking) while incorporating domain knowledge through transition matrices, producing more meaningful diffusion trajectories. Furthermore, the continuous-time formulation enables flexible adaptive sampling. To better adapt discrete diffusion models to recommendations, CDRec introduces: (1) a novel popularity-aware noise schedule that generates semantically meaningful diffusion trajectories, and (2) an efficient training framework combining consistency parameterization for fast sampling and a contrastive learning objective guided by multi-hop collaborative signals for personalized recommendation. Extensive experiments on real-world datasets demonstrate CDRec’s superior performance in both recommendation accuracy and computational efficiency.

[IR-13] ComLQ: Benchmarking Complex Logical Queries in Information Retrieval AAAI2026

链接: https://arxiv.org/abs/2511.12004
作者: Ganlin Xu,Zhitao Yin,Linghao Zhang,Jiaqing Liang,Weijia Lu,Xiaodong Zhang,Zhifei Yang,Sihang Jiang,Deqing Yang
类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking \emphcomplex logical queries involving first-order logic operations such as conjunction ( \land ), disjunction ( \lor ), and negation ( \lnot ). Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset \textbfComLQ for \textbfComplex \textbfLogical \textbfQueries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured \emphstructure conformity and \emphevidence distribution through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, \textbfLog-Scaled Negation Consistency (\textbfLSNC@ K ). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@ K measures whether top- K retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models’ limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion.

附件下载

点击下载今日全部论文列表